Taxonomic Challenges

Over on SpokenWord.org we started with a set of “source” categories such as Conference, Interview, Lecture, Sermon and so on. These categories turned out to be rather useless since very few visitors really cared whether a recording was from a conference or a lecture, for example. What they cared about was whether it was about chemistry or China, which this taxonomy didn’t address.

Next we decided to go with a free-form tagging folksonomy as do many other content sites. For better or worse, we have a semi-automated source of tags: the <keyword> elements of the RSS feeds that supply most of our new programs. Tagging has worked quite well as a search mechanism: a way to actively find content. You can now search for chemistry or China and get reasonable results.

But we also want to present content in a more traditional manner. We want to proactively feature programs (particularly on the home page) in ways that will encourage first-time visitors to listen and view. So we’re thinking of re-instituting a taxonomy of categories in addition to our tags. Now comes the challenge of defining the categories. Here’s the taxonomy we have so far. We want to keep the count to no more than fifteen, so we need to combine where possible, but we want to make sure any spoken-word content fits into at least one category appropriately.

  • business and finance
  • science and technology
  • health and medicine
  • education
  • arts, entertainment, media and literature
  • energy/environment
  • food and drink
  • religion
  • government and politics (current affairs?)
  • sports, recreation & hobbies
  • travel/history
  • comedy (humor)

Anything missing? Remember, these are topical categories, not sources, media, etc.

Update: Here’s another option. We could simply adopt the categories used by iTunes for podcasts. It’s not perfect, but it has the advantage that all of our collections and feeds would be guaranteed compatible with iTunes’ taxonomy. Here’s the list from Apple:

  • Arts
    • Design
    • Fashion & Beauty
    • Food
    • Literature
    • Performing Arts
    • Visual Arts
  • Business
    • Business News
    • Careers
    • Investing
    • Management & Marketing
    • Shopping
  • Comedy
  • Education
    • Education Technology
    • Higher Education
    • K-12
    • Language Courses
    • Training
  • Games & Hobbies
    • Automotive
    • Aviation
    • Hobbies
    • Other Games
    • Video Games

  • Government & Organizations
    • Local
    • National
    • Non-Profit
    • Regional
  • Health
    • Alternative Health
    • Fitness & Nutrition
    • Self-Help
    • Sexuality
  • Kids & Family
  • Music
  • News & Politics
  • Religion & Spirituality
    • Buddhism
    • Christianity
    • Hinduism
    • Islam
    • Judaism
    • Other
    • Spirituality

  • Science & Medicine
    • Medicine
    • Natural Sciences
    • Social Sciences
  • Society & Culture
    • History
    • Personal Journals
    • Philosophy
    • Places & Travel
  • Sports & Recreation
    • Amateur
    • College & High School
    • Outdoor
    • Professional
  • Technology
    • Gadgets
    • Tech News
    • Podcasting
    • Software How-To
  • TV & Film

A Liberal Against a Detroit Bailout

I find myself siding with the Repulicans on this one. Sorta weird. Tom Friedman has it right. I can’t see a good reason why we should put taxpayers’ dollars into a dying industry. GM, Ford and Chrysler’s management have done a miserable job, and unless they go through a serious shakeup such as a Chapter 11 bankruptcy, they shouldn’t continue to exist. The writing is on the wall for them. The Emperor has no clothes. As far as investors, lenders — I am one, through funds — and management, they deserve to suffer the consequences of how these companies have been run. The only consituents who may be entiteld to taxpayer assistance are the autoworkers and employees (not executives) of the small suppliers.

This brings up the union, healthcare and pension issues. Messy, to say the least. Personally, I’ve had a love/hate relationship with unions. I believe in the basic concepts of collective bargaining and I recognize that without the ability of employees to organize, employers will exploit them unfairly. But while the major U.S. unions have done an admirable job of growing benefits for their members, there now exist inequities in the benefits and pensions between union and non-union workers in this country. True, the UAW has accepted some concessions in recent years, but the fact is that GM and others are under a tremendous burden in supporting their former employees. This, by the way, is what the Republicans are thinking but not saying. By withholding from Detroit another $25 billion, they’re fostering union-busting through the bankruptcy process.

Although I’m not anti-union, this could ultimately be a good thing. Rather then spending billions on propping up the corporations, I’d like to see Obama and the Congress take this as an opportunity to start providing universal healthcare for all (not just out-of-work auto workers) and beefing up the Social Security System. I think we’re the only country in the world that ties healthcare to employment, which is nuts. And we’ve all seen what will happen if we continue down the Republican path of increased privitazion of retirement benefits.

Let the Big Three go into Bankruptcy. That’s what it’s for. There’s a process that has been tweaked for decades as opposed to the Paulson/Bernanke methodology of writing checks without adequate conditions and then seeing what works and what doesn’t. Let the old and broken institutions crumble. Only then can we get to the bottom and build a more honest and sustainable world. Avoiding the inevitable never works, by definition.

Amazon CloudFront

For the past three months we’ve been beta-testing a new Amazon web service now named CloudFront. The best way to think of CloudFront is a high-performance front end for Amazon’s S3, based upon edge servers located closer to your web site’s visitors.

I’ve been favorably impressed with the new service. To try it out, I went for the low-hanging fruit by simply changing delivery of our CSS and JavaScript files to CloudFront. Performance-wise, these are our most-critical files because browsers run single-threaded while fetching and processing CSS/JS files. After the change, the download speeds of these files fluctuated between 3x and 4x faster than when delivered from our dedicated servers at The Planet in Texas. The key, in looking at the network histograms, is the all-important ‘first-byte delivery time.’ Net improvement: ~750 milliseconds for the load of any of our pages, based on measurements here, 12 miles north of San Francisco. The entire change took only about 15 minutes of effort, including creating a new S3 bucket, copying the files, modifying our code — all the changes were in one file — and establishing a new CNAME, which is optional.

Amazon calls CloudFront a “web service for content delivery,” which isn’t quite the same thing as a content-delivery network (CDN). The difference (for us) is that CloudFront doesn’t (yet?) operate as a pure cache, running off our “origin server” in the same way as we deliver our media files via Limelight Networks, a true CDN. In the case of Limelight, we just maintain the files on our own server, setup a CNAME that refers to Limelight’s edge servers and that’s it. When we add or modify a file on the origin server, that’s all we have to do. Limelight instantly (and I mean that literally) begins to deliver the new version worldwide. We don’t have to do anything manual or otherwise to keep the CDN copies of our files fresh. In the case of CloudFront, you still have to take certain actions (which could be automated, of course) to get new and updated assets from your primary servers pushed to their edge servers.

But while CloudFront may not be a pure CDN at this time, it’s extraordinarily cost-effective. It’s a no-brainer way to speed up almost any web site. For those assets like CSS, JavaScript files, frequently used images, icons, etc., the performance is as good as any CDN I’ve used but at a fraction of the cost. Pricing has two components. For assets served from U.S. edge locations:

  1. $0.170/GB data transfer out
  2. $0.010 per 1,000 GET requests

Charges are lower as volume increases, but higher for delivery from their European and Asian edge locations.

(Aside: One thing I love about all of the AWS services is that by publishing their prices so clearly, they set a very public bar against which all other providers are instantly measured. This happened with S3, and it’s going to happen with CloudFront. Pricing of storage, hosting, servers and now content delivery was previously mysterious and highly negotiable — like by an order of magnitude. AWS has brought transparency to the world of web-service pricing.)

Consider, too, that CloudFront is a completely self-service offering with no minimums, setup costs or hassles once you’re into the whole AWS world. As far as reliability, we never had a single failure or outage that I’m aware of during the entire three-month test period.

Highest-Rated Programs

Over the weekend I added a Highest Rated tab to the SpokenWord.org homepage. This is something I’ve done before on sites like IT Conversations, and it’s always a challenge. On one hand you want the feature to honestly display the highest-rated programs, but on the other hand you don’t want the list to get stale. You want to avoid the situation in which the most-popular items become increasingly popular and lock themselves into the top slots.

Working with my personal on-call mathematician, Bruce Sharpe, I’ve implemented an algorithm that is at least a good first cut. There are a number of tweakable parameters that have yet to be tweaked. The concept is to discount ratings by two factors: (1) discount each individual rating by the age of that rating; and (2) discount the adjusted average rating by the inverse of the number of ratings the object has received. Highest Rated is therefore influenced by (but not the same as) a popularity index.

At Bruce’s urging, I’m using the tanh() (hyperbolic tangent) function to determine the curves for both discounting formulas. In about 34 years of writing code I can honestly say that’s a first for me. I once wrote an entire floating-point runtime library in assembler language — yeah, that’s a challenge! — but I’ve never had much need for those trig functions myself.

The Highest Rated tab on the homepage currently shows too many programs from IT Conversations because of the recovery from a recent database coding error (mine), but over the next few days as the ratings age, the fairness of algorithms should kick in yielding more valuable data.