It Was 40 Years Ago Today

If you’re at least 50 years old, you probably remember quite well where you were the day that men first walked on the moon. It was an exciting and yet surreal moment. I was working that summer in San Diego, but it was a Sunday so like so many Americans I was glued to the television. After Neil Armstrong stepped off the lunar lander, I walked outside, looked up at the moon (clearly visible mid-day) and just shook my head in near disbelief. Truly amazing. I called my then-girlfriend, Cessna (now my wife of 38 years) in Kansas to compare notes. She was awed as well.

Why San Diego? I was hired by Jan Popper as a production stage manager at what was then known as San Diego International University. I spent most of my summer there directing opera workshop productions and teaching acting to foreign opera students. That was almost as surreal as men walking on the moon.

More Facebook Integration for SpokenWord.org

I’ve added a new feature to SpokenWord.org for Facebook users. When you submit a program to our database or you add a program to one of your SpokenWord.org collections, you’ll be given the chance to post it to your Feed (Wall) on Facebook. Note that this only works if you’ve previously logged into our site using your Facebook ID and your’e currently logged into Facebook.

CHI Conversations Launches

CHI Conversations is a new channel from The Conversations Network, the home if IT Conversations. “CHI” refers to Computer-Human Interfaces and this new channel initially features programs produced by BayCHI, the San Francisco Bay Area chapter of ACM’s SIGCHI. BayCHI has been recording the speakers at their monthly Silicon Valley meetings for many years, but those programs (many with extraordinary presentations by some of the great technologists of our times) have gone unheard by the general public until now.

Those recordings would have been lost forever if it weren’t for the efforts of BayCHI’s Steven Williams, backed by the support of BayCHI’s membership and Board of Directors. Steve has single-handedly pulled together all of the bits and pieces it takes to create a new channel on The Conversations Network. It has taken nearly a year of Steve’s efforts to get to this launch, and we’re indebted to him for his perseverance. Steve is now serving as Executive Producer of CHI Conversations.

Along with the members of TeamITC including new volunteers from BayCHI, Steve plans to publish their new monthly programs as well as work their way through the archives of past-year’s recordings at the rate of two or three programs each week.

Collection Limits for SpokenWord.org

Because SpokenWord.org collections can subscribe to feeds and even follow other collections, they can grow to a size that is unmanageable. We’ve therefore added three ways in which you can keep your collections under control.

  1. Limit the number of programs.
  2. Limit the age of programs.
  3. Limit the size of a collections RSS feed.

On your collection’s page, click the Info link under “Edit This Collection”.

1. “Remove oldest programs when there are more than [count] or [age].” The default value for [count] is 1,000, the maximum number of programs any collection can contain. If you want to keep your collection smaller, select another value: 10, 25, 100 or 250. As you add new programs, earlier-added programs will be removed in order to maintain the maximum size you specify.

2. Likewise [age] tells us how long to keep programs from the date you collect them. The default is never to delete them (by age), but you can change this to automatically remove programs that have been in your collection for more than one week, one month or one year.

3. “Most-recent programs to include in RSS feed: [count].” By default, we’ll include up to 100 programs from your collection in its RSS feed. But you can use this option to change that value to 10, 25, 50, 100, 250 or “all”.

Note: Although you can set all of these values now, only #3 (RSS limits) is operational. We won’t turn on #1 and #2 until at least Wednesday morning (July 15) at 9am Pacific time to allow you time to modify your collections that may be affected by the change.

Facebook Connect for SpokenWord.org

Yesterday I rolled out Facebook Connect for SpokenWord.org, and if you have a Facebook account I urge you to stop by, give it a try, and let us know if it works for you. The integration is about two-thirds done, but you probably won’t notice the missing one-third. It has been an interesting process so far. I previously implemented OpenID, and I expected something similar, but that’s not the case. The concepts of the two systems are similar, but the realities are quite different. For example:

  • Facebook’s documentation is awful. Rather than one or two coherent documents there are dozens of wiki pages written, as far as I can tell, by the developers themselves, not good tech writers. Each page is written in a different style and documents (usually incompletely) one small piece of the big picture. To actually integrate Facebook into an existing identity system, there are many — more than becessary — moving parts.
  • Although a FB user explicitly authorizes your application, FB refuses to supply his or her email address through the API. Instead, there’s a very Baroque system by which you send FB hashed versions of the email addresses of all your existing registered members in advance so that Facebook can then let you know that one of them matches a FB user at the time that user authorizes your application. But if a new (to you) FB user logs into your site, you don’t have that existing data. (OpenId’s API gives you an email address if the user approves.)
  • The Facebook Terms of Service are oppressive. They must have been written by Facebook’s Business Prevention Division. For example, you are not allowed to store (in a database) any personal data you receive from Facebook Connect. When a user authorizes our app, FB sends us the user’s first and last names. We’re allowed to display those while the user is connected, but not thereafter. (We get around this by asking the user to give us this data independently.) I noticed that TechCrunch uses Facebook Connect for comments, so I was curious what would happen if I left a comment on their blog and then de-authorized the TechCrunch app. Sure enough, my comments disappeared from their site, and when I re-enabled the app, the comments re-appeared. Weird.
  • The email thing is particularly nasty, for while we’re not sending FB our users’ emaill addresses unencrypted (which would violate our own Privacy Policy), we are sending an MD5 hash of those addresses. This means FB can compare the hashes we send them to the 100+ million email addresses they already have, allowing them to determine that someone is a registered members on our site even before that person authorizes the use of his/her FB identity to access our site.
  • FB requires that if a user is logged in via Facebook, you display that user’s Facebook photo on every page they view. No reason is given for this requirement, and very few Facebook Connect sites do so. (Digg is an exception.) Note that this (and other ToS issues) requires that you load FB’s supporting JavaScript on every page.
  • Oh, did I mention how bad their documentation is?

All of that said — and there are many more issues — we’ve had many requests for this integration as a way to make it easier to register for and login to SpokenWord.org. I hope you find it valuable.

Email Gremlins

So I’ve been having this realy strange problem. I use OS X’s Mail app along with SpamSieve for spam filtering. But recently I’ve been noticing that the spam detection has been hyperactive: way too any false positives. I tried re-training SpamSieve. No help. So then I shut it down altogether: Whoa! I was *still* getting messages sent to the spam folder. Next, all the usual steps: rebooting, re-initializing this and that. Still no help. With absolutely no spam filtering turned on, stuff was still being flagged and moved. (Any of you email geeks starting to get a clue here?)

For a totally separate reason I pulled out my MacBook Pro, and that’s when it hit me. I even caught the nasty gremlin in the act. What was it?

I use Google as my inbound and outbound email server. Yes, I use their spam filtering, too — it’s much better than SpamSieve — but that wasn’t it. Because I have three different email clients (if you count the iPhone) I use IMAP4 instead of POP3 to communicate between those clients and the Google server and keep things in sync. So here’s what was happening: My MacBook Pro had been on and running it’s own instances of Mail and SpamSieve. Messages would come into Google and, in some cases, my laptop would grab them. The copy of SpamSieve on that computer decided some of them were spam and would move them to the spam folder. And because I’m using IMAP4, this change was sent to the server and then to the email client running on the desktop. It was my laptop, running this other instance of my spam filtering software that was moving messages around on the email server and hence on my desktop client. It was downright spooky to see the messages moving without a clue as to why, but as soon as I realized my laptop was also running email, it became instantly clear.

Adventures in Full-Text Search

SpokenWord.org calls itself a site for “finding and sharing audio and video spoken-word recordings.” Sounds great, but our “finding” capabilities (search, in particular) have been pretty bad. In mid-March I started writing a fancy new full-text search module that worked across database tables and allowed all sorts of customization and advanced-search features. Six weeks and a few thousand lines of code later, I had a new system that…well, sucked. There are all sorts of reasons why, but it sucked. Bottom line: It just didn’t do a decent job of finding stuff.

I then considered implementing something like Solr, based on Lucene. But the more I thought about it, the more I realized that would be only marginally better.

Searching for audio and video programs from a database that will hit 250,000 in the next few hours comes down to a few architectural issues:

  • You’ve got to search the text of titles, descriptions, keywords, tags and comments, which in our case are stored in separate database tables.
  • There are three ways of doing this: (1) read the database tables in which these strings are stored in real time; (2) in background/batch, build a separate table of the integrated text from the separate tables, then search this integrated table in real time; or (3) build the integrated table by scraping/crawling the site’s HTML pages then, as in #2, search that table in real time.
  • Make your search smart by ignoring noise words, being tolerant of (or correct) spelling mistakes, understand synonyms, etc.
  • Develop an ranking algorithm to display the most-relevant results first.
  • Provide users advanced-search options such as boolean logic and restricting the search to a subset of objects such as only searching programs or only searching feeds.

My fancy search code used method #1 and the resulting code generated some of the longest, most confusing and slowest SQL queries I’ve ever seen. And it’s buggy. Solr uses technique #2, and that’s clearly better for all sorts of reasons. #3 seemed like a particularly poor solution because (a) you lose track of the differences between titles and tags, for example, and (b) it’s kludgy. Or so I thought.

But I’ve now implemented technique #3 by outsourcing the whole thing to Google Custom Search and the initial results are spectacular. Here’s why:

  • Scraping HTML may sound kludgy, but it works.
  • Google knows how to scrape web pages better than anyone.
  • So long as you’re keeping the text you want searched in the page (eg, not served by Ajax) Google will find it.
  • Google’s smart-search, advanced-search and relevance-ranking are better than anything you can write or find elsewhere.
  • Google does all of this with their CPU cycles, not ours, thereby eventually saving us an entire server and its management.
  • Google allows educational institutions and non-profit organizations to disable ads.
  • Google does a better job of finding what you want than is possible using an in-house full-text search with lots of customized filtering options.

This last one is important. I spent a lot of time on giving users tools for narrowing their search. For example, I provided radio buttons to distinguish between programs, feeds and collections. But it annoyed even me that users had to check one of these buttons. People would search for “IT Conversations” and find nothing because the default was to search for individual programs not feeds and there are no individual programs with that string in their titles or descriptions. Annoying and confusing.

Then I had a moment of clarity. Rather than proactively providing users control of the object type up front, I came up with another scheme. I changed the HTML <title>s of the pages so that they now start with strings like Audio:, Video:, Feed: and Collection:. This way (once Google re-scrapes all quarter-million pages) the search results will allow you to immediately and clearly distinguish programs (by media type) from RSS/Atom feeds and personal collections. I’ve tried it on my development server and it’s great. Because of the value of serendipity and the fact that Google’s search is so good, I find it’s much more valuable to discover objects in this way than to specify a subset of the results in advance.

Finally, I’ve discovered that Custom Search supports a feature from regular Google search. You can specify part of a URL as a filter. For example, if you want to search only for feeds, you can start your search string with “http://spokenword.org/feed&#8221;. The result will only include our feeds. Same for /collections, /members and /programs. How cool is that? (Thank goodness for RESTful URLs!) I have yet to integrate that into the web site — a weekend project — but it means we can offer the user the ability to restrict the search to a particular type of object if that’s what they want.

I’m so glad that Google Custom Search works as well as it does, that I’ve decided not to brood about the six weeks of my life wasted designing, coding and debugging my own search. It was another one of those learning experiences.

Note: Not all of the features described above appear on SpokenWord.org yet, and the maximum benefit won’t be visible until Google re-scrapes the site, but if you use the Search box on the top of the right-hand column you’ll get the idea. Very cool.

The Submission Wizard

Making it easier to submit content to SpokenWord.org has always been high on the to-do list. For the past seven weeks I’ve been working on a Submission Wizard, which I hope goes a long way towards that goal. It’s a wizard because it takes what you give it and tries to figure out what you meant. If you supply the URL of a media file, it will then ask you for an associated web page from which it will suggest the title, description and keywords. If you start by supplying a web-page URL, the wizard will scrape that page looking for RSS/Atom and OPML feeds. And whether it finds those feeds or you explicitly supply a feed’s URL, the wizard will give you choices of what to submit and what to add to your collection(s) before showing you all the steps it takes to follow your instructions.

After the RSS/Atom feed parser, which continues to be a maintenance challenge, the Submission Wizard is probably the most-complex single piece of code for the site. It weighs in at about 6,000 lines of new code and it’s certainly not done. Give it a try, and if it doesn’t do what you think it should, let me know. I’m particularly interested in finding more web pages that the Submission Wizard can learn how to scrape.