Adventures in Full-Text Search

SpokenWord.org calls itself a site for “finding and sharing audio and video spoken-word recordings.” Sounds great, but our “finding” capabilities (search, in particular) have been pretty bad. In mid-March I started writing a fancy new full-text search module that worked across database tables and allowed all sorts of customization and advanced-search features. Six weeks and a few thousand lines of code later, I had a new system that…well, sucked. There are all sorts of reasons why, but it sucked. Bottom line: It just didn’t do a decent job of finding stuff.

I then considered implementing something like Solr, based on Lucene. But the more I thought about it, the more I realized that would be only marginally better.

Searching for audio and video programs from a database that will hit 250,000 in the next few hours comes down to a few architectural issues:

  • You’ve got to search the text of titles, descriptions, keywords, tags and comments, which in our case are stored in separate database tables.
  • There are three ways of doing this: (1) read the database tables in which these strings are stored in real time; (2) in background/batch, build a separate table of the integrated text from the separate tables, then search this integrated table in real time; or (3) build the integrated table by scraping/crawling the site’s HTML pages then, as in #2, search that table in real time.
  • Make your search smart by ignoring noise words, being tolerant of (or correct) spelling mistakes, understand synonyms, etc.
  • Develop an ranking algorithm to display the most-relevant results first.
  • Provide users advanced-search options such as boolean logic and restricting the search to a subset of objects such as only searching programs or only searching feeds.

My fancy search code used method #1 and the resulting code generated some of the longest, most confusing and slowest SQL queries I’ve ever seen. And it’s buggy. Solr uses technique #2, and that’s clearly better for all sorts of reasons. #3 seemed like a particularly poor solution because (a) you lose track of the differences between titles and tags, for example, and (b) it’s kludgy. Or so I thought.

But I’ve now implemented technique #3 by outsourcing the whole thing to Google Custom Search and the initial results are spectacular. Here’s why:

  • Scraping HTML may sound kludgy, but it works.
  • Google knows how to scrape web pages better than anyone.
  • So long as you’re keeping the text you want searched in the page (eg, not served by Ajax) Google will find it.
  • Google’s smart-search, advanced-search and relevance-ranking are better than anything you can write or find elsewhere.
  • Google does all of this with their CPU cycles, not ours, thereby eventually saving us an entire server and its management.
  • Google allows educational institutions and non-profit organizations to disable ads.
  • Google does a better job of finding what you want than is possible using an in-house full-text search with lots of customized filtering options.

This last one is important. I spent a lot of time on giving users tools for narrowing their search. For example, I provided radio buttons to distinguish between programs, feeds and collections. But it annoyed even me that users had to check one of these buttons. People would search for “IT Conversations” and find nothing because the default was to search for individual programs not feeds and there are no individual programs with that string in their titles or descriptions. Annoying and confusing.

Then I had a moment of clarity. Rather than proactively providing users control of the object type up front, I came up with another scheme. I changed the HTML <title>s of the pages so that they now start with strings like Audio:, Video:, Feed: and Collection:. This way (once Google re-scrapes all quarter-million pages) the search results will allow you to immediately and clearly distinguish programs (by media type) from RSS/Atom feeds and personal collections. I’ve tried it on my development server and it’s great. Because of the value of serendipity and the fact that Google’s search is so good, I find it’s much more valuable to discover objects in this way than to specify a subset of the results in advance.

Finally, I’ve discovered that Custom Search supports a feature from regular Google search. You can specify part of a URL as a filter. For example, if you want to search only for feeds, you can start your search string with “http://spokenword.org/feed&#8221;. The result will only include our feeds. Same for /collections, /members and /programs. How cool is that? (Thank goodness for RESTful URLs!) I have yet to integrate that into the web site — a weekend project — but it means we can offer the user the ability to restrict the search to a particular type of object if that’s what they want.

I’m so glad that Google Custom Search works as well as it does, that I’ve decided not to brood about the six weeks of my life wasted designing, coding and debugging my own search. It was another one of those learning experiences.

Note: Not all of the features described above appear on SpokenWord.org yet, and the maximum benefit won’t be visible until Google re-scrapes the site, but if you use the Search box on the top of the right-hand column you’ll get the idea. Very cool.

5 thoughts on “Adventures in Full-Text Search

  1. we had the same experience with our online journals (http://www.livingreviews.org/). once your stuff is on the web in a reasonable way (i.e. RESTish, with good URLs and not too much ajax), you’re done search-wise. and when the site is reasonably fast and popular, even the delay between publication of new pages and them being searchable isn’t more than a day (we use sitemaps to notify the search engines). the last thing we are fighting with is frames. i guess bad design decisions do come back and bite you.

    Like

  2. (4) Create a functional index over to_tsvector(fields in table that need to be searched) for each table with text data, query the tables with a UNION
    (5) Create a field of type tsvector in each table that needs to be queried, populate it and then update it (with a AFTER TRIGGER) on each change, query as in (4).

    s/tsvector/whatever field type your DBMS uses for lexemes/g.

    Unless you’re using MySQL, then you should probably just go with Google Search.

    Like

  3. Doug, if you haven’t already, you should talk to the folks at EveryZing (http://everyzing.com). You may remember their former incarnation running the now defunct PodZinger service. They have audio/video search that can’t be beat. Normally they sell it dear, but perhaps with your non-profit status you could work something out with them. It’s worth talking to them, I think.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s