Bad RSS

The greatest challenge in keeping SpokenWord.org running on a daily basis is dealing with rogue RSS feeds. We’ve got a bit over 3,000 feeds at the moment, most of which are being scanned every hour. But I just checked the admin report, and 27 feeds (nearly 1%) have been disabled for one reason or another.

For those of you in control of your feeds, here are some of the problems we encounter on a regular basis.

  • HTTP 404 errors. If your server isn’t accessible, we can’t read your feed.
  • Invalid characters. One bad character in your feed keeps our parser from reading the whole thing.
  • Missing GUIDs. Globally Unique IDs (GUIDs) are very important.
  • Duplicate GUIDs. (They’re supposed to be Unique!)
  • Incorrect MIME types. Should be:
    • application/rss+xml
    • application/atom+xml
    • application/xml
  • The following are common, but they’re wrong:
    • text/xml
    • text/plain
    • text/html

The GUID issues deserve more discussion. When you rescan feeds every hour, one of bigest challenges is to figure out if an <item> is old, new or modified. Here’s our logic:

  • If we’ve never seen this GUID before, we assume it’s a new <item>.
  • If we’ve already ingested an <item> with this GUID, we check all the pertinent elements and attributes for changes.

The GUID allows you to make changes like correcting a spelling error in a title. We see the unchanged GUID, notice that the title has changed, and just replace the title. Without the GUID, we have a helluva time trying to figure out whether an <item> with a one-character change in its title is just that or a whole-new program. We want you to be able to correct your titles, descriptions and media URLs without our system creating a duplicate program. Only your proper use of GUIDs makes that possible.

Once you assign a GUID to a program, never change it. That means never. And make sure your GUIDS are truly globally unique. Using a unique URL from your site as a GUID is a good way to do this. No other site is likely to include http://yourdomain in their GUIDs. And never, never, never reuse a GUID for another program. You’d be amazed at the number of feeds that include the same GUID for more than one <item>. I’ve designed our system to immediately disable any feed in which a duplicate GUID is detected.

As a somewhat defensive move, but also to help those who submit RSS/Atom feeds to SpokenWord.org, I’ve added code that runs submitted feeds through the W3C RSS Validator. We’ll accept Warnings, but if your feed generates Errors from the validator, we will reject it. My next step is to likewise call the W3C validator when we encounter a problem and to after-the-fact disable feeds that don’t validate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s