From the TMI category for most of you, but of interest to at least two readers…
We’re getting some bogus RSS feeds, some from otherwise respectable media sources. One class of problems has to do with GUIDs (Globally Unique IDs). In particular, we’re seeing a single GUID being used for different programs, which violates the whole idea of a GUID. We thought we could depend on GUIDs as the sole mechanism of identifying a program, but when a site re-uses its GUIDs, the effect is that the programs appear to change more than once every time the feed is scanned, which drives our updating logic crazy. Here’s what I think we’re going to do:
- If any <guid> appears more than once in the current <item>s of a feed, we’ll never depend on GUIDs for that feed again.
- If we’ve never seen such a duplicate GUID, we’ll use each <item>s GUID as it’s supposed to be used: to uniquely identify the program.
- If we’ve ever found a duplicate GUID for a feed, we’ll look at the <title> elements and the <enclosure url= attributes.
- If either the title OR the url for an item match one that’s in the database for this feed, we’ll assume the scanned item is just a modified version of the program previously found. The reason is that we tend to occasionally see a site change the title or the media filename of a program, but rarely both at once. (If they do change both at once AND they’ve ever used dupe GUIDs, there’s not really much we can do. We have to assume it is indeed a new program.)
- IOW, if the GUIDs have been bogus and neither an item’s title NOR media URL can be found in the database, we’ll assume it’s a new program.
2 thoughts on “Bad RSS”
Now isn’t that frustrating! GUID – not really that difficult to understand, is it?
Thanks for sharing, Doug. I’m storing some similar data in my podcatching tool, and I’ve been worrying over the same issues. I like seeing how other people solve the problem.