As Jon Udell pointed out, Amazon’s S3 service is filled with potential. But I’m looking for an enhancement, which if they implemented it would add instant scalability and reliability to hundreds of thousands of applications. I didn’t invent caching or CDNs, but I’ve been a huge fan of this architecure for many years, and I wish it was more common in the web-hosting industry. Here’s a copy of a post I just left on the Amazon Web Services Developer Connection forum:
I wonder if there’s a way to use S3 as a cache or content-delivery network (CDN)?
We, like others, have an application containing a large number large objects. The challenge is that while each of them may be modified each day, relatively few of them are downloaded by the public on any given day. Pushing new versions of each object to S3 each day would be very wasteful of bandwidth, since most of the updated versions won’t be accessed.
This is why we like caching/CDN architectures, and it’s something I’d love to see S3 support. It’s an extraordinarily cool architecture that painlessly gives small-server apps large-server scalability. Here’s how I imagine it working:
- We (the S3 customer) upload an object using the APIs.
- Along with this upload, we specify an “Origin Server URL” on our own servers where we have stored the original copy of the object.
- We publish the S3 public URL of the object for external access by the public.
- When S3 receives a request for the registered object, it first sends an HTTP HEAD request to our origin server to see whether the object has changed.
- If the object has not changed since the most-recently uploaded version, or if the origin server doesn’t respond promptly for whatever reason, S3 delivers the object to the public requester.
- However, if the object on the origin server is newer than S3’s copy, S3 fetches the new copy from the origin server and, while doing so, delivers that version to the requester.
If you’ve ever used a CDN or even a standard cache (like Squid) you know how brilliant this architecture can be. As I mentioned above, it *instantly* adds scalability and reliability to a small-server application. (If S3’s HEAD request fails for whatever reason, it returns its most-recent version of the object to the requester.)
An app developer can then simply write new or modified objects to his local low-capacity, low-cost server then use the APIs to upload to S3. That’s it. Done. Got an update or new version? Just write it to the origin server. Your local server goes down? No problem. The S3 infrastructure keeps on ticking.
S3’s pricing of $0.20 (USD) per GB of traffic is actually very good. It’s extremely good as compared to commercial CDNs. If you have to upload all your objects every day, even if they’re not downloaded by your visitors, however, the economics rapidly deteriorate. Caching solves all of that.
5 thoughts on “Amazon S3 as a CDN?”
Can you not do this in reverse?
1. Your server handles all requests.
2. All resource requests are actually filled from Amazon – your server is just handling the redirection.
3. Where a resource is out of date on Amazon your server would upload it first to Amazon and then return a reference to the uploaded object.
[If this imposes too long a delay you could serve it directlty from your server but that breaks the model.]
If your server is not serving content but just redirecting links it should be able to handle high volumes?
This sounds like you want a version control system, as Ewan says keep use your host as a broker and ensure your application points to the latest version of the asset. This will give you total control over whats served from the cache and also permit you to transparently change caching services should you require. Also any need for requests to be transacted between S3 and your site will have a cost in terms of performance and unless they live close together you may well outweigh the benefit of the caching infrastructure, unless of course its the bandwidth your worried about and not the performance.
Systems I am involved in have a file versioning approach where the origin server contains a copy of all versions of a file, new ones are pushed there and front facing references to assets are update when a new version comes along. You could then go back and clean up any old versions on the cache to reduce your storage costs.
Caching or CDNs, particularly where you can locate your cache geographically close to your end user are essential for any large scale web application.
Hey, good idea.
I just don’t understand about specifying “Origin Server URL”. Am i understand you correct, you are specify it somewhere in the bucket? or its some property on the file that you uploading?
Can you tell me more detail how to adjust my S3 bucket so he can check for the newer version first.
Another question. Does its means the server checking for the version every time when the request coming? In this case its pointless to put it to the S3 if your origin server will always get version validation messages. Don’t you think so?
Amazon? Infrastructure? Started with books. Soon added CDs & DVDs. Toys R Us, Borders, Target. zShops, Marketplace, E-Commerce API People building their businesses on Amazon is cool. What else do we have lurking in the corners? 8 I’m not totally sure how Amazon came up with AWS, but I’ll bet it went something like this. It sure makes sense that they began to like having businesses building on top of them and their expertise. And I don’t buy the argument that this is silly because Amazon’s a bookseller. What a dumb argument. In reality, Amazon’s ﬁnding ways to monetize other things they do well. More businesses should do this.
Amazon is building a CDN.