
Cloudflare now populating and using the Internet Archiveโs Wayback Machine in its content distribution network application
Cloudflare and the Internet Archive are now working together to help make the web more reliable. Websites that enable Cloudflareโs Always Online service will now have their content automatically archived, and if by chance the original host is not available to Cloudflare, then the Internet Archive will step in to make sure the pages get through to users.
Cloudflare has become core infrastructure for the Web, and we are glad we can be helpful in making a more reliable web for everyone.
โThe Internet Archiveโs Wayback Machine has an impressive infrastructure that can archive the web at scale,โ said Matthew Prince, co-founder and CEO of Cloudflare. โBy working together, we can take another step toward making the Internet more resilient by stopping server issues for our customers and in turn from interrupting businesses and users online.โ
For more than 20 years the Internet Archiveโs Wayback Machine has been archiving much of the public Web, and making those archives available to journalists, researchers, activists, academics and the general public, in total to hundreds of thousands of people a day. To date more than 468 billion Web pages are available via the Wayback Machine and we are adding more than 1 billion new archived URLs/day.
We archive URLs that are identified via a variety of different methods, such as โcrawlingโ from lists of millions of sites, as submitted by users via the Wayback Machineโs โSave Page Nowโ feature, added to Wikipedia articles, referenced in Tweets, and based on a number of other โsignalsโ and sources, such multiple feeds of โnewsโ stories.
An additional source of URLs we will preserve now originates from customers of Cloudflareโs Always Online service. As new URLs are added to sites that use that service they are submitted for archiving to the Wayback Machine. In some cases this will be the first time a URL will be seen by our system and result in a โFirst Archiveโ event.
In all cases those archived URLs will be available to anyone who uses the Wayback Machine.
By joining forces on this project we can do a better job of backing up more of the public Web, and in so doing help make the Web more useful and reliable.
If you have suggestions about how we can continue to improve our services, please donโt hesitate to drop us a note at info@archive.org.
Hi,
Excellent article. I definitely love this site. Stick with it!
Woah, this is an AWESOME collaboration.
Question: what would the process be to opt out of this feature for Cloudflare users? At the moment, you can block Wayback Machine crawlers via robotx.txt, but this feature would negate that.
Cloudflareโs โAlways Onlineโ feature is only available to people who opt-in to use it.
Wayback machine has been really helpful in finding content from old and expired domains.
Hope partnership with cloudflare will allow even more sites to be indexed and added to archive.
Great work. Thanks for the service in preserving the internet.
so great and informative post
thank you
Iโve been using Cloudflare for more than 2 years now and I love it. This new addition is definitely a must use for me, I think itโll be a good one for the little <1% downtime that my blog experiences monthly.
It won't even be noticeable, it is better than to have a downtime page.
I agree totally, its really a great combination . Am so looking out for it.
Really a better step to make web more useful.
I have few questions related to this:
1. What if the webmaster blocks Internet Archive bot in htaccess file? Will it still archive pages through/via cloudflare?
2. How about password protected/payment wall pages where a publisher/webmaster shows first paragraph of an article and hides the rest behind a payment wall? Will such pages be archived by Internet Archive through cloudflare?
3. Can a webmaster select which of their sites pages be crawled/indexed by Internet Archive via Cloudflare? Suppose I want Homepage, Products pages, but not the shopping-cart page to be archived.
Customers of Cloudflare need to opt-in to use their โAlways Onlineโ service.
Is this the reason I am getting an error when using web.archive.org/save/youtube.com
We canโt retrieve all the files we need to display that page. Please try again later.
If iโm reading this correctly, Cloudflare users will have their pages cached and indexed in Way Back Machine?
Will there be an opt-out option?
Customers of Cloudflare have to opt-in to use their โAlways Onlineโ service.
Any plan for IPv6 support in Internet Archive?
Nice post thank you so much
Thatโs a good news for both users and website owners. Thank you!
I have been using Cloudfare for 4 months. They are providing the best services.
This is an awesome collaboration. Great work. Thanks for the service in preserving the internet.
This is taking the meaning of โalways onlineโ a big step further!
Itโs wonderful to have more such extensive and broad-based sources of URLs. I wish big search engines and DNS servers could find a privacy-friendly way to contribute too.
CloudFlare states both that they have their own crawler and that theyโll send the hostname (not specific URLs?) to Internet Archive for its own crawl.
> Enabling Always Online in the Cloudflare dashboard allows us to share your hostname with the Wayback Machine so that they can archive your website. When a websiteโs origin is down, Cloudflare will go to the Internet Archive to retrieve the most recently archived version of the site, so that visitors will still be able to view the siteโs content.
https://blog.cloudflare.com/cloudflares-always-online-and-the-internet-archive-team-up-to-fight-origin-errors/
> Our User Agent Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +https://www.cloudflare.com/always-online) AppleWebKit/534.34
https://www.cloudflare.com/always-online/
It seems however that the latter page is actually outdated, based on this comment from jgc at CloudFlare:
> Uh, no. Weโre literally doing the opposite. We used to have our own caching infrastructure for โAlways Onlineโ and weโre getting rid of it and using archive.org instead. [โฆ] We tell archive.org about the URI, they crawl it. They handle robots.txt.
https://news.ycombinator.com/item?id=24506956
So itโs definitely not about CloudFlare making or sharing any WARCs by itself. Reusing Internet Archiveโs services (for a decent fee, I suppose) is good use of resources!
Wayback machine has been really helpful in finding content from old and expired domains