Cloudflare now populating and using the Internet Archive’s Wayback Machine in its content distribution network application

Cloudflare and the Internet Archive are now working together to help make the web more reliable. Websites that enable Cloudflare’s Always Online service will now have their content automatically archived, and if by chance the original host is not available to Cloudflare, then the Internet Archive will step in to make sure the pages get through to users.

Cloudflare has become core infrastructure for the Web, and we are glad we can be helpful in making a more reliable web for everyone.

“The Internet Archive’s Wayback Machine has an impressive infrastructure that can archive the web at scale,” said Matthew Prince, co-founder and CEO of Cloudflare. “By working together, we can take another step toward making the Internet more resilient by stopping server issues for our customers and in turn from interrupting businesses and users online.”

For more than 20 years the Internet Archive’s Wayback Machine has been archiving much of the public Web, and making those archives available to journalists, researchers, activists, academics and the general public, in total to hundreds of thousands of people a day. To date more than 468 billion Web pages are available via the Wayback Machine and we are adding more than 1 billion new archived URLs/day.

We archive URLs that are identified via a variety of different methods, such as “crawling” from lists of millions of sites, as submitted by users via the Wayback Machine’s “Save Page Now” feature, added to Wikipedia articles, referenced in Tweets, and based on a number of other “signals” and sources, such multiple feeds of “news” stories.

An additional source of URLs we will preserve now originates from customers of Cloudflare’s Always Online service. As new URLs are added to sites that use that service they are submitted for archiving to the Wayback Machine. In some cases this will be the first time a URL will be seen by our system and result in a “First Archive” event.

In all cases those archived URLs will be available to anyone who uses the Wayback Machine.

By joining forces on this project we can do a better job of backing up more of the public Web, and in so doing help make the Web more useful and reliable.

If you have suggestions about how we can continue to improve our services, please don’t hesitate to drop us a note at info@archive.org.

19 thoughts on “Cloudflare and the Wayback Machine, joining forces for a more reliable Web”

Asmat Ullah September 18, 2020 at 1:28 am

Hi,
Excellent article. I definitely love this site. Stick with it!

Tom September 18, 2020 at 10:27 am

Woah, this is an AWESOME collaboration.

Question: what would the process be to opt out of this feature for Cloudflare users? At the moment, you can block Wayback Machine crawlers via robotx.txt, but this feature would negate that.

Mark Graham Post authorSeptember 19, 2020 at 9:05 pm

Cloudflare’s “Always Online” feature is only available to people who opt-in to use it.

Rajneesh Rana September 18, 2020 at 12:23 pm

Wayback machine has been really helpful in finding content from old and expired domains.
Hope partnership with cloudflare will allow even more sites to be indexed and added to archive.
Great work. Thanks for the service in preserving the internet.

search engine September 18, 2020 at 10:46 pm

so great and informative post
thank you

Saheed September 20, 2020 at 2:26 pm

I’ve been using Cloudflare for more than 2 years now and I love it. This new addition is definitely a must use for me, I think it’ll be a good one for the little <1% downtime that my blog experiences monthly.

It won't even be noticeable, it is better than to have a downtime page.

Frank obama September 26, 2020 at 10:57 pm

I agree totally, its really a great combination . Am so looking out for it.

Joy September 20, 2020 at 2:55 pm

Really a better step to make web more useful.
I have few questions related to this:
1. What if the webmaster blocks Internet Archive bot in htaccess file? Will it still archive pages through/via cloudflare?
2. How about password protected/payment wall pages where a publisher/webmaster shows first paragraph of an article and hides the rest behind a payment wall? Will such pages be archived by Internet Archive through cloudflare?
3. Can a webmaster select which of their sites pages be crawled/indexed by Internet Archive via Cloudflare? Suppose I want Homepage, Products pages, but not the shopping-cart page to be archived.

Mark Graham Post authorSeptember 29, 2020 at 7:38 pm

Customers of Cloudflare need to opt-in to use their “Always Online” service.

medi September 21, 2020 at 5:20 am

Is this the reason I am getting an error when using web.archive.org/save/youtube.com

We can’t retrieve all the files we need to display that page. Please try again later.

Chris Lever September 21, 2020 at 10:25 am

If i’m reading this correctly, Cloudflare users will have their pages cached and indexed in Way Back Machine?

Will there be an opt-out option?

Mark Graham Post authorSeptember 29, 2020 at 7:36 pm

Customers of Cloudflare have to opt-in to use their “Always Online” service.

Ikom Wrestling September 23, 2020 at 2:55 am

Any plan for IPv6 support in Internet Archive?

Flipkart Offer September 26, 2020 at 9:10 am

Nice post thank you so much

Younes Ben Amara September 26, 2020 at 11:28 pm

That’s a good news for both users and website owners. Thank you!

Capptain Mark September 27, 2020 at 7:43 am

I have been using Cloudfare for 4 months. They are providing the best services.

Volkan September 29, 2020 at 7:24 pm

This is an awesome collaboration. Great work. Thanks for the service in preserving the internet.

Nemo September 30, 2020 at 7:41 am

This is taking the meaning of “always online” a big step further!

It’s wonderful to have more such extensive and broad-based sources of URLs. I wish big search engines and DNS servers could find a privacy-friendly way to contribute too.

CloudFlare states both that they have their own crawler and that they’ll send the hostname (not specific URLs?) to Internet Archive for its own crawl.

> Enabling Always Online in the Cloudflare dashboard allows us to share your hostname with the Wayback Machine so that they can archive your website. When a website’s origin is down, Cloudflare will go to the Internet Archive to retrieve the most recently archived version of the site, so that visitors will still be able to view the site’s content.

https://blog.cloudflare.com/cloudflares-always-online-and-the-internet-archive-team-up-to-fight-origin-errors/

> Our User Agent Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +https://www.cloudflare.com/always-online) AppleWebKit/534.34

https://www.cloudflare.com/always-online/

It seems however that the latter page is actually outdated, based on this comment from jgc at CloudFlare:

> Uh, no. We’re literally doing the opposite. We used to have our own caching infrastructure for “Always Online” and we’re getting rid of it and using archive.org instead. […] We tell archive.org about the URI, they crawl it. They handle robots.txt.

https://news.ycombinator.com/item?id=24506956

So it’s definitely not about CloudFlare making or sharing any WARCs by itself. Reusing Internet Archive’s services (for a decent fee, I suppose) is good use of resources!

Emma lash September 30, 2020 at 4:26 pm

Wayback machine has been really helpful in finding content from old and expired domains

Comments are closed.

Internet Archive Blogs

A blog from the team at archive.org

Cloudflare and the Wayback Machine, joining forces for a more reliable Web

Cloudflare now populating and using the Internet Archive’s Wayback Machine in its content distribution network application

19 thoughts on “Cloudflare and the Wayback Machine, joining forces for a more reliable Web”

Upcoming Events

Workshop: Wayback Machine for Journalists

Electronic Frontier Foundation (EFF) – Awards Celebration

Book Talk: After Disruption