Category Archives: Wayback Machine – Web Archive

Keep the News in the Wayback Machine

For nearly 30 years, the Internet Archive’s Wayback Machine has helped preserve the public record.

It has captured more than 1 trillion web pages, documented history in real time, and ensured that journalists, researchers, historians, librarians, and the public can continue to access reporting long after stories are published. From breaking news and investigative journalism to local reporting and public statements, the Wayback Machine has become essential infrastructure for the public’s ability to preserve online news and culture.

Now, that preservation work is under threat.

As reported by Nieman Lab and WIRED, some publishers are blocking the Wayback Machine from preserving their reporting. As a result, some of the most important journalism being produced today may no longer be independently archived for future generations. For details on the publisher blocking, check out our FAQ: Publishers Blocking the Wayback Machine.

In response to these blocks, Fight for the Future has launched an open letter calling on major media organizations to work with the Internet Archive to ensure the news remains preserved and accessible in the Wayback Machine.

Sign the open letter here: https://www.savethearchive.com/NewsLeaders

The letter argues that preserving journalism is not only about access today, but about protecting the historical record itself:

“The freedom of journalists isn’t only the freedom to write, it’s also the freedom to have your work read and remembered for generations to come.”

At a moment when misinformation spreads rapidly, links disappear, websites change, and pressure to alter or erase reporting continues to grow, independent web preservation matters more than ever. The Wayback Machine helps make journalism more resilient by ensuring published reporting can still be referenced, verified, and studied years later.

The campaign also highlights a growing contradiction: while many publishers rely on the Wayback Machine for reporting, research, and fact-checking, some are simultaneously preventing their own journalism from being preserved.

The Internet Archive has long worked collaboratively with publishers and respects requests around access and preservation. The Wayback Machine has been designed for preservation: helping ensure that the historical record of the web is not lost.

If you believe journalism should remain accessible to historians, researchers, educators, and future generations, we encourage you to add your name to the letter.

Sign the open letter here: https://www.savethearchive.com/NewsLeaders

A Thank You to Journalists Supporting the Wayback Machine

As publishers block the Internet Archive’s Wayback Machine for unfounded concerns over AI scraping, hundreds of journalists have signed a public letter supporting the Wayback Machine and the importance of preserving the online historical record. Below, Mark Graham, director of the Wayback Machine, shares a message of thanks to the journalism community for standing up for web preservation, accountability, and access to the public record.

Journalists who would like to add their names can sign the letter here, and members of the public can sign the broader support letter here.


Dear colleagues,

On behalf of all of us at the Internet Archive, I want to thank you.

Your support for the Wayback Machine sends a clear message: preserving the record matters.

For thirty years, the Wayback Machine has worked in the background, preserving more than 1 trillion web pages so that reporting doesn’t simply vanish with the next site redesign or corporate decision. Today, more than 100 news articles every month reference, cite, or rely on material preserved by the Wayback Machine to verify claims, recover deleted information, or provide historical context.

Where previous generations could walk into a newsroom morgue or a local library archive, today’s journalists increasingly rely on digital preservation to trace accountability and verify claims that might otherwise be lost. When a source disappears, when a statement is rewritten, when a page is taken down, the ability to recover that record is not a luxury. 

The stakes are not hypothetical. A Pew Research Center study from 2024 found that 38% of webpages from a decade ago are no longer accessible, and about 25% of pages sampled across the decade have disappeared entirely. But that’s not the whole story. New analysis by Internet Archive data scientist Sawood Alam found that the Wayback Machine has rescued roughly 15% of those otherwise lost pages, preserving reporting and historical evidence that would simply no longer exist online.

We are especially grateful that you recognized the care with which we approach this work. We are your partners in preservation. We build systems designed for people, not bulk extraction; we monitor our services to manage abusive access; and we actively collaborate with publishers and newsrooms to ensure their work is preserved with integrity.

Importantly, recent reporting has also underscored a key reality of this debate. As journalist Andrew Deck reported in Marketplace Tech, many publishers blocking the Wayback Machine appear to be acting preemptively out of concern over AI scraping rather than evidence of misuse. “None of the publishers were able to point to a particular AI company or other kinds of direct evidence that their content had already been scraped by the Wayback Machine,” Deck wrote.

At a time when the pressures on journalism are mounting—from economic shifts to the rapid evolution of AI—your support sends a clear message: preserving the public record is not optional. It is essential infrastructure for a functioning democracy.

We remain committed to the important task of preserving the web. And we are deeply encouraged to know that so many of you stand with us in defending that work.

With gratitude,

Mark Graham
Director, Wayback Machine
Internet Archive

Celebrating Thirty Years of the Internet Archive with the ‘Class of 1996’

Before feeds, before algorithms, there was the Class of 1996: websites & organizations founded (or expanded) in 1996, like the Internet Archive.

On the occasion of the Internet Archive’s 30th anniversary, we’re opening the internet’s yearbook to celebrate the sites, services & scrappy experiments that helped shape the web as we know it. From class leaders like Center for Democracy and Technology to cultural icons like The Onion to the archivists making sure none of it disappears, this is a reunion worth attending.

Some are still thriving. Some have changed beyond recognition. Some are already gone. All of them remind us: the early web wasn’t just built, it was lived in.

THE MORE YOU KNOW: Did you know that some publishers are blocking the Wayback Machine from archiving their sites, putting decades of reporting and cultural history at risk of disappearing from the public record? If the web’s past matters — and the Class of 1996 reminds us that it doesnow is the time to speak up. Add your name to the petition calling on publishers to stop blocking the Wayback Machine and help ensure the internet’s history remains accessible for future generations.


Class of 1996

Class President — Center for Democracy and Technology

The Center for Democracy and Technology didn’t just show up—they helped write the rules of the internet. And 30 years later, they’re still fighting to keep it open.

Class President

Go Wayback to 1996: https://web.archive.org/web/19961022174718/https://cdt.org/


Most Likely to Fix Your Computer — CNET

Before YouTube & TikTok tutorials, there was CNET, walking you through every crash, install & “have you tried turning it off and on again?”

Go Wayback to 1996: https://web.archive.org/web/19961221064020/http://www.cnet.com/


Best Dressed — eBay

eBay—Where the outfit and the backstory come with it. Vintage, rare, unforgettable…just like the early web.

Go Wayback to 1999: https://web.archive.org/web/19990117033159/http://pages.ebay.com/aw/index.html


Most Popular (Or Knows Who Is) — Alexa Internet

Before “trending,” there were rankings, and Alexa told us who ruled the web. (RIP to a real one.)

Go Wayback to 1997: https://web.archive.org/web/19970530104435/http://www.alexa.com/


Most Changed Since Freshman Year — Google

From a dorm room experiment to organizing the world’s information. Some people really did peak after high school.

Go Wayback to 1998: https://web.archive.org/web/19981111183552/http://google.stanford.edu/


Most Helpful — Ask Jeeves

Ask a question. Get an answer. Preferably in complete sentences. The internet had a butler once & he was awesome.

Go Wayback to 1996: https://web.archive.org/web/19961219064854/http://www.askjeeves.com/


Class Clown — The Onion

Making us laugh at the news online since 1996 & occasionally making it feel a little too real.

Go Wayback to 1996: https://web.archive.org/web/19961219015005/http://theonion.com/


Best Hair — Unofficial Spice Girls Fan Site

Before social media, fandom lived here: glitter text, tiled backgrounds & serious ‘Wannabe’ hair.

Go Wayback to 1996: https://web.archive.org/web/19961229144915/http://spicegirls.com/


Cutest Couple — World Wide Web Consortium & Cascading Style Sheets

Structure meets style. The web’s ultimate power couple & still going strong.

Go Wayback to 1996: https://web.archive.org/web/19961227091242/https://www.w3.org/


Most Athletic — 1996 Summer Olympics Website

One of the first times the whole world followed the games online. Faster, higher, more digital.

Go Wayback to 1996: https://web.archive.org/web/19961223003700/http://www.atlanta.olympic.org/


Most Talkative — ICQ & Hotmail

The beginning of being always reachable…for better or worse.

Go Wayback to 1997: https://web.archive.org/web/19971210072826/http://www.icq.com/

https://web.archive.org/web/19971210171246/http://hotmail.com


Most Likely to Save Everything — Internet Archive

Because the web isn’t forever, unless someone saves it.

Go Wayback to 1996: https://web.archive.org/web/19970126045828/http://www.archive.org/


Most Likely to LAN Party — Quake

Before Twitch streams there were cables, pizza & Quake. You had to be there (literally).

Go Wayback to 1996: https://web.archive.org/web/19961220085409/http://www.idsoftware.com/


Most Quotable — Salon

Smart, sharp & written to be shared.

Go Wayback to 1998: https://web.archive.org/web/19981212032509/http://www.salon1999.com/

Wayback Machine Director: We Are ‘Collateral Damage’ in the Fight Between AI Companies and Publishers

In the latest episode of the Future Knowledge podcast, “Preserving the Web in the Age of AI,” Wayback Machine director Mark Graham, tech policy expert Mike Masnick, and media lawyer Kendra Albert discuss the reports that some news publishers are blocking the Wayback Machine from archiving their websites due to unfounded concerns over AI scraping.

For Graham, it’s an issue of supporting journalism and the historical record. The Wayback Machine has become “collateral damage caught up in the conflict between AI companies and publishers.”

As Graham recounts encounters with reporters and researchers, a clear pattern emerges: even the most well-resourced institutions cannot fully preserve their own digital history. The Wayback Machine has become an indispensable backstop, ensuring that the public record remains accessible even when original sources disappear.

“I was in the offices of The New York Times just a few weeks ago,” said Graham, noting that The New York Times has blocked the Wayback Machine from archiving its website, “and a senior researcher came up to me and said, ‘Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.’ I get those stories all the time.”

For Masnick, blocking the Wayback Machine “will be seen as a huge mistake by those media organizations, an overreaction to a problem that probably isn’t really a problem.”

When considering blocking all bot activity over fears of AI scraping, Albert cautions that websites, “whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, ‘Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.'”

Listen to the full episode on the Future Knowledge podcast:

Full transcript:

Chris Freeland (00:05):
Welcome to Future Knowledge, a podcast about knowledge, creativity, and policy brought to you by the Internet Archive and Authors Alliance.

(00:14):
We tend to think of the web as a living archive, this vast searchable record of who we were, what we knew, and how we understood the world at any given moment in time. But that assumption is starting to crack. As publishers move to block AI scraping and restrict access to their content, the tools that quietly preserve our online history like the Internet Archives Wayback machine are getting caught in the crossfire. What started as a fight over AI training data is quickly becoming something bigger, a question about whether the web itself will be able to be archived in the future. Because if preserving the web becomes a threat, what happens to our memory when the past can’t be saved? Hi, everyone. I’m Chris Freeland. I’m a librarian at the Internet Archive. I want to welcome you to today’s discussion. So we have assembled an excellent panel of experts to discuss how efforts to limit AI access are reshaping the boundaries of preservation and what’s at stake if those boundaries continue to close.

(01:17):
So today we’re joined by Mike Masnick, the founder of TechDirt, Mark Graham, the director of The Wayback Machine, and Kendra Albert, a tech and media policy expert at Albert Sellers LLP. Here to introduce our speakers and to set the stage for today’s conversation is Dave Hansen, the Executive Director of Authors Alliance.

Dave Hansen (01:37):
Thanks, Chris. Hi, everyone. So this is a little bit different for us today. Usually we’re doing book talks, and we thought that this was such an important issue and such a fast-moving issue. No one has yet written the book on what is happening with the crisis in web archiving and preserving the web. But it’s a really important issue. Authors Alliance, from our perspective, we care about this because we’re quite fond of the internet and being able to research adequately what has happened over time across the web is so important for any sort of journalistic writing, for history. I refer to the Wayback Machine weekly at least for writing when I’m looking at back versions of documents and things like that. And I think we really do face a real crisis at this moment where this has never been an easy task. And I think we’ll hear from Mark about the Wayback Machine takes a lot of work and other web preservation efforts take a lot of work.

(02:32):
It’s never been easy, but in the current moment, it is particularly challenging when we have news publishers and other platforms online making it not just legally or technically complicated, but in some cases, almost impossible to really engage in preserving content in an automated way. So we’re here to talk about that. And I think the three perspectives that we’ve assembled here are, I hope, going to fill in some of the pieces of what’s happening on the ground with web preservation, what’s happening in the broader policy sphere that’s driving some of this. And also what does the law have to say about this? Because often what we see happening is sort of a reflection or a shadow of the legal rights that exist. So with that, I’m going to turn it over to our speakers here. And Mark, how about we start off with you to just talk a little bit about what do you see as the major challenge right now in terms of preserving the web in the age of AI?

Mark Graham (03:32):
Sure. Well, I mean, first of all, for close to 30 years, Internet Archives Wayback Machine has been archiving much of the public web, including journalism and making that material available to people. I should note that a large percentage of this material was no longer available on the live web and indeed thousands and thousands of news sites that we have archived over the decades are no longer available. And what has been going on recently as reported by Andrew Deck of Harvard’s Journalism School and others is that some news organizations and other platforms, most notably the New York Times and Reddit have began blocking the ability, preventing the internet archives way back machine from producing archives of their material and making that available. Indeed, it has been suggested that we are victims, if you will, collateral damage caught up in the conflict between AI companies and publishers.

Dave Hansen (04:35):
Thanks, Mark. Mike, how about let’s hear from you and take this whatever direction you want, but I’m particularly interested in your take on sort of the broader what’s happening in the policy sphere around this.

Mike Masnick (04:46):
Yeah. I mean, it’s a really tricky space because I think that a lot of people certainly recognize the value and importance of preserving culture and understanding culture, building things like institutions like libraries and related institutions. And yet there’s been this kind of struggle in large part because of the rise of AI, which Mark sort of hinted at in his opening, which is that until just a few years ago, most people consider the archiving of the web and of other resources as something that was seen sort of akin to the role of the library, which makes sense. But with the rise of AI tools, there is this interesting challenge in that all of the major frontier LLM models have been trained on huge corpuses of data and they’re always looking for more. And the question is where and how. And there are all sorts of other related discussions on is training fair use and things like that, which I don’t think we need to get into here, but that is the backdrop behind all of this.

(05:55):
And so companies, especially the media companies, are certainly very concerned about the way that the AI companies have gotten access to their data for training purposes and they feel that they are uncompensated and that they need to be compensated. Some of them have been working out deals. You mentioned the New York Times and Reddit and New York Times is suing OpenAI and has cut deals with others and Reddit has cut deals with Google and some others. And there’s all sorts of back and forth and negotiations. And all of this debate then becomes collateral damage to that because the fear is, and I think it’s an overblown and misguided fear that because you have organizations like the Internet Archive building the Wayback Machine, which again, they’ve built for decades. And I think most people recognize it’s just a generally useful tool for the preservation of culture and for researchers and for the journalists at some of the news organizations who are complaining, they feel that something like the Wayback Machine offers a way to go around these negotiations and undercut the negotiations in some form or another, which I think is at some point in retrospect will be seen as a huge mistake by those media organizations and overreaction to a problem that probably isn’t really a problem, but in this sort of rush to deal with this concern of, oh my gosh, the AI companies are taking over everything, they’re looking to plug any hole and block any opportunity for the AI companies to train on their material.

(07:34):
And the collateral damage of that is that some of them are now blocking the way back machine. And I think there are other efforts then to see about will there be either on the legal side, which Kendra can talk about, or just through technical measures, will there be ways to put up effectively toll booths on the internet if you want to archive this or if you want to make use of the larger corpus of data that various news organizations have put together, do you first have to pay a toll? Do the AI scanning companies have to effectively pay for the right to read that content? And that leads to a whole bunch of other downstream issues, but I’ll cut it off there and give Kendrick a chance to talk as well.

Kendra Albert (08:16):
First of all, just want to, I’m really excited for this conversation. And before I get to the law, I want to obligatorily state that my experience with web archiving is sort of a little bit unique in the sense that I was on the founding team for Perma.cc, which is basically another web archiving service that’s aimed at providing sort of permanent links freeze and scholarly work, cart filings, et cetera, a project that owes a lot to the Wayback Machine. And so I think this is a topic that’s sort of near and dear to my heart, even outside of the sort of intellectually interesting parts of the legal analysis. So I think there’s sort of two sets of legal issues that you can kind of think about when you’re thinking about web architecting. The one that everyone thinks of, and I realize probably most people do not have an instinctive legal reaction to a conversation about web arching, but one of them is basically copyright law.

(09:03):
The question of whether you can make a copy of someone’s work and save it, even if the purpose you’re saving it for is quite different than the purpose it was originally created for. And there’s some good case law from primarily actually the early 2000s suggesting that making cached copies of websites, even the full website by Google is fair use use of images for search results, fair use. So basically fair use being a limitation on copyright law that allows people to make use of copyright of works without the permission from the original owner. And I think that’s how many folks think about many of these large scale web archiving projects is that they use, they’re under fair use and that oftentimes fair use asks questions like, “Hey, are you harming the market for the original copyrighted works? What are you really doing with this sort of use of the copyright of works?” The test asks questions about what you’re doing.

(09:52):
And I think in the context of web archiving, especially for sort of memory institutions, for the kinds of criticism, accountability, journalism that we’re going to talk about, I think probably a little later, there’s some really strong fair use arguments, although it’s not just like most things in the space, it’s not like we have a Supreme Court case on point about this specifically. That’s the sort of, in some ways, the easier question. And when fairy use is the easier question, you’re in a bad situation from a legal perspective. The much harder question having to do with web scraping is sort of the process of scraping material online itself. And this sort of falls under a separate set of legal regimes, including things like the Computer Fraud and Abuse Act, which if you’re a little confused as to why the federal anti-hacking statute applies to certain kinds of web scraping, originally the theory had more to do with the fact that terms of service for websites would prohibit web scraping, and thus those terms of service could be used to argue that there was a CFAA violation.

(10:51):
Now, as we’re thinking more about the sort of technical restriction that folks are placing on accessing websites, or even things like robots.text, which I’m sure we’ll talk about, but which sort of is meant to convey signals about whether websites want themselves to be scraped, courts do have to take up the question of whether violating those signals constitutes breaking the law. And this is where it gets even more tricky because in the fairy’s context, you get to talk about things like, “Hey, it’s really good for general knowledge that folks can access archived version of websites. This isn’t harming the market.” In many conversations around web scraping, whether it’s under the CFAA or some other legal theory, you’re often much less focused on the question of why are you doing this? And this I think gets to Mike’s point about the broader environment and the sort of good guy archival institutions as being collateral damage, much of what we’re seeing in the backlash around AI training, web scraping, and in legitimate concerns about just bandwidth use.

(11:50):
I’m sure Mark can speak to this more than I can, but I think it’s really important to say part of the reason people are concerned about web scraping is just because they are paying for companies to access their website at a scale that is not feasible. So when we think about the law, it’s not an area where we have super clear answers as to the legality. And a lot of it depends on the particulars and has been defended, I think, for a long time by the fact that most folks doing web archiving have been good actors like the internet archive, like PERMA, other folks who are responsive to requests to fee list material from the web or from their online archives who are thoughtful about their engagement with people who want to have a conversation about, “Oh God, are we spamming your website? Let’s not do that.

(12:33):
” And so I think that’s meant that actually we haven’t had a ton of litigation over, okay, exactly, how is this legal under copyright? In the web scraping context, there has been much more direct litigation, including a fair amount having to do with scraping of LinkedIn, especially by commercial providers. And so when we think about the law of web scraping in particular, you’re thinking less about why are you doing it? Well, there’s some slight exceptions and more just about are you trying to get around technological barriers? Are you trespassing on a website? The kinds of questions that aren’t typically how we think about access to things on the internet.

Dave Hansen (13:12):
Thanks, Kendra. The CFAA piece of this is just bewildering to me to think that if you follow that path to its logical conclusion, we’ve criminalized being an archivist potentially online. And it’s just wild to me that that’s the world that we’re now in. So I want to talk a little bit about the motivation for blocking a bit more. I mean, we’ve gotten into AI as, I guess, ostensibly the driver here, but then not everybody, not every news organization, not every website has seen this pot of gold and said, “We must protect it at all costs.” They haven’t shut this down across the board. And so I wanted to probe that a little bit. What’s going on there and why is there significant variation, at least at this point, across policies from … I guess we can focus on news. I know there are other websites as well.

Mark Graham (14:02):
Well, I think first of all, it should be noted that very few news organizations have actually taken these kinds of measures like the New York Times. The vast majority of the news organizations in the world are very happy for their resources to be archived. Indeed, if they hadn’t been over the decades, we would not have access to them today. Examples would include Gawka Media or MTV News, nearly a half a million articles from the US or maybe in Hong Kong where news organizations like Apple Daily or The Stand were shut down for political reasons. And indeed, editors are in jail today. The only way one can access that material is from the wayback machine. In addition, we partner with Bard College and Pan America on a project, the Russian Independent Media Archive, focusing on archiving news from the Russian language journalism in exile and other places in the world where journalism is at risk.

(15:02):
I also note that Andrew Deck in his reporting, he could not find any examples where any publisher found evidence that material from the Wayback Machine was in fact being exploited by AI companies. So there’s, I think, a great deal at risk here, and frankly, very little, if any, evidence whatsoever of a threat to these news organizations. And at the same time, I want to emphasize that the Internet Archive is working collaboratively and supportively with journalists for decades that journalism often is based on references to other journalism. I was in the offices of The New York Times just a few weeks ago, and a senior researcher came up to me and said, “Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.” I get those stories all the time.

(15:59):
And I want to emphasize that we’re not static. We don’t just do our thing and then nothing ever changes. The web is constantly changing, business environments are changing, et cetera, and we change as well. We have implemented a whole variety of mechanisms over the last few years, especially with the rise of AI company scrapings to make it such that by and large, the Wayback machine is limited in its use by humans. The system is optimized for use by humans. We’ve taken specific measures to reduce if not eliminate bulk access to materials, especially from certain news organizations, limiting functionality in the Wayback Machines UI, collaborating with entities like Cloudflare, putting in place rate limiting mechanisms and a whole variety of measures. Some of these we’ve taken in collaboration with news organizations who have expressed specific legitimate concerns. So the conversation is very much up. We welcome it as we look for ways to continue to provide the vital service that we provide to archive and make available what is considered by many the first rough draft of history.

(17:11):
And by definition, that rough draft needs to be available to be able to be examined, to be interrogated, to be reviewed, to be cited and referenced. There’s just one more thought there, the citation side of it. Today, there are millions of URLs from news sites in Wikipedia articles, and a large percentage of them are only available because they’re in the wayback machine, because the news sites they came from just don’t exist anymore.

Mike Masnick (17:40):
The point I’ll add in terms of why our news organizations doing this is that I think it’s all part of a negotiation, right? I mean, if you look at the ones who are sort of at the forefront of trying to do this blocking, the New York Times and Gannett mainly being some of the big ones, their own business model has changed quite a lot in the last few decades and certainly in the last few years as well. And they’re very, very focused on trying to figure out how they’re going to continue to make money. And lately that has been through negotiating with large tech AI players and trying to cut deals. And the concern, which again, I would argue is misplaced, is that anything that might undercut the negotiations to make a deal and sort of prop up their business model is seen as a threat to that.

(18:33):
And so the few of them that are going around and saying that the internet archive is a problem or needs to be blocked from scraping their content, for the most part, they’re using that just to help them in their negotiations out of the fear that, oh, if the AI companies have a back door into getting our content, then the negotiation with us over a deal is a different proposition. I think this is a mistake on multiple levels, but that is kind of where their thinking seems to be.

Kendra Albert (19:04):
I also think to that point that there can be this sort of way of thinking about … I’m reminded of the famous drill tweet, there’s no difference between good and bad things. Sort of this idea that in order to take a stance about bot access on your platform, you have to block all of them. It doesn’t matter what they’re there for, doesn’t matter whether they’re well-behaved in terms of bandwidth use or crawling. I don’t know. I haven’t had conversations with folks. Some of it may be lawyer brain. I’m happy. I think there is a world in which lawyers looking at their legal positions with regard to scraping that might be occurring through AI sites might say, “Well, actually this is simpler.” If we don’t have to explain, we allow it for these folks because we actually think that they’re okay or we think that the uses may be fair or whatever, but we don’t allow it for these folks.

(19:51):
I can imagine that making an argument more complicated even if I sort of agree with Mike that I don’t think it’s a particularly good way to do things. And I also think websites generally, whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, “Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.” The other thing I can imagine, and I don’t think this is true for the New York Times organic, and Mark can correct me if I’m wrong, that is I think there are some circumstances where there are smaller sites that actually may not have the specific technical expertise to really understand what’s fully going on, where you have a site that just knows their bandwidth costs have gone crazy, they’re aware that folks are sort of scraping the web training for AI.

(20:39):
They’re not necessarily in a position to go through and actually distinguish different sort of bots or actors, different sort of folks who are accessing content as may take a uniform approach. But I mean, I think part of this has to be a conversation about, hey, independent of what you think about the training data copyright fight, which I’m not going to get any then saying that, that archival uses are really important. Right now, I think it does this via RSS feeds because it’s for headlines, but there’s a bot on BlueSky and Mastodon that looks at changes to New York Times headlines from the first post to 20 minutes later. And you can see the sort of diff between those headlines. And that provides valuable media criticism, frankly. And we’re not even talking about 20 years from now. We’re talking about it within the day of people posting to be able to see how stories have changed.

(21:28):
So I don’t think folks are doing it. I don’t think the New York Times is doing it because they don’t want people seeing how their headlines have changed or that they’re stealth correcting things in the text, although they certainly do do that. But I think that some of it comes from this sort of general framing of enclosure, as Mike was talking about, or can come from a lack of going through the details to understand the differences between different types of actors who may be using somewhat similar technologies.

Mike Masnick (21:53):
Yeah. Can I just add something to that? One of the things that I think is important that overlays a whole bunch of this is the general feeling of many people somewhat reasonably about the entire AI space right now, that there’s a large sort of backlash. There was this study recently that ICE has a higher approval rating than AI technology right now. There is a general sort of conceptual backlash to this technology. Some of it based on perhaps good reasons, some of it based on perhaps not good reasons, none of that matters. Culturally, there is this general backlash, and especially for smaller, less sophisticated sites that don’t want to go through the process of having to deal with that and the nuance or just saying, “I want to opt out if I can of this technology that I feel is problematic and bad.” And if they don’t have a clear and easy way to do that, one of that might be, “Well, I’m going to block any and all scraping because I vaguely know that that is being used to allow these companies that I hate to do something with my content.” And therefore, for some of them, it is not a well thought out, I am taking a stand against archives.

(23:08):
They’re not thinking that far. They’re just saying like, “AI, bad. I have no control over this situation. The only thing I can do is someone has made it easy for me to block archiving or scraping, and therefore I have to do that as a stand against this technology.”

Mark Graham (23:24):
That’s true. And at the same time, I want to note that there are many other news organizations that take the opposite approach. Indeed, specifically, for example, the Pointer Institute and with the organization behind the Investigative Reporters and Editors Conference has partnered with the internet archive on a project called Today’s News for Tomorrow. And what we are doing specifically is providing free archival services to more than 300 local newsrooms across the United States to help them archive their material. They have chosen to participate in this project because they value and appreciate the importance of the archiving. And at the same time, I note that more than 200 journalists have recently signed a letter endorsing the work of The Way Back Machine, celebrating it. In fact, Rachel Matto and others are on record supporting this and has signed the letter of support. So we are focusing here on some of the pushback from a very small number, but influential and well-known news and other sites.

(24:27):
But I want to put across the point here that generally speaking, we are able to continue to provide the service that we have for decades with the active support of, first of all, the patrons of the internet archive, the folks that are curious enough to want to learn from journalists writ large and from media platforms.

Mike Masnick (24:46):
Yeah. And I signed that letter and I completely agree with that thinking. I’m just sort of explaining some of the thinking. And I would even go a little bit further in that beyond just the importance of archiving and being able to use these tools that journalists use for research, I do worry a little bit, even if we’re talking about the AI technologies as well, that when you have major publications like the New York Times trying to block any and every possible way in which their writing might be read by AI tools, that that actually has problematic downstream consequences as well, where you have more problematic publications that are out there and the ones that have done more careful reporting. The New York Times sometimes does careful reporting, not always, I would say, but you want to have good reporting in these archives and in the AI tools as well as people are using them so that they’re not overrun by more problematic content.

Mark Graham (25:44):
It does. And if I could build on this just a bit, it sets a very bad precedent and that then bleeds into other areas of publishing. For example, the US government, the world’s largest publisher, uses large commercial platforms for much risk publishing. The US Agency for Global Media, the folks behind Radio Free Europe, et cetera, use YouTube to publish videos, millions of videos, thousands of which have been taken down since this new Trump administration. A couple of months ago, the State Department said that they were going to remove all of the social media posts prior to the Trump true administration. And as we were racing to archive more than two million social media posts, we were watching accounts from embassies, ambassadors and others over the years literally disappear from our screen as we were trying to archive them. So I think this just gets a dangerous precedent and something that we should be paying attention to in all dimensions of how we are working to preserve the materials that are published and to never trust a publisher to do the job of a library.

Dave Hansen (26:49):
As you’re talking, I’m really thinking here about some of the business model stuff that underlies so much of these concerns. And I was recalling, it was like three or four years ago, I guess, at this point working with a library that was doing a licensing deal with a rather large newspaper. And I mean, the numbers that they showed me, they’re talking about six-figure data licenses for access to the newspaper data. And we’ve had people talk about this before on here. Sarah Lambden did a talk about her book, Data Cartels, where a lot of it focuses on Read Elsevier, academic publisher, and there’s this real disconnect with how authors and contributors and journalists think of those outlets and what those outlets actually are from a business perspective. And I think the New York Times at this point is as much a data and analytics company as it is a newspaper.

(27:40):
Read Elsevier specifically calls themselves a data analytics company, even though they are on paper an academic publisher. And I think it doesn’t really help solve the situation, but at least explains a little bit more to me why they are making the moves that they are around restricting access to this content, if that’s the core of your business. I still don’t like it, but that explains a little bit. So I do want to talk about some other companies and outside of news, I guess, is where I’d like to go. So Reddit has been pretty public about blocking access. They have a lawsuit right now against Anthropic. That’s been a kind of interesting one to watch. And there are lots of other commercial platforms, social media platforms, for instance, that are restricting access for web scraping and preservation. So what’s going on there? Kendra, maybe we can start with you to just talk a little bit about what’s happening in litigation with some of these other platforms.

Kendra Albert (28:37):
And I think Mark’s point and your point about these sort of platforms is I think it’s really valuable in some ways to think about the actual rights to the content or your sort of legal right to use the content as an actually functionally totally separate question of how scraping it. And I think specifically with Reddit, Reddit doesn’t have the right to sue someone for copyright infringement for copying Reddit posts. I haven’t read the terms of service recently, but I’m pretty sure you’re not allowing Reddit to sue on your behalf for copyright infringement. But oftentimes the way this litigation is framed is around access to the platform, circumventing technological measures. So that’s the anti-circumvention part of the copyright statute, section 1201, or through things like there’s this fantastic, I’m not sure how I mean that word, but I’m not entirely positive tort that used to be because you touched someone’s car without permission called trespass to channels, which also has to do with what has been historically used in some context for web scraping, although usually you need to show that there’s some form of harm to the sort of infrastructure in order to bring it.

(29:41):
We talked about the CFAA, there’s trade secret, there’s all kinds of other sort of legal claims. So in some ways when you’re thinking about how some of these platforms are choosing to in some ways back up their business model goals, Reddit has done licensing deals with AI companies. I forget which one off the top of my head, but There is a very real conversation about like, “Hey, why should we pay you for this data if we could scrape it for much less money?” Now, of course, the version that you’re going to get from Reddit, if you pay them for it, is probably going to have other advantages just in terms of the metadata, the infrastructure, being able to ask Reddit questions about how the data works, all that kinds of stuff. But when we’re thinking about the legal reality behind these decisions, I think part of it has to do with the idea of the business model.

(30:29):
And part of it has to do with, I think, some degree to which I think some of these platforms may be genuinely responding to their own users being upset. And LinkedIn scraping from current generative AI days is actually a really good example of this because LinkedIn brought a lot of scraping litigation against primarily business competitors that were using LinkedIn data in order to run a recruiting tool or do other things that one might want to do with professional information. And to some extent, that was protective of their business model. These were effectively their competitors or they would roll out a product that was competing with whatever that company was doing. But also, legitimately, sometimes folks had real privacy concerns about the fact that, “Hey, I shared this data on LinkedIn. I didn’t assume that it was going to go everywhere. Now it’s gone everywhere.” I think that is different than the web archive in context.

(31:22):
And I’m not saying, “Oh, this is the same thing.” But I think why I bring it up is to say that you have this sort of circumstance under which there’s a variety of different incentives for limiting access to data, and it’s impossible to disentangle them. It’s impossible to say, “Oh, this is only because of business models. Oh, this is only because people have privacy or usage concern where this goes outside where it was supposed to be. ” And that oftentimes tech companies, LinkedIn has long said actually that their primary reason for a lot of their anti-scraping tooling is to protect users’ privacy. Now, I think that that’s a hard position to defend given the sort of business model stuff, that that’s the only reason, but I don’t think it’s not part of it. So I think that when we think about the moves by companies like Reddit to restrict all kinds of access, including the internet archive and the way back machine, you can’t just pin it to one thing.

(32:16):
And it’s not always based on one specific legal theory because oftentimes they’re trying a bunch of different stuff simultaneously, of which copyright might be one of the tools, but often is actually not the most useful if you’re talking about really significant amounts of web scraping. I hope that sort of answered your question, Dave.

Mike Masnick (32:34):
The one thing I was going to add in the Reddit context is that it is an example of where this can lead in terms of starting to test out questionable or extreme legal theories. So one of the cases that Reddit has is against this company called SERP API, which or CERP IP, I don’t know how they pronounce their name. And you can argue that this is perhaps not a good company, but basically what they do is they scrape Google results and create an API so that you can programmatically make use of Google results. Google is also suing them, but that’s a separate case. But you have Reddit suing this company over copyrights that Reddit doesn’t own, as Kendra noted. It’s the users in most cases, if there’s any copyright interest at all. And they’re suing this company for scraping Google’s results, which again is not Reddit and claiming that it’s a DMCA 1201 anti-circumvention measure over something that Reddit itself hasn’t set up the technological protection measure.

(33:35):
The only thing that they’ve done is cut a $40 million deal with Google. And so you get these sort of stacking legal theories and questionable things that while you can see, okay, Reddit is upset that perhaps AI companies are routing around doing a deal with Reddit or with Google because they can use a company like ServpAPI to get Google results that scrape Reddit because they have a deal with Reddit, it leads to really questionable places in terms of other types of scraping or other uses that are important and useful culturally. But because everybody’s sort of trying to figure out how do we do these things and how do we cut these deals, you see these sort of somewhat stretched legal definitions, I think, or tempts at questionable cases.

Kendra Albert (34:22):
And can I just say one more thing about Mike’s point real fast, which is I think that that’s entirely true. And I think the other thing to point out there is as much as I like to, I think it’s good to distinguish between good things and bad things. I’ll go on the record as being in favor of that. I think when we’re talking about making case law, oftentimes the sort of factors judges look at or the decisions judges make don’t say, “Okay, well, I don’t like this company because I think their business model’s bad. And so I’m going to find that they violated the CFAA because of that. ” But the good guys, it’s not a CFAA violation. That part of the law usually works. We actually get to do that way more in fair use. Because of that in my current job, we often work with researchers who scrape internet platforms to look for things like bias, discrimination, to understand how platforms work, that kind of thing.

(35:07):
And those folks are subject to all the same bodies of law that get made by, well, Reddit is pissed off that you can get Reddit results from Google at this company, or Reddit feels like they’re channeling their users outrage that the sort of user’s data is being used for these purposes they didn’t intend. So I think it is really important to note that their archiving, research, all of these kinds of uses often basically require exactly the same tools, just like the way BackMachine uses the same, using bots to view webpages and archive them. Mark, I’m wildly dumbing down the complexity of what you do, but researchers are using the same tools to scrape data and to sort of understand how tech works. So I think it’s not actually easy to just be like, “Okay, great. This technology, this way of doing it is good or bad, and we should just make a rule generally.” And

Mark Graham (36:00):
You have to explain a little bit too about what’s at risk here beyond just news. The internet archives more than a billion URLs a day. And one of the signals that we follow is links added to Wikipedia articles, for example, all of them. And as a result of that, we have been able to identify and fix that is edit and replace otherwise broken URLs that would return a 404 with archives of those references that human beings had added to Wikipedia articles over the years. More than 30 million links have been fixed in this way. Pew Research, for example, identified for a collection of URLs they looked at that were 10 years old, that 38% of them were no longer available on the live web. So what does that mean if we can’t have access to this material anymore? A variety of things. Hundreds of times a year, the Wayback Machine Team produces an affidavit to attest to the veracity of our web archives for the use by lawyers in courts.

(37:03):
And often these are cases of product liability, maybe a misrepresentation by a company, et cetera. And this material is often the critical evidence that is used to determine the outcome of the case. So there are any one of a number of applications of web archives beyond just news that are vital to our society to be able to hold those in power accountable and to be able to help those curious enough to learn to inform themselves.

Mike Masnick (37:30):
I do think that that is important to just remember the concept of the open web itself and sort of how we got here in the first place. I think it gets very easy. I mean, I even sort of got bogged down immediately on the AI aspect of all this, but the open web has been around for more than three decades at this point. And I think many of us are here because we believe in the promise of the open web and what it enabled in terms of community and culture and sharing of information and meeting people and everything. So much of what we rely on today was built on this open web. And the concept of the open web is this idea that it’s not controlled by any one entity. And it is not locked down and limited, but that we can build on it and do more with it and we can share with each other and build culture.

(38:22):
Culture is about multiple people understanding the same concepts. And that is built very much on the open web these days. And so much of where this unfortunately potentially leads to is a locking down of the open web just because of concerns about how it might be used in one particular way. And so just as I know we’re sort of getting to the Q&A part, I felt like we should emphasize that aspect of why we’re all here.

Chris Freeland (38:52):
Thank you, Mike, for acknowledging that. I’d say long live the open web. I 100% agree with everything you said. It was the open web is an important part of our culture and I hope that it remains that way. And Mark, I think it may be helpful if you can explain how does the Wayback Machine make data available in bulk and what kinds of protections are in place to prevent some of the abuses that have been mentioned here?

Mark Graham (39:17):
Sure. Generally speaking, we don’t make material available in bulk. The underlying files behind the Wayback Machine are generally not publicly accessible. We do provide an ability to playback, to replay individual web pages through what I refer to as the thin straw of the wayback machine. For those of you who have used the service, you understand what I mean, it’s pretty slow. There are certain features where one can list large numbers of URLs for a given site, for example. At the request of some publishers, including the New York Times, we’ve disabled that capability for those particular sites. We do some archiving of material that is generally considered to be publicly available in particular material from governments. We participate with many others, including Kendrick with Perma CC at Harvard and others on doing a deep dive on material from the US government. And we do package that material up and we do make bulk acts for that particular collection of web archives available to researchers and others.

(40:23):
And also, as I noted, that’s on the playback side on the archiving side or how we serve material out to the world. There are a variety of mechanisms that we put in place to do limiting, to detect and deter access to the service that is not human originated.

Chris Freeland (40:41):
Very helpful. Thank you. Question for everyone. If the way back machine and other archival institutions get blocked, people are probably still going to do some archiving, but they’re going to do so in maybe less legitimate ways and screenshots and other things. And so I’d be interested in your thoughts on this issue of maybe the non-legitimate archives or the preservation by organizations that are outside of the traditional library sphere. What does that mean for the historical record? I’m

Kendra Albert (41:07):
Going to just leave non-legitimate archives over there. Well, so I think there’s a couple things to think about there. One is I think, yes, certainly screenshots are not as good as a more interactive page component, but I think ultimately having something of it is better than having nothing at all. One area I work on a lot is video game preservation, which where we encounter a lot of somewhat similar challenges in terms of the degree to which the technological complexity, the sort of challenges with permissions from rights holders, that kind of thing. And I think one thing that I think about a lot there is in some ways when you make it really hard for institutions to legitimately preserve things, for institutions that are big, public, who are very clear about what they do and how they do it, you do in some ways seed ground to smaller institutions that may have different practices.

(41:55):
And some of those institutions are often really good at what they do and they’re just quiet about it. And that’s great. And some of those institutions, I think we maybe all followed. There was a whole kerfluffle about, I think archive.is, which was sort of a tool that people used for archiving webpages, often getting around paywalls that was allegedly running a fake capture that was DDoSing a critic of the site. I think that’s a really good example of one of the potential downsides of some of the more aggressive attempts to limit automated access or access because folks were not going to that site because they unnecessarily would’ve preferred that site. They were going to that site because they could view content there that they weren’t able to view elsewhere or they could access an archive page that they couldn’t access elsewhere. And so I think there is a real risk in a lot of these spaces of making it very hard for institutions that want to do the right thing to effectively preserve or save works.

(42:50):
And then it’s sort of causing challenges for both the historical record and for who’s left.

Mike Masnick (42:57):
Yeah. I mean, I think there are good actors in this space. And obviously the Wayback Machine, the Internet Archive are a very clear example of a good actor. And if you continue to make life difficult for them, it is only going to push people to those who maybe are less good actors and there is other kinds of collateral damage that comes along with that.

Chris Freeland (43:18):
Leaving the non-legitimate archives on the floor, but something of a related question. So should preservation institutions be treated differently from AI companies in law or policy? And are there then proactive policies that libraries need to be able to continue doing this work in the digital age?

Kendra Albert (43:37):
I mean, in some ways they already are. Section 107, the reason I kept being like, and you get to actually talk about whether what people do, Section 107, which is fair use within the US does actually care about what you’re doing with the content. Section 108 of the Copyright Act is specific to libraries, certain kinds of archival and preservation institutions allows them to do things that other institutions can’t do. It’s not a question of like, should we treat them differently? We already do. It then becomes a question of, “Hey, should we treat them differently anywhere else?” Is maybe the sort of question I’m asking. And I think it’s really hard in the existing scraping law context to see how that would quite work. Although I think that we did see some of that in a case called Sandig v. DOJ where some researchers sued the DOJ over the computer fraud and BSX criminal components, making it harder to First Amendment kinds of protective research.

(44:24):
So I think there’s some inklings of that and it would be fantastic to, I think, see more engagement with this question of what are the actual uses we think are good and important and how do we promote those versus sort of, okay, just get rid of the whole thing.

Mark Graham (44:38):
Yeah. I’m going to add, first of all, I’m not a lawyer and I do recognize existing copyright and fair use allowances to substantiate the work of The Wayback Machine to support it. But at the same time, there was the Vanderbilt clause added to carve out specific explicit protections in the area of television news archiving. I should note that the internet archive is a very robust television news archiving program as well. But I want to flip it around a little bit and say that news is a very special category of online material. It plays a vital role in our democracy. Indeed, it’s been referred to as the fourth estate, and various measures of privilege are given to news and news organizations. And I might suggest that with those privileges and rights come certain responsibilities related to access and availability. I think living in a world that’s awash with mis and disinformation, the Internet Archive recently co-published a paper that suggests that up to a third of new websites and webpages appearing on the public web today are at least partially AI generated.

(45:50):
And so this is a time of rapid change. In fact, if we’re paywalling and making quality journalism generally unavailable to people unless they have a subscription, which is a teeny, teeny percentage of the population, that we’re going to end up with a world more and more where the truth, the quality journalism is paywalled and therefore generally inaccessible to people, but the lies will proliferate and they will become, as they are in many cases, the dominant presence in the conversation. When I was growing up, I had a library, a physical library, and I had access to the New York Times and other magazines that I was able to read. If that library didn’t have access to that material, I simply wouldn’t have had access.

Chris Freeland (46:35):
Hat tip to Nathan J. Robinson and a current affairs. The truth is paywalled, but the lies are free. I want to leave with our final question here for each of the panelists. What can anyone who’s listening here today do to help change this trajectory?

Mike Masnick (46:49):
I mean, speak about it, talk about it. Obviously use the tools well and intelligently and explain to others how you’re using these tools and why they matter. Certainly when it comes to things like potential policy or legislation, being aware of what’s happening and being willing to speak out and make sure that there is nothing that will then get in the way of important cultural institutions like the Internet Archive, but really just being a part of the conversation. I think a lot of people don’t understand where this is leading and sort of the impact on organizations like the Internet Archive and tools like the Wayback Machine. And so making sure that more people are aware, I think is the most important thing that at an individual level you can do. Obviously at institutional levels, if you do work for a news organization that is blocking access to the internet archive, maybe try to convince people that that is a bad idea and we’ll have downstream cultural impacts that are not good for society, but that more depends on where people are situated.

Kendra Albert (47:54):
Mike still, one of the things I was going to say, which is I think that for folks who have institutional affiliations, I think making sure that A, you can still access the internet archive is still accessing pages from your institution. And then if it’s not making the case internally that, hey, this is why it’s important for my work, for the things that I do, for the things that I care about, which I think is going to be much more powerful coming from folks who are internal to an institution than necessarily coming from those of us who are sort of out here being like, “Doom is coming, archiving is stopping.” So I thinking about to the extent that folks have an institutional role where they bring attention to these issues, I think that’s really valuable.

Chris Freeland (48:33):
Mark, how about you?

Mark Graham (48:35):
Just a few things. First of all, use our service. We’re a public library and we love it when people are able to benefit from the resources that are available from our library and give us feedback about how we can do a better job at providing those services. Subscribe to our newsletters, follow us on the socials. If you’re a journalist or know a journalist, I’d recommend that you check out the Fight for the Future letter that we share here, that Chris, you shared here. And then if you’re in the Bay Area, come visit us. We host more than a hundred events a year at our facility in San Francisco and every Friday, except for I think Thanksgiving and Christmas, at one o’clock, we host a tour so you can kind of get an in- depth and personal look at what we do and how we do it.

Chris Freeland (49:23):
Thank you for that, Mark. Thank you to Mark and to Mike and to Kendra for such a fascinating conversation today and to Dave Hansen and Authors Alliance as always for facilitating and co-hosting this session. Thanks everyone. Have a great day. Thanks for joining us on this journey into the Future of Knowledge. Be sure to follow the show. New episodes drop every other Wednesday with bold ideas, fresh insights, and the voices shaping tomorrow.

On World Press Freedom Day, a Call to Keep the News Preserved

For nearly 30 years, the Internet Archive’s Wayback Machine has worked alongside journalists, researchers, and the public to ensure that the web—and the news it carries—remains part of our shared historical record. Today, on World Press Freedom Day, that mission faces a new and urgent challenge.

Some news organizations, including The New York Times, The Atlantic, and USA Today, are blocking their sites from being preserved in the Wayback Machine over unfounded concerns about AI scraping. As Andrew Deck from Nieman Lab noted noted in Marketplace, “None of the publishers were able to point to a particular AI company or other kinds of direct evidence that their content had already been scraped by the Wayback Machine.” As a result, important journalism is at risk of disappearing from the public record. More than 200 journalists have added their support to keeping the news in the Wayback Machine.

In response, Fight for the Future has launched a public petition calling on news leaders to work with the Internet Archive to ensure their reporting remains accessible for generations to come.

Take action

On this World Press Freedom Day, we invite you to stand with journalists and with the future of the historical record. Add your name to the public petition and join the call for news organizations to work with the Internet Archive to keep the news in the Wayback Machine.

Gone but Not Forgotten: Recovering the Dead Web

TL;DR: A Pew Research Center study found that 38% of webpages from a decade ago, and about 25% of pages sampled across the decade, are now inaccessible; our analysis shows that the Wayback Machine has rescued roughly 15% of those otherwise dead pages.

In 2024, the Pew Research Center published a link-rot study, “When Online Content Disappears”. They stated, “38% of webpages that existed in 2013 are no longer accessible a decade later”. They further noted, “a quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible”. This is not an isolated report that quantified the rate of loss of the online information. Numerous other link-rot studies in the last two decades have reported similar numbers or worse, depending on the context and samples. For example, Ahrefs, an SEO company, reported in the same year, “At Least 66.5% of Links to Sites in the Last 9 Years Are Dead”. In 2021, Jonathan Zittrain published an article in the Atlantic, “The Internet Is Rotting”, in which his team analyzed about 2 million external links from New York Times (NYTimes) articles and reported that 25% of deep links have rotted. They further noted that 72% of the older links from 1998 were dead. A recent longitudinal study on link-rot from the Old Dominion University (ODU), “Some URLs Are Immortal, Most Are Ephemeral”, analyzed 27.3 million URL samples from the Wayback Machine since 1996 and reported that about 65% of the sampled URLs were found dead on the live Web, when checked in 2023. Brewster Kahle, the founder of the Internet Archive, has been citing numbers from the early days of the Web and stating the average life of webpages to be anywhere from 40 to 100 days. A 2026 book, “Vanishing Culture: A Report on Our Fragile Cultural Record”, by Messarra et al. highlights underlying causes of numerous recent cultural digital losses while emphasizing the critical roles libraries and archives must play to maintain our cultural history for the future. Different studies have looked at the problem from different perspectives and contexts, hence it is often difficult to compare them side-by-side, but they all agree that an increasing number of links are rotting with the passage of time. However, some of these studies (not all) have failed to acknowledge the existence of Web archives, such as the Wayback Machine, where a portion of the dead Web might be preserved and can be used as a fallback when a reference leads to a broken link.

In this post we go through some of the link-rot studies and look at them from the perspective of the Wayback Machine to see how much of the dead Web can be rescued. Table 1 shows the status of the dead and rescued Web at a glance as sampled by a few different studies.

StudyYearPeriodSamplesDeadRescued
Pew (All)20242013-20235.4M26%16%
Pew (General)20242013-20231M27%13%
Zittrain NYT*20212013-201388K40%38%
ODU NYPW20241996-202127.3M65%65%
Table 1: Dead links from various link-rot studies rescued by the Wayback Machine.
* The NYT numbers are based on our recreated dataset.

Let us begin by looking at the study from Pew Research Center. They have generously shared their dataset with us so it was rather trivial for us (after performing some transformations and extractions, as the original dataset was stored in Parquet files) to check the URLs against the Wayback Machine to see if and when each of those were archived the first time. Their dataset contains 5.4 million unique URLs in general, news, government, and Wikipedia references categories sampled from the Common Crawl archive and Wikipedia pages. They also reported on Tweets in their post, but that dataset was not shared with us due to the restrictions posed by the usage policies.

Before we dive into our findings, below are brief descriptions of some terminologies that we will use frequently:

  • Alive: URLs that return 200 OK HTTP status code when resolved
  • Dead: URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved
  • Preserved: URLs that are Alive on the live Web as well as present in a Web archive
  • Rescued: URLs that are Dead on the live Web, but are present in a Web archive
  • Endangered: URLs that are Alive on the live Web, but are not present in any Web archive
  • Vanished: URLs that are Dead on the live Web and also not present in any Web archive
  • Archived: Preserved + Rescued
  • Accessible: Preserved + Rescued + Endangered

When we do not take any Web archives into account, about a quarter of all the 5.4 million sampled URLs would be considered inaccessible or dead as illustrated in Figure 1. However, when we leverage the Wayback Machine to access otherwise dead URLs, the fraction of inaccessible or vanished URLs drops from one in every four down to only one in every ten. The Wayback Machine has about 72% of the entire dataset archived, of which 56% are preserved from the URLs that are still alive on the live Web and 16% are rescued from the dead. There are 18% of the URLs from the sample that are still alive, but have not been archived in the Wayback Machine yet, which we call endangered, as they may become vanished if they cease to exist on the live Web ever. It is worth noting that we did not account for any captures of these URLs that might be present in any of the many smaller Web archives other than the Wayback Machine, which if accounted for, might increase the percentage of the accessible URLs a little more. Moreover, we relied on HTTP status codes and did not look into the contents of the pages to check for any soft-404s (i.e., error pages that wrongly return a 200 OK HTTP status code) or other irrelevant content, which might change the numbers further.

Figure 1: Archiving status of all the URLs from the Pew dataset in the Wayback Machine.

A subset of about 1 million URLs from the Pew dataset is a sample of general webpages from the last decade, spanning across 11 years from 2013 to 2023. They noted that about a quarter of the URLs from this subset were dead in 2023, with older URLs having a greater percentage of loss, all the way to 38% for links from 2013. We recreated their yearly graph in Figure 2 in orange color with an overlay of rescued URLs by the Wayback Machine in green color. We found that about 38% of the 38% dead URLs from 2013 (i.e., about 15% of the total) are rescued by the Wayback Machine. Moreover, about a quarter of the accumulative URLs of the general sample which were considered dead, about half of them were rescued by the Wayback Machine. It is worth noting that the last three years in Figure 2 seem to be rescued almost completely, but it is a side-effect of ingestion of Common Crawl data from the recent years into the Wayback Machine, which happens to be the source of the sample of the Pew dataset.

Figure 2: Yearly archiving status of URLs from the general sample of the Pew dataset in the Wayback Machine.

We tried getting access to the dataset of about 2 million URLs from the Zittrain’s NYTimes outlinks study, but we did not get it yet. However, in the interim we created our own dataset by downloading all the NYTimes pages published in 2013 that are present in the Wayback Machine, extracting all the outlinks from them, and excluding all the links to pages from NYTimes itself. We were able to collect about 88 thousand such URLs this way. Then we checked the live Web status of each of the URLs (after following up to 5 redirects, if any) and also checked for their presence in the Wayback Machine. We found that 40% of the external links from NYTimes pages from 2013 were found dead on the live Web, but 96% of those URLs are archived in the Wayback Machine. This means, only about 2% URLs from this sample have vanished. However, this impressive number needs to be taken with a grain of salt because we do not have the original URL sample and our own sample is derived from pages present in the Wayback Machine, which has an inherent bias of outlinks from those pages being more likely to be archived than the outlinks of the pages that are not present in the Wayback Machine. That said, we will be keen to revisit these numbers if and when we get access to the original sample of URLs used in Zittrain’s study.

A recent, and perhaps the most comprehensive, longitudinal link-rot study from ODU, to which we are a collaborator, analyzed 27.3 million URLs sampled from the index of the Wayback Machine spanning over more than two and a half decades. They reported about 65% of the sampled URLs from 1996 to 2021 were found dead in 2023. A significant number of these samples were not even resolving the DNS, indicating that many of those domain names were not registered anymore. They found that most of the URLs die rapidly in the first few years of their existence, but some of the longest living sites are not dead yet. Luckily, all of the dead URLs in this sample are rescued by the Wayback Machine by the virtue of it being the source of the sample in the first place. This also means, the ODU study would not be able to tell the percentages of endangered or vanished URLs, because its dataset contains no URLs that were never archived.

In summary, all of the link-rot studies, with varying numbers, indicate that the Web is brittle and an increasing number of Web resources die with the passage of time. However, we found that Web archives like the Wayback Machine play an increasingly important role in rescuing the dead Web and minimizing the fracture of the knowledge graph on the Web, but there is a lot more to do. For example, the Turn All References Blue (TARB) project has fixed more than 30 million broken links (and counting) on hundreds of wikis with the help of the InternetArchiveBot, the WaybackMedic bot, and the Wayback Machine.

While there is not a lot that can be done to resurrect the vanished Web other than attempting to find alternate locations where the content might have moved to (via projects like FABLE), we are determined to minimize the percentage of the endangered URLs. However, there are some internal and external factors that limit our ability to make it ZERO, such as, resource limitations, JavaScript-heavy pages, bot blocking, loginwalls, paywalls, deepweb, lack of timely discovery, etc. We strive to narrow down the potential loss of our cultural heritage via different means such as ingesting feeds from MediaCloud, GDELT, Wikipedia EventStream, and more recently, becoming part of the IndexNow initiative for link discovery soon after corresponding page creation or update on the Web. Moreover, we have the Save Page Now (SPN) service and urge that when you “See Something, Save Something!”. Your continued support will help us preserve the Web more and better.

NOTE: This work was presented at the IIPC WAC 2025, with the talk recording available on YouTube and slides hosted in the UNT Digital Library. It was also presented at the WADL 2025.

ACKNOWLEDGEMENTS: We thank our friends at the Pew Research Center and the Old Dominion University and our colleagues Jake LaFountain, Stephen Balbach, Chris Freeland, and Mark Graham for their help and support in this work.


Dr. Sawood Alam
Research Lead, Wayback Machine
Internet Archive

Wayback Machine Director Pushes Back on AI Scraping Fears Driving Archive Blocks

As reported by Nieman Lab last month, some major media organizations—including The New York Times, The Guardian, and Reddit—have started blocking the Wayback Machine from archiving their sites over unfounded concerns about AI scraping.

Last week, tech writer Mike Masnick (Techdirt) explained why this is “a mistake we’re going to regret for generations.”

Today, Mark Graham, director of the Wayback Machine, has published a response to the Nieman Lab reporting, pushing back on the media organizations’ concerns about the Wayback Machine being a backdoor to AI scraping. Graham writes:

“These concerns are understandable, but unfounded… like others on the web today, we expend significant time and effort working to prevent such abuse.”

Read the post to learn how Graham is working to protect the integrity of the Wayback Machine, and why limiting web archiving threatens our shared digital history.

Preserving the Open Web: Inside the New Wayback Machine Plugin for WordPress 

Link rot. There’s nothing quite as frustrating as clicking on a link that leads to nowhere.

WordPress, which powers more than 40% of websites online, recently partnered with the Internet Archive to address this problem. Engineers from the Internet Archive and Automattic worked together to create a plugin that can be added to a WordPress website to improve the user experience and check the Wayback Machine for an archived version of any webpage that has been moved, changed or taken down.

The free Internet Archive Wayback Machine Link Fixer, publicly launched last fall, combats link rot by seamlessly redirecting the user to a reliable backup page when it encounters a missing page. When the plugin is added to a website, it will do a scan, see what pages exist, and then automatically save those pages to a queue to be archived. If it doesn’t exist, then it will be sent for capture.

DOWNLOAD THE PLUGIN

Once the software is installed on a WordPress website, the plugin will auto redirect users to the Wayback Machine version of a missing page. 

Broken links are one of the web’s most relentless problems. Pew Research found that 38% of the web has disappeared over the past decade and for web admins, “It’s a never-ending game of whack-a-mole to keep links working,” said Matt Blumberg, Product Manager with the Wayback Machine. “This new tool prevents those inevitable 404s by automatically updating links to a preserved copy and it proactively archives pages in the Wayback Machine, where they’re kept accessible for free, long-term, so your site stays usable without manual fixes.”

“It’s very important that websites have a memory and that the web overall as has a memory. We are increasingly using [the web] as our only source of truth. When links go dead, in effect, the truth goes dead. This has become even more important in the world of AI.”

Alexander Rose, Director of Long-term Futures for Automattic Inc.

Many WordPress websites are homespun and are most susceptible to having links go dead. Remedying this problem is not only valuable to individuals, but also to the overall culture, said Alexander Rose, Director of Long-term Futures for Automattic Inc., the technology company behind WordPress.com.

“We need to have an accurate memory of the things that get said, posted, and the ways that we have communicated over time,” Rose said. “Otherwise we’re either doomed to repeat errors or we’re going to make choices that are uninformed by the past.”

The link fixer is expanding the “heroic effort” made by the Internet Archive over the years to preserve everything from small websites to NASA.gov and WhiteHouse.gov, he said.

“It’s very important that websites have a memory and that the web overall as has a memory,” Rose said. “We are increasingly using [the web] as our only source of truth. When links go dead, in effect, the truth goes dead. This has become even more important in the world of AI.”

As the plugin rolls out, Rose and Blumberg said they are open to feedback. The goal is to make the software as easy as possible to use. Next, they will fine tune the features and promote its broad use.

“As it becomes a solid piece of software that people know and like, then I think it has a path to being integrated much more deeply,” Rose said. “It’s early days, but every person I’ve talked to about it is excited to see the potential end of the dreaded 404 error.”

Follow the Changes: 9 Ways Web Archives are Used in Digital Investigations

Guest post from Thais Lobo, Liliana Bounegru & Jonathan W. Y. Gray, King’s College London.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


Digital journalists increasingly turn to web archives like the Wayback Machine to follow how things on the Internet break, change or disappear – from deleted posts to quietly edited pages.

The web has become not only a source of information but also the subject of media investigations, prompting journalists, researchers and activists to use digital archives to reconstruct timelines, verify claims, uncover hidden connections and hold powerful actors to account.

As online materials grow more fragile and prone to disappearance, the Internet Archive’s Wayback Machine has been critical in making “lost” web pages available – recently celebrating archiving over a trillion pages.

As we’ve previously written about on this blog, the Wayback Machine is an important resource for our work as media researchers, helping us to trace histories of digital media objects (for example, changes in ad tracker signatures of viral “fake news” sites over time).

We are also interested in how others use web archives across fields, and what we can learn from each other.

In this piece we draw on the Internet Archive’s News Stories collection to surface practices and use cultures of the Wayback Machine amongst journalists and media organisations. We analysed a dataset of about 8,600 news articles, assembled by the IA via daily Google News keyword searches since 2018.

Drawing on a combination of digital methods, machine learning and lots of reading – we surfaced nine ways that journalists use the Wayback Machine in their reporting.

***

1. following what is deleted

Shifting political alliances are a common driver of online footprint erasure. Deleted tweets have revealed past critics in current allies (here and here), and current career aspirations were juxtaposed with earlier conflicting stances in personal blogs and websites (here, here, here and here). 

Unannounced takedowns of collections or site sections on government websites often prompt investigations using archival snapshots. Examples include removed editions of presidential newsletters and deleted staff contact lists for services supporting vulnerable groups, signaling access-to-information breaches. 

The removal of official publications also enticed further contextualisation, revealing cases in which information was deleted due to being incomplete, inaccurate or inconveniently timed

Beyond politics, erasing on corporate websites highlights commercial and reputational pressures, such as deleted statements on forced labour, product safety and climate deception.

2. following what has been altered

Subtle alterations on webpages can also reveal a plain-to-see effort to reshape narratives.

Reporting based on archived pages shows how wording edits can move in opposite directions: from hardening language on migration ahead of a policy announcement to softening controversial statements in view of a political nomination, or erasing customer protection promises prior to a bankruptcy filing. 

In other cases, small additions to online content have proved just as revealing. A before and after snapshot of a blog post showed how a supposed early warning about a virus threat was added only after the pandemic began. Similarly, changes to a social media platform’s API rules appeared shortly after third-party apps were banned, subtly reframing the policy to align with new restrictions.  

3. following what is banned

Sometimes removals are deliberate, often at the request of companies seeking to enforce copyright, control branding, or limit liability.

Reports from media investigations highlight how such bans can affect games (here, here, here and here), apps and technical reviews.

In some cases, the bans intersect with political pressures, such as Hong Kong news outlets being shuttered under pro‑Beijing pressure, and disinformation networks being taken down due to links to state actors.

4. following what is broken

Archived snapshots are also often the only way to reconstruct what preceded a link break, when it happened, and what information was effectively cut off.

For example, an investigation into a set of broken URLs on a government website revealed that the pages themselves had not been removed, but the links pointed to outdated servers, creating a false impression of secrecy that sparked a conspiracy theory.

In another case, a major technical glitch took multiple Nigerian government websites offline, cutting off access to official information and showing how even unintentional failures can undermine transparency.

5. following what is hacked

Compromised versions of hacked websites and social media accounts present another form of using archived snapshots as traceable historical record.

For example, past screenshots of Twitter’s bio page revealed inconsistencies in claims about an alleged takeover of the US president’s social media account. In other cases, such snapshots helped surface a forensic trail and distinguish unauthorised activity carried out by activists (here and here) from the ones linked to cybercriminal groups (here).

6. following what is connected

Archived web data often uncovers unexpected linkages between domains’ ownership that appear unrelated on the surface.

For example, journalists used analytics codes of copies of sites maintained by the Wayback Machine to uncover disinformation networks. In another investigation, archived records verified that a website redirect to Joe Biden’s presidential campaign was unrelated to him, debunking conspiracy theories about the domain’s ownership.

Snapshots of a fake Black Lives Matter Facebook page and its associated websites allowed reporters to trace the individuals behind the operation. Similarly, archived versions of Amazon storefronts exposed networks of accounts generating affiliate revenue from coordinated product listings.

7. following what is reported

Archived web pages have proven vital for tracing how stories are presented across media outlets and platforms.

Investigations have examined archived versions of individual pages, such as headline coverage relying heavily on unverified claims, a news agency editorial premature assessment, or the unflagging of a branded content

In another case, snapshots of the Google homepage captured during the 2018 State of the Union speech disproved a viral claim that Google ignored Donald Trump’s address in favour of Barack Obama.

8. following what is unchanged

In other investigations, the most revealing detail is what did not change.

For example, during a bushfire crisis in Australia, archived pages showed that a key policy statement by the Greens party was left untouched, despite a disinformation campaign claiming to the contrary.

Similarly, a social media account circulated as having been reactivated under a new wave of laissez-faire moderation was, in fact, never suspended.

9. following what is saved 

When forums, platforms and websites vanish, it’s the work of crowdsourced archivists that capture their traces before they vanish for good.

In several reported cases, users raced to preserve spaces such as a long-running forum for sex workers, a 16-year-old Q&A site, a meme-sharing platform, and a free music library

Archiving web pages can become part of the story.

***

These are some of the ways we’ve noticed journalists using web archives – and there are many more! If you know of other interesting examples, we’d love to hear from you.

We hope that these nine ways may help to inspire critical and creative uses of web archives to “follow the changes” – exploring what they can tell us about digital culture and society, and the times we live in.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


About the authors

Thais Lobo is research associate at the Department of Digital Humanities, King’s College London, with a previous career in journalism.

Jonathan W. Y. Gray is Co-director of the Centre for Digital Culture and Reader in Critical Infrastructure Studies at the Department of Digital Humanities, King’s College London. He is also co-founder of the Public Data Lab; research associate at the Digital Methods Initiative (University of Amsterdam) and the médialab (Sciences Po, Paris). More about his work at jonathangray.org

Liliana Bounegru is Senior Lecturer (Associate Professor) in Digital Media, Culture and Society at the Department of Digital Humanities, King’s College London. She is also co-founder of the Public Data Lab, member of the Digital Methods Initiative at the University of Amsterdam and associate of the Sciences Po Paris médialab. More about her work can be found at lilianabounegru.org.

How Librarian Megan Lotts Turned 1 Trillion Web Pages into an 8-Page Zine

How do you commemorate the preservation of 1 trillion web pages in a zine? That was Megan Lotts’ challenge when she was contacted by the Internet Archive last summer.

Lotts is an art librarian at Rutgers, the State University of New Jersey, where she promotes creativity, play, and makerspaces through her teaching and research. She designs zines (short for magazine), which are self-published, handmade objects that are often copied and shared. It was through Lotts’ involvement with zines at the American Library Association (ALA) conference that she was asked by Internet Archive librarian Chris Freeland to create one for the Internet Archive’s October celebration.

For the project, Lotts collaborated with Louisa Cohen and Drew MacDonald at the Internet Archive on images and text to incorporate. Although an avid user of the Internet Archive, Lotts said making the zine prompted her to take a deep dive and discover all new material. 

“As a librarian, this is a space where you go for history,” she said of the Internet Archive. “I’m a kind of curious, reflective person, but there were collections that I came across that I didn’t know existed.”

The final product is an 8-page zine that Lotts has shared on the Internet Archive, along with a close-up view of the pages. It includes the Wayback Machine logo, icons of various collections, an old Polaroid photo of Internet Archive’s digital librarian, Brewster Kahle, next to a vintage computer.

The zine was printed and shared with attendees at the Oct. 22 Internet Archive party in San Francisco. Lotts took a week off from Rutgers to help unveil the zine at the festivities. Upon returning to Rutgers, she said it was fun to show students her work and explain the process. They were excited to hear about her experience, Lotts said, and what she learned behind the scenes at the headquarters.

“My students grew up with the Wayback Machine. They’ve used it since grade school,” said Lotts, 51, who remembers first accessing the Archive in college. “If you think about 1 trillion pages in less than 30 years, that’s outrageous. It’s preserving information for posterity.”

Zines need to be preserved, Lotts maintains, along with other art and cultural artifacts.

Librarian and creator Megan Lotts.

“When I give someone a zine, what I’m really hoping is that I’m giving you a moment,” Lotts said, “whether you recognize it or not, to hold this in your hands and get lost from the rest of the world. It’s just a tiny little book … I want people to look at it and think about it. That’s the beauty of the zine.”

Zines can be as elaborate as the one she produced for the Archive, she said, or as simple as creating something with a piece of paper, pen or pencil and an idea. “Those are things that most of us can access and everybody has a story,” said Lotts, who hopes the project inspires people to consider tapping into their creative side to make a zine.

“I’m noticing—as a scholar and as an educator—that people want to engage with the arts. They want to be creative,” said Lotts, who has degrees in fine arts, library science, painting and art history and teaches a class on play. “It’s really powerful for me to see students come alive and think about information and knowledge creation in a playful and exciting way.”

Lotts is the author of two books published by the American Library Association (ALA):  Advancing a Culture of Creativity in Libraries: Programming and Engagement (2021) and The Playful Library: Building Environments for Learning and Creativity (2024).

Check out her scholarship web page and website for more.