The Internet Archive is struggling to preserve the web thanks to walled gardens and pay...

Shawn Knight

Posts: 15,307   +193
Staff member
Recap: It has been more than 26 years since the Internet Archive set about preserving all sorts of digital material including software, games, movies, images and of course, web pages. The Wayback Machine is the mechanism that handles the ever-increasing task of collecting and collating snapshots of the Internet and it has come a long way since the mid-90s.

Think of the Wayback Machine as a virtual time machine. With it, you can travel back into the past and view how websites looked at regular intervals throughout history. This can be immensely useful when performing research or fact checking and equally as amusing when chronicling how web design has evolved over the years.

The Wayback Machine had managed to archive two terabytes of data after just one year, which was a massive amount of data at the time. These days, you could store all of that on a $30 USB flash drive and carry it around in your pocket.

Today, the Wayback Machine contains more than 700 billion web pages in its database and is approaching 100 petabytes. Unfortunately, the non-profit's work is not getting any easier as paywalls and other walled gardens like Facebook are making it increasingly difficult to capture. Will we have a neatly preserved record of today's social media activity 20 years from now?

Related reading: The Internet Archive has enhanced its Computerworld scans

Should the metaverse materialize as some predict, the Internet Archive will have to evolve its collection efforts accordingly or risk not cataloging what transpires in that digital medium.

Not everyone believes the organization has the right to do some of the stuff it does. When the Internet Archive launched the National Emergency Library with no waitlists at the start of the pandemic, several publishers said it amounted to willful mass copyright infringement.

The Internet Archive closed its emergency lending library early in hopes of avoiding a costly lawsuit, but publishers filed suit anyway. In July, both sides filed a motion for summary judgment.

Permalink to story.

 
Can't they get it from the Google cache like https://12ft.io/ does?
Well no, that's the problem. The pay-walls don't allow things like search bots to access the otherwise restricted content. Google doesn't have access to them either so there isn't anything for them to cache aside from maybe a headline title of a site page.
 
It might be time to consider excluding a few of these more difficult customers .... sad but true ....
 
It bugs me how Balkanized the web has become over the last decade. We're so far into the post-truth era now that you might as well tune it all out and just go with your gut.
 
Well no, that's the problem. The pay-walls don't allow things like search bots to access the otherwise restricted content. Google doesn't have access to them either so there isn't anything for them to cache aside from maybe a headline title of a site page.
That's confusing because that's exactly how 12ft.io does it. The only way to defeat 12ft.io is to NOT allow Google's page scraper to scrape your page. WaPo and NYT are two I know who don't allow Google to scrape their sites. I'm sure others do it too but I am not aware of the sites.
 
Creating a digital snapshot of the last 20 years of social media content is like cataloging every piece of trash at the local dump. If future generations never know that the Kardashians existed they will be better for it!
 
Maybe by then people will have to actually do something to merit being called a "celebrity" rather than getting a boob job and self-declaring their status.
 
Back