How the Internet Archive maintains an information super highway

No readers like this yet.
open network

Opensource.com

I've been an avid user of Internet Archive since I found out about it five years ago. Since then, I've used their Wayback Machine to search for fun glimpses at the World Wide Web of yore, but I've also benefited from the Archive as a distribution platform.

I've discovered musical performances by old forgotten jazz musicians, tiny indie bands, and entire orchestras performing (sometimes definitive, in my opinion) versions of classics. I have uncovered niche podcasts, forgotten noir and science fiction and horror movies, and more. I've even used sound clips from old TV ads and TV shows in some of my experimental music (I think they'd be classified as "sound collages" in art school) albums.

And, maybe most importantly, I've gotten a glimpse of a culture of users and homespun archivists who are interested in getting an impossible snapshot of the world, in the finest detail.

The Archive isn't just about archiving the Internet; it's about using the Internet as an archive for world culture, over time, with no barrier to entry or qualification or justification required. And it works. An afternoon spent on the Archive is like an afternoon at the public library; it's aimless, and overwhelming, and endlessly exciting and educational.

In a way, the Archive is exactly what we got told the Internet was supposed to be, back before the popup ads and nationwide ISPs: an Information Super Highway.

Eager to find out more about the culture within the Internet Archive and the background of the people who are helping keep it alive, I spoke with the director of media and access at Internet Archive, Alexis Rossi, and Vicky Brasseur, a volunteer maintaining the Internet Archive S3 API documentation. Together, they gave a presentation about the Internet Archive's API set at All Things Open 2015.

Read more in this interview.

What do you do at Internet Archive?

Alexis Rossi (AR): Basically, I'm in charge of all the digital media ("collections" in library parlance) and the interfaces people use to access that media.

Vicky Brasseur (VB): At the moment, I guess you could call me a Volunteer Without Portfolio. I maintain the Internet Archive S3 API documentation and sometimes travel around speaking about the Archive, its collections, and how people can use it.

What is your educational background?

AR: I have a B.A. in English and a Masters of Library and Information Science. I've been working with Internet content since 1996. (Before that, I used to edit cookbooks!)

VB: I'm a former programmer who long ago went to the dark side of management (which I love). I don't have an MLIS or official archivist training, but having spent over 10 years of my life working either in libraries or library software, I'm very aware of and sensitive to the importance of GLAM (galleries, libraries, archives, museums) in our lives.

How did you get involved with Internet Archive?

AR: I became involved with Internet Archive when I was working at Alexa Internet, which was also founded by Brewster Kahle. Alexa built the first version of the Wayback Machine for us, and I was there when we launched it in 2001.

VB: I became aware of the Archive soon after it was created, probably through its relationship with Project Gutenberg, and continued following it over the years. In 2011 I wrote the Archive presenting my qualifications, hoping there would be a volunteer opportunity for me. I was pleasantly surprised to be offered a full-time position developing a new project. Life happened and I moved on from the Archive, but I've found ways to remain connected to and associated with it and its inspiring mission ever since I left.

What does open mean to the Internet Archive?

AR: Open means free access to information. We are a library, and we exist to gather knowledge and make it available to people.

Why do you feel that Internet Archive is important?

AR: There are a lot of physical libraries in the world, both academic and public. They do a wonderful job of serving their communities, but they focus on serving the people who happen to be nearby. There are also several digital library initiatives, like Europeana and Hathi Trust, but they tend to have limited access and/or limited scope based on nationality or university membership or other factors.

The Internet Archive is a public library that serves the entire world. Anyone in the world can use the media from our library, and anyone can contribute media by uploading.

What was the driving force behind the recent archive.org redesign?

AR: The Internet Archive's motto is "universal access to all knowledge." We've spent years improving our storage system, learning to digitize books, gathering web pages and media, and working with partners to build collections. In other words, we've spent a long time working on the "all knowledge" portion of our motto.

We have always made those collections available, but the previous version of the web site was designed in 2002. A lot has changed on the Internet since then, and while we made lots of small changes along the way to keep up with changing technologies, we felt that we were not doing enough toward "universal access." For example, about 35% of our traffic comes from mobile devices, and the previous version of the site was not responsive and rather difficult to use on a phone. We are continuing to work on improvements for the new interface and will roll out new features as they are built.

What is an open API as opposed to an API, which ostensibly 'opens' access to any application?

VB: Many of the readers already know about Application Programming Interfaces, aka APIs. An open API is exactly what it says on the tin: These are APIs that are open to use, provide access to open resources, and (for those APIs which are in public code repositories) are open to modification and re-use. The Archive can't fulfill its mission of universal access to human knowledge if the APIs which help enable that access are at all limited. These APIs are there to be used and used freely by all.

Of course, it's always best to be a good citizen and not use those APIs to slam the Archive with thousands of downloads or uploads in a short time. That's not usually a problem though, as Internet Archive patrons by and large understand and appreciate that they're part of a larger community, all contributing to and benefiting from an astonishingly rich resource. It's really a great thing to watch happening.

Is the Internet Archive itself an open stack, or is only the API open?

AR: Our stack is Ubuntu Linux + PostgreSQL + NGINX + PHP5 (primarily) + Redis + Elasticsearch + jQuery + Less

Which archive.org APIs are available? What do they access and in what languages are the accessible?

VB: Internet Archive has made so many APIs available that I hardly know where to start answering that question. There are APIs for the Wayback Machine, Open Library, searching, and uploading. There are just so many different ways to interact with the Archive programmatically. Want to data mine 24 petabytes of content? There's an API for that. Want to eliminate 404s on your site by redirecting users to the Wayback Machine instead? There's an API for that. Want to host your podcast on the Archive so it's preserved in perpetuity? Yeah, there's an API for that. And that's only scratching the surface.

For the most part, all of the APIs are entirely language agnostic. Can you do REST? Can you do URLs? Can you parse JSON? Then you can use the APIs. For those of a Pythonic bent, Jake Johnson at the Archive has provided a really great toolset and library which wraps up a number of the APIs into one neat package: github.com/jjjake/internetarchive.

We gave an introduction to all of the APIs in our talk at All Things Open, but for those who couldn't make it to Raleigh (we missed you), we've uploaded slides that will introduce people to the API options: archive.org/details/linuxconna2015-ia-apis.

Why is it important that culture remains open?

AR: All human achievements are built upon the work of others. Society evolves because of information, and increasing the number of people who have access to that information can only benefit us. We need more people with more ideas!

What can someone do to help the Archive?

VB: Aside from just use it in all it's forms? Well, I can't speak for Alexis, but I would love, Love, LOVE it if some folks would pitch in and translate some API documentation. It's hard to provide universal access to all knowledge if you're only speaking in one language. I don't know how to say it in a language other than English, but: Patches Welcome.

Seth Kenlon
Seth Kenlon is a UNIX geek, free culture advocate, independent multimedia artist, and D&D nerd. He has worked in the film and computing industry, often at the same time.

1 Comment

I have used Internet Archive long back. The design has improved a lot. I am happy to hear about the apis. Kudos to the team for the awesome work.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.