On May 12, 1996, like a benevolent mad scientist, Brewster Kahle brought the Internet Archive to life. The World Wide Web was in its infancy and the Archive was there to capture its growing pains. Inspired by and emulating the Library at Alexandria, the Internet Archive began its mission to preserve and provide universal access to all knowledge.
On October 27, 2016, the Internet Archive celebrated its 20th birthday with a party at its beautiful headquarters in San Francisco. According to an article in the San Francisco Chronicle, over 600 people gathered to pay their respects and hear about the latest projects and features of the Archive. The Internet Archive team did not disappoint, presenting some important and impressive advances which they've released in the past year, including but not limited to:
- The Political TV Ad Archive, a spin-off of the TV News Archive which allows for searching and citation of thousands of advertisements aired during this US election cycle and a treasure trove for journalists.
- A Firefox add-on which offers Wayback Machine snapshots of webpages which return a 404.
- Announcement of a project which has fixed millions of dead links on the English Wikipedia, redirecting them to their Wayback Machine snapshots.
- The unveiling of GifCities, a specialized search engine for locating retro animated gifs from the good ol' days of GeoCities.
- A new domain summary feature for the Wayback Machine, providing fascinating historical information about websites.
IA servers by John Blyberg; CC BY (on Flickr)
Of all the projects announced during the event though, by far one of the most exciting and impressive is the newly released ability to search the complete contents of all text items on the Internet Archive. Nine million text items, covering hundreds of years of human history, are now searchable in an instant.
"It's kind of magic: It is like to be able to read at the speed of light! Every day I am discovering contents that we don't even know we have," says Giovanni Damiola, the software engineer who developed the new functionality. Hailing from Italy, Giovanni started at the Archive in 2015. He's spent the last four months implementing the search feature while also keeping OpenLibrary running smoothly.
"Our search engine uses an Elastic Search cluster. The core is made with 10 servers with 22 CPU each, with a total storage of 70TB on SSD. The index at the moment is 4-5TB and it contains around 9 million documents...growing every day."
Users can access the new feature by selecting the "Search full text of books" option below the search bar on the search results page. Search terms are highlighted in the results thanks to the Archive's open eBook reader, a project maintained by Richard Caceres of the Internet Archive.
The entire feature is still in beta but is already amazingly powerful. It's easy to see that this is a valuable tool for researchers, even in these early days of the feature. Users can provide feedback on the beta to help improve the next versions of the tool.
IA truck by Jeremy Brooks; CC BY-NC (on Flickr)
Unsurprisingly, this is only the first stage in the Archive's vision for improving access to the contents of their textual holdings. When asked what the future holds, Giovanni provided some tantalizing hints. "This is just the beginning. Soon we can add more features, for example: entities recognition, allowing us to group the books in new and, not obvious, ways and categories. This tool will make it easy also to run data analysis on books' corpora."
Full text isn't the only search feature launched by the Internet Archive last week. It also revealed improved Advanced Search filtering to assist visitors in locating what they need in the over 15 PB of data at their fingertips. You can read more about these filtering options in a blog post published by the Archive.