The Memory of the Internet

Dr. Noha Adly
Dr. Noha Adly

How many computers and how much space do you need to store the current Internet archive?

Adly: A little over 180 computers store the 100 terabytes of data in the Internet Archive on an area of 40 square meters. Additionally, we require a dedicated large air conditioning machine. However, data storage technology is continuously providing more and more density.

Can you estimate how many gigabytes would have been needed for the ancient knowledge recorded on 700,000 scrolls?

Adly: This is an interesting question. An A3 sheet not containing much graphics digitized at high quality results in an RGB image with a file size of approximately 35 megabytes. Thus, if one thinks of a sheet from a scroll in the ancient library as such A3 sheet, and supposed one scroll would need ten sheets, one gets a number of 0.245 terabytes for all ancient knowledge recorded.

Owners of websites can manually prevent their pages from being archived by installing a robots.txt file on their Web server. Many websites are not linked up and are therefore not found by archive crawlers. How complete can the collection of links ever be?

Adly: It is difficult to say how much of the content published on the WWW since 1996 has not been archived in the collection, because it would require knowing how much of the WWW is not linked to. So we can only guess not less than 80 percent of the WWW has been archived – but this is very difficult to confirm.

Does the archive cover other networks such as the earlier Gopher or WWW3?

Adly: No, just the WWW starting from 1996.

How many users have searched through the Internet archive since it was opened?

Adly: There is no direct way of determining how many individual users have used a public Web database such as the Internet Archive. However, there have been close to 100,000 different IP addresses that have connected to our site, which gives a good indication of how popular the service is, keeping in mind that a single IP address could represent a network of a couple hundred users.

Will a text search be available in the Internet archive, soon?

Adly: This is an important point. If you don’t know the exact web site address, you cannot find the site in the internet archive. Therefore, the independent project “Recall” works to make the collection text-searchable.

How are websites being archived exactly?

Adly: Archive-specific Web crawlers search the Internet in a two-month cycle, in order to continuously archive all the URLs that are found. The crawling is based on custom software developed for this particular application.

Are there legal gray areas relating to the archiving of private or commercial pages?

Adly: There seems to always be a legal gray area when it comes to intellectual property in this digital age. However, the Internet Archive is always keen not to transgress anybody’s rights. The Wayback Machine respects robots exclusion, and requests for exclusion by legitimate authors of pages are honored.

What is the point of having several Internet archives in the world? One already exists in San Francisco, and there are plans to launch others.

Adly: The point of having archives in different parts of the world (where different lifestyles exist) is for each archive to serve as a mutual backup for the other. In addition, the archives in different regions serve different cultures; for instance, the one in Alexandria holds 2,000 hours of video recordings of Egyptian television amongst its collection. Each archive would also better serve researchers in its specific region. The third archive is presently being installed in Amsterdam.

Is the Internet archive a possible first step towards preventing the memory loss of the digital age?

Adly: Preservation is a key goal of the Internet Archive, which does make the project a good step towards preventing the memory loss of the Internet. One should not stop at archiving the Internet and overlook other forms of media such as books, audio, and video recordings. Consistent with that vision, the library is involved in a number of projects concerned with the digital archiving of books and other intellectual material in order to make such content openly accessible in a robust digital form along with the archived Internet pages.

For more information, visit