Tuesday, July 26, 2011

Bytes of History Biting the Dust

An article published in the Washington Post last week, Daniel de Vise's, "A Bit of Time Traveling in the Internet Age: Scholars Are Trying to Create a Past for the Perpetual Now Web," (July 18, 2011, at B1) reported on a research effort by computer technologists that may be of interest to readers of this blog.  Researchers are trying to determine how much data of historical interest that is published on the internet is being archived and preserved. 

Michael Nelson, a computer science professor, heads a group of researchers at at Old Dominion University and the Los Alamos National Laboratory who have established a method of assessing the "many questions to ask about web archiving and digital preservation - why is archiving important? what should be archived? what is currently being archived? how often should pages be archived." Nelson and several coauthors conducted a study that answers some of these questions.  The researchers published their findings in a paper entitled, "How Much of the Web is Archived" (June 2011). The abstract of the paper follows.
 With more and more of our business, academic, and cultural discourse contained primarily or exclusively on the Web, the problem of archiving the Web is receiving increased attention. The focal point of much of this attention is the Internet Archive’s Wayback Machine, which began archiving the Web in 1996 and as of 2010 had over 1.5 billion unique URIs, making it the largest, longest-running and most well known publicly-available web archive. Recently, there has been a proliferation of new public web archives at universities, national libraries, and other organizations. These differ in scale, ingest models, collection development policies,
and the software employed. All of this leads to the question “How much of the Web is archived?” To address this question, we sampled URIs from four sources to estimate the percentage of archived URIs and the number and frequency of archived versions. From this, we extrapolate the percentage of the Surface Web that is archived.
The full paper is available here.  Professor Nelson's website, which describes the team's research effort in greater detail, is here

Apparently, much less is being preserved than scholars might assume, given all that we've read about the permanence of everything posted to the web.  It turns out that while profit-making sites (e.g., social media)  may well take care to ensure that every comment (including off-color remarks) that anyone posts lives on in cyberspace, the kinds of sites where scholarly papers, blog posts, essays, photos, videos, and other digital matter of historical significance are posted may not be so fastidious in their preservation efforts.