NARA and the web harvest: a discussion of the issues

As I wrote last Thursday, the National Archives has decided not to conduct a harvest of Federal web sites at the end of this Presidential administration. In my post, I referred to this as a “public relations error.” It looks like I was right. Take a look at some of these links if you want to see how this story is being portrayed on the web:

After my post went up, I was encouraged to look into this situation more carefully. Many of the issues at stake in this controversy have their roots in key archival principles, and I think it’s our duty as archivists to bring understanding of those issues to the public debate. I’ll provide some basic background first, then discuss some of the appraisal and resource issues.

In January 2001 NARA collected a “web snapshot” by having Federal agency CIOs collect and transfer to NARA a “snapshot” of the agency’s public web site. The intent was to “ensure that we are able to document at least in part agency use of the Internet at the end of the Clinton Administration.” You can read more about this effort here. I do not believe the web records collected by this effort are currently available online.

In January 2005 NARA issued its Guidance on Web Records, which clarified that Federal web sites (both public-facing and intranets) are Federal records and must be scheduled like any other Federal record. It is the responsibility of Federal agencies and their records managers to develop schedules for their web records and submit them to NARA for approval. Those aspects of the agency web site which are determined by the agency and NARA to be of permanent value will be transferred to NARA custody in accordance with the disposition instructions in the schedule and NARA’s transfer guidance.

Around the same time the web record guidance came out, NARA conducted a web harvest of all government web sites as they existed prior to January 20, 2005. You can see the records harvested and more background here. NARA conducted another harvest of House and Senate public web sites as they existed prior to December 11 2006 which you can see here.

There were many issues with these web harvests. They did not necessarily capture the entire public site because they only captured up to four levels of depth. They did not capture agency or Congressional intranet sites. They provided only a snapshot of the sites at one particular moment. They were very expensive.

For archivists, these web harvests should be troubling because they dispense with the process of appraisal. In effect, anything on the top four levels of an agency’s web site was determined to be of permanent value. For NARA, they also established a troublesome precedent. Would NARA routinely conduct these harvests? If NARA was already capturing their web sites, why should agencies bother to schedule or transfer their web records?

Having conducted the harvest at the end of the last Presidential administration, NARA was now faced with the decision of whether or not to do another such harvest next year. Here are some factors that might have been taken into account in their decision making:

  • Unlike in 2004, NARA has had guidance for agency web records in place for several years now. Agencies should clearly understand what their responsibilities are and the process they need to follow.
  • If they conducted another web harvest, it might send the wrong message to agencies. It might give an excuse, for those agencies looking for one, for them not to schedule their web records, because NARA is “preserving them anyway.”
  • Such web harvests are very expensive, costing perhaps millions of dollars, and NARA, like most parts of the government, is strapped for resources.
  • There are other organizations, such as the Internet Archive and NARA’s affiliated archives, the University of North Texas Libraries, which have taken on the function of preserving some aspects of Federal web sites. The UNT, for example preserves “deceased federal agency web sites, the Congressional Research Service Reports electronic archive, and more.”
  • The harvests obligate NARA to permanently devote resources to preserve records that are not necessarily of permanent value.
  • The harvest process is in direct opposition to the archival process of appraisal.

For me, as an archivist and a former NARA employee, that’s a pretty compelling list of reasons against making another harvest.

Stacked on the other side of the argument is that there is a public expectation, created by the previous harvests, that this is something NARA regularly does. In fact, on its own web site about the harvests, NARA states:

“The National Archives and Records Administration (NARA) conducts a harvest (i.e., capture) of Federal Agency and Congressional public web sites as they exist at the end of each Presidential term and a harvest of Congressional web sites at the end of the Congressional term that does not coincide with a Presidential term.”

So deciding not to do a harvest is a break with existing practice and public statements, if not actual policy. There is also the possibility that agencies aren’t properly scheduling or transferring their web records, and that conducting a harvest preserves some records that would otherwise not be preserved by NARA (although they might be preserved by third parties, such as the Internet Archive).

I think NARA made the right decision. I now regret that in my previous post I agreed with the statement that NARA was abdicating their responsibility. They are complying with their responsibility by following the regulations and processes already in place for scheduling, appraising, accessioning, and preserving Federal records of permanent value. If there are concerns about what web records are being preserved, the available resources should be dedicated to addressing those concerns within the existing process. If the process needs to change in response to the shorter lifecycle of web records, then the process needs to be changed, not abandoned.

What I strongly disagree with was the way NARA presented their decision to the public. It appears as if the decision was announced, with very little justification or discussion, in a memo circulated only to Federal records officers. I don’t know if there was a plan for communicating the decision to the general public, but the memo made its way to a journalist. You saw the outcome in the list of links at the top of this post. Now they are having to justify a decision in the face of public outcry, and that is never a good place to be. If they had communicated their decision more effectively, and laid out all the reasoning behind their decision, they might still face public concern but they’d be in a much stronger position.

Now I am afraid that they will be forced to do another crawl, spending millions of dollars that may not have been budgeted for this activity. Agencies will have another excuse not to schedule their records, and NARA’s public image has a bit of a black eye. I believe on some blogs people are even speculating that this is some kind of Bush administration-backed effort to destroy evidence (which I have every reason to think is not true).

There are issues of archival principle here and issues of resources. It’s a real life case study playing out in front of us, and one that could have dramatic implications for NARA. What do you think about the decision and how it was handled?

9 thoughts on “NARA and the web harvest: a discussion of the issues”

  1. Thanks for revisiting this issue – it’s great to read a well-informed take on this story from an archivist perspective.

    If the rationale for not doing a web capture is that there are records schedules covering web records that ensure the preservation of the ones with archival value, then I am fine with the decision. That makes perfect sense. You wouldn’t run out and photocopy all the records from an office knowing that the ones with archival value were going to be making their way to the archives within a few years based on the records retention schedules. (A paper analogy that is probably unnecessary, but sometimes it helps to put things in “familiar” terms.) But if the rationale is as presented in the memo that was leaked – that the Internet Archive and other “archiving” sites are taking care of it – then I’m not O.K. with it. I’m all for public/private partnerships, but there needs to be an actual agreement in place that guides what is preserved and how, and my sense is that such an agreement doesn’t exist in this case. (Maybe I’m wrong.) I don’t know why the choice was to focus on the flimsier rationale, but I agree that it was a public relations blunder.

    As far as the appraisal issue goes, I’m going to point out the obvious, but it is very difficult to do item-level appraisal of web files, because the pages are usually so interconnected. Toward the end of my grad student days (2000) I worked on a pilot project examining appraisal and preservation of student organization websites. We decided relatively early on that the main appraisal would be which sites we would capture, and how deep to go in terms of hierarchy. If you started taking chunks out of less important parts of a site, not only would it take a long time, but how would you ensure that the structure and context was preserved, and not have to develop a completely new archivist-supplied structure and context? (Getting into that authenticity issue that e-records folks tend to be preoccupied with.) Things have progressed a lot since then, and technology has made the actual capturing part (which was always the simplest step) even easier, but a lot of the same issues remain because of the nature of web records.

    I’d say that hierarchically targeted web captures aren’t anti-appraisal – the appraisal is just at a less granular level. Just as we’re being urged to abandon item- or folder-level appraisal for paper records because it’s generally too time consuming and may not warrant the effort and cost, so too should we for electronic records. Appraisal at the collection/record group or series level is usually going to be the most cost effective approach, and should meet most researchers’ needs. Of course, this is also another case where working with the records creators – by helping them understand the best ways to structure their site to ensure that the most important information exists on the same level of hierarchy or within predictable directories – can solve a lot of issues upfront.

  2. On the issue of cost, which had been raised in an exchange in the comments on the FGI site, I just found a reference about the 2004 harvest in an article in Government Computer News. This states:

    “NARA hired Information Systems Support of Gaithersburg, Md., to carry out the $337,000 project. ISS subcontracted the Web harvesting to Internet Archive, a San Francisco nonprofit. Internet Archive used a seed list of URLs provided by NARA for the site scans. For each scan, Internet Archive’s software “traverses an entire Web site tree by clicking on all hyperlinks and makes copies of those pages,” Giguere said.

    In all, the Web harvest collected 6.5T of data from 1,300 civilian and 70 unrestricted Defense Department and intelligence agency domains.”

    Here’s the souce:

  3. Given the daily stories about this administration’s secrecy and loss of records, I think it is important that NARA do another harvest or, at least explain more effectively and publicly, why it is not doing this. NARA always seems to bumble in the PR area, and their involvement in the reclassification scandal, the Sandy Berger case, and other such examples suggests that they need to do things far more out in the open. Of course, we can always look to the National Security Archives to take the real leadership, but it would be refreshing to see NARA step up to the plate for a change and actually not just mouth the words “public accountability” but function in a way that suggests that they believe that such accountability is important. Personally, I am not sure about the need for such a harvest, but I am 100 percent certain that NARA needs to be more open about what it is doing or not doing.

  4. We read with interest your postings on this topic.

    The National Archives and Records Administration (NARA) has posted background information regarding our web harvest decision at This background document includes links to our guidance products related to web records and the decisionmaking process we went through to arrive at our decision.

    Paul M. Wester, Jr.
    Director, Modern Records Programs
    National Archives and Records Administration

  5. This is a lengthy comment that I sent to the National Coalition for History over the weekend. To provide a context for these comments, NCH had cited a blog for its source (.govwatch) whose subject line referenced the destruction by NARA of millions of records. The NCH stated that it would ask NARA to clarify its decision at the same time NCH would join other stateholders in protesting that decision. Hopefully, this explains the meanings of the first two paragraphs. A key point to be uiterated is: “I think it is telling that the public has not requested a single page of any of the Web records harvested in 2001, 2004, and 2006.”

    The following is my post from April 13 to the NCH website:

    While the National Coalition for History should always ask for clarification of any decision of the National Archives that may have a detrimental impact on the preservation of historically valuable documents, I disagree with the Coalition’s protest before it receives the clarification. It seems that the Coalition has already made a decision to protest regardless of the basis for the decision by NARA. Such an approach rings loudly that the decision to protest is pre-ordained and the facts be damned.

    I, for one, take issue with the .govwatch blog posting. The posting’s subject line is: “The National is Quietly Destroying Millions of Documents.” Nothing could be father the truth. First, the National Archives is doing nothing “quietly.” As the blog notes, the National Archives has issued a numbered directive announcing its action. A numbered directive is not “quiet.” And nowhere in the directive does the Archives state that it is destroying anything. The agency states that it will not acquire custody or accession certain records. Since the failure to accession a record will result in it result in its destruction, one may think that this is splitting hairs. But it is not. The Archives accessions only those records that have undergone a rigorous appraisal that has determined that the records meet the stringent criteria for a permanent record. A case can be made that almost any record might be of value in the future. But the fact that a record merely has the potential of future value does not mean and should not mean that it is a permanent record worthy of archival retention. In this connection, the National Archives is allowing tens of billions of records to be ultimately destroyed. Rather than this being the subject of protest, we should commend the National Archives for responsible archival stewardship of accessioning only those records appraised as permanent.

    The .govwatch blog assumes that the Web pages “will be valuable for historians in the coming decades.” I question this assumption. The initial Web site harvest was essentially insurance against the possible loss of important records. In 2001, no archives had experience with the appraisal and archival administration of Web records. Hence the idea that Federal Web sites of the Clinton administration might have archival value prompted the harvest lest the Web records be lost. With the hindsight of the past eight years and more hands-on experience with Web records, archivists can now view Web records more professionally. I think it is telling that the public has not requested a single page of any of the Web records harvested in 2001, 2004, and 2006. Current records that have archival value almost always have much greater research demand that they will have in archival custody. Further, I argued in an article in SAA’s Archival Outlook (“Toward the Appraisal of Web Records,” July/August 2006, pp. 6, 25) that the appraisal of Web records will probably mirror other types of records and that only one-per cent of Web records would be appraised as meeting the criteria required for permanent records. Finally, I also think it is telling that none of the Web records from the harvest have been formally accessioned into the National Archives of the United States. The harvested Web records have yet to be subject of an archival appraisal.

    I do not believe that archival appraisal decisions should be based on a cost-benefit analysis. It is impossible to assign a monetary value to the preservation of our national heritage. But hopefully, in its clarification of the Web site harvest, the National Archives will include an estimated costs associated with the harvesting of the Web sites, their subsequent accessioning and initial processing, and their preservation in an accessible format for at least the next generation of historians (say until 2048). I think such a bottom line would add a dose of needed reality.

    Hence I think the proper course of action for the National Coalition would be to ask for a clarification about decision not to harvest the Web sites of the Bush administration and not to protest such a decision in absence of such a clarification.

  6. Interesting blog, Kate, I’ve enjoyed reading your postings recently on various issues.

    Here are some lengthy ruminations over my morning coffee on a rainy Sunday morning.

    How well agencies schedule web pages and web posted information remains to be seen. Much of that depends on their ability to project what end users will need, internally or externally. That requires stepping back from the press of current business and taking a reflective look at the many ways people use information.

    External agency websites can be valuable for what they tell us about the public face of agencies, how message discipline is applied throughout an administration, and so forth. They can be quite revelatory.

    NARA’s own external website is useful for what it has included over the years (useful information on the “reclassification” flap, including Dr. Weinstein’s candid, good statement and Bill Leonard’s later excellent assessment of what happened) but also for what it has excluded (the OIG report on the Sandy Berger flap, released to an external requester under FOIA, but never posted, even in redacted form, on NARA’s external website). What is posted publicly and what is not is useful for assessing the degree of transparency possible for NARA in areas involving the Federal Records Act versus the Presidential Records Act. (I’m particularly interested in such issues due to my past work as a NARA employee between 1976-1990 with the Nixon tapes.)

    Internal websites may contain posted information and guidance materials of the type once found in hard copy in agency libraries: telephone directories, directives, newsletters, etc. (Agency telephone directories are valuable for looking at organizational structure in detail, as well as seeing where people who later rose in rank were assigned early in their careers.)

    Although these are useful sources of information in reconstructing decades later what happened and why, as well as in compiling biographical information, some of them now are dynamic records subject to being overwritten. Unless agencies preserved successive iterations from the time they first web posted them, there may be some information gaps. This may have happened in some organizations as I believe NARA first issued detailed guidance on preserving web information in 2005.

    Prior to 2005, some historians, such as Eduard Mark, expressed concern about knowledge gaps. Dr. Mark wrote about electronic record keeping on the H-Diplo listserv in 2001:

    “I believe that we shall forever know more about the activities of the United States Government at the beginning of the Cold War than about what it was doing at the struggle’s end. When the government’s historians talk shop, a common topic is, ‘Whatever will our successors do several decades hence?’”

    Mark seemed most focused on capture of internal memoranda and correspondence. But historians also rely on guidance and informational materials. Since this is an area where NARA and the agencies are playing catch up, I have to wonder whether some agencies and departments already may have faced periods where potentially useful web posted information was overwritten and not preserved.

    Depending on the agency, decisions on how best to share information might have been driven initially by technological factors more so than long term capture of knowledge. From reading records managers’ forums, I gather that in some agencies IT more so than RM may have driven adoption of solutions for dealing with electronic records.

    No two organizations are going to have exactly the same culture and organizational climate. So it’s hard to predict how preservation of electronic information is going to play out throughout the government. (In the survey that records expert Rick Barry did a few years ago, one RM noted, “archivists and records managers are all too often at the very bottom of the totem pole: records managers within an agency ought to answer to the CEO.”

    Even now, given the dynamic nature of some web posted content, and the fact that program people may be immersed in current business, it may be difficult for some agencies to assess what has value and what does not. Again, from studying records managers’ forums, I suspect that this might be the case even with guidance from NARA. Consequently, end users of archived information need to play a role within the agencies in explaining what has been useful in past work with hard copy materials and publications. What data and sources supported the conclusory narratives written internally in the past? This needs to be translated by end users into what happens with the electronic, web posted equivalents. NARA can only do that in a big picture way. Internal end users of information (lawyers, analysts, historians, if an agency has them on staff) need to have a seat at the table in discussions with RM, IT, etc., and to fill in some of the dots, as needed.

