Web archiving workflow (draft)

Introduction

Web archiving at the Schlesinger Library is an integral part of the library's collecting activities.  Since 2007, the library has been investing staff and resources in a web archiving program. Initially the program archived sites through WAX (Harvard's Web Archiving tool); since 2017, the library has been harvesting web content via Archive-It (AI), a web archiving service for collecting and accessing cultural heritage on the web and built at the Internet Archive. Archive-it at Harvard University is centrally managed by the Harvard Library Preservation Services.  Contact Stephen Abrams or Tricia Patterson for questions about getting access to Archive-It.

At Schlesinger, staff can get access by asking any member of the Web Archiving Team (Jen W, Laura, Amy, Paula).

The main parts of the Schlesinger Library web archiving program include:

  • Identifying websites to archive
  • Adding the seed (URL) and metadata to our Archive-It collection
  • Scheduling an initial test crawl of each new seed
  • Reviewing the crawl, editing the seed scope if necessary to capture missing content, and saving the most complete version
  • Adding sites to the Schlesinger Library's annual or semi-annual harvesting schedule in AI

Projects

Blogs: capturing women's voices, 2007-2016?

  • In 2007, the Schlesinger Library began a project to harvest the blogs of women in order to capture voices and points of view which may be inadequately documented elsewhere, as well as to trace the growing use of internet technology by women in the early 21st century. Blogging is a fairly inexpensive and expeditious activity that suits the hectic pace of many women’s lives, and encompasses a wide range of expression--from simple extraction and summarization of other Web content to the creation of personal narratives centering around a topic or a time in one’s life. For these reasons, groups of women whose voices might not be found elsewhere have become a vivid presence in the "blogosphere," and there was concern that these primary sources would be lost. As part of its mission to document the multifaceted contributions of American women, the Library identified a small group of blogs created by African-American women, Latina women, lesbians, and women grappling with health and reproductive issues in their lives. Collection consists of a sample of blogs by women documenting their engagement with politics, both nationally and internationally; their personal lives and philosophies; and their work lives. Harvesting of these blogs began in the fall of 2007 and continued in most cases for a period of time.

#Metoo digital media collection project, 2017?-ongoing

  • The #Metoo digital media collection is a Schlesinger Library project that documents the digital footprint of the #metoo movement and the accompanying political, legal, and social battles in the United States. Digital media collected includes social media, news articles, statements of denial and/or apology, Web-forum conversations, legislation, lawsuits, statistical studies, Fortune 500 companies’ employment manuals, hashtags related to #metoo, and more. The material in the collection will date from 2007 with the creation of the #metoo hashtag by Tarana Burke and will end when #metoo activity subsides. The collection will be made available for interdisciplinary research on #metoo. For more information and for access to the collection see the project website: https://www.schlesinger-metooproject-radcliffe.org/

Schlesinger Library Sites, 2007-ongoing

  • The library is actively archiving web sites created and maintained by organizations and individuals whose collections are housed at the Schlesinger Library. The web sites supplement the paper-based collections and represent additional documentation of the important activities and contributions of these organizations and individuals. Below is the web archiving step-by-step workflow and schedule for Schlesinger Library Sites.

Schlesinger Library Sites Step-By-Step Workflow

Adding a New Seed

  • For now, Laura will be responsible for adding new seeds to Archive-It (AI). The pre-AI work involves discussing the creator's website with the processor or the curator and letting them guide us on the frequency of the crawl and what content is particularly valuable to capture. See the processing manual for more information on this front-end work: https://harvardwiki.atlassian.net/wiki/display/Proceed/Web+sites
    • Once seed is ready for crawling, login to Archive-It and click on SLSites
    • Navigate to the Add Seeds pop up window.
    • Copy a seed URL from a browser window displaying the desired content whenever possible. Otherwise, copy and paste the URL from a reliable source.
    • It is also possible to add multiple seed URLs at one time, following the same procedures.

Seed Scope and Settings

NB: the following elements of a URL can impact the scope of a crawl:

        • http vs https
        • www.site.org vs site.org
        • presence or absence of end slash
        • Target a subdirectory only, or start with a subdirectory but include other parts
        • Refer to https://support.archive-it.org/hc/en-us/articles/208001076-How-our-crawler-determines-scope for details
        • Our current default settings for SL Sites crawls are as follows:
          • Access: Public
          • Frequency: Annual (confirm this with the processor or curator, if necessary)
          • Seed Type: Standard (use Standard+ if there are external links that you want to capture)
          • Modify the above settings as needed for particular situations, such as running crawls more frequently if warranted by site content, expanding or restricting the scope as appropriate, scheduling for a one-time crawl if creator is no longer updating site.
          • For sites with YouTube and/or Vimeo videos, Flickr pages, SoundCloud audio files, etc. see specific recommendations on AI help page: https://support.archive-it.org/hc/en-us/sections/201841373-Scoping-crawls-for-specific-types-of-sites
        • Add metadata to the seed
          • Check the box next to your seed. The “add metadata” button will become active. Click on “add metadata” and fill in the standard fields: Title and Creator. Be sure to click “add” after completing a field and then click “done” when complete.
          • Default permission is from deed of gift or via e-mail with donor. if permission is unclear or received in a different way, we should add this information to an internal notes field by clicking on seed URL and then notes tab


Running a Test Crawl

  • Run a test crawl using the defaults (or with additional scope rules, if you think the site warrants it):
    • Go to Run Crawl
      • Choose: Test Crawl
      • Time Limit: 3 days
      • Click Crawl

QA Test Crawl and Troubleshooting tips

Note: Archive-It continually updates and enhances their crawling capabilities.  For up-to-date troubleshooting tips, go to AI's help center: https://support.archive-it.org/hc/en-us

To see Archive-It enhancement schedule:  https://support.archive-it.org/hc/en-us/articles/209637783-Ongoing-and-future-Archive-It-development

To add an enhancement: https://support.archive-it.org/hc/en-us/articles/209637783-Ongoing-and-future-Archive-It-development

  • When the test crawl is complete, the web archive team members receive an e-mail from AI indicating the crawl is complete.
  • QA the site, being sure to check many of the links, any audio/video files, or PDF attachments.
  • Oftentimes, the first test crawl will be missing information and you will need to edit the seed scope and try again.
  • Troubleshooting - Experiment!
    • Try using Brozzler if the crawl has poor results (particularly if video does not stream). For more information on Brozzler, see: https://support.archive-it.org/hc/en-us/articles/360000351986-How-and-when-to-use-Brozzler
    • Sometimes the content will archive but the formatting won't because the CSS file was either blocked or not archived.  Find the CSS file in the host list and add it to the seed scope.
    • Unclear why the crawl result is poor?  Add Ignore Robots.txt to seed scope (Content blocked by Robots.txt files can be patched in)
    • Sometimes if certain pages of a site do not have an adequate capture, consider crawling it as a separate seed. This way you can add additional scoping rules specific for this problem page.  If this strategy works, group the multiple URLs together under the creator name. For a local example see the Deanna Booher group. For more information on grouping sites: https://support.archive-it.org/hc/en-us/articles/208332743-Organize-seeds-as-a-group-
      • NOTE: grouping can also be used when a creator has multiple sites.
    To help with troubleshooting, you may want to see specific URLs captured for the website. Click on the Hosts tab to view a list of the URLs.
    • See number of docs, number of new docs, amount of data, amount of new data, blocked content, queued content, and content that is out of scope.
    • Adjust the crawl scope as needed to capture any important content that was missed and re-run crawl.
  • When you have a successful crawl, save it.
    • Remember to delete any unsuccessful crawls.
    • Contact Paula with the URLs for the public seed/harvest page on AI for linking to HOLLIS (and in finding aid)

Crawl Schedule and QC

  • We have three crawl frequencies in our collection:
    • One-time: the site is not scheduled for ongoing crawls. The sites are either not active or we determined that only a one-time crawl is necessary due to limited added value of the site or ? NEED TO CHANGE TO INACTIVE IN AI.
    • Semi-annual will crawl automatically every six months. 
    • Annual constitutes the bulk of our sites.

  • We have set up automatic crawls for semi-annual and annual through Archive It.
    • Semi-annual is scheduled for November and May.
    • Annual is scheduled each April.
  • For sites that have been crawled using Brozzler:
    • These sites will need to be scheduled for a separate crawl. In order to help identify Brozzler sites for crawling they will be given the status of ... After a first successful Brozzler crawl, you may want to try switching to a regular (non-Brozzler) crawl to capture additional data and also adding the site to the annual crawl.
  • When the scheduled crawl is complete, the web archive team members receive an e-mail from AI indicating the crawl is complete.
  • The web archive team will then split up the work and do a cursory review of each harvest:
    • Is the site still active? Has the site been hijacked or corrupted?
      • Sometimes it's obvious when to change the status of a seed to inactive because it's no longer available. There are other times that sites have not been updated for some years and should be changed to inactive. Consider making the site inactive if there has not been activity on the site for 5-10 years and particularly if the creator is no longer living. You can also ask the manuscripts curator for guidance. 
    • Redirected crawls
      • The crawler will sometimes be redirected to another site. When this happens the redirection will be indicated in the seeds list in the crawl report.  Redirected crawls where the site address shows minor changes will be bundled together.  These changes include (but may not be limited to): http/https and www/no-www, and sites with or without an ending /. In these cases, simply update the seed name in Archive-It.
      • Sometimes, however, there are sites that have been redirected by the crawler but have more significant changes in their URL. For example:

        Tiaw.org to tiaw.org/default.aspx; Ywcaboston.org to ywboston.org; Girlscoutseasternmass.org to gsema.org; Suzannemasks.com to suzannebentonartist.com
        • In these types of cases, it will be necessary to add the new seed name in Archive-It for future crawls. Steps to take:
          • Change status of old seed to inactive.

          • Add new seed to Archive-It including filling in metadata fields for title and creator and adding in any relevant scoping rules.

          • If necessary, crawl the new seed

          • Group the seeds together under a new group name

          • Notify Paula so she can update the URN in HOLLIS.

    • Does the content and formatting of the site look good (for the most part)?
      • Due to the large number of sites, particularly for the annual crawl, QA will not involve the high level of review that is usually done with a test crawl, but the reviewers will still need to determine if the capture is sufficient.
    • Reviewers will go back to the problem sites and edit the seed scope accordingly to try to get a better crawl.  These sites will then be crawled separately as test crawls and then saved.

Discontinuing Crawls

We discontinue crawls when a site is no longer live.  In cases where organizations have gone defunct or an individual has died, wait until you see that the site itself is defunct. If a site were being maintained as a memorial, we would most likely continue to capture it. 

To discontinue crawling the site:

  • Go to the page for the individual page and uncheck the active seed box. Click Save. 
  • Under the notes tab, enter an explanation for ending the crawl.

Example: Turned off 6/9/21. Site inactive during Annual Crawl and still inactive as of June 2021.

  • If the collection has a finding aid, ask Paula to adjust the language in the finding aid so that there is an end date for the crawl. If the collection that has been discontinued is part of a group, do not adjust the folder end date if other sites in the group are still being crawled. In the example below, both sites have been made inactive and are no longer being crawled.

                Example: E.1. Mautner Project archived web sites, 2011-2021.  Scope and content note: Mautner Project-Whitman-Walker Health  (http://www.mautnerproject.org/) and Whitman-Walker Health (https://whitman-walker.thankyou4caring.org/page.aspx?pid=290&bm=185443561).







Copyright © 2024 The President and Fellows of Harvard College * Accessibility * Support * Request Access * Terms of Use