Web Archiving with Archive-It

Creating a crawl

Disclaimer:

For manuscripts, make sure to go through standard acquisition process prior to crawling a website.
For Archives:
- Websites are considered University property and do not require permission from the webmaster.
- Social media sites are blocked in the Institutional Collection-level scope. Institutional social media accounts should be crawled as its own seed. Please contact Meghan Kerr for guidance on how to update the Collection-level scope.
- HUIT contacts Skip Kendall from Harvard University Archives when a new website is created. When the new websites are for one of the LMA schools, Skip will notify Meghan Kerr.
Student groups
- Before crawling any student group websites that do not fall under the Harvard domain, contact the group's leadership team to obtain permission to crawl.

Additional guidance on running successful crawls can be found here: https://support.archive-it.org/hc/en-us/articles/208001076-How-our-crawler-determines-scope

Setting up crawl in Archive-It

Create a test crawl first.
1. Login to https://partner.archive-it.org
2. Pick your collection:
  1. Manuscript collections
  2. Institutional archives
    NOTE: you cannot move crawls between collections! Make sure you select the correct collection first.
Select “Seeds” > “Add Seed”
Enter one seed URL per line below to add them to this collection
1. Make sure to at least start out by adding a “/” to the end of the URL. This will allow the crawl to capture any pages beyond that initial page.
2. Note that if a lab or subpage deviates from the initial seed URL, it won’t be captured. Example: Crawling “http://hms.harvard.edu/” won’t capture “http://cellbio.med.harvard.edu”. This means you may lose departments or labs.
Settings:
1. Access: Private
2. Frequency: One time
3. Seed type: Standard Plus
Select “Add seeds”
Disable robots, & etc.
1. Click on seed URL
2. Select “Seed Scope” from tab menu
3. Select from drop down “Ignore Robot.txt”
4. Click “add rule”
5. Select “save”
  NOTE: For the Institutional Collection, the Collection Level Scope includes a list of URLs to block (newspapers/news websites, affiliated hospitals, professional organizations, social media sites, etc.). If there are other URLs that you think would be worth blocking on the Collection level rather than Seed level, please check with Meghan Kerr before adding the URL. If for some reason you want one of these URLs unblocked, please contact Meghan Kerr for guidance.
With your seed URL selected, click “Run Crawl”.
1. Select “test crawl”
2. Set time limit to “7 days”
3. Select “OK”. Crawl is now scheduled.
Once you have approval, save your test crawl
Run QA

**Archive-It staff are very responsive and helpful. If you are having issues with specific content and the help menu isn't helping, reach out to them directly.

Creating a Seed Group and adding new seeds

Navigate to the seed and click the check box to the left of the seed URL.
Click "Edit Groups" in the blue menu bar and in the "Add to New Group" field, use the bib 245 as the New Group Name, but do not use any terminal punctuation.
- e.g. Harvard Medical School. Office for Research Operations. University publications
If you don't need to create a new group, navigate to the group in the "Add To Existing Groups" box.
Make sure the Group is public
1. Go to "Manage Seed Groups" and click the Group name
2. Click the check box next to "Visible to the Public" in the upper left hand corner (below "Group Name Visibility").

Logging seeds and crawls in CHoM Archive-It Tracking database

The Archive-It Tracking database is used to keep track of the following:

Seeds
Crawls
QA
Metadata and accessioning status

The Access database is saved on the N: drive: N:\Collections\07_Collections_Databases_and_Lists\ArchiveIt_Tracking.accdb

Adding Seeds

Open table 1_SeedIDName, enter the AI seed ID number followed by 3 spaces, and then the website's name located in the <title> tag in Page Source
Open form 1_Seeds and in "Seed Name" drop-down field, select the applicable seed
Fill out the remaining fields
1. "Crawled Elsewhere?" = have any other Harvard repositories crawled this website?
Do not enter crawl information in the Crawls subform - this subform in 1_Seeds is only for reference purposes

Adding Crawls

Open form 2_Crawls_QA and in "Seed Name" drop-down field, select the applicable seed
Fill out remaining fields
1. "Total Crawl Data" and "New Data" should be entered as gigabytes. Both fields cut off after 6 decimal places.

Crawl QA

Try not to wait too long to conduct the QA. Websites are constantly updated and websites/content could be completely removed (e.g. HMS NEPRC website). If a website is dramatically altered, you will not have the ability to compare the archived version to the live version. QA should be tracked in the Archive-It Tracking database. There are currently only three fields:

QA Notes
- See below sections on what to look for while conducting QA
Do any of the subpages redirect to an earlier crawl?
Patch crawl(s) run?

The QA guidance in this section was adapted from both the Harvard University Archives and the NYU AI QA Manual, which has even more helpful information about how to conduct QA.

Questions to ask yourself

Did we get everything we wanted?
- Pages
- Embedded images
- Video and audio content
- Calendar information
- Google docs
Does the website use a third party application (Wix, Medium, Squarespace, tumblr, WordPress, Medium, etc.)?
Does the harvested site display properly?
- It doesn’t need to be perfect but should be usable and look reasonably close to what the live site looks like.
Are there any associated subdomains, links, or external websites that should be considered for crawling?
Is there anything we may not want?
- Are we getting content from other sites that is unnecessarily inflating the size of the crawls?

Review Crawl Report

Compare the size to previous crawls; if the new one is pretty close to the older ones, we’re probably in pretty good shape.
Look at the New Data vs. Total Data to see if there appears to be more content on the site than we got.
1. Keep in mind that some of that Total data may be either content that we already have (i.e. duplicate data) or data that’s not necessary for a good crawl. It is very common to get a good crawl where there’s more Total Data than New Data.
Look at the Hosts report for the crawl
1. If there are queued documents, do they appear to be necessary for getting a complete crawl?
  1. This can also be a way to identify crawler traps (https://support.archive-it.org/hc/en-us/articles/208332943-Identify-and-avoid-crawler-traps-).
2. Are there hosts being crawled that are accounting for large quantities of data or documents that are clearly out of scope?

Examining Crawl in Wayback

Navigate to the seed and click "Wayback"
1. Once you're on the Wayback page, there might be multiple captures of the website. Only select the date of the crawl you initiated.
2. Optional: once seed is open in Wayback, view the site in Proxy Mode. When viewing the website in Proxy Mode, only the most recent capture will be visible and content won't be redirected to the Live website, which occasionally occurs. Keep in mind that you cannot view https websites in Proxy Mode, only http websites.
In a separate browser tab, open the live version of the website you crawled.
Make sure you "Enable QA" at the top of the Wayback website.
1. In the background, Archive-It's QA feature will create a running list of missing documents, however, this list is of all missing documents for all seeds in the respective Collection your working within (Manuscript; Institutional; WAM). You can narrow down the list by entering your seed URL.
2. "Wayback QA" is accessed via the menu bar within the Collection your seed resides.
Check pages, including hyperlinks and internal navigation. Do a cursory check for glaring missing content, such as embedded video and images (please see section Capturing Embedded Audio and Video below). Sometimes subpages won't be crawled and a patch crawl will fix the issue.
Open all the main menu items on the site in new tabs
1. Look for any missing items on those pages, both on the page and through the View Missing URLs report in the header.
2. You may also want to open up secondary header items as it doesn’t take too much time and may reveal problems.
Watch for common problems:
1. Embedded video and audio
  1. Sometimes we can do something about these and sometimes we can’t so don’t get too preoccupied with getting them unless they’re really critical as content
  2. Social media content
    1. Primarily worry about these if we’re actually crawling the social media site itself; embedded Twitter or Facebook content is hard to get and not critical.
    2. Be familiar with Archive-It’s advice on crawling social media
      1. The advice does change periodically so if you know we’re doing what we’re supposed to be doing and it’s not working, check back with the Archive-It advice
      2. https://support.archive-it.org/hc/en-us/articles/208333113-Archiving-Facebook-pages
      3. https://support.archive-it.org/hc/en-us/articles/208001986-Archiving-Instagram-feeds
      4. https://support.archive-it.org/hc/en-us/articles/208002006-Archiving-Soundcloud-pages
      5. https://support.archive-it.org/hc/en-us/articles/208002016-Archiving-Tumblr-sites
      6. https://support.archive-it.org/hc/en-us/articles/208333743-Archiving-Twitter-feeds
      7. https://support.archive-it.org/hc/en-us/articles/208333753-Archiving-YouTube-videos
  3. Multi-page lists of content, e.g. https://charleshamiltonhouston.org/news/?cat=5
    1. Sometimes the crawling technology has trouble finding its way through the pages so will not get pages beyond the first one.
If you find any significant issues with a crawl, note them, along with the date, in the "QA Notes" field of the Archive-It Tracking database. This will both help you remember why you were doing a resumption or a patch and allow others to pick up the QA of that crawl when necessary.

Run a Patch Crawl

Missing documents will sometimes, but not always, appear in the missing documents list linked from the header. If they do, a patch crawl should get them.
Clicking the link to a missing document that does not appear on the list will sometimes cause Archive-It to recognize the document as missing. Occasionally it will take two or three attempts to trigger the recognition.
1. If a large number of documents linked from a page are not showing up, it can be quickest to open them all up in tabs.
2. You will need to reload the page or the missing documents list to determine whether or not this worked.
Running a patch crawl
1. A patch crawl only works if you have the "Enable QA" feature on. You should only run a patch crawl after you have completely QA'd the whole archived website.
2. Navigate to "Wayback QA" and select the Missing Documents you want patch crawled.
3. Click "Patch Crawl Selected"
4. In the new dialog box, select the checkbox next to "capture documents blocked by robots.txt"
5. Click "Crawl"
6. Like all other crawls, patch crawls take up to 24 hours after completion before they will be available in Wayback.
  1. The patch crawl will appear in the Collection's list of current crawls.
  2. You can also view the "Patch Crawl Report" which is located within the "Wayback QA" section.

Capturing Embedded Audio and Video

Sometimes the crawl will not capture embedded audio or video content. Running a patch crawl to try to capture AV will not work if you have selected "Enable QA" and you see the following in the Wayback header on an archived webpage that is known to have embedded AV: [MBK to add in screenshot of header showing that it doesn’t detect any AV]

When this happens, you will need to add the problem page as a private one-page seed URL and test crawl that seed:

Select “Seeds” > “Add Seed”
Settings:
1. Access: Private
2. Frequency: One time
3. Seed type: One Page
Select “Add seeds”
Disable robots, & etc.
1. Click on seed URL
2. Select “Seed Scope” from tab menu
3. Select from drop down “Ignore Robot.txt”
4. Click “add rule”
5. Select “save”
With your seed URL selected, click “Run Crawl”
1. Select “test crawl”
2. Set time limit to “7 days”
3. Select “OK”. Crawl is now scheduled.
Once you have approval, save your test crawl
Review the saved test crawl to make sure the AV was captured.

Unfortunately, you will have to create these private one page crawls for every individual webpage where AV wasn’t captured in the original crawl.

Scheduling Crawls

Using Archive-It's "Schedule Crawls" option

If you know that the size of the website you want to crawl is consistent and does not fluctuate greatly, you might want to set up a scheduled crawl. The Archive-It User Guide has step-by-step instructions for setting up this feature. Before creating a scheduled crawl, first obtain approval from Emily Gustainis. CHoM has a limited data budget and we need to make sure that all members of the acquisitions team have access to this data budget.

Manually Scheduling Crawls

There are certain websites that should be crawled at a minimum, on an annual basis. Once a year, the following websites should be crawled:

These crawls should be run as a 7-day Test Crawl. Before saving the crawls, first consult with Emily Gustainis and the Acquisitions Team to ensure that there is enough data in our data budget. While running QA on these sites, there will be content that is not captured, such as labs, or other subpages too deep in the hierarchy. If patch crawls do not capture the missing documents and pages, then you will need to run crawls for the missing individuals seeds. While conducting QA, some examples of content that you want to make sure was captured include:

Academic departments
Faculty pages/labs
Pages with embedded content

Description & Metadata

See: Website Archiving Metadata

CHoM Manual