Google and HathiTrust Discovery and Data Maintenance

See also DRAFT: Alma workflow for sending material to Google for scanning (2024-)

Background

In 2006, Google began scanning portions of Harvard's collection, and intermittent scanning projects have continued over time. When Google scans a Harvard volume, it is almost always also downloaded and made available in HathiTrust. Metadata for these volumes is sent to both Google and HathiTrust as it is updated in Alma. Data elements in the record help determine whether a volume should be exposed as full-text or not. 

FIG holdings in Alma

When a title has a useable scan, it will have a holding record in Alma with library code FIG (852 $b). Please do not update FIG holdings except as noted below. Each title should have only a single FIG holding (except for the 3 titles with too many volumes to accommodate - see below). If you need to merge bibs for the same resource, and each has a FIG holding, or you encounter a bib with multiple FIG holdings, please contact LTS, or visit scenario 4 in the table below. 

Most FIG holdings contain an 856 field with a link to the object in HathiTrust, per the recommendations of the Google Books Advisory and Discovery Working Group (2019). In some cases, the link will point to Google instead if that volume is not yet available in HathiTrust. When the title is a single volume, the link will point directly to the volume. Example: https://hdl.handle.net/2027/hvd.HN25KM

When the title is a serial or multi-volume work, the link will point to the catalog-level page in HathiTrust, where you can see a list of all digitized volumes, including ones digitized or held by other institutions in the case of multi-volume works. Example: https://catalog.hathitrust.org/Record/000532756

The FIG library in Alma has multiple location codes, used by system processes, that indicate the content provider and the level of availability of the scan. For a list and definition of location codes, please visit Alma location codes - usage information

The barcode of a volume is used as the primary identifier for sending metadata to Google, and it is critical that it be maintained in the appropriate FIG holding. Local MARC holding field 976 is used for this purpose, with dedicated subfields for item-level information. Some physical volumes get withdrawn from the collection, and the 976 field is the only metadata element in Alma to link the digital object with the descriptive metadata for the title. Field 976 should only exist in FIG holdings. A given barcode in 976 should appear only once across the database. 

Usage of 976 field: 

  • b    Barcode
  • v    Item description
  • w    Enum A 
  • x    Enum B
  • y    Chron I
  • z    Chron J

Note that Google/Hathi only use $v for display, but may use the other values for sorting. 

Special handling for extra large serial sets

Three titles are so large that the holdings cannot accommodate all the 976 fields. These are all in HART, but are not represented in the FIG holdings. If any new metadata needs to be sent for volumes on these titles, please contact LTS. 

    • US congressional set. MMS ID 990082742980203941
    • Great Britain parliamentary papers. MMS ID 990023755340203941
    • US Supreme court records. MMS ID 990004521330203941

Making corrections to Alma metadata

Corrections you make to bibs with FIG holdings will automatically go to HathiTrust where they will be ingested within several days, and to Google where they will be ingested periodically. 

Certain corrections are particularly important as they may affect whether a digitized object is exposed as full-text or not. These include 008 dates and place of publication, and item description for serials. A full list is available from HathiTrust Rights Determination. For example, if the 008 Date 1 is 19uu, and you correct it to probable date 1910, then when HathiTrust processes the updated metadata it will change the work from "limited search" to full-text (for U.S. publications). 

Situations you may encounter and what to do


ScenarioAction
1The volume description (enum/chron, description) in HathiTrust or Google is wrong, for a given digitized object

Update the 976 field in the FIG holding so that $v has the correct item description. If w/x/y/z are present, update those as well. 

DO NOT CHANGE THE BARCODE IN SUBFIELD B. 

If the Ama item records exist and are also wrong, they should be corrected as well. 

2Bib is for the correct resource but needs to be updated with newer copy from OCLC, or be fully catalogedProceed to update bib as you normally would
3There are duplicate bibs in Alma and one has a FIG holdingMerge bibs using the MDE "merge and combine inventory" function
4There are duplicate bibs in Alma and they both have FIG holdings

Report the records to LTS. LTS will: 

  • create a single FIG holding that contains a 976 for every item
  • for serials/MVMs, create an 856 that goes to the Hathi record level page
  • for duplicate scans
    • where one is a better copy, or they are the same, create an 856 that goes right to the digital object in Hathi
    • if there is anything problematic about a scan, add a note to $v for future reference
5You come across a record where we have gaps for a multi-volume work or serial

Add a note to the 856 $z indicating which volumes we have (or don't have), or simply add "incomplete." 


6The bib describes an edition or version of the work that does not match that of the digitized object (page scans)  AND the FIG holding is the only holding on the bib. Proceed to update bib to match the digitized resource
7The bib describes an edition or version of the work that does not match that of the digitized object (page scans), AND there are physical holdings in addition to the FIG holding

If the digitized object has the same barcode as the physical piece, and there is only 1 physical item on the bib, proceed to update bib to match the digitized resource

All other cases, report the record to LTS

8The bib description in Alma already matches the digitized object and was updated more than one week ago, but the bib-level metadata for the object that displays in HathiTrust or Google is wrong

Report the records to LTS. (This could signify an issue with the flow of data from Alma to Hathi/Google.)

Automated process for sending updated metadata to Hathi and Google

  • When you make an update to a bib, or to a holding 976 field, the updated record will be sent to Hathi and Google overnight. 
  • Hathi generally processes the update within 2-3 days.
  • Google processes updates periodically. 

If you come across any links that aren't working, or go to the wrong place, please report the issue to LTS

What to do when an item should be in full-view but is not

Check that our metadata contains the correct dates, and if it is incorrect, see section above about making corrections to Alma metadata.

HathiTrust's bibliographic rights determination process details whether you can expect something to be in full view or not. 

NOTE: Because international laws vary, users with IP addresses outside of the United States may not be able to access some volumes in HathiTrust/Google. 

In HOLLIS – links to HathiTrust that aren't from FIG holdings

When you are in HOLLIS, links to HathiTrust may come from one of three places:

    • FIG holdings
    • a real-time API call to HathiTrust when we have print only
    • the CDI HathiTrust collection that is searched in Catalog & Articles.

More information is available at Online links in HOLLIS search results (Library Catalog scope)

How HathiTrust handles our metadata

What match points does HathiTrust use when we send updated metadata? 

Sometimes matching can be tricky because our system numbers changed when we migrated from Aleph to Alma, and we finished a Data Sync project to update the OCLC numbers in our records. Our content may not match what we sent in the Aleph days, or we may have added content that wasn't there before.    

When we send HathiTrust an updated bib for a barcode we added in the past, the system:

    1. First looks for a match on OCLC number
    2. Next it tries to match using the Alma ID
    3. Next it tries to match using the old Aleph ID (based on the value in bib 900 $a)
      1. When records are merged in Alma, they may end up with two 900 $a fields. Only the first is sent to HathiTrust. 
    4. If none of these match, the record is rejected and LTS receives an error report

How to fix cases where there are two separate Harvard bibs in HathiTrust (each with a separate barcode), and only one bib in Alma (FIG holding will have multiple barcodes)

This can happen if the records were separate in Aleph, and then identified as duplicates in Alma and merged. You can send a ticket to LTS to report this. We'll take these steps: 

    1. (We only send the first 900 field to HathiTrust (old Aleph ID), therefore they won't be able to match to old Aleph ID for any but the first 900 $a.)
    2. We need to send only a single bib with the first barcode to HathiTrust on one day, and then on consecutive days send a bib with each of the remaining barcodes (i.e. one barcode per day). 
    3. This will allow the HathiTrust system to recalculate the clusters. The clusters cannot be recalculated if we send all the barcodes in the same file. 

About HathiTrust record IDs

The instances where the record ID may change are comparatively rare. Estimate:  about 3% of record IDs may change over the course of a year. The circumstances for the change would be if a new record gets selected as the preferred record, or if there's a change in OCN that would necessitate a re-clustering. So if an OCN has been succeeded by a new one there would be a recluster on a new ID; also if there was a de-duping of OCNs the same thing would likely happen as all of the separate records were clustered together for the first time. 

Hathi has a redirect script in place that should keep links fresh even if there is a new record ID. It keeps track of the OCN and recordID changes to mitigate as much as possible the potential for dead links in user catalogs and third party indices, or even in papers and dissertations. 

Reporting on Google, HathiTrust, and Harvard data ❤️

Detailed reporting is available in HART, see the "Google Library Inventory (GLIB)" section of the HART wiki. A Dashboard has been created to make reporting easier, 'Google Library Inventory Items." 

If the item was sent to Google for scanning, but was later removed from Google's tracking database (GRIN) due to an unusable scan, it will not appear in the Dashboard. 

Other topics

Is there a way to tell if we sent something to Google and they couldn't scan it

    • See the "Google Library Inventory (GLIB)" section of the HART wiki
    • Note in the holdings record "unscannable" – could mean that there was no publication date on the piece and not a reflection of the condition of the volume

Whose bib records does HathiTrust use in its catalog? 

When multiple HathiTrust partners send a bib for the same title, HathiTrust assigns a cluster ID (CID) to the duplicate records. A scoring algorithm is used to determine which record from the cluster will used in the HathiTrust catalog, so you may not always see the Harvard bib. See the Record Scoring section at HathiTrust for a link to the most current documentation. 

Image problems or bad scans

What kind of problem you encounter determines how to report the issue:

    • Google can now improve or fix some bad scans due to ongoing image processing improvements. For issues where the image is warped or has crooked pages; repeated or out-of-order pages; poor image quality or color; scanner's hands/scanning mechanism present in the scan; or poor resolution, contact: 
      • Google: Choose the "Report a quality issue with a book" option on the Library Partners: Contact us form. Remember to include the link to the record in their system, i.e., "https://books.google.com/books?vid=HARVARD:[Harvard barcode in all caps]"
      • Updates made to scans in Google will eventually be harvested by Hathi.
      • If the Google scan is better than the Hathi scan, you can write to HathiTrust: Write to support@hathitrust.org, so the Digital Object Quality Corrections team can try to improve the scans. Remember to include the link to the record in their system, i.e., "https://babel.hathitrust.org/cgi/pt?id=hvd.[Harvard barcode lowercase]"

    • I the image cannot be fixed and/or the problem affects the user's ability to access the complete content, we may want to discontinue linking to the resource entirely. Google may be able to remove the scan from its database, but there is no process for HathiTrust to do so at this time. If a scan is cannot be improved and is unusable:
      • Report it to LTS so that a note can be added in HART. LTS will add a note indicating that a scan is bad, and will remove the FIG holding from Alma.