Data Organization and Data Transfer Guide

Imaging Services will work with Harvard libraries, museums, and archives to share digitized content (digital images and associated metadata) with external partners (e.g., funders, commercial publishers, other cultural heritage organizations).

However, Imaging Services will not:

Description of files

File naming

Image files

Image file names consist of one or more components: [UNIQUE_ID].jp2, [UNIQUE_ID]_[Sequence_number].jp2, [UNIQUE_ID]_[Volume_number]_[Sequence_number].jp2, or [UNIQUE_ID]_[Volume_number]_[Sequence_number]_[Issue_number].jp2

  1. [UNIQUE_ID]: Usually this will be a catalog record ID1. For example, a 9-digit HOLLIS ID (e.g. 011835322) will be recommended for an item that has been cataloged in Harvard’s HOLLIS catalog system. In the case of manuscript collections, the library or archives repository will provided a unique identifier from HOLLIS for Archival Discovery or local tracking system.  In the case of photograph or other art objects, the library or archives repository will provide a unique identifier from JSTOR Forum or local tracking system.

  2. [Volume_number]: (multi-volume sets only) 1 or more digits zero padding volume sequence number with a prefix 'v' (e.g. v5, v05 or v0005).

  3. [Issue_number]: (multi-issue sets only) 1 or more digits zero padding issue sequence number with a prefix 'n' (e.g. n2, n02 or n002).
  4. [Sequence_number]: 4-digit or 5-digit page sequence number (e.g. 0045, 00045).  This will not apply to photograph or art object image files.
  5. .jp2: File format (JPEG 2000, part 1) extension.

Examples

Note: 

Occasionally, a project may require a different file name pattern due to project partners' specific need.   For example, The Black Teacher Archive project need to name file as [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]

(for example, bta_30786193_MA_1966_038_008.jp2).

OCR plain text files

OCR plain text file names will have same prefix as their corresponding image file names, and with .txt file extension.

OCR ALTO files

OCR ALTO file names will have same prefix as their corresponding image file names, and with .xml file extension.

MARCXML files 2

MARCXML file names consist of two components: [HOLLIS_ID].xml

Packaging tag files

Packaging tag file names (see section – Use of the “bagit” file-packaging and – interchange protocol).

Packaging data

Simple packaging

Usually we package the data in the following directory structure.

Monograph and manuscript materials

For monograph and manuscript materials, we group files into items and items into a batches.  For example, the following batch contains three items.  Two items contain only JPEG2000 image files, and one item contains JPEG2000, OCR plain text, and OCR ALTO XML files.

    [BATCH ID] (see note 4)
        |-- [UNIQUE_ID]-mets.xml (single volume METS file example: 007984492-mets.xml)
        |-- [UNIQUE_ID]_[VOLUME_ID]-mets.xml (multivolume METS file example: 000652831_v0002-mets.xml)
        |-- [UNIQUE_ID]-mets.xml (manuscript collection METS file example: morgan_601_705_volIV-mets.xml)
        |-- [HOLLIS_ID].xml (MARCXML or MODS xml file, e.g., 000652831.xml)
| |-- [UNIQUE_ID]/ (manuscript collection example, e.g. morgan_601_705_volIV) | |-- [UNIQUE_ID]_[####].jp2 |-- morgan_601_705_volIV_0001.jp2 |-- morgan_601_705_volIV_0002.jp2 |-- morgan_601_705_volIV_0003.jp2 |-- morgan_601_705_volIV_0004.jp2 ... |-- morgan_601_705_volIV_0099.jp2
| |-- [HOLLIS_ID]/(single volume monograph example, e.g. 007984492) | |-- [HOLLIS_ID]_[####].jp2
|-- [HOLLIS_ID]_[####].txt
|-- [HOLLIS_ID]_[####].xml |-- 007984492_0001.jp2
|-- 007984492_0001.txt
|-- 007984492_0001.xml |-- 0079984492_0002.jp2 ... |-- 007984492_0099.jp2
|-- 007984492_0099.txt
|-- 007984492_0099.xml
| |-- [UNIQUE_ID]_[VOLUME_ID]/ (multi-volume example, e.g. 000652831_v0002) | |-- [VOLUME_ID]_[v####]_[####].jp2 |-- 000652831_v0002_0001.jp2 |-- 000652831_v0002_0002.jp2 |-- 000652831_v0002_0003.jp2 ... |-- 000652831_v0002_0099.jp2

Photograph and other art objects

For photographs and other art objects, we group files into batches.  For example, the following batch contains a set of JPEG2000 files.

[BATCH ID] (see note 3)

   |-- ss_123458.jp2

   |-- ss_458790.jp2

       ...

   |-- ss_987692.jp2

 

NOTE: Each delivery may contain several batches.

Use of the “Bagit” file-packaging and -interchange protocol

If the data recipient needs the data packaged in a way to facilitate data ingestion.  Imaging Services can package the data following the "Bagiit" specification.

The data files provided will be arranged and inventoried in accordance with the “Bagit” specification promoted by the Preservation Directorate of the Library of Congress.

To learn more about Bagit and to investigate the freely available tools available for checking the integrity of the Bagit-packaged data, we suggest you consult the following online resources:

Organization of files with "Bagit" packaging

<root directory>
| bag-info.txt  
| bagit.txt
| manifest-md5.txt
| tagmanifest-md5.txt
|
|-- data
    |-- [ THE SAME DIRECTORY STRUCTURE AS THE SIMPLE PACKAGING ABOVE ]

Delivery media and transfer method

The owning repositories may choose to deliver the data using their own proper methods.  The followings are some ways we use to delivery the data. 


1. In cases where the record identifier includes space, the spaces will be replaced by underscores.

2. MARCXML records are only available for items that have been cataloged in Harvard’s bibliographic database, HOLLIS.

3. Batch level identifiers are assigned to groups of titles prepared and submitted together for scanning. These named “batches” will be maintained from scanning all the way through deposit to Harvard's Digital Repository Service and transfer of data to project partners beyond the Harvard libraries. Inclusion of technical metadata is optional.