Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Data Organization and Data Transfer Guide

...

  • Single volume file name examples: 010010723_0001.jp2, 990079903010203941_0001.jp2
  • Multi-volume file name examples: 008105127_v0007_0001.jp2, 990041786900203941_v0001_0001.jp2
  • Multi-volume and multi-issue file name examples: 008105127_v0007_n003_0001.jp2
  • Manuscript collection file name examples: morgan_601_705_volIV_0001.jp2, sch01593c00004_0001.jp2
  • Photographs, films, art objects file name examples: SS_24369503.jp2, G3884_P4A9_1864_M3.jp2, LA-GD-UF-3-03B-CT.jp2

Note: 

Occasionally, a project may require a different file name pattern due to project partners' specific need.   For example, The Black Teacher Archive project need to name file as [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]

...

OCR ALTO file names will have same prefix as their corresponding image file names, and with .xml file extension.

MARCXML files 2

Footnote

MARCXML records are only available for items that have been cataloged in Harvard’s bibliographic database, HOLLIS.


MARCXML file names consist of two components: [HOLLIS_ID].xml

...

Packaging data

Simple packaging

Batch, Items, files

All files will be organized into Batches.  

Use of the “Bagit” file-packaging and -interchange protocol

The data files provided will be arranged and inventoried in accordance with the “Bagit” specification promoted by the Preservation Directorate of the Library of Congress.

To learn more about Bagit and to investigate the freely available tools available for checking the integrity of the Bagit-packaged data, we suggest you consult the following online resources:

A Library of Congress produced video designed to introduce the “Bagit” specification: http://www.youtube.com/watch?v=l3p3ao_JSfo

Opensource Bagit software tools: https://github.com/LibraryOfCongress

Wikipedia entry: https://en.wikipedia.org/wiki/BagIt

Organization of files and file system on portable media (i.e, portable hard drive)3

...

Usually we package the data in the following directory structure.

For monograph and manuscript materials, we group files into items and items into a batches.  For example, the following batch contains three items.  Two items contain only JPEG2000 image files, and one item contains JPEG2000, OCR plain text, and OCR ALTO XML files.

    [BATCH ID] (see note 4)
        |-- [UNIQUE_ID]-mets.xml (single volume METS file example:_ 007984492-mets.xml)
        | 
        |-- [UNIQUE_ID]_[VOLUME_ID]-METSmets.xml (multivolume METS file example: 000652831_v0002-mets.xml)
        |
        |-- [UNIQUE_ID]-METSmets.xml (manuscript collection METS file example: morgan_601_705_volIV-mets.xml)
        |
        |-- batch.xml(see note 5) (technical metadata file)
        |
        |-- [HOLLIS_ID].xml(see note 6) (MarcXMLMARCXML or MODS xml file, e.g., 000652831.xml)
| |-- [UNIQUE_ID(see note 7)]/ (manuscript collection example, e.g. morgan_601_705_volIV) | |-- [UNIQUE_ID]_[####].jp2 |-- morgan_601_705_volIV_0001.jp2 |-- morgan_601_705_volIV_0002.jp2 |-- morgan_601_705_volIV_0003.jp2 |-- morgan_601_705_volIV_0004.jp2 ... |-- morgan_601_705_volIV_0099.jp2
| |-- [HOLLIS_ID]/(single volume monograph example, e.g. 007984492) | |-- [HOLLIS_ID]_[####].jp2
|-- [HOLLIS_ID]_[####].txt
|-- [HOLLIS_ID]_[####].xml |-- 007984492_0001.jp2
|-- 007984492_0001.txt
|-- 007984492_0001.xml |-- 0079984492_0002.jp2 ... |-- 007984492_0099.jp2
|-- 007984492_0099.txt
|-- 007984492_0099.xml
| |-- [UNIQUE_ID]_[VOLUME_ID]/(see note 8) (multi-volume example, e.g. 000652831_v0002) | |-- [VOLUME_ID]_[v####]_[####].jp2 |-- 000652831_v0002_0001.jp2 |-- 000652831_v0002_0002.jp2 |-- 000652831_v0002_0003.jp2 ... |-- 000652831_v0002_0099.jp2


For photographs and other art objects, we group files into batches.  For example, the following batch contains a set of JPEG2000 files.

[BATCH ID] (see note 4)

   |-- ss_123458.jp2

   |-- ss_458790.jp2

       ...

   |-- ss_987692.jp2

 


NOTE: Each delivery may contain more than one batch.

Use of the “Bagit” file-packaging and -interchange protocol

If the data recipient needs the data packaged in a way to facilitate data ingestion.  Imaging Services can package the data following the "Bagiit" specification.

The data files provided will be arranged and inventoried in accordance with the “Bagit” specification promoted by the Preservation Directorate of the Library of Congress.

To learn more about Bagit and to investigate the freely available tools available for checking the integrity of the Bagit-packaged data, we suggest you consult the following online resources:

Organization of files with "Bagit" packaging

<root directory>
| bag-info.txt  
| bagit.txt
| manifest-md5.txt
| tagmanifest-md5.txt
|
|-- data
    |-- [ THE SAME DIRECTORY STRUCTURE AS THE SIMPLE PACKAGING ABOVE ]

Delivery media and transfer method

The owning repositories may choose to deliver the data using their own proper methods.  The followings are some ways we use to delivery the data. 

  • Hard disk, flash drive
    • The repositories can borrow the media from Imaging Services or pay for them
    • Recommended for large sets of data
  • Google shared drive

...

    • A google account from the data recipient needs to be provided to Imaging Services
  • MS shared directory
    • The data recipient's email address needs to be provided to Imaging Services
    • Suitable for small sets of data
  • Secure file transfer

...

titleNOTES

...


1. In cases where the record identifier includes space, the spaces will be replaced by underscores.

2. MARCXML records are only available for items that have been cataloged in Harvard’s bibliographic database, HOLLIS.

4. Batch level identifiers are assigned to groups of titles prepared and submitted together for scanning. These named “batches” will be maintained from scanning all the way through deposit to Harvard's Digital Repository Service and transfer of data to project partners beyond the Harvard libraries. Inclusion of technical metadata is optional.

5. Inclusion of MARCXML files are optional.

6. The title's unique identifier is used as the directory name.

7. Individual volumes or fascicles will be labeled using a two- or three-digit sequence number (e.g., v001, v002, v099, v123).