Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Data Organization and Data Transfer Guide

Imaging Services will work with Harvard libraries, museums, and archives to share digitized content (digital images and associated metadata) with external partners (e.g., funders, commercial publishers, other cultural heritage organizations).

However, Imaging Services will not:

  • Produce images or associated metadata files in formats, or to special format-encodings or technical specifications required by external partners when these specification differ standard workflows developed to make digital content that is to be stored by Harvard Library's Digital Repository Service.
  • Responsibilities for transferring data, converting data, and the provision of any supplemental data need to satisfy external partner requirements are to be borne by the owning repository that has contracted with Imaging Services.

Description of files

  • Image files are encoded in compliance with the JPEG2000 standard
    • Data compression: irreversible 9-7 wavelet transform for lossy compression, or reversible 5-3 wavelet transform for lossless compression.
    • Photometric interpretation:
      • RGB (color), 8-bits per channel, embedded sRGB ICC profile.
      • Grayscale, 8 or 16-bits per channel.
  • OCR plain text files are in UTF-8 encoding (optional, not available for photograph, film, and art objects).
  • OCR alto files conform to the ALTO standard (optional, not available for photograph, film, and art objects). 
  • Structural metadata files conform to the METS standard and Harvard's METS profile for page-turned objects (optional, not available for photograph, film, and art objects).
  • MARC record files are in MARCXML or MODS format (optional, not available for photograph, film, and art objects).
  • Packaging tag files generated by the packaging application (Bagit) describe the package (optional).

File naming

Image files

Image file names consist of one or more components: [UNIQUE_ID].jp2, [UNIQUE_ID]_[Sequence_number].jp2, [UNIQUE_ID]_[Volume_number]_[Sequence_number].jp2, or [UNIQUE_ID]_[Volume_number]_[Sequence_number]_[Issue_number].jp2

  1. [UNIQUE_ID]: Usually this will be a catalog record ID1. For example, a 9-digit HOLLIS ID (e.g. 011835322) will be recommended for an item that has been cataloged in Harvard’s HOLLIS catalog system. In the case of manuscript collections, the library or archives repository will provided a unique identifier from HOLLIS for Archival Discovery or local tracking system.  In the case of photograph or other art objects, the library or archives repository will provide a unique identifier from JSTOR Forum or local tracking system.

  2. [Volume_number]: (multi-volume sets only) 1 or more digits zero padding volume sequence number with a prefix 'v' (e.g. v5, v05 or v0005).

  3. [Issue_number]: (multi-issue sets only) 1 or more digits zero padding issue sequence number with a prefix 'n' (e.g. n2, n02 or n002).
  4. [Sequence_number]: 4-digit or 5-digit page sequence number (e.g. 0045, 00045).  This will not apply to photograph or art object image files.
  5. .jp2: File format (JPEG 2000, part 1) extension.

Examples

  • Single volume file name examples: 010010723_0001.jp2, 990079903010203941_0001.jp2
  • Multi-volume file name examples: 008105127_v0007_0001.jp2, 990041786900203941_v0001_0001.jp2
  • Multi-volume and multi-issue file name examples: 008105127_v0007_n003_0001.jp2
  • Manuscript collection file name examples: morgan_601_705_volIV_0001.jp2, sch01593c00004_0001.jp2
  • Photographs, films, art objects file name examples: SS_24369503.jp2, G3884_P4A9_1864_M3.jp2, LA-GD-UF-3-03B-CT.jp2

Note: 

Occasionally, a project may require a different file name pattern due to project partners' specific need.   For example, The Black Teacher Archive project need to name file as [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]

(for example, bta_30786193_MA_1966_038_008.jp2).

OCR plain text files

OCR plain text file names will have same prefix as their corresponding image file names, and with .txt file extension.

OCR ALTO files

OCR ALTO file names will have same prefix as their corresponding image file names, and with .xml file extension.

MARCXML files 2

MARCXML file names consist of two components: [HOLLIS_ID].xml

Packaging tag files

Packaging tag file names (see section – Use of the “bagit” file-packaging and – interchange protocol).

Packaging data

Simple packaging

Batch, Items, files

All files will be organized into Batches.  

Use of the “Bagit” file-packaging and -interchange protocol

The data files provided will be arranged and inventoried in accordance with the “Bagit” specification promoted by the Preservation Directorate of the Library of Congress.

To learn more about Bagit and to investigate the freely available tools available for checking the integrity of the Bagit-packaged data, we suggest you consult the following online resources:

A Library of Congress produced video designed to introduce the “Bagit” specification: http://www.youtube.com/watch?v=l3p3ao_JSfo

Opensource Bagit software tools: https://github.com/LibraryOfCongress

Wikipedia entry: https://en.wikipedia.org/wiki/BagIt

Organization of files and file system on portable media (i.e, portable hard drive)3

<root directory>
| bag-info.txt  
| bagit.txt
| manifest-md5.txt
| tagmanifest-md5.txt
|
|-- data
    |
    |-- [BATCH ID] (see note 4)
        |-- [UNIQUE_ID]-mets.xml(single volume METS file example:_007984492-mets.xml)
        | 
        |-- [UNIQUE_ID]_[VOLUME_ID]-METS.xml(multivolume METS file example: 000652831_v0002-mets.xml)
        |
        |-- [UNIQUE_ID]-METS.xml (manuscript collection METS file example: morgan_601_705_volIV-mets.xml)
        |
        |-- batch.xml(see note 5) (technical metadata file)
        |
        |-- [HOLLIS_ID].xml(see note 6) (MarcXML file, e.g., 000652831.xml)
        |
        |-- [UNIQUE_ID(see note 7)]/(manuscript collection example, e.g. morgan_601_705_volIV)
             |
             |-- [UNIQUE_ID]_[####].jp2
             |-- morgan_601_705_volIV_0001.jp2
             |-- morgan_601_705_volIV_0002.jp2
             |-- morgan_601_705_volIV_0003.jp2
             |-- morgan_601_705_volIV_0004.jp2
                 ...
             |-- morgan_601_705_volIV_0099.jp2
        |-- [HOLLIS_ID]/(single volume monograph example, e.g. 007984492)
             |
             |-- [HOLLIS_ID]_[####].jp2
             |-- 007984492_0001.jp2
             |-- 0079984492_0002.jp2
                 ...
             |-- 007984492_0099.jp2
        |-- [UNIQUE_ID]_[VOLUME_ID]/(see note 8) (multi-volume example, e.g. 000652831_v0002)
             |
             |-- [VOLUME_ID]_[####].jp2
             |-- 000652831_v0002_0001.jp2
             |-- 000652831_v0002_0002.jp2
             |-- 000652831_v0002_0003.jp2
               ...
             |-- 000652831_v0002_0099.jp2

Delivery media and transfer method

Hard disk, flash drive

Google shared drive, MS shared directory ...

Secure file transfer


NOTES

1. In cases where the record identifier includes space, the spaces will be replaced by underscores.
2. MARCXML records are only available for items that have been cataloged in Harvard’s bibliographic database, HOLLIS.
3. If more than one disk is needed, a batch may span more than one disk; the corresponding metadata files for the batch will appear on each disk.
4. Batch level identifiers are assigned to groups of titles prepared and submitted together for scanning. These named “batches” will be maintained from scanning all the way through deposit to Harvard's Digital Repository Service and transfer of data to project partners beyond the Harvard libraries.
Inclusion of technical metadata is optional.
5. Inclusion of MARCXML files are optional.
6. The title's unique identifier is used as the directory name.
7. Individual volumes or fascicles will be labeled using a two- or three-digit sequence number (e.g., v001, v002, v099, v123).
  • No labels