Table of Contents
Data Organization and Data Transfer Guide
Imaging Services will work with Harvard libraries, museums, and archives to share digitized content (digital images and associated metadata) with external partners (e.g., funders, commercial publishers, other cultural heritage organizations).
However, Imaging Services will not:
- Produce images or associated metadata files in formats, or to special format-encodings or technical specifications required by external partners when these specification differ standard workflows developed to make digital content that is to be stored by Harvard Library's Digital Repository Service.
- Responsibilities for transferring data, converting data, and the provision of any supplemental data need to satisfy external partner requirements are to be borne by the owning repository that has contracted with Imaging Services.
Description of files
- Image files are encoded in compliance with the JPEG2000 standard
- Data compression: irreversible 9-7 wavelet transform for lossy compression, or reversible 5-3 wavelet transform for lossless compression.
- Photometric interpretation:
- RGB (color), 8-bits per channel, embedded sRGB ICC profile.
- Grayscale, 8 or 16-bits per channel.
- OCR plain text files are in UTF-8 encoding (optional, not available for photograph, film, and art objects).
- OCR alto files conform to the ALTO standard (optional, not available for photograph, film, and art objects).
- Structural metadata files conform to the METS standard and Harvard's METS profile for page-turned objects (optional, not available for photograph, film, and art objects).
- MARC record files are in MARCXML or MODS format (optional, not available for photograph, film, and art objects).
- Packaging tag files generated by the packaging application (Bagit) describe the package (optional).
File naming
Image files
Image file names consist of one or more components: [UNIQUE_ID].jp2, [UNIQUE_ID]_[Sequence_number].jp2, [UNIQUE_ID]_[Volume_number]_[Sequence_number].jp2, or [UNIQUE_ID]_[Volume_number]_[Sequence_number]_[Issue_number].jp2
[UNIQUE_ID]: Usually this will be a catalog record ID1. For example, a 9-digit HOLLIS ID (e.g. 011835322) will be recommended for an item that has been cataloged in Harvard’s HOLLIS catalog system. In the case of manuscript collections, the library or archives repository will provided a unique identifier from HOLLIS for Archival Discovery or local tracking system. In the case of photograph or other art objects, the library or archives repository will provide a unique identifier from JSTOR Forum or local tracking system.
[Volume_number]: (multi-volume sets only) 1 or more digits zero padding volume sequence number with a prefix 'v' (e.g. v5, v05 or v0005).
- [Issue_number]: (multi-issue sets only) 1 or more digits zero padding issue sequence number with a prefix 'n' (e.g. n2, n02 or n002).
- [Sequence_number]: 4-digit or 5-digit page sequence number (e.g. 0045, 00045). This will not apply to photograph or art object image files.
- .jp2: File format (JPEG 2000, part 1) extension.
Examples
- Single volume file name examples: 010010723_0001.jp2, 990079903010203941_0001.jp2
- Multi-volume file name examples: 008105127_v0007_0001.jp2, 990041786900203941_v0001_0001.jp2
- Multi-volume and multi-issue file name examples: 008105127_v0007_n003_0001.jp2
- Manuscript collection file name examples: morgan_601_705_volIV_0001.jp2, sch01593c00004_0001.jp2
- Photographs, films, art objects file name examples: SS_24369503.jp2, G3884_P4A9_1864_M3.jp2, LA-GD-UF-3-03B-CT.jp2
Note:
Occasionally, a project may require a different file name pattern due to project partners' specific need. For example, The Black Teacher Archive project need to name file as [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]
(for example, bta_30786193_MA_1966_038_008.jp2).
OCR plain text files
OCR plain text file names will have same prefix as their corresponding image file names, and with .txt file extension.
OCR ALTO files
OCR ALTO file names will have same prefix as their corresponding image file names, and with .xml file extension.
MARCXML files 2
MARCXML file names consist of two components: [HOLLIS_ID].xml
- [HOLLIS_ID]: Hollis system identifier (e.g. 011835322)
- .xml: File format (XML) extension
Packaging tag files
Packaging tag file names (see section – Use of the “bagit” file-packaging and – interchange protocol).
Packaging data
Simple packaging
Usually we package the data in the following directory structure.
For monograph and manuscript materials, we group files into items and items into a batches. For example, the following batch contains three items. Two items contain only JPEG2000 image files, and one item contains JPEG2000, OCR plain text, and OCR ALTO XML files.
...
Table of Contents
Data Organization and Data Transfer Guide
Imaging Services will work with Harvard libraries, museums, and archives to share digitized content (digital images and associated metadata) with external partners (e.g., funders, commercial publishers, other cultural heritage organizations).
Note:
- Imaging Services will not produce images or associated metadata files in formats, or to technical specifications, that require a deviation from IS workflows developed to produce digital resources that meet local, Harvard Library requirements.
- Responsibility for converting the data IS provides, or the creation of any supplemental data need to satisfy external partner requirements, are the responsibility of the repository that has contracted with IS for services.
Description of files
- Image files are encoded in compliance with the JPEG2000 standard
- Data compression: irreversible 9-7 wavelet transform for lossy compression, or reversible 5-3 wavelet transform for lossless compression.
- Photometric interpretation:
- RGB (color), 8-bits per channel, embedded sRGB ICC profile.
- Grayscale, 8 or 16-bits per channel, embedded sGray ICC profile.
- OCR plain text files are in UTF-8 encoding. (optional, not available for photograph, film, and art objects)
- OCR alto files conform to the ALTO standard. (optional, not available for photograph, film, and art objects)
- Structural metadata files conform to the METS standard and Harvard's METS profile for page-turned objects. (optional, not available for photograph, film, and art objects)
- MARC record files are in MARCXML or MODS format. (optional, not available for photograph, film, and art objects).
- Packaging tag files generated by the packaging application (Bagit) describe the package. (optional)
File naming
Image files
Image file names consist of one or more components: [UNIQUE_ID].jp2, [UNIQUE_ID]_[Sequence_number].jp2, [UNIQUE_ID]_[Volume_number]_[Sequence_number].jp2, or [UNIQUE_ID]_[Volume_number]_[Sequence_number]_[Issue_number].jp2
[UNIQUE_ID]: Usually this will be a catalog record ID1. For example, a 9-digit HOLLIS ID (e.g. 011835322) will be recommended for an item that has been cataloged in Harvard’s HOLLIS catalog system. In the case of manuscript collections, the library or archives repository will provided a unique identifier from HOLLIS for Archival Discovery or local tracking system. In the case of photograph or other art objects, the library or archives repository will provide a unique identifier from JSTOR Forum or local tracking system.
[Volume_number]: (multi-volume sets only) 1 or more digits zero padding volume sequence number with a prefix 'v' (e.g. v5, v05 or v0005).
- [Issue_number]: (multi-issue sets only) 1 or more digits zero padding issue sequence number with a prefix 'n' (e.g. n2, n02 or n002).
- [Sequence_number]: 4-digit or 5-digit page sequence number (e.g. 0045, 00045). This will not apply to photograph or art object image files.
- .jp2: File format (JPEG 2000, part 1) extension.
Examples:
- Single volume file name examples: 010010723_0001.jp2, 990079903010203941_0001.jp2
- Multi-volume file name examples: 008105127_v0007_0001.jp2, 990041786900203941_v0001_0001.jp2
- Multi-volume and multi-issue file name examples: 008105127_v0007_n003_0001.jp2
- Manuscript collection file name examples: morgan_601_705_volIV_0001.jp2, sch01593c00004_0001.jp2
- Photographs, films, art objects file name examples: SS_24369503.jp2, G3884_P4A9_1864_M3.jp2, LA-GD-UF-3-03B-CT.jp2
Note:
Occasionally, a project may require a different file name pattern due to project partners' specific need. For example, The Black Teacher Archive Project need to name file as [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]
(for example, bta_30786193_MA_1966_038_008.jp2).
OCR plain text files
OCR plain text file names will have same prefix as their corresponding image file names, and with .txt file extension.
OCR ALTO files
OCR ALTO file names will have same prefix as their corresponding image file names, and with .xml file extension.
MARCXML or MODS XML files
MARCXML or MODS XML file2 names consist of two components: [HOLLIS_ID].xml
- [HOLLIS_ID]: Hollis system identifier (e.g. 011835322)
- .xml: File format (XML) extension
Packaging tag files
Packaging tag file names (see section – Use of the “bagit” file-packaging and – interchange protocol).
Packaging data
Simple packaging
Usually we package the data in the following directory structure.
Monograph and manuscript materials
For monograph and manuscript materials, we group files into items and items into a batch3. For example, the following batch contains three items. Two items contain only JPEG2000 image files, and one item contains JPEG2000, OCR plain text, and OCR ALTO XML files.
[BATCH ID] (example: Batch02 or Box03) |-- [UNIQUE_ID]-mets.xml (single volume METS file example: 007984492-mets.xml) |-- [UNIQUE_ID]_[VOLUME_ID]-mets.xml (multivolume METS file example: 000652831_v0002-mets.xml) |-- [UNIQUE_ID]-mets.xml (manuscript collection METS file example: morgan_601_705_volIV-mets.xml) |-- [HOLLIS_ID].xml (MARCXML or MODS xml file, e.g., 000652831.xml)
| |-- [UNIQUE_ID]/ (manuscript collection example, e.g. morgan_601_705_volIV) | |-- [UNIQUE_ID]_[####].jp2 |-- morgan_601_705_volIV_0001.jp2 |-- morgan_601_705_volIV_0002.jp2 |-- morgan_601_705_volIV_0003.jp2 |-- morgan_601_705_volIV_0004.jp2 ... |-- morgan_601_705_volIV_0099.jp2
| |-- [HOLLIS_ID]/(single volume monograph example, e.g. 007984492) | |-- [HOLLIS_ID]_[####].jp2
|-- [HOLLIS_ID]_[####].txt
|-- [HOLLIS_ID]_[####].xml |-- 007984492_0001.jp2
|-- 007984492_0001.txt
|-- 007984492_0001.xml |-- 0079984492_0002.jp2 ... |-- 007984492_0099.jp2
|-- 007984492_0099.txt
|-- 007984492_0099.xml
| |-- [UNIQUE_ID]_[VOLUME_ID]/ (multi-volume example, e.g. 000652831_v0002) | |-- [VOLUME_ID]_[v####]_[####].jp2 |-- 000652831_v0002_0001.jp2 |-- 000652831_v0002_0002.jp2 |-- 000652831_v0002_0003.jp2 ... |-- 000652831_v0002_0099.jp2
Here is another example from Black Teacher Archive Project showing a batch with project specific file name patterns.
[BATCH ID] (example: GWU_02) |-- [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]-mets.xml (issue level METS file, example: bta_45355957_VA_1957_038_005_mets.xml)
|-- [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]-mets.xml (issue level METS file, example: bta_45355957_VA_1957_038_006_mets.xml) |-- [UNIQUE_ID]-metsProject_code]_[OCLC#].xml (singleMARCXML or volumeMODS METSxml file for the series, example: 007984492-metsbta_45355957.xml)
| |-- [UNIQUE_ID[Project_code]_[OCLC#]_[State_code]_[Year]_[VOLUMEVolume#]_ID]-mets.xml[Issue#]/ (multivolumeissue METSdirectory, file example: 000652831_v0002-mets.xmlbta_45355957_VA_1957_038_005) |-- [UNIQUE_ID]-mets.xml (manuscript collection METS file example: morgan_601_705_volIV-mets.xml) | |-- [HOLLIS_ID].xml (MARCXML or MODS xml file, e.g., 000652831.xml)
|-- [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].jp2
| |-- [UNIQUE_ID]/ (manuscript collection example, e.g. morgan_601_705_volIV) | [Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].txt
|-- [UNIQUE_ID[Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].jp2xml |-- morgan_601_705_volIVbta_45355957_VA_1957_038_005_0001.jp2
|-- morgan_601_705_volIV_0002.jp2 bta_45355957_VA_1957_038_005_0001.txt
|-- morgan_601_705_volIV_0003.jp2bta_45355957_VA_1957_038_005_0001.xml |-- morgan_601_705_volIV_0004.jp2bta_45355957_VA_1957_038_005_0002.jp2
... |-- bta_45355957_VA_1957_038_005_0002.txt
|-- morgan_601_705_volIV_0099.jp2
bta_45355957_VA_1957_038_005_0002.xml | ... |-- [HOLLIS_ID]/(single volume monograph example, e.g. 007984492) |-- bta_45355957_VA_1957_038_005_0432.jp2
|-- bta_45355957_VA_1957_038_005_0432.txt
|-- [HOLLIS_ID]_[####].jp2bta_45355957_VA_1957_038_005_0432.xml
|-- [HOLLIS_ID]_[####].txt
|-- [HOLLIS_IDProject_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[####Issue#].xml |-- 007984492_0001.jp2
/ (issue directory, example: bta_45355957_VA_1957_038_006) |-- 007984492_0001.txt
|-- 007984492_0001.xml[Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].jp2
|-- 0079984492_0002.jp2 ... |-- 007984492_0099.jp2[Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].txt
|-- 007984492_0099.txt
[Project_code]_[OCLC#]_[State_code]_[Year]_[Volume#]_[Issue#]_[####].xml |-- 007984492_0099.xmlbta_45355957_VA_1957_038_006_0001.jp2
|-- bta_45355957_VA_1957_038_006_0001.txt
|-- [UNIQUE_ID]_[VOLUME_ID]/ (multi-volume example, e.g. 000652831_v0002) |-- bta_45355957_VA_1957_038_006_0001.xml
|-- bta_45355957_VA_1957_038_006_0002.jp2
|-- [VOLUME_ID]_[v####]_[####].jp2bta_45355957_VA_1957_038_006_0002.txt
|-- 000652831_v0002_0001.jp2bta_45355957_VA_1957_038_006_0002.xml
|-- 000652831_v0002_0002.jp2 ...
|-- 000652831_v0002_0003.jp2 bta_45355957_VA_1957_038_006_0030.jp2
... |-- bta_45355957_VA_1957_038_006_0030.txt
|-- 000652831_v0002_0099.jp2bta_45355957_VA_1957_038_006_0030.xml
Photograph and other art objects
For photographs and other art objects, we group files into batches. For example, the following batch contains a set of JPEG2000 files.
[BATCH ID]
...
(example: Batch02 or Album03)
|-- ss_123458.jp2
|-- ss_458790.jp2
...
|-- ss_987692.jp2
NOTE: Each delivery may contain more than one batchseveral batches.
Use of the “Bagit” file-packaging and -interchange protocol
...
Organization of files with "Bagit" packaging
<root directory> |-- bag-info.txt |-- bagit.txt |-- manifest-md5.txt |-- tagmanifest-md5.txt | |-- data |-- [ THE SAME DIRECTORY STRUCTURE AS THE SIMPLE PACKAGING ABOVE ]
Delivery media and transfer method
...
- Hard disk, flash drive
- The repositories can borrow the media from Imaging Services or pay for them
- Recommended for large sets of data
- Google shared drive
- A google account from the data recipient needs to be provided to Imaging Services
- MS shared directory
- The data recipient's email address needs to be provided to Imaging Services
- Suitable for small sets of data
- Secure file transfer (https://filetransfer.harvard.edu)
- The data recipient's email address needs to be provided to Imaging Services.
- The data recipient outside Harvard University needs to set up a guest account.
- Suitable for small sets of data which need encryption during file transfer.
...
In cases where the record identifier includes space, the spaces will be replaced by underscores.Anchor note1 note1
...
MARCXML and MODS XML record files are only available for items that have been cataloged in Harvard’s bibliographic database, HOLLIS.Anchor note2 note2
...
Batch level identifiers are assigned to groups of titles prepared and submitted together for scanning. These named “batches” will be maintained from scanning all the way through deposit to Harvard's Digital Repository Service and transfer of data to project partners beyond the Harvard libraries. Inclusion of technical metadata is optional.Anchor note3 note3