Selecting an appropriate image file format & specification
Source material characteristics / type | preferred image file formats (in preference order) 1 | preferred image capture resolutions (in preference order) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Machine printed black and white text documents
|
|
| ||||||||||||||||
Printed or handwritten documents with color content
|
|
| ||||||||||||||||
Works of art, color photographs
|
|
| ||||||||||||||||
Monochromatic, black and white photographs, or continuous tone black and white images |
|
| ||||||||||||||||
NOTE: ALL COLOR AND CONTINUOUS TONE BLACK & WHITE IMAGES should include embedded ICC display profiles, (e.g., sRGB, eciRGB, AdobeRGB, sGray).2 |
Optical Character Recognition (OCR) and keyed text files
For digital objects that include page-images and searchable text, the Harvard Digital Repository Service (DRS) requires deposits include one UTF-8 encoded plain text file for each corresponding page-image file. The text file could be obtained from an OCR software or keyed. Optionally, ALTO layout xml file for each image could also be included.
For example, a 5 page document deposited to DRS could include 5 image files sequentially named, identically named (except for the file extension) OCR or keyed plain text files, and identically named (except for the file extension) OCR ALTO layout XML files.
├── 013814337
├── 013814337_0001.tif
├── 013814337_0001.txt
├── 013814337_0001.xml
├── 013814337_0002.tif
├── 013814337_0002.txt
├── 013814337_0002.xml
├── 013814337_0003.tif
├── 013814337_0003.txt
├── 013814337_0003.xml
├── 013814337_0004.tif
├── 013814337_0004.txt
├── 013814337_0004.xml
├── 013814337_0005.tif
├── 013814337_0005.txt
├── 013814337_0005.xml
Naming and organizing files
Prior to or after scanning documents, one needs to decide how to organize the information so that it can be easily navigated in digital form. Documents have their own organizational structure (individual titles, volumes, issues, chapters, etc.). These meaningful structural components of the scanned documents need to be reflected in the organization of the sequentially numbered scanned page-images arranged within named directories.
Example
At Harvard we might use the HOLLIS ID (The title's bibliographic catalog identifier) as the directory name for a title. It doesn't matter what item ID is used so long as the library's bibliographer or curator has a document key that can be used to unambiguously relate the assigned item ID to a specific title and document description.
File naming restrictions
Filenames must be unique.
Maximum number of characters per file name must be 64 characters or less, and the complete_directory_path + file_name for each file be kept to 255 characters, or less.
Valid characters in file name prefix are letters, digits, underscores ('_'), and hyphens ('-').
File names should not contain spaces.
Use a single '.' character to separate the file name prefix from the file extension. In the case of file compression formats used on archive file formats (e.g. TAR), the double extension format is acceptable. For example: file.tar.gz, file.tar.Z, file.tar.bz2.
- Files that share a derivative relationship (e.g., a production master .tif file and its related deliverable .jpg or .jp2 file) should share the same file name in order for Batch Builder to determine that the relationship exists (e.g. clocktower.tif and clocktower.jpg).
Naming schemes: In this example we use three or four components in our directory and file naming.
Components
- [Item ID]: At Harvard, we would typically use the Hollis catalog identifier
- [Volume ID]: Volume sequence number (multi-volume sets only, 3-digit ID)
- [Page sequence number]: Note: four digits
- File format (e.g., tif, jpg) extension
Document directory should be named with the item ID (lowercase characters, no spaces)
[002208174] ← this is a directory name: [ITEM_ID]
|
| ---- 002208174_0001.jpg
| ---- 002208174_0002.jpg
| ---- 002208174_0003.jpg
| ---- 002208174_0004.jpg
| ---- 002208174_0005.jpg
| ...
| ---- 002208174_0099.jpg
[007984492]
|
| ---- 007984492_0001.jpg
| ---- 007984492_0002.jpg
| ---- 007984492_0003.jpg
| ---- 007984492_0004.jpg
| ---- 007984492_0005.jpg
| ...
| ---- 007984492_0099.jpg
Multi-volume example
[ITEM_ID] ← this is the parent directory for the title. | | ---- [ITEM_ID]_[VOLUME_ID] ← this is the directory for the volume. | ---- 000652831_v001_0001.jpg | ---- 000652831_v001_0002.jpg | ---- 000652831_v001_0003.jpg | ---- 000652831_v001_0004.jpg | ---- 000652831_v001_0005.jpg | ... | ---- 000652831_v001_0099.jpg | |---[000652831_v002] | | ---- 000652831_v002_0001.jpg | ---- 000652831_v002_0002.jpg | ---- 000652831_v002_0003.jpg | ---- 000652831_v002_0004.jpg | ---- 000652831_v002_0005.jpg | ... | ---- 000652831_v002_0099.jpg
Considering to have files deposited into DRS
We strongly recommend repositories to consider depositing outsourced and acquired digital content into Harvard Digital Repository. You can find detailed information about Harvard DRS, including storage fees, on this LTS "DRS & Delivery Services" wiki page.
Imaging Services offers services of depositing outsourced and acquired digital content into Harvard Digital Repository for Harvard repositories. Please consult our "Deposit Born-digital Content" page.
- Technical Guidelines for Digitizing Cultural Heritage Materials
- Article: What is embedded color profile information?