...
...
Selecting an appropriate image file format & specification2
Source material characteristics / type | preferred image file formats (in preference order) | preferred image capture resolutions (in preference order) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Machine printed black and white text documents
|
|
| ||||||||||||||||
Printed or handwritten documents with color content
|
|
| ||||||||||||||||
Works of art, color photographs
|
|
| ||||||||||||||||
Monochromatic, black and white photographs, or continuous tone black and white images |
|
| ||||||||||||||||
NOTE: ALL COLOR AND CONTINUOUS TONE BLACK & WHITE IMAGES should include embedded ICC display profiles, (e.g., sRGB, eciRGB, AdobeRGB, sGray).1 |
Optical Character Recognition (OCR) and keyed text files
For digital objects that include page-images and searchable text, the Harvard Digital Repository Service (DRS) requires deposits include one utfUTF-8 encoded plain text file for each corresponding page-image file. The text file could be obtained from an OCR software or keyed.
For example, a 10 page document deposited to DRS would include 10 image files and identically named (except for the file suffix) text files.
├── 013814337
├── 013814337_0001.txt
├── 013814337_0002.tif
├── 013814337_0002.txt
├── 013814337_0003.tif
├── 013814337_0003.txt
├── 013814337_0004.tif
├── 013814337_0004.txt
├── 013814337_0005.tif
├── 013814337_0005.txt
├── 013814337_0006.tif
├── 013814337_0006.txt
├── 013814337_0007.tif
├── 013814337_0007.txt
├── 013814337_0008.tif
├── 013814337_0008.txt
├── 013814337_0009.tif
├── 013814337_0009.txt
├── 013814337_0010.tif
├── 013814337_0010.txt
Naming and organizing files at time of scanning
Prior to scanning documents, one needs to decide how to organize the information so that it can be easily navigated in digital form. Documents have their own organizational structure (individual titles, volumes, issues, chapters, etc.). These meaningful structural components of the scanned documents need to be reflected in the organization of the sequentially numbered scanned page-images arranged within named directories.
...
Naming schemes: In this example we use three or four components in our directory and file naming.
Components
- [Item ID]: At Harvard, we would typically use the Hollis catalog identifier
- [Volume ID]: Volume sequence number (multi-volume sets only, 3-digit ID)
- [Page sequence number]: Note: four digits
- File format (e.g., tif, jpg) extension
Document directory should be named with the item ID (lowercase characters, no spaces)
[002208174] ← this is a directory name: [ITEM_ID]
|
| ---- 002208174_0001.jpg
| ---- 002208174_0002.jpg
| ---- 002208174_0003.jpg
| ---- 002208174_0004.jpg
| ---- 002208174_0005.jpg
| ...
| ---- 002208174_0099.jpg
[007984492]
|
| ---- 007984492_0001.jpg
| ---- 007984492_0002.jpg
| ---- 007984492_0003.jpg
| ---- 007984492_0004.jpg
| ---- 007984492_0005.jpg
| ...
| ---- 007984492_0099.jpg
Multi-volume example
[ITEM_ID] ← this is the parent directory for the title.
|
| ---- [ITEM_ID]_[VOLUME_ID] ← this is the directory for the volume.
| ---- 000652831_v001_0001.jpg
| ---- 000652831_v001_0002.jpg
| ---- 000652831_v001_0003.jpg
| ---- 000652831_v001_0004.jpg
| ---- 000652831_v001_0005.jpg
| ...
| ---- 000652831_v001_0099.jpg
|
|---[000652831_v002]
|
| ---- 000652831_v002_0001.jpg
| ---- 000652831_v002_0002.jpg
| ---- 000652831_v002_0003.jpg
| ---- 000652831_v002_0004.jpg
| ---- 000652831_v002_0005.jpg
| ...
| ---- 000652831_v002_0099.jpg
...