Arranging and Describing Born-Digital Files

These local guidelines were created in November 2024 by Charlotte Lellman, Amber LaFountain, and Rebecca Thayer. They are based on the Harvard-wide guidance developed by the Born-Digital Content Description Task Force (BDCDTF), which was chartered by the Shared Descriptive Practices Working Group in September 2021. Previous Center documentation for describing born-digital records is available on SharePoint (CHoM login required).

Definitions 

For the purposes of processing and description, born-digital files are defined as manuscript or archival records that the Center accessioned in born-digital form, whether through a network transfer, a web crawl, or on a physical carrier such as a floppy disk or flash drive. 

Records that came to the Center in physical form (papers, objects, notebooks, photographs, etc.) but were subsequently digitized or digitally photographed are not considered born-digital. These are considered digital surrogates.  

“Born-digital files” is Harvard’s preferred umbrella term. Note that some of the examples illustrating arrangement scenarios describe born-digital material using terms that are no longer recommended.  

Arranging Born-Digital Records  

Just as it is with traditional paper-based collections, original order is an important principle of archival arrangement for entirely born-digital or hybrid collections. Often, but not always, maintaining this order is helpful for users to understand the function and context of the records.  

There are three frequently used options for the intellectual arrangement of born-digital records in hybrid collections:  

  1. Create a separate format-based series for born-digital records. A separate series for born-digital records is appropriate when the creator has maintained their born-digital records in a way that is intellectually separate from the paper-based records. This format-based series approach may also be appropriate when the digital records’ volume or disorganization would make intellectual integration prohibitively labor-intensive or too difficult to do with certainty.  

  2. Create two separate groupings for born-digital records and their paper-based counterparts at the same level of hierarchy. This is appropriate when the paper-based and born-digital records are arranged in such a way that they have a sibling relationship, where the content is mirrored.  

  3. Integrate born-digital records at the file (folder) level. This is appropriate when the digital records and paper-based records are already integrated at the archival file level (example, a floppy disk in a folder) OR there is a direct one-to-one relationship between born-digital and paper-based records (example, twelve audio files with twelve corresponding printed transcripts).  

These options reflect the three most common approaches collection creators take when maintaining their digital files in relation to their paper-based records.  

Arrangement types are not mutually exclusive; perhaps some media carriers or digital files are integrated with the paper-based records, but not others. Other times, no meaningful original order is present, or the original order presents challenges for use. In these cases, the processor should utilize the below “Considerations for Arrangement” to determine which of the three arrangement schemes makes sense for the collection (or parts of the collection) and include the rationale for arrangement in the processing plan.  

Options and examples for arranging records for entirely born-digital collections are currently a work in progress; however, they include maintaining the creator’s directory (or folder) structure and arranging by record function. 

As with all collections, the processor should consult with the Collections Services Archivist prior to starting any re-arrangement. 

Considerations for Arrangement  

When considering an arrangement structure, please keep the following in mind. 

  1. Intellectual order of materials as they arrive. 

  • How are the born-digital files arranged?  

  • For large groups of born-digital files, such as from a network transfer, hard drive, or other high-capacity external media carrier), is there an existing directory structure /sub-file structure? 

  • How are they arranged relative to the paper-based records?  

  1. Staff resources 

  • Is the arrangement scheme reasonable from a time-investment perspective? (Intellectual integration at the file level is very intensive!)  

  • Do our technical resources permit arranging the digital files in the preferred manner? 

  • Keep in mind that physical and intellectual arrangement do not have to be the same!  

  1. User discovery and access  

  • How are researchers most likely going to search the collection for the born-digital files?  

  • If a body of born-digital files (e.g., email) will need to be accessed as a unit (typically on site), it may make sense to describe it as a unit and not parsed into series. 

  1. Accruals 

  • Are accruals anticipated? For practical reasons, post-processing accruals may be described as a separate series, which may have the effect of creating a separate series for born-digital files.  

Arrangement Scenarios 

  1. Separate series for all born-digital files 

A separate series for born-digital records is appropriate when: 

a. the creator has maintained their born-digital records in a way that is intellectually separate from the paper-based records. Imagine, for example, a collection consisting of a hard drive with the creator’s research data (Series I) and a box of paper-based records consisting of teaching records (Series II). In this case, the two series represent different types of records (function, topic) as well as different formats. 

b. the digital records’ volume or disorganization would make intellectual integration prohibitively labor-intensive or too difficult to do with certainty. For media on external carriers, “disorganized” could mean that the carriers are not grouped by topic or function or that the media is poorly labeled. For digital files transferred directly to the network, “disorganized” could mean that the creator’s file structure is unclear or nonexistent.  

Examples:  

  • M. Judah Folkman papers, Series IX. Electronic Records, 1990-2007. Subseries A. Compact Discs, Subseries B. Floppy Disks, Subseries C. Zip Disks, Subseries D. USB Flash Drives.  

  • Sven Paulin papers, Series IX. Digital Files, 1990-2014 

    • This series includes only those digital files that were not originally maintained with paper records, and were not easily categorized into the other series.” 

  1. Subseries or grouping of born-digital records within each series  

A subseries or grouping of born-digital records within each series is appropriate for collections where the original order of the born-digital records is parallel to the original order of the paper-based records. A creator may have maintained all their records on a particular subject in a unified way intellectually but not physically; for born-digital media on external carriers, this would mean separate housing; for born-digital records transferred directly to the network, this would mean an organizational structure for the digital files that parallelled the organizational structure for the paper-based records. Imagine a collection with three boxes of records from a creator’s teaching appointment in Mexico (Series I. Teaching records, Subseries A. Mexico teaching records) and a box of thumb drives with records from teaching in Mexico (Series I. Teaching records, Subseries B. Born-digital Mexico teaching records).  

Examples: 

  • Fredrick J. Stare papers, Series II. Correspondence Files, Subseries E. Digital Letters 

  • Marie C. McCormick papers. Series I. Research Records. Subseries D. Infant Health and Development Program Electronic Records.  

    • Subseries D is arranged in four subseries: 1. Administrative, Research, and Publishing Records, 2000-2015; 2. Correspondence, 2009-2016; 3. Research Data, 2003; and 4. IRB Consent Form Drafts, 2003. These subseries are the product of grouping the use copies to better reflect their functional contents. Each image’s files were maintained together, in part to reflect the already similar contents within each image, and in part to avoid artificially creating record groupings. Subseries labels were added to the use copies from each image or group of images, as a means to better identify and distinguish the contents of each image’s files, for easier discovery. Due to similarity of contents, Subseries B includes the files extracted from two images. All other subseries map to the contents of only one image. More information on this arrangement and subseries to image mappings may be found in the collection’s electronic records documentation folder on the network drive. 

  1. File-level Integration 

An arrangement with integration at the archival file level (also referred to as folder level) is often appropriate when the digital records and paper-based records are integrated at the archival file level in their original order. Creators often maintain electronic media carriers in the same physical folders as corresponding or related paper-based records. For example, a physical file titled “South Korean Cardiology Conference 2004” might contain programs and abstracts from the conference as well as related floppy disks storing presentation slides.  

In other cases, digital files are physically separated from paper-based records (for example, floppy disks stored in a separate box), but intellectually integrating them would make more sense for access and use.  

Lastly, in rare cases, a small collection may consist of individual born-digital files (for example, PDF documents or .wav video files) that correspond directly to paper-based records at the digital file level. In these cases, it is appropriate to do a more granular item-level intellectual arrangement, with each paper-based record arranged intellectually with its digital-format counterpart.  

Examples:  

Describing Born-Digital Records

Listing 

  • When paper-based records and born-digital files on carrier media (such as floppy disks or CDs) are found together in the same folder, the folders mixed-format contents constitute one intellectual unit. For description purposes, this means that the folder translates to one “file” level component in EAD and should be listed as one line on the folder list. (See: Listing)

  • All media should be assigned a unique e-media number. When listing, note e-media numbers. If multiple pieces of e-media are present in one folder, list each e-media number, rather than giving a range) in the appropriate column. (See: E-media: Logging & Imaging)

Considerations for Description  

  1. Staff time 

Aim for aggregate description that balances staff time and researcher discovery and access.

  1. Technical limitations  

Technical limitations may interfere with the ability to process and describe born-digital records. We do not have the hardware to image all media types; our FTK software is not able to open every type of digital file; and some file formats are unsupported. If technical considerations will delay processing of digital materials, move forward with analog processing and describe born-digital files later as series or sub-series. Born-digital media that is unable to be processed should still be logged, listed, and included in the finding aid.

  1. Ability to provide a file manifest 

In some cases, it may be safe or low-risk to proactively make file manifests available. When a file manifest is made available, extensive description of born-digital files is redundant. In consultation with the Collections Services Archivist, carefully evaluate whether or not it is appropriate to publish a file manifest. (Note that Public Services staff can make file manifests available to users upon request). 

For example, digital files in the Marie C. McCormick papers are described at a sub-subseries level, but no file list (file manifest) is provided. Robust scope and content notes (such as this one for Series I., Subseries D., Sub-subseries 1) describe the record types and formats: 

“Consists of research administrative and regulatory records, research data, and publishing records, generated and compiled by Marie C. McCormick during her tenure as Principal Investigator for Phase IV of the Infant Health and Development Program. Research administrative records frequently consist of: meeting minutes and agendas for site directors’ meetings; IRB and protocol application records (consent forms, safety plans, confidentiality certificates, and related correspondence); financial records (budgets and invoices); grant funding records and correspondence; reports and project descriptions; and administrative correspondence and memoranda. Administrative records also include: personnel management and training records; staff and funder contact lists; data sharing policy records; participant recruitment records; and project calendars and timelines. Research regulatory records include: blank and annotated survey instruments and interview schedules; codebooks; variable lists and descriptions; and procedures manuals, protocols, and methodologies. Research data consists of: summarized and analyzed data tables and graphs; database inventories; and coded and analyzed databases and datasets. Publishing records include manuscript drafts, abstracts, bibliographies, and publishing correspondence related to IHDP research findings, frequently from previous phases of the program. Subseries also includes public speaking presentations, posters, and collected publications. Frequent topics in this subseries include: behavior; academic performance and engagement; physical health; obesity; puberty; sexual activity; family and peer relationships; family conflict; substance use and abuse; suicidal thoughts; safety, danger, and self-protective behavior; criminal activity; household composition and environment; parents’ marital status; family socioeconomic details; maternal parenting philosophy; maternal views of children’s behavior; mother’s friend and family relationships; maternal involvement in child’s education and social activities; and numerous other topics. Frequent file formats include: Microsoft Word documents (.doc and .docx); SAS, SPSS, and Microsoft Access dataset files (.mdb, .sas, .sas7bdat, .sav, and .sd7); text documents (.txt); comma separated value files (.csv); Adobe portable document format files (.pdf); Microsoft PowerPoint presentations (.ppt); web pages (.htm); images (.jpg); rich text format document files (.rtf); Microsoft Project files (.mpp); and Microsoft Viseo diagram files (.vsd). More IHDP records may be found in Series IB and IC. More publishing records may be found in Series III.

  1. Existing digital arrangement  

Consider listing the top-level digital folder structure as subseries with their own minimal description. This is most likely to be relevant for large units of digital records, such as hard drives. (The memory on a floppy disk, on the other hand, is so small as to make multilevel file structure irrelevant).

Note Fields  

The following fields must be included in descriptions of born-digital files. Some fields may only be required at the collection level, while others may only be required at, say, the file level. For guidance, please see Harvard’s Joint Processing Guidelines: Born-Digital Description.  

Title  

A title field is required at every level of description. For hybrid collections, use a generic term (e.g., papers, records) at the collection level. When describing all born-digital files, whether at the collection, series, or file level, include a more specific term.  

Examples: 

  • All born-digital collection:

    • Safe Space Radio digital records

  • Hybrid collection:  

    • Judah M. Folkmann papers   

      • Series IX: Digital Files 

        • A. Compact Discs 

        • B. Floppy Disks 

        • C. Zip Disks 

        • D. USB Flash Drives 

Dates  

Dates should be inclusive of born-digital files at all levels of description. For born-digital files, modified dates may be relevant in addition to creation dates.  

Creator  

The collection creator’s name should be used at the collection level. The Joint Processing Guidelines indicate that this field should be used for “[t]he person, family, or corporate body who created or collected the materials, not the software or hardware used to create them.    

Extent  

In hybrid collections, use two extent statements: one for the analog extent and one for the digital extent.

  1. Cubic feet of physical collections, including storage media. (See: Calculating Extent)  

  2. Total size of digital content, expressed as kilobytes, megabytes, gigabytes, or terabytes (spell out the term instead of abbreviating as, e.g., GB). Use kilobytes for amounts less than 1 megabyte, use megabytes for amounts less than 1 gigabyte, etc. Round to two decimal places. (See: Calculating Extent)

Example

  • Mary Ellen Wohl papers) 

    • 7.26 cubic feet (5 records center cartons, 1 half letter size document box, 1 oversized box)  

    • 0.2 Gigabytes (26 digital files in 3 digital folders)

Use born-digital files as a catch-all term; use digital video files and digital audio files as specific terms if relevant. 

At the series/subseries level, calculate the digital extent in gigabytes of the digital files that are part of that series.  

Extent: Container Summary  

  • Use a container summary anywhere you use an extent statement for born-digital records, following the format: # Gigabytes (# born-digital files in # digital folders).  

Example:

  • 276 Gigabytes (13,615 born-digital files in 2,106 digital folders)

Language  

The language field should be used for the written/spoken language of records, not computer/encoding language. It is required at the collection level.  

Scope and Content  

Include scope and content notes as-needed, including born-digital genre terms, carrier media formats, and file formats. For born-digital series/subseries, cross reference related materials in an analog series/subseries. (See: Scope and Contents)

General Note 

Use this field to provide  unique identifiers for e-media carriers (floppy disks, CDs, DVDs, etc.). Log e-media in the ElectronicMediaTrackingDB (N:\Collections\07_Collections_Databases_and_Lists\Electronic Records and Digital Collections).  When creating the folder list template, note each e-media number individually. Do not use a number range. The e-media number column in the spreadsheet translates to the General Note field. (See: E-media: Logging & Imaging)

Conditions Governing Access  

At the collection level, the Conditions Governing Access indicates the presence of restricted material in the records described, regardless of the records’ format or manner of access (i.e., physical papers accessed in the reading room, born-digital files accessed on a networked computer onsite, or born-digital files accessed remotely). Restriction periods are used for student records, Harvard University records, patient records, and others. (See: Restrictions)  

For information on how researchers will physically access born-digital files, use the Physical Characteristics and Technical Access field. 

Use and modify the template language below, as appropriate:  

Collection is open for research. Access requires advance notice. 

Access to [types of records] is restricted for 50 years from the date of creation. These restrictions are noted where they appear in Series [X, Y, Z]. Access to personal, student, and patient information is restricted for 80 years from the date of creation. These restrictions appear in Series [A, B, C]. Researchers may apply for access to restricted records. Consult Public Services for further information.

Note restriction periods in the folder list template. 

On the N drive: Append the year the restriction period expires to the e-media number of any electronic media restricted records by adding 50 or 80 (years) to the latest date of the born-digital files from that piece of media. For example: 8346_Restricted_2077.  

Physical Description Note  

If appropriate at any level of description, consider using the Physical Description Note field to provide more detail about file formats or media carriers.  

Abstract  

For a hybrid collection with a significant number of born-digital files, mention the born-digital files’ content and formats in the abstract.  

Arrangement  

Optional: For hybrid collections, describe the integration method for born-digital files. 

Subjects  

Include the Getty AAT terms, “Electronic records (digital records)” and/or “Web archives,” in the collection-level list of controlled access headings. Include more specific terms if needed. 

Physical Characteristics and Technical Requirements  

Use at appropriate levels of description.  

Give a more specific delineation of file formats, file format types, and/or media carriers if necessary.  

Include information about proprietary file formats, hardware/system requirements, software access needs, and any limitations on rendering. 

Note impact on delivery, e.g., if the item requires special equipment or software, or if original can’t be viewed but surrogate is available. 

Describe any special conditions or methods for accessing digital files (downloading, streaming, on-site only, etc.) that are dictated by the content of the files or by donor agreements. 

 Use and modify the language below, as appropriate:  

Access to born-digital files in this collection (as found in Series X, Y, and Z) is also subject to the restrictions described in the Conditions Governing Access note. Technical access to digital files is premised on the availability of a computer station, requisite software, and/or the ability of Public Services staff to review and/or print out records of interest in advance of an on-site visit. 

Conditions Governing Use  

Use the Center’s standard Conditions Governing Use language at the collection level:  

The Harvard Medical Library does not hold copyright on all materials in the collection. Researchers are responsible for identifying and contacting any third-party copyright holders for permission to reproduce or publish. For more information on the Center's use, publication, and reproduction policies, view our Reproductions and Use Policy

If applicable, incorporate any separate conditions governing the (re)use of born-digital files into this note. 

Processing Information  

Describe any action taken on born-digital files or their carrier media, with the level of specificity that will be useful to the researcher. This could include modifying file names or file paths, reformatting, changing or retaining the file structure, method of capture/transfer, creation of checksums, deletion of files or folders, redaction or screening for PII, virus scanning  

If born-digital files in a hybrid collection were processed earlier/later or by a different person than the analog material, include parallel notes for each portion.  

If there is substantial processing information related specifically to born digital files, consider adding a separate “Processing Information – Digital Processing” note. 

Use the following standard note for collections with born-digital files. Modify as appropriate to reflect the specific transfer method(s) and tasks applied:   

All born-digital files (as found in Series [X, Y, and Z]) were imaged using Access Data’s FTK and a Forensic Recovery of Evidence Device. Digital files were then transferred to secure network storage. Using FTK, records were screened for explicit and encrypted files, and use copies were extracted. Files that could be opened were sampled for content, however researchers should be aware that not every file in the collection could be opened and assessed. Files for which specific software was needed, but not available to staff at the time of processing, were not reviewed. Electronic media that could not be imaged were retained and are noted in a local inventory, and any media determined to be blank were discarded.* Researchers should be aware that most dates of digital files were determined based on the file creation or modification dates (whichever is earlier); however, these dates may not always accurately reflect the actual creation or modification dates. 

If duplicates and blanks were discarded at the point of acquisition/accessioning, record this information in the Appraisal Information note instead of in the Processing Information note.  

Use the following file level processing note if media is unable to be imaged due to technical considerations (degradation, lack of software/hardware): 

Archivist was unable to image [a 5.25 inch floppy disk, 3 compact disks, etc.] (e-media numbers [x, y, z) in this file. The [physical media item] is maintained in the folder.  

Copyright © 2024 The President and Fellows of Harvard College * Accessibility * Support * Request Access * Terms of Use