/
Building DRS Objects and Batches from ProQuest ETD files

Building DRS Objects and Batches from ProQuest ETD files

Background

Most Harvard schools use the ProQuest ETD Administration tool to manage Electronic Thesis and Dissertation submission. After a school accepts and approves an ETD, the actual files created by the student need to be deposited into the DRS for preservation.

DRS Content Models and Packages from ProQuest

DRS uses a 'Content Model' approach to provide the best preservation support for a limited set of file types. Any type of file, however, can be deposited into DRS using a specific Content Model, called an Opaque Object.  Files in an Opaque Object are given bit level preservation, but there are limited delivery options.  The DRS Content Guide explains which file formats are appropriate for each supported Content Model.

ETDs from ProQuest can have a variety of file types in each submission package. For a given ETD submission package, each file with a currently supported DRS Content Model will be deposited in a separate DRS Object and assigned a 'Role'. All files in a submission package with formats that are not currently supported by a DRS Content Model will be put into Opaque Objects.  The example in the table below illustrates how file types from a submission would be assigned to a Content Model.


Examples:

File typeContent Model
PDFDocument
text/xmlText
JP2, JPEGStill Image
mp3Audio
MS Word DocumentDocument
CAD, PowerPoint fileOpaque


For example, a ProQuest ETD could have the following files which would be deposited in the corresponding Content Model

  • Thesis in PDF format => Document
  • mets.xml file => Text
  • License files in PDF formats => Document
  • Supplementary files, such as thesis appendices, data sets, videos, etc,. in a variety of formats => Opaque

DRS Roles and Relationships

DRS objects and files can be assigned 'Roles' to help categorize material as well as facilitate relationships between DRS objects. The THESIS and THESIS_SUPPLEMENT Roles are used specifically for ETDs. The Adding Relationships section of the BatchBuilder Guide has information on DRS relationships. 

File to DRS Role mapping

File from ETDDRS Object RoleRelationship from THESIS object
Thesis PDFTHESIS-
Supplementary filesTHESIS_SUPPLEMENTHAS_SUPPLEMENT
License files

LICENSE

HAS_LICENSE (In Rights section of Thesis Object)
mets.xmlDOCUMENTATIONHAS_DOCUMENTATION 

DRS Batches, Objects, Content Models and Relationships between Objects in BatchBuilder

BatchBuilder is the main tool used to organize content for deposit into DRS.  A 'batch' is the group of files and directories that is sent to DRS for deposit. Each batch has at least one 'object' of some content model along with a descriptor file, descriptor.xml, that has technical, administrative, and preservation metadata for each object and any of its files.  In order to be deposited, each batch also needs an xml file, batch.xml, that has information about all the objects in the batch.  A batch can include objects of different content models. BatchBuilder recreates the descriptor.xml and batch.xml files every time it processes a batch so to capture any edits to the objects since the last batch generation.

It is possible to define relationships between objects, as described above, even if the objects are being deposited in the same batch.  The Object Owner Supplied Name (OSN) is used to identify the object that is related.

NOTE: In order to create relationships between objects in the same batch, the object that is the SOURCE of the relationship, e.g. HAS_SUPPLEMENT, has to be deposited after the object that is the TARGET of the relationship, e.g. the object with the SUPPLEMENT role. When using the CLI version of BatchBuilder, if the objects are named so that the SOURCE object is always alphabetically after the TARGET, BatchBuilder will create a batch that will have the correct relationships when deposited into DRS.

Information needed for each object

  • Object Owner Supplied Name (OSN) - Unique within the DRS Owner Code
  • File Owner Supplied Name (OSN) for each file in the object - doesn't need to be unique in DRS Owner Code or anywhere else in DRS
  • Role for each object - especially important if an object is a TARGET of a relationship
    • e.g. in order to be a TARGET of 'HAS_SUPPLEMENT' the object has to have the role 'THESIS_SUPPLEMENT'
  • Relationships to other objects in the batch or already deposited in DRS


Variable definitions for use in making Object OSN, File OSN, and Harvard Metadata Link values:

SCHOOL_CODE - dropbox: gsas, dce, college, 

DEGREE_DATE_VALUE - thesis.degree qualifier=”date”

PROQUEST_IDENTIFIER_VALUE - dc.identifier qualifier=”other”

OBJECT_ROLE - Thesis or Supplement[1,2,3..] or License[1,2,3…] or Documentation[1,2,3,..] (outlined below)


Rule for creating Object and File OSNs using variable definitions from the Alma MARCXML Template and fileSec information

Object OSN: ETD_[OBJECT_ROLE]_[SCHOOL_CODE]_[DEGREE_DATE_VALUE]_PQ_[PROQUEST_IDENTIFIER_VALUE]

The Files OSN will be the same as the Object OSN with an additional 'sequence' number.  Only Opaque Objects will need File OSNs with sequence numbers higher than 1 since only Opaque objects can have more than one file.

File OSN: OBJECT_OSN_sequence_number

Example OSNs for an ETD from GSAS in 2022:

  • ETD_THESIS_gsas_2022_PQ_12345678
  • ETD_SUPPLEMENT_1_gsas_2022_PQ_12345678
  • ETD_SUPPLEMENT_2_gsas_2022_PQ_12345678
  • ETD_DOCUMENTATION_1_gsas_2022_PQ_12345678
  • ETD_LICENSE_1_gsas_2022_PQ_12345678

Rule for creating Harvard Metadata Link value for each DRS Object (Thesis and all supplementary objects)

Harvard Metadata link Type=Local: value=PQ-[PROQUEST_IDENTIFIER_VALUE]

Example: PQ-12345678

Harvard Metadata link Type=Alma: value=[Alma MMSID]

Harvard Metadata link Type=DASH: value=[DASH ID]


More rules for OSNs and filenames are in the BatchBuilder User Guide

Gathering information to create ETD Objects

In order to create objects for the files in an ETD submission, we must get the following values for each file:

  • Filename
  • Mime-type
  • USE type
  • ADMID

The information for all the files comes from the 'fileSec' part of the mets.xml file, except for the mets.xml file itself.

Example of files in an ETD Submission directory


FileSec of mets.xml
  <fileSec>
    <fileGrp ID="etdadmin-mets-fgrp-1" USE="CONTENT">
      <file GROUPID="etdadmin-mets-file-group" ID="etdadmin-mets-file-2132021" MIMETYPE="application/pdf" ADMID="amd_primary" SEQ="1">
        <FLocat LOCTYPE="URL" xlink:href="thesis_pdfa_allisonhyatt.pdf"/>
      </file>
      <file GROUPID="etdadmin-mets-file-group" ID="etdadmin-mets-file-2132069" MIMETYPE="application/pdf" ADMID="amd_supplemental_1" SEQ="1">
        <FLocat LOCTYPE="URL" xlink:href="appendices_pdfa_allisonhyatt.pdf"/>
      </file>
    </fileGrp>
    <fileGrp ID="etdadmin-mets-fgrp-2" USE="LICENSE">
      <file GROUPID="etdadmin-mets-file-group" ID="etdadmin-mets-file-2046147" MIMETYPE="application/pdf" ADMID="amd_license_2046147">
        <FLocat LOCTYPE="URL" xlink:href="setup_2E592954-F85C-11EA-ABB1-E61AE629DA94.pdf"/>
      </file>
    </fileGrp>
  </fileSec>


Rule for Object ROLE assignments

USE TypeAMDIDObject ROLE

CONTENT

amd_primaryTHESIS
CONTENTamd_supplemental_[/d]THESIS_SUPPLEMENT
LICENSEamd_license_[/d+]LICENSE
N/Amets.xmlDOCUMENTATION



Values assigned to each file in the submission directory

FilenameMime-typeAMDIDUSE TypeObject ROLE
thesis_pdfa_allisonhyatt.pdfapplication/pdfamd_primaryCONTENTTHESIS
appendices_pdfa_allisonhyatt.pdfapplication/pdfamd_supplemental_1CONTENTTHESIS_SUPPLEMENT
setup_2E592954-F85C-11EA-ABB1-E61AE629DA94.pdfapplication/pdfamd_license_2046147LICENSELICENSE
mets.xmltext/xmlN/AN/ADOCUMENTATION


Billing codes associated with each school

ProQuest site codeSchoolDegree - ExamplesDRS Billing codeURN Authority path
CollegeSchool of Engineering and Applied Sciences

Bachelor of Arts (A.B.)

Bachelor of Science (S.B.)

HUL.ARCH.ETHESIS.SEAS_0001

HUL.ARCH

DCE

Division of Continuing Education

Master of Liberal Arts (ALM)HUL.ARCH.ETHESIS.DCE_0001HUL.ARCH
DIVHarvard Divinity SchoolDoctor of Theology (ThD)HUL.ARCH.EDISS.THD_0001HUL.ARCH
GSASGraduate School of Arts and Sciences    Doctor of Philosophy (Ph.D.)HUL.ARCH.EDISS.PHD_0001HUL.ARCH
GSDGraduate School of Design

Doctor of Design (DDes)

GSD.LIBR.Theses_0001

GSD.LOEB

GSEGraduate School of EducationDoctor of Education (Ed.D.)HUL.ARCH.EDISS.EDD_0001HUL.ARCH
EDLDGraduate School of EducationDoctor of Education Leadership (Ed.L.D.)HUL.ARCH.EDISS.EDLC_0001HUL.ARCH
HBSHarvard Business School

Doctor of Business Administration (D.B.A.)

HBS.BAKR.HDTC_0001

HBS.BAKER

HMSHarvard Medical School

Doctor of Medicine (M.D)

HMS.COUNT.HMST_0001

HMS.COUNT

HSDMHarvard School of Dental MedicineDoctor of Medical Sciences (D.M.Sc)HMS.COUNT.HSDMT_0001HMS.COUNT
HSPHHarvard T.H. Chan School of Public HealthDoctor of Science (S.D.)HMS.COUNT.HSPHT_0001HMS.COUNT

Building the objects

Create a Content Model specific object for each file that matches a supported Content Model, and one Opaque Object for all the non-supported file formats.

  • Content Model based on file Mime-type
  • Role based on Use category and AMDID
    • Primary PDF thesis gets ROLE=THESIS
    • Objects for files in CONTENT group get ROLE=THESIS_SUPPLEMENT
    • Objects for files in ‘LICENSE’ group get ROLE=LICENSE
    • Object that has the mets.xml file gets ROLE=DOCUMENTATION
  • Object with ROLE=THESIS gets:
    • MODS descriptive metadata from Alma using MMSID
    • Has_supplement relationship to all objects with THESIS_SUPPLEMENT Roles
    • Has_license relationship to all objects with LICENSE Role
    • Has _documentation relationship to all objects with DOCUMENTATION Roles
  • Object with any ROLE gets:
    • Harvard Metadata link Type=Local with label ProQuestID; value = PQ-[PROQUEST_IDENTIFIER_VALUE]
    • Harvard Metadata link Type=Alma - using MMSID
    • Harvard Metadata link Type=DASH - using DASH ID
  • File in any Content Model (except Text*):
    • Role=ARCHIVAL_MASTER

*Text Content Model doesn't support the ARCHIVAL_MASTER role for files

ETD deposits into DRS - data validation

  • Each batch should have the objects for only one ETD
  • Each batch should have one and only one Thesis, i.e. a Document Object with ROLE=Thesis
  • Each object in an ETD  batch should have Harvard Metadata Link values for the ProQuest ID (type=local) and Alma ID (type=alma)
  • If there is a DASH ID, it should be added to Harvard Metadata Links as a DASH type
  • The Thesis document object should have MODS metadata from HOLLIS using the Alma MMSID
  • The Thesis document object should have a defined relationship with all the other objects in the ETD batch
  • Embargo information should be recorded in the Thesis and Supplements
  • All the files listed in the fileSec of the mets.xml should match a file in the zip package
    • Filenames should be sanitized before creating  DRS batches
  • Each file in the zip package, except the mets.xml, should have an entry in the fileSec of the mets.xml
  • Each file in the zip package should be in one and only one object in the ETD batch
  • The ProQuest ID for any object in a new ETD batch should not be in a previously deposited DRS object.