Advancing Open Knowledge Grant

Enhancing Slavery, Abolition, Emancipation, and Freedom: Primary Sources from Houghton Library for Deeper Research

Background

This project sets out to enhance the research, educational and interpretative possibilities for Houghton Library’s Slavery, Abolition, Emancipation, and Freedom: Primary Sources from Houghton Library (SAEF) project. In the Summer of 2020 Houghton Library committed to focusing our project digitization for the upcoming year on a curated set of 2000+ rare and unique materials illustrating African American history from the 18th century through the turn of the 20th century. The project, in process with an internal Houghton team and colleagues in Imaging Services, is planned to culminate with two primary deliverables besides hundreds of newly cataloged and digitized primary sources: a DublinCore based dataset on the digitized records available through DataVerse, and an easy to access curated CURIOsity site. The project is guided by a trifecta of Harvard Library Values; Seek collaboration, Embrace diverse perspectives, Champion access.

SAEF is being funded currently through internal Houghton resources that have been redirected away from paused digital projects, and staffed by archivists and librarians who have been able to shift duties due to the current remote work scenario. As the project has developed it has become clear that with additional funding we would be able to provide data more accommodating to computational research, in text mining and image research. We have also realized the possibilities for diversifying interpretative perspectives and pedagogical possibilities on our CURIOsity site, if we engage thinkers from outside the library, including student-scholars and educational consultants.

This grant proposal is directly aligned with the following FY21 Harvard Library Priorities:

Increase our focus on supporting excellence in virtual teaching, learning and research (Image Collection as Data/Machine Learning Transcription/Pedagogical Text))
Advance diversity, inclusion, belonging and anti-racism in our workforce, services, collections and spaces (Interpretative Text/Pedagogical Text)
Simplify and advance systems to preserve, store and access library digital assets (Image Collection as Data/Machine Learning Transcription)

Project Description

The project can be described in three parts: enhancements to the computational research possibilities of the dataset through a workflow with Imaging Services that will allow users to search the entire corpus of images and OCRed text, machine learned based handwriting transcription of the digitized manuscript collections, and the collaborative creation of interpretative, as well as pedagogical text for the CURIOsity site. All of these parts are necessary as they serve the greater goal: increasing the access to and understanding of Houghton’s African American history materials. We are particularly interested in pursuing these goals through the already existing HL infrastructure, so our progress can provide pathways for other library projects.

Image Collection as Data

As it is currently impossible for researchers and extremely difficult for staff to make bulk pulls of MARC records, ArchivesSpace finding aids, and digital objects in the DRS, we began this project with the plan to create a DublinCore compatible database with normalized titles, dates, locations, authors, and a controlled vocabulary of tags. The goal here is to provide researchers with a database more suitable to computational research. In meeting with Ceilyn Boyd and Katie Mika, we were assured the database itself was formatted for usability, but that the best case scenario would be if the image files themselves and any accompanying OCR or transcriptions could be directly linked to the respective data in our spreadsheets. While this is not a current service with Imaging Services, Bill Comstock, Wendy Gogel, and Boyd have devised a workflow by which Imaging Service staff could upload digital object bundles to a separate space on NextCloud cloud storage before they are uploaded into the DRS. Imaging staff are already familiar with a similar workflow which has recently been established to handle virtual course reserves. Imaging Services would also need to create high quality JPEG files to upload as part of these bundles, as opposed to only creating preservation JPEG2000 files. Using high quality JPEGs instead of JPEG2000s will save huge amounts of storage space and make computational research far easier.

This workflow will both provide researchers with the ability to do the sorts of computational research that large digitization projects are perfect for, and can serve as a pilot for articulating the functional requirements/user stories for computational research of archival materials for the next generation of DRS. This collection could open up text mining possibilities and comparative image searches that would expand our understanding of the African American freedom struggle. The requested funding here is $5000 for data curation, looking at 2 datasets, and $288 for six months of cloud-based data storage with NextCloud. These figures are liberal estimates provided by Sonia Barbosa, Manager of Research Data, and Sharon Bayer, who has been working on NextCloud deployment.

Machine Learning Transcription

Another key goal is to provide researchers with the OCR from our digitized objects, to open research possibilities up beyond human capability. The bulk of the materials to be digitized are 19th century published materials which OCR relatively well. There are, however, many key documents that fare less well-- particularly hand-written material. Handwritten material in our collection includes letters from famed Black abolitionists, letters from formerly enslaved peoples to the Freedmen’s Bureau, personal recollections of Gullah-Geechee sacred music, and more.

These materials complement and supplement the published materials and would benefit from transcription. Matt Cook has been experimenting with machine learning-based transcription workflows, using a combination of open-source (throughput) and cloud-based (processing) softwares, which can reliably produce full-text searchable documents with 90%+ accuracy. More specifically, Caltech Library’s Handprint software, when coupled with a recently released version of Microsoft Cognitive Services’ cloud-based Read API, provides the means to quickly transcribe handwriting - from a variety of sources (i.e. document types) and hands, and without custom model training - at a relatively low cost., This builds on previous HL work to identify functional handwriting transcription software, particularly the working group that explored Transkribus, as well as Marilyn Dunn of Schlesinger Library’s strong efforts towards scalable handwriting transcription. This methodology, unlike Transkribus, is more scalable for a mixed handwriting collection, as well as more affordable, and less time would need to be spent on identifying material sets and providing human-checked transcriptions. The requested funding here is ~$100 for approximately 2750 manuscript pages, and usable outputs include (per page) .json, .txt, and (annotated) .png image files. . This work builds on the requested funding for Image Collections as Data. By uploading the object bundles to NextCloud, we will be able to provide easy remote access to the JPEGs for Cook to run through the Handprint and Read API softwares.

Interpretative Text (Call for Participants Available Here)

A large part of increasing accessibility of this digital collection is the creation of appropriate contextualizing/interpretative material. The original project design calls for internal Houghton staff to write this text, but with access to increased funding we would like to move in a direction that creates opportunities for students to engage with digitized primary source materials and to have a paid opportunity for public scholarship. In support of this new direction, we are requesting $350 x 5 ($1750) in funding as honorarium for the creation of 5 interpretative texts to be posted as part of the CURIOSity site. Following the project design for Colonial North America where fellows were funded to create public history text, we would fund writings between 1000-1500 words and covering five major themes of the digital collection: Early Republic, Civil War, Reconstruction, Abolitionists, and Black Authors. With the support of Professor Sarah Lewis, a CFP will be circulated amongst the AAAS department directed towards advanced undergraduate, and graduate students.

Pedagogical Text (PDF Download Here)

Houghton Library’s active DIBAR Committee has marked increasing access for K-12 students as part of our inclusive outreach initiatives. In support of this new direction, and to supplement the pedagogical text for higher-ed learners that will be created in-house, we are asking for $450 in funding as honorarium for the creation of a learning guide for early learners to be posted as part of the CURIOSity site. This work would continue projects already in place, like the May Crane Fellow working with DSI, to open our collections up to the broadest educational community possible. The fee and collaborative process for this guide’s creation have been designed with the help of education consultant Mekha McGuire, who is currently working as a consultant with the Boston Teachers Union to develop a K-12 Black Studies curriculum. With experience in special collections, McGuire has agreed to sign on to write this guide, funding contingent.

Project Team and Collaborators

Houghton Library Project Team

Dorothy Berry, Digital Collections Program Manager

Grant PI and Project Manager

Christine Jacobsen, Assistant Curator of Modern Books and Manuscripts

Interpretative Text

Micah Hoggatt, Reference Librarian

Pedagogical Text

Monique Lassere, Digital Archivist

Image Collection as Data

Collaborators

Bill Comstock, Head of Imaging Services

Image Collection as Data

Ceilyn Boyd, Research Data Program Manager

Image Collection as Data

Katie Mika, Data Services Librarian, IQSS

Image Collection as Data

Matt Cook, Digital Scholarship Program Manager

Machine Learning Transcription

Vanessa Venti, Analyst for Stewardship of Digital Assets

CURIOSity Site

Mekha McGuire, K-12 Unit Developer

Pedagogical Text

Dr. Sarah Lewis, Associate Professor of History of Art and Architecture and African and African American Studies

Interpretative Text

Project Timeline

The timeline for this project is most heavily impacted by the vagaries of COVID safety. As Imaging Services responds to the needs of faculty, and as all staff manage on-site work safety measures, there are possibilities this timeline may shift. Luckily, the work is on-going and can begin immediately, with the next batch of digitized material.

January

Set up server access workflow with Imaging Services and begin deposits
Set up transcription workflow with DS and server access
Write CFP and pedagogical text contract agreement

February

Process materials through new storage/transcription workflows
Promote CFP with a submission deadline of early March and a final due date of early May

March

Process materials through new storage/transcription workflows

April

Process materials through new storage/transcription workflows
Review text submissions

May

Process materials through new storage/transcription workflows

Work with Dataverse and RDM on uploading data
Begin creation of CURIOSity site and final edits of commissioned writings

June

Launch CURIOSity with commissioned writings
Promote via Harvard Library and Houghton Library Social Media and through active outreach to University departments.

Project Budget

AOK Grant Proposed Budget

Expense	Base Cost	Quantity	Subtotals
Data curation cost for Dataverse service	5000	1	5000
6 Month NextCloud storage	.023 per GB	1000	144
NextCloud back-up (recommended)	.023 per GB	1000	144
Read API	100	1	100
Interpretative Text Honorarium	350	5	1750
Pedagogical Text Honoraria	450	1	450

		Total	7588

Houghton Technical Services