Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Note: see Glossary at the bottom for definitions of technical terms and acronyms.

Table of Contents

I. Problem/Value Statement

Problem Statement:

Harvard Yenching Library has performed optical character recognition (OCR) on a large corpus of ancient Chinese texts (including some Japanese and Korean texts) in its collection, and now would like deposit that text to the Digital Repository Service (DRS), associate it with existing Page Delivery Service (PDS) objects and have the OCR be searchable.  Additionally, a large collection of modern Chinese Documents on Contemporary Chinese Politics has been purchased and digitized by the Harvard-Yenching Institute. The collection, including CJK OCR has been loaded to the DRS but is not currently searchable in Full Text Search (FTS) because FTS doesn't support CJK search. The CJK OCR should be searchable in FTS and deliverable in PDS.

In order to accommodate these two existing collections and future collections with CJK OCR, an ability to add OCR to existing DRS objects needs to be developed and FTS needs to be upgraded to enable CJK search and delivery of the results to PDS.

Business Value:

Updating FTS to enable CJK search will make it possible for researchers to ask exploratory questions of the data and get quick on-the-fly responses.  Full-text search means that these questions can be answered immediately, and so will be practical to investigate in many cases in which examining every page of a scanned text by hand to find out the answer would not be an option worth contemplating. Updating DRS to allow adding OCR to existing PDS objects will allow the OCR for ancient Chinese texts and OCR for many similar projects to enrich existing PDS documents in DRS with OCR as OCR technology for non-Latin languages matures and gets better.

II.Vision and Approach

Describe the solution:
The solution will consist of an upgraded FTS search index using Solr which will be additionally tuned to support the CJK requirements. The upgraded index will use the DRS indexing functions to support the updating and full text indexing of documents in the FTS search index. As part of this solution existing documents will need to be reindexed to support updated features within the Solr environment.The updated FTS search servlet will submit a search and return the results of a search submitted through PDS or through an FTS form, in keeping with the current FTS functionality. The PDS will display the resulting search and the relevant PDS document pages, in keeping with the current FTS functionality. The API used for these functions will be kept as consistent as possible.

...

  • Tuning of the indexing engine for additional non-Latin languages is out of scope of this project
  • Supporting bulk-adding of content or metadata to the DRS as a standard generalized DRS process is out of scope of this project

III. Stakeholders/People

Who is the work being done for? (Sponsor)

...

Who will be involved in doing the work (service area, department, etc …)? [Include name, project role, and estimated percentage of time per month for the project duration]

Resource NameRole(s)Monthly Time EstimateMonthly Time Estimate
Vitaly ZakutaPM / Support / QA%%
TBD - SolrDeveloper90 %%
Chris Vicary - DRSDeveloper / Development Supervisor10 %%
Anthony MoulenTechnical Architect%%
Donald SturgeonCJK Domain Expert%%
Mingtao Zhao (Imaging Services)Staging and bulk deposit of OCR for DRS%%
Sharon YangReview and accept CJK search functionality%%
TBDOperational Resources%%

IV. Schedule and Cost

Schedule:

[include project phases, known reporting dates, and delivery deadlines]

...

Cost Breakdown (for large projects):

Cost TypeEstimated Cost
Hardware 
Licensing 
Contract Labor$90,000.00
[add and remove rows as needed] 

V. Other

Constraints:

  • May not finish all the stages of the project before we run out of budgeted resources (currently estimated at 6 months worth of work budgeted)

...

  • Sustainability plan (question)
  • Contents are in various Chinese character sets (Modern vs Traditional Chinese) which would complicate tuning the index
  • Contents are in additional languages besides Chinese (Japanese, Korean), which would complicate tuning the index
  • The product may meet the "done" measurements but may not contain desired features beyond what was specified in the "definition of done."
  • CJK upgrade may require changing other applications in the DRS ecosystem which is currently not planned as part of the project (this will not be definitely known until the upgrade is close to complete)
  • Appropriate affiliation between CJK files and existing page images in the DRS is not correct or only partially correct
  • Staged CJK OCR files (at Imaging Services or elsewhere) are not staged in the directory structure and using naming convention that are expected by the DRS

Collections to be indexed

  1. Chinese Rare Books Collection digitization - Harvard Yenching Library, 700,000 pages (OCR needs to be ingested to DRS)
  2. Contemporary Chinese Politics- 600,000 pages, digitization complete June 2015 (already in DRS

Glossary

CJK – Chinese, Japanese, Korean

...