I. Problem/Value Statement
Problem Statement:
Harvard Yenching Library has performed optical character recognition (OCR) on a large corpus of ancient Chinese texts in its collection, and now would like deposit that text to the DRS, associate it with existing PDS objects and have the OCR be searchable. Additionally, a large collection of modern Chinese Documents on Contemporary Chinese Politics has been purchased and digitized by the Harvard-Yenching Institute. The collection, including CJK OCR has been loaded to the DRS but is not currently searchable in FTS because FTS doesn't support CJK search. The CJK OCR should be searchable in FTS and deliverable in PDS.
In order to accommodate these two existing collections and future collections with CJK OCR, an ability to add OCR to existing DRS objects needs to be developed and FTS needs to be upgraded to enable CJK search and delivery of the results to PDS.
Business Value:
Updating FTS to enable CJK search will make it possible for researchers to ask exploratory questions of the data and get quick on-the-fly responses. Full-text search means that these questions can be answered immediately, and so will be practical to investigate in many cases in which examining every page of a scanned text by hand to find out the answer would not be an option worth contemplating. Updating DRS to allow adding OCR to existing PDS objects will allow the OCR for ancient Chinese texts and OCR for many similar projects to enrich existing PDS documents in DRS with OCR as OCR technology for non-Latin languages matures and gets better. This will in turn dramatically increase the use of DRS collections as more researchers will be able to ask new questions about the data.
II. Vision and Approach
Describe the solution:
The new solution will consist of a new SOLR indexing engine that will be defined and tuned for CJK FTS searching. The SOLR indexing engine will work with the new FTS indexer that will interact with the DRS in order to index and update any DRS PDS documents that are newly ingested into DRS or updated in the DRS. The new FTS search servlet will submit a search and return the results of a search submitted through PDS/Mirador or through an FTS form. The PDS Mirador will display the resulting search and the relevant PDS document pages, in keeping with the current FTS functionality.
DRS will be enhanced to allow batch loading of OCR files and linking them with existing DRS PDS documents. The newly updated documents will be automatically submitted for FTS indexing.
Deliverables/Work Products:
A replacement for the current FTS that supports CJK search.
An enhancement to the DRS to support batch loading OCR files and linking them with existing DRS PDS documents.
Define how to measure “done”:
In Scope:
Out of Scope (for medium and large projects):
III. Stakeholders/People
Who is the work being done for? (Sponsor)
Suzanne Wones
What organizations, departments, or people will benefit from this work (for medium and large projects)
Who is funding the work?
Who will accept the work?
Who is the project manager?
Who will be involved in doing the work (service area, department, etc …)? [Include name, project role, and estimated percentage of time per month for the project duration]
Resource Name | Role(s) | Monthly Time Estimate | Monthly Time Estimate |
---|---|---|---|
Joe | PM | 50% | 50% |
Sally | Developer | 50% | 30% |
Mirena | Support, QA | 20% | 50% |
[add and remove rows as needed] |
IV. Schedule and Cost
Schedule: [include project phases, known reporting dates, and delivery deadlines]
Cost:
Cost Breakdown (for large projects):
Cost Type | Estimated Cost |
---|---|
Hardware | |
Licensing | |
Contract Labor | |
[add and remove rows as needed] |
V. Other
Constraints:
Assumptions:
Dependencies:
Risks (description, plan, impact, owner):