Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Harvard Yenching Library has performed optical character recognition (OCR) on a large corpus of ancient Chinese texts in its collection, and now would like deposit that text to the Digital Repository Service (DRS), associate it with existing Page Delivery Service (PDS) objects and have the OCR be searchable.  Additionally, a large collection of modern Chinese Documents on Contemporary Chinese Politics has been purchased and digitized by the Harvard-Yenching Institute. The collection, including CJK OCR has been loaded to the DRS but is not currently searchable in Full Text Search (FTS) because FTS doesn't support CJK search. The CJK OCR should be searchable in FTS and deliverable in PDS.

...

Describe the solution:
The solution will consist of an upgraded FTS search index using SOLR Solr which will be additionally tuned to support the CJK requirements. The upgraded index will use the DRS indexing functions to support the updating and full text indexing of documents in the FTS search index. As part of this solution existing documents will need to be reindexed to support updated features within the SOLR Solr environment.The updated FTS search servlet will submit a search and return the results of a search submitted through PDS/Mirador or through an FTS form, in keeping with the current FTS functionality. The PDS Mirador will display the resulting search and the relevant PDS document pages, in keeping with the current FTS functionality. The API used for these functions will be kept as consistent as possible.

...

Resource NameRole(s)Monthly Time EstimateMonthly Time Estimate
Vitaly ZakutaPM / Support / QA%%
TBD - SOLRSolrDeveloper90 %%
Chris Vicary - DRSDeveloper10 %%
Anthony MoulenTechnical Architect% 

...

Risks (description, plan, impact, owner):

Glossary

CJK – Chinese, Japanese, Korean

DRS – Digital Repository Service

FTS - Full Text Search

Mirador – Software used by LTS for delivery and presentation of Page Turned Objects

OCR – Optical Character Recognition

PDS – Page Delivery Service

Solr – Indexing engine technology used by LTS