Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 39 Next »

Old project page: FTS CJK Search (old project page)

Note: see Glossary at the bottom for definitions of technical terms and acronyms.

I. Problem/Value Statement

Problem Statement:

Harvard Yenching Library has performed optical character recognition (OCR) on a large corpus of ancient Chinese texts in its collection, and now would like deposit that text to the Digital Repository Service (DRS), associate it with existing Page Delivery Service (PDS) objects and have the OCR be searchable.  Additionally, a large collection of modern Chinese Documents on Contemporary Chinese Politics has been purchased and digitized by the Harvard-Yenching Institute. The collection, including CJK OCR has been loaded to the DRS but is not currently searchable in Full Text Search (FTS) because FTS doesn't support CJK search. The CJK OCR should be searchable in FTS and deliverable in PDS.

In order to accommodate these two existing collections and future collections with CJK OCR, an ability to add OCR to existing DRS objects needs to be developed and FTS needs to be upgraded to enable CJK search and delivery of the results to PDS.

Business Value:

Updating FTS to enable CJK search will make it possible for researchers to ask exploratory questions of the data and get quick on-the-fly responses.  Full-text search means that these questions can be answered immediately, and so will be practical to investigate in many cases in which examining every page of a scanned text by hand to find out the answer would not be an option worth contemplating. Updating DRS to allow adding OCR to existing PDS objects will allow the OCR for ancient Chinese texts and OCR for many similar projects to enrich existing PDS documents in DRS with OCR as OCR technology for non-Latin languages matures and gets better.

II. Vision and Approach

Describe the solution:
The solution will consist of an upgraded FTS search index using Solr which will be additionally tuned to support the CJK requirements. The upgraded index will use the DRS indexing functions to support the updating and full text indexing of documents in the FTS search index. As part of this solution existing documents will need to be reindexed to support updated features within the Solr environment.The updated FTS search servlet will submit a search and return the results of a search submitted through PDS/Mirador or through an FTS form, in keeping with the current FTS functionality. The PDS Mirador will display the resulting search and the relevant PDS document pages, in keeping with the current FTS functionality. The API used for these functions will be kept as consistent as possible.

DRS will be enhanced to allow batch loading of OCR files and linking them with existing DRS PDS documents. The newly updated documents will be automatically submitted for FTS indexing.

Deliverables/Work Products: 
Upgraded FTS that supports CJK search.

An enhancement to the DRS to support batch loading OCR files and linking them with existing DRS PDS documents.

These are the individual deliverables that will be part of this project:

  • Upgrade of the indexing engine
  • Develop the DRS OCR batch import function
  • Upgrade of the index / ingest process
  • Update of the search APIs
  • Update of PDS/Mirador to support new search solution
  • Tuning of search index for CJK materials

The deliverables may represent sub-projects which may represent different interested parties and different sub-teams.

Define how to measure “done”:
The overall project is considered done once the DRS CJK PDS documents can be indexed and searched in FTS, the results of the searches can be viewed in PDS/Mirador and FTS, and the collection of OCR for existing DRS CJK PDS Documents is successfully ingested into the DRS and is available for searching in FTS and display of results in PDS/Mirador and FTS.

This project will iterate in phases. For each phase a separate definition of "done" will be defined in the project plan.

In Scope:
Support for CJK indexing and search in FTS, delivery of results to PDS/Mirador.

Enhancement to support bulk-adding CJK OCR to existing PDS objects in the DRS

Out of Scope (for medium and large projects):
Support for indexing and search in FTS for other languages; support for bulk-adding any other metadata or content to the DRS; any UI changes or enhancements to current FTS or PDS/Mirador

III. Stakeholders/People

Who is the work being done for? (Sponsor)

Suzanne Wones, Associate University Librarian for Digital Strategies and Innovation

What organizations, departments, or people will benefit from this work (for medium and large projects)

Harvard Yenching Library, Harvard Yenching Institute, faculty and researchers will benefit from this work

Who is funding the work?
Harvard Library is funding this work

Who will accept the work?
Suzanne Wones, Associate University Librarian for Digital Strategies and Innovation (or her designee) a stakeholder (TBD) from Harvard Yenching Library, and Elizabeth Perry, Director of the Harvard-Yenching Institute (or her designee) will accept this work

Who is the project manager?
Vitaly Zakuta is the project manager for this project

Who will be involved in doing the work (service area, department, etc …)? [Include name, project role, and estimated percentage of time per month for the project duration]

Resource NameRole(s)Monthly Time EstimateMonthly Time Estimate
Vitaly ZakutaPM / Support / QA%%
TBD - SolrDeveloper90 %%
Chris Vicary - DRSDeveloper10 %%
Anthony MoulenTechnical Architect% 

IV. Schedule and Cost

Schedule:

[include project phases, known reporting dates, and delivery deadlines]

Cost:

Cost Breakdown (for large projects):

Cost TypeEstimated Cost
Hardware 
Licensing 
Contract Labor$112,000.00
[add and remove rows as needed] 

V. Other

Constraints:

Assumptions:

Dependencies:

Risks (description, plan, impact, owner):

Glossary

CJK – Chinese, Japanese, Korean

DRS – Digital Repository Service

FTS - Full Text Search

Mirador – Software used by LTS for delivery and presentation of Page Turned Objects

OCR – Optical Character Recognition

PDS – Page Delivery Service

Solr – Indexing engine technology used by LTS

 

 



  • No labels