Data quality assessment and improvement

This is an umbrella page for project charters specific to data quality assessment and improvement in advance of, and continuing past, a DRS Futures release. Full charters are developed for more specific projects in sub-pages.

I. Problem/Value Statement

Problem Statement

We want to improve the quality of existing records in DRS in order to take full advantage of the software architecture and workflow improvements that DRS Futures promises to bring about.

Areas of particular interest are:

  1. Substandard media files, especially images, many of which are not fit for public delivery and are taxing our delivery infrastructure, causing disruption and slowness of services.
  2. Digital records lacking accurate and exhaustive metadata, which may affect relevance of search results in content management, digital preservation, and discovery services.
  3.  Relevance and appropriateness of the current DRS ontology and content models, which should be measured against today's Harvard stakeholders' needs.

Business Value

By creating delivery files better suited for high-traffic, highly visible public platforms, we would drastically improve the overall quality of our public-facing services.

II. Vision and Approach

Even though delivery services are out of scope for the DRS Futures project, the changes necessary to improve delivery files begins at the DRS level, by creating ad-hoc deliverable copies of archival and production masters, which are currently serving the dual purpose of archival and delivery. By establishing post-deposit processes that automatically generate an ad-hoc delivery file for each archival file deposited, we would be able to separate the two purposes.

At the same time, a mass conversion process of existing media files that uses the same tools can be planned and executed in parallel on a different timeline. Beginning with groups of media that are deemed to have the most negative effect on current delivery performance, the full set of deliverable media would be converted to delivery-optimized copies. This endeavor, which can span multiple years, should be subdivided in sub-projects focused on a specific media batch, which will be described by individual project charters under this section. The various media collections for each project would be defined based on assessment actions, carried out by automated analysis of our media collections and existing error reports.

Metadata completeness and accuracy may be assessed by establishing a set of criteria that can be computationally evaluated, to rank and classify existing digital records using automated tools. Remediation actions may be decided once an exhaustive report is available from this assessment step.

The current DRS content model and ontology should be reviewed in the light of today's DRS stakeholder needs and advancements of digital preservation concept and scope within the DP community. Once RS Futures services are live, which should allow for an easier modification of the content model configuration, batches of digital resources can be migrated one at a time with the help of ad-hoc conversion tools. These tools would be Harvard-specific but flexible enough so that they can be reused for any content model migration.

Our vision aligns with the following Harvard Library multi-year goals and objectives (MYGOs):

  • MYGO #8: Focus technical services on effective workflows and metadata that matter the most
    • By offering a centralized service for efficient, high-quality, and scalable processing of highly visible (public) files that can be used across campus we would encourage discontinuing one-off solutions that individual departments have been developing for lack of a better alternative, incurring in additional maintenance costs and inconsistent, often substandard output quality.
  • MYGO #10: Focus on space as a service, considering the most cost-effective approaches to user interests, collections security and preservation, and staff needs in HL and HCL facilities
    • [ To Do ]
  • MYGO #14: Minimize the environmental impact of collections, services, and spaces
    • As described in MYGO #10, the goal of this project is to save computing resources, thus reducing the environmental footprint of our services.

Our vision aligns with the following HUIT objectives and key results (OKRs): 

  • Develop a plan for automation in each service area for critical, frequently used or heavily manual workflows
    • This project seeks to optimize one of the most critical and frequently used workflows in the content production chain.

III. In Scope/Out of Scope

Note: while the Media transformation tools charters outline the work needed to create general-purpose tools for media transformation needed for data remediation, the remediation tasks themselves, grouped by issue type, are outlined here and constitute independent projects following an entirely different timeline.

In Scope

  • Al projects within this section deal exclusively with data migrations for improving the quality of data and metadata handled by DRS.
  • If one-off scripts and specialized tools need to be developed for a specific migration, that would fall within that migration's scope.

Out of Scope