DRS Pipelines framework project charter

I. Problem/Value Statement

Problem Statement

DRS is, and will continue to be in the future, a highly integrated service. Data are created, updated, and deleted in many systems within and outside LTS, and many of these events need to be reflected in state changes within DRS. Examples may be depositing materials, updating a source system such as Alma or ArchiveSpace, or marking some resources as publishable or not.

Currently, many DRS-related actions are taken by users manually, as well as following actions that depend on certain conditions, e.g.: depositing some resources, creating derivatives, waiting for the resources to be available in DRS, looking the DRS identifier(s) for those resources, and updating other systems with those identifiers. This is a cumbersome, time-consuming, and error-prone process.

In addition to this, if any error occurs during any of these steps, the user is often on their own to find out about it (e.g. something doesn't show in DRS after a reasonable time), report the issue with tech support, and wait for a resolution. This leaves obviously many gaps in terms of delays, as well as in which problems actually get caught, leaving room for unresolved issues to persist for an indefinite time.

Most of these actions don't require human judgment to be carried out and can be automated. In fact there are several LTS services, such as DAIS and DRS queues, to automate some internal processes.

The issue with the existing automation tools is that they are not as extensive as they need to be, and there is no centralized "control tower" to monitor possibly complex chains of actions, in order to quickly find issues and remediate them. Also, there is not an easy to use, well-documented framework to easily add new tasks and connect them to existing ones.

Several software solutions are on the market that provide an automation framework that takes care of the most complex part of enterprise integration, which is the orchestration of inter-dependent, asynchronous tasks. Adopting one such framework would provide LTS with a common platform to perform both simple and complex tasks, reuse existing services, and have a centralized overview of what is happening at any point in time.

Business Value

An automation framework primarily aimed at DRS-related tasks, but shared with LTS at large would resolve several problems:

Oversee and facilitate several data transformation pipelines, removing the need for manual intervention and reducing tech support time:
- Pre-deposit tasks
  - Characterization
  - Metadata extraction
  - etc.
- Post-deposit tasks
  - Derivative generation
  - Pushing resources slated for publishing to delivery systems
  - Updating depending systems with new data (e.g. add DRS ID to Alma record after ingest)
- Error reporting
- Automatically retrying recoverable errors
Provide a facility to create and connect tasks as reusable units of work
Comprehensive monitoring and reporting of whole event chains, rather than isolated actions
Perform large-scale migration processes

II. Vision and Approach

The general approach to this project is to use a preexisting framework that provides all the general-purpose tools needed for our ETL pipelines, and use that to connect in-house developed microservices that fulfill LTS-specific tasks.

During interviews with candidates for the Data Engineer and Software Engineer positions, Apache Airflow often came up as a widely used product, and was generally praised on many aspects. Extensive trial confirmed this feedback and proved that Airflow is a very good fit for our goals and for our team's skill set.

The microservices shall be designed to fulfill one unit of work in isolation, e.g. receive an image identifier in a request, fetch the image, extract metadata, and output the metadata. These units of work shall be orchestrated by the automation framework, which may run multiple services in parallel as needed and will take care of concurrency and resource usage. The microservices shall be designed to be completely agnostic of the automation framework.

Enterprise integration patterns adopted by the IT community can be implemented in this framework.

Existing ETL services, such as DAIS, could be in the future integrated with this framework, so that we can take advantage of Airflow's management tools and have a more complete view of extensive processes which the current services are only a part of.

The chosen solution(s) for the DRS Futures Workspace and Archive functional areas may have their own, similarly structured, automation framework(s). These can live alongside the proposed framework, whereas the proposed framework will cover all tasks outside of the other frameworks' designated scope.

Our vision aligns with the following Harvard Library multi-year goals and objectives (MYGOs):

MYGO #8: Focus technical services on effective workflows and metadata that matter the most
- By setting up a modular framework for digital preservation tasks and other types of automation, we would greatly increase the efficiency of content management and preservation workflows, removing the need for Harvard staff to carry out repetitive and tedious tasks and focusing on more intellectually demanding tasks.
MYGO #12: Improve digital infrastructure, in particular to support the preservation of vulnerable audio-visual collections and their use in teaching and learning
- This project would directly improve the quality of Harvard Library's digital infrastructure by introducing modern approaches and tools to large-scale data handling.
MYGO #24: Simplify and advance systems to preserve, store, and access library digital assets
- As for MYGO #12, this framework would facilitate the acquisition and preservation of large quantities of digital resources with less staff time, as well as removing the need to create ad-hoc tools.

Our vision aligns with the following HUIT objectives and key results (OKRs):

Develop a plan for automation in each service area for critical, frequently used or heavily manual workflows
- This project directly addresses the need for automation of large-scale operations.

III. In Scope/Out of Scope

In Scope

Implementation of an automation framework based on Apache airflow
Deployment on an auto-scaling, self-healing platform (Kubernetes)
Creation of a few simple microservices for testing purposes
Extensive integration and load testing to verify robustness under heavy load

Out of Scope

Development of all microservices for specific tasks (covered by Media transformation tools)
Setting up or running large-scale migrations or data pipelines (covered by Data quality assessment and improvement and DRS Futures core services)

IV. Deliverables and Work Products

Definition of Done

This project will be considered done once:

An automation framework is set up and deployed in Kubernetes.
Extensive tests have been performed to assess its capabilities and resource utilization under varying loads (scale up - scale down).
Roles for various management responsibilities within the framework have been appointed, and corresponding user groups and permissions have been created in Airflow.
Documentation is written in the Harvard Wiki and involved staff has been trained to use the tools.

V. Stakeholders

Stakeholder	Title	Participation
Stu Snydman	Associate University Librarian and Managing Director, Library Technology	Executive Sponsor, Business Owner

VI. Project Team

Project Role	Team member(s)
Technical Product Owner / LTSLT Owner
Software engineers	JJ Chen (LTS), Brian Hoffman (LTS)
QA	Stefano Cossu, JJ Chen, Brian Hoffman
Functional documentation	Stefano Cossu, JJ Chen, Brian Hoffman
Scrum Master	Stefano Cossu
Project Manager	Vitaly Zakuta (LTS)

VII. Estimated Schedule (tentative)

Phase	Phase Start	Phase End	Completion Milestone

VIII. Assumptions, Constraints, Dependencies, and Risks

Project Assumptions

Airflow is the base product of choice.
DRS Futures team will be in charge of writing DAGs with input from stakeholders.
Any services can be integrated (barring appropriate security, scalability and reliability assessments), not only ones created or hosted by LTS.
Non-LTS staff may have access to the Airflow UI to monitor processes.
The DRS Pipelines framework will only store operational data internally. All staff-created contents will be pushed to and pulled from designated data stores. As such, the framework itself shall not contain any sensitive data.

Project Constraints

N/A

Project Dependencies

In order to deploy DRS Pipelines in production, the Kubernetes cluster needs to be set up and ready for use in production.
The solution(s) chosen for DRS Futures core services will determine the scope and extent of the DRS Pipelines framework.

Project Risks

Description	Plan	Impact
Workflows don't scale as expected	Assess source of bottlenecks, this can be due to multiple causes.	Effectiveness of large-scale operations
Core framework presents substantial problems and becomes unsustainable	Replace the framework and migrate DAGs. While this would be a major endeavor, it can be done without changing the microservices. It is also extremely unlikely, given the support that Airflow receives currently.	Staff time

IX. Acceptance

Accepted by [ TODO ]

Prepared by Stefano Cossu

Effective Date: 15 Nov 2023

Library Technology Services: Staff Documentation Center