DRS Pipelines framework project charter
I. Problem/Value Statement
Problem Statement
DRS is, and will continue to be in the future, a highly integrated service. Data are created, updated, and deleted in many systems within and outside LTS, and many of these events need to be reflected in state changes within DRS. Examples may be depositing materials, updating a source system such as Alma or ArchiveSpace, or marking some resources as publishable or not.
Currently, many DRS-related actions are taken by users manually, as well as following actions that depend on certain conditions, e.g.: depositing some resources, creating derivatives, waiting for the resources to be available in DRS, looking the DRS identifier(s) for those resources, and updating other systems with those identifiers. This is a cumbersome, time-consuming, and error-prone process.
In addition to this, if any error occurs during any of these steps, the user is often on their own to find out about it (e.g. something doesn't show in DRS after a reasonable time), report the issue with tech support, and wait for a resolution. This leaves obviously many gaps in terms of delays, as well as in which problems actually get caught, leaving room for unresolved issues to persist for an indefinite time.
Most of these actions don't require human judgment to be carried out and can be automated. In fact there are several LTS services, such as DAIS and DRS queues, to automate some internal processes.
The issue with the existing automation tools is that they are not as extensive as they need to be, and there is no centralized "control tower" to monitor possibly complex chains of actions, in order to quickly find issues and remediate them. Also, there is not an easy to use, well-documented framework to easily add new tasks and connect them to existing ones.
Several software solutions are on the market that provide an automation framework that takes care of the most complex part of enterprise integration, which is the orchestration of inter-dependent, asynchronous tasks. Adopting one such framework would provide LTS with a common platform to perform both simple and complex tasks, reuse existing services, and have a centralized overview of what is happening at any point in time.
Business Value
An automation framework primarily aimed at DRS-related tasks, but shared with LTS at large would resolve several problems:
- Oversee and facilitate several data transformation pipelines, removing the need for manual intervention and reducing tech support time:
- Pre-deposit tasks
- Characterization
- Metadata extraction
- etc.
- Post-deposit tasks
- Derivative generation
- Pushing resources slated for publishing to delivery systems
- Updating depending systems with new data (e.g. add DRS ID to Alma record after ingest)
- Error reporting
- Automatically retrying recoverable errors
- Pre-deposit tasks
- Provide a facility to create and connect tasks as reusable units of work
- Comprehensive monitoring and reporting of whole event chains, rather than isolated actions
- Perform large-scale migration processes
II. Vision and Approach
The general approach to this project is to use a preexisting framework that provides all the general-purpose tools needed for our ETL pipelines, and use that to connect in-house developed microservices that fulfill LTS-specific tasks.
During interviews with candidates for the Data Engineer and Software Engineer positions, Apache Airflow often came up as a widely used product, and was generally praised on many aspects. Extensive trial confirmed this feedback and proved that Airflow is a very good fit for our goals and for our  team's skill set.
The microservices shall be designed to fulfill one unit of work in isolation, e.g. receive an image identifier in a request, fetch the image, extract metadata, and output the metadata. These units of work shall be orchestrated by the automation framework, which may run multiple services in parallel as needed and will take care of concurrency and resource usage. The microservices shall be designed to be completely agnostic of the automation framework.
Enterprise integration patterns adopted by the IT community can be implemented in this framework.
Existing ETL services, such as DAIS, could be in the future integrated with this framework, so that we can take advantage of Airflow's management tools and have a more complete view of extensive processes which the current services are only a part of.
The chosen solution(s) for the DRS Futures Workspace and Archive functional areas may have their own, similarly structured, automation framework(s). These can live alongside the proposed framework, whereas the proposed framework will cover all tasks outside of the other frameworks' designated scope.
Our vision aligns with the following Harvard Library multi-year goals and objectives (MYGOs):
- MYGO #8: Focus technical services on effective workflows and metadata that matter the most
- By setting up a modular framework for digital preservation tasks and other types of automation, we would greatly increase the efficiency of content management and preservation workflows, removing the need for Harvard staff to carry out repetitive and tedious tasks and focusing on more intellectually demanding tasks.
- MYGO #12: Improve digital infrastructure, in particular to support the preservation of vulnerable audio-visual collections and their use in teaching and learning
- This project would directly improve the quality of Harvard Library's digital infrastructure by introducing modern approaches and tools to large-scale data handling.
- MYGO #24: Simplify and advance systems to preserve, store, and access library digital assets
- As for MYGO #12, this framework would facilitate the acquisition and preservation of large quantities of digital resources with less staff time, as well as removing the need to create ad-hoc tools.
Our vision aligns with the following HUIT objectives and key results (OKRs):Â
- Develop a plan for automation in each service area for critical, frequently used or heavily manual workflows
- This project directly addresses the need for automation of large-scale operations.
III. In Scope/Out of Scope
In Scope
- Implementation of an automation framework based on Apache airflow
- Deployment on an auto-scaling, self-healing platform (Kubernetes)
- Creation of a few simple microservices for testing purposes
- Extensive integration and load testing to verify robustness under heavy load
Out of Scope
- Development of all microservices for specific tasks (covered by Media transformation tools)
- Setting up or running large-scale migrations or data pipelines (covered by Data quality assessment and improvement and DRS Futures core services)
IV. Deliverables and Work Products
Definition of Done
This project will be considered done once:
- An automation framework is set up and deployed in Kubernetes.
- Extensive tests have been performed to assess its capabilities and resource utilization under varying loads (scale up - scale down).
- Roles for various management responsibilities within the framework have been appointed, and corresponding user groups and permissions have been created in Airflow.
- Documentation is written in the Harvard Wiki and involved staff has been trained to use the tools.
V. Stakeholders
Stakeholder | Title | Participation |
---|---|---|
Stu Snydman | Associate University Librarian and Managing Director, Library Technology | Executive Sponsor, Business Owner |
VI. Project Team
Project Role | Team member(s) |
---|---|
Technical Product Owner / LTSLT Owner | |
Software engineers | JJ Chen (LTS), Brian Hoffman (LTS) |
QA | Stefano Cossu, JJ Chen, Brian Hoffman |
Functional documentation | Stefano Cossu, JJ Chen, Brian Hoffman |
Scrum Master | Stefano Cossu |
Project Manager | Vitaly Zakuta (LTS) |
VII. Estimated Schedule (tentative)
Phase | Phase Start | Phase End | Completion Milestone |
---|---|---|---|
VIII. Assumptions, Constraints, Dependencies, and Risks
Project Assumptions
- Airflow is the base product of choice.
- DRS Futures team will be in charge of writing DAGs with input from stakeholders.
- Any services can be integrated (barring appropriate security, scalability and reliability assessments), not only ones created or hosted by LTS.
- Non-LTS staff may have access to the Airflow UI to monitor processes.
- The DRS Pipelines framework will only store operational data internally. All staff-created contents will be pushed to and pulled from designated data stores. As such, the framework itself shall not contain any sensitive data.
Project Constraints
N/A
Project Dependencies
- In order to deploy DRS Pipelines in production, the Kubernetes cluster needs to be set up and ready for use in production.
- The solution(s) chosen for DRS Futures core services will determine the scope and extent of the DRS Pipelines framework.
Project Risks
Description | Plan | Impact | Owner |
---|---|---|---|
Workflows don't scale as expected | Assess source of bottlenecks, this can be due to multiple causes. | Effectiveness of large-scale operations | |
Core framework presents substantial problems and becomes unsustainable | Replace the framework and migrate DAGs. While this would be a major endeavor, it can be done without changing the microservices. It is also extremely unlikely, given the support that Airflow receives currently. | Staff time | |
IX. Acceptance
Accepted by  [ TODO ]
Prepared by Stefano Cossu
Effective Date: Â