I. Problem/Value Statement
Problem Statement
DRS, as it stands currently, has a very low barrier for the quality of the media tha can be uploaded as archival masters. This allows users to preserve files even if the only available copy is not deemed preservation-worthy according to DP community standards, which is better than not having any version of some contents available; however, the lack of any validation and notification for files that may be unsuitable for online visualization has created numerous problems in MPS, the LTS delivery service. This has an especially severe effect on still images, the majority of which are stored as losslessly compressed JPEG2000 (JP2). Many of these images are causing severe performance penalties or even crashes of the LTS delivery servers. Thus, a comprehensive scan of the current image collection that identifies the problematic images, as well as a pathway to convert them, if possible, to valid, web-optimized images, has been deemed necessary.
In order to optimize images for web display, as well as to reduce storage costs on a very expensive storage tier of our infrastructure, ad-hoc derivative images should be generated that take advantage of the most advanced encoding and compression algorithms.
This approach poses a problem, because currently the same lossless JP2s are used for both archiving and for delivery. Upon publishing of an image on DRS, the archival master is simply copied over to a delivery location. Image identifiers and file names are stored in delivery and name resolution systems, and cannot be easily separated. Additional steps need to be devised to replace the current references to the archival master copies, to references to the new delivery derivatives.
Business Value
Leaving delivery images in the current status will continue causing major disruption to delivery services and draining LTS staff's time. Hence, however laborious the process of remediating and separating out archival and delivery files may be, it is undoubtedly necessary and worth the effort. This initiative would bring several advantages, including:
- Overall better quality and performance of delivery services;
- Reduced LTS developer and tech support time in the long run;
- Lower storage, I/O, and computing costs for delivering images;
- Better control and insight of our images' quality;
- Content providers would no longer be obligated to provide JPEG2000 for archival masters (as long as they are provided in a preservation-worthy format, or can be converted into one upon deposit, different original image formats can be optimized into delivery formats);
- Better separation of concerns between archival and delivery files.
II. Vision and Approach
The first step of this project would be to identify the types of problems their sources, the number of affected images, and the severity of the problem from end-user's and maintenance staff's perspectives. Once that is completed, several remediation projects can be planned and prioritized according to the severity and extent of the issues that would be resolved.
In order to start any actual remediation process, several other things need to be done:
- Identifing the source of each issue type, and ensure that no new images with the same type of problem are introduced. This may require contacting the original deopsitors and inquiring about the source of the images (often they come from third party sources, or have been digitized by individual departments' staff without any quality control by an imaging specialist).
- Offering an alternative way to generate delivery derivatives. This alternative would be the media transformation tools that LTS is building for massive-scale media conversion.
- Devising and implementing a process to replace internal references to current delivery files to newly generated ones in LTS services.
Our vision aligns with the following Harvard Library multi-year goals and objectives (MYGOs):
- MYGO #8: Focus technical services on effective workflows and metadata that matter the most
- By offering a centralized service for efficient, high-quality, and scalable processing of highly visible (public) images that can be used across campus we would encourage discontinuing one-off solutions that individual departments have been developing for lack of a better alternative, incurring in additional maintenance costs and inconsistent, often substandard output quality.
- MYGO #10: Focus on space as a service, considering the most cost-effective approaches to user interests, collections security and preservation, and staff needs in HL and HCL facilities
- HTJ2K is the most space- and computationally efficient format for Web quality images today. By converting existing lossless DRS images into HTJ2K we would significantly reduce storage, computing and I/O usage on the most traffic-heavy (and expensive) tier of our infrastructure.
- MYGO #14: Minimize the environmental impact of collections, services, and spaces
- As described in MYGO #10, the goal of this project is to save computing resources, thus reducing the environmental footprint of our services.
Our vision aligns with the following HUIT objectives and key results (OKRs):
- Develop a plan for automation in each service area for critical, frequently used or heavily manual workflows
- This project seeks to optimize one of the most critical and frequently used workflows in the content production chain.
III. In Scope/Out of Scope
In Scope
- Developing a plug-in for the Vips image library that uses Kakadu for reading and writing JPEG2000 images.
- Integrating the plug-in into the currently developed imgconv project.
- Maintaining the plug-in up to speed with future Vips upgrades
- Documentation and tests for the developed code
Note that this project is limited to still images. Other media may need a different workflow and approach.
Out of Scope
The following items are out of scope because they are achievable without this project; however, an optimized HTJ2K converter would significantly improve their quality:
- Development of a microservice for converting images based on configurable profiles (imgconv)
- Integration of imgconv into an automation framework for large scale processing (drs-pipelines)
- Conversion of defective and/or substandard delivery images into delivery-optmized HTJ2K
IV. Deliverables and Work Products
- Code, documentation and tests for a Kakadu plugin for Vips in a Git repository.
Definition of Done
This project will be considered done once:
- The Vips Kakadu plugin code is completed and committed to a Harvard-owned Git repository
- We are consistently able to compile and run the plugin
- Comprehensive tests are written for the key functions
- All tests pass
- Exhaustive relevant documentation is provided
- We are able to integrate the delivered code into our imgconv project and verify that the features and options satisfy our needs.
V. Stakeholders
Stakeholder | Title | Participation |
---|---|---|
Stu Snydman | Associate University Librarian and Managing Director, Library Technology | Executive Sponsor, Business Owner |
VI. Project Team
Project Role | Team member(s) |
---|---|
Technical Product Owner / LTSLT Owner | Stefano Cossu (LTS) |
Software engineers | John Cupitt (independent contractor - development), Pierre-Anthony Lemieux (independent contractor - consulting)*, Brian Hoffman (LTS - integration), JJ Chen (LTS - integration) |
QA | Stefano Cossu, Brian Hoffman |
Functional documentation | John Cupitt |
Scrum Master | Stefano Cossu |
Project Manager | Vitaly Zakuta (LTS) |
* Pierre-Anthony Lemieux deferred to John Cupitt for carrying out the development, remaining available for advising on Kakadu-specific topics. We have not yet established whether such consulting will be pro bono or for a fee. In the latter case, we should request a cost estimate, but ideally we would like to have one contractor billing for the whole project.
VII. Estimated Schedule (tentative)
Phase | Phase Start | Phase End | Completion Milestone |
---|---|---|---|
Development & Release | 11/27/2023 | 12/04/2023 | Develop code |
Development & Release | 12/05/2023 | 12/08/2023 | Unit and integration tests |
Development & Release | 12/11/2023 | 12/15/2023 | Complete & validate documentation |
Development & Release | 12/11/2023 | 12/22/2023 | imgconv integration and release |
VIII. Assumptions, Constraints, Dependencies, and Risks
Project Assumptions
- The code delivered by the contractor will be agnostic to external integrations.
- Integration with imgconv and DRS pipelines, including deployment infrastructure and long-term maintenence, will be the DRS Futures engineering team.
- Stakeholders will be available to participate in project activities and to complete tasks as requested.
- The Executive Sponsor and other stakeholders are empowered to make the decisions required for the project to be a success.
- The code delivered by the contractor will be agnostic to external integrations.
Project Constraints
- Contractor availability
- DRS Futures team availability
- Scope
- Time
- Budget
Project Dependencies
- The plugin developer will be need a Kakadu SDK license to perform development, testing, and long-term maintenance of the requested software. Kakadu Software has agreed to provide John Cupitt a free and renewable Kakadu SDK license. The handing-off of that license is underway.
Project Risks
Description | Plan | Impact | Owner |
---|---|---|---|
IX. Acceptance
Accepted by [ TODO ]
Prepared by Stefano Cossu
Effective Date: [ TODO ]