I. Problem/Value Statement
Problem Statement
A part of the DRS Futures project consists of optimizing the data pipelines that feed the LTS delivery services and creating deliverable images that are separate from their archival and production masters, using a format that is optimized for delivery rather than for long-time preservation. This, combined with an extensive remediation of images that are currently undeliverable or very slow because of the way they were generated originally, should result in a drastic reduction of error rates and improvement of media delivery timings.
We have identified a format into which we want to convert all the still images in DRS, starting from the most problematic ones. This format is High-Troughput JPEG200 (HTJ2K), a quite recent standard that is supported by a few encoding and decoding tools. (Note: this conversion would leave the archival images in DRS unchanged, it would only replace the delivery images.)
The leading solution for HTJ2K in terms of encoding speed and efficiency is Kakadu, a proprietary software that LTS is currently licensed to use and is using to generate traditional JPEG2000 images. Kakadu, however, has a few disadvantages when it comes to bulk conversion: it offers command-line tools that were never meant to be used in large-scale batch jobs. Additionally, these tools can only convert JPEG2000 to or from another format, not from JPEG2000 to JPEG2000. Since we want to convert old-style JP2 to HTJ2K, we would have to convert each image twice, effectively doubling the conversion time. Multiplied by many millions of images, this can have a dramatic effect on the time scale of the project.
In order to perform large-scale image conversion and offer the best possible on-demand conversion service for future data pipelines, we need to integrate HTJ2K read and write functionality more tightly with our systems, and be able to perform a one-step conversion from old JP2 to HTJ2K.
Business Value
A more efficient, one-step conversion of existing JP2 images would greatly expedite the clearing of a very large backlog of images. Since we are planning to process tens to hundreds of millions of files, even a small increase in processing time per image could lead to a very significant reduction in the overall timing.
Moreover, by integrating a simpler and more efficient conversion tool in our future data transformation pipelines, we would expedite day-to-day operations such as image deposit and publishing in an upcoming DRS Futures scenario, and decrease the time needed, e.g. for a content manager who just deposited a large collection of images, to see that collection published online. This would encourage adoption of an LTS-managed centralized image conversion tool that would deliver consistently high quality images at high speed.
II. Vision and Approach
The ideal scenario for this project is to have at our disposal the tools to convert images from any format that may be stored in DRS to HTJ2K in the simplest and most resource-efficient way.
Several approaches to this goal have been considered with the help of a Kakadu developer, explained in detail in this JIRA ticket. The most beneficial option under most aspects is adding JPEG2000 read and write support, via Kakadu, to a Python image library (we are using Python for all our new data pipelines). This would enable not only the desired one-step conversion, but also a wide array of image analysis and manipulation operations while the image is being handled for transformation, which can be used to add validation and optimization steps within the same process.
The most flexible and efficient image library is Vips, which comes with Python bindings and command-line utilities. We have reached out to the Vips maintainer and proposed to add a plug-in to support JPEG2000 read/write using Kakadu. Vips has currently a similar functionality that uses OpenJPEG, which uses much slower agorithms. Happily, the Vips developer agreed to contribute to the development of such plugin and keep it compatible with future Vips releases.
The proposed plan is to hire the Vips maintainer to write the code, tests, and documentation for the Kakadu plugin and deliver it in a LTS-owned Git repository. The plugin will be licensed as Apache v2.0 and will be available to the public (with the caveat that Kakadu itself is proprietary software and needs a license). This may benefit chiefly institutions in the IIIF community, many of which use JPEG2000 and have Kakadu integrated in their workflows.
Our vision aligns with the following Harvard Library multi-year goals and objectives (MYGOs)
- MYGO #8: Focus technical services on effective workflows and metadata that matter the most
- By offering a centralized service for efficient, high-quality, and scalable processing of highly visible (public) images that can be used across campus we would encourage discontinuing one-off solutions that individual departments have been developing for lack of a better alternative, incurring in additional maintenance costs and inconsistent, often substandard output quality.
- MYGO #10: Focus on space as a service, considering the most cost-effective approaches to user interests, collections security and preservation, and staff needs in HL and HCL facilities
- HTJ2K is the most space- and computationally efficient format for Web quality images today. By converting existing lossless DRS images into HTJ2K we would significantly reduce storage, computing and I/O usage on the most traffic-heavy (and expensive) tier of our infrastructure.
- MYGO #14: Minimize the environmental impact of collections, services, and spaces
- As described in MYGO #10, the goal of this project is to save computing resources, thus reducing the environmental footprint of our services.
Our vision aligns with the following HUIT objectives and key results (OKRs):
- Develop a plan for automation in each service area for critical, frequently used or heavily manual workflows
- This project seeks to optimize one of the most critical and frequently used workflows in the content production chain.
III. In Scope/Out of Scope
In Scope
- Developing a plug-in for the Vips image library that uses Kakadu for reading and writing JPEG2000 images.
- Integrating the plug-in into the currently developed imgconv project.
- Maintaining the plug-in up to speed with future Vips upgrades
- Documentation and tests for the developed code
Note that this project is limited to still images. Other media may need a different workflow and approach.
Out of Scope
The following items are out of scope because they are achievable without this project; however, an optimized HTJ2K converter would significantly improve their quality:
- Development of a microservice for converting images based on configurable profiles (imgconv)
- Integration of imgconv into an automation framework for large scale processing (drs-pipelines)
- Conversion of defective and/or substandard delivery images into delivery-optmized HTJ2K
IV. Deliverables and Work Products
- Code, documentation and tests for a Kakadu plugin for Vips in a Git repository.
Definition of Done
This project will be considered done once:
- The Vips Kakadu plugin code is completed and committed to a Harvard-owned, public Git repository
- We are able to compile and run the plugin
- Comprehensive tests are written for the key functions
- All tests pass
- Exhaustive relevant documentation is provided
- We are able to integrate the delivered code into our imgconv project and verify that the features and options satisfy our needs.
V. Stakeholders
Stakeholder | Title | Participation |
---|---|---|
Stu Snydman | Associate University Librarian and Managing Director, Library Technology | Executive Sponsor, Business Owner |
VI. Project Team
Project Role | Team member(s) |
---|---|
Technical Product Owner / LTSLT Owner | Stefano Cossu |
Software engineers | John Cupitt (independent contractor - development), Pierre-Anthony Lemieux (independent contractor - consulting)?, Brian Hoffman (LTS - integration), JJ Chen (LTS - integration) |
QA | |
Functional documentation | John Cupitt |
Scrum Master | Stefano Cossu |
Project Manager | Vitaly Zakuta |
VII. Estimated Schedule (tentative)
Phase | Phase Start | Phase End | Completion Milestone |
---|---|---|---|
Development & Release | 11/27/2023 | 12/04/2023 | Complete code |
Development & Release | 12/05/2023 | 12/08/2023 | Unit and integration tests |
Development & Release | 12/11/2023 | 12/15/2023 | Complete & validate documentation |
Development & Release | 12/11/2023 | 12/22/2023 | imgconv integration and release |
VIII. Assumptions, Constraints, Dependencies, and Risks
Project Assumptions
- The code delivered by the contractor will be agnostic to external integrations.
- The plugin developer will be provided a free and renewable Kakadu SDK license to perform development, testing, and long-term maintenance (this process is currently underway).
- Integration with imgconv and DRS pipelines, including deployment infrastructure and long-term maintenence, will be the DRS Futures engineering team.
- Stakeholders will be available to participate in project activities and to complete tasks as requested.
- The Executive Sponsor and other stakeholders are empowered to make the decisions required for the project to be a success.
- The code delivered by the contractor will be agnostic to external integrations.
Project Constraints
- Contractor availability
- Team availability
- Scope
- Time
- Cost
Project Dependencies
- The new v3 IIIF manifest design, required for development of Viewer architecture and functionality, is not yet available for informing development of the upgraded Viewer application and plug-ins.
Project Risks
Description | Plan | Impact | Owner |
---|---|---|---|
IX. Acceptance
Accepted by [ TODO ]
Prepared by Stefano Cossu
Effective Date: [ TODO ]