Page Comparison

DASH Development and Migration to Hosted DSpace 8

Project Goals:

Invest in sustainable infrastructure and improvements to DASH's workflow, metadata, metrics, and preservation. This will allow Harvard Library to broaden DASH’s service adoption levels and support Harvard's research lifecycle, data repositories, digital asset management, and digital scholarship systems.

Enhance DASH interoperability

Advance repository collaboration

Establish a sustainable infrastructure

...

Finalizing adding finding aids, designing for how to display heirarchical items from Finding Aids and CURIOSity. Add in CURIOSity items to index, add in document type to eyebrow on front end. Confirm LLM selection and system prompts through team testing. Finalize front end in preparation for release to QA in Sprint 7.

Vision

Revolutionize how researchers, students, and the global community access and explore Harvard's extensive collections, making all kinds of information easily discoverable and accessible.

Project Goals

Enhance user experience
Improve discovery and accessibility of special and archival collections and all types of digital collections including but not limited to image, text, audio, video, born digital, immersive (3d, XR, VR, MR), GIS, etc.
Integrate distinct digital collections discovery platforms, including developing a new one
Investigate and use AI-powered tools to enhance user experience and metadata

Problem and Value Statements

Problem Statement

The current version of DSpace 6.2 is highly customized to accommodate the D3 workflow, waiver requests, assistance authorizations, quick submit deposit form, and the DASH Stories feedback module. Local customizations slow or prevent adoption of newer releases due to the need to re-write local code customizations to new versions. These local customizations, along with in-house platform hosting, have created an environment that lacks flexibility and presents risk for long-term viability.

Solution Business Value

Working with a new Hosting and Services partner (4Science), we will target desirable local customizations to the DSpace 8 release, and modify local practices (where necessary) in order to reduce or eliminate true local customizations. DSpace versions 7.2 and higher are being designed to WCAG 2.0 AA and AAA standards and these versions would also enable the Harvard Libraries to use the new Entities functionality to provide improved services for journal hosting and overlay journals. There are also ongoing opportunities to collaborate with peer institutions on mutual open source development projects, most immediately around quantitative and qualitative metrics modules, researcher profiles, and metadata harvesting. These solutions will allow OSRDS and LTS to position DASH as a service that will be able to meet current and future sustainability goals for advancing open access to knowledgeSince its founding, Harvard Library has been a guardian of the University’s memory and a gateway to the world's knowledge. We currently host an array of discovery systems that use different design approaches, organizational priorities, and technology standards. Scholars and the public expect to be able to find trustworthy information and discover resources easily regardless of the system that is managing and providing access to it.

Solution Business Value

By enabling rich cross-collection search, this project will offer end users intuitive, contextual discovery of special collections, archives and digital collections, through a mix of conversational interfaces, browsing that emphasizes the visual nature of materials when appropriate, and recommendations for similar or related resources, all informed by ongoing user research.

Alignment with Harvard Library Multi-Year Goals and Objectives

This projects aligns with FY 24 HL Goals:

Diversify and expand access to knowledge
Maximize the breadth of tangible and digital collections across Harvard and peer institutions, for the benefit of all partners
Increase our focus on acquiring, accessing, and creating digital content that is accessible to all, as open as possible, and permits creative uses of collections as data
Invest in open access infrastructure and services that support equitable, sustainable models for scholarly communication and open knowledge

Alignment with HUIT Objectives

This project aligns with the following FY 23 HUIT Goal:

Identify 20 candidate services that are “at risk” or “unsustainable” and produce action and/or remediation plans

Vision

Position DASH as an exemplary and collaborative next generation repository that supports Harvard Library initiatives in advancing open knowledge. DASH must evolve to become a more interoperable, collaborative, and accessible next-generation repository. With a more sustainable infrastructure and improvements to its workflow, metadata, metrics, and preservation, the Harvard Library can broaden DASH’s services in order to cooperate with Harvard research lifecycle, data repositories, digital asset management, and digital scholarship systems.

Goals:

Enhance DASH interoperability
Advance repository collaboration
Establish a sustainable infrastructure
Improve workflow, refine metadata, and diversify repository content

In Scope/Out of Scope

In Scope

Upgrading to DSpace 7.x or DSpace 8
Working with 4Science to develop DASH Stories for inclusion into the standard code base
Working with 4Science to develop modifications to Waivers/AA/IOAL for inclusion into the standard code base
Reassigning Harvard's authority data into the new DSpace "Entity" items and integrating authorization with HKey/OIDC
Working with 4Science to develop search enhancements for inclusion into the standard code base
Refining importing of data, search statistics, and storage statistics metrics with 4Science
Data migration, hosting and support

Out of Scope

Custom solutions for Harvard workflows unless explicitly necessary to support HUIT security protocols
Custom workflows for end-users of DASH unless explicitly necessary based on critical policy or security protocols

Deliverables and Work Products

Key Tasks and Outcomes

...

Task

...

Outcome

...

Responsible Parties

...

Group 2 development (AA/IOAL/Waiver, search enhancements, landing page)

...

Group 3 development (Import data, IR stats, Metrics)

...

Testing & Acceptance

...

Production Migration

...

Definition of Done

Stakeholders

(Who is sponsoring the work? Who is funding the work? Who will accept the work? What organizations, departments, or people will benefit from this work? Link to related governance structure wiki page(s) where relevant.)

...

Stakeholder

...

Title

...

Participation

...

Project Team

(Roles include: Product Owner, Project Manager, Scrum Master, Business Analyst, Quality Assurance Analyst, Architect, Software Engineer, Systems Engineer, UI Designer, Metadata Analyst, Subject Matter Expert, Release Coordinator)

...

Team Member

...

Affiliation

...

Project Role(s)

...

Cost and Estimated Schedule

...

Scope

In Scope

Cross collection search for special and archival collections, focusing on the end user experience and making clear the relationships between archival objects/items and larger collections
Incorporate AI/ML technologies to offer natural language search, and generative AI features like summarization, while retaining baseline search and browse functions
Access to digital content, and act as a replacement HOLLIS for Images and Harvard Digital Collections, extending their use cases to meet project goals: full text searching, born digital, GIS, A/V
Reimagine metadata pipeline using new technologies from AI/ML

Out of Scope

Discovery and access to licensed resources (articles, databases) and general collections

Deliverables and Work Products

Key Tasks and Outcomes

Sprints	Outcome	Responsible Parties
Sprint 1	Gained foundational understanding of back end, and established collaboration practices with each other and other HUIT and LTS colleagues. Demo was not recorded.	Technical Project Team
Sprint 2	Investigated front end frameworks and decided on React, diagramed a draft front end architecture, and "made real" step 3 (semantic retrieval) in order to help begin the front end work. See recording of demo here.	Technical Project Team
Sprint 3	Initialize front end development (big win: to work with fastapi for semantic retrieval), finish deploy of semantic retrieval, and experiment with one LLM generative feature and finish indexing the Finding Aids. See recording of demo here.	Technical Project Team
Sprint 4	Continuing work on front end, making it deployable on dev and finishing back end generative AI features work. Planning for usability testing. See recorded demo here.	Technical Project Team
Sprint 5	Fix the data issues with Finding Aids, add new set to index and investigate adding CURIOSity items to index. Finalize front end work and create end to end testing. By end of sprint, estimate when usability can begin. See a recording of the demo here.	Technical Project Team
Sprint 6	Finalizing adding finding aids to index, designing for how to display hierarchical items from Finding Aids and CURIOSity. Add in CURIOSity items to index, add in document type to eyebrow on front end. Confirm LLM selection and system prompts through team testing. Finalize front end in preparation for release to QA in Sprint 7. See a recording of the demo here.	Technical Project Team
Sprint 7	Finalize and release Collections Explorer alpha to QA so that usability testing can begin in Sprint 8. See a recording of demo here.	Technical Project Team
Sprint 8	Onboard technical lead, demo confidence score investigation on front end, begin technical approach discussions and research for data pipeline, conduct usability tests of QA. See recording of demo here.	Technical Project Team
Sprint 9	Solidifying designs for data pipeline, making decision for vector database and scaling considerations based on estimates of metadata records and fulltext. Begin work on front end components for re-use. Usability analysis will be completed for design changes to "production" Collections Explorer. See recording of demo here.	Technical Project Team
Sprint 10	Set up Airflow locally and deploy code; develop baseline for testing relevancy in Q3; continue to work on front end components and remediate accessibility. See recording of demo here.	Technical Project Team
Sprint 11	Re-design "Results" page for Collections Explorer based on usability results. Start migration to NextJS for front end. Evaluate the 2 narrowed down choices for vector database and demo creation of an embedding document and retrieval in one of the vendor products. Deploy Airflow to our development environment. See recording of demo here.	Technical Project Team
Sprint 12	Continue to make decisions around pipeline architecture, update data models and diagrams. Continue to migrate to NextJS for the front end. Evaluate cost estimates (AWS and Elasticsearch) and sandbox for Elasticsearch in order to get closer to decision for implementation. See recording of demo here.	Technical Project Team
Sprint 13	Home page redesign sessions are completed, continue building out ingest pipeline integrations with external services, and make a decision on vector database. See recording of demo here.	Technical Project Team

Definition of Done

Discovery platform, including access to digital assets, is released on production environment and in use by Harvard constituents and the public.

Stakeholders

Executive Stakeholders	Title
Martha Whitehead	VP for Harvard Library and University Librarian
Stu Snydman	AUL & Managing Director for Library Technology Services
Salwa Ismail	AUL for Discovery and Access (Jan. 2025)
Tom Hyry	AUL for Archives and Special Collections

The Library Stakeholders are acting as an extended project team, meeting weekly to help inform and prioritize the work.

Library Stakeholders	Title
Amy Deschenes	Head of UX and Digital Accessibility
Kai Fay	Discovery & Access Strategic Projects Manager
Adrien Hilton	Director of Technical Services for Archives and Special Collections
Chelcie Rowell	Associate Head of Digital Collections Discovery
Shalimar Fojas White	Herman & Joan Suit Librarian, Fine Arts Library
Student interns, as needed	Harvard grad and undergraduate students

Technical Project Team

Team Member	Title	Project Role(s)
Katie Amaral	Technical Project Lead	Developer, Architecture (LTS)
Enrique Diaz	Manager of Library Software Engineering	Product Owner (LTS)
Doug Simon	Senior Digital Library Software Engineer	Developer (LTS)
JJ Chen	Digital Library Data Engineer	Developer (LTS)
Maura Meagher	Associate UX Developer	Developer (LTS)
Carolyn Caizzi	Senior IT Project Manager	Project Manager/ Scrum Lead (LTS)
Meg McMahon	UX Researcher	UX Researcher/Designer (HL)

Estimated Schedule

Project is managed by using the Scrum framework and these phases/milestones will be adjusted. Below is a a high level schedule. See more detailed view of project tasks here.

DASH production is live on DSpace v. 8.x hosted with 4Science

Phase	Phase Start	Phase End	Completion Milestone
1	01 Sep 2023	15 Oct 2023	Delivery of all files/configs/databases/data, etc. from HL's current DASH instance	2	01 Oct 2023	Group 1, 2, & 3 development is complete and has passed UAT	3	01 Aug 2024	July 2024	September 2024	Natural language discovery platform with generative AI features for discovering digitized, special and archival collections is built and released to QA for testing.
2	October 2024	December 2024	Platform is tested by end users and improvements are recommended. Research into scaling platform for production is completed. Data pipeline is scoped and work begins. Design process for digitized collections (images) component is completed.
3	January 2025	March 2025	Data pipeline and digitized collections components begin to be built. Decision to soft launch discovery platform is made depending on data pipeline.
4	April 2025	June 2025	Cont. building data pipeline and digitized collections components. Platform is monitored for costs and analytics are gathered and reviewed to plan for full launch September 2025.
5-12			Years 2-3 will build out full text search integration, more types of digital collection discovery, and access, as well as continuously improve the platform. Investigation into and possible rollout of workflows for using AI to improve quality of metadata.

Assumptions, Constraints, Dependencies, and Risks

Project Assumptions

- Stakeholders either have or have identified the appropriate subject matter experts to participate in the Working Group and who can accurately and completely define the business requirements for the projectadvise on prioritization of work and other project matters
- Stakeholders will have made available the time required to participate in project activities and to complete tasks as requested
- Project sponsor and other stakeholders are empowered to make the decision required for the project to be a success
- Project sponsor will provide written approval to move forward with system development when requested as part of incremental/iterative system demonstrations

Project Constraints

- Scope - as detailed in contract
- Time - go live is projected for 9/1/2024
- Cost - $128,939.23 inclusive of development, PM, T&M, hosting and support

Project Dependencies

- Flexible (all types of digital collections depends on unknowns)
- Time - Fixed 3 year project
- Cost - Fixed 3 year budget

Project Dependencies

ArcLight implementation project
Media Presentation Service upgrade
LibraryCloud reimagine or defining a new data pipeline
DRS Futures project
Rapidly changing LLM industry

Project Risks

Description	Plan	Impact	Owner
(Update during course of project as needed.)Rapidly changing Generative AI space	Build system to be flexible, swap out models easily	Cost, trust	Technical Project Team
Library metadata quality is varied and semantic retrieval works with unstructured data	See if metadata fields can help the quality of embeddings; experiment with different embedding models, focusing on full text content and multi-modal models for digital images	Quality of retrieval	Metadata creators and Technical Project Team
Unexpected changes to other library systems like Aeon, JSTOR Forum	Account for and expect changes from external systems in design of data pipeline	Timeline delays	Technical Project Team
Staff capacity to support work of the project	Meeting weekly with stakeholders to ensure there is enough time to plan for bouts of work that include time from broader staff	Overall project success	Library Stakeholders

Acceptance

Accepted by: Colin LukensLibrary Stakeholders August 8 2024

Prepared by: Grace DunbarCarolyn Caizzi

Effective Date: August 1, 20239 2024

Version	Old Version 1	New Version Current
Changes made by	Carolyn Caizzi	Carolyn Caizzi
Saved on	Jul 26, 2024	Jan 06, 2025

Versions Compared

Key

DASH Development and Migration to Hosted DSpace 8

Project Goals:

Table of Contents

Vision

Project Goals

Problem and Value Statements

Problem Statement

Solution Business Value

Solution Business Value

Alignment with Harvard Library Multi-Year Goals and Objectives

Vision

In Scope/Out of Scope

In Scope

Out of Scope

Deliverables and Work Products

Key Tasks and Outcomes

Definition of Done

Stakeholders

Project Team

Cost and Estimated Schedule

Scope

In Scope

Out of Scope

Deliverables and Work Products

Key Tasks and Outcomes

Definition of Done

Stakeholders

Technical Project Team

Estimated Schedule

Assumptions, Constraints, Dependencies, and Risks

Project Assumptions

Project Constraints

Project Dependencies

Project Dependencies

Project Risks

Acceptance