Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
maxLevel2

...

I. Problem/Value Statement

Problem Statement

The Harvard Geospatial Library (HGL) enables researchers to discover and easily access the wealth of geospatial data available to the Harvard community. Data sets are available from around the world at various scales, from global to local. Each data set is delivered with complete metadata, making it easier to add to a geographic information system (GIS) and compare to other data sets about the same place.

HGL currently uses OpenGeoportal (OGP), a platform that is no longer developed or supported. The platform has led to reliability and stability problems. It is also impossible to make any improvements to the HGL user interface because there are no developers who can work on the OGP source code.

LTS has also developed custom programs for loading data into HGL’s GeoServer, which stores and delivers the map data. After a necessary infrastructure change, the loading programs stopped working for an important category of material. Scanned maps can be loaded, but the process is still very cumbersome.

HGL relies on LTS’s Access Management Service (AMS) to provide authorized access to licensed data sets.   AMS is being retired.  Current systems are being re-engineered to use more centrally supported Harvard systems for authentication and authorization.  Sooner or later HGL will need to be required to also use the centrally supported authentication and authorization systems.

The Harvard Library intends to modernize its implementation of a geospatial data access & discovery layer, establish a sustainable workflow data loading, and make geospatial data downloadable.

Business Value

The work proposed here meets a long-standing list of requests made by students, researchers, faculty and stakeholders over the course of several years. This project will follow the recommendation of the Harvard Geospatial Working Group and transition HGL from the current open source platform, OpenGeoportal (OGP, developed at Tufts) to a new open source platform, GeoBlacklight (GBL, developed primarily at Stanford). Harvard will become an active participant in the GBL community of users, which includes many peer institutions, including 6 Ivy Plus members.

Creating a robust and sustainable environment through which maps and myriad forms of geospatial data can be discovered, explored and downloaded fulfills a core tenet of the Library’s mission, and remediates an unstable and outdated data ingest and solution. It is critical the Library leverages those resources to reduce the practical costs of ownership and development, and increase its viability as a consortial partner in the GIS scholarly community.

...

current DRS storage infrastructure (disk and tape) is in the final year of its service term and must be replaced. At the same time, DRS business owners would like to expand the range of storage options (e.g., cloud, external, etc.) and provide greater flexibility in replication policy.

Business Value

This storage refresh is critical for maintaining continuity of DRS service. It will also have the benefit of anticipated lowering of provisioning/operational costs, which then can be reflected in reduced DRS pricing. Support for an expansion of the range of storage options will streamline future incorporation of new technical solutions. It also enables implementation of policy-driven replication that provides better alignment of curatorially-designated value and goals with the technical characteristics of various storage components that best ensure those goals. This in turn will reduce overall costs (by minimizing the number and type of replicas) and maximize use of finite resources (space no longer used for a copy of A can now be used for a copy of B).

II. Vision and Approach

The redesign provisioning of HGL will use the open source GeoBlacklight platform and establish a development-to-production environment for HGL based on LTS protocols and standards. The project will build on the knowledge gained from the S.T. Lee grant project, which used GeoBlacklight to deliver index maps, and will expand the offerings to include all the types of data that are now included in HGL. The redesign will preserve existing discovery capabilities of geospatial data from non-Harvard repositories as well as reaffirm its commitment to the extensibility of data ingest and discovery from sources beyond the Library. 

...

infrastructural capacity will place an emphasis on leveraging existing storage capabilities within the University (e.g., FAS RC) and consortial/commercial options outside the University (e.g., NESE, Iron Mountain). Conceptually, the DRS will now incorporate a storage broker architecture in which each file will be assigned a curatorially-designated storage classification that will control the variable degree of replication. New pricing will reflect the underlying provisioning/operating costs of the various storage options. Billing will reflect the differential prices of the various storage options utilized for a given file.

III. In Scope/Out of Scope

In Scope

Essential interface components

  • Authorization for restricted sets that doesn’t rely on AMS

  • Search of data using limits and facets on results

  • Relevance ranking and weighting - predefined

  • Index map display support

  • Index map facet for searching

  • Dataset preview on a map

  • Method to download vector and raster data as well as scanned maps

  • Method to download record metadata

  • Method to link back to individual record

Essential interoperability components

  • Method for providing a link from a HOLLIS record of single data layer to the single record in HGL

  • Method for providing a link from a HOLLIS record of a collection of data layers to a search result in HGL with all the data layers

  • Method for providing HGL records available in HOLLIS

  • Method for sending metadata records to OpenGeoMetadata (https://github.com/OpenGeoMetadata) on at least an intermittent basis

  • Preserve existing discovery capabilities of geospatial data from non-Harvard institutions and commitment to extensibility of data ingest and discovery beyond Harvard Library

Essential infrastructure components

  • Dev/QA/Prod servers running GeoBlacklight

  • Solr index with current HGL data in GeoBlacklight Schema

  • Supported storage for index map GeoJSON files

  • Method for depositing data into GeoServer - and determining which data types will be supported

  • Data deposit method that is extensible to new spatial data sources outside of the Map Collection

  • Method for having developers/designer commit changes to interface and view

  • Evaluate current version of HGL GeoServer for compatibility with required functionality in GeoBlacklight

  • Evaluate need for database tables used for data export and download

  • Evaluate GeoCombine as a tool for managing standardized GIS metadata - to inform data publishing decisions

  • Evaluate and document a dev upgrade path for GeoServer and, if needed, its implications for data migration

Out of Scope

  • Preserving shopping cart feature from current HGL/OGP that allows for the selection of multiple files for download

  • Decision on metadata format - FGDC vs ISO 

  • Using persistent identifiers (URNs) for layer names and persistent links (URNs) in metadata

  • Preservation of vector data in DRS

  • Preservation of FGDC metadata in DRS

  • Automated method for sharing metadata records with OpenGeoMetadata

  • Web mapping services (WMS) and tile mapping services (TMS) 

  • Determining methods for reducing tile cache storage size

  • GeoServer upgrade - unless it’s for a critical need

  • Relevance ranking and weighting - user defined

  • Autosuggest with related terms

  • Making multiple formats available for ingest and export (GeoJSON, Geodatabase, GeoPackage, CSV) 

  • Making offline datasets discoverable 

  • Making geospatial data from Dataverse available for search and delivery

...

  • An HGL solution that uses Harvard centralized systems for authentication and authorization of users who want to use licensed data sets.

  • A GeoBlacklight implementation of HGL that supports search, discovery, display, download and reuse of:

    • vector and raster datasets

    • georeferenced historical maps

    • index maps

  • An HGL solution that provides access to all data in the current HGL implementation

  • Supported and documented method for depositing data into HGL

  • Supported and documented method for storing new index map data for use in HGL

  • Supported and documented infrastructure for Dev/QA/Prod instances of HGL

  • Supported and documented methods for updates and upgrades to HGL components including GeoBlacklight,GeoServer, and Solr

  • Understanding of performance expectations related to rendering large historic maps  

  • Evaluation of need for custom database tables to support integration with Alma and downloads of DRS files

  • Evaluation of GeoCombine as a tool for managing standardized GIS metadata - to inform data publishing decisions

Definition of “Done”

The HGL/GeoBlacklight project will be considered done when:

  • Stakeholders accept that in-scope work has been delivered

  • Operations team has the tools to support system deployments and upgrades

  • HGL with GeoBlacklight front-end are deployed to production and accessible to users

  • All current HGL data layers are discoverable and deliverable

  • Stakeholders accept plan for GeoServer upgrade 

  • Documented plan to fully retire old HGL

...

  • BatchBuilder
  • WebAdmin

Essential interoperability components

  • Starfish storage management infrastructure

Essential infrastructure components

  • HUIT Research Computing managed storage infrastructure at Markley
  • FAS RC storage infrastructure at MGHPCC
  • AWS S3 Infrequent Access
  • Tape infrastructure at NESE
  • Tape warehousing at Iron Mountain
  • Snowball devices for movement of deliverable content to S3
  • Oracle DB
  • Fscheckd monitoring script
  • Quarterly billing script

Out of Scope

  • Interoperability with storage at FAS RC's HPC cluster at MGHPCC
  • Interoperability with USC DR

IV. Deliverables/Work Products

  1. HUIT Research Computing ECS at Markley
  2. FAS RC ECS at MGHPCC
  3. AWS S3 Infrequent Access
  4. Tape at NESE
  5. Tape at Iron Mountain
  6. Encryption of level 4 content into the tape environments
  7. Definition of OCFL structure for ECS file system and S3 object storage
  8. Retrospective classification of the storage class of all existing files
  9. Extension to DB schema
  10. BatchBuilder and WebAdmin support for storage (re)classification
  11. New pricing scheme based on partial cost recovery
  12. Revised quarterly billing script supporting non-uniform storage replication
  13. Updated BatchBuilder and WebAdmin online documentation and training material
  14. Switch to a different SFTP client and updates of depositor scripts to interact with ECS-hosted drop boxes; updates to "remote dropbox" setups that exist for Media Preservation, Imaging Services and Harvard Art Museums - related training, testing and documentation

Definition of "Done"

The DRS Refresh project will be considered done when:

  1. New hardware/software components are fully deployed
  2. Migration of all retrospective content to new storage environment
  3. Disposition of all prospective content to new storage environment
  4. BatchBuilder and WebAdmin updates are complete and training and documentation describe how to specify/update curatorial storage class attribute
  5. Dynamic storage reallocation (delete/copy as necessary) fully functional at point of initial deposit and on an ad hoc basis via WebAmin
  6. Fscheckd monitoring script re-written and pointed at all new storage options
  7. VPDR/LLT approval of new pricing scheme in consultation with stakeholders
  8. Quarterly billing script supports non-uniform storage replication
  9. Decommissioning and removal of old equipment

V. Stakeholders and Project Team

Stakeholders

Stakeholder

Title

Participation

Bonnie Burns

Stephen Abrams

Head of

Geospatial Resources, Harvard Map Collection

Business Sponsor and Service Owner

Marc McGee

Geospatial Metadata Librarian

Product owner and metadata 

GeoSpatial Working Group

Advisory and testing support

Stu Snydman

Associate University Librarian and Managing Director for Library Technology

Advisory

...

digital preservation/ DRS business owner

Use cases, requirements, conceptual design

Stewardship Standing Committee


Review/comment

DRS collection managers


Review/comment/training

Harvard media and image digitization and preservation practitioners

  • Head of Media Preservation
  • Head of Imaging Services
  • Director of Digital Infrastructure and Emerging Technologies at Harvard Art Museums
  • Other media and image digitization and preservation practitioners who deposit and administer DRS content

Review/comment/test

Project Team**

Team Member

Role(s)

Affiliation

Enrique Diaz

Project Co-Manager & Scrum Master

Head of Design & Development, DSI, HL

Paul Aloisio

Project Co-Manager

Systems Librarian, LTS, HUIT

Phil Plencner

Software Engineer

Senior Developer, DSI, HL

Tom Scorpa

Operational Resources

Production Operations Lead, LTS, HUIT

Marc McGee

Metadata Analyst & Product Owner

ITS, HL

Scott Walker

Business Analyst

Robin Wendler

Metadata Consultant

LTS

...

Stephen Abrams

Business owner

DPS

Tricia Patterson

Business owner

DPS

Vitaly Zakuta

Project manager/Scrum master/Analyst

LTS

Anthony Moulen

Architect

LTS

Andrew Woods

Consultant

LTS

Sharon Bayer

Infrastructure project manager

LTS

Chris Vicary

Technical lead / software engineer

LTS

David Neiman

Software engineer

LTS

Jessica Jassal

Software engineer

LTS

Valdeva Crema

Software engineer

LTS

Tom Scorpa

ProdOps lead / storage manager / systems administrator

LTS

Jason Knight

ProdOps / systems administrator

LTS

Benson Smith

DB admin

LTS

Julie Wetherill

Agile product owner / Analyst/ QA/ Documentation/training

LTS

Janet Taylor

UI/UX (4/27-on)

LTS

** Other team members may be added if work requires it

...

VI. Schedule

...

Phase

Phase Start

Phase End

Milestone

Milestone Date

Planning

12

3/

8

9

3/

2020

29

Project Charter approved by all stakeholders

12


3/

8/2020

29

Preparation

12

3/

8/2020

30

1

4/

19/2021

Development environment provisioned, configured and running; evaluations completed

development assessment (go/no-go)

1/19/2021

Development

1/19/2021

3/30/2021

Production-ready codebase ready for QA testing

3/30/2021

12

Technical design complete

4/12

Development


4/13


8/16

All development tasks are complete


8/16

Move to Production

3


8/

30/2021

17

4


8/

13/2021

Check ProdOps for release schedule

...

30

Move to production complete and accepted by stakeholders


8/30

VII. Key tasks and outcomes

Tasks

Outcomes

Responsible Parties

Approve Project Charter

Agree on Project Charter with regards to:

  • Stakeholders

  • Scope

  • Deliverables

  • Schedule

Business Sponsor

Meeting schedule

Sprint

ceremonies 

ceremonies

Project

Co-Managers

Manager

Project infrastructure

  • Populate Jira project board

  • Set up wiki page for LTS

Operations 
  • Operations

  • Set up dev/qa environments

  • Provision code repository

Project

Co-Managers and Business OwnerOperational Resources

Manager, Tech Lead, Architect, ProdOps

Development

  • Implementation of user stories

  • Based on scope and deliverables from charter

  • Reviewed and accepted by Product Owner

Project Team

Communication & Outreach planning

Demo
  • Demos to stakeholders

, GeoSpatial Working Group, campus community
  • CC article, newsletter submission, promotion

  • Project Co-Managers

    Move new HGL to production

    • HGL with GeoBlacklight is the public interface for HGL.

    • OpenGeoportal interface shut down

    Operational Resources and Project Team

    ...

    • Email communication

    • Live updates to stakeholders (monthly?)

    Project Manager; Business Owner

    Move to production



    Project Team

    VIII. Assumptions, Risks, and Constraints

    Constraints

    • Cost: this project does not account for additional costs incurred by running multiple instances of beta and production in parallel

    Assumptions

    • Stakeholders have identified the appropriate subject matter experts to participate in the project and who can accurately and completely define the business requirements for the project

    • Stakeholders will have made available the time required to participate in project activities and to complete tasks as requested

    • Project sponsor and other stakeholders are empowered to make the decision required for the project to be a success

    • Existing GeoServer implementation is compatible with newest version of GeoBlacklight 

    Risks

    ...

    Risk: Reliance on legacy authorization system (AMS) .
    Plan:
    Integrate with HarvardKey directly using methods already developed in recent LTS/DSI projects
    Impact:
    Without authentication/authorization, restricted material would not be available for download
    Owner:
    Software Engineer

    ...

    • Service contracts on existing hardware expire in September 2021
    • Updated software needs to be in production before start of Fall Semester 2021

    Assumptions

    • Prior completion of BatchBuilder Java upgrade

    Risks

    • Risk: Work extends beyond the August 31 expiration of existing hardware
    • Plan:
    • Impact:
    • Owner:

    Appendix

    Definitions of Roles

    • Business Owner - Provide vision and direction of product
    • Product Owner - Define, prioritize, and accept work done for project
    • Project Manager - Maintain project schedule and communication
    • Scrum Master - Lead, guide, and assist project team through development work
    • Business Analyst - Provide insight into user needs to inform and refine work stories
    • Technical Lead -Lead technical design and development
    • Architect - Provide technical architecture for the solution
    • Software Engineer - Update and build software to accommodate new storage architecture
    • Production Operations - Administer new storage solution system and provide insight into its operation
    • DB Admin - Administer databases to accommodate needed functionality and facilitate needed changes
    • UI/UX - Create wireframes / mockups of new/updated UI components, and provide guidance on usability of new/updated functionality

    Glossary