Harvesting Restricted Websites Project Charter

I. Problem/Value Statement

Problem Statement: Harvard's archival repositories are required to collect web-based records that are behind various forms of authentication, but technology to acquire and make accessible such records is not currently available in the Harvard web archiving environment. Additionally, there are legal and policy implications of harvesting restricted websites which must be considered and incorporated into the technical environment.

Business Value: Enabling the harvesting of restricted websites will ensure that Harvard's repositories can meet their mandates to preserve university records.

II. Vision and Approach

Describe the solution: This project will implement the recommendation of the /wiki/spaces/librarymeetings/pages/40297106 to implement a proxy solution that will access Harvard's restricted sites and allow Archive-It, Harvard's subscription-based web archiving tool, to capture these pages. It will involve setting up a proxy server at Harvard which would accept a basic username/password exchange with Archive-It but would then restrict access to pages based on a combination of policies and restrictions. The proxy server will be managed by LTS to ensure security requirements are met at all times, but will be administered by individual archives to provide lists of vetted subdomains. The site will use a special HarvardKey ID without the second factor authentication requirement. The account will be monitored closely by security (what does that mean?) and use of the ID will be logged for source of logins so that an alert will be triggered if the login comes from any site other than the proxy server. 

Deliverables/Work Products:   

Define how to measure “done”: An archival proxy server is up and running in production at Harvard and Archive-It is successfully capturing designated content through the proxy server.

In Scope: 

Out of Scope (for medium and large projects): Managing non-Harvard restricted sites.

III. Stakeholders and Project Team

Glossary

Stakeholders

Who is sponsoring the work? Who is funding the work? Who will accept the work? What organizations, departments, or people will benefit from this work (for medium and large projects)? Link to /wiki/spaces/librarymeetings/overview where relevant.

StakeholderTitleParticipation
Eliz KirkAUL for Scholarly ResourcesSponsor
Emily Gustainis

Deputy Director, Countway Library of Medicine

SME
Rachel WiseArchivist, Baker LibrarySME
/wiki/spaces/librarymeetings/pages/40239261 Beneficiary
Stephen AbramsHead, Digital PreservationSME, Manager of ArchiveIt consortial account
Skip KendallSenior Collection Development and Electronic Records ArchivistSME
LTS Technical Owner

Project Team

Roles: Project Manager, Business Analyst, Quality Assurance Analyst, Architect, Software Engineer, Systems Engineer, UI Designer, Metadata Analyst, Subject Matter Expert

Team MemberRole(s)Affiliation
Abigail BordeauxProject ManagerLTS
Skip KendallSMEUniversity Archives
Anthony MoulenArchitect

LTS

Vitaly ZakutaBusiness and Quality Assurance AnalystLTS

IV. Cost and Schedule

Define the resource commitment, project phases with their associated activities, deliverables and milestones. Include a plan for transitioning from project to operations/maintenance phase.

June - begin architecture work

Consult SMEs on when they are available for ID pages: URL, security level, description, review list already compiled by the TF.

IV. Key tasks and outcomes

TasksOutcomesResponsible Parties
Establish Audit Model for Applications/WebsitesDocumentation for the audit requirements for proxied activities on archiving webpages. The requirements likely contain information about who authorized a new site to be archived, when a site was archived last, a history of changes to sites and archiving. Any site that has been archived and then removed from the allowed list. The audit history will form the basis of proving only authorized sites were archived and who authorized the sites to be archived. This model will be key to determining if the design proposed will meet security requirements for the universityLTS
Finalize Design Architecture for Proxy SolutionPropose an architectural model by which the application will be designed. Show audit/logging endpoints, authentication connectivity, administrative interfaces, reporting output layer and web passthrough methodology. The goal is that the proxy should be transparent to the server and all connectivity should appear to come from the proxy layer, including for elements that would otherwise be available without a proxy solution.LTS/AM
Identify Pages and Classifications for Sites to be ArchivedA clear set of in scope pages and security level for the initial build out should be outlined. These pages will have to be reviewed to identify how they restrict access to their content. Based on each page's restriction model, the design architecture will have to be reviewed to ensure that it can achieve the required goal of archiving the individual page or site. Refinement of design may be necessary to handle externally managed template elements and other parts of sites that are sourced outside the primary hosting of the siteArchives
Define Stories for ImplementationDefine all known stories for implementation and organize the stories in order of priority. Identify dependencies for stories and order appropriately within the scope of the other understood priorities.LTS/Archives
Identify Implementation Team MembersAn implementation team will be identified with the appropriate skills to build a proxy solution to achieve the stated goals for the solution.LTS/Archives
Form Implementation TeamAssign resources necessary to implement the work as stated. Identify any risks associated with availability of resources to meet stated goals and to stay within target timelines.LTS
   

VI. Assumptions, Risks, and Constraints

Constraints:

  • Scope
  • Time
  • Cost

Assumptions:

  • Stakeholders have identified the appropriate subject matter experts to participate in the Task Force and who can accurately and completely define the business requirements for the project
  • Stakeholders will have made available the time required to participate in project activities and to complete tasks as requested
  • Project sponsor and other stakeholders are empowered to make the decision required for the project to be a success
  • Project sponsor will provide written approval to move forward with system development when requested as part of incremental/iterative system demonstrations

Dependencies:

Risks (description, plan, impact, owner):

DescriptionPlanImpactOwner
Archive-It may not meet Harvard security requirementsMake contact with HUIT Security early on to determine what information is required to confirm that the solution meets requirements.The type of information stored at Archive-It may be limited to certain security levels.Anthony Moulen
HUIT Identity and Access Management may not roll out the "special HarvardKey ID" on a timeline that meets project needs.Timeline may need to be adjusted to fit what IAM is able to deliver.Delayed implementation.Anthony Moulen
Ensure that authentication header is not sent to AIT.   
Update during course of project as needed.