Harvesting Restricted Websites Project Charter
I. Problem/Value Statement
Problem Statement: Harvard's archival repositories are required to collect web-based records that are behind various forms of authentication, but technology to acquire and make accessible such records is not currently available in the Harvard web archiving environment. Additionally, there are legal and policy implications of harvesting restricted websites which must be considered and incorporated into the technical environment.
Business Value: Enabling the harvesting of restricted websites will ensure that Harvard's repositories can meet their mandates to preserve university records.
II. Vision and Approach
Describe the solution: This project will implement the recommendation of the /wiki/spaces/librarymeetings/pages/40297106 to implement a proxy solution that will access Harvard's restricted sites and allow Archive-It, Harvard's subscription-based web archiving tool, to capture these pages. It will involve setting up a proxy server at Harvard which would accept a basic username/password exchange with Archive-It but would then restrict access to pages based on a combination of policies and restrictions. The proxy server will be managed by LTS to ensure security requirements are met at all times, but will be administered by individual archives to provide lists of vetted subdomains. The site will use a special HarvardKey ID without the second factor authentication requirement. The account will be monitored closely by security (what does that mean?) and use of the ID will be logged for source of logins so that an alert will be triggered if the login comes from any site other than the proxy server.
Deliverables/Work Products:
Define how to measure “done”: An archival proxy server is up and running in production at Harvard and Archive-It is successfully capturing designated content through the proxy server.
In Scope:
Out of Scope (for medium and large projects): Managing non-Harvard restricted sites.
III. Stakeholders and Project Team
Stakeholders
Who is sponsoring the work? Who is funding the work? Who will accept the work? What organizations, departments, or people will benefit from this work (for medium and large projects)? Link to /wiki/spaces/librarymeetings/overview where relevant.
Stakeholder | Title | Participation |
---|---|---|
Eliz Kirk | AUL for Scholarly Resources | Sponsor |
Emily Gustainis | Deputy Director, Countway Library of Medicine | SME |
Rachel Wise | Archivist, Baker Library | SME |
/wiki/spaces/librarymeetings/pages/40239261 | Beneficiary | |
Stephen Abrams | Head, Digital Preservation | SME, Manager of ArchiveIt consortial account |
Skip Kendall | Senior Collection Development and Electronic Records Archivist | SME |
LTS | Technical Owner |
Project Team
Roles: Project Manager, Business Analyst, Quality Assurance Analyst, Architect, Software Engineer, Systems Engineer, UI Designer, Metadata Analyst, Subject Matter Expert
Team Member | Role(s) | Affiliation |
---|---|---|
Abigail Bordeaux | Project Manager | LTS |
Skip Kendall | SME | University Archives |
Anthony Moulen | Architect | LTS |
Vitaly Zakuta | Business and Quality Assurance Analyst | LTS |
IV. Cost and Schedule
Define the resource commitment, project phases with their associated activities, deliverables and milestones. Include a plan for transitioning from project to operations/maintenance phase.
June - begin architecture work
Consult SMEs on when they are available for ID pages: URL, security level, description, review list already compiled by the TF.
IV. Key tasks and outcomes
Tasks | Outcomes | Responsible Parties |
---|---|---|
Establish Audit Model for Applications/Websites | Documentation for the audit requirements for proxied activities on archiving webpages. The requirements likely contain information about who authorized a new site to be archived, when a site was archived last, a history of changes to sites and archiving. Any site that has been archived and then removed from the allowed list. The audit history will form the basis of proving only authorized sites were archived and who authorized the sites to be archived. This model will be key to determining if the design proposed will meet security requirements for the university | LTS |
Finalize Design Architecture for Proxy Solution | Propose an architectural model by which the application will be designed. Show audit/logging endpoints, authentication connectivity, administrative interfaces, reporting output layer and web passthrough methodology. The goal is that the proxy should be transparent to the server and all connectivity should appear to come from the proxy layer, including for elements that would otherwise be available without a proxy solution. | LTS/AM |
Identify Pages and Classifications for Sites to be Archived | A clear set of in scope pages and security level for the initial build out should be outlined. These pages will have to be reviewed to identify how they restrict access to their content. Based on each page's restriction model, the design architecture will have to be reviewed to ensure that it can achieve the required goal of archiving the individual page or site. Refinement of design may be necessary to handle externally managed template elements and other parts of sites that are sourced outside the primary hosting of the site | Archives |
Define Stories for Implementation | Define all known stories for implementation and organize the stories in order of priority. Identify dependencies for stories and order appropriately within the scope of the other understood priorities. | LTS/Archives |
Identify Implementation Team Members | An implementation team will be identified with the appropriate skills to build a proxy solution to achieve the stated goals for the solution. | LTS/Archives |
Form Implementation Team | Assign resources necessary to implement the work as stated. Identify any risks associated with availability of resources to meet stated goals and to stay within target timelines. | LTS |
VI. Assumptions, Risks, and Constraints
Constraints:
- Scope
- Time
- Cost
Assumptions:
- Stakeholders have identified the appropriate subject matter experts to participate in the Task Force and who can accurately and completely define the business requirements for the project
- Stakeholders will have made available the time required to participate in project activities and to complete tasks as requested
- Project sponsor and other stakeholders are empowered to make the decision required for the project to be a success
- Project sponsor will provide written approval to move forward with system development when requested as part of incremental/iterative system demonstrations
Dependencies:
Risks (description, plan, impact, owner):
Description | Plan | Impact | Owner |
---|---|---|---|
Archive-It may not meet Harvard security requirements | Make contact with HUIT Security early on to determine what information is required to confirm that the solution meets requirements. | The type of information stored at Archive-It may be limited to certain security levels. | Anthony Moulen |
HUIT Identity and Access Management may not roll out the "special HarvardKey ID" on a timeline that meets project needs. | Timeline may need to be adjusted to fit what IAM is able to deliver. | Delayed implementation. | Anthony Moulen |
Ensure that authentication header is not sent to AIT. | |||
Update during course of project as needed. |