Preservation Activities
Methods
For each work, preservation technologists designed scalable preservation pathways that might function for similar works. In each instance, the publisher described the features of the publication that they would like to preserve (“required” through “nice to have”), shared the location of the live publication, and transferred any available export packages or metadata to the preservation institution. The publication was manually reviewed in the live environment, and then compared to the package where provided. After the initial review, the preservation institution experimented with workflows for converting the publication to an archival package containing all of the information required to replay it later if triggered on their platform.
Due to the number and complexity of publications as well as the variation within platforms, it became clear early in the project that attempting to fully process each publication using Portico’s and CLOCKSS’ existing preservation systems would have made the pace of the project too slow and limited in scope. The preservation institutions instead focused on combining standard workflows with features and functionality not yet in place. For example, Portico tested EPUB rendition, emulation, and web harvesting, each of which would have required significant development time to implement. They configured new required tools, wrote scripts, and developed proof-of-concept implementations in order to demonstrate what the output of the process could look like if the publications were consistent. Portico also created mock-ups of their access website to tie together the workflow outputs and show how it could be presented if triggered.
Described below are three general approaches that were taken in order to preserve the publications: file transfer of information packages; web archiving; and emulation, as we employed them for specific publications and platforms. As the project progressed to more complex examples in which the experience of the platform was an integral feature of the work, methods for preservation became more complex and experimental. In selecting which of these three methods to use, the general approach was to preserve what we had some confidence could be preserved (e.g. text, raw supplements, and metadata), but also reach for the solution that could fulfill all of the publisher requirements. Thus, we used the file transfer method for all publications, often as a secondary mode that would plainly fail to meet the acceptance criteria. This is standard procedure for Portico, ensuring the retention of at least some content using the least precarious methods. For this reason, many of the publications included both an export package and an emulated or web harvestested version in the archival package.
File Transfer of Information Packages
The file transfer method was deployed by Portico for all works. In some cases, we saw the possibility that this method, the standard one for most of Portico’s operations, could suffice for providing access to the work in the future; other works were too complex in their structure or user experience to be reconstructed with discrete files and metadata. Publications that could most easily be exported and reassembled from their component parts were best able to make use of this method. For each of these publications, publishers created a submission information package (SIP) containing the original EPUB file, supplemental media files, and metadata which was transferred to Portico via FTP. Files provided were assigned a file type (based on the output of several file format tools). To the extent possible, each file was then validated and a detailed report for that format was produced and incorporated into the object’s technical metadata. While working on this project, Portico added a new EPUB module to the JHOVE application using the World Wide Web Consortium’s EPUBCheck tool. JHOVE is an open source tool used to validate and characterize file types and it can be used to identify structural issues that might affect future playback of the file. The new EPUB module is available to the community via the official Open Preservation Foundation’s version of JHOVE.
Key to the success for this file transfer method was the inclusion of all resources contained within the publication along with metadata that provided enough information to map embedded resources to their appropriate positions in the EPUB. The availability and structure of metadata varied across publishers. In the case of Fulcrum-hosted publications, transferred SIPs contained an EPUB file, media files, and a Fulcrum manifest in the form of a CSV file exported directly from the publishing repository. Media files and associated software for playback, viewing, or interaction were embedded or linked from within the EPUB via a URL rather than contained within the EPUB, making rendition dependent on the persistence of those URLs.
Figure 1: Custom video viewer in a Fulcrum EPUB. Embedded using an <iframe> displaying the viewer on the Fulcrum website. From Developing Writers in Higher Education.
At the start of the Portico workflow to create an archival package, the CSV file containing metadata about the media supplements was converted to an XML format and tagged as descriptive metadata. The EPUB included an Open Package Format file that contained additional information, and this was also tagged as descriptive metadata and validated for errors. The two XML files were merged and transformed into header metadata in Book Interchange Tag Suite (BITS) XML, a standard for academic book content and metadata used by Portico to describe books. Where TEI XML was included in the package, full text BITS was generated.
Though it would have been possible to preserve the separate resources and leave it to a future user to reconstruct the work, this approach would not have fulfilled the publisher’s aim of preserving the original reading experience with media embedded within its textual context. Nor would it have functioned within Portico’s business model, in which publications are “triggered” in the event that they become unavailable or inaccessible through their publisher. For this reason, Portico experimented with ways to make access scalable. The Fulcrum EPUBs use iframes to display enhanced media players that are live on the Fulcrum site. That means if the website is unavailable (as it would be in the event of a Portico trigger), the EPUB will display empty boxes where the media players once were with no visible indication of how to reach those resources. One workflow that was tested required two changes. First, access copies of the media files were moved to a folder within the EPUB. Second, XML transforms were used to replace the Fulcrum media player iframes with simplified EPUB3-compatible media players that would allow the media file and player to be contained within the EPUB package. The result was a larger new EPUB that could be used for dissemination through the Portico Portal in the event that the publication was triggered. The original unmodified EPUB was also preserved and presented alongside the access version through the portal. While this process of modifying the EPUB resulted in an access experience that was closer to the original, it did result in some loss. The feature-rich Fulcrum media players were not preserved. Instead, transform efforts focused on the vital intellectual components of the experience, for example, the ability to play an audio clip inside the text and simultaneously read the transcript.
With later examples, Portico suggested that descriptive captions be added under each of the embedded multimedia elements in the new access version EPUBs and that they include their unique persistent identifiers. The caption could provide information about the embedded material if it should fail to play back correctly, and the identifiers could resolve to their new Portico locations rather than the original Fulcrum website if triggered. For Portico to add a caption during the transform would require minor editorial decisions about the look and feel of a publication with its transformed content. Portico prefers not to make such decisions without input from the publisher. The transforms and media embedding process for each individual Fulcrum publication were deemed successful and the publishers’ intent was realized. It would currently be difficult to execute this process across all Fulcrum publications due to minor variations in how media embedding was handled between the EPUBs. This meant there was a need to check for potential issues resulting from transforms and to repeatedly tweak the process to accommodate for those variations.
Adapting to challenges at the platform level
A scalable workflow for the Fulcrum publications may be possible with more consistency around how media is embedded since every variation introduces further complexity and fragility. For example, some Fulcrum publications used a IIIF viewer to display images, others embedded the image file directly; some used captions, others didn’t. Since Portico prefers not to make editorial decisions on layout, nor risk a brittle preservation workflow, a better solution would be for an alternative EPUB export to be built into the Fulcrum platform itself. This could generate a larger EPUB for offline use that would be more appropriate for preservation and would evolve in line with the platform development.
Export packages were also generated by the Manifold and Scalar platforms. The Manifold export was in development during the project. As with Fulcrum, its export package contains an EPUB and a comprehensive set of supplements and metadata packaged together using the Bag-It specification. After an initial evaluation of the export feature, preservation technologists expressed concern that the metadata was spread across too many files and folders within the package: there was a top-level project metadata file and two files for each supplement serialized as JSON; XML metadata embedded within each EPUB; and then a variety of small text files within each folder in which the file name was the property name and the contents the value. While comprehensive, this would have required a very complex workflow to process the different metadata formats and files. Portico recommended that these smaller files be consolidated into fewer files that all use the same serialization format. While this package could have been used for more conventional publications, for the complex publications selected for this project, this export would be insufficient to preserve significant aspects of the user experience. In particular, the seamless integration of text, multimedia, and user contributed content such as annotations and highlights were not captured. The Manifold export packages were deemed important to preserve as a baseline, but there was interest in exploring whether we could also preserve the rich experience offered on the platform.
Scalar’s export was an RDF package that was more suited for migration between Scalar instances than preservation. The networked nature of the content and the specificity of the stylesheet classes and property names to the platform meant that without the platform, the publication would be extremely challenging to render. An approach that could capture the work as a website seemed like a more scalable option.
Web Archiving
Both of the preservation service providers employed web archiving for a number of publications during the project. The preservation activities for CLOCKSS were performed by the LOCKSS development team at Stanford. The web archiving approach deployed for this project uses Heritrix, the Internet Archive's open-source, extensible, web crawler project. To archive publications on a particular platform, LOCKSS engineers develop a plugin to the LOCKSS software with descriptor rules and code describing the harvesting process specific to that platform.
At the time of the project, Portico did not have a web archiving service in production, but used the publications presented in this project to test several of the new generation of browser-supported web crawlers-a different approach from CLOCKSS. Rather than attempting to predict and simulate what a browser does, these tools use an existing browser such as Chrome to automatically crawl the website and assemble any loaded resources into a WARC file. Many modern websites use JavaScript that continues to load new resources after the initial page load and as the user interacts with the page. These new crawl tools each have options for simulating user behaviors such as scroll, click, hover, etc. The Portico preservation technologist tested Brozzler, Squidwarc, Memento Tracer, and Browsertrix. Each had its benefits and drawbacks in the context of this project. While Squidwarc offered the best balance of control and simplicity, Internet Archive’s Brozzler was the most mature crawling tool and was used for most of the tests. Since completing the project, Portico has moved forward with a web archiving pilot using Webrecorder’s Browsertrix Crawler - a lightweight derivative of Browsertrix that uses a Chrome browser and Pywb for capturing web pages, and Puppeteer JavaScript for simulating user behaviors.
The preservation services tested a web archiving approach with 15 publications. In each case the service evaluated what was possible to archive and how well the approach could scale. The definition of acceptable scale and precision may vary between preservation services. Services like CLOCKSS and Portico work with publishers to build solutions that are tailored to their platforms. A custom workflow is considered to be successful if it can, over long periods, automatically and accurately archive all publications on the platform without needing to be modified frequently as a result of changes or inconsistencies. Ideally the benefits of the work to tailor a solution for a publisher’s platform can be transferred to other publishers that use the same platform or standards. The guidelines accompanying this report describe ways to design web-based publications that can be successfully captured using web archiving, and also aim to minimize the effort required for customization and maintenance of preservation workflows. If a website is easy to archive, a wider variety of services can preserve the site at a lower cost, improving the likelihood that the publications will be available to future scholars.
With time to customize the web archiving tools, the preservation services had some success in creating reusable workflows that could capture the majority of the publications that were presented using publishing platforms like Manifold, Scalar, or Fulcrum. These customized workflows were relatively complex, however, with some taking weeks to develop as a proof-of-concept. Where there were certain types of enhanced features embedded within the publication, or the publications were custom-designed rather than presented using publishing software, the results were mixed. The publishers understood that some loss may be unavoidable using these methods, and this was acceptable for features considered “optional” to preserve (e.g. full text search). In some cases though, the very features that qualified the project for inclusion in this research -- features such as interactive map visualizations and IIIF viewers -- could not be captured using web archiving, or required manual work that was out of scope for the preservation services. Specific successes and challenges are described in the sections that follow.
Adapting to challenges at the platform level
Prior to the project start, LOCKSS had been working on a plugin for Fulcrum publications. The development of the Fulcrum plugin continued during this project; feedback from the project team during weekly meetings fed into an iterative development. The method involved harvesting native publication content from its published location on the Fulcrum website. EPUB files were also harvested from the Fulcrum site when available. However, at the time of the project, LOCKSS was not able to ingest both EPUB and other resources separately in a way that would maintain linkages. Therefore, the publications were treated as web sites for the purpose of harvesting.
The LOCKSS team reviewed individual Fulcrum works that had noteworthy or extensive resources embedded in, or presented alongside the text, such as Animal Acts: Performing Species Today. The Fulcrum plugin for LOCKSS could harvest the main text of the publication. Use of third-party, remote fonts, such as those from Typekit or Google Fonts, can be an issue in web archiving. For Fulcrum, the publications’ font and icon toolkit is hosted on the platform, making it possible to capture. Links to PDFs, video, audio, and transcripts were easily discovered and downloaded by the web crawler. The functionality on the Fulcrum publication landing page that allows users to sort and group the publication’s resources was not archived. This feature adds an open-ended combination of parameters to the URL, resulting in an explosion of “pages” that the web crawler identifies as unique content. While crawling a large number of pages is possible it requires an inordinate amount of time and an impractical amount of storage. The plugin approach was therefore successful for capturing many of the core features of the Fulcrum publications and could potentially be applied to other Fulcrum installations. For some Fulcrum publications, specific dynamic features proved challenging with this approach. These will be discussed later.
The Manifold team at the University of Minnesota Press identified six Manifold projects as candidates for analysis, and the project team selected four as representatives. The Manifold platform includes an export feature that creates detailed packages composed of EPUB versions of any texts in the project, non-textual resources, and metadata for the project and all provided files. The packages contain the core source material, but lose a lot of the richness of the original presentation such as integration of user-contributed content and convenient navigation between the text and resources. Portico explored both whether the packages could be improved to reflect more detail and whether a web archiving approach could preserve the project as originally presented.
Portico initially attempted to archive a Manifold project by providing several web harvesting tools with the starting URL and allowing the crawler to automatically discover the pages of the publication. This proved ineffective due to the design of the platform site — there is no sitemap; projects use multiple URL arrangements making it difficult to control the scope of the crawl; many HTML link tags, which are typically used by crawlers to discover content, do not reference a target URL but instead initiate a JavaScript action on click to load a new page; and, new data is loaded from the server as the user interacts with some page features. All of these factors contributed to incomplete crawls.
In a second Manifold crawl attempt, Portico worked with Memento Tracer and Squidwarc. Initially the crawl scope was controlled by encoding sequences of user behaviors to automatically crawl a single Manifold project. The size and complexity of these publications made this a difficult task with complex behaviors required on many pages to capture all content (e.g. for each annotation click to open, if the annotation panel has a more button click it, close panel, repeat for next annotation). A simple proof of concept was developed that limited the scope to see if these tools were viable. The crawl was programmed to simulate a user navigating the publication text. It started on the hero page, clicked to the first page of a text, clicked “next” until it reached the end of the publication, and visited each linked resource it discovered. A WARC file was generated that held all of the content loaded into the browser during this process. The replay was much better than for the initial fully-automated crawl attempt but was still very buggy - within a few clicks an error would appear. It also would have been a lot of effort to encode all of the user behaviors that would fulfill the publisher’s requirements.
The cause of the replay problems were traced to the fact that Manifold runs as a “single page application” in which the entire template website is loaded into the browser when the user visits the first page, and then additional content is loaded using JavaScript functions that call the Manifold API to retrieve page data. When the user clicks a link to go to a new page within Manifold, the URL in the address bar is artificially changed by JavaScript, which means the page URL that shows in the address bar has not really been visited as a unique location on the network. The result is that the URLs that appear in the address bar, and the resources that would load if they were visited, may not get added to the WARC file. This causes sporadic errors in the replay of the web archive. To add to the complexity, the developer of Manifold expressed concerns about the web crawling mechanism being stable since it depended on CSS Selector or JavaScript DOM paths to simulate user behaviors. Given that the platform is in active development, these paths could change frequently.
Through these experiments, Portico concluded that a full list of resource URLs needs to be identified for crawling. This list should cover both the top level URLs that load in the address bar, and all of the API URLs that are called while the user is interacting with the site. Portico wrote a script to generate this using the Manifold API. The script took approximately two weeks to write. At this point, Portico re-evaluated the crawler tool options in the context of the new method. Memento Tracer did not support feeding in a list of URLs. Squidwarc did allow this, but during testing it was discovered that it lacked functionality to download PDFs or MP3 files, which were vital to the work. Since full control of browser behaviors was no longer needed for this approach, Portico switched to using Internet Archive’s Brozzler - another browser-supported crawler. Brozzler does allow basic customization of behaviors, which were useful in other experiments, but they were no longer needed for Manifold. Brozzler was also deemed a more stable option given that it has been running in production as part of Archive-It for some time.
The URLs generated by the custom script were fed into Brozzler, which was configured to only visit the URLs provided and not attempt to discover new links. The resulting website capture was tested by Portico, and again by the publisher and project manager. The same script was used to generate the URLs for three other Manifold publications; for the second, minor improvements to the script were required to incorporate features not expressed in the first iteration. By the third and fourth publications, no further changes to the script were required. Though further testing is required, this indicated that a reusable script to generate resource URLs for crawling might be a scalable solution for Manifold publications.
Emulation
The priorities for preservation for the RavenSpace work As I Remember It were threefold: preserving customizations for representing Indigenous metadata, languages and orthographies; maintaining the multi-path structure of the work; and an opening pop-up agreement requiring users to agree to respect the expressed cultural protocols before accessing the publication. Web archiving tools allowed Portico to preserve: the general page layout including citations, notes, and embedded media; the table of contents to a second level; and the popup agreement, which required extra configuration for playback. Missing from the web-archived version were: custom search functionality (including the ability to search using the First Nations keyboard), content visualizations, and a dynamic curriculum explorer. Also missing were external resources: Google Maps, iframe content containing two other websites, and a YouTube video. While the web archive was deemed acceptable by the publisher, there were some concerning issues with capture and playback that might cause problems for access in the future. The web archive seemed worth preserving despite these issues, but Portico decided to explore a more advanced approach in response to shortcomings of web archiving for As I Remember It.
Portico preserved the work a second time using emulation, which attempts to replicate the server side configuration of a website or application. RavenSpace is built on Scalar, a Linux/Apache Server/MySQL/ PHP (LAMP) application. Portico performed the emulation work by creating a virtual machine on an instance of EaaSI, the Emulation-as-a-Service Infrastructure developed by Yale University Library. The virtual machine contained a LAMP stack with the GUI enabled, a web browser, Scalar 2.5.12 with the Import/Export plugin installed, and the RavenSpace customizations. The Scalar Import/Export plugin supports the transfer of the text data between platforms, but does not include the non-text resources or update media links. The vital components to recreate As I Remember It on the new server were: (a) the Scalar export data from the original platform, which was supplied as a large JSON file,(b) copies of the media files that were embedded into the publication but hosted on Omeka and DSpace (c) the Scalar-generated publication folder from the original web server containing all other resources used in the publication. Fortunately, these remote media files were documented in Scalar as resources, and so their descriptive metadata and original file URLs were included in the data export. This supported the process of matching files to their descriptions and placing them in the context of the text. The JSON was imported into RavenSpace on the virtual machine; the publication folder was copied to the appropriate location on the web server; and all media files were copied to the publication folder under a new sub-folder so that they were local to the publication. Media paths were updated in the Scalar database to point to their new local path. The resulting preservation copy was a fully functioning website that could run offline. It included elements that were missing from the web archived copy: the search functionality, all levels of path navigation, and content visualizations. Google Maps and a YouTube video were not migrated due to copyright and technical challenges, but these had been deemed non-essential by the publisher.
Because of its predictable content layout, the Scalar platform lends itself to both web archiving and emulation. CLOCKSS and Portico were able to create a web archiving workflow by extracting a sitemap-like list of URLs from the Scalar API. Though the workflows were complicated to configure, they could potentially be applied to other Scalar publications. In addition, Portico identified patterns in the installation of Scalar that could make it possible to recreate the platform with the publication loaded into it so that the website could be emulated in the future. In order for either of these methods to be scalable, publishers would need to handle attachments in a consistent way, and communicate any customizations to a preservation service provider. Scalar’s import/export tool is useful for the emulation process, but the need to handle the publication folder and media files from DSpace and Omeka separately is burdensome. The process was labor-intensive, requiring manual work and a series of conversations to identify what pieces were required. That said, it may be possible to create a script that would automatically install a publication onto an empty RavenSpace instance for emulation. This would require a very consistent and structured handoff between the publisher and the preservation institution. We are unsure how scalable such a process would be.
Filming Revolution, which its publisher Stanford University Press describes as a “meta-documentary about independent and documentary filmmaking in Egypt,” relies on a custom LAMP (Linux, Apache, MySQL, PHP) application to deliver a user experience based on a visual network structure. The website is a single page application, which means it is heavily dependent on JavaScript to load data into the browser as the user interacts with it. Users navigate content by clicking within a visualized web of relationships between essays, videos, and themes. Both the interactive navigation and the Vimeo hosted video content were identified as primary aspects of the work to be preserved. In a package provided via FTP, the publisher provided the database and code from the server as well as copies of the video files from Vimeo. Portico experimented with both client-side (website harvesting) and server-side archiving (emulation) for this data-driven website.
For the client-side web harvesting approach, the existing sitemap URLs were crawled with Brozzler. Most of the Vimeo videos were not successfully captured, and the playback experience for navigation produced many technical errors. Stanford did have some prior success with a web harvesting approach while working with an expert from Webrecorder, who used different capture and playback tools. Understanding Webrecorder’s approach and whether it is possible to automate and scale may inform further exploration of the possibilities for this method.
Since the raw materials for Filming Revolution were provided, as with RavenSpace, it was possible to build a functional replica of the publication’s web server and attempt to emulate it. The ability to emulate software is dependent on the ability to encapsulate it on a single virtual machine - in other words, remove or modify all external dependencies so that the entire application works offline. Because this publication depended on numerous Vimeo videos and an external font, the site would not function without a connection to the live web and the persistence of all of those connections. The entire application depends on loading a font from the Google API and thus would not load at all once disconnected from the Internet. Because many LAMP website installations are fairly standard, an automated process could be developed if the publisher used a consistent process for building the web server or could provide a script to install all dependendencies onto an empty Linux machine. Portico therefore sought to determine what would need to change within the package provided in order to develop a scalable emulation process. A new virtual machine was created by installing the LAMP stack; the website files for Filming Revolution were copied to Apache Server on the new machine; and the website database was restored to the MySQL server. The publisher supplied an offline copy of all video assets for use in the emulation. The videos were compressed and moved to a file path on the Apache server and renamed to match their Vimeo IDs in order to make the required code change simpler. The code was then modified to use these local copies of the videos instead of the copies on Vimeo. A method was devised to move the web fonts local to the application, a process that could have been simplified if the fonts chosen were local and non-proprietary. Portico also worked with the publisher to remove any data or server information that was not appropriate for access or preservation. The results were demonstrated and shared via an EaaSI instance hosted by Portico for the project.
Code modifications and data cleaning took approximately five days for the Portico technician, an experienced web developer who had never seen the code before. For a scalable approach, Portico could not spend five days on each publication — slightly more time than it took to localize As I Remember It. In order to scale this approach, an application would need to be provided to Portico in a form that is ready for a generic LAMP installation. The most efficient way to do this would be for the original developers to design the website with sustainability and encapsulation in mind — ensuring files are local to the application where possible and that there is a simple way to fall back to local functionality for integrations such as Vimeo. When delivering the package to the publisher, the creator would then produce a clean package, which would be provided to the preservation institution. In addition to enhancing preservation, limiting third party dependencies and building in fallback mechanisms would allow the live application to degrade gracefully and reduce future maintenance.