Skip to main content

Report on Enhancing Services to Preserve New Forms of Scholarship: Preservation Activities

Report on Enhancing Services to Preserve New Forms of Scholarship
Preservation Activities
    • Notifications
    • Privacy
  • Project HomeWikimedians at NYU
  • Projects
  • Learn more about Manifold

Notes

Show the following:

  • Annotations
  • Resources
Search within:

Adjust appearance:

  • font
    Font style
  • color scheme
  • Margins
table of contents
  1. Report on Enhancing Services to Preserve New Forms of Scholarship
    1. Executive Summary
    2. Preservation Guidelines
    3. Contents
    4. Changes in Scholarly Publishing
      1. New questions
    5. Project Description
      1. Scope and motivation
      2. Partners
      3. How the work was organized
      4. What we did
        1. Pre-Transfer Activities
        2. Preservation Actions
        3. Evaluation
    6. Preservation Objectives
      1. Managing expectations
      2. Defining a work and the elements to be preserved
      3. Existing preservation-oriented features of publication platforms
    7. Preservation Activities
      1. Methods
      2. File Transfer of Information Packages
        1. Adapting to challenges related to embedded third party resources
        2. Adapting to challenges at the platform level
      3. Web Archiving
        1. Adapting to challenges at the platform level
        2. Adapting to challenges related to social media and user contributed content
        3. Adapting to challenges related to dynamic content
      4. Emulation
    8. Assessment
    9. Works Cited
    10. Appendix A: Publications Analyzed
      1. Fulcrum
      2. Manifold
      3. RavenSpace
      4. DLXS
      5. Scalar
      6. Open Square
      7. Standalone websites
    11. Appendix B: Acceptance Criteria Template
      1. Section A: Pre-transfer activities
        1. Preservation objectives
        2. Transfer of content to preservation partner
        3. Describe contents submitted for preservation
      2. Section B: Preservation activities
        1. Assessment of submitted materials
        2. Access to Archived Copy
      3. Section C: Assessment activities
        1. Evaluation
    12. Appendix C: Enhancing Services to Preserve New Forms of Scholarship Project Participants

Preservation Activities

Methods

For each work, preservation technologists designed scalable preservation pathways that might function for similar works. In each instance, the publisher described the features of the publication that they would like to preserve (“required” through “nice to have”), shared the location of the live publication, and transferred any available export packages or metadata to the preservation institution. The publication was manually reviewed in the live environment, and then compared to the package where provided. After the initial review, the preservation institution experimented with workflows for converting the publication to an archival package containing all of the information required to replay it later if triggered on their platform.

Due to the number and complexity of publications as well as the variation within platforms, it became clear early in the project that attempting to fully process each publication using Portico’s and CLOCKSS’ existing preservation systems would have made the pace of the project too slow and limited in scope. The preservation institutions instead focused on combining standard workflows with features and functionality not yet in place. For example, Portico tested EPUB rendition, emulation, and web harvesting, each of which would have required significant development time to implement. They configured new required tools, wrote scripts, and developed proof-of-concept implementations in order to demonstrate what the output of the process could look like if the publications were consistent. Portico also created mock-ups of their access website to tie together the workflow outputs and show how it could be presented if triggered.

Described below are three general approaches that were taken in order to preserve the publications: file transfer of information packages; web archiving; and emulation, as we employed them for specific publications and platforms. As the project progressed to more complex examples in which the experience of the platform was an integral feature of the work, methods for preservation became more complex and experimental. In selecting which of these three methods to use, the general approach was to preserve what we had some confidence could be preserved (e.g. text, raw supplements, and metadata), but also reach for the solution that could fulfill all of the publisher requirements. Thus, we used the file transfer method for all publications, often as a secondary mode that would plainly fail to meet the acceptance criteria. This is standard procedure for Portico, ensuring the retention of at least some content using the least precarious methods. For this reason, many of the publications included both an export package and an emulated or web harvestested version in the archival package.

File Transfer of Information Packages

The file transfer method was deployed by Portico for all works. In some cases, we saw the possibility that this method, the standard one for most of Portico’s operations, could suffice for providing access to the work in the future; other works were too complex in their structure or user experience to be reconstructed with discrete files and metadata. Publications that could most easily be exported and reassembled from their component parts were best able to make use of this method. For each of these publications, publishers created a submission information package (SIP) containing the original EPUB file, supplemental media files, and metadata which was transferred to Portico via FTP. Files provided were assigned a file type (based on the output of several file format tools). To the extent possible, each file was then validated and a detailed report for that format was produced and incorporated into the object’s technical metadata. While working on this project, Portico added a new EPUB module to the JHOVE application using the World Wide Web Consortium’s EPUBCheck tool. JHOVE is an open source tool used to validate and characterize file types and it can be used to identify structural issues that might affect future playback of the file. The new EPUB module is available to the community via the official Open Preservation Foundation’s version of JHOVE.

Key to the success for this file transfer method was the inclusion of all resources contained within the publication along with metadata that provided enough information to map embedded resources to their appropriate positions in the EPUB. The availability and structure of metadata varied across publishers. In the case of Fulcrum-hosted publications, transferred SIPs contained an EPUB file, media files, and a Fulcrum manifest in the form of a CSV file exported directly from the publishing repository. Media files and associated software for playback, viewing, or interaction were embedded or linked from within the EPUB via a URL rather than contained within the EPUB, making rendition dependent on the persistence of those URLs.

Screenshots of a video playing in the custom video viewer in Fulcrum.
Figure 1: Custom video viewer in a Fulcrum EPUB. Embedded using an <iframe> displaying the viewer on the Fulcrum website. From Developing Writers in Higher Education.

At the start of the Portico workflow to create an archival package, the CSV file containing metadata about the media supplements was converted to an XML format and tagged as descriptive metadata. The EPUB included an Open Package Format file that contained additional information, and this was also tagged as descriptive metadata and validated for errors. The two XML files were merged and transformed into header metadata in Book Interchange Tag Suite (BITS) XML, a standard for academic book content and metadata used by Portico to describe books. Where TEI XML was included in the package, full text BITS was generated.

Though it would have been possible to preserve the separate resources and leave it to a future user to reconstruct the work, this approach would not have fulfilled the publisher’s aim of preserving the original reading experience with media embedded within its textual context. Nor would it have functioned within Portico’s business model, in which publications are “triggered” in the event that they become unavailable or inaccessible through their publisher. For this reason, Portico experimented with ways to make access scalable. The Fulcrum EPUBs use iframes to display enhanced media players that are live on the Fulcrum site. That means if the website is unavailable (as it would be in the event of a Portico trigger), the EPUB will display empty boxes where the media players once were with no visible indication of how to reach those resources. One workflow that was tested required two changes. First, access copies of the media files were moved to a folder within the EPUB. Second, XML transforms were used to replace the Fulcrum media player iframes with simplified EPUB3-compatible media players that would allow the media file and player to be contained within the EPUB package. The result was a larger new EPUB that could be used for dissemination through the Portico Portal in the event that the publication was triggered. The original unmodified EPUB was also preserved and presented alongside the access version through the portal. While this process of modifying the EPUB resulted in an access experience that was closer to the original, it did result in some loss. The feature-rich Fulcrum media players were not preserved. Instead, transform efforts focused on the vital intellectual components of the experience, for example, the ability to play an audio clip inside the text and simultaneously read the transcript.

With later examples, Portico suggested that descriptive captions be added under each of the embedded multimedia elements in the new access version EPUBs and that they include their unique persistent identifiers. The caption could provide information about the embedded material if it should fail to play back correctly, and the identifiers could resolve to their new Portico locations rather than the original Fulcrum website if triggered. For Portico to add a caption during the transform would require minor editorial decisions about the look and feel of a publication with its transformed content. Portico prefers not to make such decisions without input from the publisher. The transforms and media embedding process for each individual Fulcrum publication were deemed successful and the publishers’ intent was realized. It would currently be difficult to execute this process across all Fulcrum publications due to minor variations in how media embedding was handled between the EPUBs. This meant there was a need to check for potential issues resulting from transforms and to repeatedly tweak the process to accommodate for those variations.

Adapting to challenges related to embedded third party resources

Preservation specialists aimed to retrieve resources that were visually embedded in the text but located outside of the EPUB container and maintain their relationship to textual content within the EPUB. This was easier to accomplish when these resources were published via the same platform as the publication, as with many examples on Fulcrum. Where these embedded resources were hosted by a third party service they required special attention.

The NYU Press publication By Any Media Necessary contains embedded YouTube videos which are integral to the work. Portico generally requests that the publisher submit a copy of the video files along with text-based files in EPUB or XML format. The publisher of an EPUB-based book like By Any Media Necessary could embed the video files directly in the EPUB. However, for various reasons, publishers with platforms for enhanced digital books have often chosen not to embed video in ebook files intended for distribution: EPUB files may become large and lose the portability for which the EPUB standard was designed; platform developers have also leveraged these platforms to allow for playback features that may not be available even in widely used reading systems such as those built by Amazon, Kobo, and Apple. In this case, the video content was hosted on YouTube at the time of publication. NYU Press does not hold copies of the video files, and could not provide them to Portico. The risk of referencing externally hosted content was evident from the start of the project: one of the videos has already been removed from YouTube, resulting in a grey box within the publication.

Without explicit permission from rights holders, Portico will not capture and store content from YouTube or similar commercial services. Portico looked for options to improve the likelihood that the video content would be preserved and that did not involve storing a local copy of the content. One method that showed promise utilized the Internet Archive’s Save Page Now service which allows human- or machine-initiated capture for a single webpage and immediately produces a persistent URL to the archived version of the page. Portico experimented with a proof of concept workflow in which embedded YouTube videos were identified within an EPUB; the corresponding URLs were submitted to the Internet Archive; and a notice was added beneath each video in the publication linking to the archived location. This method seems possible to automate with code and provides a level of content stability. However, because the process is not always successful, adoption of this method may require manual quality checking. Ideally, the archiving process would take place during production—or even earlier, during the research or writing process—to mitigate the risk that videos may be deleted from YouTube, as was the case with By Any Media Necessary .

Adapting to challenges at the platform level

A scalable workflow for the Fulcrum publications may be possible with more consistency around how media is embedded since every variation introduces further complexity and fragility. For example, some Fulcrum publications used a IIIF viewer to display images, others embedded the image file directly; some used captions, others didn’t. Since Portico prefers not to make editorial decisions on layout, nor risk a brittle preservation workflow, a better solution would be for an alternative EPUB export to be built into the Fulcrum platform itself. This could generate a larger EPUB for offline use that would be more appropriate for preservation and would evolve in line with the platform development.

Export packages were also generated by the Manifold and Scalar platforms. The Manifold export was in development during the project. As with Fulcrum, its export package contains an EPUB and a comprehensive set of supplements and metadata packaged together using the Bag-It specification. After an initial evaluation of the export feature, preservation technologists expressed concern that the metadata was spread across too many files and folders within the package: there was a top-level project metadata file and two files for each supplement serialized as JSON; XML metadata embedded within each EPUB; and then a variety of small text files within each folder in which the file name was the property name and the contents the value. While comprehensive, this would have required a very complex workflow to process the different metadata formats and files. Portico recommended that these smaller files be consolidated into fewer files that all use the same serialization format. While this package could have been used for more conventional publications, for the complex publications selected for this project, this export would be insufficient to preserve significant aspects of the user experience. In particular, the seamless integration of text, multimedia, and user contributed content such as annotations and highlights were not captured. The Manifold export packages were deemed important to preserve as a baseline, but there was interest in exploring whether we could also preserve the rich experience offered on the platform.

Scalar’s export was an RDF package that was more suited for migration between Scalar instances than preservation. The networked nature of the content and the specificity of the stylesheet classes and property names to the platform meant that without the platform, the publication would be extremely challenging to render. An approach that could capture the work as a website seemed like a more scalable option.

Web Archiving

Both of the preservation service providers employed web archiving for a number of publications during the project. The preservation activities for CLOCKSS were performed by the LOCKSS development team at Stanford. The web archiving approach deployed for this project uses Heritrix, the Internet Archive's open-source, extensible, web crawler project. To archive publications on a particular platform, LOCKSS engineers develop a plugin to the LOCKSS software with descriptor rules and code describing the harvesting process specific to that platform.

At the time of the project, Portico did not have a web archiving service in production, but used the publications presented in this project to test several of the new generation of browser-supported web crawlers-a different approach from CLOCKSS. Rather than attempting to predict and simulate what a browser does, these tools use an existing browser such as Chrome to automatically crawl the website and assemble any loaded resources into a WARC file. Many modern websites use JavaScript that continues to load new resources after the initial page load and as the user interacts with the page. These new crawl tools each have options for simulating user behaviors such as scroll, click, hover, etc. The Portico preservation technologist tested Brozzler, Squidwarc, Memento Tracer, and Browsertrix. Each had its benefits and drawbacks in the context of this project. While Squidwarc offered the best balance of control and simplicity, Internet Archive’s Brozzler was the most mature crawling tool and was used for most of the tests. Since completing the project, Portico has moved forward with a web archiving pilot using Webrecorder’s Browsertrix Crawler - a lightweight derivative of Browsertrix that uses a Chrome browser and Pywb for capturing web pages, and Puppeteer JavaScript for simulating user behaviors.

The preservation services tested a web archiving approach with 15 publications. In each case the service evaluated what was possible to archive and how well the approach could scale. The definition of acceptable scale and precision may vary between preservation services. Services like CLOCKSS and Portico work with publishers to build solutions that are tailored to their platforms. A custom workflow is considered to be successful if it can, over long periods, automatically and accurately archive all publications on the platform without needing to be modified frequently as a result of changes or inconsistencies. Ideally the benefits of the work to tailor a solution for a publisher’s platform can be transferred to other publishers that use the same platform or standards. The guidelines accompanying this report describe ways to design web-based publications that can be successfully captured using web archiving, and also aim to minimize the effort required for customization and maintenance of preservation workflows. If a website is easy to archive, a wider variety of services can preserve the site at a lower cost, improving the likelihood that the publications will be available to future scholars.

With time to customize the web archiving tools, the preservation services had some success in creating reusable workflows that could capture the majority of the publications that were presented using publishing platforms like Manifold, Scalar, or Fulcrum. These customized workflows were relatively complex, however, with some taking weeks to develop as a proof-of-concept. Where there were certain types of enhanced features embedded within the publication, or the publications were custom-designed rather than presented using publishing software, the results were mixed. The publishers understood that some loss may be unavoidable using these methods, and this was acceptable for features considered “optional” to preserve (e.g. full text search). In some cases though, the very features that qualified the project for inclusion in this research -- features such as interactive map visualizations and IIIF viewers -- could not be captured using web archiving, or required manual work that was out of scope for the preservation services. Specific successes and challenges are described in the sections that follow.

Adapting to challenges at the platform level

Prior to the project start, LOCKSS had been working on a plugin for Fulcrum publications. The development of the Fulcrum plugin continued during this project; feedback from the project team during weekly meetings fed into an iterative development. The method involved harvesting native publication content from its published location on the Fulcrum website. EPUB files were also harvested from the Fulcrum site when available. However, at the time of the project, LOCKSS was not able to ingest both EPUB and other resources separately in a way that would maintain linkages. Therefore, the publications were treated as web sites for the purpose of harvesting.

The LOCKSS team reviewed individual Fulcrum works that had noteworthy or extensive resources embedded in, or presented alongside the text, such as Animal Acts: Performing Species Today. The Fulcrum plugin for LOCKSS could harvest the main text of the publication. Use of third-party, remote fonts, such as those from Typekit or Google Fonts, can be an issue in web archiving. For Fulcrum, the publications’ font and icon toolkit is hosted on the platform, making it possible to capture. Links to PDFs, video, audio, and transcripts were easily discovered and downloaded by the web crawler. The functionality on the Fulcrum publication landing page that allows users to sort and group the publication’s resources was not archived. This feature adds an open-ended combination of parameters to the URL, resulting in an explosion of “pages” that the web crawler identifies as unique content. While crawling a large number of pages is possible it requires an inordinate amount of time and an impractical amount of storage. The plugin approach was therefore successful for capturing many of the core features of the Fulcrum publications and could potentially be applied to other Fulcrum installations. For some Fulcrum publications, specific dynamic features proved challenging with this approach. These will be discussed later.

The Manifold team at the University of Minnesota Press identified six Manifold projects as candidates for analysis, and the project team selected four as representatives. The Manifold platform includes an export feature that creates detailed packages composed of EPUB versions of any texts in the project, non-textual resources, and metadata for the project and all provided files. The packages contain the core source material, but lose a lot of the richness of the original presentation such as integration of user-contributed content and convenient navigation between the text and resources. Portico explored both whether the packages could be improved to reflect more detail and whether a web archiving approach could preserve the project as originally presented.

Portico initially attempted to archive a Manifold project by providing several web harvesting tools with the starting URL and allowing the crawler to automatically discover the pages of the publication. This proved ineffective due to the design of the platform site — there is no sitemap; projects use multiple URL arrangements making it difficult to control the scope of the crawl; many HTML link tags, which are typically used by crawlers to discover content, do not reference a target URL but instead initiate a JavaScript action on click to load a new page; and, new data is loaded from the server as the user interacts with some page features. All of these factors contributed to incomplete crawls.

In a second Manifold crawl attempt, Portico worked with Memento Tracer and Squidwarc. Initially the crawl scope was controlled by encoding sequences of user behaviors to automatically crawl a single Manifold project. The size and complexity of these publications made this a difficult task with complex behaviors required on many pages to capture all content (e.g. for each annotation click to open, if the annotation panel has a more button click it, close panel, repeat for next annotation). A simple proof of concept was developed that limited the scope to see if these tools were viable. The crawl was programmed to simulate a user navigating the publication text. It started on the hero page, clicked to the first page of a text, clicked “next” until it reached the end of the publication, and visited each linked resource it discovered. A WARC file was generated that held all of the content loaded into the browser during this process. The replay was much better than for the initial fully-automated crawl attempt but was still very buggy - within a few clicks an error would appear. It also would have been a lot of effort to encode all of the user behaviors that would fulfill the publisher’s requirements.

The cause of the replay problems were traced to the fact that Manifold runs as a “single page application” in which the entire template website is loaded into the browser when the user visits the first page, and then additional content is loaded using JavaScript functions that call the Manifold API to retrieve page data. When the user clicks a link to go to a new page within Manifold, the URL in the address bar is artificially changed by JavaScript, which means the page URL that shows in the address bar has not really been visited as a unique location on the network. The result is that the URLs that appear in the address bar, and the resources that would load if they were visited, may not get added to the WARC file. This causes sporadic errors in the replay of the web archive. To add to the complexity, the developer of Manifold expressed concerns about the web crawling mechanism being stable since it depended on CSS Selector or JavaScript DOM paths to simulate user behaviors. Given that the platform is in active development, these paths could change frequently.

Through these experiments, Portico concluded that a full list of resource URLs needs to be identified for crawling. This list should cover both the top level URLs that load in the address bar, and all of the API URLs that are called while the user is interacting with the site. Portico wrote a script to generate this using the Manifold API. The script took approximately two weeks to write. At this point, Portico re-evaluated the crawler tool options in the context of the new method. Memento Tracer did not support feeding in a list of URLs. Squidwarc did allow this, but during testing it was discovered that it lacked functionality to download PDFs or MP3 files, which were vital to the work. Since full control of browser behaviors was no longer needed for this approach, Portico switched to using Internet Archive’s Brozzler - another browser-supported crawler. Brozzler does allow basic customization of behaviors, which were useful in other experiments, but they were no longer needed for Manifold. Brozzler was also deemed a more stable option given that it has been running in production as part of Archive-It for some time.

The URLs generated by the custom script were fed into Brozzler, which was configured to only visit the URLs provided and not attempt to discover new links. The resulting website capture was tested by Portico, and again by the publisher and project manager. The same script was used to generate the URLs for three other Manifold publications; for the second, minor improvements to the script were required to incorporate features not expressed in the first iteration. By the third and fourth publications, no further changes to the script were required. Though further testing is required, this indicated that a reusable script to generate resource URLs for crawling might be a scalable solution for Manifold publications.

Adapting to challenges related to social media and user contributed content

Several of the web-based publications in the project had various forms of user contributed content. While this presented new technical challenges, it also raised a variety of legal and ethical questions around copyright, privacy, and safety.

A standard feature of Manifold projects is that Twitter content related to the publication is presented alongside the text. The project hero page links to the author’s Twitter profile and integrates a text-only version of any Tweets that mention the project. The publisher was interested in including this context as part of the web archive. The technical challenges of preserving the full Twitter experience are well documented - many social media platforms are extremely dynamic, with JavaScript driving continuous updates to the page content in response to user interactions, and unexpected platform updates prompting web archivists to adapt their methods. Many archives opt to instead preserve Twitter in a more raw form using the Twitter API. This is much more stable compared to the GUI, though some loss is expected.

In addition to the technical challenges, there was a broader set of questions for Portico related to rights and ethics around harvesting user-generated content and archiving it without permission. Portico decided it would not be appropriate to archive an author’s Twitter profile page, even if it was technically possible. The Tweets embedded in the Manifold hero page seemed like more of a grey area — they are text only, which limits issues of privacy and copyright surrounding visual material; they display only a Twitter handle (no full name or profile picture); and they are all related to the project. They were successfully captured as part of the web harvest, but further consideration is required to determine whether this is appropriate on legal and ethical grounds. If not appropriate, it raises a new and difficult technical challenge - to exclude a section of content from the web harvest even though it is seamlessly integrated into the page with data imported from Twitter on the server end. Standard web archiving practice would be to simply exclude the Twitter URL, but this is not possible because of Manifold’s architecture.

While this case did not come up in any of the examples, where it is possible to obtain rights to preserve social media content, in general a screenshot of the post with a link to the original location, or even a link to an archived copy would be more stable than embedding the Tweet. As with YouTube videos, services such as Save Page Now from the Internet Archive, may be used to generate an archive link for social media content. The success of Save Page Now in recording social media posts will depend on the evolution of social media platforms and web archivists’ ability to keep pace.

With Manifold and several other publications, there was also the broader question of user contributed content in the form of comments and annotations. Several platforms integrate the web annotation tool Hypothesis. Because Hypothesis’ terms of use indicate that annotations are in the public domain, they are legal to preserve. Manifold, however, has a local annotation tool that does not specify rights associated with user-generated annotations, which raises the same questions as for Twitter: is it legal and/or ethical to harvest these for preservation? An adjustment to the Terms of Use for Manifold could forestall legal concerns in this instance.

Lines become blurred once again when it comes to using third party comment plugins. Rhizcomics, published on a custom-built site by Michigan Publishing, uses a service called Disqus so that user comments can be added to each page. Some of these comments include uploaded images, and users have full names and profile photos on display within the webpage. Archiving these profiles freezes them in time, with the user losing the ability to control their profile or comment in the preserved copy, which again raises ethical questions about people’s right to control their public online identity. While the publisher may wish to preserve this commentary, the most likely approach for the preservation institution is to exclude the Disqus URL from the harvest. A local comment service that is text-only and is covered by an appropriate terms of use statement would allow for archiving by more cautious parties.

Adapting to challenges related to dynamic content

Publication features which require communication with a server where that communication is unpredictable or results in an open-ended number of related URLs cannot be captured well with web archiving. Examples of this include full-text search, embedded Google Maps, or IIIF viewers, all of which depend on user interactions to load additional data. Capturing this dynamic content can be further complicated if the feature relies on third-party services since rights issues may then come into play and continuity becomes dependent on that service being sustained. The preservation technologists analyzed and attempted to address some of the challenges presented by such dynamic content.

Oplontis Villa A (“of Poppaea”) at Torre Annunziata, Italy, Volume 1. The Ancient Setting and Modern Rediscovery, published on Fulcrum, proved to be challenging for a web archiving approach. The images in the work are displayed through a IIIF Leaflet widget that allows a user to pan from left to right and zoom. Links to the images are not contained in the HTML page. They are fetched by JavaScript and appended to the page after it is downloaded to the browser. What reads to a viewer as a single image is made up of nine tiles within Leaflet. As the user interacts with the IIIF viewer using zoom and pan, the JavaScript dynamically fetches new image tiles from the server through an API. In some cases, as was the case with Oplontis, the CLOCKSS team was able to determine the image URLs by reverse engineering the process to find that images were hosted by IIIF image servers on the Fulcrum website. In the case of Lake Erie Fisherman, which displays images within Fulcrum with multiple zoom levels, it was not possible to discern which resolution the Leaflet tool displays at first. Without this information, the resulting image replay in the web archive appears as a gray box. In order to replicate the original presentation experience, the harvester would need to fetch all possible combinations of image tiles - another limit to preservation. One way to further the likelihood that an ebook could be harvested as a web site is for publishers to display a simplified version of their site when it is accessed by a web crawler such as LOCKSS.

As with Fulcrum, resources embedded in Manifold publications such as images, PDFs and videos—which generate predictable and limited responses from servers—were possible to capture as part of the web archive. However, data visualizations presented a particular challenge. The publication Cut/Copy/Paste contains an interactive map created by the author which displays different combinations of image tiles as the user navigates over the map. This dynamic content changes based on user interactions and presents a similar challenge for web archiving as the IIIF viewers encountered by CLOCKSS. The Portico team opted to treat the underlying data and code for the visualizations separately from the web archive file. The data and code were preserved along with the WARC files. This combined approach to archiving was applied to another Manifold work, Metagaming: Playing, Competing, Spectating, Cheating, Trading, Making, and Breaking Videogames, which contains downloadable versions of the games referenced in the text: disk image files for Mac and executables for Windows. These were preserved as digital objects in addition to the WARC files. The addition of contextual metadata related to these files would help users to identify what would be needed to run them.

A Mid Republican House in Gabii from University of Michigan Press on Fulcrum presented a different challenge related to preserving a data visualization. A navigational device in the work is driven by WebGL, a JavaScript API used to create interactive 3D graphics within a browser. The publisher prioritized this 3D model and its relationship to specific locations within the text as an important element for preservation. In addition to this, the 3D visualization included DOI links to data records outside of Fulcrum. These records were part of a larger website that supported the ability to browse or search the full dataset. Portico first ran the work through their Fulcrum EPUB file transfer workflow, which successfully captured all of the component parts including the visualization and metadata. However, the desired interactions between these components and the links to content from the external database were missing from this initial package export approach. A second experiment was made with web archiving, using the contents of the FTP package exported from Fulcrum to identify a list of URLs to crawl. It was possible to visit each link using Brozzler which was configured to crawl the links in the database. The resulting preservation copy was successful on playback, including the relationships between the EPUB and WebGL 3D model. The DOI links to the supplemental dataset, accessed via the WebGL visualization, were not expressed in the metadata and were therefore lost. As an experiment, Portico envisioned a scenario where Fulcrum had included these database DOIs in the supplied metadata file as resources so that they would automatically be added to the web harvester crawl list. In a proof-of-concept test, this approach was successful for retaining the connection between the WebGL visualization and the web-based data, though it did not result in a comprehensive capture of the database. Portico suggested that in addition to including these DOIs in the metadata, adding the raw supplemental dataset to the package would ensure future researchers had a way to access the full data in some form, even if they could not fully navigate it using the archived version. For Portico, this case was an outlier among the Fulcrum works because it required a secondary approach to preservation (web archiving) in addition to the EPUB export. This added a layer of complexity to an already customized workflow. To support outliers like this, either extra effort is required to make the publication work well with a web archiving approach, or some manual effort is required on the part of the publisher and/or preservation service to support unique cases.

The Stanford University Press publication, Constructing the Sacred: Visibility and Ritual Landscape at the Egyptian Necropolis of Saqqara, a Scalar publication that contains a complex 3D visualization which is central to the work and therefore, a priority for preservation. The feature allows users to zoom, pan and rotate select objects in the embedded 3D visualization which is hosted on ArcGIS, a third party service. The ArcGIS instance contains the coordinates for the placement of the objects, and bit.ly is used to store detailed bookmarks that rotate and zoom the visualization within ArcGIS. The metadata that is needed to understand the relationship of the publication’s components is documented within the Scalar based publication. Portico attempted to archive the visualization automatically using Brozzler, and then manually using Webrecorder. Neither resulted in a comprehensive capture of the visualization. The initial view of the visualization was captured but would break as soon as the user interacted with it. The instance of ArcGIS—a third-party service—is essential to this publication and cannot be preserved in its current form. The data would need to be able to work independent of these third party services in order to be preservable. One possible solution for improving the preservability would be to make a package for the GIS data that is independent of the Scalar platform which would include the metadata and information about the relationships between the parts. An alternative solution, if the volume of data is small enough, might be to use a visualization tool that does not require ongoing communication with the server - as seen with the WebGL visualization in A Mid Republican House in Gabii in which the data supporting the tool is loaded into the browser when a user first opens the page. This would allow for better web archiving and remove a fragile dependency on a third party platform. A final option would be to create a fallback equivalent, perhaps a video demonstrating the visualization, that could be displayed if the ArcGIS service is not available. Each option requires collaborative effort from the publisher and author to ensure the item can be preserved.

Emulation

The priorities for preservation for the RavenSpace work As I Remember It were threefold: preserving customizations for representing Indigenous metadata, languages and orthographies; maintaining the multi-path structure of the work; and an opening pop-up agreement requiring users to agree to respect the expressed cultural protocols before accessing the publication. Web archiving tools allowed Portico to preserve: the general page layout including citations, notes, and embedded media; the table of contents to a second level; and the popup agreement, which required extra configuration for playback. Missing from the web-archived version were: custom search functionality (including the ability to search using the First Nations keyboard), content visualizations, and a dynamic curriculum explorer. Also missing were external resources: Google Maps, iframe content containing two other websites, and a YouTube video. While the web archive was deemed acceptable by the publisher, there were some concerning issues with capture and playback that might cause problems for access in the future. The web archive seemed worth preserving despite these issues, but Portico decided to explore a more advanced approach in response to shortcomings of web archiving for As I Remember It.

Portico preserved the work a second time using emulation, which attempts to replicate the server side configuration of a website or application. RavenSpace is built on Scalar, a Linux/Apache Server/MySQL/ PHP (LAMP) application. Portico performed the emulation work by creating a virtual machine on an instance of EaaSI, the Emulation-as-a-Service Infrastructure developed by Yale University Library. The virtual machine contained a LAMP stack with the GUI enabled, a web browser, Scalar 2.5.12 with the Import/Export plugin installed, and the RavenSpace customizations. The Scalar Import/Export plugin supports the transfer of the text data between platforms, but does not include the non-text resources or update media links. The vital components to recreate As I Remember It on the new server were: (a) the Scalar export data from the original platform, which was supplied as a large JSON file,(b) copies of the media files that were embedded into the publication but hosted on Omeka and DSpace (c) the Scalar-generated publication folder from the original web server containing all other resources used in the publication. Fortunately, these remote media files were documented in Scalar as resources, and so their descriptive metadata and original file URLs were included in the data export. This supported the process of matching files to their descriptions and placing them in the context of the text. The JSON was imported into RavenSpace on the virtual machine; the publication folder was copied to the appropriate location on the web server; and all media files were copied to the publication folder under a new sub-folder so that they were local to the publication. Media paths were updated in the Scalar database to point to their new local path. The resulting preservation copy was a fully functioning website that could run offline. It included elements that were missing from the web archived copy: the search functionality, all levels of path navigation, and content visualizations. Google Maps and a YouTube video were not migrated due to copyright and technical challenges, but these had been deemed non-essential by the publisher.

Because of its predictable content layout, the Scalar platform lends itself to both web archiving and emulation. CLOCKSS and Portico were able to create a web archiving workflow by extracting a sitemap-like list of URLs from the Scalar API. Though the workflows were complicated to configure, they could potentially be applied to other Scalar publications. In addition, Portico identified patterns in the installation of Scalar that could make it possible to recreate the platform with the publication loaded into it so that the website could be emulated in the future. In order for either of these methods to be scalable, publishers would need to handle attachments in a consistent way, and communicate any customizations to a preservation service provider. Scalar’s import/export tool is useful for the emulation process, but the need to handle the publication folder and media files from DSpace and Omeka separately is burdensome. The process was labor-intensive, requiring manual work and a series of conversations to identify what pieces were required. That said, it may be possible to create a script that would automatically install a publication onto an empty RavenSpace instance for emulation. This would require a very consistent and structured handoff between the publisher and the preservation institution. We are unsure how scalable such a process would be.

Filming Revolution, which its publisher Stanford University Press describes as a “meta-documentary about independent and documentary filmmaking in Egypt,” relies on a custom LAMP (Linux, Apache, MySQL, PHP) application to deliver a user experience based on a visual network structure. The website is a single page application, which means it is heavily dependent on JavaScript to load data into the browser as the user interacts with it. Users navigate content by clicking within a visualized web of relationships between essays, videos, and themes. Both the interactive navigation and the Vimeo hosted video content were identified as primary aspects of the work to be preserved. In a package provided via FTP, the publisher provided the database and code from the server as well as copies of the video files from Vimeo. Portico experimented with both client-side (website harvesting) and server-side archiving (emulation) for this data-driven website.

For the client-side web harvesting approach, the existing sitemap URLs were crawled with Brozzler. Most of the Vimeo videos were not successfully captured, and the playback experience for navigation produced many technical errors. Stanford did have some prior success with a web harvesting approach while working with an expert from Webrecorder, who used different capture and playback tools. Understanding Webrecorder’s approach and whether it is possible to automate and scale may inform further exploration of the possibilities for this method.

Since the raw materials for Filming Revolution were provided, as with RavenSpace, it was possible to build a functional replica of the publication’s web server and attempt to emulate it. The ability to emulate software is dependent on the ability to encapsulate it on a single virtual machine - in other words, remove or modify all external dependencies so that the entire application works offline. Because this publication depended on numerous Vimeo videos and an external font, the site would not function without a connection to the live web and the persistence of all of those connections. The entire application depends on loading a font from the Google API and thus would not load at all once disconnected from the Internet. Because many LAMP website installations are fairly standard, an automated process could be developed if the publisher used a consistent process for building the web server or could provide a script to install all dependendencies onto an empty Linux machine. Portico therefore sought to determine what would need to change within the package provided in order to develop a scalable emulation process. A new virtual machine was created by installing the LAMP stack; the website files for Filming Revolution were copied to Apache Server on the new machine; and the website database was restored to the MySQL server. The publisher supplied an offline copy of all video assets for use in the emulation. The videos were compressed and moved to a file path on the Apache server and renamed to match their Vimeo IDs in order to make the required code change simpler. The code was then modified to use these local copies of the videos instead of the copies on Vimeo. A method was devised to move the web fonts local to the application, a process that could have been simplified if the fonts chosen were local and non-proprietary. Portico also worked with the publisher to remove any data or server information that was not appropriate for access or preservation. The results were demonstrated and shared via an EaaSI instance hosted by Portico for the project.

Code modifications and data cleaning took approximately five days for the Portico technician, an experienced web developer who had never seen the code before. For a scalable approach, Portico could not spend five days on each publication — slightly more time than it took to localize As I Remember It. In order to scale this approach, an application would need to be provided to Portico in a form that is ready for a generic LAMP installation. The most efficient way to do this would be for the original developers to design the website with sustainability and encapsulation in mind — ensuring files are local to the application where possible and that there is a simple way to fall back to local functionality for integrations such as Vimeo. When delivering the package to the publisher, the creator would then produce a clean package, which would be provided to the preservation institution. In addition to enhancing preservation, limiting third party dependencies and building in fallback mechanisms would allow the live application to degrade gracefully and reduce future maintenance.

Annotate

Next Chapter
Assessment
PreviousNext
CC BY 4.0
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org