Archiving Old Dynamic Websites

Archiver is a web based application written in Java using the Spring Framework. Its main use is to produce static versions of websites which are no longer updated with new content. This is important for many reasons; firstly, dynamic sites will typically target specific software dependencies. As time passes certain dependencies will no longer receive patches, posing a security risk, or updates, meaning it may no longer function with the latest operating systems and code environments. This is almost always a problem for system administrators who sometimes end up maintaining software that can be tens of years old on outdated systems. Archiver offers an easy means of creating a static versions of of these dynamic sites that can deployed simply to a standard web server, typically, but not limited to, Apache.

Websites will run using different software; some will be written in plain HTML and CSS whilst others will run on CMS or WIKI platforms such as WordPress, MediaWiki or Drupal. Each of these methods will provide a slightly different way of performing some tasks, writing certain elements in HTML/CSS etc and laying out structure. After analysing the problem it was clear that we would need to target specific software in order to provide a high quality solution. For this reason, the ‘Strategy’ design pattern was chosen.

In this case an interface and super implementation provided a default set of methods for dealing with the processing of individual web elements written to work in the majority of cases. It can be thought of as standard web behavior. Subclasses of this Strategy were provided to account for software differences.

We currently support the following strategies -:

Basic (Static websites for migration)
Drupal
Media Wiki
WordPress

One of the main tasks which Archiver performs is to make any links which appear in HTML, CSS or JavaScript files relative to the homepage of the website so they are not absolute links. The JSoup plugin for Java was especially useful in this case as it allows the detection of a specified tag in the HTML file. JSoup also uses a Jquery type syntax to select the different elements from the HTML e.g. “#” is used to select an ID and “.” is used to select a class. JSoup also allows invalid HTML which is useful doesn’t prevent a site from being fully archived if there are mistakes in the markup. For the CSS and JavaScript, Regex was used to create expressions in the specified format for a CSS or JavaScript link, this could then be used to find and change the links. Alongside making links relative, Archiver also adds each link which it finds to the list of files to be added into the archive folder. After archiving recursively a zip file is served up to the user.

While existing solutions are available none of them provide the comprehensive rewriting capabilities of Archiver. All the user has to do is point the webapp at a site, choose a strategy and deploy the resulting zip.

Archiver also produces a README file which provides details of all the files which have been included in the archive and lists any errors such as missing pages.

Code is available from https://bitbucket.org/ukolnisc/archiver/src

While this is working code it has not received sufficient testing which is obviously vital for this type of project. With that in mind we would love to hear your feedback.

UKOLN Dev

Digital library notes: 2009-2013

Archiving Old Dynamic Websites

Leave a Reply Cancel reply