Monthly Archives: May 2012

Archiving Old Dynamic Websites

Archiver is a web based application written in Java using the Spring Framework. Its main use is to produce static versions of websites which are no longer updated with new content. This is important for many reasons; firstly, dynamic sites will typically target specific software dependencies.  As time passes certain dependencies will no longer receive patches, posing a security risk, or updates, meaning it may no longer function with the latest operating systems and code environments. This is almost always a problem for system administrators who sometimes end up maintaining software that can be tens of years old on outdated systems. Archiver offers an easy means of creating a static versions of of these dynamic sites that can deployed simply to a standard web server, typically, but not limited to, Apache.

 

Websites will run using different software; some will be written in plain HTML and CSS whilst others will run on CMS or WIKI platforms such as WordPress, MediaWiki or Drupal. Each of these methods will provide a slightly different way of performing some tasks, writing certain elements in HTML/CSS etc and laying out structure. After analysing the problem it was clear that we would need to target specific software in order to provide a high quality solution. For this reason, the ‘Strategy’ design pattern was chosen.

In this case an interface and super implementation provided a default set of methods for dealing with the processing of individual web elements written to work in the majority of cases. It can be thought of as standard web behavior. Subclasses of this Strategy were provided to account for software differences.

We currently support the following strategies -:

  • Basic (Static websites for migration)
  • Drupal
  • Media Wiki
  • WordPress

One of the main tasks which Archiver performs is to make any links which appear in HTML, CSS or JavaScript files relative to the homepage of the website so they are not absolute links. The JSoup plugin for Java was especially useful in this case as it allows the detection of a specified tag in the HTML file. JSoup also uses a Jquery type syntax to select the different elements from the HTML e.g. “#” is used to select an ID and “.” is used to select a class. JSoup also allows invalid HTML which is useful doesn’t prevent a site from being fully archived if there are mistakes in the markup. For the CSS and JavaScript, Regex was used to create expressions in the specified format for a CSS or JavaScript link, this could then be used to find and change the links. Alongside making links relative, Archiver also adds each link which it finds to the list of files to be added into the archive folder. After archiving recursively a zip file is served up to the user.

While existing solutions are available none of them provide the comprehensive rewriting capabilities of Archiver. All the user has to do is point the webapp at a site, choose a strategy and deploy the resulting zip.

Archiver also produces a README file which provides details of all the files which have been included in the archive and lists any errors such as missing pages.

Code is available from https://bitbucket.org/ukolnisc/archiver/src

While this is working code it has not received sufficient testing which is obviously vital for this type of project. With that in mind we would love to hear your feedback.

 

Two cities, two hack days

During March and May, I attended two very different hack days. The first was part of Bath’s first ever digital festival, aptly called, Bath Digital Festival. The hack day was organised by local web development consultancy Storm.

Unlike previous Storm hack days that have had a theme, this one was open ended for the developers to develop anything they wished. They have had good success in their previous hack days resulting in some of the hacks being turned into finished products and released on Apple’s App Store, such as Spyhunt and Shaken created by local software development company Riot.

At the hack day I teamed up with fellow Ruby developer and hardware hacker Paul Leader (who just happens to work at Storm). We had borrowed a receipt printer from Mike Ellis (organiser of Bath Digital Festival) with the intention of plumbing it up to the internet in order to print out tweets from the conference as a physical takeaway memento for festival goers.

Arduino Printer wiring diagram

Working from a highly complicated wiring diagram, we attempted to connect the printer to the internet. Unfortunately for us after many hours in the morning trying to get this to work, we eventually gave up and had lunch. One of my fellow attendees sums this up quite nicely on her blog.

Conclusion

“I also spent a large part of the day sat next to Paul and Julian who were attempting to turn an old receipt printer into a tweet printer – sadly, they couldn’t get it to work, which was a shame – but it was interesting to see the processes and patience they both possessed to get to the desired result (or at least close to it).”

As is the way with most events the wifi during the morning wasn’t quite up to par, so the other 60+ developers in the room found it hard to implement the ideas they wanted to build. After a lunch the wifi was going strong and people started hacking again, I mainly spent the afternoon, finding out what others were working on, and also worked on a twitter text analysis tool with another at attendee.

I think the day went really well, I spoke to some interesting people and thought the event was well organised.

MRD Hack day

The Managing Research Data hack day in Manchester was part of the JISC call by the same name being run by Simon Hodson. Although technically I am not part of any of the projects in the MRD call, I was still asked to attend. The hack day was actually a hack two days, with the room we were in open until the last person left.

After a morning of talks about various projects on the MRD call and various other data related presentations, it was time to start/join a team and brainstorm some ideas. I joined forces with Nick Jackson and Harry Newton of Lincoln University and Nick Syrotiuk of Mimas. The idea of our project came from Joss Winn which he had got from an academic at Lincoln. The basic idea was to create a system whereby an academic could see the outputs of all the research projects not just in their department, but across theirs and every other university.

To get started we first chose a project name from a random name generator, and then I created a GitHub project for it. The project would now and forever be know as Project Rainbow Beam. Built onto of MongoDB I created a simple Sinatra web app to accept a JSON payload which would then be added to the Mongo database. We soon realised that the incoming JSON data need to by sanitised, I volunteered. As I was now chief of sanitisation, Nick J, rewrote the front end using a PHP framework called Codeigniter. To keep enable optimum developer communication we created a chatroom on Campfire, as we were using Campfire, it seemed a good idea to hook GitHub to the chat room, so that every time we pushed code, Campfire would play a Vuvuzela on all of our computers.

Skip to many hours later, Nick J and I were the last to go to bed having been up many hours hacking away at the project.

By mid to late morning day two, we had a fully Bootstrapped website, documentation, api endpoint, data sanitizer, and live feed which was updated via Pusher.

At the hack event, it was decided to vote on all the hack projects that had been going on to see which one would win a further two days development work. With the developers being whisked away to a hotel and given two days to make their project better. Unfortunately we didn’t win this, although our project was well received. The prize of getting two more days to work on their project went to the BitTorrent group whose idea was to use BitTorrent and SWORD to move large research data sets around.

Conclusion

These two events were very different, and were targeting very different audiences. However the common thread they shared was they were meant for developers. They both did well in catering for developer needs, coffee, wifi, and electricity. It was great to be part of these two events, I learned a lot and met lots of great people. I look forward to the next hack day to find a new challenge to work on.