Author Archives: et207

Further Stanford* classes to be run in January

According to a series of posts on the ml-class forum, a series of apparently-free Stanford-inspired* distance-learning ten-week classes are currently expected to start in January/February 2012. Note that as before, the courses are not credit-bearing – take them for what they are.

Looks like neither AI-class nor DB-class will be rerun, at least not yet; if you fancied the AI-class, take Machine Learning instead. Ng is a great teacher. Because the course involves extensive amounts of practical coding, it is actually both fun and fulfilling, which I’m afraid I cannot say of the AI-class  – which has been informative, and at times challenging, but for me there is less satisfaction in a set of quiz grades than there is in building something and watching it work.

Technical courses

Machine Learning (Andrew Ng) – will be rerun in more or less its current form
Probabilistic Graphical Models
(Daphne Koller)- a logical successor for ml-class survivors
Natural Language Processing (Chris Manning and Dan Jurafsky)
Cryptography (Dan Boneh)
Game Theory (Matthew O. Jackson and Yoav Shoham)
Human-Computer Interaction (Scott Klemmer)
Design and Analysis of Algorithms I (Tim Roughgarden)
Computer Science 101 (Nick Parlante)  – The beginner’s guide to these strange things they call ‘computers’ and ‘code’
Software Engineering for Software as a Service (Armando Fox and David Patterson)
Computer security (Dan Boneh, John Mitchell and Dawn Song) – How to ‘design secure systems and write secure code’

Electrical Engineering
Information Theory (Tsachy Weissman) – ‘the science of operations on data such as compression, storage and communication’. Begins in March 2012.

Complex Systems
Model Thinking (Scott E. Page) – building models of complex systems.

Entrepreneurship
Technology Entrepreneurship (Chuck Easley) – ‘understand the formation and growth of high-impact start-ups in areas such as information, green/clean, medical and consumer technologies.’
The Lean Launchpad (Steve Blank) –  Business models, customer development, and starting up your startup.

Civil Engineering
Making Green Buildings (Martin Fischer) – how to manage sustainable building projects.

Medicine
Anatomy (Sakti Srivastava) – knee bone connected to the hip bone, etc.

Caveat emptor
As these courses are free online, I suppose that really ought to read caveat lector or caveat auditor or something, but you know what I mean. Here’s the warning: each of these courses are supposed to take over ten hours a week. Follow the Stanford AI-Class Decision Diagram with care and attention when deciding whether to enrol.

P.S.
If you’re not a computer science or mathematics graduate, you will probably need to work on your maths for many of these courses. The Khan academy have very useful course material for areas like basic probability, Bayes and linear algebra/matrices.

P.P.S.
If anybody wonders what an unspecified number of thousands of dedicated students attempting to finish a midterm exam before the deadline can do to a server, wonder no more:

Having seen it repeatedly whilst trying to fill in the midterm forms, today I see this message every time I close my eyes…

* As it happens, not all these classes are run by Stanford. Software Engineering for Software as a Service is a Berkeley course (although one of the instructors, Armando Fox, was previously employed at Stanford), Computer Security is a joint effort, and Model Thinking is taught by Scott Page of the University of Michigan.

AI-Class with tablet devices

Quote from Sebastian Thrun

@SebastianThrun: Who’s up for a $2000 Stanford degree?

You might have seen the intense publicity received by Stanford’s current experiment: Ai-Class, not to mention the sibling efforts ML-Class and DB-Class. These were described to the public as beta-releases of a new kind of education, and have been made available for free, possibly a once-in-a-lifetime offer, possibly never to be repeated. Class began in mid-October, and it’s not clear whether these will run again in their current form.

I joined two classes; AI-Class (artificial intelligence, taught by Sebastian Thrun and Peter Norvig) and ML-Class (machine learning, taught by Andrew Ng). Given that the midterm exam happens next week, I won’t be sharing my grades, but I would like to write a little about accessing these courses on various platforms.

First, a confession: despite the fact that the AI-class draws extensively on material from Russell & Norvig’s ‘Artificial Intelligence: a modern approach’, and the fact that I would’ve liked to use this to check out some ebook reader platforms, I haven’t been able to do so. There are various reasons for that, but the most compelling is :

Content Unavailable in the United Kingdom

Oh well.

There were other problems, anyway; the price of Norvig’s other books suggest that I would not have been happy to pay the price for a Kindle copy. Keep in mind that the office wouldn’t be paying; this is something I’m doing in what I laughably refer to as ‘spare time’. Norvig’s cheapest available Kindle download, Case Studies in Common Lisp, costs £41.89. If AI:AMA cost anything like that, I’d have ended up checking out the second-hand market anyway – you can pick up a second-hand copy for between a fiver and a tenner. Even if I’d bought a paper copy new it may have been cheaper; e-books attract VAT.
This got the Kindle out of the running very quickly. The primary use it can be put to during the course is revision of notes from the ML-class, which conveniently includes revision slides/PDFs.

That left the Apple iPad and Motorola Xoom, which could not only view the PDFs, but also access the videos offered by each site. In the case of ML-Class, a download link was even provided for each video – perfect, I thought, I’ll download them and watch the videos in transit. One difficulty: the iPad seems to disapprove of the concept of downloading files. Safari will consent to send pdfs to iBooks, but as for storing videos for later review, the obvious solutions involve a laptop and iTunes. If you are not always online, the need for advance planning – the faff factor, if you like – increases rapidly. The determined can mitigate the problem via applications for the iPad such as the MyMedia download manager, but the app-centric viewpoint is frustrating. Stanford could solve this through iTunes U – but how many channels must a provider support?

The Xoom did not go to the same finishing school as the iPad, if it went to one at all. Unaware that saving files from the browser and displaying them in anything available is an uncouth habit, it simply does it. It also seems to have passed through its formative years without learning that arbitrary soft-resetting is rude, so it occasionally does that as well.

ML-Class makes extensive use of Octave, a free and fairly Matlab-compatible language and interpreter, giving weekly assignments. The idea of Octave on a mobile device is not as far-fetched as it sounds – Nokia N800/810 owners were able to use both Octave and Gnuplot. Similar software packages, such as Addi and Mathmatiz, are available for Android. In general these are works in progress. iPad owners with a desktop copy of Matlab can try connecting to it remotely via Matlab Mobile, a function that is available through unofficial apps on Android. The interface is not, however, optimised for the iPad, and as with the problem of watching videos in transit, those with limited network connectivity will find this an imperfect solution. Why no Octave clone on iOs? The App Store, the GPL, and extensible interpreters apparently don’t mix, although since Apple changed the language in their SDK, some of the issues mentioned have been resolved.

To conclude: the iPad is polished, but I found myself reaching for the (heavier, clunkier) Android device instead. The Xoom is indeed something of a brick, but the iPad seems to be designed for a world with uniformly excellent 3G coverage, in which nobody ever spends much time offline.

Extended Repository PDF Assessment

As part of FixRep a small project is being carried out to examine the use of metadata in pdf documents held in HE/FE repositories (such as the University of Bath’s Opus repository). This builds on an initial pilot that was carried out using pdfs harvested from Opus, which we wrote a paper about for QQML 2010 (Hewson & Tonkin, 2010).

The original study of Opus was an exploration to test out the extraction and analysis process. Obviously the initial analysis focusing on only one repository could only be used to draw conclusions about what’s in Opus; issues it may present, metadata usage, etc. The extended assessment is examining pdfs from about a dozen UK repositories so that a reasonable analysis of metadata, comparison of ‘standard’ usage, and common vs. unique issues, can be obtained.

So, how are we going about this?

It’s a pretty manual process at the moment, at least each of the stages is kicked-off manually, and can be divided into three stages…

  • Harvest the pdf files
  • Extract the metadata into database
  • Analyse content

Harvest…

Using wget the repository structure containing the pdf files is copied to a local server. This process takes some time and can be rather heavy-handed in the overhead it places on the repository server through continual requests for files. If we wanted to extend the extraction and analysis process into a service that regularly updates, then a more considerate approach towards the repositories would be required. However, we’ve got away with it at least this far!

Extract & load…

A prototype service that extracts information from pdf documents was developed as part of the FixRep project. It extracts the information in a number of ways:

Heading and formatting analysis, such as:

  • The version of the PDF standard in use
  • Whether certain features, such as PDF tagging, are declared to be in use
  • The software used to create the PDF
  • The publisher of the PDF
  • The date of creation and last modification

Information about the body of the documents:

  • Whether images or text could be successfully extracted from the document and, if they could, information about those data objects.
  • If any text could be extracted from the object, further information such as the language in which it appeared to be written and the number of words in the text

Information from the originating filesystem, such as:

  • document path
  • document size
  • creation date, etc.

The extracted information is put into intermediate files in an XML format and is then loaded into a MySQL database for…

Analysis…

PDF Processing

PDF Processing

The first thing we actually look at is how many of the harvested pdf files could be processed, and for those that failed, what was the reason they failed. For example in out pilot run against the Opus content about 80% of pdf files could be processed. The 20% that failed were mainly due to the service being unable to extract meaningful information, while a very small number of files turned out to be ‘bad’ pdfs – that is the format of the files was corrupted or not in a recognisable file format. some of the errors identified were recoverable with some manual intervention, while some meant file had to be excluded as un-processable.

While not definitive, this does give us a baseline expectation for the success rate of extracting meaningful information from other repositories.

Once we have the data in the database it’s easy enough to run some sql to generate simple statistics such as: Type and number of distinct metadata tags used;  Average number of metadata tags per file; etc. This gives us a good overview of the content (in a given repository) and whether the content is consistent within and between repositories.

Next steps…

The target repositories have been harvested and will soon be processed for analysis. So, unless some very unexpected processing problems happen we should have results and be ready to produce a report on this project in early December.

Haptics: First impressions of the Novint Falcon

Haptic feedback has been in the process of coming of age for a good long time. Logitech released the force-feedback iFeel mouse (I swear I am not making this up) about a decade ago (see review, slightly more cynical review).  It got a few headlines at the time, but eventually somebody pointed out that it was essentially a mouse that went ‘buzz’, so that was that. On the other side of the scale, the CS dept at the University of Bristol once kindly permitted me to be a victim experimental subject for a very interesting piece of work that made use of midrange SensABLE Phantom devices, which you can see here. Be advised in advance that if you have to ask the price of a SensABLE device, you probably can’t afford it.

One of the problems with haptics is that it’s simply pretty hard to explain. The Phantom experiment, for example, was very cool; the brief was to ‘feel’ your way around a three-dimensional workspace, and try to describe the object you can feel. Umm… it’s sort of boxy. There’s a sort of doodad here. Um, there’s a gap in the middle. What is it? Oh. Wait. There’s another doodad below the first one. What on earth is it? And in the end it would turn out to be a model of a desk, at which point you, the lab rat experimental subject, would say, “Oh, right.”

So on the one hand, people aren’t very good at identifying objects by touch. (For a more complete discussion of our confused mumblings, see Pearson & Fraser, 2008. Read it. It’s interesting…) On the other hand, as confusing as the information may be to use, the experience of fumble-fingering your way around a 3-D model of a piece of office furniture is extremely good fun.

Of course, that meant that someone was going to build this stuff into a game, and yeah, it’s been done. From the invention of haptic battle pong, which on the face of it must be one of the most amusing things you could possibly do with the most reasonably priced SensABLE device (recently on sale at 800 euros), things have moved on. But the one that caught our attention was the Novint Falcon, which first shipped in 2007 and, at $180 plus inevitable overhead in customs charges and the like, is only a fairly expensive method of playing Pong in the workplace.

So we bought one. It looks a lot bigger in real life. And after we got it installed, and got over playing the games that came with the device – particularly one in which the player is invited to launch ducks into a series of ponds using a large catapult – we settled down to see what else we could do with it and the available frameworks, such as Chai3D.

Here are Andy Hewson’s first impressions of the Falcon and the Chai3D demos:

Pearson, W. and Fraser, M., Collaborative Identification of Haptic-Only Objects, in Proc. EuroHaptics 2008, Madrid, Spain, June 2008, pp. 806-819.

Accessing local-network-only web pages from outside the firewall

VPNs aren’t as painless as they ought to be, and setting one up purely in order to get at web pages that have been hidden behind a firewall can seem to be overkill sometimes. But then, there are good reasons for hiding things like internal finance-and-admin web pages behind a firewall. What to do?

OpenSSH to the rescue! Greg Tourte just pointed out an alternative to using the VPN in order to access these internal web pages, such as Agresso. It certainly works on Linux and on the Mac, and should also work using Putty (see http://home.fnal.gov/~dwd/ssh-to-browse-behind-firewall.html). This is pretty useful. For example, people within UKOLN who might be having difficulty accessing internal web pages such as Agresso due to VPN problems should be able to use this method instead.

It relies on the fact that OpenSSH has a lot of little-known functionality, in this case, the ability to act as a SOCKS proxy (see http://en.wikipedia.org/wiki/SOCKS for lengthy and boring introduction). In short, OpenSSH can tunnel through to a machine at UKOLN and then allow the browser to treat that connection as a standard web proxy. Because the web proxy endpoint is inside the Bath network, it permits access to internal web pages.

Here is a blog post describing the basics of this openssh functionality:
http://alien.slackbook.org/blog/securely-browsing-the-net-using-socks/

Here are the steps required to set up a SOCKS proxy in order to access these internal web pages remotely (we just tested this on a Mac and on Linux).

1. Install Foxyproxy on Firefox : https://addons.mozilla.org/en-US/firefox/addon/2464

2. Open a terminal window and type: ssh -D 8888 yourusername@InternalServer.YourUniversity.ac.uk
Type in your password when prompted to do so. You will then get a ssh session on InternalServer; just leave this open.

3. Open FoxyProxy, and complete the following steps:

a) add a new manually configured proxy, with the host/IP ‘localhost’, port 8888, click ‘SOCKS proxy?’ and set it to SOCKS v5.

b) Add a new pattern defining when this proxy should be used:
Given the example of setting this up for Bath’s Agresso server:
Call the pattern something like ‘Agresso’, and put in the URL: *agresso.bath.ac.uk/*
Under ‘URL Inclusion/Exclusion’, select ‘Whitelist’, and under ‘Pattern Contains’, select ‘Wildcards’
Make sure that the newly set up proxy is enabled.

 

4. Type http://agresso.bath.ac.uk/ into the address bar. Hopefully, FoxyProxy should make use of the pattern that we have just set up and send the traffic through via the SOCKS proxy.

Caveats: Obviously, just as you need to turn on the VPN before you can access Agresso via the VPN, you will also need to run the SSH step before you can access Agresso via SSH – the SOCKS proxy only lasts for as long as the SSH session remains connected. That said, you can simplify the setup process by adding it to your ~/.ssh/config file, for example:

Host agresso
HostName InternalServer.YourUniversity.ac.uk
DynamicForward 8888
User MyUsername

If I then type ‘ssh agresso’, and type in my password, that does the trick.

This simply goes one step closer to proving that no matter what the problem, if it involves a network, SSH has an answer.

writeslike.us: identity information from repository metadata

  • Screenshots or diagram of prototype:
    Writeslike.us search
    Searching for a person
    Writeslike.us search
    Choosing an individual
    Writeslike.us search
    Viewing information about them and people who ‘write like them’
  • Description of Prototype: Explore people, publications, institutions and themes through oai metadata
  • End User of Prototype: “Jonathan is a researcher in evolutionary linguistics. He has become very interested in possible mathematical mechanisms for describing the nature, growth and adaption of language, as he has heard that others, such as Partha Nyogi, have done some very interesting work in this area. Unfortunately, Jonathan is not a mathematician and finds that some of the detail is hard to follow. He realises that what he really needs to do is either to go to the right sort of event or the right sort of online forum and find some people who might be interested in exploring links between his specialist area and their own. Both of these are difficult in their own ways. To go to the right sort of event would mean identifying what sort of event that would be, and he does not have enough money to go to very many. So he chooses to look up possible events and web forums, thinking that he can look through the participant lists for names that he recognises. This is greatly simplified by a system that uses information about the papers and authors that he considers most relevant; with this information it is able to parse through lists of participants in events or online communities in order to provide him with a rough classification of how relevant the group is likely to be to his ideas.”
  • Link to working prototype: writeslike.us
  • Link to end user documentation: http://www.ukoln.ac.uk/projects/writeslike.us
  • Link to code repository or API: http://code.google.com/p/writeslikeus/
  • Link to technical documentation: http://www.ukoln.ac.uk/projects/writeslike.us (TBA)
  • Date prototype was launched: Dec 01 2009
  • Project Team Names, Emails and Organisations: Emma Tonkin, e.tonkin@ukoln.ac.uk, UKOLN; Alexey Strelnikov, a.strelnikov@ukoln.ac.uk, UKOLN, Andrew Hewson, a.hewson@ukoln.ac.uk, UKOLN
  • Project Website: http://code.google.com/p/writeslikeus/
  • PIMS entry: https://pims.jisc.ac.uk/projects/view/1263
  • Table of Content for Project Posts: TBA

Value Add

Probably the most important thing I discovered in this project was the importance of ‘crowdsourced’ data in filling in the gaps between metadata and common knowledge.

The availability of Wikipedia as a source of random information, although much of it contains inadequate structure to search through with something like dbpedia, is a very important factor for us in improving the metadata and the data that we are putting together to support usage of that metadata. It’s not perfect of course – or perhaps it’s better to say that the imperfect and rough ways in which we use the data are not able to achieve the sorts of results that one might like – but it seems obvious that it’s an invaluable resource for the future.

Other data sources have been invaluable for us as well, particularly DBLP, despite the strong focus on computer science (which, however, means that for training across domains we should probably be looking elsewhere too 🙂 )

Finally, social tags have been less effective for our purposes than one might imagine for one reason, which is that there aren’t an awful lot of them around, and those that are need to be detected by a relatively complex process of resolving title/author into the most popular mirror URI(s).

We’ll be publishing some of the extracted data shortly – boring but useful stuff like lists of institutions, urls, coordinates, enhanced metadata, etc – so hopefully it will come in useful to others!