Monthly Archives: December 2009

writeslike.us: identity information from repository metadata

  • Screenshots or diagram of prototype:
    Writeslike.us search
    Searching for a person
    Writeslike.us search
    Choosing an individual
    Writeslike.us search
    Viewing information about them and people who ‘write like them’
  • Description of Prototype: Explore people, publications, institutions and themes through oai metadata
  • End User of Prototype: “Jonathan is a researcher in evolutionary linguistics. He has become very interested in possible mathematical mechanisms for describing the nature, growth and adaption of language, as he has heard that others, such as Partha Nyogi, have done some very interesting work in this area. Unfortunately, Jonathan is not a mathematician and finds that some of the detail is hard to follow. He realises that what he really needs to do is either to go to the right sort of event or the right sort of online forum and find some people who might be interested in exploring links between his specialist area and their own. Both of these are difficult in their own ways. To go to the right sort of event would mean identifying what sort of event that would be, and he does not have enough money to go to very many. So he chooses to look up possible events and web forums, thinking that he can look through the participant lists for names that he recognises. This is greatly simplified by a system that uses information about the papers and authors that he considers most relevant; with this information it is able to parse through lists of participants in events or online communities in order to provide him with a rough classification of how relevant the group is likely to be to his ideas.”
  • Link to working prototype: writeslike.us
  • Link to end user documentation: http://www.ukoln.ac.uk/projects/writeslike.us
  • Link to code repository or API: http://code.google.com/p/writeslikeus/
  • Link to technical documentation: http://www.ukoln.ac.uk/projects/writeslike.us (TBA)
  • Date prototype was launched: Dec 01 2009
  • Project Team Names, Emails and Organisations: Emma Tonkin, e.tonkin@ukoln.ac.uk, UKOLN; Alexey Strelnikov, a.strelnikov@ukoln.ac.uk, UKOLN, Andrew Hewson, a.hewson@ukoln.ac.uk, UKOLN
  • Project Website: http://code.google.com/p/writeslikeus/
  • PIMS entry: https://pims.jisc.ac.uk/projects/view/1263
  • Table of Content for Project Posts: TBA

Value Add

Probably the most important thing I discovered in this project was the importance of ‘crowdsourced’ data in filling in the gaps between metadata and common knowledge.

The availability of Wikipedia as a source of random information, although much of it contains inadequate structure to search through with something like dbpedia, is a very important factor for us in improving the metadata and the data that we are putting together to support usage of that metadata. It’s not perfect of course – or perhaps it’s better to say that the imperfect and rough ways in which we use the data are not able to achieve the sorts of results that one might like – but it seems obvious that it’s an invaluable resource for the future.

Other data sources have been invaluable for us as well, particularly DBLP, despite the strong focus on computer science (which, however, means that for training across domains we should probably be looking elsewhere too 🙂 )

Finally, social tags have been less effective for our purposes than one might imagine for one reason, which is that there aren’t an awful lot of them around, and those that are need to be detected by a relatively complex process of resolving title/author into the most popular mirror URI(s).

We’ll be publishing some of the extracted data shortly – boring but useful stuff like lists of institutions, urls, coordinates, enhanced metadata, etc – so hopefully it will come in useful to others!

writeslike.us: Wins and Fails

Wins:
➢ Getting information such as institution names/URLs from Wikipedia, and widespread use of available web services in general
➢ Extracting names from OAI-DC was easier than expected – although there are still issues with identifying name pair order.
➢ Evidence based learning methods can be applied successfully to the data retrieved to enhance it – getting into FixRep territory. The project has been very useful for the purpose of establishing further use cases for ‘cleaning up’ metadata.
➢ Some interesting work in name / identity disambiguation through statistical clustering analysis. We’re looking at linking extracted info together with formal information such as that made available by the NAMES project.
➢ Storyboards defining the workflow of the system form an effective part of the agile development process, and were very useful for us.
➢ Using an SQL db as the repository was effective once problems with slow queries was addressed through: normalizing data, reviewing db schema design, adding indexes as necessary.

Fails:
➢ Natural Language Tool Kit – didn’t use it for its original purpose. Instead, went back to the Tree Tagger, although this was not specifically trained for the sort of technical document we were analysing.
➢ Text analysis expertise required for this project wasn’t already extant in the team. It would’ve been a good idea to have ensured training for team to make sure we were all on the same page!
➢ Ensure all related documents, URIs, etc, are contained/linked in the project wiki.
➢ Cultural mismatch between research approach to defining requirements/expectations and development requirements/expectations. e.g. who writes the formal requirements document?
➢ Earlier storyboard scenario development would have been helpful, so a good lesson for next time.
➢ Swine flu and its effects were quite severe on this project – our Portugese collaborators were unavailable for quite some time due to a) the danger of traveling to the UK and contracting the virus, and (subsequently to contracting the illness in Portugal) b) the effects of the illness!