Probably the most important thing I discovered in this project was the importance of ‘crowdsourced’ data in filling in the gaps between metadata and common knowledge.
The availability of Wikipedia as a source of random information, although much of it contains inadequate structure to search through with something like dbpedia, is a very important factor for us in improving the metadata and the data that we are putting together to support usage of that metadata. It’s not perfect of course – or perhaps it’s better to say that the imperfect and rough ways in which we use the data are not able to achieve the sorts of results that one might like – but it seems obvious that it’s an invaluable resource for the future.
Other data sources have been invaluable for us as well, particularly DBLP, despite the strong focus on computer science (which, however, means that for training across domains we should probably be looking elsewhere too 🙂 )
Finally, social tags have been less effective for our purposes than one might imagine for one reason, which is that there aren’t an awful lot of them around, and those that are need to be detected by a relatively complex process of resolving title/author into the most popular mirror URI(s).
We’ll be publishing some of the extracted data shortly – boring but useful stuff like lists of institutions, urls, coordinates, enhanced metadata, etc – so hopefully it will come in useful to others!