Category Archives: research

new ideas, random thoughts

The Raspberry Pi

Unless you’ve been living under a rock for the last six months or so, you’ll almost certainly have heard of the Raspberry Pi. If not, here’s the low-down. The Raspberry Pi is a single-board computer aimed at the hobbyist and educational markets. It comes in two flavours, the Model A and Model B, which are subtle references to the BBC Micro computers of the 1980s and 90s. Unfortunately, for the foreseeable future only the Model B – the higher specification model – is scheduled for production. Lurking within the guts of this credit-card sized wonder is 256mb of RAM, and a 700MHz ARM chip that is easily capable of being pushed to 800MHz. For audio and video there are RCA video and 3.5mm audio out, as well as an HDMI port. Resolutions covered range from VGA all the way up to 1080p (and beyond!), with almost all PAL and NTSC video standards covered. Connectivity is a doddle, as a 10/100 Ethernet socket is included on the board. WiFi is also possible; although ARM devices are notoriously finicky about which USB adapters they will work with. I/O is covered too – two USB ports are provided, and are extensible with a hub, and GPIO (general-purpose input/output) pins are provided for connections in and out to various devices, more about which will be covered shortly.

Raspberry Pi Board

The Raspberry Pi viewed side-on. Visible here is the the HDMI port (front centre), the SD card slot (left) the GPIO pins (back left), RCA video (yellow jack) USB ports (back right) and Ethernet port (front right).

While unfortunately the hardware of the Raspberry Pi is almost unchangeable (short of the size of the SD card used), this is more than made up by the choice of operating systems. In true hacking fashion, several operating systems have sprung up, each doing different things. Here are a selection:

1) Raspbian “Wheezy”

Raspbian is based on the Debian kernel, and is the recommended start point for beginners to the Raspberry Pi. It boots to a command prompt by default, but pre-installed is LXDE – a lightweight X11 manager. Other tools included include the Midori web browser, and all the development tools you’d expect on a Linux system, including Python and Java compilers. Of course, since it’s a Debian installation, new software is a doddle to install using the package manager. Within minutes I had set up VLC and was playing 1080p video with no problems.

2) Arch Linux ARM

Arch Linux is extremely popular with the modders and tweakers of the Raspberry Pi community. Its no-frills approach centres on “simplicity and full control for the end-user”. By default, no X11 server is included – it is up to the user to decide which (if any) they would like. Obviously, this distribution is not recommended for those with little to no Linux knowledge.

3) RaspBMC

On the other end of the scale, RaspBMC is totally different to either of the distributions mentioned above. When you use this distribution to boot the Raspberry Pi, it becomes a fully-fledged home media centre, with the ability to play films, music and even YouTube videos. RaspBMC is based on the very popular XBMC, a cross-platform media centre that is used by countless people worldwide.

 

RaspBMC screenshot

The default home screen for the really quite good RaspBMC media centre operating system for the Raspberry Pi.

One of the main reasons that the Raspberry Pi came about was to teach children in schools about electronics and programming. As such the GPIO pins can be used to interact with code and give sensor readings to programs. Unfortunately, in Raspbian at least, the Python modules for interacting with the GPIO pins are not included by default. Instructions for installing them are given here.  A popular way to interface the Raspberry Pi is a simple ribbon cable and a prototyping board, which will let you try out many different combinations before settling on something more permanent. One of the peripherals that has generated the most buzz lately is a camera module featured here which would pave the way to features such as image recognition for navigation, or more multimedia capabilities.

As with most things, however, there are a few drawbacks, but what else did you expect from a machine costing £25/$35? The biggest caveat for me was initially the lack of hardware MPEG-2 decoding, which meant my whole library of movies would have to be transcoded to h.264 for smooth playback on the device. However, the Raspberry Pi Foundation has now released licenses for roughly £2.50 for MPEG-2 and £1.50 for VC-1. The other gripe that some may have is the lack of expandable RAM, as it is all contained within the CPU. Such users may find the VIA APC or cubieboard a little more suitable for their use, however, for pure value for money and form factor, the Raspberry Pi is hard to beat.

Edit (1/11/12) – As of October 15th, the Raspberry Pi now ships with 512MB RAM, making it an even more attractive proposition for its price point.

Streaming video with VLC

Occasionally one wants to stream video for various reasons, whether it’s within the institutional network or a live feed from a conference venue. A few years ago Greg Tourte and I wrote a paper about the process of streaming video from a DV camera using FireWire, encoding into Ogg Theora/Vorbis, and transmitting the result to an audience via IceCast. For no adequately explored reason I have found myself playing with VLC’s inbuilt streaming methods for various purposes over the last week or so, and since VLC isn’t especially well documented, I’ve put the results up here.

 

1)  Streaming video to an icecast server.

Once you have the icecast server set up this is actually shockingly easy to do. Set up a mountpoint on the server side, in your icecast.xml setup (/usr/local/etc/icecast.xml by default):

<mount>
<mount-name>/test.ogg</mount-name>
<username>myusername</username>
<password>mypassword</password>
<max-listeners>10</max-listeners>
<burst-size>65536</burst-size>
</mount>

for example.

Now, on the client side (which could be anything from Windows to Linux to MacOS, because VLC is cross-platform, but this example is Windows), try

C:UsersEm>”c:Program Files (x86)VideoLANVLCvlc.exe” “C:UsersPublicVideosMy Video.wmv” –sout=#transcode{vcodec=theo,vb=800,scale=1,acodec=vorb,ab=128,channels=2,samplerate=44100}: std{access=shout,mux=ogg,dst=myusername:mypassword@myicecastserver.domain.com:port/test.ogg}

It should transcode on the fly into Ogg Vorbis/Theora and throw it at your icecast server. Viewers who go to myicecastserver.domain.com:port should be able to view it from there. Note that you can change various settings on the transcode process (for example scale=0.5, vb=400), so you can reduce the network bandwidth required, for example, but that paradoxically reducing some of these settings will actually increase the time taken for the transcoding process, so it can result in the transcode getting laggier than it was already.

Why transcode? Well, icecast only handles a limited format set. It’s really designed for audio data, not audiovisual. It’ll handle pretty well anything in an Ogg wrapper, though, and it is free. So if you want to stream video with Icecast, transcoding will probably be involved somewhere.

2)  Streaming from a DVD (previously recorded event)

One would expect this to be as simple as

“c:Program Files (x86)VideoLANVLCvlc.exe” dvdsimple:///E:/#1

but as it happens this seldom works, and the reason is the reaction time. Icecast is contacted with a header as soon as the streaming process begins. If it takes too long to get the DVD spun up and begin the process of streaming, icecast simply times out on you, leaving an error message along the lines of ‘ WARN source/get_next_buffer Disconnecting source due to socket timeout’.

Having tested this on various platforms, I find that the following string: “vlc dvdsimple:///dev/dvd –sout=’#transcode{vcodec=theo,vb=200,scale=0.4,theora-quality=10,fps=12,acodec=vorb,ab=48,channels=2}:std{access=shout,mux=ogg,dst=username:password@myicecastserver.domain.com:port/destination.ogg}’ –sout-transcode-audio-sync –sout-transcode-deinterlace” works very well in some cases. Apparently the DVD drive I first tested this with is just unusually slow. This DVD, being homegrown, doesn’t require libdvdcss to view/transcode.

3) Streaming with ffmpeg2theora

Bit of a Linux solution, this one. Install libvpx, libkate, scons and ffmpeg (all available as Slackbuilds for those who are that way inclined).  Install ffmpeg2theora. Install libshout and oggfwd.

Then: try a command line along the lines of the following:

ffmpeg2theora /source/material/in/ffmpeg/readable/format.ext -F 16 -x 240 -c 1 -A 32 –speedlevel 2 -o /dev/stdout –  | oggfwd myicecastserver.domain.com server_port password /test2.ogg

Obviously the output of this is not exactly high-quality; it’s been resized to a width of 240 pixels, audio has been reduced in quality, framerate’s been reduced to 16. But all these configuration options can be played with. Here’s a useful help page: http://commons.wikimedia.org/wiki/Help:Converting_video/ffmpeg2theora

Having called this a Linux solution, it’s worth pointing out that ffmpeg2theora is available for Windows (http://v2v.cc/~j/ffmpeg2theora/download.html) and that oggfwd/ezstream (http://www.icecast.org/ezstream.php/) have been used successfully on Windows as well. It’s also worth noting that, again, VLC can do the ogg/theora encoding too (and has done since 2006)- it’s just a question of seeing what’s better optimised for your purpose on your platform.

Note also that in this instance no username is needed, and the password used in this case is that set in the ‘<source-password>’ directive in icecast.xml.

4)  Streaming without icecast

Icecast is a useful default solution if you want to broadcast your event/recording to multiple people across the web. It’s also useful because, operating via HTTP, it doesn’t suffer from the sort of firewall/router problems that UDP-based video streaming, for example, typically encounters. On the other hand, if you’re streaming across a local LAN (for example, into the next room), there’s (usually) no network border police to get in your way — and VLC does also offer a direct VLC-to-VLC HTTP-based streaming solution. Unlike Icecast, though, it’s not ideal for one-to-many broadcast.

The Videolan documentation has a graphical explanation of this setup: http://www.videolan.org/doc/streaming-howto/en/ch02.html

 

5) Mixing video for streaming

An obvious application to test in this context is FreeJ. Sadly it’s a bit of a pain to compile as it doesn’t seem to have been touched for a while. You’ll need to use the following approach for configuring the code:

CXXFLAGS=-D__STDC_CONSTANT_MACROS  ./configure –enable-python –enable-perl –enable-java –disable-qt-ui

Typing ‘make’ will result in : error: ‘snprintf’ was not declared in this scope. Add #include <stdio.h> to any files afflicted in this way.

You then come across a crop of errors resulting from changes in recent ffmpeg. Some of these can be resolved with a patch, the rest, you’re better off going to the git repository rather than trying a stable version.

In principle you probably want to enable-qt-gui, but since it doesn’t currently compile I have left it as an exercise for some other day.

And once you have FreeJ working, you need to read the tutorial. Note this advice regarding addition of an audio track to FreeJ output.

 

 

Archiving Old Dynamic Websites

Archiver is a web based application written in Java using the Spring Framework. Its main use is to produce static versions of websites which are no longer updated with new content. This is important for many reasons; firstly, dynamic sites will typically target specific software dependencies.  As time passes certain dependencies will no longer receive patches, posing a security risk, or updates, meaning it may no longer function with the latest operating systems and code environments. This is almost always a problem for system administrators who sometimes end up maintaining software that can be tens of years old on outdated systems. Archiver offers an easy means of creating a static versions of of these dynamic sites that can deployed simply to a standard web server, typically, but not limited to, Apache.

 

Websites will run using different software; some will be written in plain HTML and CSS whilst others will run on CMS or WIKI platforms such as WordPress, MediaWiki or Drupal. Each of these methods will provide a slightly different way of performing some tasks, writing certain elements in HTML/CSS etc and laying out structure. After analysing the problem it was clear that we would need to target specific software in order to provide a high quality solution. For this reason, the ‘Strategy’ design pattern was chosen.

In this case an interface and super implementation provided a default set of methods for dealing with the processing of individual web elements written to work in the majority of cases. It can be thought of as standard web behavior. Subclasses of this Strategy were provided to account for software differences.

We currently support the following strategies -:

  • Basic (Static websites for migration)
  • Drupal
  • Media Wiki
  • WordPress

One of the main tasks which Archiver performs is to make any links which appear in HTML, CSS or JavaScript files relative to the homepage of the website so they are not absolute links. The JSoup plugin for Java was especially useful in this case as it allows the detection of a specified tag in the HTML file. JSoup also uses a Jquery type syntax to select the different elements from the HTML e.g. “#” is used to select an ID and “.” is used to select a class. JSoup also allows invalid HTML which is useful doesn’t prevent a site from being fully archived if there are mistakes in the markup. For the CSS and JavaScript, Regex was used to create expressions in the specified format for a CSS or JavaScript link, this could then be used to find and change the links. Alongside making links relative, Archiver also adds each link which it finds to the list of files to be added into the archive folder. After archiving recursively a zip file is served up to the user.

While existing solutions are available none of them provide the comprehensive rewriting capabilities of Archiver. All the user has to do is point the webapp at a site, choose a strategy and deploy the resulting zip.

Archiver also produces a README file which provides details of all the files which have been included in the archive and lists any errors such as missing pages.

Code is available from https://bitbucket.org/ukolnisc/archiver/src

While this is working code it has not received sufficient testing which is obviously vital for this type of project. With that in mind we would love to hear your feedback.

 

Some are more NoSQL than others

I’m no SQL Expert

Over the past few years I have had my fair share of tricky data management opportunities. There was the financial transaction database that had no keys or indexes and had to be pieced back together by getting old source code releases, finding the bugs and reversing the incorrect values. There was the GB’s of web log records that needed cross referencing and finally, analysing free text marketing responses for patterns.

This was all a warm up for my current opportunity. With this task I have all the issues at once. I have the scale, with 22 million odd items. I know it’s not enormous for 2012, but it is far from easily manageable. I have the lack of consistent relationships and the final piece of the puzzle lack of data quality.

What I wouldn’t give for a nice enumeration of types, something concrete to go on. Take dates for example. For decades it has been the norm to store dates in ISO format, or at least something that can be converted back and forth. If I am really lucky I get ISO dates, a lot of the time I get something that isn’t defined but recognisable and can convert to ISO from, for example, ‘moon cycles since equinox format’ ™. Often though, I get user typed input, not from the same person and not even the same system. Entering dates like “about the middle of last year” is guaranteed to anger even your most friendly neighbourhood developer.

Taken individually it isn’t that big a deal. However producing results in reasonable web- response speeds for 22 million records, grouping by, counting and cross referencing on thousands of possible groups on standard hardware is eluding me. If you can make them dance this way I would love to hear from you.

NoSQL Hype vs. Substance

I’m sure I am not the only one who has trouble keeping up with the latest and newest technology releases. There are lots of exciting new cool apps and services that I don’t really have the time to investigate due to the sheer quantity. Sometimes I just don’t have the patience to coax a demo app out of the latest beta release, constantly cross referencing against error messages. Finally, my least favourite, there is the kind of technology forced along by big business marketing.

Data storage is huge business, particularly the other side of the Atlantic. There are vast sums of money at stake and even corporate survival can depend on the success or failure of given products. I’m no Commie, I don’t mind this in principal, but the amount of positioning, media attention and misinformation that then surrounds the products makes it very hard to separate the hype and the substance.

A couple of weeks ago I was deliberating, remembering all the great things I have heard about NoSQL. Maybe the NoSQL people have a point and now I have a solid use case where my RDBMS is not suiting me. Up until this point I had discounted alternatives to my RDBMS on the grounds that any storage solution was moving bits around on a disc and that the same rules applied. Like all performance computing it is a game of caching. Keep indexes in memory and look to disc as little as possible. Indexes and disc space usage are always going to be more or less equal leaving any performance improvements to the implementation, hardware and possibly some new algorithms. RDBMS design was based on set theory and predicate logic, or to put it another way, Maths. Very little has changed to satisfy my scepticism with regards the speed and scale increases promised by NoSQL movement. Even the idea that there is seen to be a movement worries me. I mean, it’s hardly suffrage, anti-war or civil rights is it?

Some are more NoSQL than others

Up until now I have been talking about NoSQL as a single entity. Of course this is just one of the misleading factors. For some reason, lots of substantially different technologies have been lumped under one umbrella. Maybe the daunting numbers necessitated this; maybe it was because it was felt they could survive better as a combined opponent to RDBMS. The majority of them share some common themes but thinking of them as a single entity is particularly unhelpful. In fact several of the so called NoSQL solutions have more in common with your SQL RDBMS than each other. Two of these notable exceptions are CouchDB and Neo4 which offer ACID compliance.

From Wikipedia the generally accepted types of NoSQL solution are -:

Document store, graph, Key-value store, multivalue, object, RDF, tabular and tuples

Having read a few articles about the various NoSQL solutions, it seems that each author had decided to group them up and talk about the groups in some way, for each offering possible scenarios where each are useful. So far it has been easy to pick holes in every one of the lists, in some cases because they are out of date but mostly because even in these sub types the feature sets can still be very different. For this reason I shall approach this from a slightly different angle. Firstly I shall talk about common themes to most (but not all) of the NoSQL solutions, then follow up with a few types of software and which specific NoSQL products would be useful.

Speed and ACID

Earlier I mentioned that I couldn’t really see how you can develop a significantly faster comparable version of a storage solution. In the case of the majority of the NoSQL products, the main selling point is horizontal scalability. To put it another way, it is easier to deploy over lots of load balanced clusters giving the performance gains. DBRMS’s do not scale as easily in this manner.

The reason for this is that all good RDBMS’s are at least approaching ACID compliance. In essence, this is your guarantee that data you store is consistent and will be there when you want it. With ACID comes the concept of transactions which are important for many real world tasks, and without them bank transfers would vanish, nuclear missiles would launch. The locking required does not work as easily over RDBMS clusters due to the inevitable latency.

Having said that there are many cases where this isn’t important. You could maintain the consistency at the application level. It gets increasingly harder to maintain with increasing system complexity but it is far from impossible. Alternatively read only data sources are a good candidate or maybe you just don’t care. If the odd ‘Like’ or +1 goes missing the sun will still rise the next day. In addition I should probably point out that most people tend to agree that NoSQL means ‘Not Only SQL’. For reasons discussed, in most cases it would represent part of a given solution. A fast NoSQL solution would work well as a client facing readable resource to a large complex dataset.

NoSchema

A relaxed or in some cases entirely non-existent schema is another selling point. This for me is the key difference. So many times my model has altered slightly and various null checks have crept into my code. You can easily see how in some cases a very relaxed schema would be a nice thing to have.

Commodity computing

Computing as a commodity has been a big driver behind many of these products. It isn’t hard to see the value of being able to easily spin up a few more database clusters over the Christmas busy period with little fuss. This is a key feature of how the horizontal scalability can be a massively appealing part of these solutions. Taking this further some products have an emphasis on distribution. For example you could have a country or regional presence in a datacentre where for example UK residents are served by one cluster/shard and Australian by another. Maybe you can offload your Black Friday North American rush to your Pacific Rim cluster where it is 2am.

UC1: Online Store/CMS/Blog

If you though there were a lot of NoSQL options then you are in for treat when you start looking for a CMS. It seems that every developer, has at some point, started coding their own CMS. It isn’t hard to see that document stores are particularly suited to this task. Almost the entire focus is around the document. Taking a real world example, MongoDB and Etsy demonstrate a nice scenario for this use case. On Etsy you have various sellers all over the globe creating product pages. Some might have shipping restrictions, photos, size guides, linked products or any number of combinations. With MongoDB and a relaxed schema, a product page could be a single document with just the relevant categories embedded. I am willing to bet they don’t use it for their payment systems though.

UC2: Caching

Memcached is probably the most common and famous example of caching in the NoSQL world*. Notoriously thousands of memchached nodes allow us all to keep up with the interesting happenings on Facebook. These are typically used in front of a backing data store and provide most recently used hash based caching and runs entirely from RAM. I think they key here is understanding that it can be used as part of a massive infrastructure rather than being something particularly revolutionary.

If you aren’t Facebook or similar and thinking of adding one memchached box to the font a box or two, you might be better off exploring other routes first.

*Other k-v stores are available.

UC3: Development

A relaxed and adaptable schema during software development has obvious benefits.

UC4: Graph Data

The most interesting type of solution in my opinion is the Graph database and oddly this seems to be the direction that receives the least attention. I have had a number of problems where I needed to view data from various angles at different times and the relational approach just didn’t work. I was constantly creating temporary tables of underlying data from different directions which became hard to maintain. Expressed as a graph I can see that it could be far easier to work with. Again the concept of data as a graph is hardly new but I am about to trial Neo4j as a solution to my current problem so I shall report back with my findings.

UC5: Analytics

The likes of Hadoop MapReduce can be suited to analytics. Typically reporting makes it into the code at a much later stage and can be easily forgotten. I have seen many systems spending most of their cycles calculating the nightly sales reports with increasingly complicated SQL queries over their perfectly normalised data sets. Aggregating, result summarisation and general querying can be guaranteed with real time performance. Google, despite trying to replace it, are using a version of this behind the scenes to provide your search results. It clearly scales.

I’m no NoSQL Expert

It is a point worth labouring, that the key is in picking the right tool for your data. Slightly less obviously it is about how you need to reference that data, not only today but in the future.

Experiences

Being a developer I had itchy keyboard fingers and didn’t quite get around to researching thoroughly before I trialled MongoDB. Seemingly it was a good match for my data with a relaxed schema but there probably isn’t a worse match for my need to referencing the data. Lesson learnt until the next time. Had I not experimented though I would not have had the joy of expressing my MapReduce functions inside a Mongo query using JavaScript. Whose idea was that?

I am still evaluating Hadoop, the pinup for NoSQL. I think there is a lot of potential here for MapReduce in my batch operations, a clear fit, but there is considerable set-up overhead. The Hadoop umbrella has also become quite sizeable in its own regard so I expect there is some more value in this area. Neo4j is also looking very promising. It is a Graph based ACID database and as such stands out. Relationships are treated, according to the documentation, as first class citizens so I am taking a look at this next. My only concern is how it performs with ad-hoc queries. Failing all this I will go back to multi pass batch processing on my RDBMS with plenty of caching for good measure. It’s not elegant, but it works.

Social media sites in China

Introduction

When we talk about social media sites, we tend to focus on a number of well-known examples – Facebook, YouTube, and so forth. Yet there are many international social media sites of all kinds. This post is written by Jenny Luo, who is studying Electrical Power Engineering at Bath and works part-time at UKOLN. In it, she looks at some examples of social media sites in China, compares them to other popular sites, and collects together information about the sites’ APIs.

— Emma

 

Social network sites (similar to Facebook)

1. renren

Renren means “people and people”. Renren is very similar to Facebook, not only in terms of function but also in terms of interface. You need to register on this website first ,then you can add somebody else as your  friends. You can upload some photos, write diaries, share videos, add some comments or delete some comments.There are also some online games. Anyway, it’s almost the same as Facebook. But, in my opinion British tend to share more pictures than the Chinese do.

Link: www.renren.com
API: http://www.programmableweb.com/api/renren
Description: This is an open platform based on OpenSocial

2. douban

Douban means “watercress”. On this website you will feel free to comment on any  books, films, and music. You can find others’ recommendations about books, films and music. Everything shown on your douban webpage can be chosen by yourself. For example, if you are a mother the website will recommend you some recipes. Unlike Renren, which is mainly used by college students, this website focuses on all kinds of people, they will help you find friends from what you like, then you can find more things you like from them. It has more than 5,000,0000 users now.

Link: www.douban.com
API: http://www.douban.com/service/apidoc/reference/
Description: The weblink given above is the instruction of API utilisation. The Douban API follows the Atom and GData principles. When addition or deletion operations are used, OAuth certification would be required. Then, there are also lots of instructions about how to use the API to acquire different information.

3. kaixin001

The website’s name means ‘having fun’. It was set up in 2008. Till now, it has about 110 million registered users. This website mainly focuses on the urban white collar. This website has 3 main functions, such as ‘basic tools’, ‘social games’, and ‘other applications’. For example, in ‘other applications’ you can get a weather forecast, a service for buying tickets online, and many other practical applications.

Link: http://www.kaixin001.com/
API: http://wenku.baidu.com/view/671f256e1eb91a37f1115c5a.html
Description: The link given above is the instruction of  kaixinoo1 API. This API support java, PHP, NET and a variety of kinds of programming language. Kaixin API uses a REST connection. All the Kaixin open platforms are achieved by using HTTP POST to send requests to http://rest.kaixinoo1.com/api/rest.php The following instructions are about Users API, Friends API, actions API and so on.

Blogging (similar to Blogger)

1. Sina Webblog

‘Sina’ is not a a Chinese word, I asked baidu about why it got this name, and it says that it means China in Latin. Sina Webblog is the most popular and also mainstream blogging service in China. It includes amusement celebrities’ blogs, intellectual celebrities’ blogs, sensibility blogs, and common people blogs.

Link: http://blog.sina.com.cn/
API: http://blog.csdn.net/用户名/services/metablogapi.aspx
Description: The link above is found in a message board. It seems that till now sina blog haven’t provided public access to their API. The Chinese character in the link above means user name.

2. SoHu Webblog

‘So’ means ‘ searching’  ‘Hu’ means ‘ ‘fox’. So the name means ‘search a fox’. It’s similar to the sina’s but not as popular as the one above.

Link: http://blog.sohu.com/
API: http://ow.blog.sohu.com/guide#11
Description: Sohu open widget(SOW) is proposed by sohu company, which is based on the UWA (Universal Widget API) principle, applied on many platform’s Widget standards. A brief introduction about SOW follows. Firstly, it’s based on a standard Widget principle — the UWA principle. Secondly, everyone can use this to develop their own Widget, and share it to multiple net friends, adding to their sohu blog for utilisation. Then there are loads of information about how to apply.

Microblogging services (similar to Twitter)

1. Sina Microblogging

Sina Microblogging is very similar to twitter.You can also call it ‘one sentence’ blogging. Users can send messages from mobile phones’ SMS messaging functions, WAP, Internet, or MMS.You can send what you hear, see, and think immediately. Your friends can also see what you sent immediately and add some comments. It has the most users of all Chinese microblogging services, and public celebrity is its character.Sina microblogging invite stars and celebrities to be users and will authenticate them. After authentication,a ‘character “V”‘ will be added after their names to distinguish them from common users.

Link: http://weibo.com/
API: http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3_V2#.E5.BE.AE.E5.8D.9A
Description: The link above is a introduction to the sina microblog API. This documentation contains detailed information about the reading and writing of Microblog, users, comments, relationships, accounts, topics, searching, registration, and so on. Some of them are signed by a red Chinese character, which means this is high class permission, and can only be used by making an application.

2. SoHu microblogging.

It’s very similar to the one above, but not as popular as the above.

Link: http://t.sohu.com/
API: http://open.t.sohu.com/en/%E9%A6%96%E9%A1%B5
Description: This link above is instructions for how to use the sohu API. They give some English labels.

Instant Messaging (similar to MSN)

1. QQ

This name means cute because its symbol is a little penguin.This software support online chatting, video chatting, sending documents, sharing documents, qq email, Netdisk. You can also join some groups which are built by the qq users. For instance,if you like yoga, you can join a yoga group, and you can chat with the people in this group. It’s the most widely used chatting software in China.

Link: www.qq.com
API: http://wiki.opensns.qq.com/wiki/%E3%80%90QQ%E7%99%BB%E5%BD%95%E3%80%91Qzone_OAuth2.0%E7%AE%80%E4%BB%8B
Description: The link above has these contents. 1, a brief introduction to qq zone OAuth 2.0. 2,The procedures involved in using qq zone OAuth2.0. Step1: How to acquire access-token, qqzone. OAuth 2.0 provide different log-in checking and authorization procedures for website, mobile application and desktop. Now they provide 2 ways to acquire an access token. 1.server-side mode. 2. client side-mode. The two modes  only have differences in acquiring an access-token, but work similarly in the following steps of acquiring an  openid and invoking the API. Step 2: Using access-token to get the corresponding openid. Step 3 : using openAPI to get resources. You can also get specific information about API  from the following link given http://wiki.opensns.qq.com/wiki/%E3%80%90QQ%E7%99%BB%E5%BD%95%E3%80%91API%E6%96%87%E6%A1%A3

2. MSN

No more introduction, but not as popular as qq in China

Link:http://cn.msn.com/
API: http://dev.live.com/messenger/

Video Sharing (for example, YouTube)

1. YOUKU

This name means ‘you are so cool’. A very popular video sharing website in China, founded in 2006. In 2007, it launched an activity called ‘LOMO is everywhere’, and turned out to be very successful. It’s also a website which gathers lots and lots of people who like to share their videos.

Link: www.youku.com
API: http://dev.youku.com/
Description: The weblink describes the basic functions of youku’s open API. They are: uploading videos, getting users’ related data, broadcasting which includes designing your own player’s appearance. Their characteristics are XML or JSON format data, and using Javascript client-side to directly insert youku videos without having to add anything on the server-side. Youku API is only open to partners, so a partner ID must be applied for first.

2. Tudou

Its name means ‘potato’. You can upload, download, and share videos through this website.

Link: www.tudou.com
API: http://api.tudou.com/wiki/index.php/%E9%A6%96%E9%A1%B5
Description: The link given is a tudou open platform document. Every developer can use a tudou account to login to the open platform and apply for a APP. Every APP has a limited amount of interface requirements. The interface includes functions that require user authorization, and others for which user authorization is not needed.

Photo sites (similar to Flicker)

1. babidou

‘Babidou’ I don’t know its meaning actually. But I think this name sounds good in Chinese and may mean ‘some interesting beans’.  Babidou internet photo album Internet Save centre was founded in 2005. Babidou specially serves Internet business. Almost all its photos are from taobao, yiqu, paipai, some large Internet shopping Websites. It has a strong documents management system, and a humanistic operator interface. Very easy and convenient to use and totally free.

Link: www.babidou.com
API: N/A.
Description: N/A.

2. bababian

‘bababian’ means ‘change’. This website imitates flicker.It has two subsites, one is for Internet shops, the other is for individual photo albums.

Link: www.bababian.com
API: http://www.bababian.com/api/api.htm
Description: This is the open platform link of bababian, but a bababian account is needed to see more detailed information.

Social bookmarking services (for example Addthis, or Delicious)

1. jia This
‘jia’ means add in Chinese. Its function is as same as its name. It provides these functions: website link collection, website sharing, and website link sending as well. Users can share everything they want to share to many popular social website by using this tool.
Link: www.jiathis.com
API: http://www.jiathis.com/help/html/share-with-jiathis-api
Description:The link gives a standard jiathis API interface, which make the implementation easier.This link, http://www.jiathis.com/send/?webid=shareID&url=$siteUrl&title=$siteTitle&uid=$uid is the standard form, share ID can be gleaned by a ID list. $siteUrl means the weblink that you want to share, $siteTitle means the shared website title which can also be defined by yourself. $uid(非必须) is used for data statistics. Four examples are also given.

2. bshare.cn

bshare is also a social sharing web2.0 button tool.

Link: www.bshare.cn/index
API: http://www.bshare.cn/api
Description: The link above gives detailed information about the API. It’s an open platform.

3. Baidu share

This is almost as same as the tools above.

link: http://share.baidu.com/
API: http://open.baidu.com/
Description: The link given above is the website of baidu open platform where you can find the APIs of different baidu applications, such as baidu share, baidu map, baidu encyclopedia.http://share.baidu.com/get-codes. This link is the specific code of baidu share, and also the instructions for use. You can copy and paste the code given and put it in any position within the webpage between <body> and </body>.

Online Trade (for example, ebay)

1. taobao

‘Taobao’ means ‘finding treasure’ in Chinese. Taobao is the biggest online retailer in Asia Pacific.It was funded by alibaba enterprise in 2003.Its business include c2c( person to person) and B2C( Business to person). By 2008, its registed member is more than 98million and represented about 80% of China on line trade, and its turnover reached 41.3billion yuan.

Link: www.taobao.com
API: http://open.taobao.com/doc/api_list.htm?id=102
Description:The link given above is an API list of taobao.It has API of users , products, businesses and so on.You can click on what  you need to get more detailed information.

Deal of the day (for example: Groupon)

1. meituan

Meituan means ‘Shopping happily in group’. This group shopping website is funded by the same company that funded Renren. Meituan reccomends you a qualified life service everyday. Its recommendation must be of excellent quality and reasonable value.

Link: www.meituan.com
API:http://www.meituan.com/help/api
Description:The weblink given is meituan API. You can get cities API,and also the daily deals API.

A general overview of lots of Chinese social media sites, including many not covered here, can be found at: http://www.bshare.cn/share. English speakers might prefer to read it through Google Translate (Click here for a translation).

CC image by Dainis Matisons

Extended Repository PDF Assessment

As part of FixRep a small project is being carried out to examine the use of metadata in pdf documents held in HE/FE repositories (such as the University of Bath’s Opus repository). This builds on an initial pilot that was carried out using pdfs harvested from Opus, which we wrote a paper about for QQML 2010 (Hewson & Tonkin, 2010).

The original study of Opus was an exploration to test out the extraction and analysis process. Obviously the initial analysis focusing on only one repository could only be used to draw conclusions about what’s in Opus; issues it may present, metadata usage, etc. The extended assessment is examining pdfs from about a dozen UK repositories so that a reasonable analysis of metadata, comparison of ‘standard’ usage, and common vs. unique issues, can be obtained.

So, how are we going about this?

It’s a pretty manual process at the moment, at least each of the stages is kicked-off manually, and can be divided into three stages…

  • Harvest the pdf files
  • Extract the metadata into database
  • Analyse content

Harvest…

Using wget the repository structure containing the pdf files is copied to a local server. This process takes some time and can be rather heavy-handed in the overhead it places on the repository server through continual requests for files. If we wanted to extend the extraction and analysis process into a service that regularly updates, then a more considerate approach towards the repositories would be required. However, we’ve got away with it at least this far!

Extract & load…

A prototype service that extracts information from pdf documents was developed as part of the FixRep project. It extracts the information in a number of ways:

Heading and formatting analysis, such as:

  • The version of the PDF standard in use
  • Whether certain features, such as PDF tagging, are declared to be in use
  • The software used to create the PDF
  • The publisher of the PDF
  • The date of creation and last modification

Information about the body of the documents:

  • Whether images or text could be successfully extracted from the document and, if they could, information about those data objects.
  • If any text could be extracted from the object, further information such as the language in which it appeared to be written and the number of words in the text

Information from the originating filesystem, such as:

  • document path
  • document size
  • creation date, etc.

The extracted information is put into intermediate files in an XML format and is then loaded into a MySQL database for…

Analysis…

PDF Processing

PDF Processing

The first thing we actually look at is how many of the harvested pdf files could be processed, and for those that failed, what was the reason they failed. For example in out pilot run against the Opus content about 80% of pdf files could be processed. The 20% that failed were mainly due to the service being unable to extract meaningful information, while a very small number of files turned out to be ‘bad’ pdfs – that is the format of the files was corrupted or not in a recognisable file format. some of the errors identified were recoverable with some manual intervention, while some meant file had to be excluded as un-processable.

While not definitive, this does give us a baseline expectation for the success rate of extracting meaningful information from other repositories.

Once we have the data in the database it’s easy enough to run some sql to generate simple statistics such as: Type and number of distinct metadata tags used;  Average number of metadata tags per file; etc. This gives us a good overview of the content (in a given repository) and whether the content is consistent within and between repositories.

Next steps…

The target repositories have been harvested and will soon be processed for analysis. So, unless some very unexpected processing problems happen we should have results and be ready to produce a report on this project in early December.

Day-to-day work

1.scrum meeting + meeting minutes

  1. see the progress made by what people present – what they did yesterday, what they expect to do today, and what worked for them. Allow the manager to keep track of how the project development is progressing and how the team are performing.
  2. find and solve potential problems before they become significant or expensive!

2. collaboration on code

  1. Allows developers to learn from each others’ expertise; permits peer-review of code (extreme programming-style?) and review of functionality.
  2. A long-term goal for this approach is to encourage developers to share code, to think of it as ‘our code’ rather than ‘my code’ and to be more open to review, reuse, constructive criticism, etc.

3. moving code between machines & testing code on different workstations

  1. A single functional installation does not mean that a development project is finished, since it may be very difficult to set up on other platforms, to understand or to reuse.
  2. It should be functionally portable and include all necessary libraries, scripts, datasets and configuration to promote remote development, reuse and external contribution to the codebase.
  3. This also encourages review and testing since it tends to highlight any difficulties with installation and use of newly developed components.

Active collaboration and a flexible approach to development in particular tend to optimise productivity, in that time spent coding also has a knowledge sharing component – and there is relatively little time spent becoming familiar with code before beginning to contribute.