September 25, 2006

Opening Up OU Administrative Content with Dapper

Over the weekend, I had a little tinker with Dapper that lets you define what are essentially screen scraping profiles that can be applied to a set of 'look-alike' webpages and then used to present the scraped information in a variety of formats - XML, HTML, JSON, and so on.

My first attempt was to try and scrape some of the OU Library Voyager catalogue results pages. However, the HTML is such a mess on those pages that it bugged Dapper out confirmed via an email to the Dapper team). Attempts at scraping the list of databases also met with failure, presumably because the listing is so long...

So instead I turned to the OU course catalogue, with a view to seeing what I could pull out of pages like the course pages for T324 Keeping Ahead in Information and Communication Technologies or TU120 Beyond Google


How to use Dapper to scrape a page is well illustrated by a screencast demo on the Dapper site, so I won't repeat it here...

So after half an or so, how far did I get? Well - why not see for yourself: OU Course Info v0.01a:


The information I went for was the course code and title, next srtart date (and fee) plus an attempt at pulling out some assessment data.

I also built in a query argument to let users enter a course code so that information can be scraped from any course page:


Usage options include opbtaing an XML feed (e.g. here's an XML feed for Robotics and the Meaning of Life).


The link option looks interesting, though I haven't tried it yet. What it seems to offer is a way of wiring, or pipelining, several Dapper apps together...

As a tool for rapid prototyping "data-web" services by screenscraping legacy web pages, Dapper looks promising. Stability issues - in the sense of parsing tatty HTML pages - need addressing, as the attempt to scrape the OU Voyager catalogue shows, but using a tool like this is certainly easier than trying to write your own regular expressions and Perl scripts!

I wonder whether a feature to run a page through Tidy and then apply an XSLT to it to try and tidy it up/preprocess it a little before working the Dapper magic may be appropriate? It would raise the barrier to entry somewhat for novice users, but then again, it could just be offered as a power tool...

...and it would still be easier than writing Perl from scratch!

Anyway, anyway, I'm looking forward to using Dapper quite a bit more over the coming months in an attempt to try and get OU related information into a form in which it becomes more generally OUseful, so I'll keep you posted ;-)

PS 8th Feb 07: it seems that the described above Dapp has rotted somewhat and quite a few of the fields aren't being populated. This is always going to be an issue with scraping pages, and one reason why applying things like GRDDL to pages containing meaningful semantics is a better bet. Until the day when XML feeds of content like this are provided as a matter of course, that is... Roll on XCRI.

Posted by ajh59 at September 25, 2006 11:49 PM