Now I'm not one to criticise the Library (?!;-), but over the weekend I've been playing with one of the Library's newest collections, the Open Repository Online, with a view to visualising who's co-authored papers in the collection with whom. And looking through the author information provided in the results listing for a search on a particular author name, I have to take issue with one or two things...
Like the considerable amount of ambiguity in both the naming convention and the stylistic presentation used for name references.
For example, the following names all appear in the author information for papers returned from an ORO name search on the author "pillinger":
If I was to construct a table of author names, using direct/literal pattern matching of text strings to match individuals across different article references, all of the above Pillinger variants would appear as different authors, because the text strings are different and simple string matching is dumb.
(This is just one of the issues that was mentioned at the CRiG barcamp a month or so ago (CRIG DRY (Don't Repeat Yourself) Metadata Barcamp), in particular by the MIMAS Names project, which "is going to scope the requirements of UK institutional and subject repositories for a service that will reliably and uniquely identify individuals and institutions" in order to "provide important information about the future usefulness of a name authority service for institutional and subject-based repositories, and other applications beyond the repository sector." One of that things that was discussed then related to reconciling repository name references with staff records, but for data end users, if different presentation modes (as above) are used to reference the same individual, then simple pattern matching algorithms used in mining the data are seen to be brittle, unless the data user seeks to normalise the data in some way. Which may be advisable..? That is, always assume the data is messy/noisy...).
Where the information is the source of it's own metadata, or where appropriate web page markup - such as microformats* and cool URIs - provide implicit metadata that the adventuous can build hacks around, it's quite handy if the literal textual representation of the data is coherent/self-consistent; and in the case of ORO, it doesn't appear to be...
*though bear in mind this recent second look at microformats from an accessibility point of view from the BBC: Removing Microformats from bbc.co.uk/programmes!
So this is where I start to wonder where the Library fits in to the workflow of online university research repositories if they are to be useful to data users? For example, what do we conclude if the Library can't guarantee the quality of data in their own, most recent, collection (which was born de novo only a couple of years ago)? It seems to me that "reference gardening" (which must one the lousiest jobs imaginable!), is one of the areas where they could really add value, although ideally the information would be collected/submitted in a way that removed ambiguity and ideally guaranteed conformity with some standard, authoritative naming scheme.... But that will never happen...
So whilst I appreciate that no matter how many checks and balances you put in place, there is always going to be mess. And maybe the Library can help by finding ways to help users cope with the inevitable mess... Entropy always wins out...
Ideally, of course, the data would be clean/correct, and if it wasn't that ideally any errors, once found, would be corrected. But maybe there's also scope for the Library to accept that the data is a mess, and help data patrons find heuristics for cleaning it and removing some of the noise. For example, as well as helping users write advanced (and robust) search queries (like how to use advanced search operators and set up saved search alerts), I wonder whether they should also maintain a set of 'reference functions' that users of (messy) "data as metadata" can apply as heuristics to try and tidy up textual data.
So for example, the fiollow javascript function will (I think? Err...) replace a spelled out forename in the above name referencing style with an initial:
function tidyName2(s){return s.replace(/(,\s*[A-Z])[a-z]*/g, "$1.");}
e.g. Pillinger, Colin T. will become Pillinger, C. T.
And this one will remove the punctuation:
function tidyName3(s){return s.replace(/[\s ,\.]*/g,"");}
e.g. Pillinger, C. T. will become PillingerCT
By applying several of these heuristics, we can produce a name format that removes some of the typographical ambiguity, though losing an initial may be going too far? e.g. going from PillingerCT to PillingerC may be the wrong thing to do, even though it would set up an identity between these two original forms: Pillinger, Colin and Pillinger, C. T.
That said, how many library staff in general even know what a regular expression is, let alone how to construct one to help a 'search a replace' patron tidy a data set in a text editor or even in Word (Add power to Word searches with regular expressions)?
http://orlabs.oclc.org/viaf/LC|n+88128999
http://orlabs.oclc.org/viaf/BNF|FRBNF123588880
http://errol.oclc.org/laf/n+88128999.html
Posted by: lorcan dempsey at July 7, 2008 04:33 AMIsn't this a good argument for the Semantic Web? When we talk about authors, we do not talk about character strings but about objects (persons). So, regardless whether we talk about TBL or Berners-Lee, since we point to an URI, e.g, http://dbpedia.org/resource/Tim_Berners-Lee, we know who we mean.
Posted by: Carsten at July 7, 2008 05:59 AMWhilst there are authoritative name services out there, do they offer service that will accept a scruffy/messy name and send back estimates of authoritative names, maybe with levels of confidence? eg is this what http://orlabs.oclc.org/viaf/search/VIAF?query=local.personalName+all+%22pillinger%20colin%20 demos?
With e.g. http://worldcat.org/identities/lccn-n88-128999 how do I generate "lccn-n88-128999" without a handshake and some guesswork? Also, it'd be nice to have a switch http://worldcat.org/identities/lccn-n88-128999?limit=alt_id which just returned the alternative names?
One thing the post got me thinking about, and about which I need to ponder a bit more, are cases where web page data essentially is its own metadata. In which case you need to expose the data in a consistent way (like the structure of author names, or references to particular journals); I guess an alternative would be to admit mess at the rendered page view ("Pillinger, Colin" or "Pillinger, C.T.") as long as the names are marked up with an authoritative link or microformat eg [a href="http://dbpedia.org/resource/Tim_Berners-Lee"]Berners Lee, T[/a] or [a href="http://dbpedia.org/resource/Tim_Berners-Lee"]Berners Lee, Tim[/a] would both be reconcileable back to the same person.... In which case I guess a 'presentation layer' service could accept mess ("Berners Lee, T", "Tim Berners Lee"), leave it as mess in the presentation layer, but disambiguate/reconcile the identities with another layer of info from the gardener (the [a href="http://dbpedia.org/resource/Tim_Berners-Lee"]Berners Lee, T[/a] link)?
In this way, the info one layer down is both potentially more authoritative, but also heuristic?! I guess a confidence measure could be applied in the link? [a href="http://dbpedia.org/resource/Tim_Berners-Lee" name="Tim Berners Lee, (confidence 0.82)"]Berners Lee, T[/a]
Posted by: Tony Hirst at July 7, 2008 12:24 PMAny ambiguity in the naming conventions is surely due to the different conventions of the publishers where this material was originally published. If the source data isn't consistent how can anything aggregating that be consistent?
Databases like Scopus and Web of Science are also working on ways to pull together authors which publish under different names. It seems as if they're using affiliation as a way of linking them together, but it's difficult to tell how they're doing it technically.
Posted by: Clari Hunt at July 7, 2008 02:33 PMSorry, I guess we haven't done a very good job of letting people know that the ORLABS Identities service is in production as WorldCat Identities now. Anything you can do in ORLABS, you can do in WorldCat Identities.
The question asked above was, how do you get a WorldCat Identities URI? Go to the WorldCat Identities search page (http://worldcat.org/identities) and throw as dirty a name as you like at it. It will come back with a ranked list of suggestions.
Much to my surprise, Pillinger, C. T. seems to be the way the Library of Congress has controlled his name, even though they know he is also Colin T.
Thom Hickey has a nice post on how to interface with WorldCat Identities here: http://outgoing.typepad.com/outgoing/2008/06/linking-to-worl.html