April 14, 2008

Searching for Sites that Link to the Target of a TinyURL

Last week - I think - I came across a RESTful way in to a set of JSON returning Google search web services, that let you call on individual Google search engines (the main web search engine, the booksearch engine, the blogsearch engine, and so on) and get nice clean data set results back (albeit, not necessarily many at a time). If you're impatient for the detail, check the service(s) out here: Google AJAX Search API (Flash and other Non-Javascript Environments).

Google JSON search APIs

What this means of course is that now Google searches becomes trivially available within Yahoo Pipes (without the need for screenscraping). So for example, here's a way of seeing what pages Google thinks link to a particular web page using the link: search limit:

pipes google link search

Given an input URL (via the pipe's homepage, or as an argument in the pipe's URL), the String Regex block rewrites the url- e.g. http://www.open.ac.uk - in the form link:wwww.open.ac.uk. link: is a Google advanced search query search limit that says 'find me the sites that link to the following URL". (Google Web Search Features - Who Links to You?)

THe URL builder then constructs the URL of a query to the Google AJAX API. Depending on the actual service URL used in the URL builder, this might return results from the Google web search engine or the blogsearch engine, for example.

The Fetch Data block can receive XML or JSON feeds as input. In this case, we're receiving JSON from the Google AJAX (JSON) API.

If you've been reading OUseful.info over the last few days, you'll know I've been playing with various TinyURL related pipes (e.g. The delicious history of a TinyURL).

So here's another one. A service that takes a TinyURL and then sees, via the Google web service, what web pages and what blog pages link to the page the TinyURL actual refers to... TinyURL inlinks pipe (blog and web).

Firstly, we decode the TinyURL using something I prepared earlier ;-)

tinyurl inlinks

The spilt takes the URL in two directions. One strand is going to seed the blogsearch, the other the web search.

Here's the blog search path:

pipes blogsearch inlinks

The loop block cycles through each item in the feed (in fact, there is only a single item, and that contains the URL pointed to by the TinyURL) and passes it to the URL inlinks (blog search) pipe. This nested pipe accepts a URL (passed from the title of the feed item that entered the Loop block) and uses it as an argument in a Google blog search AJAX API request. The result is used to replace the feed item that entered the block.

The format of the result is still in the form returned from the web service, so the Rename block makes sure that a valid RSS feed item response is produced.

The two strands then get merged and actually passed through a regular expression that tidies up some escaped characters:

tinyurl inlinks pipe regexp

Easy :-)

Blogged with the Flock Browser

Tags: , ,

Posted by ajh59 at April 14, 2008 10:37 AM
Comments

Thanks for the Google search API link.
You can get more results by using the rsz parameter with a value of "large", and even more results by using a simple front-end such as http://pipes.yahoo.com/pipes/pipe.info?_id=WCnVdSIK3RG9oS6fLe2fWQ

Posted by: hapdaniel at April 14, 2008 02:17 PM