Thursday, December 21, 2006

Scraping Google

So Google scrapped their SOAP API. That's terrible! What can we do? Some people have said basically this is the end of Google. Others, over at EvilAPI, have cloned the old SOAP API by parsing search results. But we don't even need to do that; let's just parse the web pages. Google's mobile device interface uses well-formed XHTML, so it's relatively easy to parse. Who needs SOAP? It took me a while to get this right, but in the end, it was pretty simple to get a basic system up and working:

TUPLE: link title quote url ;

: parse-google ( xml -- seq-links )
"div" get-name-tags 2 12 rot <slice> [
[ "a" get-name-tag children>string ] keep ! get title
[ ! get quote
tag-children
2 swap dup length 2 - swap subseq
] keep
"span" get-name-tag children>string ! get URL
1 swap [ length 2 - ] keep subseq
%lt;link>
] map ;

This, however, is pretty brittle in response to changes in Google's format and poorly factored. It should probably be changed. It's just a start.

This article isn't meant to be in praise of Factor for making this so amazing and easy; it is just a demonstration that complex frameworks like SOAP aren't needed to move data between computers around the web, and that just because Google removed their only official API to access their data on the server side doesn't mean we can't make our own.

1 comment:

Ray Cromwell said...

"and that just because Google removed their only official API to access their data on the server side doesn't mean we can't make our own."

Google isn't dropping API access to data on their server, they are dropping SOAP and officially moving to GData and JSON.