Scraping websites using YQL

Posted on 19. Jan, 2010 by Nikhil Sheth in Knowledge, Tips, Tools

We typically do lot of data scraping for various projects. We either use php functions or CURL for scraping data from remote websites. This involves reading remote url content, removing unwanted characters from it and then using regular expressions getting relevant data.

While this is not that complex, it sometimes becomes too messy. So while exploring other solutions for same, I came across YQL.

The YQL platform provides a single endpoint service that enables developers to query, filter and combine data across Yahoo! and beyond.

Using YQL you can  use a simple web service to extract data from HTML documents. As an added bonus, the YQL engine will remove falsely encoded characters and run the data retrieved through HTML Tidy to get valid HTML back. For example to get the body content of CNN.com all you’d need to do is a:

select * from HTML where url="http://cnn.com"

The really cool thing about YQL is that it allows you to XPATH to filter down the data you want to extract. For example to get all the indian states from india.gov.in you can use:

SELECT * FROM html WHERE url="http://india.gov.in/knowindia/state_uts.php"
AND xpath="//div[3]/div[]/ul[1]"

Although YQL looks a lot like SQL, it treats the info on the web as a virtual table that developers can manipulate in a standardized way, regardless of the API that data came from. Developers only had to know how to use YQL to quickly create simple mashups.

Currently, Yahoo! has set certain limits on use of their infrastructure. App developers are limited to 100,000 calls per day, per IP address. If the application runs in a browser (hence, on many different IPs), it’s a non-issue. Pullara said, “The limit targets those who would abuse the platform… people who might spin up DoS attacks. You have to have controls in place to make sure that doesn’t happen.”

I am really impressed at the ease of using it and looking to explore it for using other services like open data tables.

Share and Enjoy:
  • Twitter
  • Digg
  • StumbleUpon
  • Facebook
  • del.icio.us
  • Yahoo! Buzz
  • Google Bookmarks
  • FriendFeed
  • LinkedIn
  • Technorati
  • Yahoo! Bookmarks
  • DZone

Related posts:

  1. Gmail to add Twitter-like live updates
  2. MySQL Admin and Development Tools
  3. What is SaaS?
  4. PHP Founder Quits Yahoo
  5. My First IM Bot

Tags:

Leave a reply

Anti-Spam Protection by WP-SpamFree

Get Adobe Flash playerPlugin by wpburn.com wordpress themes