January 10, 2011
How to Scrap Internet-Delivered Data for Conversion into a Database-Usable Format
"There is no data on the Internet that is actually impossible to download," writes ProPublica Nerd Blog's Dan Nguyen in The Coder’s Cause in “Dollars for Docs”. The post's intention is not to signal how easy it is to violate copyright but to present "public records gathering as a programming challenge" for journalists. The post is also the introductory lead-in by way of an illustration pointing to ProPublica's Dollars for Docs: What Drug Companies are Paying Your Doctor for a series of how-to guides for data scraping content off of web pages, Flash sites, and text-based or image-only PDF files to organize the obtained data into searchable databases.
With the exception of Adobe Acrobat, a must-have IMHO, all the identified tools in the guides are open-source. There are five guides:
- Using Google Refine to Clean Messy Data
- Reading Data from Flash Sites
- Parsing PDFs
- Scraping HTML
- Getting Text Out of an Image-only PDF
While intended for journalists, the audience for the guides certainly extends well beyond journalism. Many tech-savvy law librarians may know all the tools and techniques identified (perhaps even better ones). While being tech-inclined, I am always in catch-up mode.
In a time where requests for assistance in data scraping for lawful purposes for many and varied research projects is not an uncommon occurance, the ProPublica Nerd Blog's guides are (1) a good place to start if you do not already have the required skill set; (2) a great place to point a patron to if that's as far as your institution's mission allows; and (3) an excellent place for an overview before deciding whether you are up to the challenge to do it yourself, need to route the request to in-house tech staff, or should throw some $$ at the project by hiring someone to do the tech work. As stated in Nguyen's cover post, Scraping for Journalism: A Guide for Collecting Data, "The guides assume some familiarity with your operating system's command-line." This is the post to head to because it links to all five guides and suggested tools. Highly recommended (unless you already know all this stuff!). [JH]