Painful lessons in data journalism: scraping with Python

3932023011_32e5e18838_z

Lost in the woods. CC-licensed, Chris-Håvard Berge on Flickr.

Lost and found ads can be a good way to sniff out a story.

Take the ones on Craigslist about iPhones. There’s a woman who gained a husband in a quickie wedding at city hall but left her iPhone behind. Or a drunk college kid who dropped his phone on the passenger seat of a good samaritan who took him home.

Is there a bigger story about lost and stolen iPhones? To find out, I scraped all 50 states of Craigslist lost and found ads using Python and BeautifulSoup. If you want to check out or improve that code, it’s on GitHub. The full story (with charts and things!) is over at Cult of Mac.

The project required more fist clenching and eye straining than anticipated – even though writing a basic scraper for Craigslist is considered an easy-peasy programming project.

Let me just say it: as a novice Pythonista, I am challenged by nearly everything. I mean, command line interface, seriously? But I can get past that. I slogged through (and recommend) Learning Python the Hard Way, as well as finished some examples in Scraping for Journalists.

But journalism isn’t about getting “Hello World!” to run. It’s about getting your particular story to “run” correctly.

Writing a scraper for not just one URL but over a hundred (some states have multiple sites) is rough. This is also the pain of a site without an API. (In contrast, my graph of “fiscal cliff” mentions to test out the New York Times API took about 10 minutes.)

You hit a wall. You read the documentation. You tinker. Still doesn’t work. You pester a friend who tells you where to look in the documentation. No luck. You plug in an answer to a similar problem you read about in Stack Overflow. That does it! Rinse. Repeat. Until it actually gathers the information you need.

Problem solving in traditional journalism often means trying the same thing over and over. It’s a numbers game. Say you need a knowledgeable voice in your story about kite energy or whether software can recognize human emotions. If you make enough phone calls or send enough emails, you’ll get that information.

Not so when it comes to programming: the code that won’t run on your computer also won’t work under the exact same conditions on your wife’s computer or if you try it on a different day.

So, yeah. My brain feels bruised from the effort. But I’ll try again.

I backed and am looking forward to For Journalism, the Kickstarter project creating programming lessons for data-curious journos.