One thing I'm learning as an aspiring web developer: It's probably a good idea to take notes as you puzzle through a project. Otherwise you risk re-solving the same problems the next time around. I naturally didn't do that as I put together my first interactive map. This post I hope works as the next-best thing: an attempt to reconstruct the process that I hope will help others facing the same questions I did.
Disclaimer: This post is going to get technical and stay technical. If that understandably doesn't interest you, by all means go and look at the pretty map. Also, the final takeaway, if you don't want to read to the end: As with any storytelling medium, story conception is the hardest and most important part.
For those of you still with me: I got my first look at the rich assortment of free and open-source tools for making custom map visualizations in a class I took in February through the Gray Area Foundation for the Arts. I didn't notice when I signed up that the class was being taught by Mike Migurski of Stamen Desgin, a small San Francisco shop that's among the world's leading innovators in interactive maps. I describe the class in detail in several earlier posts.
Eventually, some sense was made, but not always quickly, and not always neatly. If you choose to work with Polymaps, master the first page of the documentation. Polymaps creator Mike Bostock assumes a high level of sophistication, and the in-depth methods can get confusing. Still, to get a map online that does something, the first page of the docs is most of what you need (the examples help a great deal, too).
Once I had the GeoJSON, using Polymaps to draw the different zip codes became fairly straightforward. One "duh" moment came when I realized that you don't need to use any underlying map tiles to make a map that looks like a map, at least aif you're working with data like zip codes that matches the geographical contours of the place you're displaying. You can see how that looks on the zip code map I made. You'll also see that I took some simple parks data also available on the city's site and mashed it up to show the number of city-run parks per zip code. Not incredibly informative, since much of San Francisco's green space is federally managed. Still: I'd made a map that did something.
My next experiment traced the official borders of the city's neighborhoods, again using data supplied from the city's website. In this case, I used a tile set, Cloudmade's Pale Dawn, which allows you to see the streets and landmarks associated with each neighborhood. Again, elementary in concept but a simple interactive created with a high degree of customization.
Which is all wonderful, but also doesn't mean very much if you don't have a good story to tell. I learned firsthand the lesson that storytelling through data visualization is at its core an exercise in data curation. You can make the world's most beautiful map, but the story it tells is only as compelling as the data underlying it.
At this point, I became fixated on building a map that displayed tweets based on various search terms, convinced that something inherently meaningful would emerge. I spent a long time wrangling Twitter's Streaming API, which allows you to access tweets as they're posted versus a static search that needs to be refreshed everytime you want new tweets. To get a bucketful of tweets to work with and write them to a file, open your preferred command-line tool and use this:
curl -d "track=your,comma,separated,search,terms" https://stream.twitter.com/1/statuses/filter.json -uYourTwitterUserName:YourPassword -o YourFilename.json
The command will funnel any new tweets containing your search terms into the file you specify after the "-o" as long as you supply valid Twitter login credentials after the "-u" (no space after the "u"). Be warned: The tweets can pile up fast.
Anyway, after all that delightfulness, I made another painful discovery: Hardly anyone geocodes their tweets. I downloaded about 7,200 tweets containing the name "Trayvon Martin" thinking I'd learn something pithy. Instead I found that just 27 of those contained the latitude and longitude coordinates from which the tweet originated. This rate holds steady over various tweet searches: somewhere around one percent or less of tweets seem to include geodata that allows you to plot them usefully on a map. There was no story, reminding me again: Story conception is tougher than execution.
At this point, I didn't have a story but knew that I wanted to use Stamen's newly released watercolor map tiles, an amazing project in which the entire interactive map of the world appears to have been painted in watercolor. On their own, the watercolor tiles have no labels, which sets an impressionistic tone. The story would require a similar feel.
Trolling back through my GAFFTA class notes for more ideas, I came across one of the first lessons, which dropped some photos from the 1906 earthquake onto a map of San Francisco. The earthquake photos had come from a site called Historypin, where anyone can post historic photos and pin them to standard Google-style maps. Here I came across part of the Ansel Adams collection posted by the Los Angeles Public Library, where the full collection resides.
The Historypin site is great, but I wanted to make something that really showcased this little-known collection with an engaging, standalone interactive that thanks to the watercolor tiles would have a vintage feel appropriate to Adams' pictures. To get there, I had to hand-cull the collection of 217 photos to eliminate duplicates and find those with specific location information. Most of the photos are Adams' contact prints and negatives from a 1939 magazine assignment. According to the library, he placed little value on them, which I expect is why the caption information is spotty, including on location specifics. Still, I ended up with more than 40 pictures I could plot on a watercolor map of L.A. Though labor-intensive, the research was enjoyable: I learned a lot about the history and geography of city. I hope the final result offers others that same interesting window.
A final technical note: All that work I'd done with the tweets wasn't wasted. Creating the box to display the photos on the map dots was more complicated than I'd expected. I'd followed the Polymaps example of using the jQuery Tipsy plugin, which essentially creates a separate layer of CSS-rendered tooltip boxes on top of map. I had trouble following some of the logic of the Polymaps example, which seemed to create a blank line at the bottom of the screen I couldn't eliminate. Happily, I wasn't the only one trying to work with Tipsy and SVG (the graphics format used to render the various map elements). This SVG-friendly fork made working with Tipsy on the map as easy as working with any other HTML element.
Ultimately the real value for me in this project is not the end product, but the template I patched together (which I'm happy to share), a template that's easily tweaked for use with all kinds of data. It's simple but seems powerful as an easily adaptable tool for telling a wide range of stories.
The first time the Guardian performed an act of data journalism also happened to be the first time the Guardian printed a newspaper. The May 5, 1821, edition of The Manchester Guardian included a table that listed the city's schools, how many pupils attended and how much money the schools spent each year.
Simon Rogers, editor of the present-day Guardian's Datablog, writes that at the time those seemingly mundane facts told a provocative story:
It told us, for the first time, how many pupils received free educationand how many poor children there were in the city. ... In 1821, it caused a sensation. Leaked to the Guardian by a credible source only identified as "NH", it showed how official estimates of only 8,000 children receiving free education were inaccuratein fact the total was nearer 25,000.
Rogers cited this example during a recent talk at O'Reilly's Strata Data Conference to illustrate the newspaper's long-held commitment to using data to tell stories. The Guardian has doubled down on that commitment in recent years by turning Rogers and a team of developers loose to experiment widely with new forms of data-driven storytelling. Rogers says their blog has since become the most trafficked on the Guardian's site.
For anyone interested in why Datablog has become so popular, check out some of their marquee projectsor even something as unglamorous as a bubble chart of U.K. government spending (in a glance you discover the surprising-to-me fact that Britain spends more on pensions than its single-payer health care system).
The data team has also played an instrumental role in two of the Guardian's biggest recent stories.
When Wikileaks released the Iraq and Afghanistan war logs to select news organizations, including the Guardian, Rogers' team plowed into the spreadsheets to create interactive maps that told the story of death and violence in those countries in ways that a written summary couldn't.
At Strata, Guardian developer-advocate Michael Brunton-Spall explained how the leaked U.S. State Department cables posed a much greater challenge, since the documents came not as parsed data but as straight text. To find the needles buried in the haystack of hundreds of millions of words, Brunton-Spall said developers and journalists worked as partners to burn the haystack down. Together they deciphered what amounted to a system of tagging used by the U.S. government to categorize each cable. Under 24-hour-news-cycle deadline pressure, developers also hacked together a basic search engine (for those keeping score, they used an AJAX-Apache Solr hack developed by Reuters) that sliced up the documents word by word and allowed reporters to hunt down the keywords that mattered.
When riots in London sparked by a police shooting spread across the city and then the country, the Guardian's data journalists found ways to create what Rogers called "the new short-form": data journalism not as a months-long project but as another way to do daily news.
Datablog's frequently updated map of verified incidents quickly framed the scope of the fast-moving, rumor-riddled story as rioting spread across the country. An animation of the "riot commute" uncovered a phenomenon that only makes plain sense through a visualization: Mapping the home addresses of those accused of commiting crimes against where the alleged crimes occurred, the Guardian showed that many alleged rioters seemed to make their way in from the suburbs to urban centers to join the mayhem.
My favorite visualzation used more than 2.5 million riot-related tweets to visualize how rumors spread across Twitter and then were either verified or quashed by other users. It only really makes sense if you see it, which is exactly the point.
One revelation of the Guardian presentation at Strata was how much of the work they did relied on tech that non-coders could use. Google spreadsheets and Excel documents, Fusion Tables and Tableau: Rogers and Brunton-Spall both said a key to daily data journalism was not being too picky about the tools used so long as they told the story the journalists were trying to tell.
But the most valuable point I took from the presentation didn't involve a specific technology. "Our role as journalists has become curation," Rogers said near the start of his talk, and I realized I'd been thinking about data journalism all wrong. I've always imagined the process as hinging on scoop-getting, in which some Deep Throat tipped you off to a valuable tidbit buried in some obscure government report. Rogers conveyed that in fact the scoop is a function of the sift. With the right tools and know-how, neither of which are especially technically challenging, we as journalists can sort, filter and visualize the multitude of data constantly accumulating all around us, growing ever-more pregnant with stories waiting for us to usher into the world. The journalist-as-data-curator acts as a kind of literary talent scout, rummaging through the factoid slush pile to uncover the compelling stories hidden among the spreadsheets.