The theme of day two at GAFFTA's data visualization and mapping class was where data comes from.* To get under way, Michal Migurski went meta about data, explaining that data doesn't simply hover in the ether, a value-neutral substance waiting to be collected in our nets. Rather, data and data sets represent choices made by the people collecting the data that reflect their point of view, embodied in what and how they choose to measure.
He quoted Australian information designer Mitchell Whitelaw, who blogs at (The Teeming Void):
Accepting data shaped by someone else’s choices is a tacit acceptance of their view of the world, their notion of what is interesting or important or valid. … If we are going to work intelligently with data we must remember that it can always be constructed some other way.
This point seems like an especially important point for journalists just starting to delve into the world of data, which at first glance can seem like an ideal vision of impartiality, a realm of pure factuality that transcends "he said-she said."
Next, Migurski dove into the history of large-scale data collection. Charles Darwin cousin and quintessential Victorian gentleman scientist Sir Francis Galton ** helped launch the field of modern meteorology by orchestrating _ presumably through telegrams and letters sent weeks in advance _ the simultaneous collection of weather data across Europe. From that effort Galton produced possibly the first specimen of the modern weathermap. (He also apparently originated the practice of publishing weathermaps in newspapers.)
Migurski said when you think about the logistics involved in coordinating a continent-wide data collection effort in the days when the telegraph and the steam train were the most advanced technologies you had to work with, you start to appreciate how much we take for granted the wealth of data available today. After all, what's more mundane than a weather map? Migurski pointed to the Tagging of Pacific Predators program, in which Northern California marine biologists tagged great white sharks and built a Web interface that let anyone track their location in near-real time.
"Collecting data used to be a major major network problem," Migurski said. "Now you can just strap transmitters to animals."
Other troves accessible to data sleuths include the firehose of public social media data unleashed mainly by Twitter. We talked briefly about how to filter Twitter's Streaming API to cull tweets from specific locations, timeframes and keyword searches. Click here for a real-time JSON-structured stream from the Twitter firehose, but be warned: Migurski said let this run for too long and your machine will crash. Migurski showed us how he used a slice of geotagged Twitter data to plot tweets on a map of San Francisco that referenced various burrito-related keywords.
Speaking of JSON, Migurski introduced the concept to we noobs who had heard the term but couldn't quite work out what it meant or why it mattered. We learned that geoJSON offers the most powerful way to structure your data for mapping purposes, since many mapping applications have the standard baked in. (Twitter's streaming API and JSON have moved to the top of the "must tinker with" list.)
Compared to the first class, day two definitely journeyed deeper into the geek jungle of script libraries and structured data. But I'm hanging on and eager to take a crack at a first project. The more I learn, the more apparent the benefits of mastering these tools becomes.