We seem, mercifully, to have reached a bit of a backlash to the data journalism/explainer hype typefied by sites like Vox and Fivethirtyeight. Nevertheless, editors in search of viral content find it irresistible to crank out clever articles that purport to illuminate or explain the world with “data”.
Now, I am a big partisan of using quantitative data to understand the world. And I think the hostility to quantification in some parts of the academic Left is often misplaced. But what’s so unfortunate about the wave of shoddy data journalism is that it mostly doesn’t use data as a real tool of empirical inquiry. Instead, data becomes something you sprinkle on top of your substanceless linkbait, giving it the added appearance of having some kind of scientific weight behind it.
Some of the crappiest pop-data science comes in the form of viral maps of various kinds. Ben Blatt at Slate goes over a few of these, pertaining to things like baby names and popular bands. He shows how easy it is to craft misleading maps, even leaving aside the inherent problems with using spatial areas to represent facts about populations that occur in wildly different densities.
Having identified the pitfalls, Blatt then decided to try his hand at making his own viral map. And judging by the number of times I’ve seen his maps of the most widely spoken language in each state on Facebook, he succeeded. But in what is either a sophisticated troll or an example of “knowing too little to know what you don’t know”, Blatt’s maps themselves are pretty uninformative and misleading.
The post consists of several maps. The first simply categorizes each state according to the most commonly spoken non-English language, which is almost always Spanish. Blatt calls this map “not too interesting”, but I’d say it’s the best of the bunch. It’s the least misleading while still containing some useful information about the French-speaking clusters in the Northeast and Louisiana, and the holdout German speakers in North Dakota.
The next map, which shows the most common non-English and non-Spanish language, is also decent. It’s when he starts getting down into more and more detailed subcategories that Blatt really gets into trouble. I’ll illustrate this with the most egregious example, the map of “Most Commonly Spoken Native American Language”.
Part of the problem is the familiar statistician’s issue of sample size. The American Community Survey data that Blatt used to make his maps is extremely large, but you can still run into trouble when you’re looking at a small population and dividing it up into 50 states. Native Americans are a tiny part of the population, and those who speak an indigenous language are an even smaller fraction. The more severe issue, though, is that this map would be misleading even if it were based on a complete census of the population.
That’s because the Native American population in the United States is extremely unevenly distributed, due to the way in which the American colonial project of genocide and resettlement played out historically. In some areas, like the southwest and Alaska, there are sizable populations. In much of the east of the country, there are vanishingly small populations of people who still speak Native American languages. And without even going to the original data (although I did do that), you can see that there are some things majorly wrong here. But you need a passing familiarity with the indigenous language families of North America, which is basically what I have from a cursory study of them as a linguistics major over a decade ago.
We see that Navajo is the most commonly spoken native language in New Mexico. That’s a fairly interesting fact, as it reflects a sizeable population of around 63,000 speakers. But then, we could have seen that already from the previous “non-English and Spanish speakers” map.
But now look at the northeast. We find that the most commonly spoken native language in New Hampshire is Hopi; in Connecticut it’s Navajo; in New Jersey it’s Sahaptian. What does this tell us? The answer is, approximately nothing. The Navajo and Hopi languages originate in the southwest, and the Sahaptian languages in the Pacific northwest, so these values just reflect a handful of people who moved to the east coast for whatever reason. And a handful of people it is: do we really learn anything from the fact there are 36 Hopi speakers in New Hampshire, compared to only 24 speaking Muskogee (which originates in the south)? That is, if we could even know these were the right numbers. The standard errors on these estimates are larger than the estimates themselves, meaning that there is a very good chance that Muskogee, or some other language, is actually the most common native language in New Hampshire.
I suppose this could be regarded as nitpicking, as could the similar things I could say about some of the other maps. Boy, finding out about those 170 Gujurati speakers in Wyoming sure shows me what sets that state apart from its neighbors! OMG, the few hundred Norwegian speakers in Hawaii might slightly outnumber the Swedish speakers! (Or not.) Even the “non-English and Spanish” map, which I generally kind of like, doesn’t quite say as much as it appears—or at least not what it appears to say. The large “German belt” in the plains and mountain west reflects low linguistic diversity more than a preponderance of Krauts. There is a small group of German speakers almost everywhere; in most of these states, the percentage of German speakers isn’t much greater than the national average, which is well under 1 percent. In some, like Idaho and Tennessee, it’s actually lower.
I belabor all this because I take data analysis seriously. The processing and presentation of quantitative data is a key way that facts are manufactured, a source of things people “know” about the world. So it bothers me to see the discursive pollution of things that are essentially vacuous “infotainment” dressed up in fancy terms like “data science” and “data journalism”. I mean, I get it: it’s fun to play with data and make maps! I just wish people would leave their experiments on their hard drives rather than setting them loose onto Facebook where they can mislead the unwary.