Maybe you caught the article by Ron Unz about Jewish-American performance at Ivy League colleges in the United States. He argues that Jewish-Americans are over-represented at Ivy League colleges, but even while they are accepted at higher rates than other racial groups, they are performing at lower levels than their peers, particularly Asian-Americans. [disclosure: I only came across it through Andrew Gelman's blog]
Andrew Gelman has a great takedown of the discussion, but it is predicated on the contributions of an anonymous individual who took interest in the data. Gelman argues that people took the original Unz story uncritically because it presents data in a statistical manner, but that it is undercut by serious methodological problems in the ways that Unz estimates the proportions of Jewish-Americans.
I don’t know much about Unz, or his reasons for looking into the data set (although the fact that it is published in the American Conservative might be indicative of underlying biases). Regardless, the implications of these methodological flaws point to a major issue in how we deal with big data sets. When I was getting started with my Ph.D I was looking at modern pollen samples from the western United States and Canada.
A common analysis tool for pollen assemblages is looking for modern analogues, but there is a potential issue within the North American Modern Pollen Dataset (Whitmore et al, 2005, PDF)- it includes a number of pollen samples from Calvin Heusser‘s seminal 1960 volume, “Late-Pleistocene Environments of North Pacific North America” (here’s an interesting review, only $4 for the book in 1960!). Those data sets were the first to really explore pollen in the region (aside from Hansen, for example, this from 1950), but there is no record of Cupressaceae pollen in any of the samples. Thankfully Rolf Mathewes, my advisor had the experience and domain knowledge to to point this out to me before I did too much damage, but one wonders how many times these data have been used uncritically.
In preliminary analysis with another very large data set I was able to find significant relationships (although with low R-squared values) before cleaning up my code (a very important step!) and realizing that I had unintentionally re-sorted column headings. This sorting meant that these significant relationships were actually meaningless, the significance was simply a result of an enormously large set of data. I caught the mistake though, so don’t judge me!
How easy is it to get taken in by these big data sets? To go down the rabbit hole? Without sufficient domain knowledge it becomes increasingly easy, and, with the rapid dissemination of popular posts, these can have a serious impact on our public discourse, particularly when these findings confirm our biases. I discussed evidence-based policy in passing on this blog, and Mark Stabile has a more recent post here. In the context of evidence based policy the idea of vetting through domain knowledge becomes critical. What if we take these faulty analyses as the basis for confirming our bias and driving policy?
Ultimately, the point is that this idea of big data has the potential to be transformative, but if the skills to exploit these data sets are not developed within the disciplines (ecology, geology, and geography in my case), then we risk having others define problems and answer questions without the detailed domain knowledge necessary to properly assess the validity of the outcomes.
So what are the solutions (focusing on the sciences in particular):
- Build capacity within disciplines for large scale analysis [ I'm working a this grant right now. . . ].
- Develop interdisciplinary collaborations early and often [ That's what PalEON is all about!].
- Ensure that analysis is numerically repeatable and provide reviewers (and readers) with the raw data and code necessary to run the analysis [see here].
- Provide access to the reviews [see here], although some journals do a great job of this already.
What do you think? Are there other remedies? How do we protect ourselves from these errors, and how do we prevent them from becoming zombie ideas?