Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

Three new papers in various stages of publication.

I’ve just gone through and put some new papers into my Research page.  I’ve been busy over the past little while and it seems to be paying off.  Here are some of my latest papers, with brief summaries for your enjoyment:

Figure 5 from Goring et al., the relationships between plant richness and smoothed pollen richness and vice versa both show a slightly negative relationship (accounting for very little variability), meaning higher plant richness is associated with lower pollen richness.
Figure 5 from Goring et al., the relationships between plant richness and smoothed pollen richness and vice versa both show a slightly negative relationship (accounting for very little variability), meaning higher plant richness is associated with lower pollen richness.

Goring S, Lacourse T, Pellatt MG, Mathewes RW.  Pollen richness is not correlated to plant species richness in British Columbia, Canada.  Journal of Ecology,   Accepted. [Link][Supplement]

  • Although pollen richness has acted as a proxy for vegetation richness in the literature, our paper shows that this may not be the case.  Taphonomic processes, from release of the pollen to deposition and preservation in lake sediments, appear to degrade the signal of plant richness to the point that there is no significant relationship between plant species richness and pollen taxonomic richness.  The supplementary material includes all the R code and a sample of the raw data (we could not freely share some of the data) used to perform the analysis.

Combourieu-Nebout N, Peyron O, Bout-Roumazeilles V, Goring S, Dormoy I, Joannin S, Sadori L, Siani G, and Magny M. 2013. Holocene vegetation and climate changes in central Mediterranean inferred from a high-resolution marine pollen record (Adriatic Sea). Climate of the Past Discussions, 9:1969-2014. [Link]

  • Another great paper on Holocene and late-Glacial change in the Mediterranean, part of a Special Series in Climate of the Past.  This paper uses multiple proxies, including the use of clay mineral fractions to match climate signals from pollen to sediment transport into the Adriatic from the Po River watershed, sediment blown from the Sahara and sediment transported down the Apennines.  This paper further examines shifts in seasonal precipitation in the central Mediterranean associated with changes in insolation during the Holocene and broader scale shifts in the relative influences of major climate systems in the region.

Gill JL, McLauchlan KK, Skibbe AM, Goring S, Williams JW. Linking abundances of the dung fungus Sporormiella to the density of Plains bison: Implications for assessing grazing by megaherbivores in the paleorecord. Journal of Ecology. Early view: [Link]

  • Three great papers in a row!  This paper uses modern pollen traps in the Konza Prairie LTER to examine the relationship between Sporormiella pollen and bison grazing.  This is an important link to make because Sporormiella has been used to indicate the presence of megafauna such as mammoths and mastadons in the late-glacial.  The declining signal of Sporormiella at Appleman Lake, IN was a key feature in the onset of non-analogue vegetation at the site in the late-Glacial (Gill et al., 2009).  This paper provides an explicit link between the theoretical potential of the spore as an indicator of megafaunal presence and the degree of grazing at sites.

Kentucky and Tennessee during the last glacial.

There’s a new paper out in Quaternary Research  by Liu et al. entitled “Vegetation history in central Kentucky and Tennessee (USA) during the last glacial and deglacial periods.”  This is an interesting paper for a number of reasons, the first is that it’s a (relatively) old record. I’m someone who has lived their whole life above the maximum extent of the last glacial, so it’s difficult for me to imagine landscapes without knowing that they’ve been glaciated at some point in their recent history. Continue reading Kentucky and Tennessee during the last glacial.