Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

Published by


Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

3 thoughts on “Recovering dark data, the other side of uncited papers.”

  1. I think you raise some excellent points about data availability and lost information. There’s another dimension to this, particularly for ecology: data that is tucked away in the grey literature of consultancy reports, assessments by wildlife groups, etc.. So much great data on species occurrences, relative abundance, change over time, etc., is largely ignored because it’s not easily available. A lot of this pre-dates the digital revolution, of course, meaning that it’s even less likely to ever surface and be used.

    1. The non-academic literature issue is a great point, and that’s one of the things that is pretty impressive about the Deep Dive approach, it is able to recover grey (or gray for non-Commonwealth readers) literature as well, as long as it has a reasonable corpus. I know this is a particular problem in conservation ecology, where lots of data is tied up in government reports, and also in geo-archaeology where consultant reports are often the norm. Hopefully the efforts of the PaleoDeepDive team can be broadened to other disciplines so that we can recover much of this data. Once that’s done we have to learn how to deal with it all!

      1. It’s a promising approach, as long as the reports are digitally archived. But how much of the pre-90s grey/gray literature has not been digitised I wonder? I’m willing to bet it’s the majority.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s