Helping to fill the cloud from the bottom up.

Open data in the sciences is an aspirational goal, and one that I wholeheartedly agree with. The efforts of EarthCube (among others) to build an infrastructure of tools to help facilitate data curation and discovery in the Earth Sciences have been fundamental in moving this discussion forward in the geosciences, and at the most recent ESA meeting saw the development of a new section of the society dedicated to Open Science.

One of the big challenges to open science is that making data easily accessible and easily discoverable can be at odds with one another.  Making data “open” is as easy as posting it on a website, but making it discoverable is much more complex.  Borgman and colleagues (2007) very clearly lay out a critical barrier to data sharing in an excellent paper examining practices in “habitat ecology” (emphasis mine):

Also paradoxical is researchers’ awareness of extant metadata standards for reconciling, managing, and sharing their data, but their lack of use of such standards. The present dilemma is that few of the participating scientists see a need or ability to use others’ data, so they do not request data, they have no need to share their own data, they have no need for data standards, and no standardized data are available. . .

The issue, as laid out here, is that people know that metadata standards exist, but they’re not using them from the ground up because they’re not using other people’s data.  Granted this paper is now eight years old, but, for the vast majority of disciplinary researchers in the geosciences and biological sciences the extent of data re-use is most likely limited to using data tables from publications, if that. [a quick plug, if you’re in the geosciences, please take a few minutes to complete this survey on data sharing and infrastructure as part of the EarthCube Initiative]

So, for many people who are working with self-styled data formats, and metadata that is largely implicit (they’re the only ones who really understand their multi-sheet excel file), getting data into a new format (one that conforms to explicit metadata standards) can be formidable, especially if they’re dealing with a large number of data products coming out of their research.

Right now, for just a single paper I have ten different data products that need to have annotated metadata.  I’m fully aware that I need to do it, I know it’s important, but I’ve also got papers to review and write, analysis to finish, job applications to write, emails to send, etc., etc., etc., and while I understand that I can now get DOIs for my data products, it’s still not clear to me that it really means anything concrete in terms of advancement.

Don’t get me wrong, I am totally for open science, all my research is on GitHub, even partial papers, and I’m on board with data sharing.  My point here is that even for people who are interested in open science, correctly annotating data is still a barrier.

How do we address this problem? We have lots of tools that can help generate metadata, but many, if not all, of these are post hoc tools. We talk extensively, if colloquially, about the need to start metadata creation at the same time as we collect the data, but we don’t incentivise this process.  The only time people realize that metdata is important is at the end of their project, and by then they’ve got a new job to start, a new project to undertake, or they’ve left academia.

Making metadata creation a part of the research workflow is something I am working toward as part of the Neotoma project. Where metadata is a necessary component of the actual data analysis.   The Neotoma Paleoecological Database is a community curated database that contains sixteen different paleoecological proxies, ranging from water chemistry to pollen to diatoms to stable isotope data (see Pilaar Birch and Graham 2015). Neotoma has been used to study everything from modern patterns of plant diversity, rates of migration for plant and mammals, rates of change in community turnover through time, and species relationships to climate.  It acts as both a data repository and a research tool in and of itself.  A quick plug as well, the completion of a workshop this past week with the Science Education Resource Center at Carleton College in Minnesota has resulted in the development of teaching tools to help bring paleoecology into the classroom (more are on their way).

Neotoma has a database structure that includes a large amount of metadata.  Due in no small part to the activities of Eric Grimm, the metadata is highly curated, and, Tilia, a GUI tool for producing stratigraphic diagrams and age models from paleoecological data, is designed to store data in a format that is largely aligned with the Neotoma database structure.

In designing the neotoma package for R I’ve largely focused on its use as a tool to get data out of Neotoma, but the devastating closure of the Illinois State Museum by the Illinois Governor (link) has hastened the devolution of data curation for the database. The expansion of the database to include a greater number of paleoecological proxies has meant that a number of researchers have already become data stewards, checking to ensure completeness and accuracy before data is uploaded into the database.

Having the R package (and the Tilia GUI) act as a tool to get data in as well as out serves an important function, it acts as a step to enhance the benefits of proper data curation immediately after (or even during) data generation because the data structures in the applications are so closely aligned with the actual database structure.

We are improving this data/metadata synergy in two ways:

  1. Data structures: The data structures within the R package (detailed in our Open Quaternary paper) remain parallel to the database.  We’re also working with Mark Uhen, Shanan Peters and others at the Paleobiology Database (as part of this funded NSF EarthCube project) and, elsewhere, for example, the LiPD Project, which is itself proposing community data standards for paleoclimatic data (McKay and Emile-Geay, 2015).
  2. Workflow: Making paleoecological analysis easier through the use of the R package has the benefit of reducing the ultimate barrier to data upload.  This work is ongoing, but the goal is to ensure that by creating data objects in neotoma, data is already formatted correctly for upload to Neotoma, reducing the burden on Data Stewards and on the data generators themselves.

This is a community led initiative, although development is centralized (but open, anyone can contribute to the R package for example), the user base of Neotoma is broad, it contains data from over 6500 researchers, and data is contributed at a rate that continues to increase.  By working directly with the data generators we can help build a direct pipeline into “big data” tools for researchers that have traditionally been somewhere out on the long tail.

Jack Williams will be talking a bit more about our activities in this Middle Tail, and why it’s critical for the development of truly integrated cyberinfrastructure in the geosciences (the lessons are applicable to ecoinformatics as well) at GSA this year (I’ll be there too, so come by and check out our session: Paleoecological Patterns, Ecological Processes and Modeled Scenarios, where we’re putting the ecology back in ge(c)ology!).

Sometimes saving time in a bottle isn’t the best idea.

As an academic you have the advantage of meeting people who do some really amazing research.  You also have the advantage of doing really interesting stuff yourself, but you also tend to spend a lot of time thinking about very obscure things.  Things that few other people are also thinking about, and those few people tend to be spread out across the globe.  I had the opportunity to join researchers from around the world at Queen’s University in Belfast, Northern Ireland earlier this month for a meeting about age-depth models, a meeting about how we think about time, and how we use it in our research.

Time is something that paleoecologists tend to think about a lot.  With the Neotoma paleoecological database time is a critical component.  It is how we arrange all the paleoecological data.  From the Neotoma Explorer you can search and plot out mammal fossils at any time in the recent (last 100,000 years or so) past, but what if our fundamental concept of time changes?

Figure 1.  The accelerator in Belfast uses massive magnets to accelerate Carbon particles to 5 million km/h.
Figure 1. The particle accelerator in Belfast uses magnets to accelerate Carbon particles to 5 million km/h.

Most of the ages in Neotoma are relative.  They are derived from from radiocarbon data, either directly, or within a chronology built from several radiocarbon dates (Margo Saher has a post about 14C dating here), which means that there is uncertainty around the ages that we assign to each pollen sample, mammoth bone or plant fossil.  To actually get a radiocarbon date you first need to send a sample of organic material out to a lab (such as the Queen’s University, Belfast Radiocarbon Lab).  The samples at the radiocarbon lab are processed and put in an Accelerator Mass Spectrometer (Figure 1) where molecules of Carbon reach speeds of millions of miles an hour, hurtling through a massive magnet, and are then counted, one at a time.

These counts are used to provide an estimate of age in radiocarbon years.  We then use the IntCal curve to relate radiocarbon ages to calendar ages.  This calibration curve relates absolutely dated material (such as tree rings) to their radiocarbon ages.  We need the IntCal curve since the generation of radiocarbon (14C) in the atmosphere changes over time, so there isn’t a 1:1 relationship between radiocarbon ages and calendar ages.  Radiocarbon (14C) 0 is actually 1950 (associated with atmospheric atomic bomb testing), and by the time you get back to 10,000 14C years ago, the calendar date is about 1,700 years ahead of the radiocarbon age (i.e., 10,000 14C years is equivalent to 11,700 calendar years before present).

Figure 2.  A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon).  The probability density associated with this intercept builds the age estimate for that sample, in calendar years.
Figure 2. A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon). The probability density associated with this intercept builds the age estimate for that sample, in calendar years. [link]
To build a model of age and depth within a pollen core, we link radiocarbon dates to the IntCal curve (calibration) and then link each age estimate together, with their uncertainties, using specialized software such as OxCal, Clam or Bacon.  This then allows us to examine changes in the paleoecological record through time, basically, this allows us to do paleoecology.

A case for updating chronologies

The challenge for a database like Neotoma is that the IntCal curve changes over time (IntCal98, IntCal04, IntCal09, and now IntCal13) and our idea of what makes an acceptable age model (and what constitutes acceptable material for dating) also changes.

If we’re serving up data to allow for broad scale synthesis work, which age models do we provide?  If we provide the original published model only then these models can cause significant problems for researchers working today.  As I mentioned before, by the time we get back 10,000 14C years the old models (built using only 14C ages, not calibrated ages) will be out of sync with newer data in the database, and our ability to discern patterns in the early-Holocene will be affected.  Indeed, identical cores built using different age models and different versions of the IntCal curve could tell us very different things about the timing of species expansions following glaciation, or changes in climate during the mid-Holocene due to shifts in the Intertropical Convergence Zone (for example).

So, if we’re going to archive these published records then we ought to keep the original age models, they’re what’s published after all, and we want to keep them as a snapshot (reproducible science and all that).  However, if we’re going to provide this data to researchers around the world, and across disciplines, for novel research purposes then we need to provide support for synthesis work.  This support requires updating the calibration curves, and potentially, the age-depth models.

So we get (finally) to the point of the meeting.  How do we update age-models in a reliable and reproducible manner?  Interestingly, while the meeting didn’t provide a solution, we’re much closer to an endpoint.  Scripted age-depth modelling software like Clam and Bacon make the task easier, since they provide the ability to numerically reproduce output directly in R.  The continued development of the Neotoma API also helps facilitate this task since it again would allow us to pull data directly from the database, and reproduce age-model construction using a common set of data.

Figure 3.  This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern.  No matter how you re-build this age model it's going to underestimate uncertainty in this region.
Figure 3. This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern. No matter how you re-build this age model it’s going to underestimate uncertainty in this region.

One thing that we have identified however are the current limitations to this task.  Quite simply, there’s no point in updating some age-depth models.  The lack of reliable dates (or of any dates) means that new models will be effectively useless.  The lack of metadata in published material is also a critical concern.  While some journals maintain standards for the publication of 14C dates they are only enforced when editors or reviewers are aware of them, and are difficult to enforce post publication.

The issue of making data open and available continues to be an exciting opportunity, but it really does reveal the importance of disciplinary knowledge when exploiting data sources.  Simply put, at this point if you’re going to use a large disciplinary database, unless you find someone who knows the data well, you need to hope that signal is not lost in the noise (and that the signal you find is not an artifact of some other process!).