Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files & figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

Advertisements

Sometimes saving time in a bottle isn’t the best idea.

As an academic you have the advantage of meeting people who do some really amazing research.  You also have the advantage of doing really interesting stuff yourself, but you also tend to spend a lot of time thinking about very obscure things.  Things that few other people are also thinking about, and those few people tend to be spread out across the globe.  I had the opportunity to join researchers from around the world at Queen’s University in Belfast, Northern Ireland earlier this month for a meeting about age-depth models, a meeting about how we think about time, and how we use it in our research.

Time is something that paleoecologists tend to think about a lot.  With the Neotoma paleoecological database time is a critical component.  It is how we arrange all the paleoecological data.  From the Neotoma Explorer you can search and plot out mammal fossils at any time in the recent (last 100,000 years or so) past, but what if our fundamental concept of time changes?

Figure 1.  The accelerator in Belfast uses massive magnets to accelerate Carbon particles to 5 million km/h.
Figure 1. The particle accelerator in Belfast uses magnets to accelerate Carbon particles to 5 million km/h.

Most of the ages in Neotoma are relative.  They are derived from from radiocarbon data, either directly, or within a chronology built from several radiocarbon dates (Margo Saher has a post about 14C dating here), which means that there is uncertainty around the ages that we assign to each pollen sample, mammoth bone or plant fossil.  To actually get a radiocarbon date you first need to send a sample of organic material out to a lab (such as the Queen’s University, Belfast Radiocarbon Lab).  The samples at the radiocarbon lab are processed and put in an Accelerator Mass Spectrometer (Figure 1) where molecules of Carbon reach speeds of millions of miles an hour, hurtling through a massive magnet, and are then counted, one at a time.

These counts are used to provide an estimate of age in radiocarbon years.  We then use the IntCal curve to relate radiocarbon ages to calendar ages.  This calibration curve relates absolutely dated material (such as tree rings) to their radiocarbon ages.  We need the IntCal curve since the generation of radiocarbon (14C) in the atmosphere changes over time, so there isn’t a 1:1 relationship between radiocarbon ages and calendar ages.  Radiocarbon (14C) 0 is actually 1950 (associated with atmospheric atomic bomb testing), and by the time you get back to 10,000 14C years ago, the calendar date is about 1,700 years ahead of the radiocarbon age (i.e., 10,000 14C years is equivalent to 11,700 calendar years before present).

Figure 2.  A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon).  The probability density associated with this intercept builds the age estimate for that sample, in calendar years.
Figure 2. A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon). The probability density associated with this intercept builds the age estimate for that sample, in calendar years. [link]
To build a model of age and depth within a pollen core, we link radiocarbon dates to the IntCal curve (calibration) and then link each age estimate together, with their uncertainties, using specialized software such as OxCal, Clam or Bacon.  This then allows us to examine changes in the paleoecological record through time, basically, this allows us to do paleoecology.

A case for updating chronologies

The challenge for a database like Neotoma is that the IntCal curve changes over time (IntCal98, IntCal04, IntCal09, and now IntCal13) and our idea of what makes an acceptable age model (and what constitutes acceptable material for dating) also changes.

If we’re serving up data to allow for broad scale synthesis work, which age models do we provide?  If we provide the original published model only then these models can cause significant problems for researchers working today.  As I mentioned before, by the time we get back 10,000 14C years the old models (built using only 14C ages, not calibrated ages) will be out of sync with newer data in the database, and our ability to discern patterns in the early-Holocene will be affected.  Indeed, identical cores built using different age models and different versions of the IntCal curve could tell us very different things about the timing of species expansions following glaciation, or changes in climate during the mid-Holocene due to shifts in the Intertropical Convergence Zone (for example).

So, if we’re going to archive these published records then we ought to keep the original age models, they’re what’s published after all, and we want to keep them as a snapshot (reproducible science and all that).  However, if we’re going to provide this data to researchers around the world, and across disciplines, for novel research purposes then we need to provide support for synthesis work.  This support requires updating the calibration curves, and potentially, the age-depth models.

So we get (finally) to the point of the meeting.  How do we update age-models in a reliable and reproducible manner?  Interestingly, while the meeting didn’t provide a solution, we’re much closer to an endpoint.  Scripted age-depth modelling software like Clam and Bacon make the task easier, since they provide the ability to numerically reproduce output directly in R.  The continued development of the Neotoma API also helps facilitate this task since it again would allow us to pull data directly from the database, and reproduce age-model construction using a common set of data.

Figure 3.  This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern.  No matter how you re-build this age model it's going to underestimate uncertainty in this region.
Figure 3. This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern. No matter how you re-build this age model it’s going to underestimate uncertainty in this region.

One thing that we have identified however are the current limitations to this task.  Quite simply, there’s no point in updating some age-depth models.  The lack of reliable dates (or of any dates) means that new models will be effectively useless.  The lack of metadata in published material is also a critical concern.  While some journals maintain standards for the publication of 14C dates they are only enforced when editors or reviewers are aware of them, and are difficult to enforce post publication.

The issue of making data open and available continues to be an exciting opportunity, but it really does reveal the importance of disciplinary knowledge when exploiting data sources.  Simply put, at this point if you’re going to use a large disciplinary database, unless you find someone who knows the data well, you need to hope that signal is not lost in the noise (and that the signal you find is not an artifact of some other process!).