Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files & figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

Reproducibility and R – Better results, better code, better science.

I made a short presentation for our most recent weekly lab meeting about best practices for reproducible research.  There are a few key points, the first is that the benefits of reproducible research are not just for the community.  Producing reproducible code helps you, both after publication (higher citation rates: Piwowar and Vision, 2013) but in the long run in terms of your ability to tackle bigger projects.

Lets face it, if you intend to pursue a career inside or outside of academia your success is going to depend on tackling progressively larger or more complex projects.  If programming is going to be a part of that then developing good coding practice should be a priority.  One way to get into the habit of developing good practice is to practice.  In the presentation (PDF, figShare) I point to a hierarchy (of sorts) of good scientific coding practice, reproducible programming helps support that practice:

  1. An integrated development environment (IDE) helps you organize your code in a logical manner, helps make some repeatable tasks easier and provides tools and views to make the flow of code easier to read (helping you keep track of what you’re doing)
  2. Version control helps you make incremental changes to your code, to comment the changes clearly, and helps you fix mistakes if you break something.  It also helps you learn from your old mistakes, you can go back through your commit history and see how you fixed problems in the past.
  3. Embedded code helps you produce clean and concise code with a specific purpose, and it help you in the long run by reducing the need to “find and replace” values throughout your manuscript.  It helps reviewers as well.  Your results are simply a summary of the analysis you perform, the code is the analysis.  If you can point readers and reviewers to the code you save everyone time.

So, take a look at the presentation, let me know what you think.  And, if you are an early-career researcher, make now the time to start good coding practice.

Two frustrations with R’s default behaviour.

EDIT:  In response to this post I have had good suggestions, both in the comments here and on reddit in /r/statistics. Thanks to all!

I use R every day.  I can think of very few times when I have booted up my computer and not had an instance of R or RStudio running.  Whether for data exploration, making graphs, or just fiddling around, R is a staple of my academic existence.

So I love R, but there are a few things that drive me nuts about its behaviour:

  1. Plotting a large data.frame produces unreadable plots.
  2. Read and write.csv don’t behave the same with respect to row names.
  3. stringsAsFactors isn’t default.

Continue reading Two frustrations with R’s default behaviour.

Neotoma, an API and data archaeology. Also, some fun with R.

Image of a Neotoma packrat
Figure 1. The Neotoma project is named after the genus Neotoma, packrats, who collect plant materials to build their middens, that ultimately serve as excellent paleoecological resources, if you’re okay sifting through rodent urine.

The Neotoma database is a fantastic resource, and one that should be getting lots of use from paleoecologists.  Neotoma is led by the efforts of Eric Grimm, Allan Ashworth, Russ Graham, Steve Jackson and Jack Williams, and includes paleoecological data from the North American Pollen Database and FAUNMAP, along with other sources, coming online soon.  As more and more scientific data sets are being developed, and as those data sets become updated more regularly,  we’re also moving to an era where scientists are increasingly needing to use tools to obtain this new data as it is acquired.  This leads us to the use of APIs in our scientific workflows.  There are already R packages that take advantage of existing APIs, I’ve used ritis a number of times (part of the great ROpenSci project) to access species records in the ITIS database, there’s the twitteR package to take advantage of the Twitter API, and there are plenty of others.

Continue reading Neotoma, an API and data archaeology. Also, some fun with R.