The neotoma package has been available for a little while (at figshare here and at github here). I wrote it up and posted it in the middle of April before my talk at the University of Minnesota, but I’ve been delaying writing about it here until it was a bit more polished.
So here is your introduction to the neotoma package for R, hosted & supported by ROpenSci (after the break):
The neotoma package is intended to interface with the Neotoma Paleoecological Database. The database hosts paleoecological data spanning the Pliocene-Quaternary, originally from two major paleoecological databases, the North American Pollen Database and FAUNMAP. Currently Neotoma supports working groups for a number of different paleoecological datasets, and is working toward integrating more data into the database as we speak.
In the meantime, authors, including myself and Jessica Blois have been using the Neotoma database to understand patterns of change in deposition rates and community composition and dissimilarity (respectively) over time. In our papers we had to download the entire Neotoma database and then do analysis. I was hoping to make this process easier by creating a package for R that would allow investigators to download and analyse data directly through APIs, producing a reproducible workflow, facilitating preliminary data analysis, and potentially providing an opportunity for educators to use Neotoma data as a learning tool in courses as part of lab work.
To begin to use the neotoma package you need to do a few things first:
- Download the compressed package file to your computer and then use install.packages to install it as a package in R, or if you’re using RStudio, install it from there.
Okay, that was one thing. Look at how easy that was!
Once you’ve installed the package you just have to open the package
library(neotoma) and you’re off to the races. I’m going to show you two examples to help get you started, the first showing the distribution of sites with Spruce pollen detected around the Last Glacial Maximum (at 21ka), and the second showing the number of publications in Neotoma by year. Both are fairly simple analyses, but hopefully highlight some of the potential applications of the package and the Neotoma database.
library(ggplot2) library(neotoma) library(maps) # Lets make a base map of North America for display: all.world <- map_data('world') northam <- subset(all.world, region %in% c('Canada', 'USA', 'Mexico')) # Now lets find all the records of Spruce (Picea) at the last glacial maximum from # the neotoma database: lgm.picea <- get_datasets(taxonname='Picea', ageold=22000, ageyoung=20000) loc.fromset <- function(x) data.frame(lat = x$Site$LatitudeNorth, long = x$Site$LongitudeWest) lgm.sites <- ldply(lgm.picea, loc.fromset) # Just want to get rid of the European sites: lgm.sites <- lgm.sites[lgm.sites$long < -20,] ggplot() + geom_polygon(data = northam, aes(x = long, y = lat, group = group), fill = 'white', color='gray') + geom_point(data = lgm.sites, aes(x = long, y = lat)) + annotate('text', x = -180, y = 25, label='Picea Pollen Distribution', hjust=0, size=8) + annotate('text', x = -180, y = 20, label='Last Glacial Maximum\n22 - 20ka', hjust=0) + theme_bw() + xlim(-180, -50) + ylim(14, 80)
And now, here’s the number of publications in Neotoma, by year (including vertebrate fossils, pollen and ostracods):
pubs <- get_publication() pub.years <- as.numeric(as.character(pubs$Year)) ggplot(data=data.frame(x = pub.years), aes(x)) + stat_bin(aes(y=..density..*100, position='dodge'), binwidth=1) + theme_bw() + ylab('Percent of Publications') + xlab('Year of Publication') + scale_y_continuous(expand = c(0, 0.1)) + scale_x_continuous(breaks = seq(min(pub.years, na.rm=TRUE), 2013, by=20))
The interesting thing about Figure 2 is that there is a clear peak in the publication age range at 1984. This is more than likely associated with an artifact of the database. Neotoma is currently in its first version, but there are a large number of datasets in the holding tank, waiting for the next update of the Neotoma database. We will be adding a flag into the neotoma package soon so that the user can define which version of the database to access. This will help support reproducible research by allowing supplementary data to reference a specific “snapshot” of the database, so that code samples will be numerically reproducible into the future.
I welcome everyone to try out the package and explore the Neotoma database. This is my first time making a package, so if you find bugs, have any problems, or have suggestions to improve the package please let me know, either in the comments here, on twitter or by email. This package may have a very specific niche, but that doesn’t mean it can’t be usable!