Helping to fill the cloud from the bottom up.

Open data in the sciences is an aspirational goal, and one that I wholeheartedly agree with. The efforts of EarthCube (among others) to build an infrastructure of tools to help facilitate data curation and discovery in the Earth Sciences have been fundamental in moving this discussion forward in the geosciences, and at the most recent ESA meeting saw the development of a new section of the society dedicated to Open Science.

One of the big challenges to open science is that making data easily accessible and easily discoverable can be at odds with one another.  Making data “open” is as easy as posting it on a website, but making it discoverable is much more complex.  Borgman and colleagues (2007) very clearly lay out a critical barrier to data sharing in an excellent paper examining practices in “habitat ecology” (emphasis mine):

Also paradoxical is researchers’ awareness of extant metadata standards for reconciling, managing, and sharing their data, but their lack of use of such standards. The present dilemma is that few of the participating scientists see a need or ability to use others’ data, so they do not request data, they have no need to share their own data, they have no need for data standards, and no standardized data are available. . .

The issue, as laid out here, is that people know that metadata standards exist, but they’re not using them from the ground up because they’re not using other people’s data.  Granted this paper is now eight years old, but, for the vast majority of disciplinary researchers in the geosciences and biological sciences the extent of data re-use is most likely limited to using data tables from publications, if that. [a quick plug, if you’re in the geosciences, please take a few minutes to complete this survey on data sharing and infrastructure as part of the EarthCube Initiative]

So, for many people who are working with self-styled data formats, and metadata that is largely implicit (they’re the only ones who really understand their multi-sheet excel file), getting data into a new format (one that conforms to explicit metadata standards) can be formidable, especially if they’re dealing with a large number of data products coming out of their research.

Right now, for just a single paper I have ten different data products that need to have annotated metadata.  I’m fully aware that I need to do it, I know it’s important, but I’ve also got papers to review and write, analysis to finish, job applications to write, emails to send, etc., etc., etc., and while I understand that I can now get DOIs for my data products, it’s still not clear to me that it really means anything concrete in terms of advancement.

Don’t get me wrong, I am totally for open science, all my research is on GitHub, even partial papers, and I’m on board with data sharing.  My point here is that even for people who are interested in open science, correctly annotating data is still a barrier.

How do we address this problem? We have lots of tools that can help generate metadata, but many, if not all, of these are post hoc tools. We talk extensively, if colloquially, about the need to start metadata creation at the same time as we collect the data, but we don’t incentivise this process.  The only time people realize that metdata is important is at the end of their project, and by then they’ve got a new job to start, a new project to undertake, or they’ve left academia.

Making metadata creation a part of the research workflow is something I am working toward as part of the Neotoma project. Where metadata is a necessary component of the actual data analysis.   The Neotoma Paleoecological Database is a community curated database that contains sixteen different paleoecological proxies, ranging from water chemistry to pollen to diatoms to stable isotope data (see Pilaar Birch and Graham 2015). Neotoma has been used to study everything from modern patterns of plant diversity, rates of migration for plant and mammals, rates of change in community turnover through time, and species relationships to climate.  It acts as both a data repository and a research tool in and of itself.  A quick plug as well, the completion of a workshop this past week with the Science Education Resource Center at Carleton College in Minnesota has resulted in the development of teaching tools to help bring paleoecology into the classroom (more are on their way).

Neotoma has a database structure that includes a large amount of metadata.  Due in no small part to the activities of Eric Grimm, the metadata is highly curated, and, Tilia, a GUI tool for producing stratigraphic diagrams and age models from paleoecological data, is designed to store data in a format that is largely aligned with the Neotoma database structure.

In designing the neotoma package for R I’ve largely focused on its use as a tool to get data out of Neotoma, but the devastating closure of the Illinois State Museum by the Illinois Governor (link) has hastened the devolution of data curation for the database. The expansion of the database to include a greater number of paleoecological proxies has meant that a number of researchers have already become data stewards, checking to ensure completeness and accuracy before data is uploaded into the database.

Having the R package (and the Tilia GUI) act as a tool to get data in as well as out serves an important function, it acts as a step to enhance the benefits of proper data curation immediately after (or even during) data generation because the data structures in the applications are so closely aligned with the actual database structure.

We are improving this data/metadata synergy in two ways:

  1. Data structures: The data structures within the R package (detailed in our Open Quaternary paper) remain parallel to the database.  We’re also working with Mark Uhen, Shanan Peters and others at the Paleobiology Database (as part of this funded NSF EarthCube project) and, elsewhere, for example, the LiPD Project, which is itself proposing community data standards for paleoclimatic data (McKay and Emile-Geay, 2015).
  2. Workflow: Making paleoecological analysis easier through the use of the R package has the benefit of reducing the ultimate barrier to data upload.  This work is ongoing, but the goal is to ensure that by creating data objects in neotoma, data is already formatted correctly for upload to Neotoma, reducing the burden on Data Stewards and on the data generators themselves.

This is a community led initiative, although development is centralized (but open, anyone can contribute to the R package for example), the user base of Neotoma is broad, it contains data from over 6500 researchers, and data is contributed at a rate that continues to increase.  By working directly with the data generators we can help build a direct pipeline into “big data” tools for researchers that have traditionally been somewhere out on the long tail.

Jack Williams will be talking a bit more about our activities in this Middle Tail, and why it’s critical for the development of truly integrated cyberinfrastructure in the geosciences (the lessons are applicable to ecoinformatics as well) at GSA this year (I’ll be there too, so come by and check out our session: Paleoecological Patterns, Ecological Processes and Modeled Scenarios, where we’re putting the ecology back in ge(c)ology!).

Advertisements

Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)

Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files &amp; figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

EarthCube webinars and the challenges of cross-disciplinary Big Data.

EarthCube is a moon shot.  It’s an effort to bring communities broadly supported through the NSF Geosciences Directorate and the Division of Advanced Cyberinfrastructure together to create a framework that will allow us to understand our planet (and solar system) in space and in time using data and models generated across a spectrum of disciplines, and spanning scales of space and time.  A lofty goal, and a particularly complex one given the fragmentation of many disciplines, and the breadth of researchers who might be interested in participating in the overall project.

To help support and foster a sense of community around EarthCube the directorate has been sponsoring a series of webinars as part of a Research Coordination Network called “Collaboration and Cyberinfrastructure for Paleogeosciences“, or, more simply C4P.  These webinars have been held every other Tuesday from 4 – 5pm Eastern, but are archived on the webinar website (here).

The Neotoma Paleoecological Database was featured as part of the first webinar.  Anders Noren talked about the cyber infrastructure required to support LacCore‘s operations, and Shanan Peters talks about an incredible text mining initiative (GeoDeepDive) in one of the later webinars.

Image
Fig 1. The flagship for the Society for American Pedologists has run aground and it is now a sitting duck for the battle machine controlled by the Canadian Association for Palynologists in the third war of Data Semantics.

It’s been interesting to watch these talks and think about both how unique each of these paleo-cyberinfrastructure projects is, but also how much overlap there is in data structure, use, and tool development.  Much of the struggle for EarthCube is going to be developing a data interoperability structure and acceptable standards across disciplines.  In continuing to develop the neotoma package for R I’ve been struggling to understand how to make the data objects we pull from the Neotoma API interact well with standard R functions, and existing R packages for paleoecological data.  One of the key questions is how far do we go in developing our own tools before that tool development creates a closed ecosystem that cuts off outside development?  If I’m struggling with this question in one tiny disciplinary nook, imagine the struggle that is going to occur when geophysicists and paleobotanists get together with geochonologists and pedologists!

Interoperability of these databases needs to be a key goal.  Imagine the possibilities if we could link modern biodiversity databases with Pleistocene databases such as Neotoma, and then to deep time databases like the Paleobiology Database in a seamless manner.  Big data has clearly arrived in some disciplines, but the challenges of creating big data across disciplines is just starting.

Natural Earth Data and R in ggplot2

I really like the data layers available from naturalearthdata.  They take a simple map and make it look beautiful, which I suppose is the point.

Figure 1.  Here's an amazing background for your maps using Natural Earth Data and ggplot2
Figure 1. Here’s an amazing background for your maps using Natural Earth Data and ggplot2

One thing that I’ve found frustrating though is the need to constantly go back and forth between R and ArcGIS/QGIS/GRASS to make figures with spatial data if I want them to look really pretty.  In preparation for AGU2013 this year (come see our poster) I decided I’d finally solve this annoying problem by figuring out how to get a pretty base raster layer into ggplot, and then superimpose the lakes and streams that serve to make a nice map look pretty.

Here’s the solution to make this fine figure:

library(raster)
library(rgdal)
library(ggplot2)
library(reshape2)
library(plyr)

#  Assuming you have a path 'Maps' that you store your spatial files in.  This
#  is all downloaded from <a href="http://www.naturalearthdata.com/downloads/">http://www.naturalearthdata.com/downloads/</a> using the
#  1:50m "Medium" scale data.

nat.earth <- stack('./Maps/NE2_50M_SR_W/NE2_50M_SR_W.tif')

ne_lakes <- readOGR('./Maps/NaturalEarth/ne_50m_lakes.shp',
                    'ne_50m_lakes')

ne_rivers <- readOGR('./Maps/NaturalEarth/ne_50m_rivers_lake_centerlines.shp',
                     'ne_50m_rivers_lake_centerlines')

ne_coast <- readOGR('./Maps/NaturalEarth/ne_50m_coastline.shp',
                     'ne_50m_coastline')

#  I have a domain I'm interested in, but there's no reason you can't define something else:
quick.subset <- function(x, longlat){

# longlat should be a vector of four values: c(xmin, xmax, ymin, ymax)
  x@data$id <- rownames(x@data)

  x.f = fortify(x, region="id")
  x.join = join(x.f, x@data, by="id")

  x.subset <- subset(x.join, x.join$long > longlat[1] & x.join$long < longlat[2] &
                           x.join$lat > longlat[3] & x.join$lat < longlat[4])

  x.subset
}

domain <- c(-98.6, -66.1, 36.5, 49.7)
lakes.subset <- quick.subset(ne_lakes, domain)
river.subset <- quick.subset(ne_rivers, domain)
coast.subset <- quick.subset(ne_coast, domain)

nat.crop <- crop(nat.earth, y=extent(domain))

rast.table <- data.frame(xyFromCell(nat.crop, 1:ncell(nat.crop)),
                         getValues(nat.crop/255))

rast.table$rgb <- with(rast.table, rgb(NE2_50M_SR_W.1,
                                       NE2_50M_SR_W.2,
                                       NE2_50M_SR_W.3,
                                      1))
# et voila!

ggplot(data = rast.table, aes(x = x, y = y)) +
  geom_tile(fill = rast.table$rgb) +
  geom_polygon(data=lakes.subset, aes(x = long, y = lat, group = group), fill = '#ADD8E6') +
  scale_alpha_discrete(range=c(1,0)) +
  geom_path(data=lakes.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  geom_path(data=river.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  geom_path(data=coast.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  xlab('') + ylab('')

Some issues still remain for me. For one, there’s a problem with holes in this plotting. You’ll notice that Manitoulin Island and half of the Keweenaw Peninsula are basically lake (the light blue background) outlined in blue, and not the textured raster. I’m not sure exactly how to fix that, there’s plenty of discussion about how to fix holes in polygons using ggplot, but I haven’t worked out the solution yet.

 

If you have a solution I’m all ears.

Reproducibility and R – Better results, better code, better science.

I made a short presentation for our most recent weekly lab meeting about best practices for reproducible research.  There are a few key points, the first is that the benefits of reproducible research are not just for the community.  Producing reproducible code helps you, both after publication (higher citation rates: Piwowar and Vision, 2013) but in the long run in terms of your ability to tackle bigger projects.

Lets face it, if you intend to pursue a career inside or outside of academia your success is going to depend on tackling progressively larger or more complex projects.  If programming is going to be a part of that then developing good coding practice should be a priority.  One way to get into the habit of developing good practice is to practice.  In the presentation (PDF, figShare) I point to a hierarchy (of sorts) of good scientific coding practice, reproducible programming helps support that practice:

  1. An integrated development environment (IDE) helps you organize your code in a logical manner, helps make some repeatable tasks easier and provides tools and views to make the flow of code easier to read (helping you keep track of what you’re doing)
  2. Version control helps you make incremental changes to your code, to comment the changes clearly, and helps you fix mistakes if you break something.  It also helps you learn from your old mistakes, you can go back through your commit history and see how you fixed problems in the past.
  3. Embedded code helps you produce clean and concise code with a specific purpose, and it help you in the long run by reducing the need to “find and replace” values throughout your manuscript.  It helps reviewers as well.  Your results are simply a summary of the analysis you perform, the code is the analysis.  If you can point readers and reviewers to the code you save everyone time.

So, take a look at the presentation, let me know what you think.  And, if you are an early-career researcher, make now the time to start good coding practice.