Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)
Advertisements

How far do you go with coding in a collaborative project?

I’ve been coding in R since I started graduate school, and I’ve been coding in one way or another since I learned how to use Turing in high school.  I’m not an expert by any means, but I am proficient enough to do the kind of work I want to do, to know when I’m stuck and need to ask for help, and to occasionally over-reach and get completely lost in my own code.

I’ve been working hard to improve my coding practice, particularly focusing on making my code more elegant, and looking at ways to use public code as a teaching tool. In submitting our neotoma package paper to Open Quaternary I struggled with the balance between making pretty figures, and making easily reproducible examples, between providing ‘research-quality’ case studies and having the target audience, generally non-programmers, turn off at the sight of walls of code.

There’s no doubt that the quality and interpretability of your code can have an impact on subsequent papers.  In 2012 there was a paper in PNAS that showed that papers with more equations often get cited less than similar papers with fewer equations (SciAm, Fawcett and Higginson, 2012).  I suspect the same pattern of citation holds for embedded code blocks, although how frequently this happens outside specialized journals isn’t clear to me. It certainly didn’t hurt Eric Grimm’s CONISS paper (doi and PDF) which has been cited 1300+ times, but this may be the exception rather than the rule, particularly in paleoecology.

I’m currently working on a collaboration with researchers at several other Canadian universities. We’re using sets of spatio-temporal data, along with phylogenetic information across kingdoms to do some pretty cool research.  It’s also fairly complex computationally.  One of the first things I did in co-developing the code was to go through the code-base and re-write it to try to make it a bit more robust to bugs, and to try and modularize it.  This meant pulling repeated blocks of code into new functions, taking functions and putting them into their own files, generalizing some of the functions so that they could be re-used in multiple places, and generally spiffing things up with a healthy use of the plyr package.

Figure 1. This is what coding machismo looks like on me.
Figure 1. This is what coding machismo looks like on me.

This, I now realize, was probably something akin to  statistical machismo (maybe more like coding machismo). The use of coding ‘tricks’ limited the ability of my co-authors to understand and modify the code themselves.  It also meant that further changes to the code required more significant investments in my time. They’re all great scientists, but they’re not native coders in the same way I am (not that I’m in the upper echelon myself).

This has been a real learning experience.  Coding for research is not the same as coding in an industrial application.  The culture shift in the sciences towards an open-sharing model means that we are no longer doing this work just so that we get output that “works”, but so that the code itself is an output.  Collaborative coding should mean that participants in the project should be able to understand what the code does and how it works.  In many cases that means recognizing that collaborators are likely to have differing skill sets when it comes to coding and that those different skill sets need to be respected.

In my case, going ahead and re-writing swaths of code certainly helped reduce the size of the code-base, it meant that, in general, things ran more smoothly, and that changes to the code could be accomplished relatively easily.  It also meant that I was the only one who could easily make these changes.  This is not good collaborative practice, at least, at the outset.

Figure 2.  A flowchart showing the relationships between classes in the neotoma package.
Figure 2. A flowchart showing the relationships between classes in the neotoma package.

Having said that, there are lots of good reasons why good coding practice can be beneficial,even if some collaborators can’t immediately work through changes.  It’s partly a matter of providing road maps, something that is rarely done.  Good commenting is useful, but more often lately I’ve been leaning toward trying to map an application with flowcharts.  It’s not pretty, but the diagram I drew up for our neotoma package paper (Goring et al., submitted) helped debug, clean syntax errors and gave me an overview I didn’t really have until I drafted it.

I’m working on the same kind of chart for the project I mentioned earlier, although it can be very complicated to do.  I’ve also been making an effort to clearly report changes I’ve made using git, so that we have a good record of what’s been done, and why. Certainly, it would have been easier to do all this in the first place, but I’ve learned my lesson. As in most research, a little bit of planning can go a long way.

Writing robust, readable code is an important step toward open and reproducible science, but we need to acknowledge the fact that reproducibility should not be limited to expert coders.  Trade-offs are a fact of life in evolution, and yet some of us are unwilling to make the trade-offs in our own scientific practice.  We are told constantly to pitch our public talks to the audience.  In working with your peers on a collaboration you should respect their relative abilities and ensure that they can understand the code, which may result in eschewing fancier tricks in favor of clearly outlined steps.

If you code in such a way that people can’t work with you, you are opening yourself up to more bugs, more work and possibly, more trouble down the line.  It’s a fine balance, and as quantitative tools such as R and Python become more a part of the lingua franca for grad students and early career researchers, it’s one we’re going to have to negotiate more often.

Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files &amp; figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

How do you edit someone else’s code?

As academics I like to think that we’ve become fairly used to editing text documents. Whether handwriting on printed documents (fairly old school, but cool), adding comments on PDFs, or using some form of “track changes” I think we’ve learned how to do the editing, and how to incorporate those edits into a finished draft. Large collaborative projects are still often a source of difficulty (how do you deal with ten simultaneous edits of the same draft?!) but we deal.

Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.
Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.

I’m working on several projects now that use R as a central component in analysis, and now we’re not just editing the text documents, we’re editing the code as well.

People are beginning to migrate to version control software and the literature is increasingly discussing the utility of software programming practices (e.g., Scheller et al., 2010), but given that scientific adoption of programming tools is still in its early stages, there’s no sense that we can expect people to immediately pick up all the associated tools that go along with them. Yes, it would be great if people would start using GitHub or BitBucket (or other version control tools) right away, but they’re still getting used to basic programming concepts (btw Tim Poisot has some great tips for Learning to Code in Ecology).

The other issue is that collaborating with graduate students is still a murky area. How much editing of code can you do before you’ve started doing their work for them? I think we generally have a sense of where the boundaries are for written work, but if code is part of ‘doing the experiment’, how much can you do? Editing is an opportunity to teach good coding practice, and to teach new tools to improve reproducibility and ease of use, but give the student too much and you’ve programmed everything for them.

I’m learning as I go here, and I’d appreciate tips from others (in the comments, or on twitter), but this is what I’ve started doing when working with graduate students:

  • Commenting using a ‘special’ tag:  Comments in R are just an octothorp (#), I use #* to differentiate what I’m saying from a collaborator’s comments.  This is fairly extensible, someone else could comment ‘#s’ or ‘#a’ if you have multiple collaborators.
  • Where there are major structural changes (sticking things in functions) I’ll comment heavily at the top, then build the function once.  Inside the function I’ll explain what else needs to be done so that I haven’t done it all for them.
  • If similar things need to be done further down the code I’ll comment “This needs to be done as above” in a bit more detail, so they have a template & the know where they’re going.

The tricky part about editing code is that it needs to work, so it can be frustratingly difficult to do half-edits without introducing all sorts of bugs or errors.  So if code review is part of your editing process, how do you accomplish it?

R Tips I Wish I Had Learned Earlier – Using Functions

This post is part of a series expanding on suggestions from twitter users and contributors on reddit.com’s rstats forum about tips and tricks that they wish they had learned earlier.

Writing functions in R isn’t something that people start doing right away, and it is something that people often actively avoid since functions often require rewriting code, and, in some cases, mean you have to think hard about how data looks and behaves in the context of your analysis. When you’re starting out a project, or a piece of analysis it can often seem easier to just cut and paste pieces of code, replacing key variable names. In the long term this kind of cutting and pasting can cause serious problems in analysis.

It’s funny.  We have no problem using functions in our day to day programming: lm, plot, sqrt. These functions simplify our data analysis, we use them over and over, and we’d never think about using the raw code, but when it comes to our analysis we’re willing to cut and paste, and let fairly simple sets of operations run into the hundreds of lines just because we don’t want to (or don’t know how to) add a few simple lines of code to make operations simpler.

Functions can improve your code in a few ways:

  1. Fewer overall lines of code, makes it easier to collaborate, share & manage analysis
  2. Faster debugging when you find mistakes
  3. Opportunities to make further improvements using lapply, parallel applications and other *ply functions.
  4. Fewer variables in memory, faster processing over time.

So let’s look at what a function does:  A function is a piece of code – a set of instructions – that takes in variables, performs an operation on those variables, and then returns a value.

For example, the function lm takes in information about two variables, as a formula, and then calculates the linear model relating them. The actual function is 66 lines of code (you can just type lm into your console), but it includes calls to another function, lm.fit that is 67 lines long, and that too includes calls to other functions. I’m sure we can all agree that

null.model <- lm(satisfaction ~ 1)  # the null model
my.postdoc <- lm(satisfaction ~ functions.used)
ideal.postdoc <- lm(satisfaction ~ functions.used + work.life.balance)

is much cleaner and satisfying than 180 lines of code, and it’s much easier to edit and change.

If you look at your code and you’ve got blocks of code that you use repeatedly it’s time to sit down and do a bit of work. This should be fully reproducible:

library(analogue)
data(Pollen, Climate, Location, Biome)

#  Cleaning pollen data:
# Is the location in our bounding box (Eastern North America)
good.loc <- Biome$Fedorova %in% c(‘Deciduous Forest’, ‘Prairies’)
Pollen.clip <- Pollen[good.loc ,]
Pollen.trans <- apply(Pollen.clip, 1, sqrt)
pol.mean <- rowMeans(Pollen.trans)
Climate.clip <- Climate[good.loc ,]
Climate.trans <- apply(Climate.clip, 1, scale)
clim.mean <- rowMeans(Climate.trans)
Location.clip <- Location[good.loc,]
Location.mean <- rowMeans(Location.clip)

This is a simple example, and it doesn’t exactly make sense (there’s no reason to average the rows). But, it is clear that for all three datasets (climate, location and pollen) we are doing the same general set of operations.

So, what are the things we are doing that are common?

  1. We’re removing a common set of sample sites
  2. We are applying a function by rows (except Location)
  3. We are averaging by row.

So, this seems like a decent candidate for building a function. Right now we’re at 9 rows, so keep that in mind.

We want to pass in a variable, so we’ll call it x, and we’re applying a function across rows before averaging, so we can pass in a function name as y. Location is a bit tricky, we don’t actually transform it, but there is a function called identity that just returns the argument passed to it, so we could use that. Lastly, we do the rowMean, so we can pass it out. Let’s try to construct the function:

library(analogue)
data(Pollen, Climate, Location, Biome)

clip_mean <- function(x, y){
  #  edit: Gavin Simpson pointed out that I had made an error in my code, calling
  #  apply(x, 1, y) instead of apply(x.clip, 1, y).  The nice thing about writing the
  #  function is that I only needed to correct this once, instead of three times.
  #  edit 2: It was also pointed out by reddit user jeffhughes that x and y are not very 
  #  good variable names, since they're fairly generic.  I agree.  You'd be better off
  #  writing more descriptive variable names.

  good.loc <- Biome %in% c(‘Deciduous Forest’, ‘Prairies’)
  x.clip <- x[good.loc, ]
  x.trans <- apply(x.clip, 1, y)
  rowMeans(x.trans)
}

pollen.mean <- clip_mean(Pollen, sqrt)
climate.mean <- clip_mean(Climate, scale)
location.mean <- clip_mean(Location, identity)

So we’re still at 9 lines of code, but to me this reads much cleaner. We could go even further by combining the lines below the good.loc assignment into something like this:

rowMeans(apply(x[good.loc,], 1, y))

which would make our 9 lines of code into 7, fairly clean lines of code.

The other advantage of sticking this all into a function is that all the variables created in clip_mean (e.g., x.trans) only exist within the function. The only thing that gets passed out at the end is the rowMeans call. If we had assigned rowMeans to a variable (using ‘<-‘) we wouldn’t even have that at the end (since you need to pass data out of the variable, either using the return command, or by directly calling a variable or function). So, even though we’ve got 9 rows, we have 9 rows and only 8 variables in memory, instead of 13 variables. This is a big help because the more variables you have the harder debugging becomes.

So, to summarize, we’ve got cleaner code, fewer variables and you now look like a total pro.
If you have any tips or suggestions for working with functions please leave them in the comments.

Here are some other resources for creating functions:
Quick-R: http://www.statmethods.net/management/userfunctions.html
Working with Functions: http://en.wikibooks.org/wiki/R_Programming/Working_with_functions
Hadley Wickham’s more advanced resources for functions: http://adv-r.had.co.nz/Functions.html

Thanks again to everyone who contributed suggestions!

Setting the record straight about Search Committees and Nature & Science articles.

In a recent Twitter thread I made a lighthearted post about how a search committee might write a script in R to select candidates, based on the premise that committees only care about Nature and Science articles.  I want to set the record straight and say, directly, that those of you who read this blog, and who follow me on twitter (or who follow any of those who re-tweeted the original tweet) deserve an apology.

The tweet was as follows:

I think that this is a problematic tweet for several reasons.  First, it implies that search committees are driven only to find applicants who have papers in Nature or Science.  I have served on a search committee before, and I know people who are on search committees, so I know that this is untrue.  The more problematic point is that this tweet also implies that members of search committees have no idea how to code in R.

It was not my intention to imply this, and if you are on a search committee who is currently reviewing me, I was talking about another search committee, one that is not nearly as nice as you, and one that I didn’t really care about in the first place.  I know that as a search committee you would be much more likely to code this properly as follows:

# We made everyone submit their applications as an R data object, because it was less complicated than using our website.
applications <- list.files('./JunkFolder')
all.apps <- lapply(applications, load)

short.list <- sapply(all.apps, 
                     function(x) regexpr('^nature$|^science$', x$publications$journal, ignore.case = TRUE)>0)

You can probably work out the rest here (I don’t know where people are storing their contact information, you decided that), but I just felt bad and wanted to apologize for making it look like you didn’t know how to do a proper regular expression search and stuff. My original tweet wouldn’t have searched for any papers published in Science, it would have found people who published in “The Northern Manitoba Journal of Nature and Profiteering” and short listed them, which obviously wasn’t your intention.

So here we search for all journals that either start (^) and end ($) with ‘nature’, or (|) with ‘science’, and we ignore the case. Lets face it, even though Quaternary Science Reviews is a great journal that publishes high quality papers, they’re mostly written by riff raff.

Natural Earth Data and R in ggplot2

I really like the data layers available from naturalearthdata.  They take a simple map and make it look beautiful, which I suppose is the point.

Figure 1.  Here's an amazing background for your maps using Natural Earth Data and ggplot2
Figure 1. Here’s an amazing background for your maps using Natural Earth Data and ggplot2

One thing that I’ve found frustrating though is the need to constantly go back and forth between R and ArcGIS/QGIS/GRASS to make figures with spatial data if I want them to look really pretty.  In preparation for AGU2013 this year (come see our poster) I decided I’d finally solve this annoying problem by figuring out how to get a pretty base raster layer into ggplot, and then superimpose the lakes and streams that serve to make a nice map look pretty.

Here’s the solution to make this fine figure:

library(raster)
library(rgdal)
library(ggplot2)
library(reshape2)
library(plyr)

#  Assuming you have a path 'Maps' that you store your spatial files in.  This
#  is all downloaded from <a href="http://www.naturalearthdata.com/downloads/">http://www.naturalearthdata.com/downloads/</a> using the
#  1:50m "Medium" scale data.

nat.earth <- stack('./Maps/NE2_50M_SR_W/NE2_50M_SR_W.tif')

ne_lakes <- readOGR('./Maps/NaturalEarth/ne_50m_lakes.shp',
                    'ne_50m_lakes')

ne_rivers <- readOGR('./Maps/NaturalEarth/ne_50m_rivers_lake_centerlines.shp',
                     'ne_50m_rivers_lake_centerlines')

ne_coast <- readOGR('./Maps/NaturalEarth/ne_50m_coastline.shp',
                     'ne_50m_coastline')

#  I have a domain I'm interested in, but there's no reason you can't define something else:
quick.subset <- function(x, longlat){

# longlat should be a vector of four values: c(xmin, xmax, ymin, ymax)
  x@data$id <- rownames(x@data)

  x.f = fortify(x, region="id")
  x.join = join(x.f, x@data, by="id")

  x.subset <- subset(x.join, x.join$long > longlat[1] & x.join$long < longlat[2] &
                           x.join$lat > longlat[3] & x.join$lat < longlat[4])

  x.subset
}

domain <- c(-98.6, -66.1, 36.5, 49.7)
lakes.subset <- quick.subset(ne_lakes, domain)
river.subset <- quick.subset(ne_rivers, domain)
coast.subset <- quick.subset(ne_coast, domain)

nat.crop <- crop(nat.earth, y=extent(domain))

rast.table <- data.frame(xyFromCell(nat.crop, 1:ncell(nat.crop)),
                         getValues(nat.crop/255))

rast.table$rgb <- with(rast.table, rgb(NE2_50M_SR_W.1,
                                       NE2_50M_SR_W.2,
                                       NE2_50M_SR_W.3,
                                      1))
# et voila!

ggplot(data = rast.table, aes(x = x, y = y)) +
  geom_tile(fill = rast.table$rgb) +
  geom_polygon(data=lakes.subset, aes(x = long, y = lat, group = group), fill = '#ADD8E6') +
  scale_alpha_discrete(range=c(1,0)) +
  geom_path(data=lakes.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  geom_path(data=river.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  geom_path(data=coast.subset, aes(x = long, y = lat, group = group), color = 'blue') +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  xlab('') + ylab('')

Some issues still remain for me. For one, there’s a problem with holes in this plotting. You’ll notice that Manitoulin Island and half of the Keweenaw Peninsula are basically lake (the light blue background) outlined in blue, and not the textured raster. I’m not sure exactly how to fix that, there’s plenty of discussion about how to fix holes in polygons using ggplot, but I haven’t worked out the solution yet.

 

If you have a solution I’m all ears.