How do you edit someone else’s code?

As academics I like to think that we’ve become fairly used to editing text documents. Whether handwriting on printed documents (fairly old school, but cool), adding comments on PDFs, or using some form of “track changes” I think we’ve learned how to do the editing, and how to incorporate those edits into a finished draft. Large collaborative projects are still often a source of difficulty (how do you deal with ten simultaneous edits of the same draft?!) but we deal.

Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.
Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.

I’m working on several projects now that use R as a central component in analysis, and now we’re not just editing the text documents, we’re editing the code as well.

People are beginning to migrate to version control software and the literature is increasingly discussing the utility of software programming practices (e.g., Scheller et al., 2010), but given that scientific adoption of programming tools is still in its early stages, there’s no sense that we can expect people to immediately pick up all the associated tools that go along with them. Yes, it would be great if people would start using GitHub or BitBucket (or other version control tools) right away, but they’re still getting used to basic programming concepts (btw Tim Poisot has some great tips for Learning to Code in Ecology).

The other issue is that collaborating with graduate students is still a murky area. How much editing of code can you do before you’ve started doing their work for them? I think we generally have a sense of where the boundaries are for written work, but if code is part of ‘doing the experiment’, how much can you do? Editing is an opportunity to teach good coding practice, and to teach new tools to improve reproducibility and ease of use, but give the student too much and you’ve programmed everything for them.

I’m learning as I go here, and I’d appreciate tips from others (in the comments, or on twitter), but this is what I’ve started doing when working with graduate students:

  • Commenting using a ‘special’ tag:  Comments in R are just an octothorp (#), I use #* to differentiate what I’m saying from a collaborator’s comments.  This is fairly extensible, someone else could comment ‘#s’ or ‘#a’ if you have multiple collaborators.
  • Where there are major structural changes (sticking things in functions) I’ll comment heavily at the top, then build the function once.  Inside the function I’ll explain what else needs to be done so that I haven’t done it all for them.
  • If similar things need to be done further down the code I’ll comment “This needs to be done as above” in a bit more detail, so they have a template & the know where they’re going.

The tricky part about editing code is that it needs to work, so it can be frustratingly difficult to do half-edits without introducing all sorts of bugs or errors.  So if code review is part of your editing process, how do you accomplish it?

EarthCube webinars and the challenges of cross-disciplinary Big Data.

EarthCube is a moon shot.  It’s an effort to bring communities broadly supported through the NSF Geosciences Directorate and the Division of Advanced Cyberinfrastructure together to create a framework that will allow us to understand our planet (and solar system) in space and in time using data and models generated across a spectrum of disciplines, and spanning scales of space and time.  A lofty goal, and a particularly complex one given the fragmentation of many disciplines, and the breadth of researchers who might be interested in participating in the overall project.

To help support and foster a sense of community around EarthCube the directorate has been sponsoring a series of webinars as part of a Research Coordination Network called “Collaboration and Cyberinfrastructure for Paleogeosciences“, or, more simply C4P.  These webinars have been held every other Tuesday from 4 – 5pm Eastern, but are archived on the webinar website (here).

The Neotoma Paleoecological Database was featured as part of the first webinar.  Anders Noren talked about the cyber infrastructure required to support LacCore‘s operations, and Shanan Peters talks about an incredible text mining initiative (GeoDeepDive) in one of the later webinars.

Fig 1. The flagship for the Society for American Pedologists has run aground and it is now a sitting duck for the battle machine controlled by the Canadian Association for Palynologists in the third war of Data Semantics.

It’s been interesting to watch these talks and think about both how unique each of these paleo-cyberinfrastructure projects is, but also how much overlap there is in data structure, use, and tool development.  Much of the struggle for EarthCube is going to be developing a data interoperability structure and acceptable standards across disciplines.  In continuing to develop the neotoma package for R I’ve been struggling to understand how to make the data objects we pull from the Neotoma API interact well with standard R functions, and existing R packages for paleoecological data.  One of the key questions is how far do we go in developing our own tools before that tool development creates a closed ecosystem that cuts off outside development?  If I’m struggling with this question in one tiny disciplinary nook, imagine the struggle that is going to occur when geophysicists and paleobotanists get together with geochonologists and pedologists!

Interoperability of these databases needs to be a key goal.  Imagine the possibilities if we could link modern biodiversity databases with Pleistocene databases such as Neotoma, and then to deep time databases like the Paleobiology Database in a seamless manner.  Big data has clearly arrived in some disciplines, but the challenges of creating big data across disciplines is just starting.

Writing and collaborating on GitHub, a primer for paleoecologists

At this point I’ve written a hundred times about the supplement for Goring et al., (2013), but just in case you haven’t heard it:

Goring et al., (2013) uses a large vegetation dataset to test whether or not pollen richness is related to plant richness at a regional scale.  Because of the nature of the data and analysis, and because I do all of my work (or most of it) using R, I thought it would be a good idea to produce totally reproducible research.  To achieve this I included a version of the paper, written using RMarkdown, as a supplement to the paper.  In addition to this, I posted the supplement to GitHub so that people who were interested in looking at the code more deeply could create their own versions of the data and could use the strengths of the GitHub platform to help them write their own code, or do similar analyses.

This is a basic how-to to get you started writing your paper using RMarkdown, RStudio and GitHub (EDIT: if some of these instructions don’t work let me know and I’ll fix them immediately):

Continue reading Writing and collaborating on GitHub, a primer for paleoecologists

The Nature Geosciences Climate 2k Journal Club

Nature Publishing Group held a hangout on Google+ to discuss the recent paper in Nature Geosciences summarizing global climate changes over the last 2kyr (the discussed article is: Continental-scale temperature variability during the past two millennia; here). (continued below) Continue reading The Nature Geosciences Climate 2k Journal Club