Versioning R code

I’ve just uploaded a version of a very large data set to our project wiki. We have been using dokuwiki to share our data, R code and ideas/meeting notes/biographies/general stuff and we’re getting to the point where a lot of data has been processed and finalized.

Going from ‘me’ code, which is messy, has digressions and is generally lightly commented, to ‘you’ code is a scary thing. Some of the data I use is bound by data sharing agreements, some of my collaborators aren’t useRs, some of my collaborators are very good useRs (which is a bit daunting as well). So how do you balance user needs, code requirements, and reproducibility?

First, I really like the R style guide used by Google. Using a consistent and readable style makes comprehension easier for non-coders and coders alike. I make an effort to comment blocks succinctly and thoroughly, variable names should be sensible, but not too long that they’re cumbersome. Here’s a more comprehensive post about writing clean code and avoiding coding problems.

The data sharing issue was interesting for me. I haven’t really had to deal with this before, the biggest problem is how can other collaborators run the code if they can’t have access to the data? To deal with this we randomly subset the data, taking only 1% of the total data set size (the size is arbitrary, it should be enough to obfuscate the code while still retaining enough structure that your code will run). By doing this I don’t have to generate ‘fake’ data that has the same properties as the underlying data. Unfortunately it also means that some of the significance tests begin to fail. How big an issue this might be isn’t clear yet. My feeling is that if the code is sound, we can make the assumption that the hypothesis tests are sound.

So what do I post? I post the code, the sample data sets and then the ‘correct’ outputs so that even though hypothesis tests may fail when collaborators run the code with the sub-sampled data, they would have access to the proper output tables.

Finally, the pages on the dokuwiki are written in a format so people know when the code was posted, some context about the code, a flow chart of the code and then the data required and the data output.

So that’s what I do, what do you do?

(What am I listening to today? T. Rex – Jewel, Veda Hille – Bedlam!.


Published by


Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s