Your R code as an attachment

We recently submitted a paper to QSR and I attached the R code and primary data for inclusion.  I did this for two reasons, (1) it makes the data processing completely transparent and (2) it means reviewers can see what’s actually happening during data analysis.

There are lots of reasons for including your data in research.  Piwowar et al. (2007) show that including data in cancer trials increased the citation rate by 67%.  Citation rates (for what they’re worth) are still primary metrics by which we can judge the importance of our research.  For this reason alone there is good motivation for including data, but what about your code?  Rodger D. Peng sets out the importance of including code in submissions in an invited article in Biostatistics (Peng, 2009).  The article lays out the importance of reproducable resarch, especially in a research environment where we, as ecologists and paleoecologists are looking for signals in very noisy data.

This is becoming increasingly important as NSF moves toward a model of transparency and builds mechanisms for enforcing Data Management Plans.  I’m at the NSF Macrosystems PI Workshop in Boulder, CO right now and we’re hearing lots about data sharing, reproducibility and the importance of clear data management.

Aside from institutionally enforced reasons for sharing code, the importance of other people’s code was reinforced when I reviewed a paper recently. I had some questions about methodology, and the application of various tests in R, but I couldn’t look at the code to see how important my concerns were (is it enough to ask for major changes, or just minor tweaks to the writing ofthe methods section?).  R is great in that it gives you a lot of options for any operation you want to do, but when you’re writing up a paper it can be really frustrating detailing all the choices you’ve made in parameterizing your GAM, random forest, or whatever, especially if you are using multiple tests and models.  Reviewers may often have more questions about your manipulations than you’ve included:   What spatial projection did you use for your analysis, was distance grand-circle, metric grid, or just lat/long distance?  Did you use a gamma distribution or Poisson?  What kind of interpolation did you use, how did you implement it?

There are lots of pluses to including code, but there’s always the possibility some reviewer will be shifty enough to use your code in nefarious ways.  Is it worth worrying about this?  Probably not.  I suspect the chances of this happening are low, and, ultimately, they’re jerks, so they’ve already got to deal with that, and institutionally it’s just bad form.  I think the bigger issue might be that there’s some crazy bug in your code that results in non-significance of significant results.  Would it suck if someone found it?  Yes.  Would it suck more than having to publish a retraction later?  No.

Best Practices:

What I’d like to do here is list some best practices for including your R code in a paper.  Right now this list is pretty weak, maybe there’s not much to it, but for the sake of argument:

  1. If you want results to be absolutely equivalent, set a random.seed if you’re going to be using any sort of randomization.  That way the results should be consistent between computers and users.  If not, and if your analysis depends on some sort of randomization then it is possible for the reviewer to obtain different results, it shouldn’t be a big deal if our results are robust, but you never know.  Peng, in Biostatistics, asks only that results be within the bounds of numerical tolerance.
  2. Comment, use clear coding, line breaks when necessary and use white-space to make the code clearer.  Basically, follow the Google R Style Guide.
  3. Ensure that your data is included.  If, for some reason you can’t include your whole dataset, then either create a synthetic dataset that matches the properties of your dataset, or take a randomized subset of your dataset.
  4. Before you submit run the code on its own to make sure it works!

I’ll add to this list if you have any comments to add.  If anyone’s seen a list like this let me know, sometimes searching for things about R is insanely frustrating!

What do I have stuck in my head:

  • Love’s Gonna Pack Up – The Persuaders
  • Thrasher – Neil Young, it makes me feel sad and blows me away at the same time.  Is this one of his best songs?  I never would have believed it when I was in High School, but I sure love it now.

Published by

downwithtime

Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

7 thoughts on “Your R code as an attachment”

  1. A useful and interesting post Simon. I do wish more palaeo types would follow these kinds of ideas. A couple of comments.

    1) I wouldn’t be so dogmatic about following a style guide such as Google’s. They adopted these coding practices because individual coders work in larger teams, there are commercial interests and need for continuity of code as people move on. Individuals and small groups may have very different requirements for coding practices. Others might just plain hate camelCase or separate.names or separate_names styles for naming objects; each to their own! The key thing is, as you say, to write clear code and comment it well, don’t over optimise unless really necessary (neat one-liners might show off R coding prowess but their meaning will often be impenetrable a month or so down the line), and stick to a style.

    2) I’m not sure of the merit of including a random subset of the data or a synthetic one? If the data can’t be disclosed for whatever reason, don’t publish a second-rate version. Data licensing issues can usually be addressed if a third party really needs the data. And we should be pushing/educating data holders etc to publish data openly.

    3) Did you consider posting your data to a repository like Dryad, figshare, or Pangaea? I would certainly recommend doing this over submission as supplementary data to a paper, for several reasons; ability to cite with a stable URI (e.g. a data DOI or similar standard), fewer places people need to search to find your data (without going through paper) which increases utility etc. On the latter point, someone will no doubt find a use for your data that you would never think of, but they may never make the link if they have to find your paper which may be an area of science they know nothing about. You can always cite your deposited data in the paper using the DOI.

    On your last point – isn’t checking the code runs on it’s own before you submit a bit too late? I wouldn’t trust any code that couldn’t run from scratch and certainly wouldn’t base a paper on the results of that code. A better paradigm is the reproducible research one of embedding the R code in the paper document itself (i.e. Sweave, knitr etc). All of these need to run from scratch to generate the paper so if the code doesn’t run standalone, the paper won’t be built. That seems safer all round. And with knitr and odfweave packages for example you don’t even need to know LaTeX (IIRC) to use them.

    1. Thanks Gavin, some great comments! If you’ll bear with my replies:
      1. I didn’t mean to be dogmatic, I just meant that people should really consider the readability of their code, as you’ve outlined. It’s nice that Google has a ready made style guide, I suspect that these kind of things develop organically within lab cultures, but in some cases (such as mine) I was the first adopter in my lab during my Ph.D, so I didn’t have any style guide available.

      2. I generally agree with this, sometimes it’s just not possible though, and in that case, what’s better, no code, no data, or the code and some second-rate data?

      3. I think that both are important. Getting a DOI for the data is great, and with moves toward centralized data services (DataONE, Pangea, Dryad) it’s going to be easier than ever, but at the same time, people access the data through the publications as well. As far as I know there’s no reason why you can’t do both.

      4. I just meant that checking the code is useful when you are changing the concrete links to your internal directories to relative links that can run on anyone’s computer and making sure that there’s not weird blobs of code that rely on some hidden R history file in the directory. The code needs to be numerically stable, but sometimes changing the home directory throws errors. I’ll check out knitr/odfweave though, thanks for the suggestions!

      1. Re 2) I’d just post the code and say where the data were obtained from, plus any codes required to process the raw data into whatever form was used for the paper. Anyone interested can try to obtain access to the data. Increasingly, however, we will be required to post data especially in good journals or by our funders so one way or another we have to address this.

        Re 4) The reproducible research paradigm really helps here especially when you realise there are tools available to help you do all this reasonably easily. Even LaTeX isn’t hard once you know a few basic things. Let us know how you get on with knitr or odfweave if you give them a go – I’d be interested in your thoughts (I’m happy enough with Sweave, Emacs and my LaTeX stack myself).

        And sorry for the “dogmatic” (and I meant to say “about advocating a specific style guide”, not “following”). In a small group or for individuals, Google’s or similar style guides are useful to get you thinking about how you write your code, things you may not have considered (you must have been sent code by students or colleagues with object names that are 20+ characters long). I just haven’t found them much use beyond that – style is often a very personal thing 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s