We recently submitted a paper to QSR and I attached the R code and primary data for inclusion. I did this for two reasons, (1) it makes the data processing completely transparent and (2) it means reviewers can see what’s actually happening during data analysis.
There are lots of reasons for including your data in research. Piwowar et al. (2007) show that including data in cancer trials increased the citation rate by 67%. Citation rates (for what they’re worth) are still primary metrics by which we can judge the importance of our research. For this reason alone there is good motivation for including data, but what about your code? Rodger D. Peng sets out the importance of including code in submissions in an invited article in Biostatistics (Peng, 2009). The article lays out the importance of reproducable resarch, especially in a research environment where we, as ecologists and paleoecologists are looking for signals in very noisy data.
This is becoming increasingly important as NSF moves toward a model of transparency and builds mechanisms for enforcing Data Management Plans. I’m at the NSF Macrosystems PI Workshop in Boulder, CO right now and we’re hearing lots about data sharing, reproducibility and the importance of clear data management.
Aside from institutionally enforced reasons for sharing code, the importance of other people’s code was reinforced when I reviewed a paper recently. I had some questions about methodology, and the application of various tests in R, but I couldn’t look at the code to see how important my concerns were (is it enough to ask for major changes, or just minor tweaks to the writing ofthe methods section?). R is great in that it gives you a lot of options for any operation you want to do, but when you’re writing up a paper it can be really frustrating detailing all the choices you’ve made in parameterizing your GAM, random forest, or whatever, especially if you are using multiple tests and models. Reviewers may often have more questions about your manipulations than you’ve included: What spatial projection did you use for your analysis, was distance grand-circle, metric grid, or just lat/long distance? Did you use a gamma distribution or Poisson? What kind of interpolation did you use, how did you implement it?
There are lots of pluses to including code, but there’s always the possibility some reviewer will be shifty enough to use your code in nefarious ways. Is it worth worrying about this? Probably not. I suspect the chances of this happening are low, and, ultimately, they’re jerks, so they’ve already got to deal with that, and institutionally it’s just bad form. I think the bigger issue might be that there’s some crazy bug in your code that results in non-significance of significant results. Would it suck if someone found it? Yes. Would it suck more than having to publish a retraction later? No.
What I’d like to do here is list some best practices for including your R code in a paper. Right now this list is pretty weak, maybe there’s not much to it, but for the sake of argument:
- If you want results to be absolutely equivalent, set a random.seed if you’re going to be using any sort of randomization. That way the results should be consistent between computers and users. If not, and if your analysis depends on some sort of randomization then it is possible for the reviewer to obtain different results, it shouldn’t be a big deal if our results are robust, but you never know. Peng, in Biostatistics, asks only that results be within the bounds of numerical tolerance.
- Comment, use clear coding, line breaks when necessary and use white-space to make the code clearer. Basically, follow the Google R Style Guide.
- Ensure that your data is included. If, for some reason you can’t include your whole dataset, then either create a synthetic dataset that matches the properties of your dataset, or take a randomized subset of your dataset.
- Before you submit run the code on its own to make sure it works!
I’ll add to this list if you have any comments to add. If anyone’s seen a list like this let me know, sometimes searching for things about R is insanely frustrating!
What do I have stuck in my head:
- Love’s Gonna Pack Up – The Persuaders
- Thrasher – Neil Young, it makes me feel sad and blows me away at the same time. Is this one of his best songs? I never would have believed it when I was in High School, but I sure love it now.