I’ve just uploaded a version of a very large data set to our project wiki. We have been using dokuwiki to share our data, R code and ideas/meeting notes/biographies/general stuff and we’re getting to the point where a lot of data has been processed and finalized.
Going from ‘me’ code, which is messy, has digressions and is generally lightly commented, to ‘you’ code is a scary thing. Some of the data I use is bound by data sharing agreements, some of my collaborators aren’t useRs, some of my collaborators are very good useRs (which is a bit daunting as well). So how do you balance user needs, code requirements, and reproducibility?
First, I really like the R style guide used by Google. Using a consistent and readable style makes comprehension easier for non-coders and coders alike. I make an effort to comment blocks succinctly and thoroughly, variable names should be sensible, but not too long that they’re cumbersome. Here’s a more comprehensive post about writing clean code and avoiding coding problems.
The data sharing issue was interesting for me. I haven’t really had to deal with this before, the biggest problem is how can other collaborators run the code if they can’t have access to the data? To deal with this we randomly subset the data, taking only 1% of the total data set size (the size is arbitrary, it should be enough to obfuscate the code while still retaining enough structure that your code will run). By doing this I don’t have to generate ‘fake’ data that has the same properties as the underlying data. Unfortunately it also means that some of the significance tests begin to fail. How big an issue this might be isn’t clear yet. My feeling is that if the code is sound, we can make the assumption that the hypothesis tests are sound.
So what do I post? I post the code, the sample data sets and then the ‘correct’ outputs so that even though hypothesis tests may fail when collaborators run the code with the sub-sampled data, they would have access to the proper output tables.
Finally, the pages on the dokuwiki are written in a format so people know when the code was posted, some context about the code, a flow chart of the code and then the data required and the data output.
So that’s what I do, what do you do?