How far do you go with coding in a collaborative project?

I’ve been coding in R since I started graduate school, and I’ve been coding in one way or another since I learned how to use Turing in high school.  I’m not an expert by any means, but I am proficient enough to do the kind of work I want to do, to know when I’m stuck and need to ask for help, and to occasionally over-reach and get completely lost in my own code.

I’ve been working hard to improve my coding practice, particularly focusing on making my code more elegant, and looking at ways to use public code as a teaching tool. In submitting our neotoma package paper to Open Quaternary I struggled with the balance between making pretty figures, and making easily reproducible examples, between providing ‘research-quality’ case studies and having the target audience, generally non-programmers, turn off at the sight of walls of code.

There’s no doubt that the quality and interpretability of your code can have an impact on subsequent papers.  In 2012 there was a paper in PNAS that showed that papers with more equations often get cited less than similar papers with fewer equations (SciAm, Fawcett and Higginson, 2012).  I suspect the same pattern of citation holds for embedded code blocks, although how frequently this happens outside specialized journals isn’t clear to me. It certainly didn’t hurt Eric Grimm’s CONISS paper (doi and PDF) which has been cited 1300+ times, but this may be the exception rather than the rule, particularly in paleoecology.

I’m currently working on a collaboration with researchers at several other Canadian universities. We’re using sets of spatio-temporal data, along with phylogenetic information across kingdoms to do some pretty cool research.  It’s also fairly complex computationally.  One of the first things I did in co-developing the code was to go through the code-base and re-write it to try to make it a bit more robust to bugs, and to try and modularize it.  This meant pulling repeated blocks of code into new functions, taking functions and putting them into their own files, generalizing some of the functions so that they could be re-used in multiple places, and generally spiffing things up with a healthy use of the plyr package.

Figure 1. This is what coding machismo looks like on me.
Figure 1. This is what coding machismo looks like on me.

This, I now realize, was probably something akin to  statistical machismo (maybe more like coding machismo). The use of coding ‘tricks’ limited the ability of my co-authors to understand and modify the code themselves.  It also meant that further changes to the code required more significant investments in my time. They’re all great scientists, but they’re not native coders in the same way I am (not that I’m in the upper echelon myself).

This has been a real learning experience.  Coding for research is not the same as coding in an industrial application.  The culture shift in the sciences towards an open-sharing model means that we are no longer doing this work just so that we get output that “works”, but so that the code itself is an output.  Collaborative coding should mean that participants in the project should be able to understand what the code does and how it works.  In many cases that means recognizing that collaborators are likely to have differing skill sets when it comes to coding and that those different skill sets need to be respected.

In my case, going ahead and re-writing swaths of code certainly helped reduce the size of the code-base, it meant that, in general, things ran more smoothly, and that changes to the code could be accomplished relatively easily.  It also meant that I was the only one who could easily make these changes.  This is not good collaborative practice, at least, at the outset.

Figure 2.  A flowchart showing the relationships between classes in the neotoma package.
Figure 2. A flowchart showing the relationships between classes in the neotoma package.

Having said that, there are lots of good reasons why good coding practice can be beneficial,even if some collaborators can’t immediately work through changes.  It’s partly a matter of providing road maps, something that is rarely done.  Good commenting is useful, but more often lately I’ve been leaning toward trying to map an application with flowcharts.  It’s not pretty, but the diagram I drew up for our neotoma package paper (Goring et al., submitted) helped debug, clean syntax errors and gave me an overview I didn’t really have until I drafted it.

I’m working on the same kind of chart for the project I mentioned earlier, although it can be very complicated to do.  I’ve also been making an effort to clearly report changes I’ve made using git, so that we have a good record of what’s been done, and why. Certainly, it would have been easier to do all this in the first place, but I’ve learned my lesson. As in most research, a little bit of planning can go a long way.

Writing robust, readable code is an important step toward open and reproducible science, but we need to acknowledge the fact that reproducibility should not be limited to expert coders.  Trade-offs are a fact of life in evolution, and yet some of us are unwilling to make the trade-offs in our own scientific practice.  We are told constantly to pitch our public talks to the audience.  In working with your peers on a collaboration you should respect their relative abilities and ensure that they can understand the code, which may result in eschewing fancier tricks in favor of clearly outlined steps.

If you code in such a way that people can’t work with you, you are opening yourself up to more bugs, more work and possibly, more trouble down the line.  It’s a fine balance, and as quantitative tools such as R and Python become more a part of the lingua franca for grad students and early career researchers, it’s one we’re going to have to negotiate more often.

Advertisement