Open Science, Reproducibility, Credit and Collaboration

I had the pleasure of going up to visit the Limnological Research Center (LRC) at the University of Minnesota this past week. It’s a pretty cool setup, and obviously something that we should all value very highly, both as a resource to help prepare and sample sediment cores, and as a repository for data. The LRC has more than 4000 individual cores, totaling over 13km of lacustrine and marine sediment. A key point here is that much of this sediment is still available to sample, but, this is still data in its rawest, unprocessed form.

I’ve written before about open science and reproducibility. These two things obviously go hand in hand, but we as scientists navigate a tricky world.  The NSF expects that data will be shared “within a reasonable time“, which is fairly open ended.  In practice this exhortation doesn’t always work.  We’ve all heard about researchers who won’t share data (often for good reason), but equally there are stories of researchers who may have used data to which they have no right.  In some cases this is resolved, but in others the results are not so clear cut. The fair use of data overlaps with authorship, good citizenship, fairness and a number of other issues in academia, and one person’s definition of what is fair is unlikely to be another’s.

The use of others’ data presents a fine line in an discipline that depends on the transmission of ideas and data. In a previous post I discussed the need for domain experts in “big data”, and in many ways, our ethics of data sharing are embedded both in respect for the data generator, but also in the need to understand the intricacies of data that is often noisy, quirky, idiomatic  and dependent on the methods used to gather it.  Open science, and reproducability with it, then present a challenge for data generators, who are being asked to give up control of their data, and for data users who must examine the ethics of their own data use.

Open science is a philosophy as much as it is an imperative. It is asking scientists to give up the unspoken reciprocal partnerships between data generators and those who use the data down the road. Previously our interactions on this front have been mediated by the need to directly interact with data generators since there were no central data repositories for many of our data needs. By moving data to central repositories, and by opening up data sets and the methods of analysis through all-inclusive data supplements, we take a chance in submitting our data, the chance that users will simply take the data without reciprocation.

Figure 1.  Undersampled climate space in British Columbia with respect to the BC Modern Pollen Database.  The regions that are undersampled (darker) are not the regions that are spatially undersampled because of high regional climatic heterogeneity.

Figure 1. Under-sampled climate space in British Columbia with respect to the BC Modern Pollen Database. The regions that are under-sampled (darker) are not the regions that are spatially under-sampled because of high regional climatic heterogeneity.

What I’m saying isn’t that reciprocation is necessary, but that it is a potential benefit provided to the data generators that we may be losing in a move to central repositories. Data generation is costly, it is high (or higher) risk, and it is often slow, but it is critical to move macro-scale research forward, and to find the teleconnections between ecosystems that can help push science forward.  The other issue is that we often find the the primary data papers get cited much less often than the large-scale synthesis papers that use them.  How this might affect funding in the future is obviously unclear, but it should be obvious that data synthesis may begin to pull funds from the very researchers who generate the data.  One solution is to begin to use synthesis data sets to identify gaps in existing data networks.  This is what we did in Goring et al. (2009), in developing a modern pollen data set for British Columbia we also used the dataset to identify locations that are under-sampled with respect to the regional climate (Figure 1). The hope is that this sort of work can help motivate future grant proposals, by identifying high value targets, and showing that the need has been addressed in the published literature.

In releasing the neotoma package for R (on figshare and github)we have linked open science methods (reproducible research) with a community of data generators who have contributed their data over the course of nearly 30 years to a variety of community supported repositories, and ultimately to the Neotoma Paleoecological Database. Users can now draw data from across the globe in seconds. Compare this to the situation in the 1940s when paleoecology researchers in North America weren’t even sure if their colleagues were still alive after the Second World War (detailed in the Pollen Analysis Circular). Data exchange at the time relied on formal, personal relationships.

Interestingly, we see the rise of collaborative research in the modern era, but I wonder whether this rise is the result of increasing collaboration between data generators and synthesizers, or whether we see a split, with those generating the data remaining as data generators, and those synthesizing relying increasingly on large databases. I don’t have an answer to this, but I wonder if data generators are suffering from a bit of the Matthew Effect, whereby individual data sets are losing out to synthesis work in terms of recognition and impact.

Ultimately, academics are social and we are regulated by social norms, but many of these norms are in a state of upheaval.  Increasing funding pressure, the pressure of meeting proposal goals and the new frontier of open science, all mean that many of our social norms are in for a period of upheaval.  The discussions that have gone on in the literature and on the web have been exciting and fruitful, but I still worry that open science is going to have repercussions we haven’t expected, and that they’re going to be felt most by the primary data generators.

About these ads

3 thoughts on “Open Science, Reproducibility, Credit and Collaboration

  1. Hi Simon,
    Several comments/observations that strike me.

    The use of others’ data presents a fine line in an discipline that depends on the transmission of ideas and data. In a previous post I discussed the need for domain experts in “big data”, and in many ways, our ethics of data sharing are embedded both in respect for the data generator, but also in the need to understand the intricacies of data that is often noisy, quirky, idiomatic and dependent on the methods used to gather it. Open science, and reproducability with it, then present a challenge for data generators, who are being asked to give up control of their data, and for data users who must examine the ethics of their own data use.

    OK. I hear this all the time from people who are unwilling to share their data or want honorary authorship in exchange for sharing their data. Sure, I get the data can be messy, and there is room for misinterpretation. But this line of argument is neither compelling nor productive. When researchers write papers based on such messy data, they do their best to present the facts as clearly as possible. But a reader can still misinterpret the paper and cite it in a completely incorrect context. Do the original authors need to sign off on every citation so they can ensure that the findings were understood correctly? If not, then why so with their data?

    I agreed with your post on domain expertise in the era of big data. But shouldn’t someone in the same domain be trusted to figure out what the data means and use it correctly?

    What I’m saying isn’t that reciprocation is necessary, but that it is a potential benefit provided to the data generators that we may be losing in a move to central repositories. Data generation is costly, it is high (or higher) risk, and it is often slow, but it is critical to move macro-scale research forward, and to find the teleconnections between ecosystems that can help push science forward. The other issue is that we often find the the primary data papers get cited much less often than the large-scale synthesis papers that use them.

    Also a little confused here. Are there people engaged in data generation as a sole activity but don’t actually write papers about those data?
    Museums, for example, can be classified as data generators. They digitize specimens for the research community and as more people use their data, it gives them reason to continue maintaining and justifying funding for such resources. In this case, this is a win-win.

    Researchers like us also generate data. And yes, I agree that it’s an expensive, time consuming, and risky process. But I get a first crack at my data, can use embargos to ensure I get all my papers out before it becomes open to the larger community. Getting upset that my data was used in some fancy synthesis published in Nature, and that prestige is not the same as a data citation also doesn’t seem right to me.

    “How this might affect funding in the future is obviously unclear, but it should be obvious that data synthesis may begin to pull funds from the very researchers who generate the data. ”

    So the subtext here again (correct me if I’m wrong) is again authorship in exchange for data. Because the awesome data I collected is now part of some big synthesis and I didn’t get a piece of the pie. Not sure how this can be called the Matthew effect.

    But anyway, I’m just trying to articulate my thoughts here. I certainly understand that going from zero to fully open is not easy. And when several stakeholders are involved one has to tread lightly.

  2. Hi Karthik, thanks for commenting. Sorry for the delay. I appreciate that you recognize that some of this is simply a matter of “going from zero to open”, and I should clarify that I’m not advocating for all data providers to be co-authors, or even directly cited, particularly if a large volume of data is being synthesized. In our Goring et al. 2012 paper we used all the cores from the Neotoma Database in the NE United States, making citing each original paper unmanageable.

    What I was trying to outline here was a shift in the way that scientists interact, where the database takes up the role of direct contact between individuals, and I wanted to try to consider some of the ways in which that shift might change aspects of our scientific culture. It got tied up in issues of authorship and credit, but they’re issues that affect data publication and use. I do think that these days synthesis papers are thought of as “sexier” in the sciences (but I have nothing to back that up), which means that the Matthew Effect can be justifiably invoked, but I think you identify a trend by which labs are beginning to straddle spatio-temporal scales in terms of the questions they are asking, so they are both data generators and data synthesizers.

    As for the point about messy data, well, it’s certainly tempting to take a large database and do all sorts of crazy analysis, and I’ve reviewed papers where people have done just that, without the expertise needed. It’ll be a side effect, that papers will make it through the review process without due care, but one that should eventually take care of itself as we adjust to a new, open way of doing things.

  3. Pingback: Is pollen analysis dead? Paleoecology in the era of Big Data | The Contemplative Mammoth

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s