Helping to fill the cloud from the bottom up.

Open data in the sciences is an aspirational goal, and one that I wholeheartedly agree with. The efforts of EarthCube (among others) to build an infrastructure of tools to help facilitate data curation and discovery in the Earth Sciences have been fundamental in moving this discussion forward in the geosciences, and at the most recent ESA meeting saw the development of a new section of the society dedicated to Open Science.

One of the big challenges to open science is that making data easily accessible and easily discoverable can be at odds with one another.  Making data “open” is as easy as posting it on a website, but making it discoverable is much more complex.  Borgman and colleagues (2007) very clearly lay out a critical barrier to data sharing in an excellent paper examining practices in “habitat ecology” (emphasis mine):

Also paradoxical is researchers’ awareness of extant metadata standards for reconciling, managing, and sharing their data, but their lack of use of such standards. The present dilemma is that few of the participating scientists see a need or ability to use others’ data, so they do not request data, they have no need to share their own data, they have no need for data standards, and no standardized data are available. . .

The issue, as laid out here, is that people know that metadata standards exist, but they’re not using them from the ground up because they’re not using other people’s data.  Granted this paper is now eight years old, but, for the vast majority of disciplinary researchers in the geosciences and biological sciences the extent of data re-use is most likely limited to using data tables from publications, if that. [a quick plug, if you’re in the geosciences, please take a few minutes to complete this survey on data sharing and infrastructure as part of the EarthCube Initiative]

So, for many people who are working with self-styled data formats, and metadata that is largely implicit (they’re the only ones who really understand their multi-sheet excel file), getting data into a new format (one that conforms to explicit metadata standards) can be formidable, especially if they’re dealing with a large number of data products coming out of their research.

Right now, for just a single paper I have ten different data products that need to have annotated metadata.  I’m fully aware that I need to do it, I know it’s important, but I’ve also got papers to review and write, analysis to finish, job applications to write, emails to send, etc., etc., etc., and while I understand that I can now get DOIs for my data products, it’s still not clear to me that it really means anything concrete in terms of advancement.

Don’t get me wrong, I am totally for open science, all my research is on GitHub, even partial papers, and I’m on board with data sharing.  My point here is that even for people who are interested in open science, correctly annotating data is still a barrier.

How do we address this problem? We have lots of tools that can help generate metadata, but many, if not all, of these are post hoc tools. We talk extensively, if colloquially, about the need to start metadata creation at the same time as we collect the data, but we don’t incentivise this process.  The only time people realize that metdata is important is at the end of their project, and by then they’ve got a new job to start, a new project to undertake, or they’ve left academia.

Making metadata creation a part of the research workflow is something I am working toward as part of the Neotoma project. Where metadata is a necessary component of the actual data analysis.   The Neotoma Paleoecological Database is a community curated database that contains sixteen different paleoecological proxies, ranging from water chemistry to pollen to diatoms to stable isotope data (see Pilaar Birch and Graham 2015). Neotoma has been used to study everything from modern patterns of plant diversity, rates of migration for plant and mammals, rates of change in community turnover through time, and species relationships to climate.  It acts as both a data repository and a research tool in and of itself.  A quick plug as well, the completion of a workshop this past week with the Science Education Resource Center at Carleton College in Minnesota has resulted in the development of teaching tools to help bring paleoecology into the classroom (more are on their way).

Neotoma has a database structure that includes a large amount of metadata.  Due in no small part to the activities of Eric Grimm, the metadata is highly curated, and, Tilia, a GUI tool for producing stratigraphic diagrams and age models from paleoecological data, is designed to store data in a format that is largely aligned with the Neotoma database structure.

In designing the neotoma package for R I’ve largely focused on its use as a tool to get data out of Neotoma, but the devastating closure of the Illinois State Museum by the Illinois Governor (link) has hastened the devolution of data curation for the database. The expansion of the database to include a greater number of paleoecological proxies has meant that a number of researchers have already become data stewards, checking to ensure completeness and accuracy before data is uploaded into the database.

Having the R package (and the Tilia GUI) act as a tool to get data in as well as out serves an important function, it acts as a step to enhance the benefits of proper data curation immediately after (or even during) data generation because the data structures in the applications are so closely aligned with the actual database structure.

We are improving this data/metadata synergy in two ways:

  1. Data structures: The data structures within the R package (detailed in our Open Quaternary paper) remain parallel to the database.  We’re also working with Mark Uhen, Shanan Peters and others at the Paleobiology Database (as part of this funded NSF EarthCube project) and, elsewhere, for example, the LiPD Project, which is itself proposing community data standards for paleoclimatic data (McKay and Emile-Geay, 2015).
  2. Workflow: Making paleoecological analysis easier through the use of the R package has the benefit of reducing the ultimate barrier to data upload.  This work is ongoing, but the goal is to ensure that by creating data objects in neotoma, data is already formatted correctly for upload to Neotoma, reducing the burden on Data Stewards and on the data generators themselves.

This is a community led initiative, although development is centralized (but open, anyone can contribute to the R package for example), the user base of Neotoma is broad, it contains data from over 6500 researchers, and data is contributed at a rate that continues to increase.  By working directly with the data generators we can help build a direct pipeline into “big data” tools for researchers that have traditionally been somewhere out on the long tail.

Jack Williams will be talking a bit more about our activities in this Middle Tail, and why it’s critical for the development of truly integrated cyberinfrastructure in the geosciences (the lessons are applicable to ecoinformatics as well) at GSA this year (I’ll be there too, so come by and check out our session: Paleoecological Patterns, Ecological Processes and Modeled Scenarios, where we’re putting the ecology back in ge(c)ology!).

Open Science, Reproducibility, Credit and Collaboration

I had the pleasure of going up to visit the Limnological Research Center (LRC) at the University of Minnesota this past week. It’s a pretty cool setup, and obviously something that we should all value very highly, both as a resource to help prepare and sample sediment cores, and as a repository for data. The LRC has more than 4000 individual cores, totaling over 13km of lacustrine and marine sediment. A key point here is that much of this sediment is still available to sample, but, this is still data in its rawest, unprocessed form. Continue reading Open Science, Reproducibility, Credit and Collaboration