Who is a Scientist – Reflections on #AAG2016

This is the first time I’ve really been to the American Association of Geographers meeting. Last year it was held in Chicago, which is really close to Madison, and I was invited to speak at a session called “The View from the Anthropocene” organized by two great Geographers from the University of Connecticut, Kate Johnson and Megan Hill, but I had the kids & really only spent the morning there.  I’m pretty sure there was another reason as well, but I can’t remember what it was.

It was great to see Kate & Megan again this year, both of them are doing really cool stuff (check out their web pages), and it was really great to see that the momentum behind the original idea was enough to re-tool the session into the Symposium on Physical Geography at this year’s AAG meeting in San Francisco, with Anne Chin of UC-Denver on the organizing committee, and a host of fantastic speakers.

My own presentation in the session focused on the Anthropocene, and its role as both a boundary (whether you want to define it as an Epoch sensu stricto or as a philosophical concept – I think that Stanley Finney & Lucy Edward’s article in GSA Today nicely lays out the arguments) and a lens.  The second part of that equation (the lens) is a more diffuse point, but my argument is that the major changes we see in the earth system can impact our ability to build models of the past using modern analogues, whether those be climatic, or biological.  I show this using pollen and vegetation records from the Midwest, and make the connection to future projections with the example laid out in Matthes et al. (2015), where we show that the pre-industrial climate niche of plant functional types used in GCMs as part of the CMIP5 intercomparison are not better than random when compared to actual “pre-settlement” vegetation in the northeastern United States.

But I really want to talk about a single slide in my talk.  In the early part of my talk I use this slide:


This is Margaret Davis, one of the most important paleoecologists in North America, past-president of the ESA [PDF], and, importantly, a scientists who thought deeply about our past, our present and our future.  There’s no doubt she should be on the slide.  She is a critical piece of our cultural heritage as scientists, an because of her research, is uniquely well suited to show up in a slide focusing on the Anthropocene.

But it’s political too.  I put Margaret Davis up there because she’s an important scientist, but I also chose her because she’s an important female scientist. People specifically commented on the fact that I chose a female scientist, because it’s political.  It shouldn’t be.  There should be no need for me to pick someone because of their gender, and there should be no reason to comment on the fact that she was a female scientist.  It should just “be”.

Personal actions should be the manifestation of one’s political beliefs, but so much of our day to day life passes by without contemplation.  Susanne Moser, later in my session, talked about the psychological change necessary to bring society  around to the task of reducing CO2, of turning around the Anthropocene, or surviving it, and I think that the un-examined life is a critical part of the problem.  If we fail to take account of how our choices affect others, or affect society then we are going to confront an ugly future.

Everything is a choice, and the choices we make should reflect the world we want for ourselves and for the next generations. If our choices go un-examined then we wind up with the status quo.  We wind up with unbalanced panels, continued declines in under-represented minority participation in the physical sciences, and an erosion of our public institutions.

This post is maybe self-serving, but it shouldn’t have to be.  We shouldn’t have to look to people like DN Lee, the authors of Tenure She Wrote, Chanda Prescod-WeinsteinTerry McGlynn, Margaret Kosmala, Oliver Keyes, Jacquelyn Gill and so many others who advocate for change within the academic system, often penalizing themselves in the process.  We should be able to look to ourselves.

Okay, enough soap-boxing. Change yourselves.

Semantics Shememantics

In science we work, more often than not, in teams.  Whether we work with one other individual, five individuals, or interact at workshops with hundreds of strangers, it’s important that we are clearly understood.  Clarity is critical, especially when explaining complex concepts.  KISS is my second favorite acronym, even if I can’t keep to the principle (NEFLIS, a camping acronym, is my favorite – No Excuse for Living in Squalor just because you’re out in the woods).

A recently funded project I’m working on, under the aegis of EarthCube, is the harmonization of the Neotoma Paleoecological Database and the Paleobiology Database. Neotoma is a database of Quaternary fossils (mammals and microfossils such as pollen and ostracodes), and the Paleobiology Database is a database of every other kind of fossil. Both are incredible repositories for their respective communities, and powerful research tools in their own right.  My Distinguished Lecture talk at the University of Wisconsin’s Rebecca J. Holz Research Data Symposium was about the role of Community Databases in connecting researchers to Big Data tools, while getting their data into a curated form so that others could easily access and transform their data to undertake innovative research projects.

Superman Card Game by Whitman (1978) - G by andertoons, on Flickr
Figure 1. Semantic differences can be kryptonite for a project.  Especially a project that has very short arms relative to the rest of its body like Superman does in this picture. [ credit: andertoons ]
Our recent joint Neotoma-PBDB workshop, in Madison WI, showed me that, even with such closely allied data and disciplinary backgrounds, semantics matter.  We spent the first morning of the meeting having a back and forth discussion, where it kept seeming like we agreed on core concepts, but then, as the conversations progressed, we’d fall back into some sort of confusion.  As it began to seem unproductive we stepped back and checked in to see if we really were agreeing on core concepts.

While both databases contain fossil data, there is a fundamental difference in how the data are collected.  Typically Paleobiology DB data is collected in isolation, a fossil whale, discovered & reported is more common than a vertical stratigraphic survey on a single outcrop at a specific Latitude and Longitude.  In Neotoma, so much of our data comes from lake sediment cores that it makes sense that much of our data (and data collection) is described from stratigraphic sequences.

This difference may not seem like much, especially when the Holocene (the last 11,700 years) is basically an error term in much of the Paleobiology Database, but it’s enough of a difference that we were apparently agreeing on much, but then, inexplicably, disagreeing on followup discussions.

This is one of the fundamental problems in interdisciplinary research. Interdisciplinarity is as much understanding the terms another discipline uses as it is understanding the underlying philosophy and application of those terms in scientific discourse.  Learning to navigate these differences is time consuming, and requires a skill set that many of us don’t have.  At the very least, recognizing this is a problem, and learning to address this issue is a skill that is difficult to master.  It took us a morning of somewhat productive discussion before we really looked at the problem.  Once addressed we kept going back to our draft semantic outline to make sure we knew what we were talking about when discussing each other’s data.

This is all to say, we had a great workshop and I’m really looking forward to the developments on the horizon.  The PBDB development team is great (and so is the Neotoma team) and I think it’s going to be a very fruitful collaboration.

Helping to fill the cloud from the bottom up.

Open data in the sciences is an aspirational goal, and one that I wholeheartedly agree with. The efforts of EarthCube (among others) to build an infrastructure of tools to help facilitate data curation and discovery in the Earth Sciences have been fundamental in moving this discussion forward in the geosciences, and at the most recent ESA meeting saw the development of a new section of the society dedicated to Open Science.

One of the big challenges to open science is that making data easily accessible and easily discoverable can be at odds with one another.  Making data “open” is as easy as posting it on a website, but making it discoverable is much more complex.  Borgman and colleagues (2007) very clearly lay out a critical barrier to data sharing in an excellent paper examining practices in “habitat ecology” (emphasis mine):

Also paradoxical is researchers’ awareness of extant metadata standards for reconciling, managing, and sharing their data, but their lack of use of such standards. The present dilemma is that few of the participating scientists see a need or ability to use others’ data, so they do not request data, they have no need to share their own data, they have no need for data standards, and no standardized data are available. . .

The issue, as laid out here, is that people know that metadata standards exist, but they’re not using them from the ground up because they’re not using other people’s data.  Granted this paper is now eight years old, but, for the vast majority of disciplinary researchers in the geosciences and biological sciences the extent of data re-use is most likely limited to using data tables from publications, if that. [a quick plug, if you’re in the geosciences, please take a few minutes to complete this survey on data sharing and infrastructure as part of the EarthCube Initiative]

So, for many people who are working with self-styled data formats, and metadata that is largely implicit (they’re the only ones who really understand their multi-sheet excel file), getting data into a new format (one that conforms to explicit metadata standards) can be formidable, especially if they’re dealing with a large number of data products coming out of their research.

Right now, for just a single paper I have ten different data products that need to have annotated metadata.  I’m fully aware that I need to do it, I know it’s important, but I’ve also got papers to review and write, analysis to finish, job applications to write, emails to send, etc., etc., etc., and while I understand that I can now get DOIs for my data products, it’s still not clear to me that it really means anything concrete in terms of advancement.

Don’t get me wrong, I am totally for open science, all my research is on GitHub, even partial papers, and I’m on board with data sharing.  My point here is that even for people who are interested in open science, correctly annotating data is still a barrier.

How do we address this problem? We have lots of tools that can help generate metadata, but many, if not all, of these are post hoc tools. We talk extensively, if colloquially, about the need to start metadata creation at the same time as we collect the data, but we don’t incentivise this process.  The only time people realize that metdata is important is at the end of their project, and by then they’ve got a new job to start, a new project to undertake, or they’ve left academia.

Making metadata creation a part of the research workflow is something I am working toward as part of the Neotoma project. Where metadata is a necessary component of the actual data analysis.   The Neotoma Paleoecological Database is a community curated database that contains sixteen different paleoecological proxies, ranging from water chemistry to pollen to diatoms to stable isotope data (see Pilaar Birch and Graham 2015). Neotoma has been used to study everything from modern patterns of plant diversity, rates of migration for plant and mammals, rates of change in community turnover through time, and species relationships to climate.  It acts as both a data repository and a research tool in and of itself.  A quick plug as well, the completion of a workshop this past week with the Science Education Resource Center at Carleton College in Minnesota has resulted in the development of teaching tools to help bring paleoecology into the classroom (more are on their way).

Neotoma has a database structure that includes a large amount of metadata.  Due in no small part to the activities of Eric Grimm, the metadata is highly curated, and, Tilia, a GUI tool for producing stratigraphic diagrams and age models from paleoecological data, is designed to store data in a format that is largely aligned with the Neotoma database structure.

In designing the neotoma package for R I’ve largely focused on its use as a tool to get data out of Neotoma, but the devastating closure of the Illinois State Museum by the Illinois Governor (link) has hastened the devolution of data curation for the database. The expansion of the database to include a greater number of paleoecological proxies has meant that a number of researchers have already become data stewards, checking to ensure completeness and accuracy before data is uploaded into the database.

Having the R package (and the Tilia GUI) act as a tool to get data in as well as out serves an important function, it acts as a step to enhance the benefits of proper data curation immediately after (or even during) data generation because the data structures in the applications are so closely aligned with the actual database structure.

We are improving this data/metadata synergy in two ways:

  1. Data structures: The data structures within the R package (detailed in our Open Quaternary paper) remain parallel to the database.  We’re also working with Mark Uhen, Shanan Peters and others at the Paleobiology Database (as part of this funded NSF EarthCube project) and, elsewhere, for example, the LiPD Project, which is itself proposing community data standards for paleoclimatic data (McKay and Emile-Geay, 2015).
  2. Workflow: Making paleoecological analysis easier through the use of the R package has the benefit of reducing the ultimate barrier to data upload.  This work is ongoing, but the goal is to ensure that by creating data objects in neotoma, data is already formatted correctly for upload to Neotoma, reducing the burden on Data Stewards and on the data generators themselves.

This is a community led initiative, although development is centralized (but open, anyone can contribute to the R package for example), the user base of Neotoma is broad, it contains data from over 6500 researchers, and data is contributed at a rate that continues to increase.  By working directly with the data generators we can help build a direct pipeline into “big data” tools for researchers that have traditionally been somewhere out on the long tail.

Jack Williams will be talking a bit more about our activities in this Middle Tail, and why it’s critical for the development of truly integrated cyberinfrastructure in the geosciences (the lessons are applicable to ecoinformatics as well) at GSA this year (I’ll be there too, so come by and check out our session: Paleoecological Patterns, Ecological Processes and Modeled Scenarios, where we’re putting the ecology back in ge(c)ology!).

PalEON has a video

In keeping with the theme of coring pictures I wanted to share PalEON’s new video, produced by the Environmental Change Initiative at the University of Notre Dame.  It does a good job of explaining what PalEON does and what we’re all about.  There’s also a nice sequence, starting at about 2:40s in where you get to see a “frozen finger” corer in action.  We break up dry ice, create a slurry with alcohol and then drop it into the lake into the lake sediment.

Once the sediment has frozen to the sides of the corer (about 10 – 15 minutes) we bring the corer up and remove the slabs of ice from the sides, keeping track of position, dimensions and orientation so that we can piece it back together.  I’m on the platform with Jason McLachlan and Steve Jackson.

There’s a great section in there about the sociology of the PalEON project as well, although it’s brief.  So take a second and watch the video, it’s great!

EarthCube webinars and the challenges of cross-disciplinary Big Data.

EarthCube is a moon shot.  It’s an effort to bring communities broadly supported through the NSF Geosciences Directorate and the Division of Advanced Cyberinfrastructure together to create a framework that will allow us to understand our planet (and solar system) in space and in time using data and models generated across a spectrum of disciplines, and spanning scales of space and time.  A lofty goal, and a particularly complex one given the fragmentation of many disciplines, and the breadth of researchers who might be interested in participating in the overall project.

To help support and foster a sense of community around EarthCube the directorate has been sponsoring a series of webinars as part of a Research Coordination Network called “Collaboration and Cyberinfrastructure for Paleogeosciences“, or, more simply C4P.  These webinars have been held every other Tuesday from 4 – 5pm Eastern, but are archived on the webinar website (here).

The Neotoma Paleoecological Database was featured as part of the first webinar.  Anders Noren talked about the cyber infrastructure required to support LacCore‘s operations, and Shanan Peters talks about an incredible text mining initiative (GeoDeepDive) in one of the later webinars.

Fig 1. The flagship for the Society for American Pedologists has run aground and it is now a sitting duck for the battle machine controlled by the Canadian Association for Palynologists in the third war of Data Semantics.

It’s been interesting to watch these talks and think about both how unique each of these paleo-cyberinfrastructure projects is, but also how much overlap there is in data structure, use, and tool development.  Much of the struggle for EarthCube is going to be developing a data interoperability structure and acceptable standards across disciplines.  In continuing to develop the neotoma package for R I’ve been struggling to understand how to make the data objects we pull from the Neotoma API interact well with standard R functions, and existing R packages for paleoecological data.  One of the key questions is how far do we go in developing our own tools before that tool development creates a closed ecosystem that cuts off outside development?  If I’m struggling with this question in one tiny disciplinary nook, imagine the struggle that is going to occur when geophysicists and paleobotanists get together with geochonologists and pedologists!

Interoperability of these databases needs to be a key goal.  Imagine the possibilities if we could link modern biodiversity databases with Pleistocene databases such as Neotoma, and then to deep time databases like the Paleobiology Database in a seamless manner.  Big data has clearly arrived in some disciplines, but the challenges of creating big data across disciplines is just starting.

Paleo-Bloggers – A List

EDIT:  I’ve added some more blogs based on suggestions from the comments here and on twitter.  And I just keep adding more (Jan 23, 2013)!

Well, I’ve been thinking about doing a post like this for a while, a big list of paleobloggers. This post has been sitting in my drafts pile for a while and I wanted to get it moving forward with a bit of crowd sourcing. Besides, it’s hard to define what actually constitutes a “paleoecology” blog, given that blogs like The Rock Remains certainly deal with paleoecological concepts, are interested in the past and aren’t wholly about rocks, or this tiny slice at the surface of time, but these blogs aren’t exactly ecological. Of course it’s hard to define what’s ecological about downwithtime sometimes. (links below the fold) Continue reading Paleo-Bloggers – A List

Neotoma, an API and data archaeology. Also, some fun with R.

Image of a Neotoma packrat
Figure 1. The Neotoma project is named after the genus Neotoma, packrats, who collect plant materials to build their middens, that ultimately serve as excellent paleoecological resources, if you’re okay sifting through rodent urine.

The Neotoma database is a fantastic resource, and one that should be getting lots of use from paleoecologists.  Neotoma is led by the efforts of Eric Grimm, Allan Ashworth, Russ Graham, Steve Jackson and Jack Williams, and includes paleoecological data from the North American Pollen Database and FAUNMAP, along with other sources, coming online soon.  As more and more scientific data sets are being developed, and as those data sets become updated more regularly,  we’re also moving to an era where scientists are increasingly needing to use tools to obtain this new data as it is acquired.  This leads us to the use of APIs in our scientific workflows.  There are already R packages that take advantage of existing APIs, I’ve used ritis a number of times (part of the great ROpenSci project) to access species records in the ITIS database, there’s the twitteR package to take advantage of the Twitter API, and there are plenty of others.

Continue reading Neotoma, an API and data archaeology. Also, some fun with R.