Semantics Shememantics

In science we work, more often than not, in teams.  Whether we work with one other individual, five individuals, or interact at workshops with hundreds of strangers, it’s important that we are clearly understood.  Clarity is critical, especially when explaining complex concepts.  KISS is my second favorite acronym, even if I can’t keep to the principle (NEFLIS, a camping acronym, is my favorite – No Excuse for Living in Squalor just because you’re out in the woods).

A recently funded project I’m working on, under the aegis of EarthCube, is the harmonization of the Neotoma Paleoecological Database and the Paleobiology Database. Neotoma is a database of Quaternary fossils (mammals and microfossils such as pollen and ostracodes), and the Paleobiology Database is a database of every other kind of fossil. Both are incredible repositories for their respective communities, and powerful research tools in their own right.  My Distinguished Lecture talk at the University of Wisconsin’s Rebecca J. Holz Research Data Symposium was about the role of Community Databases in connecting researchers to Big Data tools, while getting their data into a curated form so that others could easily access and transform their data to undertake innovative research projects.

Superman Card Game by Whitman (1978) - G by andertoons, on Flickr
Figure 1. Semantic differences can be kryptonite for a project.  Especially a project that has very short arms relative to the rest of its body like Superman does in this picture. [ credit: andertoons ]
Our recent joint Neotoma-PBDB workshop, in Madison WI, showed me that, even with such closely allied data and disciplinary backgrounds, semantics matter.  We spent the first morning of the meeting having a back and forth discussion, where it kept seeming like we agreed on core concepts, but then, as the conversations progressed, we’d fall back into some sort of confusion.  As it began to seem unproductive we stepped back and checked in to see if we really were agreeing on core concepts.

While both databases contain fossil data, there is a fundamental difference in how the data are collected.  Typically Paleobiology DB data is collected in isolation, a fossil whale, discovered & reported is more common than a vertical stratigraphic survey on a single outcrop at a specific Latitude and Longitude.  In Neotoma, so much of our data comes from lake sediment cores that it makes sense that much of our data (and data collection) is described from stratigraphic sequences.

This difference may not seem like much, especially when the Holocene (the last 11,700 years) is basically an error term in much of the Paleobiology Database, but it’s enough of a difference that we were apparently agreeing on much, but then, inexplicably, disagreeing on followup discussions.

This is one of the fundamental problems in interdisciplinary research. Interdisciplinarity is as much understanding the terms another discipline uses as it is understanding the underlying philosophy and application of those terms in scientific discourse.  Learning to navigate these differences is time consuming, and requires a skill set that many of us don’t have.  At the very least, recognizing this is a problem, and learning to address this issue is a skill that is difficult to master.  It took us a morning of somewhat productive discussion before we really looked at the problem.  Once addressed we kept going back to our draft semantic outline to make sure we knew what we were talking about when discussing each other’s data.

This is all to say, we had a great workshop and I’m really looking forward to the developments on the horizon.  The PBDB development team is great (and so is the Neotoma team) and I think it’s going to be a very fruitful collaboration.

Macrosystems Ecology: The more we know the less we know.

Dynamic Ecology had a post recently asking why there wasn’t an Ecology Blogosphere. One of the answers was simply that as ecologists we often recognize the depth of knowledge of our peers and as such, are unlikely (or are unwilling) to comment in an area that we have little expertise. This is an important point. I often feel like the longer I stay in academia the more I am surprised when I can explain a concept outside my (fairly broad) subject area clearly and concisely.  It surprises me that I have depth of knowledge in a subject that I don’t directly study.

Of course, it makes sense.  We are constantly exposed to ideas outside our disciplines in seminars, papers, on blogs & twitter, and in general discussions, but at the same time we are also exposed to people with years of intense disciplinary knowledge, who understand the subtleties and implications of their arguments.  This is exciting and frightening.  The more we know about a subject, the more we know what we don’t know.  Plus, we’re trained to listen to other people.  We ‘grew up’ academically under the guidance of others, who often had to correct us, so when we get corrected out of our disciplines we are often likely to defer, rather than fight.

This speaks to a broader issue though, and one that is addressed in the latest issue of Frontiers in Ecology and the Environment.  The challenges of global change require us to come out of our disciplinary shells and to address challenges with a new approach, defined here as Macrosystems Ecology.  At large spatial and temporal scales – the kinds of scales at which we experience life – ecosystems cease being disciplinary.  Jim Heffernan and Pat Soranno, in the lead paper (Heffernan et al., 2014) detail three ecological systems that can’t be understood without cross-scale synthesis using multi-disciplinary teams.

Figure 1.  From Heffernan et al. (2014), multiple scales and disciplines interact to explain patterns of change in the Amazon basin.
Figure 1. From Heffernan et al. (2014), multiple scales and disciplines interact to explain patterns of change in the Amazon basin.

The Amazonian rain forest is a perfect example of a region that is imperiled by global change, and can benefit from a Macrosystems approach.  Climate change and anthropogenic land use drives vegetation change, but vegetation change also drives climate (and, ultimately, land use decisions). This is further compounded by teleconnections related to societal demand for agricultural products around the world and the regional political climate.  To understand and address ecological problems in this region then, we need to understand cross-scale phenomena in ecology, climatology, physical geography, human geography, economics and political science.

Macrosystems proposes a cross-scale effort, linking disciplines through common questions to examine how systems operate at regional to continental scales, and at multiple temporal scales.  These problems are necessarily complex, but by bringing together researchers in multiple disciplines we can begin to develop a more complete understanding of broad-scale ecological systems.

Interdisciplinary research is not something that many of us have trained for as ecologists (or biogeographers, or paleoecologists, or physical geographers. . . but that’s another post).  It is a complex, inter-personal interaction that requires understanding of the cultural norms within other disciplines.  Cheruvelil et al. (2014) do a great job of describing how to achieve and maintain high-functioning teams in large interdisciplinary projects, and Kendra also discusses this further in a post on her own academic blog.

Figure 2.  Interdisciplinary research requires effort in a number of different areas, and these efforts are not recognized under traditional reward structures.
Figure 2. From Goring et al., (2014). Interdisciplinary research requires effort in a number of different areas, and these efforts are not recognized under traditional reward structures.

In Goring et al. (2014) we discuss a peculiar issue that is posed by interdisciplinary research.  The reward system in academia is largely structured to favor disciplinary research.  We refer to this in our paper as a disciplinary silo.  You are in a department of X, you publish in the Journal of X, you go to the International Congress of X and you submit grant requests to the X Program of your funding agency.  All of these pathways are rewarded, and even though we often claim that teaching and broader outreach are important, they are important inasmuch as you need to not screw them up completely (a generalization, but one I’ve heard often enough).

As we move towards greater interdisciplinarity we begin to recognize that simply superimposing the traditional rewards structure onto interdisciplinary projects (Figure 2) leaves a lot to be desired.  This is particularly critical for early-career researchers.  We are asking these researchers (people like me) to collaborate broadly with researchers around the globe, to tackle complex issues in global change ecology, but, when it comes time to assess their research productivity we don’t account for the added burden that interdisciplinary research can require of a researcher.

Now, I admit, this is self-serving.  As an early career researcher, and member of a large interdisciplinary team (PalEON), much of what we propose in Goring et al. (2014) strongly reflects on my own personal experience.  Outreach activities, the complexities of dealing with multiple data sources, large multi-authored papers, posters and talks, and the coordination of researchers across disciplines are all realities for me, and for others in the project, but ultimately, we get evaluated on grants and papers.  The interdisciplinary model of research requires effort that never gets valuated by hiring or tenure committees.

That’s not to say that hiring committees don’t consider this complexity, and I know they’re not just looking for Nature and Science papers, but at the same time, there is a new landscape for researchers out there, and we’re trying to evaluate them with an old map.

In Goring et al. (2014) we propose a broader set of metrics against which to evaluate members of large interdisciplinary teams (or small teams, there’s no reason to be picky).  This list of new metrics (here) includes traditional metrics (numbers of papers, size of grants), but expands the value of co-authorship, recognizing that only one person is first in the authorship list, even if people make critical contributions; provides support for non-disciplinary outputs, like policy reports, dataset generation, non-disciplinary research products (white papers, books) and the creation of tools and teaching materials; and adds value to qualitative contributions, such as facilitation roles, helping people communicate or interact across disciplinary divides.

This was an exciting set of papers to be involved with, all arising from two meetings associated with the NSF Macrosystems Biology program (part of NSF BIO’s Emerging Frontiers program).  I was lucky enough to attend both meetings, the first in Boulder CO, the second in Washington DC.  As a post-doctoral researcher these are the kinds of meetings that are formative for early-career researchers, and clearly, I got a lot out of it.  The Macrosystems Biology program is funding some very exciting programs, and this Frontiers issue attempts to get to the heart of the Macrosystems approach.  It is the result of many hours and days of discussion, and many of the projects are already coming to fruition.  It is an exciting time to be an early-career researcher, hopefully you agree!

Sometimes saving time in a bottle isn’t the best idea.

As an academic you have the advantage of meeting people who do some really amazing research.  You also have the advantage of doing really interesting stuff yourself, but you also tend to spend a lot of time thinking about very obscure things.  Things that few other people are also thinking about, and those few people tend to be spread out across the globe.  I had the opportunity to join researchers from around the world at Queen’s University in Belfast, Northern Ireland earlier this month for a meeting about age-depth models, a meeting about how we think about time, and how we use it in our research.

Time is something that paleoecologists tend to think about a lot.  With the Neotoma paleoecological database time is a critical component.  It is how we arrange all the paleoecological data.  From the Neotoma Explorer you can search and plot out mammal fossils at any time in the recent (last 100,000 years or so) past, but what if our fundamental concept of time changes?

Figure 1.  The accelerator in Belfast uses massive magnets to accelerate Carbon particles to 5 million km/h.
Figure 1. The particle accelerator in Belfast uses magnets to accelerate Carbon particles to 5 million km/h.

Most of the ages in Neotoma are relative.  They are derived from from radiocarbon data, either directly, or within a chronology built from several radiocarbon dates (Margo Saher has a post about 14C dating here), which means that there is uncertainty around the ages that we assign to each pollen sample, mammoth bone or plant fossil.  To actually get a radiocarbon date you first need to send a sample of organic material out to a lab (such as the Queen’s University, Belfast Radiocarbon Lab).  The samples at the radiocarbon lab are processed and put in an Accelerator Mass Spectrometer (Figure 1) where molecules of Carbon reach speeds of millions of miles an hour, hurtling through a massive magnet, and are then counted, one at a time.

These counts are used to provide an estimate of age in radiocarbon years.  We then use the IntCal curve to relate radiocarbon ages to calendar ages.  This calibration curve relates absolutely dated material (such as tree rings) to their radiocarbon ages.  We need the IntCal curve since the generation of radiocarbon (14C) in the atmosphere changes over time, so there isn’t a 1:1 relationship between radiocarbon ages and calendar ages.  Radiocarbon (14C) 0 is actually 1950 (associated with atmospheric atomic bomb testing), and by the time you get back to 10,000 14C years ago, the calendar date is about 1,700 years ahead of the radiocarbon age (i.e., 10,000 14C years is equivalent to 11,700 calendar years before present).

Figure 2.  A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon).  The probability density associated with this intercept builds the age estimate for that sample, in calendar years.
Figure 2. A radiocarbon age estimate (in 14C years; the pink, normally distributed curve) intercepts the IntCal curve (blue ribbon). The probability density associated with this intercept builds the age estimate for that sample, in calendar years. [link]
To build a model of age and depth within a pollen core, we link radiocarbon dates to the IntCal curve (calibration) and then link each age estimate together, with their uncertainties, using specialized software such as OxCal, Clam or Bacon.  This then allows us to examine changes in the paleoecological record through time, basically, this allows us to do paleoecology.

A case for updating chronologies

The challenge for a database like Neotoma is that the IntCal curve changes over time (IntCal98, IntCal04, IntCal09, and now IntCal13) and our idea of what makes an acceptable age model (and what constitutes acceptable material for dating) also changes.

If we’re serving up data to allow for broad scale synthesis work, which age models do we provide?  If we provide the original published model only then these models can cause significant problems for researchers working today.  As I mentioned before, by the time we get back 10,000 14C years the old models (built using only 14C ages, not calibrated ages) will be out of sync with newer data in the database, and our ability to discern patterns in the early-Holocene will be affected.  Indeed, identical cores built using different age models and different versions of the IntCal curve could tell us very different things about the timing of species expansions following glaciation, or changes in climate during the mid-Holocene due to shifts in the Intertropical Convergence Zone (for example).

So, if we’re going to archive these published records then we ought to keep the original age models, they’re what’s published after all, and we want to keep them as a snapshot (reproducible science and all that).  However, if we’re going to provide this data to researchers around the world, and across disciplines, for novel research purposes then we need to provide support for synthesis work.  This support requires updating the calibration curves, and potentially, the age-depth models.

So we get (finally) to the point of the meeting.  How do we update age-models in a reliable and reproducible manner?  Interestingly, while the meeting didn’t provide a solution, we’re much closer to an endpoint.  Scripted age-depth modelling software like Clam and Bacon make the task easier, since they provide the ability to numerically reproduce output directly in R.  The continued development of the Neotoma API also helps facilitate this task since it again would allow us to pull data directly from the database, and reproduce age-model construction using a common set of data.

Figure 3.  This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern.  No matter how you re-build this age model it's going to underestimate uncertainty in this region.
Figure 3. This chronology depends on only three control points and assumes constant sedimentation from 7000 years before present to the modern. No matter how you re-build this age model it’s going to underestimate uncertainty in this region.

One thing that we have identified however are the current limitations to this task.  Quite simply, there’s no point in updating some age-depth models.  The lack of reliable dates (or of any dates) means that new models will be effectively useless.  The lack of metadata in published material is also a critical concern.  While some journals maintain standards for the publication of 14C dates they are only enforced when editors or reviewers are aware of them, and are difficult to enforce post publication.

The issue of making data open and available continues to be an exciting opportunity, but it really does reveal the importance of disciplinary knowledge when exploiting data sources.  Simply put, at this point if you’re going to use a large disciplinary database, unless you find someone who knows the data well, you need to hope that signal is not lost in the noise (and that the signal you find is not an artifact of some other process!).

No one reads your blog: Reflections on the middling bottom.

Two weeks ago Terry McGlynn posted reflections about blogging on Small Pond Science, an excellent blog that combines research, teaching reflections and other assorted topics.  Two weeks ago I didn’t post anything.  Three weeks ago I didn’t post anything.  The week before I posted a comment of Alwynne Beaudoin‘s that is great, but wasn’t really mine (although she gave me permission to post it).  The last thing I posted myself was a long primer on using GitHub that I posted six weeks ago. Continue reading No one reads your blog: Reflections on the middling bottom.

Writing and collaborating on GitHub, a primer for paleoecologists

At this point I’ve written a hundred times about the supplement for Goring et al., (2013), but just in case you haven’t heard it:

Goring et al., (2013) uses a large vegetation dataset to test whether or not pollen richness is related to plant richness at a regional scale.  Because of the nature of the data and analysis, and because I do all of my work (or most of it) using R, I thought it would be a good idea to produce totally reproducible research.  To achieve this I included a version of the paper, written using RMarkdown, as a supplement to the paper.  In addition to this, I posted the supplement to GitHub so that people who were interested in looking at the code more deeply could create their own versions of the data and could use the strengths of the GitHub platform to help them write their own code, or do similar analyses.

This is a basic how-to to get you started writing your paper using RMarkdown, RStudio and GitHub (EDIT: if some of these instructions don’t work let me know and I’ll fix them immediately):

Continue reading Writing and collaborating on GitHub, a primer for paleoecologists

Who sees your review?

There’s been a lot of calls for reform to the peer review process, and lots of blog posts about problems and bad experiences with peer review (Simply Statistics, SVPow, and this COPE report)  .  There is lots of evidence that peer review suffers from deficiencies related to author seniority, gender (although see Marsh et al, 2011), and from variability related to the choice of reviewers (see Peters & Ceci, 1982, but the age of this paper should be noted). Indeed, recent work by Thurner and Hanell (2011) and Squazzoni and Gandelli (2012) show how sensitive publication can be to the structure of the discipline (whether homogeneous or fragmented) and the intentions of the reviewers (whether they are competitive or collegial).

To my mind, one of the best, established models of peer review comes from the Copernicus journals of the European Geosciences Union.  I’m actually surprised that these journals are rarely referenced in debates about reviewing practice.  The journals offer two outlets, I’ve published with co-authors in Climate of the Past Discussions (here, here and here), the papers undergo open review by reviewers who may or may not remain anonymous (their choice), and then the response and revised paper goes to ‘print’ in Climate of the Past (still in review, here and here respectively).

This is the kind of open peer review that people have pushed by posting reviews on their blogs (I saw a post on twitter a couple weeks ago, bt can’t find the blog, if anyone has a reference please let me know).  The question is, why not publish in journals that support the kind of open review you want?  There are a number of models out there now, and I believe there is increasing acceptance of these models so we have choice, lets use it.

What inspired me to write the post though was my own recent experience as a reviewer.  I just finished reviewing a fairly good paper that ultimately got rejected.  When I received the editors notice I went to see what the other reviewer had said, only to find that the journal does not release other reviews.  This was the first time this has happened to me and I was surprised.

I review for a number of reasons.  It helps me give back to my disciplinary community, it keeps me up to date on new papers, it gives me an opportunity to deeply read and communicate science in a way that we don’t ordinarily undertake, and it helps me improve my own skills.  The last point comes not only from my own activity, but from reading the reviews of others.  If you want a stronger peer review process, having peers see one another’s reviews is helpful.

Criticism when warranted, praise when deserved.

Storm clouds developing over a small pond in the Chilcotin
Figure 1. Ecological systems depend on objects that manifest across multiple scales. Macrosystems Ecology seeks to understand ecology at regional to continental scales, which means understanding process at each of these scales.

A quick post today to keep the posting flowing.  I’m currently working on a number of papers that are really interesting and <sarcasm>sure to be the most highly cited papers ever in the history of science</sarcasm>.  In particular the group of authors who met at the NSF Macrosystems meeting in Boulder in March (Dave Schimel has some great thoughts here on the NEON website) have been working on a set of papers discussing what exactly ‘Macrosystems Ecology’ is, how it is undertaken and what it might tell us.  The collaborators include some great researchers at all stages of their careers including Pat Soranno, Andrew Finley, Jim Heffernan and others. Continue reading Criticism when warranted, praise when deserved.