Who is a Scientist – Reflections on #AAG2016

This is the first time I’ve really been to the American Association of Geographers meeting. Last year it was held in Chicago, which is really close to Madison, and I was invited to speak at a session called “The View from the Anthropocene” organized by two great Geographers from the University of Connecticut, Kate Johnson and Megan Hill, but I had the kids & really only spent the morning there.  I’m pretty sure there was another reason as well, but I can’t remember what it was.

It was great to see Kate & Megan again this year, both of them are doing really cool stuff (check out their web pages), and it was really great to see that the momentum behind the original idea was enough to re-tool the session into the Symposium on Physical Geography at this year’s AAG meeting in San Francisco, with Anne Chin of UC-Denver on the organizing committee, and a host of fantastic speakers.

My own presentation in the session focused on the Anthropocene, and its role as both a boundary (whether you want to define it as an Epoch sensu stricto or as a philosophical concept – I think that Stanley Finney & Lucy Edward’s article in GSA Today nicely lays out the arguments) and a lens.  The second part of that equation (the lens) is a more diffuse point, but my argument is that the major changes we see in the earth system can impact our ability to build models of the past using modern analogues, whether those be climatic, or biological.  I show this using pollen and vegetation records from the Midwest, and make the connection to future projections with the example laid out in Matthes et al. (2015), where we show that the pre-industrial climate niche of plant functional types used in GCMs as part of the CMIP5 intercomparison are not better than random when compared to actual “pre-settlement” vegetation in the northeastern United States.

But I really want to talk about a single slide in my talk.  In the early part of my talk I use this slide:

MGDavisSlide

This is Margaret Davis, one of the most important paleoecologists in North America, past-president of the ESA [PDF], and, importantly, a scientists who thought deeply about our past, our present and our future.  There’s no doubt she should be on the slide.  She is a critical piece of our cultural heritage as scientists, an because of her research, is uniquely well suited to show up in a slide focusing on the Anthropocene.

But it’s political too.  I put Margaret Davis up there because she’s an important scientist, but I also chose her because she’s an important female scientist. People specifically commented on the fact that I chose a female scientist, because it’s political.  It shouldn’t be.  There should be no need for me to pick someone because of their gender, and there should be no reason to comment on the fact that she was a female scientist.  It should just “be”.

Personal actions should be the manifestation of one’s political beliefs, but so much of our day to day life passes by without contemplation.  Susanne Moser, later in my session, talked about the psychological change necessary to bring society  around to the task of reducing CO2, of turning around the Anthropocene, or surviving it, and I think that the un-examined life is a critical part of the problem.  If we fail to take account of how our choices affect others, or affect society then we are going to confront an ugly future.

Everything is a choice, and the choices we make should reflect the world we want for ourselves and for the next generations. If our choices go un-examined then we wind up with the status quo.  We wind up with unbalanced panels, continued declines in under-represented minority participation in the physical sciences, and an erosion of our public institutions.

This post is maybe self-serving, but it shouldn’t have to be.  We shouldn’t have to look to people like DN Lee, the authors of Tenure She Wrote, Chanda Prescod-WeinsteinTerry McGlynn, Margaret Kosmala, Oliver Keyes, Jacquelyn Gill and so many others who advocate for change within the academic system, often penalizing themselves in the process.  We should be able to look to ourselves.

Okay, enough soap-boxing. Change yourselves.

Advertisements

Semantics Shememantics

In science we work, more often than not, in teams.  Whether we work with one other individual, five individuals, or interact at workshops with hundreds of strangers, it’s important that we are clearly understood.  Clarity is critical, especially when explaining complex concepts.  KISS is my second favorite acronym, even if I can’t keep to the principle (NEFLIS, a camping acronym, is my favorite – No Excuse for Living in Squalor just because you’re out in the woods).

A recently funded project I’m working on, under the aegis of EarthCube, is the harmonization of the Neotoma Paleoecological Database and the Paleobiology Database. Neotoma is a database of Quaternary fossils (mammals and microfossils such as pollen and ostracodes), and the Paleobiology Database is a database of every other kind of fossil. Both are incredible repositories for their respective communities, and powerful research tools in their own right.  My Distinguished Lecture talk at the University of Wisconsin’s Rebecca J. Holz Research Data Symposium was about the role of Community Databases in connecting researchers to Big Data tools, while getting their data into a curated form so that others could easily access and transform their data to undertake innovative research projects.

Superman Card Game by Whitman (1978) - G by andertoons, on Flickr
Figure 1. Semantic differences can be kryptonite for a project.  Especially a project that has very short arms relative to the rest of its body like Superman does in this picture. [ credit: andertoons ]
Our recent joint Neotoma-PBDB workshop, in Madison WI, showed me that, even with such closely allied data and disciplinary backgrounds, semantics matter.  We spent the first morning of the meeting having a back and forth discussion, where it kept seeming like we agreed on core concepts, but then, as the conversations progressed, we’d fall back into some sort of confusion.  As it began to seem unproductive we stepped back and checked in to see if we really were agreeing on core concepts.

While both databases contain fossil data, there is a fundamental difference in how the data are collected.  Typically Paleobiology DB data is collected in isolation, a fossil whale, discovered & reported is more common than a vertical stratigraphic survey on a single outcrop at a specific Latitude and Longitude.  In Neotoma, so much of our data comes from lake sediment cores that it makes sense that much of our data (and data collection) is described from stratigraphic sequences.

This difference may not seem like much, especially when the Holocene (the last 11,700 years) is basically an error term in much of the Paleobiology Database, but it’s enough of a difference that we were apparently agreeing on much, but then, inexplicably, disagreeing on followup discussions.

This is one of the fundamental problems in interdisciplinary research. Interdisciplinarity is as much understanding the terms another discipline uses as it is understanding the underlying philosophy and application of those terms in scientific discourse.  Learning to navigate these differences is time consuming, and requires a skill set that many of us don’t have.  At the very least, recognizing this is a problem, and learning to address this issue is a skill that is difficult to master.  It took us a morning of somewhat productive discussion before we really looked at the problem.  Once addressed we kept going back to our draft semantic outline to make sure we knew what we were talking about when discussing each other’s data.

This is all to say, we had a great workshop and I’m really looking forward to the developments on the horizon.  The PBDB development team is great (and so is the Neotoma team) and I think it’s going to be a very fruitful collaboration.

Explorations in outreach – Creating a Twitter bot for the Neotoma Paleoecological Database.

If you’ve ever been in doubt about whether you chose the right programming language to learn I want to lay those concerns to rest here.

For many scientists, particularly in Biology or the Earth Sciences, there is often a question about whether you should be learning R, Python, Matlab or something else.  Especially when you’re coming into scientific programming in grad school with little prior experience this might seem like a daunting proposal.  You already don’t know anything about anything, and ultimately you wind up learning whatever you’re taught, or whatever your advisor is using and you wonder. . . Is the grass greener over in Python-land? Those figures look nice, if only I had learned R. . . Why did I learn on an expensive closed platform?

I am here to say “Don’t worry about it”, and I want to emphasize that with an example centered around academic outreach:

The Neotoma Paleoecological Database has had an issue for several years now.  We have had a large number of datasets submitted, but very few people could actively upload datasets to the database.  Neotoma is a live database, which means that not only do new datasets get added, but, as new information becomes available (for example, new taxonomic designations for certain species) datasets get updated.  This means that maintaining the database is very time intensive and there has traditionally been a gap between data ingest and data publication.  To make up for this there has been a data “Holding Tank” where individual records have been available, but this wasn’t the best solution.

Fast forward to about a year ago. Eric Grimm at the Illinois State Museum update the software package Tilia to provide greater access to the database to selected data stewards.  Each data type (including insets, pollen, mammal fossils, XRF, ostracodes, lake chemistry) has one or a few stewards who can vet and upload datasets directly to the database using the Tilia platform. This has increased the speed at which datasets have entered Netoma rapidly — over the last month there have been more than 200 new datasets entered — but it’s still hard to get a sense of this as an outsider since people don’t regularly check the database unless they need data from it.

Which brings us to Twitter. Academics have taken to Twitter like academics on a grant .  Buzzfeed has posted a list of 25 twitter feeds for nerds, Science published a somewhat contentious list of scientists to follow, and I’m on twitter, so obviously all the cool kids are there. This led me to think that twitter could be a good platform for publicizing new data uploads to Neotoma.  Now I just needed to learn how.

The process is fairly straightforward:

  1. Figure out what the most recently posted Neotoma datasets are:
    • This is made easier with the Neotoma API, which has a specific method for returning datasets: http://ceiwin10.cei.psu.edu/NDB/RecentUploads?months=1
    • You’ll notice (if you click) that the link returns data in a weird format.  This format is called JSON and it has been seen by many as the successor to XML (see here for more details).
  2. Check it against two files, (1) a file of everything that’s been tweeted already, and (2) a file with everything that needs to be tweeted (since we’re not going to tweet everything at once)
  3. Append the new records to the queue of sites to tweet.
  4. Tweet.

So that’s it (generally).  I’ve been working in R for a while now, so I have a general sense of how these things might happen. The thing is, these same mechanics translate to other languages as well. The hardest thing about programming (in my opinion) is figuring out how the program ought to flow. Everything else is just window dressing. Once you get more established with a programming language you’ll learn the subtleties of the language, but for hack-y programming, you should be able to get the hang of it regardless of your language background.

As evidence, Neotomabot. The code’s all there, I spent a day figuring out how to program it in Python. But to help myself out I planned it all first using long-hand notes, and then hacked it out using Google, StackOverflow and the Python manual.  Regardless, it’s the flow control that’s key. With my experience in R I’ve learned how “for” loops work, I know about “while” loops, I know try-catch methods exist and I know I need to read JSON files and push out to Twitter. Given that, I can map out a program and then write the code, and that gives us Neotomabot:

All the code is available on the GitHub repository here, except for the OAuth handles, but you can learn more about that aspect from this tutorial: How to Write a Twitter Bot. I found it very useful for getting started.  There is also a twittR, for R, there are several good tutorials for the package available (here, and here).

So that’s it.  You don’t need to worry about picking the wrong language. Learning the basics of any language, and how to map out the solution to a problem is the key.  Focus on these and you should be able to shift when needed.

PalEON has a video

In keeping with the theme of coring pictures I wanted to share PalEON’s new video, produced by the Environmental Change Initiative at the University of Notre Dame.  It does a good job of explaining what PalEON does and what we’re all about.  There’s also a nice sequence, starting at about 2:40s in where you get to see a “frozen finger” corer in action.  We break up dry ice, create a slurry with alcohol and then drop it into the lake into the lake sediment.

Once the sediment has frozen to the sides of the corer (about 10 – 15 minutes) we bring the corer up and remove the slabs of ice from the sides, keeping track of position, dimensions and orientation so that we can piece it back together.  I’m on the platform with Jason McLachlan and Steve Jackson.

There’s a great section in there about the sociology of the PalEON project as well, although it’s brief.  So take a second and watch the video, it’s great!

The advantages of taking a chance with a new journal – OpenQuaternary

Full disclosure: I’m on the editorial board of Open Quaternary and also manage the blog, but I am not an Editor in Chief and have attempted to ensure that my role as an author and my role as an editor did not conflict.

Figure 1.  Neotoma and R together at last!
Figure 1. Neotoma and R together at last!

We (myself, Andria Dawson, Gavin L. SimpsonEric GrimmKarthik Ram, Russ Graham and Jack Williams) have a paper in press at a new journal called Open Quaternary.  The paper documents an R package that we developed in collaboration with rOpenSci to access and manipulate data from the Neotoma Paleoecological Database.  In part the project started because of the needs of the PalEON project.  We needed a dynamic way to access pollen data from Neotoma, so that analysis products could be updated as new data entered the database.  We also wanted to exploit the new API developed by Brian Bills and Michael Anderson at Penn State’s Center for Environmental Informatics.

There are lots of thoughts about where to submit journal articles.  Nature’s Research Highlights has a nice summary about a new article in PLoS One (Salinas and Munch, 2015) that looks to identify optimum journals for submission, and Dynamic Ecology discussed the point back in 2013, a post that drew considerable attention (here, here, and here, among others).  When we thought about where to submit I made the conscious choice to choose an Open Source journal. I chose Open Quaternary partly because I’m on the editorial board, but also because I believe that domain specific journals are still a critical part of the publishing landscape, and because I believe in Open Access publishing.

The downside of this decision was that (1) the journal is new, so there’s a risk that people don’t know about it, and it’s less ‘discoverable’; (2) even though it’s supported by an established publishing house (Ubiquity Press) it will not obtain an impact factor until it’s relatively well established.  Although it’s important to argue that impact factors should not make a difference, it’s hard not to believe that they do make a difference.

Figure 2.  When code looks crummy it's not usable.  This has since been fixed.
Figure 2. When code looks crummy it’s not usable. This has since been fixed.

That said, I’m willing to invest in my future and the future of the discipline (hopefully!), and we’ve already seen a clear advantage of investing in Open Quaternary.  During the revision of our proofs we noticed that the journal’s two column format wasn’t well suited the the blocks of code that we presented to illustrate examples in our paper.  We also lost the nice color syntax highlighting that pandoc offers when it renders RMarkdown documents (see examples in our paper’s markdown file).  With the help of the journal’s Publishing Assistant Paige MacKay, Editor in Chief Victoria Herridge and my co-authors we were able to get the journal to publish the article in a single column format, with syntax highlighting supported using highlight.js.

I may not have a paper in Nature, Science or Cell (the other obvious option for this paper /s) but by contributing to the early stages of a new open access publishing platform I was able to change the standards and make future contributions more readable and make sure that my own paper is accessible, readable and that the technical solution we present is easily implemented.

I think that’s a win.  The first issue of Open Quaternary should be out in March, until then you can check out our GitHub repository or the PDF as submitted (compleate with typoes).

How far do you go with coding in a collaborative project?

I’ve been coding in R since I started graduate school, and I’ve been coding in one way or another since I learned how to use Turing in high school.  I’m not an expert by any means, but I am proficient enough to do the kind of work I want to do, to know when I’m stuck and need to ask for help, and to occasionally over-reach and get completely lost in my own code.

I’ve been working hard to improve my coding practice, particularly focusing on making my code more elegant, and looking at ways to use public code as a teaching tool. In submitting our neotoma package paper to Open Quaternary I struggled with the balance between making pretty figures, and making easily reproducible examples, between providing ‘research-quality’ case studies and having the target audience, generally non-programmers, turn off at the sight of walls of code.

There’s no doubt that the quality and interpretability of your code can have an impact on subsequent papers.  In 2012 there was a paper in PNAS that showed that papers with more equations often get cited less than similar papers with fewer equations (SciAm, Fawcett and Higginson, 2012).  I suspect the same pattern of citation holds for embedded code blocks, although how frequently this happens outside specialized journals isn’t clear to me. It certainly didn’t hurt Eric Grimm’s CONISS paper (doi and PDF) which has been cited 1300+ times, but this may be the exception rather than the rule, particularly in paleoecology.

I’m currently working on a collaboration with researchers at several other Canadian universities. We’re using sets of spatio-temporal data, along with phylogenetic information across kingdoms to do some pretty cool research.  It’s also fairly complex computationally.  One of the first things I did in co-developing the code was to go through the code-base and re-write it to try to make it a bit more robust to bugs, and to try and modularize it.  This meant pulling repeated blocks of code into new functions, taking functions and putting them into their own files, generalizing some of the functions so that they could be re-used in multiple places, and generally spiffing things up with a healthy use of the plyr package.

Figure 1. This is what coding machismo looks like on me.
Figure 1. This is what coding machismo looks like on me.

This, I now realize, was probably something akin to  statistical machismo (maybe more like coding machismo). The use of coding ‘tricks’ limited the ability of my co-authors to understand and modify the code themselves.  It also meant that further changes to the code required more significant investments in my time. They’re all great scientists, but they’re not native coders in the same way I am (not that I’m in the upper echelon myself).

This has been a real learning experience.  Coding for research is not the same as coding in an industrial application.  The culture shift in the sciences towards an open-sharing model means that we are no longer doing this work just so that we get output that “works”, but so that the code itself is an output.  Collaborative coding should mean that participants in the project should be able to understand what the code does and how it works.  In many cases that means recognizing that collaborators are likely to have differing skill sets when it comes to coding and that those different skill sets need to be respected.

In my case, going ahead and re-writing swaths of code certainly helped reduce the size of the code-base, it meant that, in general, things ran more smoothly, and that changes to the code could be accomplished relatively easily.  It also meant that I was the only one who could easily make these changes.  This is not good collaborative practice, at least, at the outset.

Figure 2.  A flowchart showing the relationships between classes in the neotoma package.
Figure 2. A flowchart showing the relationships between classes in the neotoma package.

Having said that, there are lots of good reasons why good coding practice can be beneficial,even if some collaborators can’t immediately work through changes.  It’s partly a matter of providing road maps, something that is rarely done.  Good commenting is useful, but more often lately I’ve been leaning toward trying to map an application with flowcharts.  It’s not pretty, but the diagram I drew up for our neotoma package paper (Goring et al., submitted) helped debug, clean syntax errors and gave me an overview I didn’t really have until I drafted it.

I’m working on the same kind of chart for the project I mentioned earlier, although it can be very complicated to do.  I’ve also been making an effort to clearly report changes I’ve made using git, so that we have a good record of what’s been done, and why. Certainly, it would have been easier to do all this in the first place, but I’ve learned my lesson. As in most research, a little bit of planning can go a long way.

Writing robust, readable code is an important step toward open and reproducible science, but we need to acknowledge the fact that reproducibility should not be limited to expert coders.  Trade-offs are a fact of life in evolution, and yet some of us are unwilling to make the trade-offs in our own scientific practice.  We are told constantly to pitch our public talks to the audience.  In working with your peers on a collaboration you should respect their relative abilities and ensure that they can understand the code, which may result in eschewing fancier tricks in favor of clearly outlined steps.

If you code in such a way that people can’t work with you, you are opening yourself up to more bugs, more work and possibly, more trouble down the line.  It’s a fine balance, and as quantitative tools such as R and Python become more a part of the lingua franca for grad students and early career researchers, it’s one we’re going to have to negotiate more often.

Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files & figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518