Semantics Shememantics

In science we work, more often than not, in teams.  Whether we work with one other individual, five individuals, or interact at workshops with hundreds of strangers, it’s important that we are clearly understood.  Clarity is critical, especially when explaining complex concepts.  KISS is my second favorite acronym, even if I can’t keep to the principle (NEFLIS, a camping acronym, is my favorite – No Excuse for Living in Squalor just because you’re out in the woods).

A recently funded project I’m working on, under the aegis of EarthCube, is the harmonization of the Neotoma Paleoecological Database and the Paleobiology Database. Neotoma is a database of Quaternary fossils (mammals and microfossils such as pollen and ostracodes), and the Paleobiology Database is a database of every other kind of fossil. Both are incredible repositories for their respective communities, and powerful research tools in their own right.  My Distinguished Lecture talk at the University of Wisconsin’s Rebecca J. Holz Research Data Symposium was about the role of Community Databases in connecting researchers to Big Data tools, while getting their data into a curated form so that others could easily access and transform their data to undertake innovative research projects.

Superman Card Game by Whitman (1978) - G by andertoons, on Flickr
Figure 1. Semantic differences can be kryptonite for a project.  Especially a project that has very short arms relative to the rest of its body like Superman does in this picture. [ credit: andertoons ]
Our recent joint Neotoma-PBDB workshop, in Madison WI, showed me that, even with such closely allied data and disciplinary backgrounds, semantics matter.  We spent the first morning of the meeting having a back and forth discussion, where it kept seeming like we agreed on core concepts, but then, as the conversations progressed, we’d fall back into some sort of confusion.  As it began to seem unproductive we stepped back and checked in to see if we really were agreeing on core concepts.

While both databases contain fossil data, there is a fundamental difference in how the data are collected.  Typically Paleobiology DB data is collected in isolation, a fossil whale, discovered & reported is more common than a vertical stratigraphic survey on a single outcrop at a specific Latitude and Longitude.  In Neotoma, so much of our data comes from lake sediment cores that it makes sense that much of our data (and data collection) is described from stratigraphic sequences.

This difference may not seem like much, especially when the Holocene (the last 11,700 years) is basically an error term in much of the Paleobiology Database, but it’s enough of a difference that we were apparently agreeing on much, but then, inexplicably, disagreeing on followup discussions.

This is one of the fundamental problems in interdisciplinary research. Interdisciplinarity is as much understanding the terms another discipline uses as it is understanding the underlying philosophy and application of those terms in scientific discourse.  Learning to navigate these differences is time consuming, and requires a skill set that many of us don’t have.  At the very least, recognizing this is a problem, and learning to address this issue is a skill that is difficult to master.  It took us a morning of somewhat productive discussion before we really looked at the problem.  Once addressed we kept going back to our draft semantic outline to make sure we knew what we were talking about when discussing each other’s data.

This is all to say, we had a great workshop and I’m really looking forward to the developments on the horizon.  The PBDB development team is great (and so is the Neotoma team) and I think it’s going to be a very fruitful collaboration.

Helping to fill the cloud from the bottom up.

Open data in the sciences is an aspirational goal, and one that I wholeheartedly agree with. The efforts of EarthCube (among others) to build an infrastructure of tools to help facilitate data curation and discovery in the Earth Sciences have been fundamental in moving this discussion forward in the geosciences, and at the most recent ESA meeting saw the development of a new section of the society dedicated to Open Science.

One of the big challenges to open science is that making data easily accessible and easily discoverable can be at odds with one another.  Making data “open” is as easy as posting it on a website, but making it discoverable is much more complex.  Borgman and colleagues (2007) very clearly lay out a critical barrier to data sharing in an excellent paper examining practices in “habitat ecology” (emphasis mine):

Also paradoxical is researchers’ awareness of extant metadata standards for reconciling, managing, and sharing their data, but their lack of use of such standards. The present dilemma is that few of the participating scientists see a need or ability to use others’ data, so they do not request data, they have no need to share their own data, they have no need for data standards, and no standardized data are available. . .

The issue, as laid out here, is that people know that metadata standards exist, but they’re not using them from the ground up because they’re not using other people’s data.  Granted this paper is now eight years old, but, for the vast majority of disciplinary researchers in the geosciences and biological sciences the extent of data re-use is most likely limited to using data tables from publications, if that. [a quick plug, if you’re in the geosciences, please take a few minutes to complete this survey on data sharing and infrastructure as part of the EarthCube Initiative]

So, for many people who are working with self-styled data formats, and metadata that is largely implicit (they’re the only ones who really understand their multi-sheet excel file), getting data into a new format (one that conforms to explicit metadata standards) can be formidable, especially if they’re dealing with a large number of data products coming out of their research.

Right now, for just a single paper I have ten different data products that need to have annotated metadata.  I’m fully aware that I need to do it, I know it’s important, but I’ve also got papers to review and write, analysis to finish, job applications to write, emails to send, etc., etc., etc., and while I understand that I can now get DOIs for my data products, it’s still not clear to me that it really means anything concrete in terms of advancement.

Don’t get me wrong, I am totally for open science, all my research is on GitHub, even partial papers, and I’m on board with data sharing.  My point here is that even for people who are interested in open science, correctly annotating data is still a barrier.

How do we address this problem? We have lots of tools that can help generate metadata, but many, if not all, of these are post hoc tools. We talk extensively, if colloquially, about the need to start metadata creation at the same time as we collect the data, but we don’t incentivise this process.  The only time people realize that metdata is important is at the end of their project, and by then they’ve got a new job to start, a new project to undertake, or they’ve left academia.

Making metadata creation a part of the research workflow is something I am working toward as part of the Neotoma project. Where metadata is a necessary component of the actual data analysis.   The Neotoma Paleoecological Database is a community curated database that contains sixteen different paleoecological proxies, ranging from water chemistry to pollen to diatoms to stable isotope data (see Pilaar Birch and Graham 2015). Neotoma has been used to study everything from modern patterns of plant diversity, rates of migration for plant and mammals, rates of change in community turnover through time, and species relationships to climate.  It acts as both a data repository and a research tool in and of itself.  A quick plug as well, the completion of a workshop this past week with the Science Education Resource Center at Carleton College in Minnesota has resulted in the development of teaching tools to help bring paleoecology into the classroom (more are on their way).

Neotoma has a database structure that includes a large amount of metadata.  Due in no small part to the activities of Eric Grimm, the metadata is highly curated, and, Tilia, a GUI tool for producing stratigraphic diagrams and age models from paleoecological data, is designed to store data in a format that is largely aligned with the Neotoma database structure.

In designing the neotoma package for R I’ve largely focused on its use as a tool to get data out of Neotoma, but the devastating closure of the Illinois State Museum by the Illinois Governor (link) has hastened the devolution of data curation for the database. The expansion of the database to include a greater number of paleoecological proxies has meant that a number of researchers have already become data stewards, checking to ensure completeness and accuracy before data is uploaded into the database.

Having the R package (and the Tilia GUI) act as a tool to get data in as well as out serves an important function, it acts as a step to enhance the benefits of proper data curation immediately after (or even during) data generation because the data structures in the applications are so closely aligned with the actual database structure.

We are improving this data/metadata synergy in two ways:

  1. Data structures: The data structures within the R package (detailed in our Open Quaternary paper) remain parallel to the database.  We’re also working with Mark Uhen, Shanan Peters and others at the Paleobiology Database (as part of this funded NSF EarthCube project) and, elsewhere, for example, the LiPD Project, which is itself proposing community data standards for paleoclimatic data (McKay and Emile-Geay, 2015).
  2. Workflow: Making paleoecological analysis easier through the use of the R package has the benefit of reducing the ultimate barrier to data upload.  This work is ongoing, but the goal is to ensure that by creating data objects in neotoma, data is already formatted correctly for upload to Neotoma, reducing the burden on Data Stewards and on the data generators themselves.

This is a community led initiative, although development is centralized (but open, anyone can contribute to the R package for example), the user base of Neotoma is broad, it contains data from over 6500 researchers, and data is contributed at a rate that continues to increase.  By working directly with the data generators we can help build a direct pipeline into “big data” tools for researchers that have traditionally been somewhere out on the long tail.

Jack Williams will be talking a bit more about our activities in this Middle Tail, and why it’s critical for the development of truly integrated cyberinfrastructure in the geosciences (the lessons are applicable to ecoinformatics as well) at GSA this year (I’ll be there too, so come by and check out our session: Paleoecological Patterns, Ecological Processes and Modeled Scenarios, where we’re putting the ecology back in ge(c)ology!).

Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)

Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files &amp; figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

How do you edit someone else’s code?

As academics I like to think that we’ve become fairly used to editing text documents. Whether handwriting on printed documents (fairly old school, but cool), adding comments on PDFs, or using some form of “track changes” I think we’ve learned how to do the editing, and how to incorporate those edits into a finished draft. Large collaborative projects are still often a source of difficulty (how do you deal with ten simultaneous edits of the same draft?!) but we deal.

Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.
Figure 1. If your revisions look like this you should strongly question your choice of (code) reviewer.

I’m working on several projects now that use R as a central component in analysis, and now we’re not just editing the text documents, we’re editing the code as well.

People are beginning to migrate to version control software and the literature is increasingly discussing the utility of software programming practices (e.g., Scheller et al., 2010), but given that scientific adoption of programming tools is still in its early stages, there’s no sense that we can expect people to immediately pick up all the associated tools that go along with them. Yes, it would be great if people would start using GitHub or BitBucket (or other version control tools) right away, but they’re still getting used to basic programming concepts (btw Tim Poisot has some great tips for Learning to Code in Ecology).

The other issue is that collaborating with graduate students is still a murky area. How much editing of code can you do before you’ve started doing their work for them? I think we generally have a sense of where the boundaries are for written work, but if code is part of ‘doing the experiment’, how much can you do? Editing is an opportunity to teach good coding practice, and to teach new tools to improve reproducibility and ease of use, but give the student too much and you’ve programmed everything for them.

I’m learning as I go here, and I’d appreciate tips from others (in the comments, or on twitter), but this is what I’ve started doing when working with graduate students:

  • Commenting using a ‘special’ tag:  Comments in R are just an octothorp (#), I use #* to differentiate what I’m saying from a collaborator’s comments.  This is fairly extensible, someone else could comment ‘#s’ or ‘#a’ if you have multiple collaborators.
  • Where there are major structural changes (sticking things in functions) I’ll comment heavily at the top, then build the function once.  Inside the function I’ll explain what else needs to be done so that I haven’t done it all for them.
  • If similar things need to be done further down the code I’ll comment “This needs to be done as above” in a bit more detail, so they have a template & the know where they’re going.

The tricky part about editing code is that it needs to work, so it can be frustratingly difficult to do half-edits without introducing all sorts of bugs or errors.  So if code review is part of your editing process, how do you accomplish it?

What do citations tell us about the climate divide?

UPDATE:  An interesting turn of events has led me to write a follow-up to this post.

I came across an interesting article in Geoforum this past week:

Jankó, F., Móricz, N., & Papp Vancsó, J. (2014). Reviewing the climate change reviewers: Exploring controversy through report references and citations. Geoforum, 56, 17-34.

The authors are in the Faculty of Economics and the Faculty of Forestry at the University of West-Hungary, in Sopron, about an hour directly south of Vienna, Austria. They take an interesting quantitative and human geographic perspective of the use of citations in understanding the physical science basis of climate change from both scientific and skeptical perspectives. A number of bloggers have taken on the science in the NIPCC (Richard Telford has several posts on his blog), but this paper provides interesting insight into the human aspects of scientific report writing. As such the paper falls much more easily into human geography than it does the physical sciences it seeks to understand.

Figure 1.  Heartland's funders did not particularly like the comparison between climate science and the Unabomber. (image source: wikipedia)
Figure 1. Heartland’s funders did not particularly like the comparison between climate science and the Unabomber. (image source: wikipedia)

The issue of climate change is as much part of the domain of human geography as it is physical geography. In particular the dynamic of ‘skeptical’ backlash against the consensus of anthropogenic climate change is well worth studying.  Understanding resistance to scientific knowledge around climate change will be key to eventually moving forward with adaptation policies that can find broad acceptance.  The public self-reports as being less knowledgeable about climate change than it was in 2007 (Stoutenborough et al., 2014), and multiple, competing narratives are likely to play a role in that dynamic.

Lahsen (2013) points out that without examining the differences in perception between climate groups we risk making the science behind our current understanding of anthropogenic climate change more vulnerable to public backlash, and we frequently see interaction between place and social change within the organizations (Jankó et al. mention the impact of the grossly unpopular Unabomber billboard in Chicago on the Heartland Institute’s network of funders and climate change affiliates).

To study characteristics of resistance and acceptance of the science surrounding climate change, the authors review the citation lists of both the IPCC (AR4 – WG1, the Physical Science Basis) and the NIPCC ‘s Climate Change Reconsidered.  By examining similarities and differences in citations and the use of citations we can understand how the rhetoric around climate change science changes the interpretation of the published literature. Jankó et al. use a great quote from Bruno Latour to help guide the discussion:

Whatever the tactics, the general strategy is easy to grasp: do whatever you need to the former literature to render it as helpful as possible for the claims you are going to make. The rules are simple enough: weaken your enemies, paralyse those you cannot weaken […], help your allies if they are attacked, ensure safe communications with those who supply you with indisputable instruments […], oblige your enemies to fight one another […]; if you are not sure of winning, be humble and understated” (Latour, 1987, pp. 37–38).

Figure 2.  Citations are important, but they're rarely used in an impartial manner. (image source: wikipedia)
Figure 2. Citations are important, but they’re rarely used in an impartial manner. (image source: wikipedia)

I feel like this overstates the case for the IPCC a little bit (though I may be biased). The IPCC is not set up to directly combat ‘skeptical’ literature, as is the case of the NIPCC.  The NIPCC is explicitly structured to mirror and refute the IPCC.  Regardless, we often think that as researchers we use citations in a neutral manner, but I would argue that that’s rarely the case. Citations in the literature are selected to help bolster arguments, they’re selected because we know people, and they’re occasionally massaged to change the point of an argument in an effort to support our own.

So the question becomes, how is the literature used and modified in these summaries to help develop an agenda?

Interestingly Jankó et al. show that only 4.4% of total citations (IPCC + NIPCC) were used in both reports. This was surprising to me. I had expected that many of the primary sources to explain climate systems and their modern behaviour might have made up a much larger proportion of both reports. Jankó et al. include a table with analysis of many of the overlapping citations (Appendix B)and we see that most cases of duplicate citations show similar tone in the treatment of the citations. Differences do exist however.  Where there is extensive overlap in citations Jankó et al. have some very insightful points to make here.  One surprising point was that both reports use particular language around references they like (‘find’, ‘indicate’, ‘report’, ‘show’, ‘conclude’) and don’t like (‘claim’, and ‘contend’), although how the language is applied to individual citations varies between reports (the discussion of Tropical Cyclones is well worth a read).  The other main difference in these overlapping citations is that key NIPCC citations, challenging climate change are often found in the IPCC to support understanding of uncertainties.  Thus, what the IPCC sees as an uncertainty, the NIPCC sees as evidence against anthropogenic climate change.

The real issue that piqued my interest however was the much higher proportion of paleo-journals in the NIPCC literature.  The Holocene is cited 12 times more frequently in the NIPCC than in the IPCC,  Geology and Quaternary Research are both cited 10 times more often.  Why would skeptics cite paleoecological literature at higher rates than the IPCC?  In large part this is due to a key motivation for the NIPCC, and a particular focus in the paleoclimate sections.

The analytical goal of the NIPCC is to increase the perception of uncertainty, attempting to add more ‘non-supportive’ and ‘uncertain’ literature to the argument, and to use that increased uncertainty to take apart the arguments for anthropogenic climate change.  In this way the paleo-literature becomes a tool for skeptics with which to attack our understanding of climate change science.  Indeed, of the 18 references from the Holocene in the NIPCC, only one could be considered ‘Neutral’ while the other 17 were considered to be ‘Not Supporting’ of climate change science.  For Quaternary Research 2 citations were ‘Neutral’ and ’12 were ‘Not Supporting’.  Again, what might be considered uncertainty in the IPCC is considered evidence against in the NIPCC.

Figure 1.  Does showing climate has changed in the past prove that climate change is natural?
Figure 3. Does showing climate has changed in the past prove that climate change is natural?

Jankó et al. explain this trend by showing that the NIPCC uses the past to explain the present in such a way as to downplay the unprecendented nature of modern climate change, while the IPCC uses the past to search for analogues of modern climate change.  Effectively, the NIPCC view stops at the present:  The past was warmer, therefore change is not unprecedented.  The IPCC is searching for ways to explain the future: The past had warmer periods. What caused those changes, what happened  during those periods, and how can we use the past to constrain models for the future?

This, to my mind, is the difference between the camps.  The science marshaled in the IPCC is focused toward improving hypotheses and theoretical (and mechanistic) models.  It is prescriptive science in that uncertainties are identified, and used to improve our understanding of modern and future change.  In the ‘skeptical’ camp, science is marshaled to disprove anthropogenic causes, and when it does, the avenue of research is closed.  It is effectively a descriptive model without an overarching theoretical framework.  This allows it to attach the label ‘skeptical’ to disparate threads of knowledge across the literature, without having to concern itself with how those pieces join together.  Jankó et al. point out that the narrative style of the NIPCC report is structured around an anecdotal style, summarizing each paper individually and often adding textual quotes, while the IPCC synthesizes knowledge from multiple sources and provides block references for statements.  In one we see a descriptive format that highlights any contrary (or uncertain) position, in the other we see an effort to synthesize knowledge into a theoretical framework.

The scientific basis for anthropogenic climate change is strongly grounded in a fairly simple physical model that finds broad based theoretical support across a range of physical sciences.  The scientific community has shown that over time (since at least the 1970s), counter-examples and uncertainties found in the literature have been able to highlight weaknesses in our understanding, bu, rather than collapse the structure, these weaknesses have been marshaled to improve the science and to develop a much more robust scientific understanding of climate change.

Literature Cited

Idso, Craig Douglas, & Siegfried Fred Singer. 2009. Climate change reconsidered: 2009 report of the Nongovernmental International Panel on Climate Change (NIPCC). Nongovernmental International Panel on Climate Change.

Jankó, F., Móricz, N., & Papp Vancsó, J. (2014). Reviewing the climate change reviewers: Exploring controversy through report references and citations. Geoforum56, 17-34.

Jansen, E., J. Overpeck, K.R. Briffa, J.-C. Duplessy, F. Joos, V. Masson-Delmotte, D. Olago, B. Otto-Bliesner, W.R. Peltier, S. Rahmstorf, R. Ramesh, D. Raynaud, D. Rind, O. Solomina, R. Villalba and D. Zhang, 2007: Palaeoclimate. In: Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change [Solomon, S., D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M. Tignor and H.L. Miller (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard university press.

Lahsen, M. (2013). Climategate: the role of the social sciencesClimatic change119(3-4), 547-558.

Stoutenborough, J. W., Liu, X., & Vedlitz, A. (2014). Trends in Public Attitudes Toward Climate Change: The Influence of the Economy and Climategate on Risk, Information, and Public Policy. Risk, Hazards & Crisis in Public Policy,5(1), 22-37.