PalEON has got a video

In keeping with the theme of coring pictures I wanted to share PalEON’s new video, produced by the Environmental Change Initiative at the University of Notre Dame.  It does a good job of explaining what PalEON does and what we’re all about.  There’s also a nice sequence, starting at about 2:40s in where you get to see a “frozen finger” corer in action.  We break up dry ice, create a slurry with alcohol and then drop it into the lake into the lake sediment.

Once the sediment has frozen to the sides of the corer (about 10 – 15 minutes) we bring the corer up and remove the slabs of ice from the sides, keeping track of position, dimensions and orientation so that we can piece it back together.  I’m on the platform with Jason McLachlan and Steve Jackson.

There’s a great section in there about the sociology of the PalEON project as well, although it’s brief.  So take a second and watch the video, it’s great!

The advantages of taking a chance with a new journal – OpenQuaternary

Full disclosure: I’m on the editorial board of Open Quaternary and also manage the blog, but I am not an Editor in Chief and have attempted to ensure that my role as an author and my role as an editor did not conflict.

Figure 1.  Neotoma and R together at last!
Figure 1. Neotoma and R together at last!

We (myself, Andria Dawson, Gavin L. SimpsonEric GrimmKarthik Ram, Russ Graham and Jack Williams) have a paper in press at a new journal called Open Quaternary.  The paper documents an R package that we developed in collaboration with rOpenSci to access and manipulate data from the Neotoma Paleoecological Database.  In part the project started because of the needs of the PalEON project.  We needed a dynamic way to access pollen data from Neotoma, so that analysis products could be updated as new data entered the database.  We also wanted to exploit the new API developed by Brian Bills and Michael Anderson at Penn State’s Center for Environmental Informatics.

There are lots of thoughts about where to submit journal articles.  Nature’s Research Highlights has a nice summary about a new article in PLoS One (Salinas and Munch, 2015) that looks to identify optimum journals for submission, and Dynamic Ecology discussed the point back in 2013, a post that drew considerable attention (here, here, and here, among others).  When we thought about where to submit I made the conscious choice to choose an Open Source journal. I chose Open Quaternary partly because I’m on the editorial board, but also because I believe that domain specific journals are still a critical part of the publishing landscape, and because I believe in Open Access publishing.

The downside of this decision was that (1) the journal is new, so there’s a risk that people don’t know about it, and it’s less ‘discoverable'; (2) even though it’s supported by an established publishing house (Ubiquity Press) it will not obtain an impact factor until it’s relatively well established.  Although it’s important to argue that impact factors should not make a difference, it’s hard not to believe that they do make a difference.

Figure 2.  When code looks crummy it's not usable.  This has since been fixed.
Figure 2. When code looks crummy it’s not usable. This has since been fixed.

That said, I’m willing to invest in my future and the future of the discipline (hopefully!), and we’ve already seen a clear advantage of investing in Open Quaternary.  During the revision of our proofs we noticed that the journal’s two column format wasn’t well suited the the blocks of code that we presented to illustrate examples in our paper.  We also lost the nice color syntax highlighting that pandoc offers when it renders RMarkdown documents (see examples in our paper’s markdown file).  With the help of the journal’s Publishing Assistant Paige MacKay, Editor in Chief Victoria Herridge and my co-authors we were able to get the journal to publish the article in a single column format, with syntax highlighting supported using highlight.js.

I may not have a paper in Nature, Science or Cell (the other obvious option for this paper /s) but by contributing to the early stages of a new open access publishing platform I was able to change the standards and make future contributions more readable and make sure that my own paper is accessible, readable and that the technical solution we present is easily implemented.

I think that’s a win.  The first issue of Open Quaternary should be out in March, until then you can check out our GitHub repository or the PDF as submitted (compleate with typoes).

Cross-scale ecology at the Ecological Society of America Meeting in Baltimore!

Our Organized Oral Session has been approved and a date has been assigned.  ESA 2015 is getting closer evry day (abstract deadline is coming up on February 26th!), and with it the centennial celebration of the Ecological Society of America.  We’ve managed to recruit a great group of speakers to talk about ecological research that crosses scales of time, rather than space.  Many of these studies share approaches with what we generally consider to be ‘cross-scale’ ecology, which tends to be spatially focused, but the must also deal with the additional complexity of temporal uncertainty and changing relationships between communities, climate, biogeochemical cycling and disturbance at decadal, centennial and millennial scale.

Paleoecological patterns, ecological processes, modeled scenarios: Crossing scales to understand an uncertain future” will be held on the afternoon of Wednesday, August 12, 2015, from 1:30 PM – 5:00 PM.  We have a great line up of speakers confirmed, please remember to add us to your ESA schedule!

Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)

Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

More coring photos please!

I’ve started posting some of the coring pictures we’ve received from paleoecologists up on the OpenQuaternary blog.  In an effort to try to preserve this iconic image, and to trace the academic geneology within paleoecology.

I posted this message on the PALEOLIM mailing list:

Hello all, I was hoping to find pictures of coring expeditions.  In particular lake coring, but I’m happy with anything you might have handy.  The older the better, but, I’m also happy to see recent pictures.  These are some of the most iconic pictures of our research, I think it might be nice to start thinking about archiving them for posterity.

I wanted to put them into a presentation at AGU (now passed), but then I just got curious about how far back our collective memory of coring expeditions goes, and how much has really changed out on the coring platform.

If the response is sufficient I’d be happy to put them all together in some form or another (to be decided) and make them available to share.  Perhaps on the Open Quaternary discussion blog (http://openquaternary.wordpress.com).

If you do have older pictures please send them my way, any people you can identify would be appreciated.

I thought that I might get some more response by passing the message on here as well.  If you have pictures from coring expeditions please pass them on to me, I’ve started a database for the pictures so that we can keep track of locations, publications, participants and pictures, in the hopes that as we get more and more we can begin to build a database that could be linked to something like Neotoma.

Obviously early stages, but if people are interested in collaborating then let me know, and in the meantime, dig up those old photos and send them my way!

BearLake_coring

Here’s a picture sent in by Steve Jackson.  This was a 1995 coring expedition in Bear Lake on the Kaibab Plateau.  From Left to right: Dave Larsen, Chengyu Weng, Darren Singer.  Chengyu Weng published the pollen record from this lake in 1999 as part of a multi-site study focusing on the late-Glacial, early-Holocene history of the region (Weng and Jackson, 1999).

Enough Mixers Already.

Figure 1.  The Mixer is a ubiquitous feature of academia's social scene, but they can be complicated for a number of reasons.  Photo Credit: Fotokannan, via Wikipedia.
Figure 1. The Mixer is a ubiquitous feature of academia’s social scene, but they can be complicated for a number of reasons. Photo Credit: Fotokannan, via Wikipedia.

You’re at a University, you want people to hang out, what do you do?

Mixer!

If you’re on a moderately large University campus it’s probably pretty likely that you could hit one or two departmental or institutional mixers a week.  Maybe more if you’re lucky and plugged in to the right group of people.  If you can hit them up relatively frequently then you can relax and socialize with a group of peers, or senior researchers, you can make connections, build informal bonds and, eventually build more social capital with which to establish research collaborations in the long term.  We know mixers are good.  drmellivora exhorts us to go to them on Tenure She Wrote. Every conference, department, and university has them. They must be good to go to. . .

But what if you can’t?  How many times do you need to find a babysitter, or ask your partner to look after the kids so that you can go to a mixer?  And if you can’t, consistently, are you squandering social capital?  A lot has been made about trying to find work-life balance, but extra-curricular mixers live outside of the explicit realm of ‘work’, and yet they can strongly influence work outcomes.

John Hopkins University runs a successful Belgian Beer Mixer that is clearly providing benefits to the university and to the researchers who participate, bringing researchers across disciplines together in an informal and relaxed environment, but we have to assume that these mixers aren’t accommodating everyone.  It far easier to indicate and be assertive about scheduling conflicts during work hours, and people are more accommodating about finding suitable times.  Extra-curricular socialization is much harder to confront.  It winds up being an invisible ‘pass’.  There’s no re-scheduling until after the kids go to sleep, most people head over after work and leave before dinner.  There’s no re-scheduling until later in the week because your kids live with you every day of the week (which they make up for with unconditional love). And it’s not just kids, there are hundreds of reasons why people have to take a soft pass at mixers, and most of them don’t involve an unwillingness to make new connections.

Webb and Bartlein (1993) talk about the benefits of hanging out on the University of Wisconsin’s Terrace drinking beer to the team cohesion of the COHMAP project, and it was pointed out in a recent Science of Team Science (SciTS) listserv posting that Stokols et al (2008) also suggest informal social events to build camaraderie  among individuals to help foster team science.  But in Cheruvelil et al (2014) we point out that one of the first steps to building effective teams is to focus on social sensitivity and emotional engagement. Granted this is in the context of existing projects, but mixers, in the half-shadow of work, are not always the best places to support social sensitivity. Mixers can often be a source of social conflict, and some people may avoid going to them for that reason alone.

Surely there are more ways to bring people together in a semi-formal context that aren’t outside of work hours.  I mean, are mixers really just another way to make us work more?  If we were lawyers we’d definitely be billing, right?  And does it have to involve drinking?  I mean, drinking is such a part of the culture of ecology (for example) that Jeremy Fox makes his dichotomy cheap vs non-cheap beer, and, for someone like me who is equally at home in the Earth Sciences, well, forgetaboutit. WIRED’s article “Why Geologists Love Beer” from the 2009 AGU meeting begins: “Fact: Geologists love beer.”

So what about alternatives to the ever-present Mixer?  Organized, but informal lunches?  Potlucks are always a bit dangerous, but organized brown bag lunches with some focal theme can get people sitting with new neighbors, and possibly building new relationships.  They’re also more accessible since you can schedule them at different times of the day, and people are already at work, plus having the department pitch in for free food is always a good way to get every graduate student to come.

What sorts of suggestions do you have?  What’s been successful in your department, or at a conference/workshop that you’ve attended?  I’d love to hear suggestions.