Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)
Advertisements

Opening access to age models through GitHub

As part of PalEON we’ve been working with a lot of chronologies for paleoecological reconstruction (primarily Andria Dawson at UC-Berkeley, myself, Chris Paciorek at UC-Berkeley and Jack Williams at UW-Madison). I’ve mentioned before the incredible importance of chronologies in paleoecological analysis. Plainly speaking, paleoecological analysis means little without an understanding of age.  There are a number of tools that can be used to analyse, display and understand chronological controls and chronologies for paleoecological data. The Cyber4Paleo webinars, part of the EarthCube initiative, have done an excellent job of representing some of the main tools, challenges and advances in understanding and developing chronologies for paleoecological and geological data.  One of the critical issues is that the benchmarks we use to build age models change through time.  Richard Telford did a great job of demonstrating this in a recent post on his (excellent) blog.  These changes, and the diversity of age models out there in the paleo-literature means that tools to semi-automate the generation of chronologies are becoming increasingly important in paleoecological research.

Figure 1.  Why is there a picture of clams with bacon on them associated with this blog post?  Credit: Sheri Wetherell (click image for link)
Figure 1. Why is there a picture of clams with bacon on them associated with this blog post? Credit: Sheri Wetherell (click image for link)

Among the tools available to construct chronologies is a set of R scripts called ‘clam‘ (Blaauw, 2010). This is heavily used software and provides the opportunity to develop age models for paleo-reconstructions from a set of dates along the length of the sedimentary sequence.

One of the minor frustrations I’ve had with this package is that it requires the use of a fixed ‘Cores’ folder. This means that separate projects must either share a common ‘clam’ folder, so that all age modelling happens in the same place, or that the ‘clam’ files need to be moved to a new folder for each new project. Not a big deal, but also not the cleanest.

Working with the rOpenSci folks has really taught me a lot about building and maintaining packages. To me, the obvious solution to repeatedly copying files from one location to the other was to put clam into a package. That way all the infrastructure (except a Cores folder) would be portable. The other nice piece was that this would mean I could work toward a seamless integration with the neotoma package. Making the workflow: “data discovery -> data access -> data analysis -> data publication” more reproducible and easier to achieve.

To this end I talked with Maarten several months ago, started, stopped, started again, and then, just recently, got to a point where I wanted to share the result. ‘clam‘ is now built as an R package. Below is a short vignette that demonstrates the installation and use of clam, along with the neotoma package.


#  Skip these steps if one or more packages are already installed.  
#  For the development packages it's often a good idea to update frequently.

install.packages("devtools")
require(devtools)
install_github("clam", "SimonGoring")
install_github("neotoma", "ropensci")
require(clam)
require(neotoma)

# Use this, and change the directory location to set a new working directory if you want.  We will be creating
# a Cores folder and new files &amp; figures associated with clam.
setwd('.')  

#  This example will use Three Pines Bog, a core published by Diana Gordon as part of her work in Temagami.  It is stored in Neotoma with
#  dataset.id = 7.  I use it pretty often to run things.

threepines <- get_download(7)

#  Now write a clam compatible age file (but make a Cores directory first)
if(!'Cores' %in% list.files(include.dirs=TRUE)){
  dir.create('Cores')
}

write_agefile(download = threepines[[1]], chronology = 1, path = '.', corename = 'ThreePines', cal.prog = 'Clam')

#  Now go look in the 'Cores' directory and you'll see a nicely formatted file.  You can run clam now:
clam('ThreePines', type = 1)

The code for the ‘clam’ function works exactly the same way it works in the manual, except I’ve added a type 6 for Stineman smoothing. In the code above you’ve just generated a fairly straightforward linear model for the core. Congratualtions. I hope you can also see how powerful this workflow can be.

A future step is to do some more code cleaning (you’re welcome to fork or collaborate with me on GitHub), and, hopefully at some point in the future, add the funcitonality of Bacon to this as part of a broader project.

References

Blaauw, M., 2010. Methods and code for ‘classical’ age-modelling of radiocarbon sequences. Quaternary Geochronology 5: 512-518

EarthCube webinars and the challenges of cross-disciplinary Big Data.

EarthCube is a moon shot.  It’s an effort to bring communities broadly supported through the NSF Geosciences Directorate and the Division of Advanced Cyberinfrastructure together to create a framework that will allow us to understand our planet (and solar system) in space and in time using data and models generated across a spectrum of disciplines, and spanning scales of space and time.  A lofty goal, and a particularly complex one given the fragmentation of many disciplines, and the breadth of researchers who might be interested in participating in the overall project.

To help support and foster a sense of community around EarthCube the directorate has been sponsoring a series of webinars as part of a Research Coordination Network called “Collaboration and Cyberinfrastructure for Paleogeosciences“, or, more simply C4P.  These webinars have been held every other Tuesday from 4 – 5pm Eastern, but are archived on the webinar website (here).

The Neotoma Paleoecological Database was featured as part of the first webinar.  Anders Noren talked about the cyber infrastructure required to support LacCore‘s operations, and Shanan Peters talks about an incredible text mining initiative (GeoDeepDive) in one of the later webinars.

Image
Fig 1. The flagship for the Society for American Pedologists has run aground and it is now a sitting duck for the battle machine controlled by the Canadian Association for Palynologists in the third war of Data Semantics.

It’s been interesting to watch these talks and think about both how unique each of these paleo-cyberinfrastructure projects is, but also how much overlap there is in data structure, use, and tool development.  Much of the struggle for EarthCube is going to be developing a data interoperability structure and acceptable standards across disciplines.  In continuing to develop the neotoma package for R I’ve been struggling to understand how to make the data objects we pull from the Neotoma API interact well with standard R functions, and existing R packages for paleoecological data.  One of the key questions is how far do we go in developing our own tools before that tool development creates a closed ecosystem that cuts off outside development?  If I’m struggling with this question in one tiny disciplinary nook, imagine the struggle that is going to occur when geophysicists and paleobotanists get together with geochonologists and pedologists!

Interoperability of these databases needs to be a key goal.  Imagine the possibilities if we could link modern biodiversity databases with Pleistocene databases such as Neotoma, and then to deep time databases like the Paleobiology Database in a seamless manner.  Big data has clearly arrived in some disciplines, but the challenges of creating big data across disciplines is just starting.

Open Science, Reproducibility, Credit and Collaboration

I had the pleasure of going up to visit the Limnological Research Center (LRC) at the University of Minnesota this past week. It’s a pretty cool setup, and obviously something that we should all value very highly, both as a resource to help prepare and sample sediment cores, and as a repository for data. The LRC has more than 4000 individual cores, totaling over 13km of lacustrine and marine sediment. A key point here is that much of this sediment is still available to sample, but, this is still data in its rawest, unprocessed form. Continue reading Open Science, Reproducibility, Credit and Collaboration