Make a Cool $300 (CAD) in Three Easy Steps The CAP Way

Mary Vetter, the Treasurer of the Canadian Association of Palynologists passed this message on through our mailing list:

The Canadian Association of Palynologists Annual Student Research Award was established in 2009 to recognize students’ contributions to palynological research. The award is open to any undergraduate or graduate student who is a member, in good standing, of CAP, regardless of their nationality or country of residence. The intent of the research award is to support student research with a strong palynological component. The award consists of a three-year membership in the Association and $300 CDN, to be put toward some aspect of the student’s research.

The application should consist of: 1) a one-page statement outlining the nature of the research project, its scientific importance, the approximate timeline to completion of the project, and the aspect of the research the funds would be directed toward; (2) a CV; and, (3) a letter of support from the student’s supervisor.

Applications may be submitted in French or English and should be submitted by email. Completed applications are due by March 15, 2015.

Submit applications by e-mail to Dr Francine McCarthy, CAP President (fmccarthy[at]brocku[dot]ca)

Note: Only one award will be given per year, and there will be no limit to the number of times a student can submit an application.

Joining the Canadian Association of Palynologists is fairly straightforward, you can get an application here, and you don’t even need to be a Canadian. With membership you get the twice yearly newsletter, an opportunity to join us at our annual meetings and the chance to join a small, but friendly group of researchers who are interested in all things small, organic walled and fossilized.

If you know any students who might be interested please pass this along. Thanks!

PalEON has a video

In keeping with the theme of coring pictures I wanted to share PalEON’s new video, produced by the Environmental Change Initiative at the University of Notre Dame.  It does a good job of explaining what PalEON does and what we’re all about.  There’s also a nice sequence, starting at about 2:40s in where you get to see a “frozen finger” corer in action.  We break up dry ice, create a slurry with alcohol and then drop it into the lake into the lake sediment.

Once the sediment has frozen to the sides of the corer (about 10 – 15 minutes) we bring the corer up and remove the slabs of ice from the sides, keeping track of position, dimensions and orientation so that we can piece it back together.  I’m on the platform with Jason McLachlan and Steve Jackson.

There’s a great section in there about the sociology of the PalEON project as well, although it’s brief.  So take a second and watch the video, it’s great!

The advantages of taking a chance with a new journal – OpenQuaternary

Full disclosure: I’m on the editorial board of Open Quaternary and also manage the blog, but I am not an Editor in Chief and have attempted to ensure that my role as an author and my role as an editor did not conflict.

Figure 1.  Neotoma and R together at last!
Figure 1. Neotoma and R together at last!

We (myself, Andria Dawson, Gavin L. SimpsonEric GrimmKarthik Ram, Russ Graham and Jack Williams) have a paper in press at a new journal called Open Quaternary.  The paper documents an R package that we developed in collaboration with rOpenSci to access and manipulate data from the Neotoma Paleoecological Database.  In part the project started because of the needs of the PalEON project.  We needed a dynamic way to access pollen data from Neotoma, so that analysis products could be updated as new data entered the database.  We also wanted to exploit the new API developed by Brian Bills and Michael Anderson at Penn State’s Center for Environmental Informatics.

There are lots of thoughts about where to submit journal articles.  Nature’s Research Highlights has a nice summary about a new article in PLoS One (Salinas and Munch, 2015) that looks to identify optimum journals for submission, and Dynamic Ecology discussed the point back in 2013, a post that drew considerable attention (here, here, and here, among others).  When we thought about where to submit I made the conscious choice to choose an Open Source journal. I chose Open Quaternary partly because I’m on the editorial board, but also because I believe that domain specific journals are still a critical part of the publishing landscape, and because I believe in Open Access publishing.

The downside of this decision was that (1) the journal is new, so there’s a risk that people don’t know about it, and it’s less ‘discoverable'; (2) even though it’s supported by an established publishing house (Ubiquity Press) it will not obtain an impact factor until it’s relatively well established.  Although it’s important to argue that impact factors should not make a difference, it’s hard not to believe that they do make a difference.

Figure 2.  When code looks crummy it's not usable.  This has since been fixed.
Figure 2. When code looks crummy it’s not usable. This has since been fixed.

That said, I’m willing to invest in my future and the future of the discipline (hopefully!), and we’ve already seen a clear advantage of investing in Open Quaternary.  During the revision of our proofs we noticed that the journal’s two column format wasn’t well suited the the blocks of code that we presented to illustrate examples in our paper.  We also lost the nice color syntax highlighting that pandoc offers when it renders RMarkdown documents (see examples in our paper’s markdown file).  With the help of the journal’s Publishing Assistant Paige MacKay, Editor in Chief Victoria Herridge and my co-authors we were able to get the journal to publish the article in a single column format, with syntax highlighting supported using highlight.js.

I may not have a paper in Nature, Science or Cell (the other obvious option for this paper /s) but by contributing to the early stages of a new open access publishing platform I was able to change the standards and make future contributions more readable and make sure that my own paper is accessible, readable and that the technical solution we present is easily implemented.

I think that’s a win.  The first issue of Open Quaternary should be out in March, until then you can check out our GitHub repository or the PDF as submitted (compleate with typoes).

Cross-scale ecology at the Ecological Society of America Meeting in Baltimore!

Our Organized Oral Session has been approved and a date has been assigned.  ESA 2015 is getting closer evry day (abstract deadline is coming up on February 26th!), and with it the centennial celebration of the Ecological Society of America.  We’ve managed to recruit a great group of speakers to talk about ecological research that crosses scales of time, rather than space.  Many of these studies share approaches with what we generally consider to be ‘cross-scale’ ecology, which tends to be spatially focused, but the must also deal with the additional complexity of temporal uncertainty and changing relationships between communities, climate, biogeochemical cycling and disturbance at decadal, centennial and millennial scale.

Paleoecological patterns, ecological processes, modeled scenarios: Crossing scales to understand an uncertain future” will be held on the afternoon of Wednesday, August 12, 2015, from 1:30 PM – 5:00 PM.  We have a great line up of speakers confirmed, please remember to add us to your ESA schedule!

Building your network using ORCiD and ROpenSci

Our neotoma package is part of the ROpenSci network of packages.  Wrangling data structures and learning some of the tricks we’ve implemented wouldn’t have been possible without help from them throughout the coding process.  Recently Scott Chamberlain posted some code for an R package to interface with ORCiD, the rORCiD package.

To digress for a second, the neotoma package started out as rNeotoma, but I ditched the ‘r’ because, well, just because.  I’ve been second guessing myself ever since, especially as it became more and more apparent that, in writing proposals and talking about the package and the database I’ve basically created a muddle.  Who knows, maybe we’ll go back to rNeotoma when we push up to CRAN.  Point being, stick an R in it so that you don’t have to keep clarifying the differences.

So, back on point.  A little while ago I posted a network diagram culled from my cv using a bibtex parser in R (the bibtex package by Roman François).  That’s kind of fun – obviously worth blogging about – and I stuck a newer version into a job application, but I’ve really been curious about what it would look like if I went out to the second order, what does it look like when we combine my publication network with the networks of my collaborators.

Figure 1.  A second order co-author network generated using R and ORCiD's public API.
Figure 1. A second order co-author network generated using R and ORCiD’s public API.  Because we’re using the API we can keep re-running this code over and over again and it will fill in as more people sign up to get ORCiDs.

Enter ORCiD.  For those of you not familiar, ORCiD provides a unique identity code to an individual researcher.  The researcher can then identify all the research products they may have published and link these to their ID.  It’s effectively a DOI for the individual.  Sign up and you are part of the Internet of Things.  In a lot of ways this is very exciting.  The extent to which the ORCiDs can be linked to other objects will be the real test for their staying power.  And even there, it’s not so much whether the IDs can be linked, they’re unique identifiers so they’re easy to use, it’s whether other projects, institutions and data repositories will create a space for ORCiDs so that the can be linked across a web of research products.

Given the number of times I’ve been asked to add an ORCiD to an online profile or account it seems like people are prepared to invest in ORCiD for the long haul, which is exciting, and provides new opportunities for data analysis and for building research networks.

So, lets see what we can do with ORCiD and Scott’s rorcid package. This code is all available in a GitHub repository so you can modify it, fork, push or pull as you like:

The idea is to start with a single ORCiD, mine in this case (0000-0002-2700-4605).  With the ORCiD we then discover all of the research products associated with the ID.  Each research product with a DOI can be linked back to each of the ORCiDs registered for coauthors using the ORCiD API.  It is possible to find all co-authors by parsing some of the bibtex files associated with the structured data, but for this exercise I’m just going to stick with co-authors with ORCiDs.

So, for each published article we get the DOI, find all co-authors on each work who has an ORCiD, and then track down each of their publications and co-authors.  If you’re interested you can go further down the wormhole by coding this as a recursive function.  I thought about it but since this was basically a lark I figured I’d think about it later, or leave it up to someone to add to the existing repository (feel free to fork & modify).

In the end I coded this all up and plotted using the igraph package (I used network for my last graph, but wanted to try out igraph because it’s got some fun interactive tools:

library(devtools)
install_github('ropensci/rorcid')

You need devtools to be able to install the rOrcid package from the rOpenSci GitHub repository

library(rorcid)
library(igraph)

# The idea is to go into a user and get all their papers, 
# and all the papers of people they've published with:

simon.record <- orcid_id(orcid = '0000-0002-2700-4605', 
                         profile="works")

This gives us an ‘orcid’ object, returned using the ORCiD Public API. Once we ave the object we can go in and pull out all the DOIs for each of my research products that are registered with ORCID.

get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'
  
  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))
  
  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = papers[x,2], fuzzy = FALSE))
    # sometimes there's no DOI
    # if that's the case then just return NA:
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$data$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$data$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$data$'personal-details.family-name.value', 
                                sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)
  
}

So now we’ve got the functions, we’re going to get all my papers, make a list of the unique ORCIDs of my colleagues and then get all of their papers using the same ‘get_papers’ function. It’s a bit sloppy I think, but I wanted to try to avoid duplicate calls to the API since my internet connection was kind of crummy.

simons <- get_papers(simon.record)

unique.orcids <- unique(simons$orcid)

all.colleagues <- list()

for(i in 1:length(unique.orcids)){
  all.colleagues[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}

So now we’ve got a list with a data.frame for each author that has three columns, the DOI, the ORCID and their name. We want to reduce this to a single data.frame and then fill a square matrix (each row and column represents an author) where each row x column intersection represents co-authorship.


all.df <- do.call(rbind.data.frame, all.colleagues)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]
  
  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <- 
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]

diag(all.pairs) <- 0

Again, probably some lazy coding in the ‘for’ loop, but the point is that each row and column has a dimname representing each author, so row 1 is ‘Simon Goring’ and column 1 is also ‘Simon Goring’. All we’re doing is incrementing the value for the cell that intersects co-authors, where names are pulled from all individuals associated with each unique DOI. We end by plotting the whole thing out:


author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
#  Plot so that the width of the lines connecting the nodes reflects the
#  number of papers co-authored by both individuals.
#  This is Figure 1 of this blog post.
plot(author.adj, vertex.label.cex = 0.8, edge.width = E(author.adj)$weight)

Recovering dark data, the other side of uncited papers.

A little while ago, on Dynamic Ecology, a question was posed about how much self-promotion was okay, and what kinds of self promotion were acceptable.  The results were interesting, as was the discussion in the comments.  Two weeks ago I also noticed a post by Jeff Ollerton (at the University of Northampton, HT Terry McGlynn at Small Pond Science) who also weighed in on his blog, presenting a table showing that up to 40% of papers in the Biological Sciences remain uncited within the first four years since publication, with higher rates in Business, the Social Sciences and the Humanities.  The post itself is written more for the post-grad who is keen on getting their papers cited, but it presents the opportunity to introduce an exciting solution to the secondary issue: What happens to data after publication?

In 1998 the Neotoma Paleoecological Database published an ‘Unacquired Sites Inventory‘.  These were paleoecological sites (sedimentary pollen records, representing vegetation change over centuries or millennia) for which published records existed, but that had not been entered into the Neotoma Paleoecological Database or the North American Pollen Database.  Even accounting for the fact that the inventory represents a snapshot that ends in 1998, it still contains sites that are, on average, older than sites contained within the Neotoma Database itself (see this post by yours truly).  It would be interesting to see the citation patterns of sites in the Unacquired Sites versus those in the Neotoma Database, but that’s a job for another time, and, maybe, a data rescue grant (hit me up if you’re interested!).

Figure 1.  Dark data.  There is likely some excellent data down this pathway, but it's too spooky for me to want to access it, let's just ignore it for now. Photo Credit: J. Illingworth.
Figure 1. Dark data. There is likely some excellent data down this dark pathway, but it’s too spooky for me to want to access it, let’s just ignore it for now. Photo Credit: J. Illingworth.

Regardless, citation patterns are tied to data availability (Piwowar and Vision, 2013), but the converse is also likely to be true.  There is little motivation to make data available if a paper is never cited, particularly an older paper, and little motivation for the community to access or request that data if no one knows about the paper.  This is how data goes dark.  No one knows about the paper, no one knows about the data, and large synoptic analyses miss whole swaths of the literature. If the citation patterns cited by Jeff Ollerton hold up, it’s possible that we’re missing 30%+ of the available published data when we’re doing our analyses.  So it’s not only imperative that post-grads work to promote their work, and that funding agencies push PIs to provide sustainable data management plans, but we need to work to unearth that ‘dark data’ in a way that provides sufficient metadata to support secondary analysis.

Figure 1. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans.  Results are not significantly different.
Figure 2. PaleoDeepDive body size estimates generated from a publication corpus (gray bars) versus estimates directly assimilated and entered by humans. Results are not significantly different.

Enter Paleodeepdive (Peters et al., 2014).  PaleoDeepDive is a project that is part of the larger, EarthCube funded, GeoDeepDive, headed by Shanan Peters at the University of Wisconsin and built on the DeepDive platform. The system is trained to look for text, tables and figures in domain specific publications, extract data, build associations and recognize that there may be errors in the published data (e.g., misspelled scientific names).  The system then then assign scores to the data extracted indicating confidence levels in the associations, which can act as a check on the data validity, and helps in building further relations as new data is accquired.  Paleodeepdive was used to comb paleobiology journals to pull out occurrence data and morphological characteristics for the Paleobiology Database.  In this way PaleoDeepDive brings uncited data back out of the dark and pushes it into searchable databases.

These kinds of systems are potentially transformative for the sciences. “What happens to your work once it is published” is transformed into a two part question: how is the paper cited, and how is the data used. More and more scientists are using public data repositories, although that’s not neccessarily the case as Caetano and Aisenberg (2014) show for animal behaviour studies, and fragmented use of data repositories (supplemental material vs. university archives vs. community lead data repositories) means that data may still lie undiscovered.  At the same time, the barriers to older data sets are being lowered by projects like PaleoDeepDive that are able to search disciplinary archives and collate the data into a single data storage location, in this case the Paleobiology Database. The problem still remains, how is the data cited?

We’ve run into the problem of citations with publications, not just with data but with R packages as well.  Artificial reference limits in some journals preclude full citations, pushing them into web-only appendices, that aren’t tracked by the dominant scholarly search engines.  That, of course, is a discussion for another time.

More coring photos please!

I’ve started posting some of the coring pictures we’ve received from paleoecologists up on the OpenQuaternary blog.  In an effort to try to preserve this iconic image, and to trace the academic geneology within paleoecology.

I posted this message on the PALEOLIM mailing list:

Hello all, I was hoping to find pictures of coring expeditions.  In particular lake coring, but I’m happy with anything you might have handy.  The older the better, but, I’m also happy to see recent pictures.  These are some of the most iconic pictures of our research, I think it might be nice to start thinking about archiving them for posterity.

I wanted to put them into a presentation at AGU (now passed), but then I just got curious about how far back our collective memory of coring expeditions goes, and how much has really changed out on the coring platform.

If the response is sufficient I’d be happy to put them all together in some form or another (to be decided) and make them available to share.  Perhaps on the Open Quaternary discussion blog (http://openquaternary.wordpress.com).

If you do have older pictures please send them my way, any people you can identify would be appreciated.

I thought that I might get some more response by passing the message on here as well.  If you have pictures from coring expeditions please pass them on to me, I’ve started a database for the pictures so that we can keep track of locations, publications, participants and pictures, in the hopes that as we get more and more we can begin to build a database that could be linked to something like Neotoma.

Obviously early stages, but if people are interested in collaborating then let me know, and in the meantime, dig up those old photos and send them my way!

BearLake_coring

Here’s a picture sent in by Steve Jackson.  This was a 1995 coring expedition in Bear Lake on the Kaibab Plateau.  From Left to right: Dave Larsen, Chengyu Weng, Darren Singer.  Chengyu Weng published the pollen record from this lake in 1999 as part of a multi-site study focusing on the late-Glacial, early-Holocene history of the region (Weng and Jackson, 1999).