Experimenting with ROpenSci and learning to optimize.

I learned about the ritis package when Scott Chamberlain posted about it on twitter a couple weeks ago.  It sounded great, and exactly what I was looking for.  I have a large dataset of plant species that I’ve been fooling around with, but I’ve been hesitant to do anything with it since I’m unsure about the quality of the taxonomy, and, ultimately, tracking down the correct names seemed like a chore I’d rather not attempt.  Not that I’m generally lazy, I’m just lazy about some tasks.

Anyway, ritis looked great.  It used the API from the Integrated Taxonomic Information System to return either taxonomic serial numbers (TSNs) from scientific names or scientific names from TSNs.  It does a lot more than that, but this was the initial appeal.  The nice thing about the TSN is that it is a unique identifier for a species that can plug into the ITIS API (now through the wonderful ritis R wraper) so you can retain the original taxonomic nomenclature from the field, and update it automatically through the ITIS API.  If you want to be able to keep a data trail from field to publication this is pretty important.  I also like its use of plyr, which is a great package I’ve just been learning about as well.

To make a long story short, the code works great.  The biggest problem I’ve encountered is that the ITIS servers are slow.  Painfully slow in some cases.  The slow speeds are probably magnified even further since I live in the one part of town that has really bad Internet access.

No chance I'd ever get it though.
Fig 1. A phylogeny for a very grassy knoll, recently burned.

I eventually got my data to a point where I was confident in the TSNs for each taxon (after a bit of automation and a bunch of manual entry).  From there I wanted to go back to the ITIS to find the most recent classification so I can then go on and plug it into phylomatic.  This is part of the work flow that would be most important for replicability.  If someone wanted to run the same analysis in three years, they ought to be able to do it with my code.  Ultimately, trying to build the phylogeny (see Fig. 1) using taxize didn’t work.  I have too many taxa for the call out to work so I’ll have to work around that.  I want to build a master phylogeny and then subsample it, rather than building hundreds of tiny phylogenies.

Anyway, when I initially ran the loop to do all this I kept getting errors when the connection would time out and it took a very long time to run.  To simplify the process, and to save time, I only called ITIS when a genus hadn’t been encountered before.  If I had seen the genus before I just copied its information into the table for the new species.  Here’s the code, input.table is a table with TSN codes and species names:


taxon.table <- matrix(ncol=4, nrow=nrow(input.table))
# Four columns in taxon.table: Order, Family, Genus, Species

for(i in 1:nrow(input.table))
{

if(is.na(taxon.table[i,1])){
#  If the taxon has a space in it then it is a species, otherwise it's some higher level code.

space <- regexpr(' ', input.table[i,2])

# If it is a species, see if the genus has been run before (column three of the taxon table)
# in which case we can just copy everything over and forget about the itis!
if(space > 0){ in.set <- substr(input.table[i,2], 0, space - 1) %in% taxon.table[,3] }
else         {  in.set <- FALSE }
# If it's the first time the genus has been seen:
if(in.set == FALSE){
z <- 0       test <- try(classification(input.table[i,4]))
# If we get an error (http time-out) then keep trying, up to 5 times.
while(class(test) == 'try-error' & z < 5){
test <- try(classification(input.table[i,4]))
z <- z + 1
cat('.')
}

if(!class(test) == 'try-error'){
if('Species' %in% test$rank) taxon.table[i, 4] <- as.character(test[which(test$rank == 'Species'),2])
if('Genus' %in% test$rank) taxon.table[i, 3] <- as.character(test[which(test$rank == 'Genus'),2])
if('Family' %in% test$rank) taxon.table[i, 2] <- as.character(test[which(test$rank == 'Family'),2])
if('Order' %in% test$rank) taxon.table[i, 1] <- as.character(test[which(test$rank == 'Order'),2])
# I don't know if this is important, but maybe there's no order?! Just for completeness.
missing <- which(is.na(taxon.table[i,-1]))

for(j in missing){
taxon.table[i, j + 1] <- taxon.table[i, j]
}
}
}
else{
# So if we've seen the genus before, copy the Order, Family and Genus data from another occurrence.
existing <- which(taxon.table[,3] %in% substr(input.table[i,2], 0, space - 1))[1]
taxon.table[i,] <- c(taxon.table[existing,1:3], as.character(input.table[i,2]))
}
}
cat(i, ': ', taxon.table[i,], '\n')
}

That’s it.  Did I waste my time?  It worked really well, I got no errors, and I was able to let it run overnight and didn’t have to worry about it throwing an error.  If I had it to do over again I’d keep writing out taxon.table, but that’s minor.

NOTE: I’m not sure how to format code nicely on this blog. Advice would be welcome!

Advertisements

Published by

downwithtime

Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

4 thoughts on “Experimenting with ROpenSci and learning to optimize.”

  1. Hi Simon, Nice work.

    I think we may be able to figure out a workaround when users have large numbers of API calls they want to make to the ITIS servers. I am looking into the ITIS bulk download in SQL format, which then can be queried from R. I’ll have to modify the current functions or possibly make new ones. We probably wouldn’t provide the SQL database in the pkg, but just direct users to where to download it. Users could then update it once per month (as ITIS updates the bulk download once per month), or whenever they want.

    Did you use functions in the taxize package to query the phylomatic API?

    I know wordpress.org has syntax highlighting, but I don’t think wordpress.com does. Hence, my move to github jekyll hosting!

    Scott

    1. Thanks Scott. I have been using taxize to query phylomatic. I think I mentioned that trying to generate the large regional tree (~3500 taxa) gave me an error that seems to have come from phylomatic rather than taxize (I looked through the R code and couldn’t find the error message). I think I’m just going to download phylomatic and run it off my computer using a ‘system’ call. I guess taxize doesn’t interface with phylomatic on a home computer does it?

      1. Cool, glad you’re using it. If you mean query phylomatic from your machine (i.e. no API), then no, taxize doesn’t do that. Yeah, you could just query the master tree itself on your own machine I suppose. Do you mind sending me the file/code just for getting the tree so I can see if it also doesn’t work on my machine? I’ll ask Cam Webb to see if there is an API limit on number of taxa and any workaround.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s