Explorations in outreach – Creating a Twitter bot for the Neotoma Paleoecological Database.

If you’ve ever been in doubt about whether you chose the right programming language to learn I want to lay those concerns to rest here.

For many scientists, particularly in Biology or the Earth Sciences, there is often a question about whether you should be learning R, Python, Matlab or something else.  Especially when you’re coming into scientific programming in grad school with little prior experience this might seem like a daunting proposal.  You already don’t know anything about anything, and ultimately you wind up learning whatever you’re taught, or whatever your advisor is using and you wonder. . . Is the grass greener over in Python-land? Those figures look nice, if only I had learned R. . . Why did I learn on an expensive closed platform?

I am here to say “Don’t worry about it”, and I want to emphasize that with an example centered around academic outreach:

The Neotoma Paleoecological Database has had an issue for several years now.  We have had a large number of datasets submitted, but very few people could actively upload datasets to the database.  Neotoma is a live database, which means that not only do new datasets get added, but, as new information becomes available (for example, new taxonomic designations for certain species) datasets get updated.  This means that maintaining the database is very time intensive and there has traditionally been a gap between data ingest and data publication.  To make up for this there has been a data “Holding Tank” where individual records have been available, but this wasn’t the best solution.

Fast forward to about a year ago. Eric Grimm at the Illinois State Museum update the software package Tilia to provide greater access to the database to selected data stewards.  Each data type (including insets, pollen, mammal fossils, XRF, ostracodes, lake chemistry) has one or a few stewards who can vet and upload datasets directly to the database using the Tilia platform. This has increased the speed at which datasets have entered Netoma rapidly — over the last month there have been more than 200 new datasets entered — but it’s still hard to get a sense of this as an outsider since people don’t regularly check the database unless they need data from it.

Which brings us to Twitter. Academics have taken to Twitter like academics on a grant .  Buzzfeed has posted a list of 25 twitter feeds for nerds, Science published a somewhat contentious list of scientists to follow, and I’m on twitter, so obviously all the cool kids are there. This led me to think that twitter could be a good platform for publicizing new data uploads to Neotoma.  Now I just needed to learn how.

The process is fairly straightforward:

  1. Figure out what the most recently posted Neotoma datasets are:
    • This is made easier with the Neotoma API, which has a specific method for returning datasets: http://ceiwin10.cei.psu.edu/NDB/RecentUploads?months=1
    • You’ll notice (if you click) that the link returns data in a weird format.  This format is called JSON and it has been seen by many as the successor to XML (see here for more details).
  2. Check it against two files, (1) a file of everything that’s been tweeted already, and (2) a file with everything that needs to be tweeted (since we’re not going to tweet everything at once)
  3. Append the new records to the queue of sites to tweet.
  4. Tweet.

So that’s it (generally).  I’ve been working in R for a while now, so I have a general sense of how these things might happen. The thing is, these same mechanics translate to other languages as well. The hardest thing about programming (in my opinion) is figuring out how the program ought to flow. Everything else is just window dressing. Once you get more established with a programming language you’ll learn the subtleties of the language, but for hack-y programming, you should be able to get the hang of it regardless of your language background.

As evidence, Neotomabot. The code’s all there, I spent a day figuring out how to program it in Python. But to help myself out I planned it all first using long-hand notes, and then hacked it out using Google, StackOverflow and the Python manual.  Regardless, it’s the flow control that’s key. With my experience in R I’ve learned how “for” loops work, I know about “while” loops, I know try-catch methods exist and I know I need to read JSON files and push out to Twitter. Given that, I can map out a program and then write the code, and that gives us Neotomabot:

All the code is available on the GitHub repository here, except for the OAuth handles, but you can learn more about that aspect from this tutorial: How to Write a Twitter Bot. I found it very useful for getting started.  There is also a twittR, for R, there are several good tutorials for the package available (here, and here).

So that’s it.  You don’t need to worry about picking the wrong language. Learning the basics of any language, and how to map out the solution to a problem is the key.  Focus on these and you should be able to shift when needed.

R Tips I Wish I Had Learned Earlier – Using Functions

This post is part of a series expanding on suggestions from twitter users and contributors on reddit.com’s rstats forum about tips and tricks that they wish they had learned earlier.

Writing functions in R isn’t something that people start doing right away, and it is something that people often actively avoid since functions often require rewriting code, and, in some cases, mean you have to think hard about how data looks and behaves in the context of your analysis. When you’re starting out a project, or a piece of analysis it can often seem easier to just cut and paste pieces of code, replacing key variable names. In the long term this kind of cutting and pasting can cause serious problems in analysis.

It’s funny.  We have no problem using functions in our day to day programming: lm, plot, sqrt. These functions simplify our data analysis, we use them over and over, and we’d never think about using the raw code, but when it comes to our analysis we’re willing to cut and paste, and let fairly simple sets of operations run into the hundreds of lines just because we don’t want to (or don’t know how to) add a few simple lines of code to make operations simpler.

Functions can improve your code in a few ways:

  1. Fewer overall lines of code, makes it easier to collaborate, share & manage analysis
  2. Faster debugging when you find mistakes
  3. Opportunities to make further improvements using lapply, parallel applications and other *ply functions.
  4. Fewer variables in memory, faster processing over time.

So let’s look at what a function does:  A function is a piece of code – a set of instructions – that takes in variables, performs an operation on those variables, and then returns a value.

For example, the function lm takes in information about two variables, as a formula, and then calculates the linear model relating them. The actual function is 66 lines of code (you can just type lm into your console), but it includes calls to another function, lm.fit that is 67 lines long, and that too includes calls to other functions. I’m sure we can all agree that

null.model <- lm(satisfaction ~ 1)  # the null model
my.postdoc <- lm(satisfaction ~ functions.used)
ideal.postdoc <- lm(satisfaction ~ functions.used + work.life.balance)

is much cleaner and satisfying than 180 lines of code, and it’s much easier to edit and change.

If you look at your code and you’ve got blocks of code that you use repeatedly it’s time to sit down and do a bit of work. This should be fully reproducible:

data(Pollen, Climate, Location, Biome)

#  Cleaning pollen data:
# Is the location in our bounding box (Eastern North America)
good.loc <- Biome$Fedorova %in% c(‘Deciduous Forest’, ‘Prairies’)
Pollen.clip <- Pollen[good.loc ,]
Pollen.trans <- apply(Pollen.clip, 1, sqrt)
pol.mean <- rowMeans(Pollen.trans)
Climate.clip <- Climate[good.loc ,]
Climate.trans <- apply(Climate.clip, 1, scale)
clim.mean <- rowMeans(Climate.trans)
Location.clip <- Location[good.loc,]
Location.mean <- rowMeans(Location.clip)

This is a simple example, and it doesn’t exactly make sense (there’s no reason to average the rows). But, it is clear that for all three datasets (climate, location and pollen) we are doing the same general set of operations.

So, what are the things we are doing that are common?

  1. We’re removing a common set of sample sites
  2. We are applying a function by rows (except Location)
  3. We are averaging by row.

So, this seems like a decent candidate for building a function. Right now we’re at 9 rows, so keep that in mind.

We want to pass in a variable, so we’ll call it x, and we’re applying a function across rows before averaging, so we can pass in a function name as y. Location is a bit tricky, we don’t actually transform it, but there is a function called identity that just returns the argument passed to it, so we could use that. Lastly, we do the rowMean, so we can pass it out. Let’s try to construct the function:

data(Pollen, Climate, Location, Biome)

clip_mean <- function(x, y){
  #  edit: Gavin Simpson pointed out that I had made an error in my code, calling
  #  apply(x, 1, y) instead of apply(x.clip, 1, y).  The nice thing about writing the
  #  function is that I only needed to correct this once, instead of three times.
  #  edit 2: It was also pointed out by reddit user jeffhughes that x and y are not very 
  #  good variable names, since they're fairly generic.  I agree.  You'd be better off
  #  writing more descriptive variable names.

  good.loc <- Biome %in% c(‘Deciduous Forest’, ‘Prairies’)
  x.clip <- x[good.loc, ]
  x.trans <- apply(x.clip, 1, y)

pollen.mean <- clip_mean(Pollen, sqrt)
climate.mean <- clip_mean(Climate, scale)
location.mean <- clip_mean(Location, identity)

So we’re still at 9 lines of code, but to me this reads much cleaner. We could go even further by combining the lines below the good.loc assignment into something like this:

rowMeans(apply(x[good.loc,], 1, y))

which would make our 9 lines of code into 7, fairly clean lines of code.

The other advantage of sticking this all into a function is that all the variables created in clip_mean (e.g., x.trans) only exist within the function. The only thing that gets passed out at the end is the rowMeans call. If we had assigned rowMeans to a variable (using ‘<-‘) we wouldn’t even have that at the end (since you need to pass data out of the variable, either using the return command, or by directly calling a variable or function). So, even though we’ve got 9 rows, we have 9 rows and only 8 variables in memory, instead of 13 variables. This is a big help because the more variables you have the harder debugging becomes.

So, to summarize, we’ve got cleaner code, fewer variables and you now look like a total pro.
If you have any tips or suggestions for working with functions please leave them in the comments.

Here are some other resources for creating functions:
Quick-R: http://www.statmethods.net/management/userfunctions.html
Working with Functions: http://en.wikibooks.org/wiki/R_Programming/Working_with_functions
Hadley Wickham’s more advanced resources for functions: http://adv-r.had.co.nz/Functions.html

Thanks again to everyone who contributed suggestions!