R Tips I Wish I Had Learned Earlier – Using Functions

This post is part of a series expanding on suggestions from twitter users and contributors on reddit.com’s rstats forum about tips and tricks that they wish they had learned earlier.

Writing functions in R isn’t something that people start doing right away, and it is something that people often actively avoid since functions often require rewriting code, and, in some cases, mean you have to think hard about how data looks and behaves in the context of your analysis. When you’re starting out a project, or a piece of analysis it can often seem easier to just cut and paste pieces of code, replacing key variable names. In the long term this kind of cutting and pasting can cause serious problems in analysis.

It’s funny.  We have no problem using functions in our day to day programming: lm, plot, sqrt. These functions simplify our data analysis, we use them over and over, and we’d never think about using the raw code, but when it comes to our analysis we’re willing to cut and paste, and let fairly simple sets of operations run into the hundreds of lines just because we don’t want to (or don’t know how to) add a few simple lines of code to make operations simpler.

Functions can improve your code in a few ways:

  1. Fewer overall lines of code, makes it easier to collaborate, share & manage analysis
  2. Faster debugging when you find mistakes
  3. Opportunities to make further improvements using lapply, parallel applications and other *ply functions.
  4. Fewer variables in memory, faster processing over time.

So let’s look at what a function does:  A function is a piece of code – a set of instructions – that takes in variables, performs an operation on those variables, and then returns a value.

For example, the function lm takes in information about two variables, as a formula, and then calculates the linear model relating them. The actual function is 66 lines of code (you can just type lm into your console), but it includes calls to another function, lm.fit that is 67 lines long, and that too includes calls to other functions. I’m sure we can all agree that

null.model <- lm(satisfaction ~ 1)  # the null model
my.postdoc <- lm(satisfaction ~ functions.used)
ideal.postdoc <- lm(satisfaction ~ functions.used + work.life.balance)

is much cleaner and satisfying than 180 lines of code, and it’s much easier to edit and change.

If you look at your code and you’ve got blocks of code that you use repeatedly it’s time to sit down and do a bit of work. This should be fully reproducible:

library(analogue)
data(Pollen, Climate, Location, Biome)

#  Cleaning pollen data:
# Is the location in our bounding box (Eastern North America)
good.loc <- Biome$Fedorova %in% c(‘Deciduous Forest’, ‘Prairies’)
Pollen.clip <- Pollen[good.loc ,]
Pollen.trans <- apply(Pollen.clip, 1, sqrt)
pol.mean <- rowMeans(Pollen.trans)
Climate.clip <- Climate[good.loc ,]
Climate.trans <- apply(Climate.clip, 1, scale)
clim.mean <- rowMeans(Climate.trans)
Location.clip <- Location[good.loc,]
Location.mean <- rowMeans(Location.clip)

This is a simple example, and it doesn’t exactly make sense (there’s no reason to average the rows). But, it is clear that for all three datasets (climate, location and pollen) we are doing the same general set of operations.

So, what are the things we are doing that are common?

  1. We’re removing a common set of sample sites
  2. We are applying a function by rows (except Location)
  3. We are averaging by row.

So, this seems like a decent candidate for building a function. Right now we’re at 9 rows, so keep that in mind.

We want to pass in a variable, so we’ll call it x, and we’re applying a function across rows before averaging, so we can pass in a function name as y. Location is a bit tricky, we don’t actually transform it, but there is a function called identity that just returns the argument passed to it, so we could use that. Lastly, we do the rowMean, so we can pass it out. Let’s try to construct the function:

library(analogue)
data(Pollen, Climate, Location, Biome)

clip_mean <- function(x, y){
  #  edit: Gavin Simpson pointed out that I had made an error in my code, calling
  #  apply(x, 1, y) instead of apply(x.clip, 1, y).  The nice thing about writing the
  #  function is that I only needed to correct this once, instead of three times.
  #  edit 2: It was also pointed out by reddit user jeffhughes that x and y are not very 
  #  good variable names, since they're fairly generic.  I agree.  You'd be better off
  #  writing more descriptive variable names.

  good.loc <- Biome %in% c(‘Deciduous Forest’, ‘Prairies’)
  x.clip <- x[good.loc, ]
  x.trans <- apply(x.clip, 1, y)
  rowMeans(x.trans)
}

pollen.mean <- clip_mean(Pollen, sqrt)
climate.mean <- clip_mean(Climate, scale)
location.mean <- clip_mean(Location, identity)

So we’re still at 9 lines of code, but to me this reads much cleaner. We could go even further by combining the lines below the good.loc assignment into something like this:

rowMeans(apply(x[good.loc,], 1, y))

which would make our 9 lines of code into 7, fairly clean lines of code.

The other advantage of sticking this all into a function is that all the variables created in clip_mean (e.g., x.trans) only exist within the function. The only thing that gets passed out at the end is the rowMeans call. If we had assigned rowMeans to a variable (using ‘<-‘) we wouldn’t even have that at the end (since you need to pass data out of the variable, either using the return command, or by directly calling a variable or function). So, even though we’ve got 9 rows, we have 9 rows and only 8 variables in memory, instead of 13 variables. This is a big help because the more variables you have the harder debugging becomes.

So, to summarize, we’ve got cleaner code, fewer variables and you now look like a total pro.
If you have any tips or suggestions for working with functions please leave them in the comments.

Here are some other resources for creating functions:
Quick-R: http://www.statmethods.net/management/userfunctions.html
Working with Functions: http://en.wikibooks.org/wiki/R_Programming/Working_with_functions
Hadley Wickham’s more advanced resources for functions: http://adv-r.had.co.nz/Functions.html

Thanks again to everyone who contributed suggestions!

Published by

downwithtime

Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

2 thoughts on “R Tips I Wish I Had Learned Earlier – Using Functions”

  1. Error?

    good.loc <- Biome %in% c(‘Deciduous Forest’, ‘Prairies’)
    should be
    good.loc <- Biome$Fedorova %in% c(‘Deciduous Forest’, ‘Prairies’)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s