The big payback

EDIT; For the sake of posterity I’m going to keep this post up, but I would like to direct readers to this amazing resource (The R Inferno, should be required reading!) that goes into many of the issues surrounding vectorization in a really fantastic way.

James Brown talks about the Big Payback.

(they put an ad in front of that song, the soul-less bastards!)

But I’d like to talk about the big payback in another sense.  As researchers we’re always looking for the payback.  Our research funding is no payback for all the time we’ve put into reading papers that no one else has read, writing papers that no one else will read (maybe that’s just me)  and all the administrative hoops we have to jump through (although pity those administrators who have to put up with us!). . .

I guess when it comes down to it, the big payback is really the minor paybacks that we can achieve from day to day: a well formulated alternative hypothesis test, the exact right edit that skewers the word count and make your paragraph that much clearer or when all that R coding pays back in a well vectorized script (example to follow), .

I’m a bit worried that I’ve set my sights too low, but maybe that’s life as a post-doc without a Nature paper (I did meet a Science editor yesterday, so maybe there’s that in my future though!), and after the last Nature debacle do I really feel comfortable striving for excellence?  Sadly, yes.

Anyway, the point of this blog was supposed to be coding in R, parenthetical references, some paleoecological thoughts, and some music references to make that dusty bass amp (that I helped build!) in my basement seem worth it.

So, on to coding and the little victories we find there.  There’s lots of R advice that suggests vectorization is a huge boon when using R.  The main idea is that looping operations (which are arguably simpler to conceptualize) are slower in R than are vectorized arguments.  Although as with most things in R, this is not exclusively true:

for(i in 1:10000) {1}

is faster than:

sapply(1:10000, function(x){1})

(see my note above, sapply is actually a loop and I clearly misunderstood the concept of vectorization)

For a language that promotes vectorization that’s some crazy stuff, R!  I can understand it, but as I stretch to make a really simple example it’s a bit frustrating. . . Could they not make this example an exception in the source code to help educators?

Regardless, I ran into a problem the other day where I needed to sort a big matrix (hundreds of thousands of rows). I had sets of points in the rows and distances to trees (one through four) in the columns.  In this case I had a set of distances, along with a number of other variables, but I needed to sort everything with respect to the distance so that all the variables relating to the closest tree were in column one, the variables relating to the second closest tree were in column two, and so on.

There were a few ways of doing this:

1. Create a matrix of ranks (not order!!!) for the distance matrix and apply this to each of the other matrices:

#  at this point I would like to suggest that Donald Trump has killed all my enjoyment of the O'Jays' "For the Love of Money"*, and for that I can never forgive him.

rank.set <- apply(distances, 1,  rank, ties.method = 'random')
# if there's a tie it's probably because the values are NAs, otherwise, I don't care.
other.var <- other.var [ , rank.set]

just doesn’t work, I get a bunch of NAs because R isn’t compliant to my demands.

If you don’t want to vectorize then you’re stuck with a for loop:

rank.set <- apply(distances, 1,  rank, ties.method = 'random')
for(i in 1:nrow(rank.set)){
other.var[i, ] <- other.var[i, rank.set[ ,  i]]
}

which takes a brutally long time if you’ve got as big a dataset as I have.

The final option is to vectorize.  This is the fastest way I’ve found, but it’s not really all that great:

new.set <- cbind(rank.set, other.var)
newest.set <- apply(new.set, 1, function(x) { new.set[, c(rank(new.set[, 1 : 4], ties.method = 'random'), 4 +  rank(new.set[, 1 : 4], ties.method = 'random'))] })
new.dist <- newest.set[, 1 : 4]
new.other <- newest.set[, 5 : 8]

Not very pretty, but it’s the fastest solution I’ve found so far, and it’s a decent first post.  When you find my blog and you are amazed by the insightful content I publish, you can look back on this first post and be astounded by how far I’ve come.

That’s it for tonight. Back to AGU tomorrow.

*  While not espousing biblical truth (and choosing not to capitalize It), isn’t it ironic that the title of the song refers to the bible verse:  “For the love of money is the root of all evil: which while some coveted after, they have erred from the faith, and pierced themselves through with many sorrows”?  Does the Donald know?  Does the Donald care?

Advertisements

Published by

downwithtime

Assistant scientist in the Department of Geography at the University of Wisconsin, Madison. Studying paleoecology and the challenges of large data synthesis.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s