Two frustrations with R’s default behaviour.

EDIT:  In response to this post I have had good suggestions, both in the comments here and on reddit in /r/statistics. Thanks to all!

I use R every day.  I can think of very few times when I have booted up my computer and not had an instance of R or RStudio running.  Whether for data exploration, making graphs, or just fiddling around, R is a staple of my academic existence.

So I love R, but there are a few things that drive me nuts about its behaviour:

  1. Plotting a large data.frame produces unreadable plots.
  2. Read and write.csv don’t behave the same with respect to row names.
  3. stringsAsFactors isn’t default.

1.  Plotting a large data.frame:

Plot of data frame size vs plotting time

Figure 1. How long it takes to plot a data frame increases linearly with the number of rows. Imagine how long it would take if you accidentally plotted all the data in the world!

I get it, it’s a way to tell me to stop doing it, but I can’t help it.  Every once in a while I forget to stick the columns into my plot command, and I wind up getting something that looks like the floor of my bathroom (Figure 1).  The default plot behaviour for data.frames is frustrating, especially when you have a very large data set.  With one simple command you can unleash minutes of processing time for a plot that yields no information at all.   I decided to time it, for no particular reason than to include a code snippet here:

timer <- rep(NA, 50)

for(i in 1:50){

rowtest <- i * 20

 example <- matrix(runif(rowtest * 20, 0, 3), ncol=20) %*% matrix(rt(400, 2), ncol=20)
 #This is going to take a while. . .
 timer[i] <- system.time(plot(as.data.frame(example)))[1]

}

plot(1:50 * 20, timer, type='b', xlab='Data frame rows', ylab='Wasted Time')

I’m sure there are good reasons for this, and it might be possible to change the default behaviour by writing up a code snippet, but changing default behaviour of some of the base classes is probably pretty dangerous. It would be nice though if the default behaviour were to plot all columns up to a maximum number, possibly based on the pch size, so that you could at least plot until minimum readability was reached.

2. read.csv and write.csv don’t agree on whether or not rownames are default.

Try this out:

aa <- data.frame(column.one = runif(10, 0, 2),
 column.two = rnorm(10, 0, 1))

write.csv(aa, 'dead.file.csv')
bb <- read.csv('dead.file.csv')

How many columns does bb have? Three. And if you’re not paying attention, or you’re not careful coding your functions, all that great analysis you did the first time has completely changed. Of course, if you used read.table everything would work out fine. then you’d have only two columns.  It’s frustrating, and, probably, the fact that we all account for it in one way or another means that a major change in the bedrock of R would break a lot of code.

3. stringsAsFactors == FALSE isn’t the default.

Oh data.frames, oh read.table, sometimes you drive me bonkers. Maybe it’s just me, maybe it’s just the scripts I’m writing, but the fact that my strings get converted to factors all the time drives me crazy. As far as I’m concerned a factor is a derived data-type, you ought to convert to a factor, not convert from a factor. It leads to even more complicated problems when, for whatever reason you have a column that is largely numeric, with some sort of character symbol stuck in randomly (multiple NA type strings indicating different types of unusable data for example). To get the numbers back from a factor is much more complicated than just converting the string using as.numeric, you’ve got to as.numeric(as.character(whatever)).

Okay, I’ve got that off my chest. I am genuinely interested though, is there a reason for these default behaviours? Are they some sort of legacy that hasn’t been changed, or am I pretty much the only one who encounters them. If you know, just pop your answer in the comments.

About these ads

5 thoughts on “Two frustrations with R’s default behaviour.

  1. For your first problem, I’m not sure what the desired output is so it’s hard to fix it.

    For your second the behavior makes sense. You don’t want to lose the row names do you when you write a csv right? For instance the row names may be a factor so you want to able to treat them as such, not just row names that you use for easy dataframe viewing. The fault is more with how you’re reading the data in. If you want the first column to be row names just do this bb <- read.csv('dead.file.csv', row.names=1). It makes sense to me, because R doesn't know if the row names are important or not, and the default is to include the information and not throw it out.

    Your third bug can be fixed using the .Rprofile file.

    .First <- function() {
    options(stringsAsFactors=FALSE)
    # maybe some other packages to load
    }

    You could also try the "Defaults" package, and then put set default options in your .Rprofile, the problem is that you'll muck up anyone you share code with because they'll have a different .Rprofile.

    • Thanks Edmund, for the second point, it still seems odd that R assumes row names are important when writing a CSV, but assumes they aren’t when reading the file in. read.table is internally consistent, why can’t read.csv be?
      Thanks for the tip on the .Rprofile fix, but I agree, if you’re sharing code it’s a bit risky changing all your defaults. I suppose you just have to share the .Rprofile file as well.

      • Huh, that is weird that read.csv and write.csv are inconsistent but write.table and read.table aren’t. I rarely write files so I’ve never encountered the problem. I guess the easy work around is just:

        write.table(aa, ‘dead.file.csv’,sep=”,”)
        bb <- read.table('dead.file.csv',sep=",")
        bb

        instead of write.csv. I totally agree that the .Rprofile is really only useful for yourself, which is why I tend to not use it. I guess you could call the option set in the header of your script where you put your library loading calls as well.

        I'd guess that the reason for the weird defaults is that R has been developed over a number of years and different people have worked on different base package functions.

  2. Is the read.csv() vs write.csv() something you encounter a lot? Are you doing this to write out data that you need to read back into R? If so, don’t go from one representation to another and then have to read that other representation back in. Instead, serialize the object or objects to disk using save() or saveRDS(). I like to think of the write.*() funs as exports whilst the serialization functions are a native Save option for R.

    Regarding the factors; R doing this is a pain if you don’t know to look for it. However, it is far more efficient to store a long character vector as a factor as it only needs to store the index of the factor level in the vector plus the unique factor levels. A character vector needs to store multiple copies of the same thing. At least it used to. Now R will try to be more efficient in that regard so a benefit of stringsAsFactors = TRUE is largely removed. In the particular use-case you give though, you get what you pay for; if you read data in without telling R what the na.strings are then you deserve to get back a factor ;-)

    • It is to some degree. I use read/write.csv in scripts where I have time consuming data processing, essentially:
      if !filename %in% list.files() { . . . }
      else {read.csv(filename, row.names = 1)}

      I could use save and load, but csv files give me the options of sharing data easily with people who may not be as comfortable with R so we can discuss parts of the code, or partial results, before we get to the final results.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s