Friday, April 3, 2015

R and Unlist Your List!

If you try to assign a list to a column in an R data frame, it won't quite work, you need unlist. (That's the short version for the search engine snippet, although it does not make for a great narrative intro, it's more the concise summary.)

A few days ago, I was working in R, and was generating a new data frame from another one. It was a little more complex than I was used to, for instance I had to bin one variable by the values of another variable, and make some new percentage/frequencies in the new, smaller, data frame, so I couldn't just use the non-looping approach that is common to R (and which is a lot nicer and fast). The data frame was relatively small, so one loop level was not a problem, even on my 4.5 year old MacBook Air.

In one section, I generated a list of values (numeric, nothing fancy), and then assigned that list to a column in my new data frame. When I called the data frame to look at it, it looked fine, but if I did str(my_df) or summary(my_df) something was horribly wrong -- the column wasn't a numeric column, it was some odd list format and wasn't working for my ggplot.

I tried assigning the generated values directly to the column in the data frame, with something like this inside the loop, where I also incremented i:

my_df[i, 'the_variable'] = one_generated_variable

(Note I can't use R syntax there with the greater than sign, Google barfs on the code even though it's text, so I have to use the equals sign which is older R style.)

one_generated_variable was just a numeric value. Should have been fine, I thought! But no, it still came out as a list. I have no idea why, honestly it seems impossible since the values were generated one at a time and assigned then and there -- they were not bundled into a list first. But, unlist fixed the problem.

my_df$the_variable = unlist(my_df$the_variable)

That did it. I still don't understand the details, since I don't see why it was a list in the first place (aren't columns vectors anyways?). I have never run into that problem before, although mostly I've been working in Python lately.

Also, a friend put me onto data.table instead of data.frame for bigger data.