Sunday, January 25, 2015

Random R Notes - Factors, Rank v. Order, Unsplit

Some R issues I have run into recently...

I split a dataframe, then split it again, and the analysis was taking forever. Something was wrong. After inspection, the sub-split DF had all the factor levels from the original DF! Terrible since there were about 25,000 in one variable. It just took forever, which I don't think it should but hey I gave up on it.

I needed droplevels. You can apply it to the whole DF, removing unused levels from your old DF and then assign it (to a new one or just write over the old one).

your_new_df = droplevels(your_old_df)

(Google cannot handle the less than sign, used for "get" in R instead of =, with either the code tag or the pre tag, it just blows it up. Annoying.)

Then I could run the ordering code on dates. But no: there is, I learned, order and there is rank. There is also sort but I managed to avoid that somehow, so I won't discuss that here.

Note that for rank you need to figure out what to do with ties! (That is, when values are equal, how to rank them exactly.)

There was a really great post about it on Stackoverflow but I can't find it at the moment. This post might help, though.

Or, I made a nice little example! I use R: as the start of the input lines since the greater than symbol and blogger are not friends.

R: the_list = c('A', 'D', 'B', 'C')

R: order(the_list)

[1] 1 3 4 2

R: rank(the_list)

[1] 1 4 2 3


So, you see the two outputs are different.
Order says, put the first element first, then the third element would come next (B), then the fourth element (C), then the second element (D).
Rank says, the first element is first, the second element (D) is the fourth of them all, the third element (B) is the second overall, and the fourth element (C) is the third overall.

Edit: I called sort!

R: sort(the_list)


[1] "A" "B" "C" "D"


That's awesome.

So after that, I wanted to unsplit. But, no, I had added the rank column, so instead of 400 rows I only got 100 (I had 100 df's with 4 rows each). Unsplit does not work well (or at all?) if you add (or subtract?) items. So stackoverflow told me I needed do.call and rbind ("row bind").

rejoined_df = do.call(rbind, splitted_df)


Note that the splitted_df is the result of the split() call, which I'm not showing, so is not actually a DF, I think it's a list of DFs maybe. Or maybe it's not a typical DF. But you can call it directly, and if you use split you should familiarize yourself a bit with the resulting object.

There you have it, some random R notes.