Thursday, October 31, 2013

Halloween in EQII

Well sure, there's Nights of the Dead or whatever, but here's some fun homage I mean IP infringement almost I mean homage. Hopefully the text flow is ok, but we have "Norm Baites" and we can also wonder how many licks it takes to get to the center of a… well just a lollipop.

Wednesday, October 23, 2013

R and Regex Named Matches

I use Python and R to do stuff, Python for web scraping and text clean up, R for the analysis. But people have expanded the functionality of the two, and they are overlapping (it's enough to not get them confused as it is some of the time). I found I needed to use named groups in regex in R, and... couldn't figure it out. The web did not help.

SHORT VERSION: Turn on the Perl regex style (perl = TRUE) and go read some Perl regex pages, you'll be fine. Name the match: (?<name>...), to match it later: \\g{name}
This is completely different from what I was used to.
Google blogger will try to blow this post up as I want to have greater than and less than symbols. Yeah they can't get that right. Oh maybe it's working.

Typically, if I use regular expressions it's in Python, but R can do it too, and sometimes you'll want to do that. But there isn't a ton of help online about it (despite the links I have lined up to include below) and there are some things that confuse the issue (to Perl or not to Perl...).

If you just want to do some work with strings, first check out Hadley Wickham's stringr. It's awesome.

I, however, wanted to do some pattern matching that included a repeated section, so I needed regex's named group functionality, which I couldn't find or figure out in stringr. I was looking for patterns like this:


...where 5 could be any number between 0 - 500 or so, but it would repeat. I had already removed spaces and added commas for easier parsing. (So other matches would be, like, 17,-1,17,-1,17.... etc.) So I needed to make sure the first match there was repeated, thus, named groups (or any group capture really, but I wanted to name it).

But I also couldn't figure it out in R. I can do it in Python, but the Python code for regex wouldn't work in R, alas. It was not clear what changes needed to be made.

One reason was the the \ needs to be escaped, that is, \\. So for example, \d+ needed to be \\d+. That wasn't too hard to figure out. But the rest was.

You can have Perl style, or POSIX, or not. Uh, what? No idea! I just needed it to work. Specifically, named groups in R. I found this page which said "Named subpatterns... are not covered here." Hmm. Another page said how "examples for the use of regex in R are rather rare" and had some useful examples. Eventually I figured I would set the Perl option and see what I could do; at least I could search on "perl", and that made all the difference as I could find out how to do named groups in Perl-style regex and there you go.

Name a group in Python: (?P<name>...)    
Name a group in R, Perl style: (?<name>...)

Note: I expect the less than and greater than symbols fail at some point.

Reference it later in Python: (?P=name)
Reference it later in R, Perl style: \\g{name}
    So curly braces (Perl?), and double backslash for R.

Some useful Perl-regex links:

Although honestly one problem I have with a lot of online examples (and the R help files) is that they are completely arcane. If I'm looking for help with syntax, a complex example isn't going to solve it, that's bad usability.

Post Keywords: regex, R, r-project, cran, grep, regular expressions, named groups.

Monday, October 21, 2013

iTunes Radio

Still working out a few kinks, as you can see (not repeating recently played songs). The result is I have had Fiona Apple stuck in my head for four days. (Not 'shopped.)

R and head() and tail()

So, if you're not careful using R's head() and tail() commands, you'll end up with a little surprise. Perhaps I should say not careful reading the documentation.

Head() and tail() do not return just one item from the list (or whatever), they return several. So head does not mean first, and tail does not mean last.

Read carefully: "Returns the first or last parts of a vector, matrix, table, data frame or function" [Italics added.] PARTS. Plural. An 's' on there.


our_list = list(3, 7)       # Makes a list with two items, the first is 3, the second is 7. Integers.

Note I used "=" since if I use a "less than" bracket, Google blogger freaks out. (The typical R code is "less than" followed by a dash, which make an arrow, representing "gets", the left side gets [is given] the right side.) It's giving me a hard time with formatting as it is.
If you type in "our_list", R will print our_list:

[1] 3

[1] 7

So, the first item in our_list [[1]] has one item [1], which is a 3.
The second item in our_list [[2]] has one item [1], which is a 7.

I don't fully understand the difference between [[x]] and [x], it seems mysterious.
Edit: The R Inferno, 8.1.54.... aha. Still is mysterious, though.

If you type:
head(our_list) would be nice to get just the head, that is, the first item. But no, you get the whole list (since the list is small you get the entire lists, larger lists would only return the first few items).

What you want is:
head(our_list, n=1)

...where the 'n' gives the value of how many items you want. (You don't actually need the n=, I have noticed.)
When I try "n=3" for this two item list, it just gives the first two items (i.e., the entire list in this case) and does not give an error.

Note I made our_list have [[1]] == 3 and [[2]] == 7 since far too often [[1]] == 1 and [[2]] == 2 and really people that's just not clear. If you're trying to make a useful example, don't make it where the same symbols (1, 2) are being used to represent widely different things.

Also, Googling for info on R's "by" command is just impossible, as "r by" is not a specific enough search string (in-context it's fine though). That's why I like books (yes paper) sometimes, if the index is any good, there you go.

Sunday, October 6, 2013

R For Loops Indexing

R does something a little unexpected -- well, unexpected to me -- with the indexing of the for loop (and maybe this is more general, I don't know).

If you have... (note I can't use "get" with the arrow made of brackets, Google does not parse that in terms of HTML and it kills the code...)

n = 5
for (i in 1 : n+1) {
    do stuff

...the index is 2 to 6, not 1 to 6. The +1 gets added to both the indices.
So it's like (i in (1:n) + 1) kind of.

What you need is....

for (i in 1 : (n+1)) {....

This is related to off by one errors (humorous explanation and more serious explanation), but I certainly didn't expect it to be parsed like that.