Wednesday, October 23, 2013

R and Regex Named Matches

I use Python and R to do stuff, Python for web scraping and text clean up, R for the analysis. But people have expanded the functionality of the two, and they are overlapping (it's enough to not get them confused as it is some of the time). I found I needed to use named groups in regex in R, and... couldn't figure it out. The web did not help.

SHORT VERSION: Turn on the Perl regex style (perl = TRUE) and go read some Perl regex pages, you'll be fine. Name the match: (?<name>...), to match it later: \\g{name}
This is completely different from what I was used to.
Google blogger will try to blow this post up as I want to have greater than and less than symbols. Yeah they can't get that right. Oh maybe it's working.

Typically, if I use regular expressions it's in Python, but R can do it too, and sometimes you'll want to do that. But there isn't a ton of help online about it (despite the links I have lined up to include below) and there are some things that confuse the issue (to Perl or not to Perl...).

If you just want to do some work with strings, first check out Hadley Wickham's stringr. It's awesome.

I, however, wanted to do some pattern matching that included a repeated section, so I needed regex's named group functionality, which I couldn't find or figure out in stringr. I was looking for patterns like this:

5,-1,5,-1,5

...where 5 could be any number between 0 - 500 or so, but it would repeat. I had already removed spaces and added commas for easier parsing. (So other matches would be, like, 17,-1,17,-1,17.... etc.) So I needed to make sure the first match there was repeated, thus, named groups (or any group capture really, but I wanted to name it).

But I also couldn't figure it out in R. I can do it in Python, but the Python code for regex wouldn't work in R, alas. It was not clear what changes needed to be made.

One reason was the the \ needs to be escaped, that is, \\. So for example, \d+ needed to be \\d+. That wasn't too hard to figure out. But the rest was.

You can have Perl style, or POSIX, or not. Uh, what? No idea! I just needed it to work. Specifically, named groups in R. I found this page which said "Named subpatterns... are not covered here." Hmm. Another page said how "examples for the use of regex in R are rather rare" and had some useful examples. Eventually I figured I would set the Perl option and see what I could do; at least I could search on "perl", and that made all the difference as I could find out how to do named groups in Perl-style regex and there you go.

Name a group in Python: (?P<name>...)    
Name a group in R, Perl style: (?<name>...)

Note: I expect the less than and greater than symbols fail at some point.

Reference it later in Python: (?P=name)
Reference it later in R, Perl style: \\g{name}
    So curly braces (Perl?), and double backslash for R.

Some useful Perl-regex links:
http://modernperlbooks.com/books/modern_perl/chapter_06.html
http://perldoc.perl.org/perlre.html

Although honestly one problem I have with a lot of online examples (and the R help files) is that they are completely arcane. If I'm looking for help with syntax, a complex example isn't going to solve it, that's bad usability.


Post Keywords: regex, R, r-project, cran, grep, regular expressions, named groups.