/dev/culture: Code

Showing posts with label Code. Show all posts

Wednesday, October 4, 2017

Minitel!

A working Minitel! This one was built in 1985 and is still going strong, cared for by Julien Mailland and Kevin Driscoll, who spoke recently at MIT's Comparative Media Studies program weekly seminar. They have an awesome new book out about the Minitel that hit on many of the issues I also ran into studying related technologies in my dissertation back in ~2003.

But, as they told me, all the specs for the Minitel were released when it came out in the 1980s so all the service providers could connect to it. Those specs are still available today, so they have a working Minitel and an Arduino device sending it some data. Wow! Super cool.

Minitel, front

Minitel, back, and Arduino device

Closeup, Arduino device

Pokémon GO and Cultural Locations

This is not new, but I have an example of a problematic Pokémon GO stop near my apartment (besides the one that is in the wrong location and the one that was removed IRL), one where the cultural info is wrong in the game but the correct information is just a block away. (And yes, I know these are all from Ingress.)

The stop is "Laughing Man", as you see here in this screen grab:

However, just a short block north is a plaque about the sidewalk tiles in the neighborhood, and it's not a man, it's Geneva, and there is a fair amount of information about her that could have been included in the in-game description. This is another problem: there's no in-game way to rectify erroneous information. Granted, allowing people to submit any old thing would be a disaster and Niantic would need real human filterers (not just machine learning), but allowing people to submit any old thing is how they got the data in the first place for Ingress.

Tuesday, April 11, 2017

Python 3 vs. Python 2.7

I decided it was finally time to move to Python 3 from Python 2. Having done so, I don't see why I wasn't using Python 3 years ago, although my code worked just fine so it wasn't really a big deal.

Python 3 has two big advantages, and there's also a third reason you should be using it by now.

Unicode: You don't have to worry about catching Unicode characters in string types anymore, Python 3 does it for you. This is a concern for me with web scraping. So much easier.
For years I've read about the following dilemma in OSX, with no solution:

If you DON'T install your own copy of Python 2, you are modifying the OS's copy of important libraries and such, and that can cause problems.
If you DO install your own copy of Python 2, you then have two versions of Python 2 on your computer, and that can cause problems.

But, the best part was that about 99% of my code still works as is. All I've had to do so far is change print statements, from print "Print this Py2!" to print("Print this Py3!"), and get rid of the Unicode error catching.

Sunday, July 24, 2016

TKinter, ttk, and Progressbar

tl;dr: ttk.Progressbar is max 99 by default, not 100, despite the documentation. If you try to overfill it, it won't accept the call that does so.

I was building a front end for a scraper app, and at first I tried Xcode and the Interface Builder (which I first saw over two decades ago on a NeXT machine, it was glorious then and it still is), but I couldn't get it to mesh with my Python code (so much of the online help is out of date). A friend told me I was being an idiot and should try something simpler, and I settled on TKinter, which had me up and running in very little time. (The front end took only two days, but I wasn't committing every waking hour to it, and I had to figure out how to take my linear Python script and conceive of it in the looping GUI manner, which was difficult.)

I wanted a text box field so the scraper could print to it like it does with Python's print statement to the terminal (but I don't want the user to have to deal with the terminal or the console). I ended up using ScrolledText, which you have to import (well as far as I can tell, and it's working, so, once it works, I don't have time to poke at it too much). (NB: with ScrolledText, I needed setgrid=True to make the frames resize nicely, this was VITAL, packing frames in TKinter is an art I do not yet understand, and with the ScrolledText field, you might want state='normal' to print to it, then state='disabled' so the user doesn't type in it [but loses copy capability], you'll want insert(END, new_string) to print at the bottom of your field, but then you also need see(END) so that it scrolls to the bottom -- otherwise it prints on the bottom but the view stays put at the top. Details.)

Then I wanted two progress bars, one to show the user the scrape progress and the second to show the parsing progress. The scraping one I needed to fudge a little, so I tried....

my_window.scrape_progress.step(10) # first init step
my_window.scrape_progress.step(20) # bigger step

my_window.scrape_progress.step(20) # another bigger step

my_window.scrape_progress.step(50) # jump to done!

Where scrape_progress is the name of the Progressbar object for my scraping progress.

As you can see, that's 10 + 20 + 20 + 50 = 100.

The bar would fill 10% (10), then to about 30% (10+20), then to about 50% (10+20+20), then it wouldn't fill anymore.

Eventually out of annoyance when trying alternatives, instead of 50 in the last step I used 49, and it worked.

So no, the max is not 100, it's 99, so the bar values are probably 0-99 for 100 increments, as 0-100 would be 101 increments. I suspect that step(100) won't work, but step(99) should fill it to 100%.

Some code:

from Tkinter import *
from ttk import * # ttk widgets should overwrite Tkinter ones in the namespace.
import ScrolledText as tkst # Not sure why this is its own library.

# from my window class def, nothing to do with the Progressbar:
def print_to_text_field(self, the_string):
new_string = '\n' + the_string
self.the_text_field.configure(state='normal')
self.the_text_field.insert(END, new_string)
self.the_text_field.see(END)
self.the_text_field.configure(state='disabled')
tk_root.update()
tk_root.update_idletasks()

Monday, July 4, 2016

Making a Spectrum/Gradient Color Palette for R / iGraph

How to make a color gradient palette in R for iGraph (that was written tersely for search engine results), since despite some online help I still had a really hard time figuring it out. As usual, now that it works, it doesn't seem to hard, but anyways.

(I had forgotten how horrible blogger is at R code with the "gets" syntax, the arrow, the less than with a dash. Google parses it as code, not text, and it just barfs all over the page, so I think I have to use the equal sign [old school R] instead. It is also completely failing at typeface changes from courier back to default. I see why people use WordPress....)

The way I will do it here takes six steps (and so six lines of code). There are a few different ways you could do this, such as where you set the gradient or if you assign the vertices (nodes) the colors in the graph object or at use them at the time of drawing but not actually assigning them in the graph object itself. The variable I based the gradient on is an integer, and given my analysis I'm making a ratio of "for each item in my data, what is its percentage on that variable compared to the maximum?" It's a character level in a game, so if a character is level 5 and the max level is 10, then the value I want is 0.5 (i.e. half).

Keep in mind that the gradient you use here isn't analog (like a rainbow with thousands [more I think] of colors), it's a finite number of colors, with a starting color and an ending color. If your resolution is 10 then you have ten colors in your gradient, determined by the software as 8 steps between the color you told it to start at and the color you told it to end at (8 steps + start color + end color = 10 colors).

The general conceptual steps for how I did it:

Set the resolution for the gradient, that is, how many color steps there are/you want.
Set up the palette object with a start color and an end color. (Don't call it "palette" like I did at first, that is apparently some other object and it will blow up your code but the error message won't help with figuring it out.)
You'll want a vector of values that will match to colors in the gradient for your observations, for what I'm doing I got the maximum on the variable in one step...
And then set up the vector in the second step (so, this is a vector of the same length as the number of observations you have, since each value represents the value that matches up against a color in the gradient). (In my code here, it's a ratio, but the point is you have numerical values for your observations [your nodes] that will be matched to colors in the gradient.)
Create a vector that is your gradient that has the correct color value for each observation. (The examples of this I could find online were very confusing, and that's why I'm making this post.)
Draw! (Or you could assign colors to your graph object and then draw.)

Let's look at some code and, on occasion, the resulting objects. (I'll include the code as one code block below this explained version.)

Don't forget library(igraph)

Also, if you're new to iGraph, note that it uses slightly odd (well to me at least) syntax, or you can use slightly odd syntax, to access and assign values to the nodes, that is, the Vertices of your graph, with V(your_igraph_object), which looks a little odd when you do V(g)$my_variable, for instance. (Below I do use "my_whatever" to highlight user made objects, except I did use just "g" for my iGraph graph object.)

Also note that, I think, the my_palette object is actually a function, but it definitely isn't a "palette" in the sense of a selection (or vector) of colors or color values. I think that is part of what makes line 4, below, unusual. Maybe I should have used my_palette_f to be more clear, but if you've made it this far, I have faith in you. (Also note that colorRampPalette is part of R, not part of iGraph.)

Using the language from the above steps...

Set resolution, I'm using 100: my_resolution = 100
Set palette end points, this starts with low values at blue and high values at red: my_palette = colorRampPalette(c('blue','red'))
Get the max from your variable you want colorized to make the ratio: my_max = max(V(g)$my_var_of_interest, na.rm=TRUE)
Create your vector of values which will determine the color values for each node. For me it was a ratio, so based on the max value: my_vector = V(g)$my_var_of_interest / my_max

Notice here we have iGraph's V(g)$var syntax.

Create the vector of color values, based on your variable of interest and the palette end points and the resolution (how many steps of colors). This will give you a vector of color values with the correct color value in the correct location for your variables in your df-like object: my_colors = my_palette(my_resolution)[as.numeric(cut(my_vector, breaks=my_resolution))]

Ok let's explain that. Take my_vector, and bin it into a number of parts -- how many? That's set by the resolution variable (my_resolution). By "bin" I mean cut it up, divide it up, separate it into my_resolution number of elements. So if I have 200 items, I am still going to have 100 colors because I want to see where on the spectrum they all fall. Take that vector as.numeric (since maybe it comes back as factors, I don't know, I didn't poke at that.) Send that resulting vector of numeric elements (which are determined by my_var_of_interest and my_resolution) to the my_palette function along with my_resolution, which returns a vector of hex color values which are the colors you want in the correct order.

Draw! plot(g, vertex.color=my_colors)

Note that we aren't modifying the colors in the iGraph object, we're just assigning them at run time for plot(). We could assign them to the iGraph object and them draw the graph instead.

Done! Let's look at two of the resulting vectors (but you should be using RStudio of course so you can see them anyways), as when I did it helped me understand what was going on.

So, my_vector is the vector of values for the variable of interest which determine the colors. They aren't the color values themselves, they are the positions on the scale which will get mapped to colors in the spectrum / gradient. (Note I have 1,019 observations in this data.)

my_vector num [1:1019] 0.31 0.581 0.112 0.108 0.181 ...

So, we can see these are ratios and we know they're between 0 and 1 since that's how I set it up. (A percentage of the max value in this data.) These will map to the right colors in the gradient. Note we can change the gradient, either its start color, end color, or the resolution (how many steps), and this my_vector won't change. This my_vector gets mapped to the colors. What the colors in the gradient are depends on the start color, the end color, and how many steps in the gradient there are.

Then there is also my_colors, which have colors in hex! Exciting to see it work.

my_colors chr [1:1019] "#4D00B1" "#92006C" "#1900E5" "#1900E5" ...

If you are great at mentally mapping hex RGB values to colors between blue and red to a percentage between blue and red (blue and red being the start [i.e. 0] and end [i.e. 1] points as determined in line 2 up above) you'll note that the values in my_vector do indeed map to the colors in my_colors which is cool. (You will notice all the middle two values, the green in RGB, are 00, since there is no green when you go from blue to red.) Note that the 3rd and 4th values in the hex list (my_colors) are the same, as they are mapping from 0.112 and 0.108, which are, when binned into 100 bins, both being approximated to, most likely, 0.11. Thus they have the same color value (which is 19 in hex of red, RGB or #RRGGBB, and E5 of blue, so E5 is out of FF max, so lots of blue and a little red, as they are both 11% of the way on the scale from the bottom (blue) end to the top (red) end. This makes sense.)

So, there you go.

# Set up resolution and palette.
my_resolution = 100
my_palette = colorRampPalette(c('blue','red'))

# This gives you the colors you want for every point.
my_max = max(V(g)$my_var_of_interest, na.rm=TRUE)
my_vector = V(g)$my_var_of_interest / my_max
my_colors = my_palette(my_resolution)[as.numeric(cut(my_vector, breaks=my_resolution))]

# Now you just need to plot it with those colors.
plot(g, vertex.color=my_colors)

Sunday, March 6, 2016

NYC School of Data

Spent most of the day yesterday at the NYC School of Data conference -- accurately billed as "NYC's civic technology & open data conference." Sponsored by a wide variety of organizations, such as Microsoft, Data & Society, the day involved a lot of great organizations such as various NYC government data departments, included great NYC people such as Manhattan Borough President Gale Brewer and New York City council member Ben Kallos, and was held at my workplace, the awesome Civic Hall.

Thursday, November 19, 2015

Python local variable referenced before assignment

tl;dr: A mix of tabs and spaces for your indent will cause this problem. At least in Python 2.7.

I post this since the answers I see in Google/Stackoverflow all talk about scope, and I didn't have a scope issue. That's pretty much it. This usually happens when I paste some code in from an example online (like from Stackoverflow, which I do like most of the time), it comes in with spaces but I prefer to use tabs.

Monday, August 31, 2015

Python, DictReader, DictWriter

Because I can never, ever, remember exactly how to code these. Example of both, basic.

data_list = []

with open(input_file, 'rU') as f:
  data_file = csv.DictReader(f)
  for line in data_file:
    data_list.append(line) # gets you a list of dicts

the_header = ['h1', 'h2', 'etc'] # column headers, a list of text strings

with open(output_file, 'w') as f:
  file_writer = csv.DictWriter(f, fieldnames=the_header)
  file_writer.writeheader()
  for line in data_list:
     file_writer.writerow(line)

Here I am going to experiment linking it as a script-tagged element from Gist via GitHub:

Sunday, August 2, 2015

Code for All Summit

Had a great time at the Code for All summit, held here at Civic Hall. Global meets local with a variety of civic tech people and a few government and NGO people thrown in. Code for All is the global offshoot of Code for America.

One nice thing to see was that yes, sometimes the best solution is SMS and not a fancy app.

That's me in the front row second from the left.

Sunday, June 14, 2015

Python, OSX, and Computer Name

Sounds thrilling! No not host name, but the name you give your computer -- so my multi-core beast is "NeXTcyl" (like a NeXT cube but a cylinder).

It was somewhat difficult to find, well, not the best way to do this in Python, but the only way I could find to do it in Python for OSX. Lots and lots of method for hostname: no no no Google, not that. You want to call out to scutil, a command line program.

import subprocess this_computer = subprocess.check_output(["scutil", "--get", "ComputerName"])
Essentially, use the subprocess library to call a command line function, use the check_output component to get the output from it (important!), and the three parts of the command line command are all cut up into the different arguments you hand the call (also important!). I tried about four other approaches before this one, and then had to try about three different syntaxes to get it to work, since I couldn't find any good online help. Here you go. For OSX, not Windows or other *nixes. No idea what they will do (nothing bad, but maybe not what you want).

(Because I have a 3.6 GB file I don't want to put in DropBox, so I have a local copy on my desktop and on my laptop, but the files are in different paths on each, so I wanted a way to detect which machine the code was running on so as to call the right file path -- I could have just tried one path and if it failed use the other, but, I only just thought of that now.)

Friday, April 10, 2015

SQLite and Python Notes

I don't have a background in SQL, so getting the syntax correct for SQLite in Python was a little tricky, especially since it reads like it's straight out of 1983. So, here is some working syntax for a search/select and replace, and also a search/select and an iteration through the results.

Search for one result (on a unique variable) and change some of that entry's data:

cursor.execute("UPDATE outfits SET size=?, members=?, scraped=? WHERE id=?", (how_many_members, char_id_list, 1, int(outfit_id)))
db.commit()

This assumes you know a bit about SQL. cursor is your cursor object. This snippet searches the db for any entries (lines, rows, whatever) for where the id variable matches the value of outfit_id. In this case, that variable will have all unique entries since I declared it that way when I made the db (which is some other SQLite code that is in multiple other places on the net). So, this line finds the one I want and then changes those three variables, then you commit it which actually write it. That seems really weird to me, either do it or don't do it. I assume this made some sense back in 1983 when people wrote code in capital letters. Oh and outfits here is the name of your table in the db. Well it's the name of my table in my db.

Iterator on search results:

cursor.execute("SELECT * FROM outfits WHERE scraped=?", (0,)) # this selects them but doesn't return them for use. NOTE TUPLE!!!
not_scraped_outfits = cursor.fetchall() # aha!
for an_outfit_row in not_scraped_outfits:
# do your stuff here

That seems weird to me, but I guess I don't understand the cursor idea. You SELECT in caps, but then you have to fetchall(). That seems like two steps where you only need one. So, you SELECT everything (the asterisk, I think) from your table that matches the WHERE call, here where scraped is 0, since it's a Boolean. That returns possibly none, one, or more. Usually for me in this particular code it will return several, and then you have to iterate through the results, and I think there are a few ways to code the iterator call, but the code I have here works so there you go. Execute a SELECT which is a search (the WHERE), then fetchall the results (even though you already selected them), then you can iterate through them.

NB: Tuple! When you do the funky security thing, which I can't explain and don't care about since I am only running local code, the argument has to be a tuple, so if you are just passing one argument you need a trailing comma:
cursor.execute("SELECT * FROM outfits WHERE scraped=?", (0)) # fail
cursor.execute("SELECT * FROM outfits WHERE scraped=?", (0,)) # success, due to the last comma there
Also, one of the many pages I was poking around at suggested SQLite Manager, a plugin for Firefox. There may be other similar things, I have no idea, but I really like it, it's free, and if you don't have anything that allows you to view the innards of your SQL db easily, I strongly recommend it. If you don't use Firefox, heck it's just another app (I tend to think I don't need three browsers on my machines, but hey).

More also, it is apparently a good idea to store really long ID numbers as text, not numeric. (Because something, somewhere, decided to round them all off so they were all wrong.)

Thursday, April 9, 2015

Kickstarter Talks

A great evening at Kickstarter HQ here in Greenpoint! Three fantastic talks about processes they use there:

Kumquat: Rendering Graphs and Data from R into Rails, by Fred Benenson, Head of Data.
Rack::Attack: Protect your app with this one weird gem! by Aaron Suggs, Engineering Lead.
Testing Is Fun Again, by Rebecca Sliter, Engineer.

It is always great to see uses of ggplot, especially on data upon which I also use ggplot, and I got to talk to Aaron about the throttling they do to stop malicious scrapers (I fall into the non-malicious camp of course!).

They have a really great little auditorium and they were pretty awesome and had some text to speech system for anyone who was hearing impaired -- that's the bright rectangle in the lower left (it's washed out due to contrast).

Here is Aaron talking about throttling overly requesty processes, which I found really funny since I have scraped Kickstarter but hopefully for good not for evil.

Finally, here is a ggplot chart I made of some Kickstarter data, for Music, US projects, with various other long-winded details, but this shows that, out of those who succeeded on their first project (blue line), people with a lower pledged/funder ratio (left side) were slightly more likely to do a second project (higher on the Y axis) than those with a higher pledged/funder ratio. We call this ratio "the sugar daddy" measure, since if you are higher on this measure, maybe your rich uncle came in at the last minute to save your project.

Friday, April 3, 2015

R and Unlist Your List!

If you try to assign a list to a column in an R data frame, it won't quite work, you need unlist. (That's the short version for the search engine snippet, although it does not make for a great narrative intro, it's more the concise summary.)

A few days ago, I was working in R, and was generating a new data frame from another one. It was a little more complex than I was used to, for instance I had to bin one variable by the values of another variable, and make some new percentage/frequencies in the new, smaller, data frame, so I couldn't just use the non-looping approach that is common to R (and which is a lot nicer and fast). The data frame was relatively small, so one loop level was not a problem, even on my 4.5 year old MacBook Air.

In one section, I generated a list of values (numeric, nothing fancy), and then assigned that list to a column in my new data frame. When I called the data frame to look at it, it looked fine, but if I did str(my_df) or summary(my_df) something was horribly wrong -- the column wasn't a numeric column, it was some odd list format and wasn't working for my ggplot.

I tried assigning the generated values directly to the column in the data frame, with something like this inside the loop, where I also incremented i:

my_df[i, 'the_variable'] = one_generated_variable

(Note I can't use R syntax there with the greater than sign, Google barfs on the code even though it's text, so I have to use the equals sign which is older R style.)

one_generated_variable was just a numeric value. Should have been fine, I thought! But no, it still came out as a list. I have no idea why, honestly it seems impossible since the values were generated one at a time and assigned then and there -- they were not bundled into a list first. But, unlist fixed the problem.

my_df$the_variable = unlist(my_df$the_variable)

That did it. I still don't understand the details, since I don't see why it was a list in the first place (aren't columns vectors anyways?). I have never run into that problem before, although mostly I've been working in Python lately.

Also, a friend put me onto data.table instead of data.frame for bigger data.

Thursday, February 19, 2015

Nice Little Python Trick

I can't summarize this for a headline, but I have a list (CSV) of KickStarter data, where each line is a project and includes the project URL and founder KickStarter username. I wanted to go in and get the biographies for each founder who had more than two projects. First I needed to drop all the one- and two-project people (easy), but then I'd have multiple URLs for each remaining founder. So, for someone with three projects, I'd have three unique URLs but only needed one.

I imagined sorting, or splitting, or checking against founder names that were already accounted for. Horrible. Then I realized... Dictionaries! With founder as the key.

So I read the CSV as a list of dicts (typically how it is done, but so few examples show this online it is horrid), so had

data_list[i]['founder']

and

data_list[i]['proj_url']

to work with.

And now Google insists on line breaks there with the pre tag. Sigh.

But, the solution! Since I only needed one URL per founder, it didn't matter which one I had. So I could just loop through the data once, and grab every URL, and let the dict just overwrite founder-key entries with any URL for that founder. So, the following code only loops once, returns a nice dict object, and uses founder names as the keys.

url_dict = dict()
for project in data_list:
    url_dict[project['founder']] = project['proj_url']

So for someone with, say, three projects, it will assign the first URL to their username, then assign the second URL and overwrite the first URL, and then the same for the third URL, overwriting the second. So I end up with every username associated with one appropriate URL. Perfect. No sorting, no checking, no nothing. Automatic, essentially.

So I thought that was nice.

Sunday, January 25, 2015

Random R Notes - Factors, Rank v. Order, Unsplit

Some R issues I have run into recently...

I split a dataframe, then split it again, and the analysis was taking forever. Something was wrong. After inspection, the sub-split DF had all the factor levels from the original DF! Terrible since there were about 25,000 in one variable. It just took forever, which I don't think it should but hey I gave up on it.

I needed droplevels. You can apply it to the whole DF, removing unused levels from your old DF and then assign it (to a new one or just write over the old one).

your_new_df = droplevels(your_old_df)

(Google cannot handle the less than sign, used for "get" in R instead of =, with either the code tag or the pre tag, it just blows it up. Annoying.)

Then I could run the ordering code on dates. But no: there is, I learned, order and there is rank. There is also sort but I managed to avoid that somehow, so I won't discuss that here.

Note that for rank you need to figure out what to do with ties! (That is, when values are equal, how to rank them exactly.)

There was a really great post about it on Stackoverflow but I can't find it at the moment. This post might help, though.

Or, I made a nice little example! I use R: as the start of the input lines since the greater than symbol and blogger are not friends.

R: the_list = c('A', 'D', 'B', 'C') R: order(the_list) [1] 1 3 4 2 R: rank(the_list) [1] 1 4 2 3

So, you see the two outputs are different.
Order says, put the first element first, then the third element would come next (B), then the fourth element (C), then the second element (D).
Rank says, the first element is first, the second element (D) is the fourth of them all, the third element (B) is the second overall, and the fourth element (C) is the third overall.

Edit: I called sort!

R: sort(the_list)
[1] "A" "B" "C" "D"
That's awesome.

So after that, I wanted to unsplit. But, no, I had added the rank column, so instead of 400 rows I only got 100 (I had 100 df's with 4 rows each). Unsplit does not work well (or at all?) if you add (or subtract?) items. So stackoverflow told me I needed do.call and rbind ("row bind").

rejoined_df = do.call(rbind, splitted_df)

Note that the splitted_df is the result of the split() call, which I'm not showing, so is not actually a DF, I think it's a list of DFs maybe. Or maybe it's not a typical DF. But you can call it directly, and if you use split you should familiarize yourself a bit with the resulting object.

There you have it, some random R notes.

Thursday, January 8, 2015

Amazon's Data Fail

So, I believe I have been an Amazon customer for well over ten years -- at least eight since I moved here to NYC, and I ordered stuff from them prior to that.

However, Amazon continues to ask me if I want the college student discount almost every time I check out. This is absurd. And there is no way to toggle it off, I had to get customer support in chat and not even he could do it, he had to bump it up to his supervisor.

Given all the data Amazon has on me, they should know better. It's not just a question of an algorithm, someone -- a team most likely -- was in charge of the implementation here, not just of the algorithm but of the page and its features.

They have the data, and the feature knowledge, yet they failed tremendously anyways. This is not the "Target knew a woman was pregnant before her parents did!" story.

Here is the actual screenshot I took, this is not some random illustrative image I grabbed from somewhere else on the web:

Update, Jan 21: Almost two full weeks later and I got the page again. Amazing and pathetic.

Friday, July 11, 2014

Python, Multiprocessing, and Queues

Here's a little bit on what I've been working on lately, using Python's multiprocessing package to use the multiple cores on my machine but also using queues (in the multiprocessing package) to gather results from the processes. I'm writing it up because I found some info on queues that wasn't particularly helpful and in one case was just wrong.

So, basics first:

import multiprocessing

Ok good, got that step out of the way. Some things you will want:
num_cores = multiprocessing.cpu_count()

That gives you an integer of how many cores you have on your machine. If you have a newer machine where the cores are so awesome that they virtualize themselves into two cores each, it will return the virtual number (so, on my 6-core [hardware] Mac Pro it returns 12 [virtual!], which is twice as awesome as 6).

The multiprocessing seems a little weird at first. Here's some code:
proc_list = list()
# This list is so you can "join" them. Important!
my_queue = multiprocessing.Queue()
# Queues do q.put() and q.get() and other things.
for a_range in chunk_ranges:
# The number of chunks in my code equals the number of processes.
# "chunk_ranges" is a list of lists, [[index_a, index_b], ...]
# indexes in a Pandas DF I am parsing. Easy to parallelize.
# Your for loop will differ, depending on your data.
a_proc = multiprocessing.Process(target=your_worker_function,
args=(a_range, the_data, my_queue))
# To your worker function, pass the args -- here it's the data range,
# a pointer to that data, and the queue object.
proc_list.append(a_proc)
# So you have a list of them and can iterate through the list
# for join to end them nicely.
a_proc.start() # Starts one!
# Waits for them to end and ends them nicely, cleanup.
for p in proc_list:
p.join()
print '%s.exitcode = %s' % (p.name, p.exitcode)
# You don't need this but it's nice to see them end
# and their exit status number.
print 'Getting elements from the queue.'
while not my_queue.empty():
one_procs_data = my_queue.get()
# then do something with that, maybe add it to a list, depends on what it is.
Ok so that's commented, but how does it work? How does the worker function ("worker" is the term of art, it appears, for the function that is multi-processed) deal with the queue object? Let's look at code for that. Note that this is a somewhat simple example with one queue, I've seen nice examples with an input queue and an output queue. This example here deals only with an output queue (because my project is chunking a Pandas DF into processable pieces, that is playing the role of the input queue essentially).
def your_worker_function(your_args, your_queue):
# Do your processing here!
your_queue.put(the_data_you_want_returned)
# That's it, no "return" needed!
# End of your_worker_function

Not bad! Hand the worker function the queue object like you would any function's argument. You don't need to use the "return" call, since you use ".put" on the queue object and put the data you want into it. What is nice is that Python takes care of the worker functions all putting their data into the queue so you don't have to worry about them all getting smashed up (not a technical term) and your code barfing when/if they all try to access the object at the same time. No worries! Love it.

So how does that previous code work, the code that calls the worker function?

Declare a list object to populate with the processes (their names, I think). This is important, and is so you can end them all nicely and do cleanup. Also declare your queue object.

You use a looping function (here a for loop) to give out the jobs to the right number of processes. "target" is the worker function, and then "args" are the arguments you hand to that function, just as you would normally. Then, add the process that was just made to the list and start it.

The "join" for loop -- I have no idea why it is called "join" btw -- nicely cleans up after all the processes. I'm not discussing errors and such here, that's more advanced. The for loop does indeed loop through them all, and will get to all of them, waiting for them to finish in order (I think). I was a little curious about what if the first process in the list fails or doesn't stop or something, but somehow the looping will get to and join all the ones that have finished, even if they are after an infinitely looping one earlier in the list (yes I speak from experience oops).

Then, you can call your now-populated queue object and ".get" the items from it (whatever data type you put in). Deal with them appropriately -- so, maybe you made lists, and you don't want a list of lists when you're done, you want one list, so ".extend" to the outer main list object. Whatever is appropriate for your job.

There you have it. If you have 12 cores you could make it go to 11 instead of 12, but it's more fun (or at least faster) to say that you could and use 12 anyways.

Edit: Apparently, queue objects have a size limit, and things can grind to a halt and your processes will stop but won't complete (that is, your CPU won't be doing anything but your processes won't join) if you overload your queue. Oddly this doesn't crash, perhaps because exceptions don't get pickled and passed around the levels of multiprocessing. That was a very detail thin explanation. Suffice to say, I have some code that is fine but grinds to a halt even though the worker processes exit, however, the processes don't join. Right now I'm looking at the queue as the culprit.

Edit 2: I took out the my_queue.put(item) call in the worker process and replaced it with a file write (using the process name for unique filenames) and.... It worked! Actually first I took out the my_queue.put(item) and then they all joined, so, not happy with the queue. If you are doing multiprocessing, you might have a lot of data, but I guess the queue can't handle it. And, worse, in OSX you can't get the size of the queue since something isn't implemented, and also queue.full() (or whatever exactly) isn't totally accurate.

Wednesday, June 11, 2014

Test Your Surveys

If you don't actually test your surveys, you roll out something for a major US telecommunications carrier that doesn't pick up the right value and then prints blanks and is clearly wrong, like this:

Wednesday, October 23, 2013

R and Regex Named Matches

I use Python and R to do stuff, Python for web scraping and text clean up, R for the analysis. But people have expanded the functionality of the two, and they are overlapping (it's enough to not get them confused as it is some of the time). I found I needed to use named groups in regex in R, and... couldn't figure it out. The web did not help.

SHORT VERSION: Turn on the Perl regex style (perl = TRUE) and go read some Perl regex pages, you'll be fine. Name the match: (?<name>...), to match it later: \\g{name}
This is completely different from what I was used to.
Google blogger will try to blow this post up as I want to have greater than and less than symbols. Yeah they can't get that right. Oh maybe it's working.

Typically, if I use regular expressions it's in Python, but R can do it too, and sometimes you'll want to do that. But there isn't a ton of help online about it (despite the links I have lined up to include below) and there are some things that confuse the issue (to Perl or not to Perl...).

If you just want to do some work with strings, first check out Hadley Wickham's stringr. It's awesome.

I, however, wanted to do some pattern matching that included a repeated section, so I needed regex's named group functionality, which I couldn't find or figure out in stringr. I was looking for patterns like this:

5,-1,5,-1,5

...where 5 could be any number between 0 - 500 or so, but it would repeat. I had already removed spaces and added commas for easier parsing. (So other matches would be, like, 17,-1,17,-1,17.... etc.) So I needed to make sure the first match there was repeated, thus, named groups (or any group capture really, but I wanted to name it).

But I also couldn't figure it out in R. I can do it in Python, but the Python code for regex wouldn't work in R, alas. It was not clear what changes needed to be made.

One reason was the the \ needs to be escaped, that is, \\. So for example, \d+ needed to be \\d+. That wasn't too hard to figure out. But the rest was.

You can have Perl style, or POSIX, or not. Uh, what? No idea! I just needed it to work. Specifically, named groups in R. I found this page which said "Named subpatterns... are not covered here." Hmm. Another page said how "examples for the use of regex in R are rather rare" and had some useful examples. Eventually I figured I would set the Perl option and see what I could do; at least I could search on "perl", and that made all the difference as I could find out how to do named groups in Perl-style regex and there you go.

Name a group in Python: (?P<name>...)
Name a group in R, Perl style: (?<name>...)

Note: I expect the less than and greater than symbols fail at some point.

Reference it later in Python: (?P=name)
Reference it later in R, Perl style: \\g{name}
So curly braces (Perl?), and double backslash for R.

Some useful Perl-regex links:
http://modernperlbooks.com/books/modern_perl/chapter_06.html
http://perldoc.perl.org/perlre.html

Although honestly one problem I have with a lot of online examples (and the R help files) is that they are completely arcane. If I'm looking for help with syntax, a complex example isn't going to solve it, that's bad usability.

Post Keywords: regex, R, r-project, cran, grep, regular expressions, named groups.

Monday, October 21, 2013

R and head() and tail()

So, if you're not careful using R's head() and tail() commands, you'll end up with a little surprise. Perhaps I should say not careful reading the documentation.

Head() and tail() do not return just one item from the list (or whatever), they return several. So head does not mean first, and tail does not mean last.

Read carefully: "Returns the first or last parts of a vector, matrix, table, data frame or function" [Italics added.] PARTS. Plural. An 's' on there.

Example:

our_list = list(3, 7) # Makes a list with two items, the first is 3, the second is 7. Integers.
Note I used "=" since if I use a "less than" bracket, Google blogger freaks out. (The typical R code is "less than" followed by a dash, which make an arrow, representing "gets", the left side gets [is given] the right side.) It's giving me a hard time with formatting as it is.
If you type in "our_list", R will print our_list:

[[1]]
[1] 3

[[2]]
[1] 7

So, the first item in our_list [[1]] has one item [1], which is a 3.
The second item in our_list [[2]] has one item [1], which is a 7.

I don't fully understand the difference between [[x]] and [x], it seems mysterious.
Edit: The R Inferno, 8.1.54.... aha. Still is mysterious, though.

If you type:
head(our_list)
...it would be nice to get just the head, that is, the first item. But no, you get the whole list (since the list is small you get the entire lists, larger lists would only return the first few items).

What you want is:
head(our_list, n=1)
...where the 'n' gives the value of how many items you want. (You don't actually need the n=, I have noticed.)
When I try "n=3" for this two item list, it just gives the first two items (i.e., the entire list in this case) and does not give an error.

Note I made our_list have [[1]] == 3 and [[2]] == 7 since far too often [[1]] == 1 and [[2]] == 2 and really people that's just not clear. If you're trying to make a useful example, don't make it where the same symbols (1, 2) are being used to represent widely different things.

Also, Googling for info on R's "by" command is just impossible, as "r by" is not a specific enough search string (in-context it's fine though). That's why I like books (yes paper) sometimes, if the index is any good, there you go.