Saturday, August 2, 2014

Not Ok, Cupid

Regarding the recent Ok Cupid "study", there's a nice piece about both it and the horrible Facebook study you can read over at Kottke. One thing I like is that it essentially discusses the community in which FB posters exist, which is something I found an important and overlooked issue.

"It's not A/B testing. It's just being an asshole."

Thursday, July 31, 2014

Gephi and Importing an Edges List

So, I wanted to import a CSV I had made in Python into Gephi. Easy. But, no. Gephi wasn't importing it correctly. File: Open wasn't reading the file correctly, the headers were being counted as nodes, and the weights were also being counted as nodes.

To the Internets! They were horrible. They all said to use the "Import Spreadsheet" button in the Data Table tab in the Data Laboratory. I didn't have such a button in the Data Laboratory. Bastards.

Mine looked like this:

As you can see there are no buttons with "Import Spreadsheet" there at all.

The Internets (like here, here, and the wiki page) kept telling me it should look like this:

BUT IT DIDN'T. Come on. Killin' me.

Solution: File: New Project. (Or command-N.) Really. That's all.

So I tried to fix the wiki page about it, but you have to have an account and I don't see anywhere to make one. So I tried the contact for the main community person, but the link to his page went nowhere, and the link for the American community guy went to some old generic page that has been retired. (I can't easily find those pages now and it's not worth the time.)

Not useful, people, not useful. A shame too, since Gephi is fantastic.

Friday, July 11, 2014

Python, Multiprocessing, and Queues

Here's a little bit on what I've been working on lately, using Python's multiprocessing package to use the multiple cores on my machine but also using queues (in the multiprocessing package) to gather results from the processes. I'm writing it up because I found some info on queues that wasn't particularly helpful and in one case was just wrong.

So, basics first:

import multiprocessing

Ok good, got that step out of the way. Some things you will want:

num_cores = multiprocessing.cpu_count()

That gives you an integer of how many cores you have on your machine. If you have a newer machine where the cores are so awesome that they virtualize themselves into two cores each, it will return the virtual number (so, on my 6-core [hardware] Mac Pro it returns 12 [virtual!], which is twice as awesome as 6).

The multiprocessing seems a little weird at first. Here's some code:

proc_list = list()
    # This list is so you can "join" them. Important!
my_queue = multiprocessing.Queue()
   # Queues do q.put() and q.get() and other things. 

for a_range in chunk_ranges: 
      # The number of chunks in my code equals the number of processes.
      # "chunk_ranges" is a list of lists, [[index_a, index_b], ...]
      # indexes in a Pandas DF I am parsing. Easy to parallelize.
      # Your for loop will differ, depending on your data.
  a_proc = multiprocessing.Process(target=your_worker_function,
    args=(a_range, the_data, my_queue))
      # To your worker function, pass the args -- here it's the data range,
      # a pointer to that data, and the queue object. 
      # So you have a list of them and can iterate through the list
      # for join to end them nicely. 
  a_proc.start() # Starts one! 

# Waits for them to end and ends them nicely, cleanup. 
for p in proc_list: 
  print '%s.exitcode = %s' % (, p.exitcode)
    # You don't need this but it's nice to see them end
    # and their exit status number. 

print 'Getting elements from the queue.' 
while not my_queue.empty(): 
  one_procs_data = my_queue.get() 
  # then do something with that, maybe add it to a list, depends on what it is.

Ok so that's commented, but how does it work? How does the worker function ("worker" is the term of art, it appears, for the function that is multi-processed) deal with the queue object? Let's look at code for that. Note that this is a somewhat simple example with one queue, I've seen nice examples with an input queue and an output queue. This example here deals only with an output queue (because my project is chunking a Pandas DF into processable pieces, that is playing the role of the input queue essentially).

def your_worker_function(your_args, your_queue): 
  # Do your processing here! 
  # That's it, no "return" needed! 
# End of your_worker_function 

Not bad! Hand the worker function the queue object like you would any function's argument. You don't need to use the "return" call, since you use ".put" on the queue object and put the data you want into it. What is nice is that Python takes care of the worker functions all putting their data into the queue so you don't have to worry about them all getting smashed up (not a technical term) and your code barfing when/if they all try to access the object at the same time. No worries! Love it.

So how does that previous code work, the code that calls the worker function?

Declare a list object to populate with the processes (their names, I think). This is important, and is so you can end them all nicely and do cleanup. Also declare your queue object.

You use a looping function (here a for loop) to give out the jobs to the right number of processes. "target" is the worker function, and then "args" are the arguments you hand to that function, just as you would normally. Then, add the process that was just made to the list and start it.

The "join" for loop -- I have no idea why it is called "join" btw -- nicely cleans up after all the processes. I'm not discussing errors and such here, that's more advanced. The for loop does indeed loop through them all, and will get to all of them, waiting for them to finish in order (I think). I was a little curious about what if the first process in the list fails or doesn't stop or something, but somehow the looping will get to and join all the ones that have finished, even if they are after an infinitely looping one earlier in the list (yes I speak from experience oops).

Then, you can call your now-populated queue object and ".get" the items from it (whatever data type you put in). Deal with them appropriately -- so, maybe you made lists, and you don't want a list of lists when you're done, you want one list, so ".extend" to the outer main list object. Whatever is appropriate for your job.

There you have it. If you have 12 cores you could make it go to 11 instead of 12, but it's more fun (or at least faster) to say that you could and use 12 anyways.

Edit: Apparently, queue objects have a size limit, and things can grind to a halt and your processes will stop but won't complete (that is, your CPU won't be doing anything but your processes won't join) if you overload your queue. Oddly this doesn't crash, perhaps because exceptions don't get pickled and passed around the levels of multiprocessing. That was a very detail thin explanation. Suffice to say, I have some code that is fine but grinds to a halt even though the worker processes exit, however, the processes don't join. Right now I'm looking at the queue as the culprit.

Edit 2: I took out the my_queue.put(item) call in the worker process and replaced it with a file write (using the process name for unique filenames) and.... It worked! Actually first I took out the my_queue.put(item) and then they all joined, so, not happy with the queue. If you are doing multiprocessing, you might have a lot of data, but I guess the queue can't handle it. And, worse, in OSX you can't get the size of the queue since something isn't implemented, and also queue.full() (or whatever exactly) isn't totally accurate.

Friday, July 4, 2014

How To Circumvent Your IRB In 4 Easy Steps

  1. Design a study with a big company that is going to run it without you.
    • The study can manipulate emotions yet include people with mood disorders and people under the age of 18.
  2. Have the big company run it and have them collect the data.
    • You don't know how to code in their massive and distributed server environment anyways.
  3. Approach your IRB and ask to use data that has already been collected, and because it has already been collected no human subjects approval from the IRB is needed.
    • Hope your IRB is lazy and doesn't think about this too much.
  4. Publish and over-hype your results.
    • Issue lots of non-apologies when the research community calls you on your lack of ethics.

Sunday, June 29, 2014

That Facebook Study on Manipulating Emotions

My Summary
People are instinctually driven to be a part of communities (thanks to evolution). Facebook wants to be our go-to place for easier communication with our communities (notice the similarity between those two words). We know that being a community member means celebrating the good and giving support when things are bad. By taking away both positive and negative posts, Facebook took away our ability to do that, and in doing so threatened our ability to take part in our communities, which not incorrectly is seen as a threat to our livelihood. That is a big part of the reaction here, and it's not getting the attention it deserves.

Let's be clear: the study was completely unethical, and it is horrifying that everyone involved was apparently blind to this obvious fact. Yes, obvious fact, and no it doesn't make it either non-obvious or not a fact that so many educated people missed, and continue to miss, this important point.

Some Links

  1. The actual, rather short, paper about the study
  2. A response at the AV Club, the first thing I learned about it.
  3. A great piece at Tumbling Conduct
  4. Another great piece at The Laboratorium, by James Grimmelmann. 
  5. A great takedown of the methods at Psychcentral
  6. Forbes wrote about it, and included the (lame) Facebook explanation from one of the authors.
  7. A good blog post about the lack of informed consent and why it matters here.
  8. A great NYTimes opinion piece by Jaron Lanier.
  9. The not-quite retraction by the journal, an "Editorial Expression of Concern".
  10. A lengthy write up at Science Based Medicine, quite good.
  11. Statsblog has a guest post that is also worth reading.
(I am editing this over the course of Sunday, Monday, and Tuesday: reflection and thought are more important than speed of posting. And now Friday to add the "Editorial Expression of Concern" from PNAS.)

Terms of Service: They Don't Care
No one gave informed consent to this, and yes that matters. The Terms of Service is not informed consent. It is laughable to think it is. Some people are saying that because not all studies need informed consent that this one didn't, that's not true.

Now it turns out that the Facebook TOS didn't actually include the word "research" in it at the time. Let's be honest though, the only real weight of this discovery is that Facebook doesn't follow its own TOS, which isn't surprising.

And now (Tuesday, July 1) I am reading that there may have been Facebook users who were under the age of 18 in the study, in a followup at Forbes which links to a login-protected WSJ article. (I am guessing that under 18 is a different category for studies and there may be some legal issue about that, but I don't do A/B research on young people.)

Cornell's IRB: Oops
And it also looks like Cornell's IRB is trying to wash its hands of the IRB process: it looks like they just rubber stamped it because the experiment had already been run by the time it came to them. That is, the study was run without academic IRB approval. They actually have a statement about it.

Cornell's IRB statement is horrible and intentionally misleading. It says how the Cornell researchers': was limited to initial discussions, analyzing the research results and working with colleagues from Facebook to prepare the peer-reviewed paper.
What this means is that they did everything except run a bunch of extremely complicated code on the Facebook system, which would have selected user accounts for the study, manipulated the study conditions, and then data scraped the relevant data out of a big data cloud computing environment. The only people qualified to do that are the Facebook techies.

There is no "limited" part here, they did everything, from start to finish, with a bit of help on the technical side. This is a very large and total failure of the IRB process.

Furthermore, Cornell faculty member professor Hancock "was not directly engaged in human research," which is laughable. Cynically I could say that we see here neither Facebook nor Cornell considers us human. My real guess is that Cornell's IRB just rubber stamped this and they have a very poor oversight process, or have a very weak understanding of Facebook.

The researchers had a theory that they could indeed manipulate people's behavior, as shown by what they post in Facebook, by manipulating what people saw in their feed. Some say this is irrelevant because Facebook manipulates our feeds all the time, and this is apparently in part why IRB approval was given. This is irrelevant. Facebook manipulates (this word is used slightly differently in research communities and the rest of the real world where it is very creepy, as it should be) our news feed, yes, but by "most popular", and never before has it been suggested that it is by mood. This is totally different and an important distinction.

Effect Or Not
Some people also say that it is irrelevant because there was no effect (despite the authors of the paper claiming a finding, despite the difference being roughly equivalent to zero). But no, there was no real effect that could be measured in Facebook. We have no idea what the real world effects were, if any. And that's important. Don't confuse big data with real world. Big is not complete, as someone once said about big data.

That the finding was so small but statistically significant makes it a bit paradoxical to talk about. So the researchers can claim a finding -- they wrote in the paper that "We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others" [italics added] but then Sheryl Sandberg, Facebook's COO, said "Facebook cannot control emotions of users." So much for being on the same page.

The Cornell Press Release department heavily stresses the effects, repeatedly quoting one of the authors.
“Online messages influence our experience of emotions, which may affect a variety of offline behaviors,” Hancock said.
But of course they didn't take any offline measures at all.

Professor Jessica Vitak pointed out, in a Facebook thread, that it is most likely they didn't measure emotion at all (since we can't say that Facebook posts are that representative of emotion all the time). What they could have measured was something along the lines of social acceptability of the emotional leaning of posts (she summarized it much better than I did there and had a better phrase for it). We know they measured post language, but we don't really know what that represents beyond Facebook posts, if anything. That's not good science.

The Sample: Representative? No
The sample, its representativeness, and who the (non) results apply to are also problematic. Facebook users are not representative of the population at large. They just aren't. They have internet access and computer skills. Not everyone has those two things. We are not really sure about the sample from the study, it's Facebook users whose posts were in English but that is all we know about them. It is scientifically unsound to then claim that the (non) results here apply to everyone else because we don't know enough about who the unwilling participants were and how they match up with other groups of people.

Of course if you only care about Facebook advertising, then the only relevant sample is Facebook users.

The Sample: Mental Health? Users Between 13-18?
The public health angle has only been explored by a few comments I've seen, and it's complex. I've seen one comment say how about 10% of people have a mental health disorder: ah here's the National Institute of Mental Health, which says 9.5%.

9.5% of 689,003 = 65,455 people in the study with a mood disorder (most likely -- this is statistics).

Could seeing fewer positive or negative posts cause problems? Yes. Will it, for any one person? We don't know, there are many many factors at play here. But if you're running a study where the point is to manipulate mood and you're going to have 65,455 people with a mood disorder in it, you need to be really clear about that and really careful, and this study comes nowhere near that standard.

Others have pointed out that, besides having no way to filter out those with moods disorders in a study meant to manipulate moods, we have no idea if the study filtered out young people.

Additionally, some people have pointed out the public health issues around this kind of experimentation and manipulation:

A/B Testing Is Done All The Time! So What?
Some have also said that it's ok because companies do A/B tests all the time (that is, tests with two conditions). Well does that make every A/B test ok? No, it does not. Also, Facebook is not like other companies -- other companies are not the home of our digital communities. Facebook likes to say how big and important they are because of this, but if these communities are so important to people then it is not okay to manipulate the emotional content in them at all. Yes, communities can be informational, but a lot of the time Facebook friends are also real world friends and family and the emotional content is really, really important.

Communication Is Community
In-group, out-group is important. This is Facebook, people who for most of us are out-group, manipulating the messaging in our in-groups. Facebook degraded our communication, and communication is community (they have the same root in English), and when out-groups do that I think it is rightly seen as a threat.

I want to stress the community angle. Communication forms community. This experiment reduced important, emotional communication in communities for hundreds of thousands of people. Taking part in emotional communication is a vital ritual for community members that both reenforces that community and affirms that person's membership in that community. This includes both our taking part in emotional support (replying to something negative) and our taking part in celebratory communication. To reduce our capability to take part in important community ritual is a direct threat to our social survival, and it is anathema for a company that wants to be, and currently is, the largest online community platform in the world. (Two of my favorite thinkers about community and ritual are Clifford Geertz, and on this topic see his chapter about a funeral in Java; and James Carey, who has written about community, communication, and ritual.)

Some people have said that because the researchers didn't add any negative posts, merely took away positive ones (in one of the test conditions) but that you could still see them on your friend's page, that this is ok. No it's not. (Do you really go to each and every one of your friend's pages every time you go to Facebook? Do you know anyone who does? I didn't think so.) Taking away a negative post is horrible, because it takes away my ability to support a friend in need, that is, doing so undermines my ability to act appropriately in my community, and that is hugely problematic. The same is true for my missing out on a positive post: I am denied the opportunity to take part in a positive celebration in one of my communities.

What Was The Purpose?
Some have said that the researchers weren't trying to manipulate people's emotions, just their behavior on Facebook. Well no, that's ridiculous for at least two reasons. One is the title, which contains the phrase "emotional contagion", so we know what they were thinking. The other one is of course researchers always want to have something larger to say about human behavior. You can't manipulate what people are doing in terms of their emotions without perhaps affecting their emotions. If you don't know, you are obligated to find out. But again, we have no idea how, if at all, this affected people who were in the study in their real world lives.

As some have pointed out, this was research done on a not very interesting question (this seems pretty obvious to me), on people who did not give consent, with an ineffective IRB, under academic auspices but lacking academic standards, with no consideration of real world effects, with faulty methods, and which could have been done somewhat differently looking for correlations in what people saw and what they posted using data mining and no manipulations at all.

I am actually debating quitting Facebook because of this. Google+, anyone?

John Gruber, long time computer industry expert, has a post about it with one line I'll cite: "Yes, this is creepy as hell, and indicates a complete and utter lack of respect for their users’ privacy or the integrity of their feed content. Guess what: that’s Facebook." [Italics in original.]

But it was also Cornell and two Cornell-affiliated researchers.

Friday, June 27, 2014

Picturephone Redux

I wrote a little about the Picturephone a few years ago, so it is cool to see it mentioned in the NYTimes today with a photo I hadn't seen before (this one I'm including).

NYTimes caption: "In New York on Dec. 21, 1965, Keum Ja Kim, 15, a soloist with the World Vision Orphan Choir, used the Picturephone to audition for Robert Merrill, a star with the Metropolitan Opera, who was in Washington to sing at the White House. Credit Bettmann/Corbis"

Wednesday, June 25, 2014

The Real Problem with the iPhone Fingerprint Sensor

It gets shmutzy and won't read your fingerprint until you clean it. This might involve pressing it hard enough to activate Siri. This is not a bad problem to have, "boo hoo I have a smart phone my life sucks."

When it came out, there were far too many breathless articles about how someone could make a rubber copy of your fingerprint and then hack your phone. This was stupid, and everyone knew it was stupid, but people wrote about it anyways. (And apparently a lot of people don't even have passwords at all on their phones, so....) In order to do this, you'd need to steal their phone and make a rubber copy of their fingerprint. How many times has this happened? Zero. Why? Because the internets would blow up if it did, and that hasn't happened.

(Photo credit Apple Inc.)