Sunday, September 7, 2014

Horrible Web Ads

I am tired of horrible web ads, I am tired of women's images being used to manipulate curiosity and click throughs, and I am tired of pathetically transparent false geo-targeting, but it's kind of funny when it doesn't work.


And misuse of quotes. 'Rattled', what does that even mean?

Saturday, August 30, 2014

(Mis)Information Propagation

"The One Dirty Little Secret About The Web You Don't Know!" Of course you probably do know it, and that's an intentionally horrible click-bait line. The reason you see Wikipedia's content scraped and represented in so many places is because people are too lazy to do the actual work required to do whatever it is they are trying to do (usually just make a buck). But this gets interesting, slightly, when the information is wrong.

My main, and it was going to be the sole, example was regarding the gas station / service station that does not exist around the corner from me here in Brooklyn, to which I was alerted by Apple maps. This is why I don't use Apple maps. I have reported it several times, and it is still there. They could use Google Street View to see very easily that there is no service station at that location (or anywhere near it), but whoever maintains that information does not care. Not at all. But first a quick few paragraphs about the iconic "ironworkers on a beam high above NYC" photo.

The Smithsonian Magazine, and note I love and respect the museum and a good friend of mine works at the museum, has an article about the photo. Which is great, except it's completely wrong. The man on the right of the beam isn't Patrick "Sonny" Glynn. "Pat Glynn is also the source for the identity of this worker, who he claims is his father, Patrick 'Sonny' Glynn." It's the grandfather of some friends of mine who I have known for over 30 years. (And if you look closely you can see he's missing part of a finger which he lost in a construction accident.)

To say that "for 80 years, the 11 ironworkers in the iconic photo have remained unknown" is horrible, because it's overselling, hype, and completely untrue. Just because the general public didn't know who those men were doesn't at all mean they were unknown. Just because there wasn't a source for the information didn't mean it was unknown. What does it mean to be unknown? By whom? Who gets to count as knowing?

As far as I can tell, the Smithsonian has not corrected this article at all, which is disappointing. However there was a fair amount of press one can find online from the same time, which I believe came about because a movie about the photo was released then.

Which brings us to Apple maps and the service station that is not in my neighborhood.

Here's the Apple maps image, from August 29th, 2014:



Yes the blue circle is approximately me. Note the "7th Ave Performance Center", at 121 7th ave there. There is no such commercial establishment there.

But the internet, well ok Google, will tell you there is:
(They're all purple because I clicked them.) These are the top ten results (somehow out of almost 5 million results, which makes no sense whatsoever). Nine of them are completely wrong. Only the second one gets it right (and I looked at this a few years ago when I had some small hope for Apple maps), because there is probably a service station down at 7121 7th avenue -- somewhere along the way, the leading 7 on the street address got lost, and site after site unthinkingly copies the error. (The second link there -- and I know it's a screenshot here -- also has the zip code correct.)

Which brings us to my overall annoyance. All these sites are just copying information. They don't particularly care if it's correct. That's really pathetic. Alright I have once again submitted it as an error, maybe I'll see one day if they correct it.

When Google Was Better

Google used to be about finding information for people (search), now it's about finding information on people (advertising).

Facebook used to be about keeping track of your friends (social), now it's about keeping track of your habits (advertising).

tldr: Public sphere, corruption thereof by advertising, Habermas. (That was a very heavily coded sentence using an academic concept and its author.)

Saturday, August 2, 2014

Not Ok, Cupid

Regarding the recent Ok Cupid "study", there's a nice piece about both it and the horrible Facebook study you can read over at Kottke. One thing I like is that it essentially discusses the community in which FB posters exist, which is something I found an important and overlooked issue.

"It's not A/B testing. It's just being an asshole."

Thursday, July 31, 2014

Gephi and Importing an Edges List

So, I wanted to import a CSV I had made in Python into Gephi. Easy. But, no. Gephi wasn't importing it correctly. File: Open wasn't reading the file correctly, the headers were being counted as nodes, and the weights were also being counted as nodes.

To the Internets! They were horrible. They all said to use the "Import Spreadsheet" button in the Data Table tab in the Data Laboratory. I didn't have such a button in the Data Laboratory. Bastards.

Mine looked like this:


As you can see there are no buttons with "Import Spreadsheet" there at all.

The Internets (like here, here, and the wiki page) kept telling me it should look like this:


BUT IT DIDN'T. Come on. Killin' me.

Solution: File: New Project. (Or command-N.) Really. That's all.

So I tried to fix the wiki page about it, but you have to have an account and I don't see anywhere to make one. So I tried the contact for the main community person, but the link to his page went nowhere, and the link for the American community guy went to some old generic page that has been retired. (I can't easily find those pages now and it's not worth the time.)

Not useful, people, not useful. A shame too, since Gephi is fantastic.

Friday, July 11, 2014

Python, Multiprocessing, and Queues

Here's a little bit on what I've been working on lately, using Python's multiprocessing package to use the multiple cores on my machine but also using queues (in the multiprocessing package) to gather results from the processes. I'm writing it up because I found some info on queues that wasn't particularly helpful and in one case was just wrong.

So, basics first:

import multiprocessing

Ok good, got that step out of the way. Some things you will want:

num_cores = multiprocessing.cpu_count()

That gives you an integer of how many cores you have on your machine. If you have a newer machine where the cores are so awesome that they virtualize themselves into two cores each, it will return the virtual number (so, on my 6-core [hardware] Mac Pro it returns 12 [virtual!], which is twice as awesome as 6).

The multiprocessing seems a little weird at first. Here's some code:

proc_list = list()
    # This list is so you can "join" them. Important!
my_queue = multiprocessing.Queue()
   # Queues do q.put() and q.get() and other things. 

for a_range in chunk_ranges: 
      # The number of chunks in my code equals the number of processes.
      # "chunk_ranges" is a list of lists, [[index_a, index_b], ...]
      # indexes in a Pandas DF I am parsing. Easy to parallelize.
      # Your for loop will differ, depending on your data.
  a_proc = multiprocessing.Process(target=your_worker_function,
    args=(a_range, the_data, my_queue))
      # To your worker function, pass the args -- here it's the data range,
      # a pointer to that data, and the queue object. 
  proc_list.append(a_proc)
      # So you have a list of them and can iterate through the list
      # for join to end them nicely. 
  a_proc.start() # Starts one! 

# Waits for them to end and ends them nicely, cleanup. 
for p in proc_list: 
  p.join()
  print '%s.exitcode = %s' % (p.name, p.exitcode)
    # You don't need this but it's nice to see them end
    # and their exit status number. 

print 'Getting elements from the queue.' 
while not my_queue.empty(): 
  one_procs_data = my_queue.get() 
  # then do something with that, maybe add it to a list, depends on what it is.

Ok so that's commented, but how does it work? How does the worker function ("worker" is the term of art, it appears, for the function that is multi-processed) deal with the queue object? Let's look at code for that. Note that this is a somewhat simple example with one queue, I've seen nice examples with an input queue and an output queue. This example here deals only with an output queue (because my project is chunking a Pandas DF into processable pieces, that is playing the role of the input queue essentially).

def your_worker_function(your_args, your_queue): 
  # Do your processing here! 
  your_queue.put(the_data_you_want_returned) 
  # That's it, no "return" needed! 
# End of your_worker_function 

Not bad! Hand the worker function the queue object like you would any function's argument. You don't need to use the "return" call, since you use ".put" on the queue object and put the data you want into it. What is nice is that Python takes care of the worker functions all putting their data into the queue so you don't have to worry about them all getting smashed up (not a technical term) and your code barfing when/if they all try to access the object at the same time. No worries! Love it.

So how does that previous code work, the code that calls the worker function?

Declare a list object to populate with the processes (their names, I think). This is important, and is so you can end them all nicely and do cleanup. Also declare your queue object.

You use a looping function (here a for loop) to give out the jobs to the right number of processes. "target" is the worker function, and then "args" are the arguments you hand to that function, just as you would normally. Then, add the process that was just made to the list and start it.

The "join" for loop -- I have no idea why it is called "join" btw -- nicely cleans up after all the processes. I'm not discussing errors and such here, that's more advanced. The for loop does indeed loop through them all, and will get to all of them, waiting for them to finish in order (I think). I was a little curious about what if the first process in the list fails or doesn't stop or something, but somehow the looping will get to and join all the ones that have finished, even if they are after an infinitely looping one earlier in the list (yes I speak from experience oops).

Then, you can call your now-populated queue object and ".get" the items from it (whatever data type you put in). Deal with them appropriately -- so, maybe you made lists, and you don't want a list of lists when you're done, you want one list, so ".extend" to the outer main list object. Whatever is appropriate for your job.

There you have it. If you have 12 cores you could make it go to 11 instead of 12, but it's more fun (or at least faster) to say that you could and use 12 anyways.

Edit: Apparently, queue objects have a size limit, and things can grind to a halt and your processes will stop but won't complete (that is, your CPU won't be doing anything but your processes won't join) if you overload your queue. Oddly this doesn't crash, perhaps because exceptions don't get pickled and passed around the levels of multiprocessing. That was a very detail thin explanation. Suffice to say, I have some code that is fine but grinds to a halt even though the worker processes exit, however, the processes don't join. Right now I'm looking at the queue as the culprit.

Edit 2: I took out the my_queue.put(item) call in the worker process and replaced it with a file write (using the process name for unique filenames) and.... It worked! Actually first I took out the my_queue.put(item) and then they all joined, so, not happy with the queue. If you are doing multiprocessing, you might have a lot of data, but I guess the queue can't handle it. And, worse, in OSX you can't get the size of the queue since something isn't implemented, and also queue.full() (or whatever exactly) isn't totally accurate.

Friday, July 4, 2014

How To Circumvent Your IRB In 4 Easy Steps

  1. Design a study with a big company that is going to run it without you.
    • The study can manipulate emotions yet include people with mood disorders and people under the age of 18.
  2. Have the big company run it and have them collect the data.
    • You don't know how to code in their massive and distributed server environment anyways.
  3. Approach your IRB and ask to use data that has already been collected, and because it has already been collected no human subjects approval from the IRB is needed.
    • Hope your IRB is lazy and doesn't think about this too much.
  4. Publish and over-hype your results.
    • Issue lots of non-apologies when the research community calls you on your lack of ethics.