Tuesday, April 19, 2016

When Companies Fail The Data

Recently, I have encountered three examples of how giant data gathering companies have completely failed to use that data in any sensible way. The companies are Facebook, Amazon, and Pandora.

Facebook served me an ad that said Sylvester Stallone had died without actually using any direct "passed away" words or phrases (since he hadn't). This is offensive, it's a lie, and I am not a particular fan of Stallone's films although Rocky is a classic (but Cop Land, are you kidding me?).

Amazon continues to insist I might want Amazon Student, despite my explaining to them over a year ago that I am not a student (and my account is 16 years old). 

Pandora continues to serve me ads in Spanish (which I do speak but I'm not fluent) and for cars (I don't own a car). I even told a tech support person this and he said there was nothing he could do about it.

These examples all point to the issue of not using the data you have and not taking direct information (data) from the user when the user gives it to you (which is much easier than trying to infer it, if indeed the user is truthful). 

The Facebook ad is hugely problematic. The conclusions are that:
  1. The people at Facebook do not care about the accuracy of the ads they serve.
  2. The people at Facebook do not care if the ads they serve are purely for emotional manipulation.
  3. The people at Facebook are not using the 11 years of data they have on me to realize that I would not like this ad because:
    1. I do not like advertisements that lie.
    2. I do not like advertisements that manipulate.
    3. I am not a fan of Sylvester Stallone.
They have the data. They aren't using it.

That Amazon thinks I am student, even though I've told them I am not and even though they can see my account has been buying stuff for 16 years, is bizarre. I told a tech support person that I am not a student. Yet, the algorithm they maintain apparently is not given this information at all and continues to annoy me with an extra page when I am trying to check out (yes, a good problem to have). 

They have the data. They aren't using it.

I grew up listening to FM radio, so I'm used to radio with ads. I so far use the free version of Pandora which has ads, and I think that's fine (people should get paid). However, I am not fluent in Spanish, so Spanish language ads are wasted on me (it's a waste of money to those advertisers) and also I don't own a car, but I get ads for car service stuff (I don't even remember what, but the problem is the same). So, since I think it would actually be nice to be served appropriate ads, and that those companies are getting their money's worth, I text-chatted with a Pandora text person. He said he had no way to mark my account indicating that I do not speak Spanish.

And yes, I know the image is an ad for Flonase, not for cars, it just happens to have a car--I use it here because it's in Spanish (although I am more complaining about the audio ads, images clearly work better here).

Again, they have the data. They aren't using it.

For me, these are good problems to have. I have internet access and can buy books (although if it's new I'll try to get it from my local non-chain bookstore -- yes I am serious). But all of these issues are annoying, not just because inappropriate content is being served to me, but that the companies should know better than to do that, and in all cases, they either have enough information on me, or I try to give it to them, and they still can't do it. And that's the distressing part: in this age of total information, some of the biggest information companies still don't know how to use data.

Thursday, April 14, 2016

For A Decent CSV Spreadsheet App

All I want is a decent spreadsheet app that does not insist on mangling my CSV files, which often have ID numbers in them which I might want to be viewed as text and not numbers. Apple's Numbers is maddening (you have to export to CSV, extra steps, and it has a relatively low row limit, 65,535 I believe) and Microsoft's Excel is a little better but I'll use it as an example here of What You See Is Not What You Get.

I am doing some work on cities and (county-level) FIPS codes (so, in the US, FIPS codes are Federal level identifiers useful for a lot of things, they identify counties). Some cities are large and lie in more than one county. Some of the data I have deals with cities, and the income data is on the county level, so I need to map from cities to county FIPS.

Excel did not make this easy.

The file I grabbed off the net to help me map cities to FIPS (counties) quite correctly listed all the appropriate FIPS codes for each city. I needed to narrow this down to one (Wikipedia helped a lot, the geopolitical Wikipedians are nitpickers).

FIPS codes for counties have two parts, two leading digits for the state and then three digits for the county. So all FIPS codes that start with 36, for instance, are counties in New York state.

The format from my source file looked like this:

Raleigh, NC:    37063,183
Birmingham, AL: 01073,117
New York, NY:   36005,047,061,081,085

(I am pretty sure those 5 numbers for NYC are the 5 boroughs, I know Brooklyn is its own county, Kings county.)

Excel, however, would show the following in the main view, interpreting these IDs as numbers--errors are in the parentheses, A, B, and C:
Raleigh, NC:    37,063,183 (A)
Birmingham, AL: 1,073,117 (A,B)
New York, NY:   36,005,047,061,081,000 (A,C)

  1. Added a comma that isn't there.
  2. Dropped leading zero.
  3. Rounded rightside digits.
So there are at least three issues there, but the most difficult one is that it put a comma in after the two digits for the state, initially making me think that indeed the source file had a comma after the state component of the FIPS code. It did not. Parsing the file did not work.

That was all extremely infuriating, and reminded me of Microsoft's Clippy, where the coders thought they always knew better than you. Granted, a lot of apps and even programming language packages try to be smart and guess formats, and yes this can be useful. But if there are leading zeros and commas in odd places (or not) and it's a CSV (text) file, there could be a default "read CSV as text". Of course it seems that neither of these two programs have been coded to play nice with CSV files.

As such, they are not overly useful data science tools.

Tuesday, April 5, 2016

Case Study in Data Ethics at Data & Society

I am pleased to announce that a case study on data ethics, by myself and co-author Dr. Roei Davidson, has been published at Data & Society! Titled "The Ethics of Using Hacked Data: Patreon’s Data Hack and Academic Data Standards", we look at issues around using hacked data (or not).

Basically, no.

But I wanted to. See the paper for details! (It's free and concise, don't worry.)

Thursday, March 24, 2016

Microsoft's Epic Twitterbot Fail

If you read this blog, you've read about the rather hilarious failure of Microsoft's experiment with a learning Twitter bot. Trolls gave it so much input it started turning out hateful, sexist, racist tweets.

So we really have to wonder...

  1. Why are Microsoft engineers so ignorant of Internet culture?
  2. Why Microsoft engineers who program text-based bots have no idea about the range of text available?
Because these are epic failures. Epic. No wonder there are jokes about engineers being completely socially inept.

Monday, March 14, 2016

Plagued By Bad Design, Still

Design, from websites to cities to forks, is so important, all around us, and so easy to get right--but also easy to get wrong in some cases. Here's one that was easy to get right, but the designers and people who approved it still got it wrong (don't they even test these things?).

The NYC MTA information/help audio posts found in many subway stations have two words, and two buttons, as you can almost see in the first photo. Except that the second button is really hard to see (although this photo unintentionally made it worse than usual, but it's still pretty bad).

Actual info post thing.

There are two overall problems, which you can see a little in the below photo.

  1. The physical placement of the words in relation to the buttons. 
  2. The color of the buttons. 
At first glance it looks like there is one Emergency Information button. But there is a second, dark, button there. But the word Information is closest, out of both words, to the red button, and the red button is closest to the word Information. So the red button and the word Information must have some relationship.

They don't.

Notice the yellow lines are longer than the blue line.

Clearly, the Information button should be easier to see, and the two words and their actual buttons should be visually obviously related, that is, by distance (although you could also do color). One solution would look like this:
Much better!
I don't even have a degree in design. This isn't rocket science.

Sunday, March 6, 2016

Yelverton Seven

We held the seventh installment of the Yelverton Sessions (Yelverton Seven) in conjunction with CSCW 2016. Named after the location of the third meeting, held in Yelverton, England, the Yelverton Sessions involve both intensive work sessions combined with cultural and natural places of interest not only as a break but as inspiration. And, a lot of coffee and good food. They usually, but not always, are in conjunction with a conference.

We voted to name it after the third session as by then we realized that yes, this was a sustained effort we wanted to continue. And, who doesn't like the word Yelverton?

  1. Yelverton One, Bangor Maine and Fredericton Canada (ICA 2011).
  2. Yelverton Two, Flagstaff Arizona and The Grand Canyon (ICA 2012).
  3. Yelverton Three, Devon England (ICA 2013). 
  4. Yelverton Four, Bainbridge Washington (ICA 2014).
  5. Yelverton Five, Hong Kong (WUN Understanding Global Digital Cultures 2015).
  6. Yelverton Six, Austin Texas (2016).
  7. Yelverton Seven, Santa Cruz California (CSCW 2016). 
We don't have Y8 scheduled yet, but it will happen at some point!

NYC School of Data

Spent most of the day yesterday at the NYC School of Data conference -- accurately billed as "NYC's civic technology & open data conference." Sponsored by a wide variety of organizations, such as Microsoft, Data & Society, the day involved a lot of great organizations such as various NYC government data departments, included great NYC people such as Manhattan Borough President Gale Brewer and New York City council member Ben Kallos, and was held at my workplace, the awesome Civic Hall.