/dev/culture: Data

Showing posts with label Data. Show all posts

Saturday, July 11, 2020

Amazon Still Thinks I'm A Student

It's been over five and a half years, and multiple complaints, but Amazon still insists on showing me Amazon Prime advertisements. I've told them directly that I am not a student (the funniest response was "the student flag on your account is on", of course there is no such flag, why the text chat person felt like lying, I don't know), so they have the data that I am not a student. Apparently they don't really want all the data; they just don't care.

Here is the latest screen grab from this week:

Tuesday, April 7, 2020

US Maps in R

A really great read (it's a chapter) about the use of maps in R (at least, US maps specifically), from Kieran Healy's Data Visualization. What are you trying to show with your map? What is your data? Is it spatial? Or, maybe it's actually about population, so why is Montana bigger than Connecticut?

There are some great projections, there's the standard geographical one, and the weird "geography squished into population size" one (Figure 7.1, lower left), and the electoral college/population one isn't bad depending on what you are trying to do (Figure 7.1, lower right), although I end up liking the one that makes all the states the same size, each a square (statebins, in section 7.3). (Of course, what is a state? They are not all comparable at all! What is Washington, D.C.? Why not Puerto Rico? Etc.!)

No post about maps is complete without XKCD's heatmap comic and another on map projections, as well as a link to the segment from The West Wing about map projections which everyone should watch.

Wednesday, March 6, 2019

Urban Data Marks

Really interesting talk at Boston CHI by Prof. Dietmar Offenhuber a few days ago, where one of the projects he detailed was "Dust Mark". So, in Stuttgart, Germany, some areas of the city suffer from air quality issues, as is true in many cities. One way to measure this over time is with reverse graffiti, which struck me as a really interesting concept and approach to urban marking. Instead of making a long-term mark or addition to measure something (say, with paint or a physical object like a meter of some sort), you can power-wash away accumulated air particulate from concrete surfaces. This can get around anti-graffiti laws, since you aren't adding anything (paint, chalk) to a surface, instead you're removing and actually cleaning the surface!

Sunday, November 5, 2017

Blogging, Ephemeral

So, I've been blogging for well over ten years now (but mostly the "readers" are myself and, I would guess given the consistent 30-40 views per post, webcrawlers), but I was blogging at UM before the earliest posts listed here (2007), which are re-posted UM posts IIRC.

Which made me think about how ephemeral this all is, although I might have a folder with the UMich blog material in it.

Relatedly, I've been thinking about digital photos, and what is Facebook, Instagram, and Tumblr, for instance, it's all ephemeral. Someone a while ago pointed out how, in terms of records and photographs, we are in a terrible period, as none of it will be saved for posterity. I'm debating taking an afternoon and ordering up some print books of my photos, but am not even sure what I have anymore. I think I've lost my Belgrade trip photos, back in the days of my Canon Elf (~2003?), although I might have a CD with the photo from a friend who was on the trip as well (but neither my desktop nor laptop have a CD drive, however I have an old Macbook with a CD drive that I have saved just in case).

So, no more finding ancestors' diaries and photos.

My great-grandfather and great-grandmother made a scrapbook of their trip to England in the summer of 1914, just before WWI started. My great-grandmother's parents were German and had emigrated to the US. Oops. Being German-ish in England during WWI, not a great idea. They managed to get some money somehow and got on a boat home, but it wasn't easy. Sure, if this had been 100 years later, a blog, but would it even exist for a great-grandchild to read 100 years after the fact? Our digital world hasn't been around long enough to tell, but, given the rate of digital decay we've seen, the answer is no.

Tuesday, April 11, 2017

Python 3 vs. Python 2.7

I decided it was finally time to move to Python 3 from Python 2. Having done so, I don't see why I wasn't using Python 3 years ago, although my code worked just fine so it wasn't really a big deal.

Python 3 has two big advantages, and there's also a third reason you should be using it by now.

Unicode: You don't have to worry about catching Unicode characters in string types anymore, Python 3 does it for you. This is a concern for me with web scraping. So much easier.
For years I've read about the following dilemma in OSX, with no solution:

If you DON'T install your own copy of Python 2, you are modifying the OS's copy of important libraries and such, and that can cause problems.
If you DO install your own copy of Python 2, you then have two versions of Python 2 on your computer, and that can cause problems.

But, the best part was that about 99% of my code still works as is. All I've had to do so far is change print statements, from print "Print this Py2!" to print("Print this Py3!"), and get rid of the Unicode error catching.

Thursday, January 19, 2017

Oh, Amazon (Student Prime)

Now Amazon is teaming up with Facebook to think I am a student. That two giant data companies that focus precisely on user demographics are incapable of realizing that I am not a student, despite my telling one of them several times that I am not (directly, talking with customer support human beings), is incredible.

Ok that was not the image in the advert initially (it was a student), but it appears to be what I captured. Try Amazon Prime Student. No, I am not a student, I don't qualify.

Tuesday, September 20, 2016

Still Data Fail

Amazon still thinks I'm a student, but for years I've told them I'm not. How is this a good use of customer data? How is this responsive to customers? It's insane and idiotic (and annoying when I am trying to give them my money, they make it harder to do so, but yes ok ok I am still an Amazon customer so what do they care?).

Here's a screen grab from this month, September 2016:

But I've told their online help people that I'm not a student, back in April as you can see here and previously in January of 2015 as you can see here. So much for customer feedback.

Additionally, Twitter's recommender needs some help:

Data & Society is an incorporated entity like Valvoline, but the two organizations are nothing alike and neither are their Twitter feeds. Valvoline isn't marked as a "sponsored" post (i.e., paid advertising), and even if it were the mismatch is just hilarious.

And currently I live in New York and I don't own a car.

The data is there, people just aren't using it well at all.

Sunday, July 3, 2016

Gephi and iGraph: graphml

When Gephi, which is great, decides to not exactly work, you can save your Gephi graph file in graphml format and then import it into R (or Python or C/C++) using iGraph so you can also draw it the way you were hoping to. (I'm having an issue with setting the colors at all in Gephi.)

It took me a few tries to figure out which format would work. I need location (since Gephi is good at that but I don't know how to make iGraph or R's SNA package do that) and attributes for the data. So far, so good!

Some helpful pages:

Note!!!! Apparently if you make a variable in R (at least while trying to graph something with plot) and you use a variable for your palette that you name palette, you will destroy (ok ok overwrite) some other official variable or setting also named palette, but the error you get will not at all clue you in to what happened. Better to call your variable my_palette or the_palette, which is what I usually do (so why didn't I do it here?).

Wednesday, June 15, 2016

CityU Hong Kong Summer School

Had a great time teaching a class and also an impromptu session on Gephi at the City University of Hong Kong's Summer School in Social Science Research! It's in the Department of Media and Communication, and run by my friend Dr. Marko Skoric. The main instructor was Dr. Wouter van Atteveldt, who is awesome and has great hats as you can see.

I also was fortunate enough to attend CityU's Workshop on Computational Approaches to Big Data in the Social Sciences and Humanities, which was great and had lots of great speakers.

Me, showing some great students a few things about Gephi.

The three of us in front of the department sign.

Tuesday, April 19, 2016

When Companies Fail The Data

Recently, I have encountered three examples of how giant data gathering companies have completely failed to use that data in any sensible way. The companies are Facebook, Amazon, and Pandora.

Facebook served me an ad that said Sylvester Stallone had died without actually using any direct "passed away" words or phrases (since he hadn't). This is offensive, it's a lie, and I am not a particular fan of Stallone's films although Rocky is a classic (but Cop Land, are you kidding me?).

Amazon continues to insist I might want Amazon Student, despite my explaining to them over a year ago that I am not a student (and my account is 16 years old).

Pandora continues to serve me ads in Spanish (which I do speak but I'm not fluent) and for cars (I don't own a car). I even told a tech support person this and he said there was nothing he could do about it.

These examples all point to the issue of not using the data you have and not taking direct information (data) from the user when the user gives it to you (which is much easier than trying to infer it, if indeed the user is truthful).

Facebook

The Facebook ad is hugely problematic. The conclusions are that:

The people at Facebook do not care about the accuracy of the ads they serve.
The people at Facebook do not care if the ads they serve are purely for emotional manipulation.
The people at Facebook are not using the 11 years of data they have on me to realize that I would not like this ad because:

I do not like advertisements that lie.
I do not like advertisements that manipulate.
I am not a fan of Sylvester Stallone.

They have the data. They aren't using it.

Amazon

That Amazon thinks I am student, even though I've told them I am not and even though they can see my account has been buying stuff for 16 years, is bizarre. I told a tech support person that I am not a student. Yet, the algorithm they maintain apparently is not given this information at all and continues to annoy me with an extra page when I am trying to check out (yes, a good problem to have).

They have the data. They aren't using it.

Pandora

I grew up listening to FM radio, so I'm used to radio with ads. I so far use the free version of Pandora which has ads, and I think that's fine (people should get paid). However, I am not fluent in Spanish, so Spanish language ads are wasted on me (it's a waste of money to those advertisers) and also I don't own a car, but I get ads for car service stuff (I don't even remember what, but the problem is the same). So, since I think it would actually be nice to be served appropriate ads, and that those companies are getting their money's worth, I text-chatted with a Pandora text person. He said he had no way to mark my account indicating that I do not speak Spanish.

And yes, I know the image is an ad for Flonase, not for cars, it just happens to have a car--I use it here because it's in Spanish (although I am more complaining about the audio ads, images clearly work better here).

Again, they have the data. They aren't using it.

Overall

For me, these are good problems to have. I have internet access and can buy books (although if it's new I'll try to get it from my local non-chain bookstore -- yes I am serious). But all of these issues are annoying, not just because inappropriate content is being served to me, but that the companies should know better than to do that, and in all cases, they either have enough information on me, or I try to give it to them, and they still can't do it. And that's the distressing part: in this age of total information, some of the biggest information companies still don't know how to use data.

Thursday, April 14, 2016

For A Decent CSV Spreadsheet App

All I want is a decent spreadsheet app that does not insist on mangling my CSV files, which often have ID numbers in them which I might want to view as text and not numbers. Apple's Numbers is maddening (you have to export to CSV, extra steps, and it has a relatively low row limit, 65,535 I believe) and Microsoft's Excel is a little better but I'll use it as an example here of What You See Is Not What You Get.

I am doing some work on cities and (county-level) FIPS codes (so, in the US, FIPS codes are Federal level identifiers useful for a lot of things, they identify counties). Some cities are large and lie in more than one county. Some of the data I have deals with cities, and the income data is on the county level, so I need to map from cities to county FIPS.

Excel did not make this easy.

The file I grabbed off the net to help me map cities to FIPS (counties) quite correctly listed all the appropriate FIPS codes for each city. I needed to narrow this down to one (Wikipedia helped a lot, the geopolitical Wikipedians are nitpickers).

FIPS codes for counties have two parts, two leading digits for the state and then three digits for the county. So all FIPS codes that start with 36, for instance, are counties in New York state.

The format from my source file looked like this:

Raleigh, NC:    37063,183
Birmingham, AL: 01073,117
New York, NY:   36005,047,061,081,085

(I am pretty sure those 5 numbers for NYC are the 5 boroughs, I know Brooklyn is its own county, Kings county.)

Excel, however, would show the following in the main view, interpreting these IDs as numbers--errors are in the parentheses, A, B, and C:

Raleigh, NC:    37,063,183 (A)
Birmingham, AL: 1,073,117 (A,B)
New York, NY:   36,005,047,061,081,000 (A,C)

Errors:

Added a comma that isn't there.
Dropped leading zero.
Rounded rightside digits.

So there are at least three issues there, but the most difficult one is that it put a comma in after the two digits for the state, initially making me think that indeed the source file had a comma after the state component of the FIPS code. It did not. Parsing the file did not work.

That was all extremely infuriating, and reminded me of Microsoft's Clippy, where the coders thought they always knew better than you. Granted, a lot of apps and even programming language packages try to be smart and guess formats, and yes this can be useful. But if there are leading zeros and commas in odd places (or not) and it's a CSV (text) file, there could be a default "read CSV as text". Of course it seems that neither of these two programs have been coded to play nice with CSV files.

As such, they are not overly useful data science tools.

Tuesday, April 5, 2016

Case Study in Data Ethics at Data & Society

I am pleased to announce that a case study on data ethics, by myself and co-author Dr. Roei Davidson, has been published at Data & Society! Titled "The Ethics of Using Hacked Data: Patreon’s Data Hack and Academic Data Standards", we look at issues around using hacked data (or not).

Basically, no.

But I wanted to. See the paper for details! (It's free and concise, don't worry.)

Tuesday, January 19, 2016

Meaningless Data Viz

This Google Trends data visualization is horrible. It does indeed show "top searched candidate by state", I would guess, but that doesn't at all mean what the map implies it means -- that is, positive popularity of that candidate and also a lead over the other candidates. It doesn't even come close to showing that.

The data underlying this map could be any one of these completely different scenarios, using just the first three listed candidates to show the problem:

Some Example Possibilities
Candidate	State A	State B	State C
1. Trump	1	1,000,000	1,000,000
2. Cruz	0	0	999,999
3. Rubio	0	0	999,999

The order of the candidates in the image may be from the data, or it may be from polls, or it may be something else, we don't know.

In theoretical State A, Trump does lead, but it's meaningless and no one is searching.

In theoretical State B, Trump leads, in a statistically meaningful manner, and people are searching (but we don't know exactly on what terms, "Trump liar" and "Trump bankruptcy" and "Trump racist" are not endearing search terms).

In theoretical State C, Trump leads, but it's a statistical tie, and lots of people are searching.

Each of these scenarios are massively different, yet they would all result in the same visualization.

There are other numerical combinations, this is just a sample of three.

This visualization also conflate geography for population, that is it doesn't have any state level per-capita correction. For this you need, I have learned, a cartogram (I think I've linked to that page before, it's really informative--here's one for the world with a slightly different approach). And, it only considers people who have internet access and who are using Google and who are actively searching during the debate. That leaves out lots of people.

And, it leaves out anything that isn't a state (such as Puerto Rico), although I assume Washington, DC, is in there (who can tell?). It also, and this is a minor peeve, makes it look like the top of Minnesota is connected by land (it isn't).

Edit: Apparently, this map is actually from Google, their "Google News Lab" according to one video where I got this map for the Democrats and it suffers the exact same problem:

Friday, October 30, 2015

The Civic Data Divide

I'd like to coin the term "civic data divide", and given that Google shows zero results for it, I think I can make that claim.

More importantly, I've been working on a paper looking at factors that affect the strength of a nation's open data policy. The numbers show that, although some people have theorized about the importance of both internet access and education for open civic data, neither of these factors play a role, at least not on the national level, indicating that, as we might expect, the early users of and agitators for open civic data are those who can use it: those who have internet access and those with the education to work with numbers. This is a minority of the citizenry, and as such the national measures for overall education and overall internet connectivity do not statistically relate to the strength of a nation's open data policy.

This should not come as a surprise, given the many other socio-economic divides in terms of access we've seen before, such as the digital divide. The civic data divide, I would argue, is an extension of what scholar Pippa Norris has discussed as the democratic divide, where there are some who use the internet to engage with governance and those who do not or cannot.

Yet this is not an overly problematic scenario. Internet access and education are indeed not evenly distributed across any one country, but many of those working with civic data and open data policies work every day with the issues faced by nations and cities, and as such they are aware of and engaged with socio-economic inequalities, and, more importantly, are trying to address the issues and make things better for all citizens. Along with the expansion of civic data programs and outreach (such as MIT's Civic Data Design Lab and NYU's Center for Urban Science and Progress), although the civic data divide currently exists, the very ideals behind open civic data are working to overcome it.

Google as of October 30, 2015, 5pm NYC time.

Edit: So, this does imply that those working on civic data are able to utilize their resources (socio-economic) for education so they have data skills, and so they can do this outside of their nation's educational system. But besides those that already have the resources for education, I'm guessing there are some who instead learn the needed skills via online courses, meetups, and other alternative educational avenues. But you still have to have the time to work on these projects.

Friday, October 2, 2015

Fixing Gephi on Your Mac

TL;DR: Download JDK 1.6, then point a Gephi startup file to it maybe with one change (/Library... and not /System/Library...).

UPDATE: New Gephi is out!
Original post continues below.

The longer version is that Gephi, which I love, hasn't had an update (as of this writing) in quite some time (parts of it have, parts haven't) and it doesn't work with the newer versions of Java. The 0.9 version download file for OS X doesn't seem to be where the link says it should be (again, as of this writing in early October, 2015).

Apparently Gephi 0.8 needs Java version 6, not version 7 or 8. Java appears to have lots of parts, names, and maybe even v6 is "1.6" and v7 is "1.7" which I don't understand but I don't do Java so I'm not going to spend time figuring it out--I got Gephi to work, that was all I cared about right now.

My new laptop, to replace my five-year old one, didn't even have Java on it (ah, the purity of it all). I got the JDK (Java Developer's Kit) 1.6 from part of the Apple Support website, which is what you want.

But you have to tell Gephi about it, as far as I know. This article was awesome except it turned out on my machine not quite right (but close enough that I figured it out). Java wasn't where I thought it would be. This article on Stackoverflow helped me find Java 1.6, although note the article is about Java 1.7, so make the command this:

/usr/libexec/java_home -v 1.6

(Ok now you don't need to read the Stackoverflow page, just run that in your terminal if you've installed 1.6 and don't know where it is.)

Turns out, I don't know why (and I don't care, it's working) Java 1.6 wasn't in /System/Library... it was instead in the same path but not the one in /System, it was just in /Library....

/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home

So, put that line in the Gephi startup file near the top as per the directions (I made it the first command) and so far Gephi actually loads, which it wasn't doing before. Granted I haven't actually tried to do anything with it, and who knows if this will blow up other Java things (or maybe my machine really didn't have any Java on it at all).

Monday, August 31, 2015

Python, DictReader, DictWriter

Because I can never, ever, remember exactly how to code these. Example of both, basic.

data_list = []

with open(input_file, 'rU') as f:
  data_file = csv.DictReader(f)
  for line in data_file:
    data_list.append(line) # gets you a list of dicts

the_header = ['h1', 'h2', 'etc'] # column headers, a list of text strings

with open(output_file, 'w') as f:
  file_writer = csv.DictWriter(f, fieldnames=the_header)
  file_writer.writeheader()
  for line in data_list:
     file_writer.writerow(line)

Here I am going to experiment linking it as a script-tagged element from Gist via GitHub:

16th Century Maps for 21st Century Data Science

Maps bother me. I love them, and I'm not a geospatial GIS coding specialist, but I do visualizations, and we keep using the wrong maps. Greenland is a lot smaller than all of Africa, ok?

This is the map in my office kitchen:

It's the typical Mercator projection (projection, since you have to "project" a sphere onto a flat surface, which doesn't work well). Mercator came up with this map view in 1569, according to Wikipedia. Yet we still use it for 21st century data science! Granted just because something is old doesn't mean it's not useful, but in this case the Mercator projection was created primarily for navigation, that is, sailing the seven seas. When you present geospatial data the only thing your viewers are navigating is your data. As such this is totally the wrong mapping projection to use. Totally. Don't do it. Data visualizations are about accuracy, and using the Mercator projection starts you off with a completely inaccurate mapping. Greenland and Africa? "Africa's area is 14 times greater" than Greenland according to that Wikipedia article! Fourteen!

So what to do instead?

Wikipedia has a page of many different projections, I'd vote for one of the equal-area ones, and am a fan of the Gall-Peters projection (which was the centerpiece of a great segment on The West Wing), but you'll need to decide what's best for your use.

So, I'm a little upset about the giant Mercator map in my office, but with good reason.

Monday, June 1, 2015

ICWSM 2015

Had a great time at ICWSM in Oxford last week. Great people, great place. Stayed in a hotel near the old Norman tower mound that was built in 1071. (Google's formatting is horrible with images and text, I am discovering.)

We discussed a wide range of data and social science topics, including data availability.

Original Image c/o Allie Brosch

Darren Stevenson, PhD student from my old department, Communication Studies at UM, presents!

My poster! 36 million observations!

Oxford is beautiful:

Friday, May 22, 2015

Google #datafail

I keep getting ads in my Gmail (web interface) for Masters' programs or programs in education.

Are you kidding me?

Google reads all my email, this blog is in a Google property, I have a Google Scholar page so they should know I have a PhD, really, and I have a Google Sites page.

Serving me ads like this makes no sense at all. None. No, really, it doesn't. It's indefensible. But yet, they do it.

Monday, April 20, 2015

I am not a number!

I was poking around the web after the TtW conference and found this blog post rather apt.

And then this excerpt from Joe Turow's The Daily You aligns well with it.