Wednesday, July 2, 2008

Code Optimizing (in SPSS)

(And optimizing, well, making workable, code in blogger!)

So, I am co-authoring a paper with some colleagues (aka friends) and I am the stats lead for it. We're using the World Values Survey, which I recently noticed received a lot of housecleaning (the website is better, the dataset has a 1-4 file available instead of just 1-3, the dataset is much cleaner except for a bunch of Americans who have "11" on the 1-10 income scale... Yes yes, this one goes to 11, but it's not supposed to).

I don't know SPSS syntax that well, but I was a computer science minor in college so coding is a great thing. You can easily get the syntax (CLI) for most commands from the dialog boxes (GUI), which is nice, then you have a record of what you did and can redo it in case you make a huge error. Syntax files are great. 

When I cobble them together from best-guesses derived from the help documentation (pretty decent) and bits and pieces from the dialog box-based syntax, it works. My computer is fast enough (3.06GHz) to run SPSS on our dataset (not huge) even in Parallels under XP fast enough so that I don't notice it. 

But I was finally looking at some of the code, instead of just looking at it from the point of view of wondering if I actually changed all of the variable values for the IF loops, and I realized it was not at all optimized: There were three IF statements, when really there only needed to be one with another nested for the second conditional variable (if nation/wave=x, then if income=y). With the copying and pasting it worked fine and fast, but was like this:

IF (nation/wave = x & income = y) DO lots of stuff.
IF (nation/wave = x & income > y) DO other stuff.
IF (nation/wave = x & income < y) DO other stuff.

So basically for every case, it was checking one or two conditions (assuming it pops out of an IF if the first condition fails) but it was doing this three times. So I recoded, and it worked fine but not noticeably faster. Note that each case (a person) is part of a nation, was sampled in a particular wave of a sample, and has a reported income (well income is not always there). 

IF (nation/wave = x) DO
   IF (income = y) DO lots of stuff.
   ELSE IF (income > y) DO other stuff.
   ELSE IF (income < y) DO other stuff.
   END IF.
END IF.

The point is if you check for nation/wave first you will eliminate a lot of cases that you aren't recoding at that point in time, so you go on to the next case instead of checking all the other stuff. Then I realized that it was still inefficient, since with the IF loops you can code it to reduce the number of checks it has to do -- in this case, with (A > Y) and (A = Y) you don't need to check to see if everything else is (A < Y) since it has to be, unless there are messy cases with missing values, which there alway are. Tight code is good, but so is error checking.

IF (nation/wave = x) DO
   IF (income < y) DO stuff.
   ELSE IF (income > y) DO other stuff.
   ELSE DO lots of stuff.
   END IF.
END IF.

And that was really optimized code. The majority of cases are income <, and = varied by nation/wave. I am really tempted to run the unoptimized and optimized versions on the larger dataset (which has all of the nations in it, not just the ones we are using, so has thousands more cases). Agh, blogger doesn't like the less than and greater than signs, it keeps hashing up the post -- it is interpreting them as HTML. Time for HTML coding...

Ok, the times for the two versions of the code were the same on the big file (267,870 cases -- ok not that big really), which, given the weird "execute" syntax, makes me think that SPSS is compiling the syntax file to some extent and optimizing what it is doing. Nice if it is.