Statistical Graphics

One last time

May 9th, 2009  |  Published in Data, Social Science, Statistical Graphics

Final thing on the car-culture regression. Below is a comparison of the actual data on Vehicle Miles Traveled with my reconstruction of Nate Silver's model, and my model including lagged gas prices, housing prices, and the stock market.

Comparing two models of American driving habits

Comparing two models of American driving habits

I "seasonally adjusted" the miles data by fitting a model predicting miles based only on the month of the year. The miles data (whether the actual data or the prediction from a model) is then corrected by subtracting the coefficient for the month it was collected. This data is normalized according the level of driving in April.

An even better fit is possible with a more complex model that includes a) average monthly temperatures and b) an interaction between gas prices and time. But this simpler model suffices to show that Silver's original finding was probably an artifact of his failure to control for wealth effects and the lagged effect of gas prices.

The lesson, I suppose, is: beware of columnists on deadline bearing regressions!

Moment of Zen

May 8th, 2009  |  Published in Data, Social Science, Statistical Graphics

Here are the variables I used in the models for the previous post. Simplistic social theories are left as an excercise for the reader.

Economic variables, 1990-2009

Economic variables, 1990-2009

Attempt to Regress

May 8th, 2009  |  Published in Data, Social Science, Statistical Graphics

I'm loathe to say an unkind word about Nate Silver. Besides boosting the profile of my alma mater, he's done more than anyone else to improve the reputation and sexiness of my present occupation: statistical data analyst. This is all the more welcome at a time when other people are blaming statistical models for, well, ruining everything.

But I confess to being a bit annoyed when I read Silver's recent article about the changes in American driving habits. In that article, Silver argues that we're seeing a real shift away from car culture, based on the following:

I built a regression model that accounts for both gas prices and the unemployment rate in a given month and attempts to predict from this data how much the typical American will drive. The model also accounts for the gradual increase in driving over time, as well as the seasonality of driving levels, which are much higher during the summer than during the winter.

All well and good, except that Silver doesn't provide the model or the data! He asks us to take his word for it that in January, Americans "drove about 8 percent less than the model predicted."

Now, I don't expect anyone to publish regression coefficients in Esquire magazine, but Silver does have a rather well-known website, so he could have put it there. The analysis was already done and published, so I don't see how it would have hurt Silver to publish the data after the fact. Which is what makes me suspect that he kept things deliberately vague in order to maintain a sense of mystery and awe around his regression models. Particularly because in this case, the underlying model is actually quite simple.

Which is a shame, because the simplicity of the model is actually the most appealing thing about it. It's a great example of a situation where a regression illuminates a relationship that would be really hard to discern using simple descriptive statistics. The model is a perfect balance between being simple enough to be believable, and complex enough to really gain you something over simple descriptives. In fact, it's something that I plan to refer to in the future when my less quant-y friends question the need for regressions.

Which is why I decided to recreate Silver's analysis from scratch, which took me about an hour. First I had to figure out what Silver's model was. Based on the paragraph above, I decided on:

miles = gas + unemployment + date + month
Monthly miles driven are modeled as a function of that month's average gas prices, the unemployment rate in that month, the date, and which month of the year it is. The date variable will capture the "gradual increase" in miles traveled. I use month to capture the "seasonality of driving levels". I could have grouped the months into seasons, but why not use a more precise measure if you've got it?

The next step was to find the data: From different sources, I obtained data on miles traveled, gas prices, and unemployment. All of these sources start around 1990, so that's the time frame we'll have to work with.

With that in hand, it was time for some analysis. Using R, I combined the different data sources and ran myself a regression:

lm(formula = miles ~ unemp + price + date + month)
(Intercept)     98.52     3.71
unemp           -2.09     0.34
gasprice        -0.08     0.01
date             0.01     0.00
monthAugust     17.90     1.40
monthDecember   -8.82     1.40
monthFebruary  -30.26     1.42
monthJanuary   -22.03     1.40
monthJuly       17.87     1.42
monthJune       11.34     1.42
monthMarch       0.42     1.42
monthMay        12.56     1.42
monthNovember  -10.00     1.40
monthOctober     5.85     1.40

monthSeptember  -2.55     1.40

n = 222, k = 15 residual sd = 4.25, R-Squared = 0.98

That R-Squared of 0.98 means that about 98% of the actual variation in miles traveled is explained by the variables in this model. So it's a pretty comprehensive picture of the things that predict how much Americans will drive. A one point increase in the unemployment rate, in this model, predicts a 2.09 billion mile decrease in miles driven. And gas prices are in cents, so a one-cent increase in the price of gas will, all things being equal, translate into an 80 million mile decrease in miles driven.

The next step was to check out Silver's assertion that recent data on miles driven is lower than the model would predict. Recall that Silver's model over-predicted January miles driven by 8 percent. My model predicts that in January, Americans should have driven 239.6 billion miles. The actual number was 222 billion miles. The prediction is--wait for it--7.9 percent more than the actual number! That's pretty amazing actually, and it indicates that my data and model must be pretty damn close to Silver's.

With the model in hand, however, we can do a bit better than this. Below is a chart showing how close the model was for every month in my dataset. It's similar to the graphic accompanying Silver's Esquire article, only not as ugly and confusing.

Comparison of a regression model of vehicle miles driven with the actual value

Comparison of a regression model of vehicle miles driven with the actual value

The graph shows the difference between the prediction and the actual number. When the point is above the zero line, it means people drove more than the model would predict. When it's below the line, they drove less.

You can see here that there are multiple imperfections in the model. Mileage declined a little faster than predicted in the late 90's, and then rose faster than expected in the early 2000's. It's possible that this has something to do with a policy difference between the Bush and Clinton administrations, but I'm not enough of an expert to say.

What jumps out, though, are those last three points on the right, corresponding to this past November, December, and January. All of them are way off the prediction, and the error is bigger than for any other time period. This strongly suggests that something really has changed. What's not totally clear, though, is whether it's the car culture that's different, or whether it's this recession that's unlike the other two recessions in this data set (the early 90's and early 2000's).

The next logical step is to consider some additional variables. Some commenters at Nate's site pointed out that you might want to factor in changes in wealth--as opposed to changes in income, which are at least partly captured by the unemployment variable. Directly measuring wealth is a little tricky, but we can easily measure two things that are proxies for wealth, or people's perceptions of wealth: the stock market and the housing market. So I went google-hunting again and found two more variables: the monthly closing of the Dow, and the government's housing price index. Put those into the regression, and away we go:

lm(formula = miles ~ unemp + price + date + stocks + housing + month)
(Intercept)    117.87     4.13
unemp           -1.64     0.48
gasprice        -0.11     0.01
date             0.01     0.00
stocks           1.01     0.30
housing          0.24     0.03
monthAugust     18.40     1.20
monthDecember   -8.88     1.21
monthFebruary  -30.58     1.21
monthJanuary   -22.12     1.19
monthJuly       18.28     1.20
monthJune       11.74     1.20
monthMarch       0.30     1.20
monthMay        12.77     1.20
monthNovember  -10.02     1.21
monthOctober     6.42     1.21

monthSeptember  -1.92     1.21

n = 217, k = 17 residual sd = 3.60, R-Squared = 0.98

R-squared looks the same, but the residual standard deviation is lower, which indicates that this model predicts more of the variation in the data than the last one. And the new variables both have pretty big and statistically significant effects. The stock market close is scaled in thousands, so the coefficient indicates that for every 1000 point increase in the Dow, we drive 1 billion more miles. The housing price index defines 1991 prices as 100, and went into the 220's during the bubble. Every one point increase in that index predicts a 240 million mile increase in driving.

Here's another version of the graph above, for our new model:

Predicted and actual miles, from a model with stock and housing prices

Predicted and actual miles, from a model with stock and housing prices

The same patterns are still present, but the divergence between the predictions and the actual numbers is smaller now. (Incidentally, I have no idea what happened in January of 1995. Did everyone go on a road trip without telling me?) It still looks like there's been some qualitative change in US driving habits recently, but the case is less clear cut. In particular, the late 90's now looks like another outstanding mystery. Mileage declined by more than the model expected then, but why? At the moment I have no particular hypothesis about that.

My final model tests something else that appears in Nate's article:

There is strong statistical evidence, in fact, that Americans respond rather slowly to changes in fuel prices. The cost of gas twelve months ago, for example, has historically been a much better predictor of driving behavior than the cost of gas today. In the energy crisis of the early 1980s, for instance, the price of gas peaked in March 1981, but driving did not bottom out until a year later.

OK, so let's try using the price of gas 12 months ago as a predictor along with current prices. This will force us to throw away a bit of data, but we can still fit a model on most of the data points:

lm(formula = miles ~ unemp + price + price12 + date + stocks +
 housing + month, data = data)
(Intercept)    112.28     3.82
unemp           -0.93     0.42
gasprice        -0.07     0.01
gasprice12      -0.08     0.01
date             0.01     0.00
stocks           0.93     0.26
housing          0.25     0.02
monthAugust     18.19     1.04
monthDecember   -8.99     1.05
monthFebruary  -31.26     1.06
monthJanuary   -22.20     1.05
monthJuly       18.17     1.05
monthJune       11.58     1.05
monthMarch       0.10     1.06
monthMay        12.88     1.05
monthNovember  -10.06     1.04
monthOctober     6.29     1.04

monthSeptember  -2.08     1.04

n = 210, k = 18 residual sd = 3.07, R-Squared = 0.99

It looks like current gas prices and last year's gas prices are about equivalent in their effect on mileage. Now let's look at the graph of prediction error again:

Miles driven, predicted and actual, third model

Miles driven, predicted and actual, third model

Lo and behold, the apparently anomalous findings from the last few months have disappeared. This isn't the last word, of course, nor is it the perfect model. But it no longer appears that US driving behavior is so unusual, when you account for all the relevant economic contextual factors.

Anyhow, that's enough playing around in the data for me for the time being. In the end, this whole exercise helped me understand what I like best about Nate Silver's work. He's inventing a new media niche, call it "statistical journalist". He uses publicly available data to produce quick, topical analysis that illuminates the issues of the data in the way neither anecdotes nore naive recitations of descriptive statistics can. He may play fast and loose at times, but his methods are transparent enough that people like me can still check up on him. I certainly hope that this kind of writing becomes an established sub-specialty with a wider base of practitioners than just Silver himself.

Graphs > Tables, again

March 16th, 2009  |  Published in Data, Statistical Graphics

Over at the Monkey Cage, Lee Sigelman notes a new study from the CDC that tries to figure out how many people and households in each state have no land line and rely entirely on cell phones. Being a good student of Andrew Gelman, my first thought upon clicking the link was: "these tables are horrible, they should be graphs!" My second thought was, "Gelman will probably come along and produce graphs of the data himself". So before that happens, I thought I'd take a stab at summarizing the paper's first couple of tables:

Cell phone only data

Cell phone only data

Click the image to see it full-size.

The intervals aren't classical 95% intervals--they're some kind of fancy estimation from the CDC that you'll have to click the link to find out about. The hollow points/dashed lines are the "modeled" estimates, and the black points/solid lines are the "direct estimates". The points are in order according to the modeled estimates.

The nice thing about displaying this graphically is that you can see how much uncertainty there is on some of these estimates, so you get a better idea of what this graph does and does not tell you. For example, Washington DC is estimated to have the highest percentage of adults in cell-only households, but the confidence intervals reveal that this doesn't really mean anything--the most you can say is that DC is on the high end of cell-only prevalence.

Pessimism of the Intellect

October 30th, 2008  |  Published in Politics, Statistical Graphics

My boss is a prominent political scientist and an Obama supporter. This afternoon, he was ribbing me for being a "pox on both your houses" ultra-leftist who only grudgingly acknowledges that electing Obama would be good for the left.

After our meeting had ended, I came up with a perfect encapsulation of my feelings about Obama, which has the added benefit of being an extremely nerdy joke. My point estimate is that it does matter whether Obama wins. But my confidence interval for how much it matters includes zero. In the spirit of Jessica Hagy, I present the argument in graph form:

How I feel about Obama