Attempt to Regress

May 8th, 2009 | Published in Data, Social Science, Statistical Graphics

I'm loathe to say an unkind word about Nate Silver. Besides boosting the profile of my alma mater, he's done more than anyone else to improve the reputation and sexiness of my present occupation: statistical data analyst. This is all the more welcome at a time when other people are blaming statistical models for, well, ruining everything.

But I confess to being a bit annoyed when I read Silver's recent article about the changes in American driving habits. In that article, Silver argues that we're seeing a real shift away from car culture, based on the following:

I built a regression model that accounts for both gas prices and the unemployment rate in a given month and attempts to predict from this data how much the typical American will drive. The model also accounts for the gradual increase in driving over time, as well as the seasonality of driving levels, which are much higher during the summer than during the winter.

All well and good, except that Silver doesn't provide the model or the data! He asks us to take his word for it that in January, Americans "drove about 8 percent less than the model predicted."

Now, I don't expect anyone to publish regression coefficients in Esquire magazine, but Silver does have a rather well-known website, so he could have put it there. The analysis was already done and published, so I don't see how it would have hurt Silver to publish the data after the fact. Which is what makes me suspect that he kept things deliberately vague in order to maintain a sense of mystery and awe around his regression models. Particularly because in this case, the underlying model is actually quite simple.

Which is a shame, because the simplicity of the model is actually the most appealing thing about it. It's a great example of a situation where a regression illuminates a relationship that would be really hard to discern using simple descriptive statistics. The model is a perfect balance between being simple enough to be believable, and complex enough to really gain you something over simple descriptives. In fact, it's something that I plan to refer to in the future when my less quant-y friends question the need for regressions.

Which is why I decided to recreate Silver's analysis from scratch, which took me about an hour. First I had to figure out what Silver's model was. Based on the paragraph above, I decided on:

miles = gas + unemployment + date + month

Monthly miles driven are modeled as a function of that month's average gas prices, the unemployment rate in that month, the date, and which month of the year it is. The date variable will capture the "gradual increase" in miles traveled. I use month to capture the "seasonality of driving levels". I could have grouped the months into seasons, but why not use a more precise measure if you've got it?

The next step was to find the data: From different sources, I obtained data on miles traveled, gas prices, and unemployment. All of these sources start around 1990, so that's the time frame we'll have to work with.

With that in hand, it was time for some analysis. Using R, I combined the different data sources and ran myself a regression:

lm(formula = miles ~ unemp + price + date + month)
                coef.est coef.se
(Intercept)     98.52     3.71
unemp           -2.09     0.34
gasprice        -0.08     0.01
date             0.01     0.00
monthAugust     17.90     1.40
monthDecember   -8.82     1.40
monthFebruary  -30.26     1.42
monthJanuary   -22.03     1.40
monthJuly       17.87     1.42
monthJune       11.34     1.42
monthMarch       0.42     1.42
monthMay        12.56     1.42
monthNovember  -10.00     1.40
monthOctober     5.85     1.40
monthSeptember  -2.55     1.40
---
n = 222, k = 15
residual sd = 4.25, R-Squared = 0.98

That R-Squared of 0.98 means that about 98% of the actual variation in miles traveled is explained by the variables in this model. So it's a pretty comprehensive picture of the things that predict how much Americans will drive. A one point increase in the unemployment rate, in this model, predicts a 2.09 billion mile decrease in miles driven. And gas prices are in cents, so a one-cent increase in the price of gas will, all things being equal, translate into an 80 million mile decrease in miles driven.

The next step was to check out Silver's assertion that recent data on miles driven is lower than the model would predict. Recall that Silver's model over-predicted January miles driven by 8 percent. My model predicts that in January, Americans should have driven 239.6 billion miles. The actual number was 222 billion miles. The prediction is--wait for it--7.9 percent more than the actual number! That's pretty amazing actually, and it indicates that my data and model must be pretty damn close to Silver's.

With the model in hand, however, we can do a bit better than this. Below is a chart showing how close the model was for every month in my dataset. It's similar to the graphic accompanying Silver's Esquire article, only not as ugly and confusing.

Comparison of a regression model of vehicle miles driven with the actual value

The graph shows the difference between the prediction and the actual number. When the point is above the zero line, it means people drove more than the model would predict. When it's below the line, they drove less.

You can see here that there are multiple imperfections in the model. Mileage declined a little faster than predicted in the late 90's, and then rose faster than expected in the early 2000's. It's possible that this has something to do with a policy difference between the Bush and Clinton administrations, but I'm not enough of an expert to say.

What jumps out, though, are those last three points on the right, corresponding to this past November, December, and January. All of them are way off the prediction, and the error is bigger than for any other time period. This strongly suggests that something really has changed. What's not totally clear, though, is whether it's the car culture that's different, or whether it's this recession that's unlike the other two recessions in this data set (the early 90's and early 2000's).

The next logical step is to consider some additional variables. Some commenters at Nate's site pointed out that you might want to factor in changes in wealth--as opposed to changes in income, which are at least partly captured by the unemployment variable. Directly measuring wealth is a little tricky, but we can easily measure two things that are proxies for wealth, or people's perceptions of wealth: the stock market and the housing market. So I went google-hunting again and found two more variables: the monthly closing of the Dow, and the government's housing price index. Put those into the regression, and away we go:

lm(formula = miles ~ unemp + price + date + stocks + housing + month)
               coef.est coef.se
(Intercept)    117.87     4.13
unemp           -1.64     0.48
gasprice        -0.11     0.01
date             0.01     0.00
stocks           1.01     0.30
housing          0.24     0.03
monthAugust     18.40     1.20
monthDecember   -8.88     1.21
monthFebruary  -30.58     1.21
monthJanuary   -22.12     1.19
monthJuly       18.28     1.20
monthJune       11.74     1.20
monthMarch       0.30     1.20
monthMay        12.77     1.20
monthNovember  -10.02     1.21
monthOctober     6.42     1.21
monthSeptember  -1.92     1.21
---
n = 217, k = 17
residual sd = 3.60, R-Squared = 0.98

R-squared looks the same, but the residual standard deviation is lower, which indicates that this model predicts more of the variation in the data than the last one. And the new variables both have pretty big and statistically significant effects. The stock market close is scaled in thousands, so the coefficient indicates that for every 1000 point increase in the Dow, we drive 1 billion more miles. The housing price index defines 1991 prices as 100, and went into the 220's during the bubble. Every one point increase in that index predicts a 240 million mile increase in driving.

Here's another version of the graph above, for our new model:

Predicted and actual miles, from a model with stock and housing prices

The same patterns are still present, but the divergence between the predictions and the actual numbers is smaller now. (Incidentally, I have no idea what happened in January of 1995. Did everyone go on a road trip without telling me?) It still looks like there's been some qualitative change in US driving habits recently, but the case is less clear cut. In particular, the late 90's now looks like another outstanding mystery. Mileage declined by more than the model expected then, but why? At the moment I have no particular hypothesis about that.

My final model tests something else that appears in Nate's article:

There is strong statistical evidence, in fact, that Americans respond rather slowly to changes in fuel prices. The cost of gas twelve months ago, for example, has historically been a much better predictor of driving behavior than the cost of gas today. In the energy crisis of the early 1980s, for instance, the price of gas peaked in March 1981, but driving did not bottom out until a year later.

OK, so let's try using the price of gas 12 months ago as a predictor along with current prices. This will force us to throw away a bit of data, but we can still fit a model on most of the data points:

lm(formula = miles ~ unemp + price + price12 + date + stocks +
 housing + month, data = data)
 coef.est coef.se
(Intercept)    112.28     3.82
unemp           -0.93     0.42
gasprice        -0.07     0.01
gasprice12      -0.08     0.01
date             0.01     0.00
stocks           0.93     0.26
housing          0.25     0.02
monthAugust     18.19     1.04
monthDecember   -8.99     1.05
monthFebruary  -31.26     1.06
monthJanuary   -22.20     1.05
monthJuly       18.17     1.05
monthJune       11.58     1.05
monthMarch       0.10     1.06
monthMay        12.88     1.05
monthNovember  -10.06     1.04
monthOctober     6.29     1.04
monthSeptember  -2.08     1.04
---
n = 210, k = 18
residual sd = 3.07, R-Squared = 0.99

It looks like current gas prices and last year's gas prices are about equivalent in their effect on mileage. Now let's look at the graph of prediction error again:

Miles driven, predicted and actual, third model

Lo and behold, the apparently anomalous findings from the last few months have disappeared. This isn't the last word, of course, nor is it the perfect model. But it no longer appears that US driving behavior is so unusual, when you account for all the relevant economic contextual factors.

Anyhow, that's enough playing around in the data for me for the time being. In the end, this whole exercise helped me understand what I like best about Nate Silver's work. He's inventing a new media niche, call it "statistical journalist". He uses publicly available data to produce quick, topical analysis that illuminates the issues of the data in the way neither anecdotes nore naive recitations of descriptive statistics can. He may play fast and loose at times, but his methods are transparent enough that people like me can still check up on him. I certainly hope that this kind of writing becomes an established sub-specialty with a wider base of practitioners than just Silver himself.

Peter Frase