Social Science

One last time

May 9th, 2009  |  Published in Data, Social Science, Statistical Graphics

Final thing on the car-culture regression. Below is a comparison of the actual data on Vehicle Miles Traveled with my reconstruction of Nate Silver's model, and my model including lagged gas prices, housing prices, and the stock market.

Comparing two models of American driving habits

Comparing two models of American driving habits

I "seasonally adjusted" the miles data by fitting a model predicting miles based only on the month of the year. The miles data (whether the actual data or the prediction from a model) is then corrected by subtracting the coefficient for the month it was collected. This data is normalized according the level of driving in April.

An even better fit is possible with a more complex model that includes a) average monthly temperatures and b) an interaction between gas prices and time. But this simpler model suffices to show that Silver's original finding was probably an artifact of his failure to control for wealth effects and the lagged effect of gas prices.

The lesson, I suppose, is: beware of columnists on deadline bearing regressions!

Predictin'

May 9th, 2009  |  Published in Data, Social Science

Update to the post below: I decided to see how well my model will predict miles traveled going forward. My model only includes data through January, as Nate Silver's did. But we have the data through February now, so we can see how well the model works there. We also have almost all the data needed to predict March--the only thing missing is the government's Housing Price Index. But that doesn't change too much month to month, so I made a prediction based on the February value:

           Predicted    Actual
February     215.37     215.77
March        245.31     ??

The March numbers should be out soon, so we'll see how my model performs.

Moment of Zen

May 8th, 2009  |  Published in Data, Social Science, Statistical Graphics

Here are the variables I used in the models for the previous post. Simplistic social theories are left as an excercise for the reader.

Economic variables, 1990-2009

Economic variables, 1990-2009

Attempt to Regress

May 8th, 2009  |  Published in Data, Social Science, Statistical Graphics

I'm loathe to say an unkind word about Nate Silver. Besides boosting the profile of my alma mater, he's done more than anyone else to improve the reputation and sexiness of my present occupation: statistical data analyst. This is all the more welcome at a time when other people are blaming statistical models for, well, ruining everything.

But I confess to being a bit annoyed when I read Silver's recent article about the changes in American driving habits. In that article, Silver argues that we're seeing a real shift away from car culture, based on the following:

I built a regression model that accounts for both gas prices and the unemployment rate in a given month and attempts to predict from this data how much the typical American will drive. The model also accounts for the gradual increase in driving over time, as well as the seasonality of driving levels, which are much higher during the summer than during the winter.

All well and good, except that Silver doesn't provide the model or the data! He asks us to take his word for it that in January, Americans "drove about 8 percent less than the model predicted."

Now, I don't expect anyone to publish regression coefficients in Esquire magazine, but Silver does have a rather well-known website, so he could have put it there. The analysis was already done and published, so I don't see how it would have hurt Silver to publish the data after the fact. Which is what makes me suspect that he kept things deliberately vague in order to maintain a sense of mystery and awe around his regression models. Particularly because in this case, the underlying model is actually quite simple.

Which is a shame, because the simplicity of the model is actually the most appealing thing about it. It's a great example of a situation where a regression illuminates a relationship that would be really hard to discern using simple descriptive statistics. The model is a perfect balance between being simple enough to be believable, and complex enough to really gain you something over simple descriptives. In fact, it's something that I plan to refer to in the future when my less quant-y friends question the need for regressions.

Which is why I decided to recreate Silver's analysis from scratch, which took me about an hour. First I had to figure out what Silver's model was. Based on the paragraph above, I decided on:

miles = gas + unemployment + date + month

Monthly miles driven are modeled as a function of that month's average gas prices, the unemployment rate in that month, the date, and which month of the year it is. The date variable will capture the "gradual increase" in miles traveled. I use month to capture the "seasonality of driving levels". I could have grouped the months into seasons, but why not use a more precise measure if you've got it?

The next step was to find the data: From different sources, I obtained data on miles traveled, gas prices, and unemployment. All of these sources start around 1990, so that's the time frame we'll have to work with.

With that in hand, it was time for some analysis. Using R, I combined the different data sources and ran myself a regression:

lm(formula = miles ~ unemp + price + date + month)
                coef.est coef.se
(Intercept)     98.52     3.71
unemp           -2.09     0.34
gasprice        -0.08     0.01
date             0.01     0.00
monthAugust     17.90     1.40
monthDecember   -8.82     1.40
monthFebruary  -30.26     1.42
monthJanuary   -22.03     1.40
monthJuly       17.87     1.42
monthJune       11.34     1.42
monthMarch       0.42     1.42
monthMay        12.56     1.42
monthNovember  -10.00     1.40
monthOctober     5.85     1.40
monthSeptember  -2.55     1.40
---
n = 222, k = 15
residual sd = 4.25, R-Squared = 0.98

That R-Squared of 0.98 means that about 98% of the actual variation in miles traveled is explained by the variables in this model. So it's a pretty comprehensive picture of the things that predict how much Americans will drive. A one point increase in the unemployment rate, in this model, predicts a 2.09 billion mile decrease in miles driven. And gas prices are in cents, so a one-cent increase in the price of gas will, all things being equal, translate into an 80 million mile decrease in miles driven.

The next step was to check out Silver's assertion that recent data on miles driven is lower than the model would predict. Recall that Silver's model over-predicted January miles driven by 8 percent. My model predicts that in January, Americans should have driven 239.6 billion miles. The actual number was 222 billion miles. The prediction is--wait for it--7.9 percent more than the actual number! That's pretty amazing actually, and it indicates that my data and model must be pretty damn close to Silver's.

With the model in hand, however, we can do a bit better than this. Below is a chart showing how close the model was for every month in my dataset. It's similar to the graphic accompanying Silver's Esquire article, only not as ugly and confusing.

Comparison of a regression model of vehicle miles driven with the actual value

Comparison of a regression model of vehicle miles driven with the actual value

The graph shows the difference between the prediction and the actual number. When the point is above the zero line, it means people drove more than the model would predict. When it's below the line, they drove less.

You can see here that there are multiple imperfections in the model. Mileage declined a little faster than predicted in the late 90's, and then rose faster than expected in the early 2000's. It's possible that this has something to do with a policy difference between the Bush and Clinton administrations, but I'm not enough of an expert to say.

What jumps out, though, are those last three points on the right, corresponding to this past November, December, and January. All of them are way off the prediction, and the error is bigger than for any other time period. This strongly suggests that something really has changed. What's not totally clear, though, is whether it's the car culture that's different, or whether it's this recession that's unlike the other two recessions in this data set (the early 90's and early 2000's).

The next logical step is to consider some additional variables. Some commenters at Nate's site pointed out that you might want to factor in changes in wealth--as opposed to changes in income, which are at least partly captured by the unemployment variable. Directly measuring wealth is a little tricky, but we can easily measure two things that are proxies for wealth, or people's perceptions of wealth: the stock market and the housing market. So I went google-hunting again and found two more variables: the monthly closing of the Dow, and the government's housing price index. Put those into the regression, and away we go:

lm(formula = miles ~ unemp + price + date + stocks + housing + month)
               coef.est coef.se
(Intercept)    117.87     4.13
unemp           -1.64     0.48
gasprice        -0.11     0.01
date             0.01     0.00
stocks           1.01     0.30
housing          0.24     0.03
monthAugust     18.40     1.20
monthDecember   -8.88     1.21
monthFebruary  -30.58     1.21
monthJanuary   -22.12     1.19
monthJuly       18.28     1.20
monthJune       11.74     1.20
monthMarch       0.30     1.20
monthMay        12.77     1.20
monthNovember  -10.02     1.21
monthOctober     6.42     1.21
monthSeptember  -1.92     1.21
---
n = 217, k = 17
residual sd = 3.60, R-Squared = 0.98

R-squared looks the same, but the residual standard deviation is lower, which indicates that this model predicts more of the variation in the data than the last one. And the new variables both have pretty big and statistically significant effects. The stock market close is scaled in thousands, so the coefficient indicates that for every 1000 point increase in the Dow, we drive 1 billion more miles. The housing price index defines 1991 prices as 100, and went into the 220's during the bubble. Every one point increase in that index predicts a 240 million mile increase in driving.

Here's another version of the graph above, for our new model:

Predicted and actual miles, from a model with stock and housing prices

Predicted and actual miles, from a model with stock and housing prices

The same patterns are still present, but the divergence between the predictions and the actual numbers is smaller now. (Incidentally, I have no idea what happened in January of 1995. Did everyone go on a road trip without telling me?) It still looks like there's been some qualitative change in US driving habits recently, but the case is less clear cut. In particular, the late 90's now looks like another outstanding mystery. Mileage declined by more than the model expected then, but why? At the moment I have no particular hypothesis about that.

My final model tests something else that appears in Nate's article:

There is strong statistical evidence, in fact, that Americans respond rather slowly to changes in fuel prices. The cost of gas twelve months ago, for example, has historically been a much better predictor of driving behavior than the cost of gas today. In the energy crisis of the early 1980s, for instance, the price of gas peaked in March 1981, but driving did not bottom out until a year later.

OK, so let's try using the price of gas 12 months ago as a predictor along with current prices. This will force us to throw away a bit of data, but we can still fit a model on most of the data points:

lm(formula = miles ~ unemp + price + price12 + date + stocks +
 housing + month, data = data)
 coef.est coef.se
(Intercept)    112.28     3.82
unemp           -0.93     0.42
gasprice        -0.07     0.01
gasprice12      -0.08     0.01
date             0.01     0.00
stocks           0.93     0.26
housing          0.25     0.02
monthAugust     18.19     1.04
monthDecember   -8.99     1.05
monthFebruary  -31.26     1.06
monthJanuary   -22.20     1.05
monthJuly       18.17     1.05
monthJune       11.58     1.05
monthMarch       0.10     1.06
monthMay        12.88     1.05
monthNovember  -10.06     1.04
monthOctober     6.29     1.04
monthSeptember  -2.08     1.04
---
n = 210, k = 18
residual sd = 3.07, R-Squared = 0.99

It looks like current gas prices and last year's gas prices are about equivalent in their effect on mileage. Now let's look at the graph of prediction error again:

Miles driven, predicted and actual, third model

Miles driven, predicted and actual, third model

Lo and behold, the apparently anomalous findings from the last few months have disappeared. This isn't the last word, of course, nor is it the perfect model. But it no longer appears that US driving behavior is so unusual, when you account for all the relevant economic contextual factors.

Anyhow, that's enough playing around in the data for me for the time being. In the end, this whole exercise helped me understand what I like best about Nate Silver's work. He's inventing a new media niche, call it "statistical journalist". He uses publicly available data to produce quick, topical analysis that illuminates the issues of the data in the way neither anecdotes nore naive recitations of descriptive statistics can. He may play fast and loose at times, but his methods are transparent enough that people like me can still check up on him. I certainly hope that this kind of writing becomes an established sub-specialty with a wider base of practitioners than just Silver himself.

Graphs > Tables, again

March 16th, 2009  |  Published in Data, Statistical Graphics

Over at the Monkey Cage, Lee Sigelman notes a new study from the CDC that tries to figure out how many people and households in each state have no land line and rely entirely on cell phones. Being a good student of Andrew Gelman, my first thought upon clicking the link was: "these tables are horrible, they should be graphs!" My second thought was, "Gelman will probably come along and produce graphs of the data himself". So before that happens, I thought I'd take a stab at summarizing the paper's first couple of tables:

Cell phone only data

Cell phone only data

Click the image to see it full-size.

The intervals aren't classical 95% intervals--they're some kind of fancy estimation from the CDC that you'll have to click the link to find out about. The hollow points/dashed lines are the "modeled" estimates, and the black points/solid lines are the "direct estimates". The points are in order according to the modeled estimates.

The nice thing about displaying this graphically is that you can see how much uncertainty there is on some of these estimates, so you get a better idea of what this graph does and does not tell you. For example, Washington DC is estimated to have the highest percentage of adults in cell-only households, but the confidence intervals reveal that this doesn't really mean anything--the most you can say is that DC is on the high end of cell-only prevalence.

Richard Rorty and the Giant Pool of Status

February 16th, 2009  |  Published in Social Science

By way of OrgTheory, I see that Gideon Lewis-Kraus has a nice little essay on Neil Gross's recent book on Richard Rorty. The piece strikes a number of resonant notes for me: on the terminally wack state of academic sociology, the status hierarchy of the university, and the relationship between intellectuals and public life. But one odd thought I had when reading the piece departs from the following passage:

Bourdieu suggested—often impolitely—that the generative basis for a career in thought was to be found in the lusty drive for the kind of symbolic and cultural "capital," his terms of greatest currency, that would help the thinker, and her field, achieve a higher status. In other words, the academy functions largely as an apparatus for refining and transmitting the cultural codes that serve the perpetuation of privilege. Professors, as "the dominated fraction of the dominant class," are the sentries of the class structure.

Gross's book is an attempt to argue that this account does not entirely apply to Rorty. Or, more precisely, that it does not apply to Rorty's later career, after he had gained tenure and a place of prominence within academic philosophy. Instead, Gross claims, it was an inner devotion to an intellectual self-concept as a "leftist American patriot", rather than a bid for status, which drove Rorty's evolution into an odd sort of postmodern pragmatist.

Lewis-Kraus has a number of insightful things to say about the uses and limitations of this account. But as I considered the matter of Rorty's cultural capital, I was put in mind of something about, well, regular old Capital, the kind that the boys at the hedge funds have been busy vaporizing of late. Last year, the NPR show "This American Life" did a great story about the present economic crisis, called "The Giant Pool of Money". The title refers to the roughly $70 trillion of accumulated capital that, in the early part of this decade, went looking for profitable investment opportunities. The trouble was, there just weren't enough low-risk high-reward opportunities--neither investing in the production of stuff nor in U.S. treasury bonds was going to cut it. So instead, this money started flowing into the mortgate market, which seemed like a low-risk, high-reward investment, until it didn't.

The important thing to notice is that once you have a really, really huge pile of money, it gets more and more difficult to find profitable ways of re-investing that money. This has been pointed out recently by various commentators explaining the root causes of the crisis, and it is a rediscovery of something originally pointed out by Marx.

Anyway, it occured to me that Rorty's intellectual makeover could be thought of in a similar way. By the early 1970's, he had accumulated a large amount of cultural capital by basically playing the game of analytic philosophy according to the rules accepted by its leading figures. But once he had risen to the top of that group, there were bound to be limited returns to a strategy of reinvesting cultural capital into the austere discipline of analytic philosophy, what Lewis-Kraus calls "the If-P-then-Q school of compelling reasons." The most he could have hoped for was to be remembered by philosophy professors and grad students, and not by much of anybody else.

The alternative was to plough his cultural capital into a higher-risk project, but one with potentially greater returns. Namely, to stake his reputation on an attempt to break out of the confines of the philosophy department, to redefine both the place of philosophy and the vocation of the intellectual. In staking out such an iconoclastic path, there is always the danger that one will be doomed to obscurity--recognized by neither the profession you have spurned nor the public you court. But if the gambit pays off, you become precisely what Rorty became: someone read across disciplines and even outside of academia, the sort of person whom sociologists write intellectual biographies of.

Which is not to say that this is the only explanation for Rorty's career, or even the most important one. Lewis-Kraus's own observations about the dead-end trajectory of '60's philosophy and present-day sociology are perhaps more to the point. But living as we do in a society which accumulates fame and status in a small number of hands, it's worth speculating about the consequences of "overaccumulating" that status.

And as is the case with money capital, the over-accumulation of cultural capital can have beneficial as well as deleterious results. Just as the tech bubble of the late 1990's let to an overinvestment in broadband capacity that created the basis for the future growth of the Internet, so does the overaccumulation of status among a few star academics allow some of them to do truly transformative and pathbreaking work, as Rorty did.

Pessimism of the Intellect

October 30th, 2008  |  Published in Politics, Statistical Graphics

My boss is a prominent political scientist and an Obama supporter. This afternoon, he was ribbing me for being a "pox on both your houses" ultra-leftist who only grudgingly acknowledges that electing Obama would be good for the left.

After our meeting had ended, I came up with a perfect encapsulation of my feelings about Obama, which has the added benefit of being an extremely nerdy joke. My point estimate is that it does matter whether Obama wins. But my confidence interval for how much it matters includes zero. In the spirit of Jessica Hagy, I present the argument in graph form:

How I feel about Obama

Last of the TV Presidents

June 3rd, 2008  |  Published in Politics, Social Science

Reflecting on Bill Clinton's ongoing meltdown and the tawdry Vanity Fair profile, Josh Marshall reflects:

Bill is a man out of his time, out of his element, which is something painful to watch and must be a unique agony for him to experience.

Bill Clinton was on so many levels the master of the politics of the 1980s and 1990s, the magic with words and connection with people, intuitively sizing up the tempo and undercurrents of the political moment. Hate him or love him, I think anybody with a feel for politics knew this. And I loved him. . . . But again and again through this cycle, in little ways and big, he's shown he's not quite in sync with this political era, doesn't quite grasp the new mechanics -- both the ideological texture and the nuts and bolts of the networked news cycle.

Thinking about this, it occurred to me that Clinton is really the last of the television presidents.  That is, he is the last President whose relationship to Americans was primarily mediated by television. The first, of course, was Kennedy, who was famously able to best Nixon on TV but not on the radio. Reagan and Clinton were the greatest of the television presidents, in the sense that they best understood how to manipulate the medium to their advantage.

I'm not sure that other eras in politics can be so adequately characterized by their dominant media. Were Roosevelt, Truman, and Eisenhower the "radio presidents"? Were their predecessors "newspaper presidents"? Still, at least for the late twentieth century, the medium was clearly an important determinant of the kind of politicians who rose to prominence.

Bush was elected as a television president, with many of the same skills as Clinton or Reagan (though to a lesser degree). Yet his downfall came, in part, because of the lack of information control in the post-TV era.  The disastrous trajectory of his regime stems, in some measure, from the shift of our media ecology toward the Internet. It was in that context that his lies, malapropisms and general buffoonery could be broadcast and passed around as blog posts and YouTube clips, without the filter of "legitimate" news organizations.

Meanwhile, if Barack Obama is elected in November, he will certainly be the first Internet president. Which raises the disheartening possibility that future historians of politics will be forced to watch this.

You are (voting for) what you eat

May 11th, 2008  |  Published in Politics, Social Science

Via Andrew Gelman, the New York Times explains what your culinary choices say about your political predilections.

I immediately wondered what my own tastes reveal about my deepest political desires. To clarify the situation, I summarized the first few paragraphs of the Times article in a table. Here are the food choices that are supposed to correspond to each candidate:

              Clinton       Obama        McCain
Fat           butter        olive oil    pizza
Beverage      white wine    latte        bourbon
Sweet         fig newton    granola      pizza

The article only gives two food choices for McCain, so I had to use pizza twice. (Commercial pizza, like all commercial food, is loaded with corn syrup after all.)

If I had to pick, I'd go with the butter over olive oil (though I love both). Beverages, bourbon wins by a mile. And I'd take the fig newton, I guess, although I don't really understand what's so Clintonian about it. Anyway, I guess I'm supposed to vote for Clinton or McCain now. Perhaps this explains my semi-irrational distaste for the Obama campaign, though.

But wait, there's more:

For example, Dr Pepper is a Republican soda. Pepsi-Cola and Sprite are Democratic. So are most clear liquors, like gin and vodka, along with white wine and Evian water. Republicans skew toward brown liquors like bourbon or scotch, red wine and Fiji water.

As Gelman asks, what about Mr. Pibb? Also, red wine and Fiji water are Republican? Seriously?

Dr. Pepper is my favorite soda, though. So maybe I should rethink my politics. Also, for a long time I believed that Dr. Pepper contained prune juice, and that the "Dr." originally advertised the beverage's laxative powers. But apparently that's just an urban legend.

The theory of theory

May 9th, 2008  |  Published in Social Science, Sociology

Teppo Felin has a post over at OrgTheory that quotes Homans' advice on theory-building. Thinking about where I agree or disagree with these strictures helped me see some of the ways I differ from much of mainstream social science. To take his points in order:

Look first at the obvious, the familiar, the common. In a science that has not established its foundations, these are the things that best repay study.

That one I agree with wholeheartedly. I guess it's something everyone from Henri Lefebvre to the Freakonomics guys would concur on. Hannah Arendt wouldn't like it, though.


State the obvious in its full generality. Science is an economy of thought only if its hypotheses sum up in a simple form a large number of facts.

This I'm much more ambivalent about. Often, attempts to theorize at maximum generality lead to theories that are false or vacuous. Just as important as generality is understanding the context in which a theory does or does not apply.


Talk about one thing at a time. That is, in choosing your words (or, more pedantically, concepts) see that they refer not to several classes of fact at the same time but to one and only one. Corollary: Once you have chosen your words, always use the same words when referring to the same things.

On the face of it, this seems like it should be uncontroversial. But I think it reflects a naive belief that scientific and literary language can easily be separated. I often find that when I'm writing up a sociological argument, I want use different words and different constructions for the same concept, in order to make the tone seem less clunky and flat. And I think this is more than a matter of stylistics. Freshman composition to the contrary, language is not a window onto your thoughts. It is a social fact, and it is full of ambiguities and misunderstandings. In order to really get a new idea across, it is often necessary to restate it and rephrase it in many different ways, circling around your concept in order to triangulate your position in a way that is intelligible to others. If you just use one word, referring to one thing, you are at the mercy of whatever connotations and resonances that word will have for your audience. And that leaves you open to all kinds of misinterpretation.


Cut down as far as you dare the number of things you are talking about. “As few as you may; as many as you must,” is the rule governing the number of classes of fact you take into account.

This one is the flip side of the maximum-generality rule, and I object to it for similar reasons. It's implicitly anti-dialectical, since it implies that the way to understand social phenomena is to break them down into little pieces and separate them from their context, rather than fitting them into a totality.


Once you have started to talk, do not stop until you are finished. That is, describe systematically the relationships between the facts designated by your words.

That's a good one, and it's advice I should be better at following. When I have a good idea, I sometimes have a hard time cashing it out before I get sick of it and abandon it.


Recognize that your analysis must be abstract, because it deals with only a few elements of the concrete situation. Admit the dangers of abstraction, especially when action is required, but do not be afraid of abstraction.

That's a good one too, but it all depends on what you mean by abstraction. The commodity form is an abstraction I really like. The concept of utility, not so much. For Homans, of course, it would be just the opposite.