Why election models did not predict Trump’s victory

I wrote a slightly different version of this post on Facebook the day before the election, in order to argue why it was irrational to buy the claims made by some pundits, such as Sam Wang from the Princeton Election Consortium, that Clinton had a 99% chance of winning. I decided to post a revised version here, because even though the election is over, I think it still explains a lot of things about the way in which polls and models that purport to predict the result of elections work. The issue requires that we about statistics, but don’t worry if you don’t like math, I will try to explain things in layman’s terms and I hope you will learn things people often don’t understand. For instance, you will know what a poll’s margin of error actually is, and understand why 99% of sentences about polls that contain the expression “margin of error” are false.

I will only talk about election models that rely heavily on polls, because I think only that kind of models is reliable. There is another kind of election models, which rely on what people call the fundamentals (such as the unemployment rate, the rate of economic growth, etc.), but I don’t think they are reliable and plan to write a short post to explain why. (Of course, you can design a model that rely both on polls and fundamentals, but I will ignore that possibility to keep the discussion focused and simple.) To be more specific, I will describe Wang’s basic model, because he makes everything public so anyone can look under the hood and see how it works, but also because if I created a model of my own I would basically do the same thing. This model calculates a probability distribution for each possible outcome, in terms of number of electoral votes, in the electoral college.  As we shall see, however, one needs to make a number of non-trivial assumptions to compute that probability distribution and that’s where Wang and a lot of other people made important mistakes.

The procedure used by Wang is really simple and, in my opinion, is basically what anyone sensible would do. The first step is to look at the poll average in every state and use that to calculate a probability that Clinton is going to win in that state. Actually, since the probabilities that Clinton is going to win individual states are correlated, you can’t really compute the probabilities in question for each state separately, but I will come back to this shortly so let’s ignore that for the moment. (Wang also uses the poll median instead of the poll average to get around the problem of outliers, which may or may not be a good idea, but I’m going to ignore that.) Once you have a probability that Clinton is going to win for each state, for every possible combination of outcomes state-by-state, you can use that to compute a probability that she is going to win that exact combination of outcomes. Then, in order to get a probability that Clinton is going to win the election, you just have to sum the probabilities that she is going to win each possible combination of outcomes state-by-state that would result in at least 270 votes for her in the electoral college. This, in a nutshell, is what Wang’s model does. (It doesn’t exactly do that, because there are 2^{51} possible combination of outcomes state-by-state, which is so enormous that it would take years to perform the computation. But what it does is equivalent, except that it’s much faster.) If the probability distribution calculated by that model were accurate, it would be straightforward to infer the probability the probability that Clinton would win, but as we shall see things are more complicated.

Now, if you did that and used that probability distribution to predict the outcome of the election, given what the polls in each state said just before the election, you would have predicted that Clinton is overwhelmingly likely to win and indeed that’s what most people said at the time. However, as I will explain shortly, you can’t directly infer that from the probability distribution in question, because as I already noted a lot of non-trivial assumptions went into the calculation. In particular, in order to compute that probability distribution, one has to make a lot of assumptions about non-sample polling error and it’s at this stage that Wang made several mistakes which proved catastrophic. That is why, in the weeks before the election, he kept going around saying that “it [was] totally over” and that he would “eat a bug” if Trump got more than 240 electoral votes. Well, we know how that turned out, don’t we…

If the probabilities that Wang’s model computes for each state were right, you could have used the resulting probability distribution of the outcomes in the electoral college to straightforwardly derive the probability that Clinton was going to win, which is just the probability that she gets at least 270 votes in the electoral college. But you need to assume that the probabilities for each individual state are correct and, as it turned out, they were not. In order to understand why, we need to look at how polls were used to calculate those probabilities. The problem is that, while it’s relatively straightforward to account for one type of polling error when calculating those probabilities, it’s a lot trickier to account for another type of polling error. The type of polling error that is relatively straightforward  to account for is sampling error, so let me explain quickly what that is first. This will clear up a lot of misunderstandings that most people have about the “margin of error” that journalists talk about when they report the results of polls.

Suppose that you have a big urn full of balls, some of which are blue and the others red. Say it contains several dozens of thousands of balls and it wouldn’t be practical to look at all of them. You want to know what proportion of the balls in that gigantic urn are blue and what proportion of them are red. So what you do is that you randomly pick a sample of balls in the urn and look at the balls in that sample to see how many are blue and how many are red. It’s very important that you truly randomly pick them, which means that every ball in the urn must have the same probability of being picked. (Actually, this assumption can be relaxed, but let’s not worry about that.) That’s basically what a poll is, except that instead of looking at balls, pollsters are asking people who they’re going to vote for.

So you randomly pick 100 balls in the urn and, when you look at them, you see that 55 of them are blue and 45 are red. Now, you can’t infer that 55% of the balls in the urn are blue and that 45% are red, because even if you picked the balls randomly, the proportions in the sample are probably somewhat different from the proportions in the urn just because of chance. But you can use probability theory to calculate a confidence interval, typically a 95% confidence interval, which is what gives rise to the margin of error everyone is talking about all the time.

A confidence interval for the proportion of balls in the urn that are blue is an interval around the proportion in the sample such that, if you do a “poll” of the urn an arbitrarily large number of times and use the same method to calculate the interval every time, the actual proportion of balls in the urn that are blue will be in that interval 95% of the times. For instance, suppose that in the sample I was talking about earlier (where 55% of the balls are blue and 45% are red), the margin of error is +/- 5%. It does not mean that the actual proportion of balls that are blue in the urn has a 95% probability of being in the interval [50,60]. What it means is that, if you do a “poll” of the urn by picking 100 balls an arbitrarily large number of times and calculate a confidence interval every time, the actual proportion of balls that are blue in the urn will be in the interval you calculate 95% of the time. But the interval is going to be different, possibly by quite a lot, every time you do a “poll” of the urn.

So, if you do a poll with a margin of error of 3% in Michigan and 52% of the respondents tell you that they’re going to vote for Clinton and 48% tell you that they’re going to vote for Trump, it doesn’t mean that Clinton has a 95% probability of winning that state. It just means that, if you did a poll using the same methodology an arbitrarily large number of times, the actual proportion of people who are going to vote for Clinton in Michigan would be in the confidence interval you calculate 95% of the times, but that interval would be different every time. Moreover, it also doesn’t mean, as journalists say all the time, that Clinton’s margin in that poll is “outside the margin of error”. It is not. According to that poll, the lower bound of the confidence interval for the proportion of people who are going to be voting for Clinton in Michigan is 49%, while the upper bound of the confidence interval for the proportion of people who are going to vote for Trump is 51%. Of course, even if the sample is truly random, the actual proportion of people who are going to vote for either of them could be much higher or much lower. This is just one poll and, even if the sample was truly random, chance could have messed things up. In fact, it almost certainly did, at least to some extent.

Okay, now that I have explained what sampling error is, let’s go back to how one could compute a probability that Clinton is going to win in each state if sampling error were the only type of polling error. Suppose that, in every state, the various methodologies used by different pollsters result in the same estimator of the margin between Trump and Clinton with the same distribution and no bias. To simplify a bit, that’s what would happen if every pollster used the same methodology and, when you use that methodology, the sample is truly random and there is no measurement error, i. e. every person who is going to vote in the state has the same probability of ending up in the sample and, if they tell you that they’re going to vote for X, they really are going to vote for X. (This definition of measurement error is somewhat unusual, for it implies that if a respondent honestly replied that he was going to vote for Clinton but changed his mind between the time he was polled and the election, this counts as measurement error. It’s probably not how most people would define measurement error, but it makes the discussion simpler, so I don’t think it’s a problem.) If that is true, you can use a theorem called the “central limit theorem” (look it up if you’ve never heard of it, it’s a truly mind-blowing result) and a theorem proven by a guy called William Gosset (but who for some reason published the proof under the pseudonym “Student”) to prove that, to a good approximation (provided there are enough polls), the poll average obeys a probability distribution called “Student’s t-distribution”, after the dude in question. Finally, using that distribution, you can compute the probability that Clinton is going to win based on whatever the poll average happens to be.

So far, so good, but this is where things start to get more complicated. The problem is that the assumptions on which the calculation I just explained rests are completely unrealistic. Random sampling error is not the only kind of polling error. We know that the methodologies used by pollsters do not result in an unbiased estimator of the margin between Clinton and Trump, because their samples are not truly random and because there is measurement error. Polling is a really tricky business and pollsters have to make all sorts of assumptions to get a result from their data. In particular, they have to make assumptions about what kind of people are going to turn out to vote to construct a sample or weight it appropriately, which also means that they have to make assumptions about what characteristics are relevant to predict turnout. That’s just one reason why, if you give the same data to different pollsters, they will probably all end up with a different result. (See this article in the New York Times about this.) Given how many non-trivial assumptions pollsters have to make, a lot of things can go wrong. Indeed, if you look at the history of polling, things go wrong very often. For instance, in 2012, the polls were off by 3 points (in favor of Obama), which is more than Clinton’s lead according to the national RCP average on the eve of the election and was actually further off the mark than polls in 2016.

First, even if there were no measurement error, since the samples used by pollsters are not truly random, the probability you compute by using the central limit theorem and Gosset’s theorem about the Student’s t-distribution would still not be accurate. There is no reason to think that the samples used by various pollsters are biased in such a way that, when you look at the average, the biases cancel out. It may very well be that, on the contrary, they reinforce each other. In particular, that’s exactly what you should expect if pollsters make similar mistakes when they make assumptions about who is going to show up to vote, which is likely. For instance, if most of them overestimated the rate at which black people were going to turn out to vote, it would have biased the polls in favor of Clinton. Since Obama was not running, it was clear that black turnout was going to decrease relative to 2012 (even though many people were under the illusion that blacks would turn out en masse to stop Trump), the only question was by how much. Since black people overwhelmingly vote Democrat, a small mistake in the assumptions pollsters make about black turnout could skew their results in a significant way. As it turned out, pollsters probably overestimated black turnout in constructing their samples, at least in some key states. Furthermore, since the mistakes that pollsters make when they construct their samples are probably similar in different states, the polling errors that result from those mistaken assumptions in various states are presumably correlated to some extent, which should be taken into account when calculating the probabilities that a candidate is going to win in each individual state. But there is no obvious way to figure out exactly what assumptions we should make about how exactly they are correlated.

Moreover, we know that there is measurement error, which makes the whole enterprise of predicting who is going to win the election by using polls even trickier. For instance, even before the election, many people had raised the possibility of a so-called reverse Bradley effect. The idea is that some respondents interviewed by pollsters who said they were not going to vote for Trump actually were going to vote for him but were ashamed to admit it. I read some very bad arguments against the existence of a reverse Bradley effect before the election, but this post is already long and I don’t have time to explain why these arguments were terrible, so I’ll just leave it at that. In a way, the existence of a reverse Bradley effect was trivial, the important question was how big it was going to be. In fact, even after the election, it’s still very difficult to answer that question. Furthermore, insofar as sources of measurement error are similar in different states, it means that non-sampling polling errors in various states are probably correlated. For instance, if people who intend to vote for Trump are loath to admit it to pollsters in Ohio, it’s probably also true of people who intend to vote for Trump in Pennsylvania. Similarly, if there is some bad news for Clinton right before the election, it will presumably make the polls taken before that event unreliable in every state, as people all over the country suddenly change their mind. Again, it’s very difficult to know how exactly they are correlated, yet it’s absolutely essential to make the right assumptions about that if we’re going to use polls to predict the outcome of the election.

What this means is that, if when you calculate the probability that Clinton is going to win in each individual state, you don’t take into account the fact that pollsters use biased samples and the possibility of measurement error, the number you get will not give you a good sense of how likely to win Clinton is, because the computation rests on assumptions that we know to be false. Even if you didn’t follow everything I explained above, you can convince yourself of that by considering the fact that, in pretty much any election, ignoring those facts would result in a prediction that one candidate or the other has a probability of winning of more than 95%. But nobody thinks that, in any election, we can have that kind of certainty, which shows that a probability computed in that way does not really tell you how likely Clinton was to win. Of course, it could have been be that our intuition is just wrong and that in fact most elections are not as uncertain as most people think, but in that case we just know that our intuition is not misleading.

I think it will help to work through a concrete example to convince you that, by using that method to calculate the probability that Clinton was going to win in each state, one would have been overestimating how certain one can be of the outcome. Suppose that, in a state, we have 5 polls that give Clinton a margin of 0.2, 0.9, 0.5, 0.3 and 0.8 (average = 0.54). In that case, according to that method, the probability that Clinton was going to win that state would have been more than 99%! (You can get a very high probability of winning for Clinton even if the poll average of her margin is very small provided that the variance is also very small.) Now, if you had seen those polls just before the election, I’m sure you would have thought that Clinton was more likely to win (as you should have), but I doubt you would have thought that the probability she was going to win was more than 99%. And, while our intuitions are often wrong, this particular intuition is not. It would only be rational to make that inference if you had good reasons to assume that the polls are not biased. However, not only do you have no reason to think that, but on the contrary you have every reason to think that they are!

It’s because when you use that kind of method, it’s so easy for the computation to result in a very high probability of winning for one of the candidate that, no matter the election, this method is probably going to predict that one of the candidate is going to win with a probability of more than 95%. But it would have been irrational to have a credence that Clinton was going to win equal to the probability calculated by that method. So, no matter what the probability that Clinton was going to win was exactly (assuming it’s even meaningful to talk of a precise probability that she was going to win), it was significantly less than 99%. How much less? Fuck if I know! Trying to correct the bias in the polls is probably a terrible idea, because you’re just going to inject your own biases in the process and we don’t have enough information about what assumptions pollsters make exactly. Indeed, while I have only mentioned examples of biases in favor of Trump above, you could also have come up with plenty of reasons to think that the polls were biased against Clinton before the election. This means that, for all we know before the election given what the polls were saying, she could also have won in a landslide. But she didn’t and, if so many pundits didn’t see that coming, it’s because they didn’t take seriously the possibility of systematic polling error and were extremely complacent.

What you should do is somehow take into account the uncertainty due to the possibility that the polls might be wrong because of systematic polling error in the model. That’s apparently what Nate Silver did, which is why his estimate of the probability that Clinton was going to win was lower than Wang’s. And he was right to do so, though I have no idea if he did that in a sensible way, because his model isn’t public. (Silver has gotten a lot of shit because, even though Clinton lost, his model said that she had a 65% chance of winning. But I think this criticism is confused, because given the evidence on the eve of the election, Silver’s prediction strikes me as perfectly reasonable. Indeed, given the evidence, I think it would have been irrational to predict that Trump was going to win. It’s just that it was also irrational to think that Clinton had a 99% chance of winning. If you throw a die which you know to be fair and predict with a very high confidence that it’s going to land on 6, you’re being irrational even if the die just happens to land on 6. Scott Alexander made a similar point on his blog right before the election.) To be clear, Wang also did that, but he clearly didn’t take the possibility of systematic error as seriously as he should have.

People had pointed out to him that possibility before the election, but he dismissed it by saying that he was doing enough to take it into account. It’s likely that he was somewhat blinded by his own political bias. It’s also likely that, had he not so confidently asserted that Trump had no chance of winning throughout the campaign, his website would have been a lot less popular. (To be clear, I’m not suggesting that Wang consciously tweaked his model to give the answer that his readers wanted to hear, but perhaps the fact that his readers wanted to hear that unconsciously made him less sensitive to criticism as he also got a lot of support from people who found his prediction comforting. I know that a lot of my friends, who were freaking out about Trump, were using PEC as a way to reassure themselves.) In particular, he vastly overestimated the correlation between sampling errors in different states, as he admitted after the election.

The bottom line is that, because so many non-trivial assumptions enter into the computation of a prediction based on polls and the probability computed by the model is extremely sensitive to what assumptions you make exactly, it’s irrational to have the kind of confidence that Clinton is going to win exhibited by people like Wang. (In fact, it arguably doesn’t make sense to talk of the probability that Clinton was going to win, because there was several reasonable ways of tweaking the basic model to take into account polling uncertainty due to polling error which probably would have yielded significantly different results, but there was no way of adjudicating between them.) The rational thing to say before the election was that Clinton was probably going to win, but you shouldn’t have bet your house on it, because there was also a real chance that she was going to lose and, as we know, it’s exactly what happened…

EDIT: Since I first published it a month ago, I made several changes to this post, because the original version had been written hastily and contained some misleading passages.

ANOTHER EDIT: If you found this post interesting, you should also check this article by Andrew Gelman and Julia Azari. There is a back and forth between them and other people in the same issue of that journal which is also interesting.

8 thoughts

  1. It’s too bad for all the super smart statisticians. It’s good for people like us though. You can study all the math you want, but understanding non-sampling error requires actually understanding demographics and politics on a non-formulaic level. The HuffPost model had a similar error (http://www.huffingtonpost.com/entry/huffpost-pollster-poll-averages-methodology_us_57d1a3b2e4b06a74c9f361cb). It’s fun to code up a Kalman Filter, and pretty tough. Unfortunately there is a tendency to think complex math promises nice results.

    Even with sampling-error though, the assumption is the draws across polls/states don’t have correlated errors. Unfortunately, it seems that most polls are wrong in the same ways, since presidential voting is sensitive to systematic factors. In this case what is the ‘true’ sample size? It depends on the model, but I bet the sample of states, if you reduced it to unobserved variables, is like 3 or 4 unobserved factors explains all of it. And polls all have to try and estimate the demographic that will actually vote, and their errors probably are correlated as well.

    But so often with estimation, even if everyone knows they are ignoring something, as long as it’s not in the model they don’t feel the need to put it in the confidence. So you get these super smart stats nerds saying 99%, but really…

    Guys like Silver and Gelman get this, and are willing to use more ad-hoc corrective methods (or informative priors).

    1. I think you’re right that many people who construct election models know a lot about statistics but not enough about politics. Statistics is definitely a useful tool to analyze politics, but it’s a dangerous illusion to think that it removes the need for a good understanding of politics, which many analysts lack.

  2. According to the emails leaked by Wikileaks, the pre-election polls presented in the media used a technique called oversampling to misrepresent the results.

    **Sources**:
    * https://wikileaks.org/podesta-emails/emailid/26551 (https://wikileaks.org/podesta-emails/emailid/26551)
    * https://wikileaks.org/podesta-emails//fileid/26551/7326 (https://wikileaks.org/podesta-emails//fileid/26551/7326)
    * https://wikileaks.org/podesta-emails/emailid/15442 (https://wikileaks.org/podesta-emails/emailid/15442)

    **Relevant Quotes**:
    * “I also want to get your Atlas folks to recommend oversamples for our polling before we start in February.”
    * “so we can maximize what we get out of our media polling.”
    * [For Arizona] “Research, microtargeting & polling projects – Over-sample Hispanics… – Over-sample the Native American population”
    * [For Florida] “On Independents: Tampa and Orlando are better persuasion targets than north or south Florida (check your polls before concluding this). If there are budget questions or oversamples, make sure that Tampa and Orlando are included first.
    * [For National] “General election benchmark, 800 sample, with potential over samples in key districts/regions – Benchmark polling in targeted races, with ethnic over samples as needed – Targeting tracking polls in key races, with ethnic over samples as needed”
    * “The plan includes a possible focus on women, might be something we want to do is over sample if we are worried about a certain group later in the summer.”

    **Interpretation**:​
    * This is why you see the skewed polls show Clinton +12 when other more accurate ones show Trump +2. The high Clinton ones oversample democrats by a HUGE margin to get desired results (sometimes 20-40% more Democrats sampled). Many are created by organizations that donate to Hillary, and some are even conducted by her own SuperPACs (https://theconservativetreehouse.com/2016/10/11/media-polling-fully-exposed-about-that-nbcwsj-clinton-11-point-poll/#more-123009)!
    * They do this to make Republican voters feel discouraged and not come out to vote if they think their candidate will lose.
    * Just look at this example in Arizona (https://theconservativetreehouse.com/2016/10/19/about-that-pro-clinton-arizona-polling-narrative/): Clinton +5, but Democrats were oversampled by 34% (58 out of 100 Democrats, 24 out of 100 Republicans)! Unfortunately, the media only reports on the final number, without reporting on the over-sampling.

    1. I have no doubt that campaigns sometimes cheat with polls by voluntarily using biased samples, and I’m sure Trump’s campaign did the same thing on occasions, but I doubt that polling error was only the result of that kind of manipulations. As I say in my post, polling is very difficult, so we should expect polling error even if everyone is honest.

    2. I’ve read different accounts of what “oversampling” means in this context.

      I think the most likely meaning is an innocent (i.e., non-manipulative) one.

      One can “oversample” a group of interest — because they are being targeted for persuasion, for example — without adding their raw numbers to the poll, but rather weighting them according to their presumed actual proportion in the population of voters (registered or likely). For example, one could sample double the number of blacks in a poll from their actual proportion in the population (which is about 12%) because they are relatively uncommon, but one wants more accurate information about their preferences. When it comes to putting together the overall estimate for all voters, one would simply multiply the raw poll numbers from blacks by 0.5.

      Really, there’s no good reason for doing it any other way. Some groups are of more concern than others, and if you’re paying for a poll, you should spend your money wisely.

    3. You’re completely misinterpreting what the Podesta emails are about.

      This was issue polling, not polling to determine the state of the race. It was focus group-ing, not election predict-ing. There are certain populations they care more about , generally those they think are easier to persuade (either persuade to switch to their candidate, or persuade to get out and vote for their candidate when they would otherwise stay home, or persuade to stay home when they’d otherwise get out and vote for the other guy). They want to oversample those people in their issues polling so they can craft their messages and media buys.

      They’re paying for this stuff and plan to use it to try to figure out how to run their campaign and to see how their messages are being received. They’re not trying to fool themselves.
      It’s right there in the attachment to Podesta’s first email. Take as an example what they wanted to do for Florida:
      ——–
      Research, microtargeting and polling projects:
      • Effective statewide polling in Florida requires large sample sizes (1200 or more target market breakdowns) and bilingual interviewing in Miami-Dade, Broward, Orange, Osceola, and Hillsborough counties. Pay attention to the composition of each market to evaluate messages; do not focus on the statewide number. [b]Too often, decisions are made to run ads with the highest testing number statewide, a number that is driven up by responses in the Miami and Palm Beach media markets. Basic rules of thumb for effective use of research are as follows:[/b]
      • Always conduct bilingual interviewing and complete (not “weight up”) at least 3% of the total sample in Spanish.
      • Break the state into media markets and regions that mirror the key regional breakdowns and allow targeting in the big three markets of Tampa, Orlando, and Miami.
      • Conduct microtargeting to maximize support among white Democrats in north Florida or the Trending GOP, Local Democrat/State GOP counties. A majority of these white Democrats support the Democratic candidate; the goal is to move that support from 60 to 70% into the 80% range.
      • Do not use Spanish-language research from other states in Florida. Microtargeting to Hispanics should be bilingual, with a goal of accounting for country of origin when possible. Do not buy into the misconception that the Miami-Dade Latino vote is changing linearly toward support for Democrats; instead, measure that support with a targeting and scoring project that measures the differences among Hispanics between media markets.
      •[b] Statewide microtargeting should help identify Republicans open to persuasion.[/b]
      •[b] Conduct frequent media market studies in key markets or regions to refine the buy in each market, rather than relying on a statewide poll where the sample may or may not effectively reflect the demographics of each market.[/b]
      • A listed sample asking for the individual will yield an exceptionally old result and will not be reflective of demographics in the state. Consistently monitor the sample to ensure it is not too old, and that it has enough African American and Hispanic voters to reflect the state.
      • [b]On Independents: Tampa and Orlando are better persuasion targets than north or south Florida (check your polls before concluding this). If there are budget questions or oversamples, make sure that Tampa and Orlando are included first.[/b]
      • Persuasion programs should be targeted regionally within the state. This might require a variety of research that goes beyond statewide polling to help determine the overall targets. For example:
      • Statewide benchmark and trend polls with a large enough sample (600–800) to see changes in the Tampa and Orlando markets.
      • A microtargeting project with oversamples in specific regions or with Independents.
      • Focus groups by region or demographic group (Hispanics particularly).
      • Polls focused solely on a region or a demographic group.

  3. Fantastically, fantastically stupid. If you continue with this blather, trust me, you doom only your obtunded selves. You “history mavens” might want to read-up on that pesky 12th Ammendment.
    Much as a lie repeated a thousand times might mimic the truth it still remains false.
    http://www.primarymodel.com
    Rock Star. Because he is proven right.
    Helmut Norpoth.
    http://primarymodel.com/author/

Comments are closed.