In a tight race, it is very difficult to predict a winner with confidence. The trouble in this election cycle is that most pollsters didn’t even predict that the race would be this tight. Regardless of the actual outcome of the election, one might say that the biggest losers in the 2016 and 2020 election cycles are the pollsters and pundits.
How could they be so far off twice in a row by such large margins? Not just for the presidential election, but for congressional races, too? I, along with many other analytics geeks, attributed sampling errors to the wrong prediction of the last presidential election. For one, when you sample sparsely populated areas, minor misrepresentation of key factors can lead to wrong answers. When voting patterns are divided by rural and suburban areas, for example, such sampling bias amplifies even further.
To avoid the same mistake, I heard that some analysts tried to oversample segments such as “Non-college-graduate White Males” this time. Apparently, that wasn’t good enough, was it? Also, if sample sizes of certain segments are to be manipulated, how did they do it, and by how much? How do analysts make impartial judgements when faced with such challenges? It will be difficult to find just two statisticians who completely agree with each other on the methodology.
Then there are human factors. They say modeling is half-science, half-art, and I used to agree with that statement wholeheartedly. But looking at vastly wrong predictions, I am beginning to think that the art part may be problematic, at least in some cases. Old-fashioned modeling involves heavy human intervention in variable selection and determination of the final algorithm among test models. Statisticians, who are known to be quite opinionated, can argue about seemingly simple matters such as “optimum number of predictors in a model” until cows come home.
In reality, no amount of statistical efforts can safely offset sampling biases, and worse, “wrong” data. An old expression—garbage-in, garbage-out—applies here. If a survey respondent did not answer the question honestly, that should be considered as a wrong piece of information. Erroneous data don’t stem only from data collection or processing errors. Some just show up wrong.
The human factor goes beyond model development. When faced with a piece of prediction, how would a decision-maker react to it? Let’s say a weather forecaster predicted that there will be 60% chance of shower tomorrow morning. How would a user apply that information? Would he carry an umbrella all day? Would he cancel his golf outing? People use information the way they see fit, and that has nothing to do with validity of employed data or modeling methodologies. Would it make a difference if the forecaster was 90% certain about the rain? Maybe. In any case, it is nearly impossible to ask ordinary users to get rid of all emotions when making decisions.
Granted that we cannot expect to eliminate emotional factors on the user side, data scientists must find more impartial way to build predictive models. They may not have full control over ins and outs of the data flow, but they can transform available data and select predictive variables at will. And they have full control over methodologies for variable selection and model development. In other words, there is room to be creative beyond what they are accustomed to or trained for.