Data quantity vs Data quality. A true story.

Data quality often matters more than data quantity when making an estimate or a model based on a sample.

Data quality in data science involves completeness, consistency of format, cleanliness, and accuracy of individual data points. Statistics adds the notion of representativeness.


The classic example is the Literary Digest poll of 1936 that predicted a victory of Alf Landon against Franklin Roosevelt.

The Literary Digest, a leading periodical of the day, polled its entire subscriber base, plus additional lists of individuals, a total of over 10 million, and predicted a landslide victory for Landon. George Gallup, founder of the Gallup Poll, conducted biweekly polls of just 2.000, and accurately predicted a Roosevelt victory. The difference lay in the selection of those polled.

The Literary Digest opted for quantity, paying little attention to the method of selection.

They ended up polling those with relatively high socioeconomic status (their own subscribers, plus those who, by virtue of owning luxuries like telephones and automobiles appeared in marketers’ lists).


The result was sample bias

That is, the sample was different in some meaningful nonrandomway from the larger population it was meant to represent. The term nonrandomis important –hardly any sample, including random samples, will be exactly representative of the population. Sample bias occurs when the difference is meaningful, and can be expected to continue for other samples drawn in the same way as the first.