Yesterday morning I went to Xiao-Li Meng’s AMS-MAA invited address, entitled “Statistical Paradises and Paradoxes in Big Data.” My stats background is not especially strong, but one of my favorite parts of the Joint Math Meetings is going to talks outside my area that I can actually love and understand. This was one of those. Professor Meng’s introduction set high expectations, and he really delivered in content and style. He was incredibly energetic and funny.

One of Meng’s paradises of big data is “a larger general pipeline”–more people than ever before interested in statistics at all levels, and pursuing statistics academically. Also, better airplane/taxi/party conversations for statisticians, and a current “golden era” for theoretical and methodological foundations.

However, one paradox is that big data may not be as big as it seems, when we consider quality. Most “big data” is not randomly sampled and is correspondingly prone to bias.

Dr. Meng asked us to consider: When is a large non-random sample better than a small random sample, in measurable terms? To answer the question, he presented “A trio identity for Quality, Quantity, and Difficulty,” an simple statistical identity relating measures of the quality and quantity of data.

The gist: To minimize error, one can increase quantity (proportion of total population sampled) or increase quality (randomness of sample). To see the true value of a data set, it is possible to compute the effective sample size—the estimated size of a randomly sampled data set that would give the same error as the large, non-randomly sampled set. To illustrate, Meng considered a hypothetical survey of 160 million people (half of the US population), non-randomly sampled. For particular parameters, he computed an effective sample size of 400. Wow.

People use statistics to make decisions. We may want to answer the question “What choice is most likely to result in a good outcome for people like me?” Dr. Meng pointed out that the apparent answer may depend on what “like me” means. Reference population and level of resolution matter. Simpson’s paradox may even apply—what appears to be the best choice when we consider the entire population may appear to be the worse choice for both two partitioning subsets of the population. Meng used a 1986 study by C. R. Charig, D. R. Webb, S. R. Payne, J. E. Wickham on kidney stone pain treatments to illustrate. The following percentages of people found the given treatments effective:

Treatment A Treatment B

273/350: 78% 289/350: 83%

Broken down by size of stone:

. Treatment A Treatment B

Large Stone 81/87: 93% 234/270: 87%

Small Stone 192/263: 72% 55/80: 69%

Treatment B appears to be more effective for the population as a whole, but treatment A appears to be more effective for both people with large stones and people with small stones. Argh. Which one is more effective? How do we choose?

As always when I go to statistics talks, one of my major take-aways is that I need to think way more carefully about statistics. And go to more statistics talks.

Also, Meng has an awesome section on rejection on his website, including a link to this interesting essay on rejection, a topic near to my heart.