Last night’s speaker, Nancy Reid of the University of Toronto, was introduced with so many accolades that I couldn’t keep track of them all. Her talk, on the ways “big data” has been overhyped in recent years, opened with a description of how statistics and data science have exploded over the last decade, thanks to Markov chain Monte Carlo algorithms that transformed the statistical sciences by allowing integrals to be replaced with easier-to-compute finite sums.
This started in 1990, but in the past ten years, the field of statistics has grown more rapidly than ever and has also become more interdisciplinary, largely thanks to excitement about big data. This also led to the development of a new field: data science. Reid explained that data science includes concepts outside the realm of traditional statistics: acquisition and preservation of data, making data usable, reproducibility, and security and ethics considerations. These require expertise in math, statistics, computer science, and in the field the data comes from—making data science more “outward-looking” than statistics has traditionally been.
However, Reid told us, despite all the hype about “big data” many examples of real-world problems do not actually use huge amounts of data. (Hence the title of her talk: “In Praise of Small Data”.) One such example is the field of extreme event attribution, which tries to tease out the role of anthropogenic climate change in extreme weather events like wildfires. A recent paper on this topic, explained Reid, tried to figure out how much of British Columbia’s 2017 wildfires were caused by humans.
To do this, they started with a simulation of the global climate, then downscaled it to British Columbia. They modeled the relationship between the area burned by wildfires and climate variables–temperature and precipitation. (This actually required two statistical models.) The authors then simulated the climate 50 times for two time periods: the decade 1961-1970 and the decade 2011-2020. Using their model, they found a distribution for how much area was likely to have burned in wildfires during each of these decades—and found the amount of area burned in the BC wildfires in 2017 was far out in the right tail of the distribution for 1961-1970, and not so far for the 2011-2020 distribution.
This is certainly a powerful application of statistical inference techniques, but Reid pointed out that there’s actually not that much data involved in the study—it’s mostly simulations.
Reid discussed a better example of “big data”, in which 6710 people over 50 were followed and their mortality associated with “arts engagement” (attending museums, operas, etc.). She pointed out, though, that collecting complete data for such a project requires a huge amount of time and effort—even though Google and Facebook are collecting data on billions of people every day, it’s not high-quality data. The observations are not independent, and typically, she explained, scientists are looking for unusual events which limits the amount of data that is actually useful. Moreover, the more data, the more complicated it is to work with it. Reid summed it all by noting that when it comes to data, it’s quality over quantity.