Last week, I came across this interesting piece on the perils of using “big data” to draw conclusions about the world. It analyzes, among other things, the situation of Google Flu Trends, the much heralded public health surveillance system that turned out to be mostly a predictor of winter (and has since been withdrawn).
It seems to me that big data is a fun place to explore for patterns, and that’s all good, clean, fun — but it is the moment when you think you have discovered something new when the actual work really starts. I think “data scientists” are probably on top of this problem, but are most people going on about big data data scientists?
I really do not have all that much to add to the article, but I will amateurishly opine a bit about statistical inferencing generally:
I’ve taken several statistics courses over my life (high school, undergrad, grad). In each one, I thought I had a solid grasp of the material (and got an “A”), until I took the next one, where I realized that my previous understanding was embarrassingly incorrect. I see no particular reason to think this pattern would ever stop if I took ever more stats classes. The point is, stats is hard. Big data does not make stats easier.
If you throw a bunch of variables at a model, it will find some that look like good predictors. This is true even if the variables are totally and utterly random and unrelated to the dependent variable (see try-it-at-home experiment below). Poking around in big data, unfortunately, only encourages people to do this and perhaps draw conclusions when they should not. So, if you are going to use big data, do have a plan in advance. Know what effect size would be “interesting” and disregard things well under that threshold, even if they appear to be “statistically significant.” Determine in advance how much power (and thus, observations) you should have to make your case, and sample from your ginormous set to a more appropriate size.
Big data sets seem like they were mostly created for other purposes than statistical inferencing. That makes them a form of convenience data. They might be big, but are the variables present really what you’re after? And was this data collected scientifically, in a manner designed to minimize bias? I’m told that collecting a quality data set takes effort (and money). If that’s so, it seems likely that the quality of your average big data set is low.
A lot of big data comes from log files from web services. That’s a lame place to learn about anything other than how the people who use those web services think or even how people who do use web services think while they’re doing something other than using that web service. Just sayin’.
Well, anyway, I’m perhaps out of my depth here, but I’ll leave you with this quick experiment, in R:
rows = 10000
vars = 200
x = data.frame(replicate(vars,runif(rows,0,1)))
y = runif(rows,0,1)
a.mod = lm(y ~ ., x)
It generates 10,000 observations of 201 variables, each generated from a uniform random distribution on [0,1]. Then it runs an OLS model using one variable as the dependent and the remaining 200 as independents. R is even nice enough to put friendly little asterisks next to variables that have p<0.05 .
When I run it, I get 10 variables that appear to be better than “statistically significant at the 5% level” — even though the data is nothing but pure noise. This is about what one should expect from random noise.
Of course, the r2 of the resulting model is ridiculously low (that is, the 200 variables together have low explanatory power ). Moreover, the effect size of the variables is small. All as it should be — but you do have to know to look. And in a more subtle case, you can imagine what happens if you build a model with a bunch of variables that do have explanatory power, and a bunch more that are crap. Then you will see a nice r2 overall, but you will still have some of your crap pop up.