Remember the recent post I wrote about Wired editor Chris Anderson’s article on how scientific method is becoming obsolete with the availability of large chunks of data? In that post, I conceded that it might be possible to develop some technologies without recourse to the underlying science:

At a more fundamental level, in spite of what Chris Anderson has to say, science is about explanations, coherent models and understanding. In my opinion, all of what Anderson shows is that, if you have enough data, you can develop technologies without having a clear handle on the underlying science; however, it is wrong to call these technologies science, and argue that you can do science without coherent models or mechanistic explanations.

Cosma Shalizi at Three-toed-Sloth (who knows more about these models than I do) sets the record straight, and shows how the development of some technologies is impossible without a proper grounding in science — in this eminently quotable post (which, I am going to quote almost in its entirety):

I recently made the mistake of trying to kill some waiting-room time with Wired. (Yes, I should know better.) The cover story was a piece by editor Chris Anderson, about how having lots of data means we can just look for correlations by data mining, and drop the scientific method in favor of statistical learning algorithms. Now, I

workon model discovery, but this struck me as so thoroughly, and characteristically, foolish — “saucy, ignorant contrarianism“, indeed — that I thought I was going to have to write a post picking it apart. Fortunately, Fernando Pereira (who actually knows something about machine learning) has said, crisply, what needs to be said about this. I hope he won’t mind (or charge me) if I quote him at length:I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those “patterns” would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

I might add that anyone who thinks the power of data mining will let them write a spam filter without understanding linguistic structure deserves the in-box they’ll get; and that anyone who thinks they can overcome these obstacles by chanting “Bayes, Bayes, Bayes”, without also employing

exactlythe kind of constraints Pereira mentions, is simply ignorant of the relevant probability theory.

Have fun!

Tags: Petabyte age

## Leave a Reply