Posts Tagged ‘Petabyte age’

Scientific methodology obsoleteness: the last update?

June 28, 2008

Tom Slee at Whimsley bemoans bad ideas winning the day:

Then I see that Anita Elberse of Harvard Business School has actually looked at some data behind the same Chris Anderson’s Long Tail hypothesis (please, don’t call it a theory) and, not surprisingly found it misguided. You would think that would cheer me up, but reading Anderson’s response just (here and reprinted here) plunged me further into gloom. Why? Because although he loses this battle (if asked I will bore people with the details, but I really don’t think there is a point), sloppy business journalism has won the war.

Face it. Chris Anderson has people at Harvard Business School of all places spending their valuable time following up his idle speculations. He comes up with a half-baked idea, has basically no data to support it, and yet other people – smart people, with real jobs and things to do – actually spend their time following up these idle daydreams; acting as his research assistants. What a waste.

And other people – smart people, probably with families and friends who could use their attention – feel they have to spend their time explaining a few of the reasons why he is wrong in his latest article. And here I am wasting my evening writing this junk.

Journalists and popular science or technology writers should take the serious thoughts of others and communicate them in an interesting and attention-getting way. I have no problem with that. But this is all backwards: a few stories from a business journalist setting the research agenda of Harvard Business School?

How did this happen?

I guess this post is also a reminder to me that I should stop updating on the issue!

Science and engineering approaches

June 27, 2008

Here is another take on the Wired piece of Chris Anderson — this time around, by Seth Roberts:

Varangy wonders what I think about this editorial by Chris Anderson, the editor of Wired. Anderson says “faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.” Anderson confuses statistical models with scientific ones. As far as the content goes, I’m completely unconvinced. Anderson gives no examples of this approach to science being replaced by something else.

For me, the larger lesson of the editorial is how different science is from engineering. Wired is mainly about engineering. I’m pretty sure Anderson has some grasp of the subject. Yet this editorial, which reads like something a humanities professor would write, shows that his understanding doesn’t extend to science. It reminds me why I didn’t want to be a doctor. (Which is like being an engineer.) It seemed to me that a doctor’s world is too constrained: You deal with similar problems over and over. I wanted more uncertainty, a bigger canvas. That larger canvas came along when I tried to figure out why I was waking up too early. Rather than being like engineering (applying what we already know), this was true science: I had no idea what the answer was. There was a very wide range of possibilities. Science and engineering are two ends of a dimension of problem-solving. The more you have an idea what the answer will be, the more it is like engineering. The wider the range of possible answers, the more it is like science. Making a living requires a steady income: much more compatible with engineering than science. I like to think my self-experimentation has a kind of wild flavor which is the flavor of “raw” science, whereas the science most people are familiar with is “pasteurized” science — science tamed, made more certain, more ritualistic, so as to make it more compatible with making a living. Sequencing genes, for example, is pasteurized science. Taking an MRI of the brain while subjects do this or that task is pasteurized science. Pasteurized science is full of rituals and overstatements (e.g., “correlation does not equal causation”, “the plural of anecdote is not data”) that reduce unpleasant uncertainty, just as pasteurization does. Pasteurized science is more confusable with engineering.

There’s one way in which Anderson is right about the effects of more data. It has nothing to do with the difference between petrabytes and gigabytes (which is what Anderson emphasizes), but it is something that having a lot more data enables: Making pictures. When you can make a picture with your data, it becomes a lot easier to see interesting patterns in it.

By the way, is Seth correct about being a doctor? I do not think so; having read Atul Gawande and Oliver Sacks, I do see that most of the problems that they face and their approach to them is more science-y than engineering.

By the same token, even though engineering might be applying what we already know, in real world, there is no such clear distinction; and, even in cases where there is, the problems that one comes across are so varied, we do not know enough to start applying — and, in any case, innovation is all about applying some ideas which others have not even thought of as relevant for the problems in hand.

In other words, while I agree with Seth that Anderson’s piece reads rather naive about the scientific methodology, I am not sure if the naivette can be attributed to his training as an engineer.

Update on scientific methodology obsoleteness

June 26, 2008

Remember the recent post I wrote about Wired editor Chris Anderson’s article on how scientific method is becoming obsolete with the availability of large chunks of data? In that post, I conceded that it might be possible to develop some technologies without recourse to the underlying science:

At a more fundamental level, in spite of what Chris Anderson has to say, science is about explanations, coherent models and understanding. In my opinion, all of what Anderson shows is that, if you have enough data, you can develop technologies without having a clear handle on the underlying science; however, it is wrong to call these technologies science, and argue that you can do science without coherent models or mechanistic explanations.

Cosma Shalizi at Three-toed-Sloth (who knows more about these models than I do) sets the record straight, and shows how the development of some technologies is impossible without a proper grounding in science — in this eminently quotable post (which, I am going to quote almost in its entirety):

I recently made the mistake of trying to kill some waiting-room time with Wired. (Yes, I should know better.) The cover story was a piece by editor Chris Anderson, about how having lots of data means we can just look for correlations by data mining, and drop the scientific method in favor of statistical learning algorithms. Now, I work on model discovery, but this struck me as so thoroughly, and characteristically, foolish — “saucy, ignorant contrarianism“, indeed — that I thought I was going to have to write a post picking it apart. Fortunately, Fernando Pereira (who actually knows something about machine learning) has said, crisply, what needs to be said about this. I hope he won’t mind (or charge me) if I quote him at length:

I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those “patterns” would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

I might add that anyone who thinks the power of data mining will let them write a spam filter without understanding linguistic structure deserves the in-box they’ll get; and that anyone who thinks they can overcome these obstacles by chanting “Bayes, Bayes, Bayes”, without also employing exactly the kind of constraints Pereira mentions, is simply ignorant of the relevant probability theory.

Have fun!

Has scientific method become obsolete?

June 24, 2008

An article in Wired by Chris Anderson argues that it is:

The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Learning to use a “computer” of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There’s no reason to cling to our old ways. It’s time to ask: What can science learn from Google?

There are several other articles too in the same issue about areas where petabytes of data are the norm — crop predictions, monitoring epidemics, visualization of big data and so on.

However, I still do not see this kind of “science without models” succeeding in all areas of science; from the examples that are discussed, I see that this type of methodology might be very useful in cases where there are far too many parameters, and most of them are not controllable.

At a more fundamental level, in spite of what Chris Anderson has to say, science is about explanations, coherent models and understanding.  In my opinion, all of what Anderson shows is that, if you have enough data, you can develop technologies without having a clear handle on the underlying science; however, it is wrong to call these technologies science, and argue that you can do science without coherent models or mechanistic explanations.