Dealing with data (I) – Five Colors Science & Technology

Making the cut

We have too much information to deal with. What do scientists do?

It’s a truism of modern life that we’re deluged with information, more than we can possibly process. Our astronomer will never admit that he has too much data, but often he acknowledges that his processing capability is limited. In this situation, what do astronomers do?

Start at the beginning: an astronomer studies the sky. But not all of it. First, everything we call “weather” must be filtered out. It’s important to astronomy, of course, but not something they study in its own right. Then astronomers realize that we can’t see all the sky: half is hidden under the Earth at any one time, and some of it can never be seen (unless we’re exactly on the Equator, where few observatories are located). So in any collection of data (view of the sky), part must be removed and part is missing.

Let’s ask a specific question: how many stars are there in the sky? If you’re out on a good night in a dark place (something rare nowadays), it looks like many thousands. But if you settle down to the tedious task of charting them all, the answer depends on who is looking, and on how good the sky was on that night. Some people will see fainter stars; some nights are less “transparent.” And if you pull out a set of binoculars you’ll see many more. So we have to refine the question: how many stars are there, above a certain brightness? That brings up the problem of measuring brightness, a difficult thing to do pecisely before there were electronic detectors. And different stars have different colors, which means they’re brighter in some parts of the spectrum and fainter in others. Your limits become more complicated.

It’s worse when we move on to objects more complicated (at least, to the eye) than stars. In the eighteenth century Charles Messier, hunting for comets, was troubled by a number of objects in the sky that looked (to him) like comets but weren’t. Eventually he published a list of about a hundred of them. By modern standards the selection was extremely haphazard: things he or his friends had stumbled upon in their ramblings, that in some way looked similar in their telescopes. The list is known today to contain objects of very different types: clusters of stars, wisps of gas, the ejections of old stars, distant galaxies. Wildly different things can look the same.

Some years ago, our astronomer searched photographs of the entire sky (much more systematically than Messier) looking for a certain kind of galaxy. After many follow-up observations, out of his list of several hundred objects there were many gas clouds, many galaxies of a kind he was not looking for, and just two of his intented targets. There may be much chaff for a very little wheat.

How do we apply these ideas to today’s information deluge? To start with, we are extremely pessimistic about concluding anything from the output of an internet search engine (to say nothing of social media). Exactly how it makes its choices is unknown to outsiders, so there’s no way of knowing what’s hidden below the horizon, or how much is (to put it delicately) “weather.” Certainly, a high ranking involves things many other people have searched for, along with what companies have paid to have advertised. Those may not be your criteria. As generators of data, Web search results are chiefly useful to study Web search engines.

We’ll continue on this subject next week.