Big data – Five Colors Science & Technology

On automatic

We point out dangers in succumbing to routine.

Our astronomer is finishing up his revision of a paper on variable stars, which has entailed (among other tasks) carefully rereading some previous work by other scientists. One that matches his most closely in subject started with data on 3542 variable stars, and applied several steps to analyze these. It was all done automatically, of course, because there’s no way to plot, fit, and draw conclusions from thousands of light-curves manually. Unfortunately, looking closer at the process, each step might seem plausible, and would be in another context; but here it produces nonsense. There are quality-control measures that might have been taken, but weren’t. (To be fair to the process of science, it was a poster paper. These are short, terse, and not peer-reviewed. It doesn’t appear to have been followed up with a formal journal paper.)

It reminded him of another paper, from much longer ago, dealing with galaxies. At the time he was investigating the makeup of the Local Group of galaxies, a concentration of smaller objects around the big spirals of the Milky Way (our own) and Andromeda. At the time there were something over two dozen known members. A paper came out sorting some hundreds of relatively nearby galaxies into groups, and it listed 52 in the Local Group. Our astronomer was startled; how had everyone missed the extra ones? He looked more closely into the matter. The authors had started with someone else’s big catalog of galaxy positions and redshifts. Redshifts are a measure of how fast the galaxy is moving away from us. In general, due to the expansion of the universe (a phrase that is misleading and inaccurate, but we’ll deal with that another time), a lower redshift means a galaxy closer to our own. A (relatively) small number of these redshifts were wrong, because a star in our own galaxy was superposed on the distant galaxy and gave a spuriously low result. The authors of the paper automatically placed these galaxies in the Local Group. This should have been caught (it was a peer-reviewed, journal paper) but wasn’t.

Our astronomer is analyzing ten stars, not 3542. In principle more stars mean better results; but not this time. In an earlier paper he did some dynamical calculations with 342 galaxies, each of whose distance and redshift he tracked down (there were a lot of references in that paper). He still considers the master of data to be Roger Griffin, who could discuss each data point like an old friend.

But this kind of handling of data has already become impossible. The massive surveys of modern astronomy must be analyzed automatically, and by seriously big machines. One can hope that those working in the field have carefully considered all important things that may molest the data and the algorithms that work on them. And of course any serious analysis tests its method and has self-correction in mind. It is, however, a different kind of astronomy.