Dealing with data (II) – Five Colors Science & Technology

The Law of Propagation of Bad Data

You need to go back to the source.

Last week we pointed out that running an internet search engine is not a good way to collect data (except, maybe, on search engines). The algorithms used are secert, so we have no way of accounting for what is included and what is left out, a vital step in coming to any conclusions. So what do astronomers (and other scientists) do?

There are databases, mostly on-line, that collect the sort of information astronomers need. There are some for stars, some for variable stars, some for a particular kind of variable star; there are some for galaxies and other more distant objects; some for almost any specialized need. The people who run these databases do not produce the data themselves. Rather, they collect the work of others. Each does it in its own way, but the important thing is that they state explicitly how they do it. They may use papers published in a particular set of scientific journals; they may filter the data in any of several ways; they have procedures for combining data from different sources. From their procedures, an astronomer can estimate what possible biases there may be, and correct for them.

And each database includes a citation to the original data. This is important also. Not because some authors are more authoritative than others, but because the details of how the data are taken impose their own biases and incompleteness. And uncertainties: it is often vital to know whether a redshift was measured roughly from a small photograph, or with great care using a sophisticated electronic detector. Then there’s the matter of calibration. A scientist has to be able to get to the source.

So citation-tracing is a has long been a necessary skill. In his early days, our astronomer was looking for all the data he could find on distances and redshifts to nearby galaxies. He located one database with both; but found, in its procedures, that the distances had been estimated from the redshifts, and were thus of no use to him. The redshifts were, but he noticed that the value given for one galaxy was different from the value given in another source. After much time in the library, he found that the redshift had been measured three times by different groups. Two were in accord; the third, as printed, had an obvious typo. And that was the one that had been taken up by the database.

Thus, only partly tongue-in-cheek, he formulated the Law of Propagation of Bad Data: when there are two or more sources for data, one of which is erroneous, the bad datum will be the one selected for further publication.

Databases are enormous now. It’s not possible to track down every entry manually, so there are sophisticated programs doing checks and cross-checks. Simple typos are much rarer, as manual transcription has gone away. But the final defense against the Law of Propagation of Bad Data remains the ability to go back to the source.