Skip to content

Scientist’s Operating Manual – Evidence; gathering, measuring, analysing

In this chapter we will look at how science gathers information about the world, and what it does with it.

[Contributors should write their bits in the comments, and I will collate them below the fold or in new posts. By the way, contributors will be named unless they don’t want to be.

This will be expanded on when inspiration strikes and I have time, but it’s been brewing long enough already.]

The process of science

Francis Bacon, who lived in the 16th century, is widely thought of as the person who first set out the way to learn about the world through science, in what was called the Novum Organon (that is, the “new instrument” or “method”). It was a logic and procedure for discovery.

According to Bacon, one should gather together as much information about a topic as one can through observation. The “evidence of the senses” was to take priority over reports from authorities. He started out by saying

Man, being the servant and interpreter of Nature, can do and understand so much and so much only as he has observed in fact or in thought of the course of nature. Beyond this he neither knows anything nor can do anything. [Aphorism 1]

Science, as Bacon thought of it, was all about observation and reasoning. Science was a process of induction, in which a large amount of observation would lead to a generalisation or law of nature as a hypothesis. Bacon thought that science began by observing and that one would prove hypotheses that way. Later, Karl Popper argued, following the 18th century philosopher David Hume, that inductions from many singular observations to a law or theory were not justifiable inferences. So he proposed that the real test of a theory was whether it could be falsified. That is, Popper thought theories should be vulnerable to being disproven, not proven. A theory that could be disproven, but had many times not been, was the better theory. It might still be false, but it was going to be a better thing to accept than one that was immune to disproof, or one that had not yet been subjected to a test of that kind.

A hypothesis, a law, and a theory are not all the same sort of thing, but they are similar. A hypothesis is an explanation that has been proposed, and usually hypotheses are as-yet untested. When they are tested, hypotheses become theories. A law is a general statement that has no exceptions in the domain it is an explanation for, or if it does, those exceptions are at the limit or extremes. A theory, as understood by most philosophers of science, is a whole intellectual edifice that includes laws, models, evidence and experimental results. It explains, predicts, and offers avenues for further research.

Bacon’s idea about science was that it was linear:


In simple terms, a scientist makes many observations, and by generalising them

Popper’s was roughly the reverse:


although he thought this was a cycle rather than a simple line. Each turn of the cycle meant that your scientific theories got more truthlike.

Bacon’s view of science is called empiricism – the idea that science starts with and entirely rests upon evidence of observation. Popper’s view is called falsificationism – the idea that the role of evidence is to eliminate hypotheses, not to construct them. In both cases, evidence plays a key role, so how evidence is gathered is the key to understanding science itself. There are, of course many other views about the process and sequence of science, but they all involve evidence.


Evidence is collected by observation, and this is of a number of kinds. There is description of observations made, usually in the field, without much in the way of hypothesis or theory, although as Darwin once wrote, “How odd it is that anyone should not see that all observation must be for or against some view if it is to be of any service!” Mostly this is called field observations and is the stock in trade of naturalists who observe the behaviour of animals in the wild, or geologists who survey some new area. Field observers have a number of prior ideas, of course, because they are trained in their subject, but they may have no preset hypotheses for the specific phenomena they are about to observe.

Then there is observation by experiment. Broadly speaking, this is of two kinds. One is when the experimenter has no clear idea of what they are looking for, and merely “plays” with some phenomena to see what happens. This is not regarded as a reliable form of observation, however, and at best it leads to chance discoveries, called serendipitous discovery. Also called luck…* The other kind is the kind most widely used and relied upon, and it is the kind that has to be done for a new phenomenon or explanation to be accepted: experiment to test some conjecture or hypothesis.

Experiment is sometimes thought to include a special kind of experiment: the crucial experiment, which is supposed to “prove” or “confirm” a hypothesis, so that it becomes a theory (theories are what models, hypotheses and conjectures want to be when they grow up; in science, the very best explanation is a theory). This is thought now to be wrong, but we’ll look at experiment later.


Evidence has to be measured. Saying that an animal runs fast is not enough; you have to give measured speeds, average speeds, and methods used to determine this (e.g., stopwatch, radar, use of video) so that other researchers can check the reliability of your methods if they need to do so. Measurements have two properties: precision and accuracy. A measurement is precise if it has, roughly speaking, the right number of decimal points (a measurement might be done with equipment sensitive to parts per billion, or it might be enough to get it to the nearest 100). Precision is something that the practice of that field sets up. You do not need to be more precise than you can assume the measuring equipment is capable of being.

Accuracy, on the other hand, means that you got the measurement right. You can be very precise, and quite wrong. To illustrate: suppose I say to you that the earth’s moon, on a certain date, is 225,345.0451 km away from the earth. This is extremely precise, and quite wrong – the earth’s moon is usually around 384,000 km away. The latter value is not precise (while laser reflection locates the earth’s moon within centimetres), but it is accurate. People often mistake precision for accuracy.

[The following section is from Neil Rickert’s post; I have amended it in various ways – JSW]

Often evidence is not readily available. Scientists often need to be creative in coming up with ways of getting evidence. And, at times, this requires being creative in finding new concepts with which the new evidence can be expressed. We will start with a couple of examples, to illustrate the point.


Our first example, from physics, is the Michelson-Morley experiment. Michelson and Morley had the idea of measuring the aether drift by comparing the speed of light in two different directions. Now that sounds simple. We could measure the speed of light in each of those directions, subtract the two measurements, yielding the difference. The problem, though, is that it is difficult to measure the speed of light with anything close to the precision that would have been required for such a simple approach. So, instead, they came up with a creative plan to build an interferometer, whereby they could split a beam of light, have part travel in one direction and part in another, then recombine them in a way that wave interference could be observed. That inventive experimental design allowed observation of a difference in speeds at a far higher precision than was available for the direct measurement of the speed of light. The experiment showed no detectable aether drift, and this was taken as evidence against the aether theory of light.

Defining a new species

For our second example, we will use a hypothetical case from biology. Let us suppose that there is a species, call it spX, of insects being studied. The biologists are studying the habitat and the ethology of insects of this species. As they proceed with their studies, they begin to realize that this should be considered to be two species rather than one. So they name a new species, call it spY, and they provide criteria to distinguish between organisms of spX and organisms of spY. As a result, there are now many more expressible facts. For example, facts which state differences between the behavior of organisms of spX and those of spY are now expressible, whereas they were not expressible before the new species was defined. As in the Michelson-Morley example, this allows the expression of information which is more precise than would have been previously possible.


The Michelson-Morley experiment is well known, and it is reasonably common for biologists to define new species. It seems, however, that the epistemic significance has been largely overlooked. Epistemology (the field of philosophy that investigates how knowledge is gained) is often presented as if there exists a fixed set of true propositions, and our job is to find out what those propositions are. However, the set of expressible propositions is not fixed. New methods of getting evidence, or new concepts (such as that of the species spY), expand the set of potentially expressible true propositions. So there is more to acquiring knowledge than simply picking up the facts.

The example of defining as new species fits rather well with what perceptual psychologist Eleanor Gibson described in her 1969 book Principles of Perceptual Learning and Development. For it is an example of a newly acquired ability to discriminate between organisms of spX and those of spY. Gibson used the term “perceptual learning” for acquisition of such abilities. Since the Michelson-Morley experiment also allowed for more precise observations, that could also be considered an example of perceptual learning. Our scientific knowledge is growing in ways that are not adequately described by ordinary epistemology.

[End Neil’s text]


Once evidence has been gathered, it has to be analysed. Evidence doesn’t, on its own, tell us much. A single observation, a single “data point” (or, if you are a classicist, a datum, a “given”) only tells the scientist that one measurement was made. For it to mean much, many observations have to be made to reduce the error that is inevitable in measurement (an exception is when all you can have is a single observation or data point, which often is the case in historical sciences). And these multiple observations need to be tested and analysed. If you do anything that involves large numbers, things will tend to scatter. and this includes measurements. With the best equipment and observers in the world, each measurement will be slightly different (precision, remember?), and so you have to use statistical analytical techniques to retrieve the information inherent in those observations.

All these scattered measurements will tend to cluster around a mode, and so typically scientists will report the modal value and some error bars. These report the likely range within which the true value is expected to fall; but the usual measurement of error, p = 0.05 (which is a fancy way of saying, plus or minus 5%) itself is just a convention and doesn’t mean all that much. It was introduced by R. A. Fisher, who just chose it for argument’s sake, and it became the gold standard.

Analysis also goes on before you make observations and measurements. Statistics are often used to design experiments. But we’ll cover that later.

[Contributors: add what you think should be added please in the comments. If you are adding to an existing section, name it, and if a new section, title it. If you want to rewrite something already in, note this.]

* An apocryphal (which means, “I can’t find the source”) quotation from science fiction author Isaac Asimov that demonstrates this is “The words that herald a major scientific breakthrough are less often ‘Eureka!’ than ‘Hmmm. That’s odd.'”


  1. What is missing in your account, is that often evidence is not readily available. Scientists often need to be creative in coming up with ways of getting evidence. And, at times, this requires being creative in finding new concepts with which the new evidence can be expressed.

    • John S. Wilkins John S. Wilkins

      Excellent! Write that passage, will you? 🙂

  2. I’ll be happy to write more. Tentatively, I’ll plan on writing up something on my blog tomorrow.

  3. Bob O'H Bob O'H

    If you do anything that involves large numbers, things will tend to scatter.

    Huh? The sample size shouldn’t affect the scatter: you might have cause & effect the wrong way round: if there’s scatter, you should collect more data.

    All these scattered measurements will tend to cluster around a mode, and so typically scientists will report the modal value and some error bars.

    They don’t have to cluster around the mode: they might be multi-modal, or the mode might be at one extreme of the distribution.

    In addition, it’s rare that we report the mode. For continuous measurements, where the empirical mode lies is more to do with chance. The model of the sampling distribution is trickier to estimate, so we’re usually lazy and report the mean or median instead.

    These report the likely range within which the true value is expected to fall; but the usual measurement of error, p = 0.05 (which is a fancy way of saying, plus or minus 5%) itself is just a convention and doesn’t mean all that much.

    Ugh – that’s a mess.

    First, I’m not sure one should talk about the true value, but I guess I should leave that to philosophers. Once you get on to estimation through model fitting, it’s only true conditional on the model being true, but (as G.E.P. Box famously said) “All models are wrong, but some are useful”.

    Second, error bars usually report the mean ± standard error. The coverage of this depends on the distribution: it’s about 0.68 for a normal distribution.

    Third, p = 0.05 is most certainly not a fancy way of saying, plus or minus 5%. It’s a way of accepting that we’ll be wrong 5% of the time (i.e. the confidence interval will not contain the actual value 5% of the time). Also, talking about p=0.05 conflates parameter estimation with hypothesis testing: you should be talking about α=0.95.

    I’m pretty sure the Fisher thing is more complicated than you imply. I mention this to make Simon Ings happy: your simplification is a little bit of whig history being written.

    • John S. Wilkins John S. Wilkins

      Don’t just comment. Edit.

    • John S. Wilkins John S. Wilkins

      Oh, and here’s why I though FIsher’s was a matter of convention: He says so.

      • Bob O'H Bob O'H

        Yes, but as Jerry points out Fisher was inconsistent.

        He also mentions the issue that Fisher didn’t like the Neyman-Pearson approach of setting a p-value, and preferred to see the p as a measure on a continuous axis. So I’m not sure he really introduced it as a definitive cut-off, as you seem to imply.

        Someone should really write a popular science book about Fisher and that period. There were lots of really strong characters he got to interact with, it would be quite juicy.

  4. Bob O'H Bob O'H

    Damn, I buggered up the blockquotes. And a +/- became a ñ.

    • John S. Wilkins John S. Wilkins


      • Bob O'H Bob O'H


  5. John Harshman John Harshman

    I’m looking in the description for what I (and other phylogeneticists) do, and I’m not sure it’s there. We gather data in the hopes that it will tell us the phylogeny of a group, but we generally aren’t testing any hypothesis. Rather, we are testing a very large family of hypotheses, i.e. the entire a priori credible set of trees. This is neither random exploration — we have the specific purpose of determining the phylogeny of some group, and gather that particular data we think may be useful for that purpose — nor hypothesis testing/falsification. Eventually, the credible hypotheses may be reduced to some small number, perhaps even two, at which point the process may come to resemble classical hypothesis testing.

    In other news, I think you’re missing a lot when you end with Popper. I think more scientists these days would use some kind of statistical model of knowledge than would accept the yes/no (or perhaps maybe/no) idea of falsification.

    • John S. Wilkins John S. Wilkins

      Excellent comments, John. Can you write them up or make edits to the text so I can include them?

      • John Harshman John Harshman

        Not sure I can. I really don’t know much philosophy of science beyond Popper and Kuhn, neither of whom I find at all satisfactory. I have only a naive faith that better material must be out there somewhere, something that better reflects what scientists actually do.

        If there really is nothing better already published, a good place to start would be trying to gather a sample of actual practice from as many fields as possible, and then making what generalizations one can. Before that happens, a manual seems premature. And so I am attempting to contribute my own example.

        While I can’t attach any famous names, the process in my field seems to be of this sort:

        Profusion of hypotheses -> Observations -> Analysis -> Pruning of hypotheses -> Repeat. Observations add to the credibility of some hypotheses and subtract from the credibility of others. In this case, a hypothesis is the existence of a particular node of the tree in preference to all conflicting nodes. The more the observations, as analyzed, are compatible with one hypothesis and incompatible with others, the more likely we consider that hypothesis. At some point in the piling up of observations, we are entitled to conclude that a hypothesis is confirmed and that conflicting ones are falsified. Statistical tests are a reasonable way of deciding where that point is.

        I don’t quite know how to fit this into your structure.

  6. ckc (not kc) ckc (not kc)

    …the Scientist’s Operating Manual is a great idea. As a scientist who spends most of my time doing science for non-scientists (contract and service research), I know that the layman’s understanding of how science is done is, through no fault of their own, limited. I’m not sure how best to contribute, so I’ll start with some comments on the post above. [It’s very hard to resist the temptation to improve the descriptions by making them more “scientific” and simultaneously less “accessible”, let me say. I’ll try to fight that.]

    With the best equipment and observers in the world, each measurement will be slightly different (precision, remember?),

    perhaps “No matter how precise the equipment and observers, the real world adds noise (imprecision) to each measurement. To further complicate matters, the real world also adds bias (inaccuracy).”

    and so you have to use statistical analytical techniques to retrieve the information inherent in those observations.

    adding “Statistics are a scientist’s way to estimate parameters, that is to take the measurements that the real world has made messy and remove the noise and bias”

    […this is already getting into terminology, such as “parameters” that will put off the non-scientist. Maybe the best way to make this manual work would be to make it more concrete – pick a (possibly imaginary, but preferably believable) set of organisms/processes/patterns that everyone can relate to and deal with concepts in that context.]

Comments are closed.