This is a field in which I am largely ignorant, so I will just report it and leave the commenters to interpret.

The basic idea is taken from Shannon theory using a term – *surprisal* – to denote the information content of an experiment, or a paper announcing the results of that experiment. Knudsen writes:

The larger the surprisal of a particular outcome, the greater its (

a posteriori) scientiﬁc merit.

The surprisal of the experiment/paper is calculated using Shannon’s entropy equation (different from thermodynamic entropy – the story goes that Shannon asked John von Neumann what to call his value, and von Neumann replied, “Call it *entropy*; nobody will know what you are talking about” because of the continuing confusion about what thermo entropy was. The equation is

What determines the probability of *p(x*)? Effectively, it’s the prior expectation of getting that value, which seems to me, although I’m open to correction, to be a version of Bayesian priors, from Bayes’ famous equation

in which *P(B)* is the prior expectation of the outcome. Bayesian priors (*B*) affect the likelihood of event *A* occurring, or being true. Change the priors and you change the likelihoods.

So why care? It has to do with a long standing tradition, mostly among the medical researchers and psychologists, of setting up a “null hypothesis” in experimental design. This is usually the “alternative explanation” to the hypothesis being tested. But it has come under a lot of criticism amongst philosophy of science, particularly by Fiona Fidler and Neil Thomason and critics. Neil was my PhD advisor, and Fiona did her thesis on the topic.

Now I have little comprehension of this matter, and had less when I was Neil’s student and Fiona’s costudent, so I won’t go into it now and display my ignorance. Statistics is not my strong suit.

But the use of Shannon entropy in this context is interesting to me. Why use Shannon’s index? It’s a measure of the error of communication; so why should we expect it to measure the information content of experiments? I think, but do not know enough to know for sure, that it has to do with the general properties of that equation. Let me muse…

Shannon’s *H* is a measure of the probability distribution of symbols given a “population” of symbols from which any particular (next) one can be drawn. As such it’s a measure of diversity, of difference in a population, and it has a life in conservation biology as a metric of diversity (Hammond 1992) where it is sometimes called the Hammond index. So it may have wider implications than just communications.

What I wonder at, and lack the math to find out, is if this might be a way to tie Fisher information, which is the accuracy of a measurement, to Shannon information. Any math whizzes out there care to enlighten me?

*Late note* Bob O’Hara, a commenter on this site, has blogged a more measured and informed view on Knudsen’s paper.

Neither equation shows when viewed with any of my browsers (Camino, Firefox, Opera, Safari).

ecto played up. It should display now.

Me again. I tried reloading the page on each browser. Three failed, but this time Opera got it right. (I have no idea why.)

John, you should wander down to a stats department some day. 🙂 There have been a lot of criticisms of NHST, and nowadays I think most applied statisticians argue against it.

There are a few alternatives. One that has been gaining traction in the biological community is the use of AIC. This has an information theoretic interpretation, but also an interpretation in terms of the ability of the model to predict replicate data. There are a few subtleties involved, which have lead to a whole range of alternatives being suggested (BIC, CIC, DIC, TIC etc. etIC.).

The paper proposes a measure that does the same thing as a Bayes Factors (I’m not going to dig into the maths to check how far apart they are, but it looks like there will be a monotonic transformation between the two). Bayes Factors are well known, as are their problems – essentially, they can be sensitive to the prior distribution of the parameters. Of course, high energy physicists don’t need to worry about such things, as their experiments don’t have any error.

As a member of the editorial board of JNR-EEB, I feel obliged to point out that using “surprise” as a measure of scientific worth is very short-sighted.

Oh, wait. He’s a high-energy physicist, so he doesn’t need to replicate his studies.

Bob

John, you should wander down to a stats department some day. 🙂 There have been a lot of criticisms of NHST, and nowadays I think most applied statisticians argue against it.

There are a few alternatives. One that has been gaining traction in the biological community is the use of AIC. This has an information theoretic interpretation, but also an interpretation in terms of the ability of the model to predict replicate data. There are a few subtleties involved, which have lead to a whole range of alternatives being suggested (BIC, CIC, DIC, TIC etc. etIC.).

The paper proposes a measure that does the same thing as a Bayes Factors (I’m not going to dig into the maths to check how far apart they are, but it looks like there will be a monotonic transformation between the two). Bayes Factors are well known, as are their problems – essentially, they can be sensitive to the prior distribution of the parameters. Of course, high energy physicists don’t need to worry about such things, as their experiments don’t have any error.

As a member of the editorial board of JNR-EEB, I feel obliged to point out that using “surprise” as a measure of scientific worth is very short-sighted.

Oh, wait. He’s a high-energy physicist, so he doesn’t need to replicate his studies.

Bob

John, you should wander down to a stats department some day. 🙂 There have been a lot of criticisms of NHST, and nowadays I think most applied statisticians argue against it.

There are a few alternatives. One that has been gaining traction in the biological community is the use of AIC. This has an information theoretic interpretation, but also an interpretation in terms of the ability of the model to predict replicate data. There are a few subtleties involved, which have lead to a whole range of alternatives being suggested (BIC, CIC, DIC, TIC etc. etIC.).

The paper proposes a measure that does the same thing as a Bayes Factors (I’m not going to dig into the maths to check how far apart they are, but it looks like there will be a monotonic transformation between the two). Bayes Factors are well known, as are their problems – essentially, they can be sensitive to the prior distribution of the parameters. Of course, high energy physicists don’t need to worry about such things, as their experiments don’t have any error.

As a member of the editorial board of JNR-EEB, I feel obliged to point out that using “surprise” as a measure of scientific worth is very short-sighted.

Oh, wait. He’s a high-energy physicist, so he doesn’t need to replicate his studies.

Bob

Oh, wait. He’s a high-energy physicist, so he doesn’t need to replicate his studies.

Bob

CRM, it probably worked because I figured out the HTML to make it link to the files. My blog editor did weird shit and I had to wrangle the code directly.

Bob, it’s

surpris, not surprise…alStats is one of those things I regret not learning, along with pretty much the rest of mathematics, but these days I defer to those who are equationate. I knew of Akaike, because Elliot Sober has been busy discussing it in philosophy, but the subtleties, as you call them (extremely hard ideas, I call them) are beyond me.

Now I’ve re-read what you wrote (and I’m wondering if Cost-u-dent are good at removing teeth), it’s worth pointing out that Shannon’s Information appears from the way the problem is formulated – the surprisal is defined as -log10

p(grrr, he should have used used loge, so that this is then simply the -likelihood, or half the deviance), and then he calculates the expected surprisal (if that isn’t an oxymoron), which gives him the Shannon’s Information that he wants.I’ll try and remember to check on the connections between Shannon’ and Fisher’s information tomorrow. There’s no direct general relationship because they’re fundamentally measuring different things.

Bob

Now I’ve re-read what you wrote (and I’m wondering if Cost-u-dent are good at removing teeth), it’s worth pointing out that Shannon’s Information appears from the way the problem is formulated – the surprisal is defined as -log10

p(grrr, he should have used used loge, so that this is then simply the -likelihood, or half the deviance), and then he calculates the expected surprisal (if that isn’t an oxymoron), which gives him the Shannon’s Information that he wants.I’ll try and remember to check on the connections between Shannon’ and Fisher’s information tomorrow. There’s no direct general relationship because they’re fundamentally measuring different things.

Bob

Now I’ve re-read what you wrote (and I’m wondering if Cost-u-dent are good at removing teeth), it’s worth pointing out that Shannon’s Information appears from the way the problem is formulated – the surprisal is defined as -log10

p(grrr, he should have used used loge, so that this is then simply the -likelihood, or half the deviance), and then he calculates the expected surprisal (if that isn’t an oxymoron), which gives him the Shannon’s Information that he wants.I’ll try and remember to check on the connections between Shannon’ and Fisher’s information tomorrow. There’s no direct general relationship because they’re fundamentally measuring different things.

Bob

p(grrr, he should have used used loge, so that this is then simply the -likelihood, or half the deviance), and then he calculates the expected surprisal (if that isn’t an oxymoron), which gives him the Shannon’s Information that he wants.Bob

Having only learned only enough statistics to know not to use any that do I not (at least superficially) understand; I expose my ignorancance thustly; I understand null hypothesis as the hypothesis that you didn’t really do anything; thus the differences in your results and do-nothing results are just due to chance. Does surprisal imply unexpected results?

You may have a typo in the defn above. If it is written as sum_i {observed p(x_i) * ln(prior p(x_i)} you get an increasing function as the observed gets larger than the prior, for non-uniform priors. In plain english, frequent occurances of events thought be rare are surprising.

If you use the p*ln(p) formulation then the likelihood decreases for deviations from uniformity.

A related book is “Physics from Fisher Information”.

Re #7: The name “Null Hypothesis” is an unfortunate choice in words. It would be better named “Hypothesis I’d like to exclude/contradict”, but that doesn’t roll off the tongue so well, does it. It doesn’t need a zero, nor does it need to be the hypothesis that “you didn’t really do anything.” NHST is a straight-forward probabilistic proof-by-contradiction.

If you read some of the math based articles, this becomes clearer when the statisticians start talking about multiple hypotheses or sequences of hypotheses (H_0, H_1, …, H_i, ….)

Of course a good bayesian thinks all of this is nonsense anyway and wants to put priors everywhere.

Thanks Bob, that puts it in a context for me. I’ll add that link to the main article.

I’ve finally got round to blogging about this. My conclusion – of little use in practice, and some of the ideas go back to 1956. I’ll claim this as a win for the statisticians.

Bob

I’ve finally got round to blogging about this. My conclusion – of little use in practice, and some of the ideas go back to 1956. I’ll claim this as a win for the statisticians.

Bob

I’ve finally got round to blogging about this. My conclusion – of little use in practice, and some of the ideas go back to 1956. I’ll claim this as a win for the statisticians.

Bob

Bob