modernCSLewis

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, August 22, 2013

Why the Bonferroni correction is a mistake (almost always)

Posted on 10:43 PM by Unknown
* 

Why statistical adjustment for multiple comparisons (eg. the Bonferroni correction) is almost always a mistake

Bruce G Charlton

*

One thing that everybody who has ever done a course on statistics apparently remembers is that there is ‘a problem’ with using multiple tests for statistical significance' on a single data set.

In a nutshell, if you keep looking for significant differences, or significant correlations, by two-way comparisons - then you will eventually find one by chance.

So that if you were to seek for the 'cause' of fingernail cancer by measuring 20 biochemical variables, then you would expect one of these variables would correlate with the diagnosis at the p=0.5 level of significance - on the basis that p=0.5 is a one-in-twenty probability statistic.

*

For some reason, everybody remembers this problem. But what to do about it is where the trouble starts - especially considering that most research studies measure more than a pair of variables and consequently want to make more than one comparison.

*

Increasing the 'stringency' of the statistical test by demanding a smaller p value in proportion to the number of comparisons – with the greater the number comparisons the smaller the p value before ‘significance’ is reached (eg the Bonferroni 'correction') - is probably the commonest suggestion - but this is the wrong answer.

http://en.wikipedia.org/wiki/Bonferroni_correction

Or rather it is the answer to a different question, because (as it is used) it is trying to provide a statistical solution to a scientific problem - thus it is trying to replace science by statistical obfuscation which cannot be done: this the Bonferroni 'correction' is (in common usage) an error based upon incompetence.

As I will try to explain, in reality, the Bonferroni correction has no plausible role in mainstream research - what it does is something that almost never needs to be done.

*

Statistical tests are based on the idea that the investigator has taken a random sample from a population, and wishes to generalise from that sample to the whole population. Each random sample is a microcosm of the population from which it is drawn, so by measuring a variable in the sample one can makes an estimate of the size of that variable in a population.

For example, the percentage of intended Republican voters in a random sample from the US state of Utah is an estimate of the percentage of Republican voters in the whole of Utah.

If the sample is small, then the estimate is imprecise and will have a large confidence interval - and as the size of a sample gets bigger, then the properties of the population from which it is drawn become more apparent, hence the confidence interval gets smaller and the estimate is regarded as more precise.

*

For instance, if there were only ten people randomly sampled in an opinion poll, then obviously this cannot give a precise estimate of the true proportion that would vote Republican compared with Democrat, and Libertarian voters would probably be missed-out, or if included over-estimated as a proportion, and the tiny proportion intending to vote for the Monster Raving Looney Party would almost certainly be unrepresented.

But a random sample of 1000 will yield a highly precise estimate of the major parties support, and will let you know whether the MRLP voters are few enough to be ignored.

The statistical null hypothesis assumes that comparisons between Republican and Democrat are comparisons between random samples drawn from a single population. The size of the p value estimates how likely this null hypothesis is, given the measurements we have obtained from our sample.

*

Now, suppose we want to compare Democrat support in Utah and Massachusetts, to see if it is different. This is the kind of question being asked in almost all science where statistics are used. 

Random samples of opinion are taken in the two states, and it looks as if there are a higher proportion of potential Democrat voters in Massachusetts. A t-test is used to ask how likely it is that the apparent difference in Democrat support could in fact have arisen by random chance in randomly drawing two samples from the same population (taking into account that the samples have a particular size, are normally distributed, and are characterized by these particular mean and standard deviation values).

It turns out that the p value is low, which means that the difference in intended Democrat voting between samples from Utah and Massachusetts is large enough to make it improbable that the samples were drawn from the same population (ie. it is more probable that the two sample were drawn from different populations).

*

What is happening here is that we have decided that there is a significant difference between a microcosm of Massachusetts voting patterns and a microcosm of Utah voting patterns. It seems very unlikely that they could have been so different simply by random chance of sampling from a single population. The p value merely summarizes this probabilities relating to this state of affairs.

IN this example, the p value is affected only by the characteristics of the Utah and Massachusetts samples, and we use it to decide how big a sample is needed before we accept the probability that voting patterns in Massachusetts really are different from Utah.

The necessary size of the sample (to make this decision of difference) depends on how big the difference is (the bigger the difference, the smaller the sample needed to discover it), and on the scatter around the mean (the more scattered the variation around the mean, the bigger the sample needed to discover the true mean value which is being obscured by the scatter).

But the sample size needed to decide that Utah and Massachusetts are different does not (of course!) depend upon how many other samples are being compared, in different studies, involving different places.

*

Supposing we had done opinion polls in both Utah and Massachusetts, and that there really was a difference between the microcosms of Utah and Massachusetts voting intentions.

Suppose then that someone goes and does an opinion poll in Texas. Naturally, this makes no difference to our decision regarding the difference between Massachusetts and Utah.

Even if opinion polls were conducted on intended Democrat voters in every state of the Union, then these other pollsters performed statistical tests to see whether these other states differed from one another - this would have no effect whatsoever on our initial inference concerning the difference between Massachusetts and Utah.

In particular, there would be no reason to feel that it was now necessary either to take much larger samples of the Utah and Massachusetts populations, nor to demand much bigger differences between measuring voting patterns, in order to feel the same level of confidence that the difference was real. Because information on voting in other places is irrelevant to the question of voting in Massachusetts and Utah.

Yet this is what the Bonferroni 'correction' imposes: it falsely assumes that the addiction of other comparisons somehow means that we do need to have a larger sample from Massachusees and Utah in order to conclude the same as we did before. This is just plain wrong!  

*

In sum, the appropriate p value which results from comparing the Utah and Massachusetts samples ought to derive solely from the characteristics of those samples (ie. the size of the samples, and the measured proportion of Democrat voters); and is not affected by the properties of other samples, nor the number of other samples. Obviously not!

*

So, what assumptions are being made by the procedures for ‘correction’ of p values for multiple comparisons, such as the Bonferroni procedure? This procedure demands a smaller p value to count as a significant difference according to the number of comparisons. And therefore it assumes that the reality of a significant difference in voting intentions between Utah and Massachusetts is – somehow! – affected the voting intentions in other states...

From where does this confusion arise?  The answer is that the multiple comparisons procedure has a different null hypothesis. The question being asked is different.

The question implicitly being asked when using the Bonferroni ‘correction’ is no longer 'what is the likelihood that these two opinion polls were performed on the same population' - but instead something on the lines of  'how likely is it that we will find differences between any two opinion polls – no matter how many opinion polls we do on a single population'.

*

In effect, the Bonferroni procedure incrementally adjusts the p values, such that every time we take another opinion poll, the p value that counts as significant gets smaller, so that the probability of finding a statistically significant difference between polls remains constant no matter how many polls we do.

In other words, the Bonferroni ‘correction’ is based upon the correct but irrelevant fact that the more states we compared by opinion polls, the greater the chance that we would find any two states with different voting intentions by sheer chance variation in the validity of polls.

But this has precisely nothing to do with a comparison of voting patterns between Utah and Massachusetts.

*

In effect, the Bp takes account of the fact that a sample may be large enough to provide a sufficiently precise microcosm of the voting intentions in two states, but the imprecision of each two-way comparison is multiplied whenever we do another, and another, such comparison. So with the Bonferroni ‘correction’ in operation - no matter how many opinion polls are compared, we are no more likely to find a difference than if only two samples were compared.

*

But why or when might this kind of statistical correction be useful and relevant?

Beats me!...

I cannot imagine any plausible situation when it would be legitimate to use the Bonferroni procedure in practice. I cannot imagine any legitimate scientific situation in which it would be appropriate to apply this correction.

I am not saying there aren’t any situations when the Bp is appropriate – but I cannot think of any, and surely such situations must be very rare indeed.

*

Yet in practice the Bonferroni procedure is used - hence mis-used - a lot; and in fact the Bp often insisted-upon by supposed statistical experts as a condition of refereeing processes in academic and publishing situations (e.g. use the Bp or you won't pass your PhD, use the Bp or your paper will be rejected).

The usual situation is the non-scientific (hence anti-scientific) statistical incompetents (which is to say, nearly everybody) believes that the Bonferroni correction is merely a more rigorous use of significance testing - a marker of a competent researcher; when in fact it is (almost always) the marker of somebody who hasn't a clue what they are doing or why.

This situation is a damning indictment of the honesty and competence of modern researchers - who are prepared to use and indeed impose a procedure they don't understand - and apparently don't care about enough to make the ten minutes of effort necessary to understand; but instead just 'go along with' a prevailing trend that is not just arbitrary but pernicious - multiply pernicious not only in terms of systematically misinterpreting research results, but also in coercing people formally to agree to believing in nonsense; and thereby publicly to join-in with a purposive destruction of real science and the motivations of real science. 

*

So, given that applying the Bonferroni procedure is not a ‘correction’ but a nonsensical and distorting misunderstanding; then what should be done about the problem of multiple comparisons? Because the problem is real, even though the suggested answer has nothing to do with the problem.

Supposing you had been trawling data to find possibly important correlations between any of a large number of measured variables. A perfectly legitimate scientific procedure. Well, there may be some magnitude of significance (e.g. p value) at which your attention would be attracted to a specific pair of variables from among all the others.

If, in the above hypothetical data trawl, fingernail cancer showed a positive association with nose-picking with a p value of 0.05 - then that is the size of the association - of a size in which it would occur by random sampling of the same population only once in twenty times. It doesn't matter how many other variables were considered as well as nose-picking in looking for causes of fingernail cancer, one, two, four or a hundred and twenty eight; but if that nature of association between NPicking and FNail Ca is important enough to be interesting, then it is important enough to be interesting.

If a pairwise correlation or difference between two populations is big enough to be interesting, but you are are unsure whether or not it might be due to repeated random sampling and multiple comparisons within of a single population - then further statistical analysis of that same set of data cannot help you resolve the uncertainty.

*

But - given the reality of the problem of multiple comparisons - what to do?

The one and only rigorous answer is (if possible) to check your measurement using new data.

If you are unsure whether Utah and Massachusetts voting patterns really are different, then don't fiddle around with the statistical analysis of that poll - go out and do another poll.

And keep making observations until you are sure enough.

It's called doing science. 

* 
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Attitudes and the Thought Police: opponents of Leftism cannot be subversive
    * New Leftism, post-mid-sixties Leftism, has been about shaping 'attitudes' - and this leads directly to the Thought Police For Left...
  • Who had the highest IQ: JRR Tolkien or CS Lewis?
    * http://notionclubpapers.blogspot.co.uk/2013/02/tolkien-and-lewis-which-was-most.html *
  • Free will entails a plurality of gods
    * By which I mean that free will makes each Man into something very much like the God of the philosophers: an unmoved mover, an uncaused cau...
  • How to make a Patagonian Shakespeare
    ...is the name of a new blog I am intending to work on - with a view to writing a book of that name. http://patagonianshakespeare.blogspot.c...
  • The bass part of music
    * The bass part seems to be liked - even though it is seldom noticed (some unmusical people seem unable to hear it). When the bass comes in,...
  • The Left isn't winning by having good arguments - it wins because people are punished for arguing against the Left
    * This is one of the things I find most frustrating, and increasingly frustrating: not so much that it happens, but that so many people cann...
  • Free will, the torturer and the tortured
    * If free will is real - as it is - then the extreme torturer (and nobody and nothing else) really is responsible for his choice to inflict ...
  • What do 'antipsychotics' do to people?
    * An interesting quote from Robert Whitaker's Anatomy of an Epidemic: magic bullets, psychiatric drugs, and the astonishing rise of ment...
  • Free will implies/ entails pre-mortal existence
    * I find the following line of argument very convincing. Edited, and with bold emphases added, from pages 47-51 of  The God who weeps by Te...
  • Why remain a Church of England Anglican?
    * Given all my nasty (and well-deserved) criticisms of the Church of England, why am I a member? 1. I was baptized into into it, I attended ...

Blog Archive

  • ▼  2013 (424)
    • ►  September (22)
    • ▼  August (57)
      • Natural selection as a coherent religion also requ...
      • What to think of Seamus Heaney?
      • Negativity of a young creative genius - the exampl...
      • If Jehovah is Jesus, then the incarnation may expl...
      • Given that there are grounds for doubt, what shoul...
      • Natural selection as religion
      • Scientific geniuses enabled the destruction of Chr...
      • Deep modern Christian apologetics - psychological ...
      • Five favourite tree species
      • Is Terryl Givens the modern C.S Lewis?
      • If free will really cannot be coerced, ever, by an...
      • Charles Williams and Phyllis Jones - kissing was i...
      • How to be *certain*? It is a matter of love, a mat...
      • Why the Bonferroni correction is a mistake (almost...
      • Falling in love or/and being married
      • Why get married, why have children? The reason mus...
      • What advantages are there to the (deleted) Epilogu...
      • Is creation necessary? What are the intuitions? Mo...
      • The Three Greatest Pirate Captains
      • Shamans and creativity
      • Clarification: it is not about good versus evil pe...
      • It was a perfect title...
      • The traditional Christian concept of marriage is t...
      • Thinking about creative thinking - the external, n...
      • IQ research, the sexual revolution and traditional...
      • My (non-) career as a freelance journalist
      • The appeal of bad art, poetry, music
      • Genius and breakthroughs - a round-up of assumptions
      • Bill Whittle - exemplar of the power but weakness ...
      • Three types of tenor singing Rossini, with varying...
      • Christians against the sexual revolution: sexual s...
      • Creativity: randomness versus inspiration
      • Most modern creatives are evil, overall
      • What is justice?
      • Magicians versus ordinary geniuses
      • An angry God - why not?
      • The concept of Fake Creativity stands close to the...
      • How to be more creative (self-help edition)
      • No such thing as deferred satisfaction - implicati...
      • Christians need to understand God (and in fact do ...
      • Favourite CS Lewis
      • The perils of reaction
      • Why The Master hates Dr Who
      • Harry Potter and the need for a single volume Half...
      • Why I believe creativity is rare - and why it is rare
      • Why hard-working, reliable and sociable people are...
      • Four Christian views of what happens after death
      • What does it *feel* like to be creative?
      • Jesus is Jehovah/ YHWH/ God of the Old Testament
      • What should we do about 'X'?
      • But *everybody* does it...
      • Why I am so wise
      • Immodest dress
      • Mormonism: poised between incredibilities
      • High Psychoticism/ creatives attitude to the churc...
      • Creative people and the churches: Heretics OK, Apo...
      • The troubling acceptability of Eastern Orthodoxy a...
    • ►  July (71)
    • ►  June (60)
    • ►  May (49)
    • ►  April (30)
    • ►  March (51)
    • ►  February (39)
    • ►  January (45)
  • ►  2012 (76)
    • ►  December (52)
    • ►  November (24)
Powered by Blogger.

About Me

Unknown
View my complete profile