Null hypotheses, statistical significance and p-values
When we try to test research data statistically the predominant framework is known as null hypothesis significance testing (NHST). As currently practised NHST is an amalgam of the ideas developed in the first half of the 20th century by Fisher and by Neyman and Pearson. There are many useful open-access tutorials on these approaches (e.g., Perezgonzalez, 2015).
As the name implies NHST starts with a null hypothesis, usually denoted H0. In the case of the zander study the research questions, as noted above, were whether atmospheric pressure or air temperature would have an effect on an angler's ability to catch zander. One can easily turn these into null hypotheses; for example, "air temperature has no effect on zander angling success measures". The measures of angling success I used are common-sense ones (e.g., the number of zander caught in a session per rod used per hour). For full details, you can watch the results and analysis video at the link below. The alternative to the null hypothesis is usually referred to as the alternative hypothesis and is abbreviated HA or H1. H1 is usually the simple negation of H0, for example: air temperature does have an effect on zander angling success. We can also sometimes specify the direction of the effect under H1 (e.g., higher air temperature is associated with higher catching success), or keep it in its non-directional form.
In our study we have 40 fishing sessions in the study and measures of air temperature and atmospheric pressure were taken in each, along with measures of angling success. We can therefore use a very well-known index of the possible association between the weather and angling measures: the correlation coefficient. The history of the development this measure of association was covered in an earlier blog post.
The correlation coefficient is an index of the strength of the relationship between a pair of measured variables (in our case a measure of weather conditions and a measure of angling success). If the variables have no relationship at all then theoretically the correlation coefficient will be zero. If the variables are perfectly related then the correlation will be either -1 or +1. The sign of the correlation simply reflects whether the relationship is a positive one (e.g., higher air temperature is associated with higher angling success) or a negative one (e.g., higher air pressure is associated with lower angling success). If the variables have an intermediate strength of relationship, then the correlation will lie between 0 and 1 (or between 0 and -1). Some examples of data plots with different degrees of corelation can be seen here.
We can then compute the correlation coefficient for the data in our study, but deciding what the value obtained implies for H0 and H1 is not straightforward owing to a variety of issues which we explore further below. One important factor is the contribution of random variation to our measures. Although H0 states there is no relationship between a pair of variables the value of the correlation coefficient in any specific study won't be exactly 0. Even if H0 were true then, over repeated samples, the value of the observed correlation will vary around 0 owing to random variation in the values of each variable in the pair. For example, sometimes random chance factors (such as a shoal of zander being in the area of the lake where I was fishing on a particular night) will have increased the number of fish I caught on a specific session, and at other times other random factors will decrease the number of fish caught. Whatever the true level of correlation between the number of fish caught and the weather conditions, these random fluctuations will cause variation in the observed level of the correlation coefficient in our study. Moreover, by the so-called law of large numbers the effects of random variation will generally be less noticeable in large data samples.
To illustrate this I wrote a simple computer programme (in Matlab, available here) to simulate the degree of correlation we can observe between two unrelated variables. In the programme I created two random variables which each were independently drawn from a normal distribution. The variables were unrelated (their true correlation coefficient was zero). I then generated 1000 samples, each with 40 measures of each variable (a sample size of 40 matches the fishing study) and recorded the correlation coefficient for each sample. The plot below shows that the observed correlation coefficient for each study lies in a band above or below zero. Most values are close to zero but a few are as high as 0.5 or as low as -0.5. The large variation reflects the fact that a sample size of 40 is not very large. The variation in correlation values about zero would be smaller if we had used a larger sample size and this can be illustrated by running the programme using larger values for the sample size (see the earlier comment about the law of large numbers). The blue lines in the figure below are drawn so that 2.5% of the simulated correlation coefficients lie above the upper blue line and 2.5% lie below (i.e., are more negative than) the lower blue line.
Imagine that we got a correlation of +0.3 (or -0.3) in our study. Does this size of correlation coefficient offer evidence in favour of H1 or H0? The above graph shows that we could very often get a correlation that different from zero even if H0 were true and there was no real relationship between the pair of variables. The blue lines are roughly at correlation values of +0.3 and -0.3 and they therefore indicate that roughly 2.5% of the correlation coefficients are above +0.3 or below -0.3. Remember that the size and range of the correlations in the above figure simply reflect chance (random) variation in our pair of measures.
The NHST framework provides a criterion for us to decide whether our data tend to indicate H0 may not be true, and thus imply a real relationship between the variables. NHST does this by setting an acceptable level of so-called Type I error: this error is the percentage of times we are content to be wrong and say that our data suggest there is a real relationship when actually the relationship might simply have arisen under H0. In a single study we can be either right or wrong in our interpretations about what the data means. The notion of the Type I error rate requires us to imagine the hypothetical scenario in which we repeated the experiment a large number of times; then it makes sense to say that we would be wrong across 5% of these experiments. The logic of NHST is based on the expected frequencies of data patterns across large numbers of hypothetical repetitions of an experiment; approaches to statistical analysis like NHST are therefore often referred to as frequentist.
In psychology and many other sciences it is conventional to set the acceptable level of Type I error at 5%, and this is why I drew the blue lines in the above graph to reflect this standard. If we found the correlation in our real data sample to be above the upper blue line (at or above +0.3) or below the lower blue line (at or below -0.3), then it would be common for this to be declared statistically significant, based on the 5% Type I error criterion and ignoring the sign of the correlation. Having declared a value for the correlation coefficient to be "significant", researchers will often say that they can "reject H0", and accept that they will be "wrong 5% of the time". We will see below that such statements are misleading, or even wrong, except under very specific conditions.
Incidentally, this criterion for statistical significance horrifies physicists and engineers who set a much tougher standard in their work. The 5% criterion in psychology corresponds to a value which is roughly 2 standard deviations either side of the mean of a normal distribution. In particle physics the usual criterion is called 5-sigma or 5 standard deviations. This equates to a Type I error criterion approaching 0.00005% (i.e. almost 100,000 times smaller than the 5% used in psychology). In industrial engineering settings manufacturing often operates within tolerances based on a 6-sigma criterion; the critical p-value for this is even smaller at 0.0000002%. This is why other fields feel that 5% is a pretty lax/generous criterion for declaring a result to be significant.
It is important to note that we don't need to run a computer simulation for each study in order to get the values represented by the blue lines in the above graph. If we make a series of assumptions related to the process of data collection, along with other assumptions inherent in the statistical model underpinning the correlation coefficient, plus the assumption that the null hypothesis is true, then we can theoretically compute the frequentist probabilities mentioned earlier. These probabilities summarise how often we would expect to obtain a particular correlation value that was as far from zero as the one we have observed, given the sample size of our study and assuming H0 was true. These probabilities are widely known as p-values, and are available in any statistical analysis programme one might use to compute the correlation coefficient from some sample data. As we have noted, using the 5% Type I error criterion leads researchers to declare any p-value smaller than 5% (i.e., below 0.05 when not expressed as a percentage) to be "statistically significant". We have illustrated NHST using the correlation coefficient as our "test statistic" but the same principles apply to analyses using any of the other popular test statistics you may have come across: t-tests, ANOVA F-tests, chi-squared tests etc etc.
The continuing controversy surrounding traditional statistical methods
Despite its incredibly widespread usage, the NHST framework has always been the subject of controversy amongst experts. Yet, until quite recently, most researchers carrying out statistical analyses under the NHST approach were generally not aware of (most of) the issues it raises. In the last 1-2 decades, however, it has become widely known that published results in most areas of science often fail to replicate. This has become known as the replication crisis. This is a huge topic in itself, with multiple complex causes and effects; it will probably figure in later blog posts. For the present purpose, it is enough to stress that the poor replicability of scientific findings has often been attributed, at least in part, to misunderstanding and/or deliberate abuse of the NHST framework. The subtleties, complexities and controversial aspects of NHST have therefore never been so prominent, and a large new literature, aimed at educating researchers in general, has emerged. Unfortunately, as we will see, this literature can sometimes be confusing.
To illustrate, let's consider how one of the major learned statistical societies reacted to the renewed interest in criticisms of NHST. Partly in response to the replication crisis, the American Statistical Association (ASA) produced a statement on p-values (Wasserstein & Lazar, 2016). This was the first time they had ever taken an official position on a specific matter of statistical practice. On behalf of the ASA, Wasserstein assembled a group of more than two dozen experts with varying views. After months of discussions by this group, followed by a 2-day meeting, and a further 3 months drafting a statement; the ASA board approved the statement which was then published (Wasserstein & Lazar, 2016). It is open access and well worth reading in full.
The statement included 6 principles to bear in mind in relation to p-values. I reproduce them below as IMO they offer useful guidance to any researcher. As you read them you might do a double-take because one or more might seem to conflict with what you (thought you) were taught about statistics. The principles given were:
1) P-values can indicate how incompatible the data are
with a specified statistical model.
2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data
were produced by random chance alone.
3) Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes
a specific threshold.
4) Proper inference requires full reporting and
transparency.
5) A p-value, or statistical significance, does not measure
the size of an effect or the importance of a result.
6) By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
Although there was apparently a great deal of disagreement during the discussions leading to the ASA statement, the 6 principles above seem to me to be appropriate and useful. Things did not stop there, however. In 2019, the ASA published a special issue of their in-house journal (The American Statistician) which included "43 innovative
and thought-provoking papers from forward-looking statisticians" intended to help researchers navigate the thorny issues that are thrown up by running statistical analyses. The editorial for this special issue (Wassserstein, Schirm, & Lazar, 2019) was entitled "Moving to a world beyond 'p<0.05'". Those authors went beyond the 2016 ASA statement and said that "it is time to stop using
the term “statistically significant” entirely". The President of the ASA became concerned that the 2019 editorial might be interpreted as its official policy, and so convened a "task force" of more than a dozen eminent statisticians who put out a further open-access two-page statement which appeared in several statistical publications (e.g., Benjamini et al, 2021). The 2021 statement concluded that "P-values and significance tests, when properly applied and interpreted,
increase the rigor of the conclusions drawn from data."
Are you confused? If so, then it is perhaps not surprising. It is clear that the topics are still controversial and experts do not fully agree on the right way forward. So what is the best approach for a student or researcher who is not a statistics expert?
I have found it helpful to read sources which go through the ways in which aspects of NHST have been misinterpreted and misunderstood. I find this is helpful to prevent oneself from adopting bad habits, or slipping back into traps that you had previously learned to avoid. Although I taught stats at university for 30 years, I sometimes find myself saying things about hypotheses, p-values and Type I error rates etc. that aren't exactly correct. Greenland et al (2016) provide a really clear, expert, open access paper which provides an exhuastive list of 25 common misinterpretations regarding NHST concepts.
I don't propose to go through all 25 of the misinterpretations here. I'll list just a few, trying to pick ones which might provoke the kind of double-take I mentioned earlier. They write each misinterpretation as a statement which resesearchers often believe is true but actually is not. For each they explain why it is not true. When you read the examples below, if you find yourself thinking "what? I thought that was true?" then it will probably be worth reading their paper. It should help clarify why you (along with many, many other students and researchers) might have learned/thought something different. Greenland et al talk about the "test hypothesis" in this paper, for which one can often substitute "null hypothesis"; they were just trying to be as general as possible, as it is possible to test hypothesis which are not null hypotheses.
Here's misinterpretation number 3: "A significant test result (P < 0.05) means that the
test hypothesis is false or should be rejected."
Or number 4: "A nonsignificant test result (P > 0.05) means that
the test hypothesis is true or should be accepted."
Or number 7: "Statistical significance indicates a scientifically or
substantively important relation has been detected."
Or number 16: "When the same hypothesis is tested in two different
populations and the resulting P values are on
opposite sides of 0.05, the results are conflicting."
Greenland et al also give some common misinterpretations about confidence intervals, which are often promoted as a better alternative to p-values. They are also often misinterpreted.
Are there alternative statistical approaches to NHST?
A general issue underlying the misinterpretations in the Greenland et al paper is the frequentist nature of the NHST approach. As stated earlier, NHST can give values for the expected frequencies (i.e., probabilities) of the data pattern observed in a study, if one were to repeat that study a large number of times. Moreover these probabilites are valid assuming that the null hypothesis is true as well as assuming that various criteria regarding the data collection process and the mathematics of the statistical test were also met. Thus, under these assumptions, NHST gives us an estimate of p(D|H): the probability of the data (D) given (|) the hypothesis (H). By contrast, scientists intuitively want an answer to a different but related question: what is the probability that my hypothesis is true given the data I have collected?. We can write this probability as p(H|D). The common problems with interpreting NHST analyses often derive from trying to use their results to answer a question about probabilities that the analyses were not designed to address.
An analogy might help here: imagine you have a pack of cards. You can ask a question of the following kind: what is the probability of drawing n black cards in a sample of k picks from the pack, assuming that it is a standard deck of cards and the person drawing it is not using any special magic techniques. This is similar to the kind of probability question that NHST is designed to answer (and the kind of assumptions involved). However, the scientist generally wants to ask a different question; namely, what is the probability that the pack is a standard deck, given that n black cards have been drawn in a sample of k picks from the pack (again under certain assumptions).
Historically, the latter type of probability was known as the inverse probability. Attempts to address how one might estimate inverse probabilities have had a long history, far longer than the methods of frequentist statistics described earlier. A major impetus came from the publication in 1763 of the famous essay by Thomas Bayes and Richard Price which laid out Bayes' Theorem. However, it wasn't until the mid 20th century that the conceptual framework of estimating "inverse probability" generally became referred to as Bayesian statistics or Bayesian inference (see Feinberg, 2006, for a fascinating historical account). Because Bayesian statistics more directly estimate the kinds of probabilities that scientists want to know they are increasingly popular as an alternative to NHST: Bayesians argue they circumvent all/many of the issues with frequentist approaches such as NHST. This is a huge debate in statistics that I won't get into here. However, as explained above, I thought it might be interesting to analyse the zander study (you remember the one?) using simple Bayesian methods.
The Results of the Zander Study
You can watch the final short video which describes the analysis and results of the fishing study. The data are analysed using a Bayesian correlation analysis and by reporting the so-called Bayes factors reflected in these correlation values. Fairly brief introductions to these methods are given in the video. We will take a much deeper dive into Bayesian statistics in later blog posts.
Spoiler alert: the text below gives the results of the zander study. So, don't read any further right now if you want to watch the video to find out what the data showed. The video also describes the many limitations of the study.
Tl;dr (or should that be Tl;dw?): there was quite strong evidence from the study that the air temperature was positively associated with my zander-catching success rate (i.e., higher air temperatures were those in which I caught more zander). By contrast, there was little evidence from the study that atmospheric pressure had any effect on zander catching. It should be noted that if there was an effect of air pressure, then it was not strong enough to show up in the small sample of 40 fishing sessions with the methods I used. If anything, the Bayes factors pointed weakly towards the null hypothesis with respect to the atmospheric pressure variable. I have posted the zander data file (available here), so anyone interested can verify the results. I might well return to this dataset in future blog posts.
Catching a Zander in January 2023
In the latter part of 2022 it was announced locally that the owner of the Old Lake at Bury Hill was going to sell his lovely waterside home and the lake. The bad news was that fishing there would stop once the sale was completed (most likely sometime early in 2023), after 60+ years of being a public fishing venue. This threw me into a slight fit of data anxiety because my spreadsheet showed that I had caught 199 zander since October 1st 2016, 158 of them as part of the above study. Therefore, when I came up with the fishing-and-reading challenge as the New Year beckoned, Bury Hill Old Lake had to be the first venue, and the zander had to be the first species of 2023.
Given the results from my study, I checked the predictions for the air temperature. It was fortunately very mild in early January in my part of Surrey (above 10 centigrade until 9 pm). So, Jan 5th was set as the evening for my first attempt. Eventually, sometime after dark, my 200th zander was caught: see the picture below. It was a typical example from Bury Hill, about 5lb in weight. Please note that UK fishermen always give their fish details in Imperial measures; this is not some kind of post-Brexit thing.
In all likelihood, this will be my last ever fish caught from the Old Lake. It was my first fish of 2023, and the only one I caught that night. This left 364 more fish to catch to achieve part (i) of the 2023 challenge. The 1st species caught for part (ii) was therefore a zander, leaving 24 more species to catch. This meant I could move on to read my second book of the year (to be reviewed in a later blog), leaving just that book and 23 more to be read to complete part (iii) of the challenge. Note to self: my blogging is lagging well behind; get your typing fingers out!
References
Benjamini, Y., De Veaux, R. D., Efron, B., Evans, S., Glickman, M., Graubard, B. I., He, X., Meng, X.-L., Reid, N. M., Stigler, S. M., Vardeman, S. B., Wikle, C. K., Wright, T., Young, L. J., & Kafadar, K. (2021). ASA President’s Task Force Statement on Statistical Significance and Replicability. Chance , 34(4), 10–11.
Fienberg, S. E. (2006). When did Bayesian inference become “Bayesian”? Bayesian Analysis, 1(1), 1–40.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6, 223.
The Angling Trust. (2021). Zander: A balanced approach. A position paper by the Angling Trust.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19.
Again I need to check that the Matlab code runs under Octave (the free clone)
ReplyDeleteA recent short commentary by Daniel Lakens shows why p-values are not measures of evidence as is sometimes claimed. It is an open source publication, available to all. See:
ReplyDeleteLakens, D. (2022). Why P values are not measures of evidence [Review of Why P values are not measures of evidence]. Trends in Ecology & Evolution, 37(4), 289–290.