What is the difference between confidence level and probability?

You may have figured out already that statistics isn’t exactly a science. Lots of terms are open to interpretation, and sometimes there are many words that mean the same thing—like “mean” and “average”—or sound like they should mean the same thing, like significance level and confidence level.  

Although they sound very similar, significance level and confidence level are in fact two completely different concepts. Confidence levels and confidence intervals also sound like they are related; They are usually used in conjunction with each other, which adds to the confusion. However, they do have very different meanings.

In a nutshell, here are the definitions for all three.

  1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
  2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level =  1 – alpha. 
  3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

In the following sections, I’ll delve into what each of these definitions means in (relatively) plain language.

Confidence level vs Confidence Interval

When a confidence interval (CI) and confidence level (CL) are put together, the result is a statistically sound spread of data. For example, a result might be reported as “50% ± 6%, with a 95% confidence”. Let’s break apart the statistic into individual parts:

  • The confidence interval: 50% ± 6% = 44% to 56%
  • The confidence level: 95%

 Confidence intervals are intrinsically connected to confidence levels. Confidence levels are expressed as a percentage (for example, a 90% confidence level). Should you repeat an experiment or survey with a 90% confidence level, we would expect that 90% of the time your results will match results you should get from a population. Confidence intervals are a range of results where you would expect the true value to appear. For example, you survey a group of children to see how many in-app purchases made a year. Your test is at the 99 percent confidence level and the result is a confidence interval of (250,300). That means you think they buy between 250 and 300 in-app items a year, and you’re confident that should the survey be repeated, 99% of the time the results will be the same. 

Let’s delve a little more into both terms.

1. The Confidence Interval

This Gallup poll states both a CI and a CL. The result of the poll concerns answers to claims that the 2016 presidential election was “rigged”, with two in three Americans (66%) saying prior to the election “…that they are “very” or “somewhat confident” that votes will be cast and counted accurately across the country.” Further down in the article is more information about the statistic: “The margin of sampling error is ±6 percentage points at the 95% confidence level.”

Let’s take the stated percentage first. The “66%” result is only part of the picture. It’s an estimate, and if you’re just trying to get a general idea about people’s views on election rigging, then 66% should be good enough for most purposes like a speech, a newspaper article, or passing along the information to your Uncle Albert, who loves a good political discussion. However, you might be interested in getting more information about how good that estimate actually is. For example, the real estimate might be somewhere between 46% and 86% (which would actually be a poor estimate), or the pollsters could have a very accurate figure: between, say, 64% and 68%. That spread of percentages (from 46% to 86% or 64% to 68%) is the confidence interval. But how good is this specific poll? The answer in this line:

“The margin of sampling error is ±6 percentage points…”

What this margin of error tells us is that the reported 66% could be 6% either way. So our confidence interval is actually 66%, plus or minus 6%, giving a possible range of 60% to 72%.

2. The Confidence Level

Again, the above information is probably good enough for most purposes. But, for the sake of science, let’s say you wanted to get a little more rigorous. Just because on poll reports a certain result, doesn’t mean that it’s an accurate reflection of public opinion as a whole. In fact, many polls from different companies report different results for the same population, mostly because sampling (i.e. asking a fraction of the population instead of the whole) is never an exact science.

To make the poll results statistically sound, you want to know if the poll was repeated (over and over), would the poll results be the same? Enter the confidence level. The confidence level states how confident you are that your results (whether a poll, test, or experiment) can be repeated ad infinitum with the same result. In a perfect world, you would want your confidence level to be 100%. In other words, you want to be 100% certain that if a rival polling company, public entity, or Joe Smith off of the street were to perform the same poll, they would get the same results. But this is statistics, and nothing is ever 100%; Usually, confidence levels are set at 90-98%.

For this particular example, Gallup reported a ” 95% confidence level,” which means that if the poll was to be repeated, Gallup would expect the same results 95% of the time.

  • A 0% confidence level means you have no faith at all that if you repeated the survey that you would get the same results. In fact, you’re sure the results would be completely different.
  • A 100% confidence level means there is no doubt at all that if you repeated the survey you would get the same results. The results would be repeatable 100% of the time.

Confidence level vs Significance Level

Above, I defined a confidence level as answering the question: “…if the poll/test/experiment was repeated (over and over), would the results be the same?” In essence, confidence levels deal with repeatability. Significance levels on the other hand, have nothing at all to do with repeatability. They are set in the beginning of a specific type of experiment (a “hypothesis test”), and controlled by you, the researcher.

The significance level (also called the alpha level) is a term used to test a hypothesis. More specifically, it’s the probability of making the wrong decision when the null hypothesis is true.  In statistical speak, another way of saying this is that it’s your probability of making a Type I error. 

Constructing Confidence Intervals with Significance Levels

Using the normal distribution, you can create a confidence interval for any significance level with this formula:

sample statistic ± z*(standard error)

(z* = multiplier)

Confidence intervals are constructed around a point estimate (like the mean) using statistical table (e.g. the z-table or t-table), which give known ranges for normally distributed data. Normally distributed data is preferable because the data tends to behave in a known way, with a certain percentage of data falling a certain distance from the mean. For example, a point estimate will fall within 1.96 standard deviations about 95% of the time.

If you’re interested more in the math behind this idea, how to use the formula, and constructing confidence intervals using significance levels, you can find a short video on how to find a confidence interval here.

Finally, if all of this sounds like Greek to you, you can read more about significance levels, Type 1 errors and hypothesis testing in this article.

References

Update: Americans’ Confidence in Voting, Election

Review Article

Part 4 of a Series on Evaluation of Scientific Publications

1Johannes Gutenberg-Universität Mainz: Zentrum für Kinder- und Jugendmedizin, Zentrum Präventive Pädiatrie

Find articles by Jean-Baptist du Prel

2Johannes Gutenberg-Universität Mainz: Institut für Medizinische Biometrie, Epidemiologie und Informatik

Find articles by Gerhard Hommel

2Johannes Gutenberg-Universität Mainz: Institut für Medizinische Biometrie, Epidemiologie und Informatik

Find articles by Bernd Röhrig

2Johannes Gutenberg-Universität Mainz: Institut für Medizinische Biometrie, Epidemiologie und Informatik

Find articles by Maria Blettner

An understanding of p-values and confidence intervals is necessary for the evaluation of scientific articles. This article will inform the reader of the meaning and interpretation of these two statistical concepts.

The uses of these two statistical concepts and the differences between them are discussed on the basis of a selective literature search concerning the methods employed in scientific articles.

P-values in scientific studies are used to determine whether a null hypothesis formulated before the performance of the study is to be accepted or rejected. In exploratory studies, p-values enable the recognition of any statistically noteworthy findings. Confidence intervals provide information about a range in which the true value lies with a certain degree of probability, as well as about the direction and strength of the demonstrated effect. This enables conclusions to be drawn about the statistical plausibility and clinical relevance of the study findings. It is often useful for both statistical measures to be reported in scientific articles, because they provide complementary types of information.

Keywords: publications, clinical research, p-value, statistics, confidence interval

People who read scientific articles must be familiar with the interpretation of p-values and confidence intervals when assessing the statistical findings. Some will have asked themselves why a p-value is given as a measure of statistical probability in certain studies, while other studies give a confidence interval and still others give both. The authors explain the two parameters on the basis of a selective literature search and describe when p-values or confidence intervals should be given. The two statistical concepts will then be compared and evaluated.

In confirmatory (evidential) studies, null hypotheses are formulated, which are then rejected or retained with the help of statistical tests. The p-value is a probability, which is the result of such a statistical test. This probability reflects the measure of evidence against the null hypothesis. Small p-values correspond to strong evidence. If the p-value is below a predefined limit, the results are designated as "statistically significant" (1). The phrase "statistically striking results" is also used in exploratory studies.

If it is to be shown that a new drug is better than an old one, the first step is to show that the two drugs are not equivalent. Thus, the hypothesis of equality is to be rejected. The null hypothesis (H0) to be rejected is then formulated in this case as follows: "There is no difference between the two treatments with respect to their effect." For example, there might be no difference between two antihypertensives with respect to their ability to reduce blood pressure. The alternative hypothesis (H1) then states that there is a difference between the two treatments. This can either be formulated as a two-tailed hypothesis (any difference) or as a one-tailed hypothesis (positive or negative effect). In this case, the expression "one-tailed" means that the direction of the expected effect is laid down when the alternative hypothesis is formulated. For example, if there is clear preliminary evidence that an antihypertensive has on average a stronger hypertensive effect than the comparator drug, the alternative hypothesis can be formulated as follows: "The difference between the mean hypotensive activity of antihypertensive 1 and the mean hypotensive activity of antihypertensive 2 is positive." However, as this requires plausible assumptions about the direction of the effect, the two-tailed hypothesis is often formulated.

For example, the data from a randomized clinical study are to be used to estimate the effect strength relevant to the question to be answered. This could, for example, be the difference between the mean decrease in blood pressure with a new and with an old antihypertensive. On this basis, the null hypothesis formulated in advance is tested with the help of a significance test. The p-value gives the probability of obtaining the present test result—or an even more extreme one—if the null hypothesis is correct. A small p-value signifies that the probability is small that the difference can purely be assigned to chance. In our example, the observed difference in mean systolic pressure might not be due to a real difference in the hypotensive activity of the two antihypertensives, but might be due to chance. However, if the p-value is < 0.05, the chance that this is the case is under 5%. To permit a decision between the null hypothesis and the alternative hypothesis, significance limits are often specified in advance, at a level of significance α. The level of significance of 0.05 (or 5%) is often chosen. If the p-value is less than this limit, the result is significant and it is agreed that the null hypothesis should be rejected and the alternative hypothesis—that there is a difference—is accepted. The specification of the level of significance also fixes the probability that the null hypothesis is wrongly rejected.

P-values alone do not permit any direct statement about the direction or size of a difference or of a relative risk between different groups (1). However, this would be particularly useful when the results are not significant (2). For this purpose, confidence limits contain more information. Aside from p-values, at least a measure of the effect strength must be reported—for example, the difference between the mean decreases in blood pressure in the two treatment groups (3). In the final analysis, the definition of a significance limit is arbitrary and p-values can be given even without a significance limit being selected. The smaller the p-value, the less plausible is the null hypothesis that there is no difference between the treatment groups.

The confidence interval is a range of values calculated by statistical methods which includes the desired true parameter (for example, the arithmetic mean, the difference between two means, the odds ratio etc.) with a probability defined in advance (coverage probability, confidence probability, or confidence level). The confidence level of 95% is usually selected. This means that the confidence interval covers the true value in 95 of 100 studies performed (4, 5). The advantage of confidence limits in comparison with p-values is that they reflect the results at the level of data measurement (6). For instance, the lower and upper limits of the mean systolic blood pressure difference between the two treatment groups are given in mm Hg in our example.

The size of the confidence interval depends on the sample size and the standard deviation of the study groups (5). If the sample size is large, this leads to "more confidence" and a narrower confidence interval. If the confidence interval is wide, this may mean that the sample is small. If the dispersion is high, the conclusion is less certain and the confidence interval becomes wider. Finally, the size of the confidence interval is influenced by the selected level of confidence. A 99% confidence interval is wider than a 95% confidence interval. In general, with a higher probability to cover the true value the confidence interval becomes wider.

In contrast to p-values, confidence intervals indicate the direction of the effect studied. Conclusions about statistical significance are possible with the help of the confidence interval. If the confidence interval does not include the value of zero effect, it can be assumed that there is a statistically significant result. In the example of the difference of the mean systolic blood pressure between the two treatment groups, the question is whether the value 0 mm Hg is within the 95% confidence interval (= not significant) or outside it (= significant). The situation is equivalent with the relative risk; if the confidence interval contains the relative risk of 1.00, the result is not significant. It would then have to be examined whether the confidence interval for the relative risk is completely under 1.00 (= protective effect) or completely above it (= increase in risk).

Figure 1 shows the difference for the example of the mean systolic blood pressure difference between two groups. The confidence interval for the mean blood pressure difference is narrow with small variation within the sample (= low dispersion) (figure 1b), low confidence level (figure 1d) and large sample size (figure 1f). In this example, there is no significant difference between the mean systolic blood pressures in the groups if the dispersion is high (figure 1c), the confidence level is high (figure 1e) or the sample size is small (figure 1g), as the value zero is then contained in the confidence interval.

What is the difference between confidence level and probability?

Using the example of the difference in the mean systolic blood pressure between two groups, it is examined how the size of the confidence interval (a) can be modified by changes in dispersion (b, c), confidence interval (d, e), and sample size (f, g). The difference between the mean systolic blood pressure in group 1 (150 mm Hg) and in group 2 (145 mm Hg) was 5 mmHg. Example modified from (6)

Although point estimates, such as the arithmetic mean, the difference between two means or the odds ratio, provide the best approximation to the true value, they do not provide any information about how exact they are. This is achieved by confidence intervals. It is of course impossible to make any precise statement about the size of the difference between the estimated parameters for the sample and the true value for the population, as the true value is unknown. However, one would like to have some confidence that the point estimate is in the vicinity of the true value (7). Confidence intervals can be used to describe the probability that the true value is within a given range.

If a confidence interval is given, several conclusions can be made. Firstly, values below the lower limit or above the upper limit are not excluded, but are improbable. With the confidence limit of 95%, each of these probabilities is only 2.5%. Values within the confidence limits, but near to the limits, are mostly less probable than values near the point estimate, which in our example with the two antihypertensives is the difference in the mean values of the reduction in blood pressure in the two treatment groups in mm Hg. Whatever the size of the confidence interval, the point estimate based on the sample is the best approximation to the true value for the population. Values in the vicinity of the point estimate are mostly plausible values. This is particularly the case if it can be assumed that the values are normally distributed.

A frequent procedure is to check whether confidence intervals include a certain limit or not and, if they do not, to regard the findings as being significant. It is however a better approach to exploit the additional information in confidence intervals. Particularly with so-called close results, the possibility should be considered that the result might have been significant with a larger sample.

Important international journals of medical science, such as the Lancet and the British Medical Journal, as well as the International Committee of Medical Journal Editors (ICMJE), recommend the use of confidence intervals (6). In particular, confidence intervals are of great help in interpreting the results of randomized clinical studies and meta-analyses. Thus the use of confidence intervals is expressly demanded in international agreements and in the CONSORT statement (8) for reporting randomized clinical studies and in the QUORUM statement (9) for reporting systematic reviews.

A clear distinction must be made between statistical significance and clinical relevance (or clinical significance). Aside from the effect strength, p-values incorporate the case numbers and the variability of the sample data. Even if the limit for statistical significance is laid down in advance, the reader must still judge the clinical relevance of statistically significant differences for himself. The same numerical value for the difference may be "statistically significant" if a large sample is taken and "not significant" if the sample is smaller. On the other hand, results of high clinical relevance are not automatically unimportant if there is no statistical significance. The cause may be that the sample is too small or that the dispersion in the samples is too great—for example, if the patient group is highly heterogenous. For this reason, a decision for significance or lack of significance on the basis of the p-value alone may be simplistic.

This can be illustrated using the example of systolic blood pressure. Figure 2 specifies a relevance limit r. A systolic blood pressure difference of at least 4 mm Hg between the two groups is then defined as clinically relevant. If the blood pressure difference is neither statistically significant nor clinically relevant (figure 2a) or statistically significant and clinically relevant (figure 2b), interpretation is easy. However, statistically significant differences in blood pressure may lie under the limit for clinical relevance and are then of no clinical importance (figure 2c). On the other hand, there may be real and clinically important differences in systolic blood pressure between the treatment groups, even though statistical significance has not been achieved (figure 2d).

Unfortunately, statistical significance is often thought to be equivalent to clinical relevance. Many research workers, readers, and journals ignore findings which are potentially clinically useful only because they are not statistically significant (4). At this point, we can criticize the practice of some scientific journals of preferably publishing significant results. A study has shown that this is mainly the case in high–impact factor journals (10). This can distort the facts ("publication bias"). Moreover, it can often be seen that a non-significant difference is interpreted as meaning that there is no difference (for example, between two treatment groups). A p-value of >0.05 only signifies that the evidence is not adequate to reject the null hypothesis—for example, that there is no difference between two alternative treatments. This does not imply that the two treatments are equivalent. The quantitative compilation of comparable studies in the form of systematic reviews or meta-analyses can then help to identify differences which had not been recognized because the number of cases in individual studies had been too low. A special article in this series is devoted to this subject.

The essential differences between p-values and confidence intervals are as follows:

  • The advantage of confidence intervals in comparison to giving p-values after hypothesis testing is that the result is given directly at the level of data measurement. Confidence intervals provide information about statistical significance, as well as the direction and strength of the effect (11). This also allows a decision about the clinical relevance of the results. If the error probability is given in advance, the size of the confidence interval depends on the data variability and the case number in the sample examined (12).

  • P-values are clearer than confidence intervals. It can be judged whether a value is greater or less than a previously specified limit. This allows a rapid decision as to whether a value is statistically significant or not. However, this type of "diagnosis on sight" can be misleading, as it can lead to clinical decisions solely based on statistics.

  • Hypothesis testing using a p-value is a binary (yes-or-no) decision. The reduction of statistical inference (inductive inference from a single sample to the total population) to this level may be simplistic. The simple distinction between "significant" and "non-significant" in isolation is not very reliable. For example, there is little difference between the evidence for p-values of 0.04 and of 0.06. Nevertheless, binary decisions based on these minor differences lead to converse decisions (1, 13). For this reason, p-values must always be given completely (suggestion: always to three decimal places) (14).

  • When a point estimate is used (for example, difference in means, relative risk), an attempt is made to draw conclusions about the situation in the target population on the basis of only a single value for the sample. Even though this figure is the best possible approximation to the true value, it is not very probable that the values are exactly the same. In contrast, confidence intervals provide a range of possible plausible values for the target population, as well as the probability with which this range covers the real value.

  • In contrast to confidence intervals, p-values give the difference from a previously specified statistical level α (15). This facilitates the evaluation of a "close" result.

  • Statistical significance must be distinguished from medical relevance or biological importance. If the sample size is large enough, even very small differences may be statistically significant (16, 17). On the other hand, even large differences may lead to non-significant results if the sample is too small (12). However, the investigator should be more interested in the size of the difference in therapeutic effect between two treatment groups in clinical studies, as this is what is important for successful treatment, rather than whether the result is statistically significant or not (18).

Taken in isolation, p-values provide a measure of the statistical plausibility of a result. With a defined level of significance, p-values allow a decision about the rejection or maintenance of a previously formulated null hypothesis in confirmatory studies. Only very restricted statements about effect strength are possible on the basis of p-values. Confidence intervals provide an adequately plausible range for the true value related to the measurement of the point estimate. Statements are possible on the direction of the effects, as well as its strength and the presence of a statistically significant result. In conclusion, it should be clearly stated that p-values and confidence intervals are not contradictory statistical concepts. If the size of the sample and the dispersion or a point estimate are known, confidence intervals can be calculated from p-values, and conversely. The two statistical concepts are complementary.

Translated from the original German by Rodney A. Yeates, M.A., Ph.D.

Conflict of interest statement

The authors declare that there is no conflict of interest as defined by the guidelines of the International Committee of Medical Journal Editors.

1. Bland M, Peacock J. Interpreting statistics with confidence. The Obstetrician and Gynaecologist. 2002;4:176–180. [Google Scholar]

2. Houle TT. Importance of effect sizes for the accumulation of knowledge. Anesthesiology. 2007;106:415–417. [PubMed] [Google Scholar]

3. Faller H. Signifikanz, Effektstärke und Konfidenzintervall. Rehabilitation. 2004;43:174–178. [PubMed] [Google Scholar]

4. Greenfield ML, Kuhn JE, Wojtys EM. A statistics primer. Confidence intervals. AmJ Sports Med. 1998;26:145–149. No abstract available. Erratum in: Am J Sports Med 1999; 27: 544. [PubMed] [Google Scholar]

5. Bender R, Lange St. Was ist ein Konfidenzintervall? Dtsch Med Wschr. 2001;126 [PubMed] [Google Scholar]

6. Altman DG. Altman DG, Machin D, Bryant TN, Gardner MJ, editors. Confidence intervals in practice. BMJ Books. 2002:6–9. [Google Scholar]

7. Weiss C, Weiß C. Basiswissen Medizinische Statistik. Springer Verlag; 1999. Intervallschätzungen. Die Bedeutung eines Konfidenzintervalls; pp. 191–192. [Google Scholar]

8. Moher D, Schulz KF Altman DG für die CONSORT Gruppe. Das COSORT Statement: Überarbeitete Empfehlungen zur Qualitätsverbesserung von Reports randomisierter Studien im Parallel-Design. Dtsch Med Wschr. 2004;129:16–20. [PubMed] [Google Scholar]

9. Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF. Improving the quality of reports of meta-analyses of randomized controlled trials: the QUOROM statement. Quality of Reporting of Meta-analyses. Lancet. 1999;354:1896–1900. [PubMed] [Google Scholar]

10. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet. 1991;337:867–872. [PubMed] [Google Scholar]

11. Shakespeare TP, Gebski VJ, Veness MJ, Simes J. Improving interpretation of clinical studies by use of confidence levels, clinical significance curves, and riskbenefit contours. Lancet. 2001;357:1349–1353. Review. [PubMed] [Google Scholar]

12. Gardner MJ, Altman DG. Confidence intervals rather than P-values: estimation rather than hypothesis testing. Br Med J. 1986;292:746–750. [PMC free article] [PubMed] [Google Scholar]

13. Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 1. hypothesis testing. CMAJ. 1995;152:27–32. Review. [PMC free article] [PubMed] [Google Scholar]

14. ICH 9: Statisticlal Principles for Clinical Trials. International Conference on Harmonization; London UK. 1998. Adopted by CPMP July 1998 (CPMP/ICH/363/96) [Google Scholar]

15. Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998;51:355–360. [PubMed] [Google Scholar]

16. Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 2. interpreting study results: confidence intervals. CMAJ. 1995;152:169–173. [PMC free article] [PubMed] [Google Scholar]

17. Sim J, Reid N. Statistical inference by confidence intervals: issues of interpretation and utilization. Phys Ther. 1999;79:186–195. [PubMed] [Google Scholar]

18. Gardner MJ, Altman DG. Altman DG, Machin D, Bryant TN, Gardner MJ, editors. Confidence intervals rather than P values Statistics with confidence. Confidence intervals and statistical guidelines. BMJ Books. (Second Edition) 2002:15–27. [Google Scholar]