Hypothesis Tests in R
This tutorial covers basic hypothesis testing in R.
- Normality tests
- Shapiro-Wilk normality test
- Kolmogorov-Smirnov test
- Comparing central tendencies: Tests with continuous / discrete data
- One-sample t-test : Normally-distributed sample vs. expected mean
- Two-sample t-test : Two normally-distributed samples
- Wilcoxen rank sum : Two non-normally-distributed samples
- Weighted two-sample t-test : Two continuous samples with weights
- Comparing proportions: Tests with categorical data
- Chi-squared goodness of fit test : Sampled frequencies of categorical values vs. expected frequencies
- Chi-squared independence test : Two sampled frequencies of categorical values
- Weighted chi-squared independence test : Two weighted sampled frequencies of categorical values
- Comparing multiple groups: Tests with categorical and continuous / discrete data
- Analysis of Variation (ANOVA) : Normally-distributed samples in groups defined by categorical variable(s)
- Kruskal-Wallace One-Way Analysis of Variance : Nonparametric test of the significance of differences between two or more groups
Hypothesis Testing
Science is "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method" (Merriam-Webster 2022) .
The idealized world of the scientific method is question-driven , with the collection and analysis of data determined by the formulation of research questions and the testing of hypotheses. Hypotheses are tentative assumptions about what the answers to your research questions may be.
- Formulate questions: How can I understand some phenomenon?
- Literature review: What does existing research say about my questions?
- Formulate hypotheses: What do I think the answers to my questions will be?
- Collect data: What data can I gather to test my hypothesis?
- Test hypotheses: Does the data support my hypothesis?
- Communicate results: Who else needs to know about this?
- Formulate questions: Frame missing knowledge about a phenomenon as research question(s).
- Literature review: A literature review is an investigation of what existing research says about the phenomenon you are studying. A thorough literature review is essential to identify gaps in existing knowledge you can fill, and to avoid unnecessarily duplicating existing research.
- Formulate hypotheses: Develop possible answers to your research questions.
- Collect data: Acquire data that supports or refutes the hypothesis.
- Test hypotheses: Run tools to determine if the data corroborates the hypothesis.
- Communicate results: Share your findings with the broader community that might find them useful.
While the process of knowledge production is, in practice, often more iterative than this waterfall model, the testing of hypotheses is usually a fundamental element of scientific endeavors involving quantitative data.
The Problem of Induction
The scientific method looks to the past or present to build a model that can be used to infer what will happen in the future. General knowledge asserts that given a particular set of conditions, a particular outcome will or is likely to occur.
The problem of induction is that we cannot be 100% certain that what we are assuming is a general principle is not, in fact, specific to the particular set of conditions when we made our empirical observations. We cannot prove that that such principles will hold true under future conditions or different locations that we have not yet experienced (Vickers 2014) .
The problem of induction is often associated with the 18th-century British philosopher David Hume . This problem is especially vexing in the study of human beings, where behaviors are a function of complex social interactions that vary over both space and time.
Falsification
One way of addressing the problem of induction was proposed by the 20th-century Viennese philosopher Karl Popper .
Rather than try to prove a hypothesis is true, which we cannot do because we cannot know all possible situations that will arise in the future, we should instead concentrate on falsification , where we try to find situations where a hypothesis is false. While you cannot prove your hypothesis will always be true, you only need to find one situation where the hypothesis is false to demonstrate that the hypothesis can be false (Popper 1962) .
If a hypothesis is not demonstrated to be false by a particular test, we have corroborated that hypothesis. While corroboration does not "prove" anything with 100% certainty, by subjecting a hypothesis to multiple tests that fail to demonstrate that it is false, we can have increasing confidence that our hypothesis reflects reality.
Null and Alternative Hypotheses
In scientific inquiry, we are often concerned with whether a factor we are considering (such as taking a specific drug) results in a specific effect (such as reduced recovery time).
To evaluate whether a factor results in an effect, we will perform an experiment and / or gather data. For example, in a clinical drug trial, half of the test subjects will be given the drug, and half will be given a placebo (something that appears to be the drug but is actually a neutral substance).
Because the data we gather will usually only be a portion (sample) of total possible people or places that could be affected (population), there is a possibility that the sample is unrepresentative of the population. We use a statistical test that considers that uncertainty when assessing whether an effect is associated with a factor.
- Statistical testing begins with an alternative hypothesis (H 1 ) that states that the factor we are considering results in a particular effect. The alternative hypothesis is based on the research question and the type of statistical test being used.
- Because of the problem of induction , we cannot prove our alternative hypothesis. However, under the concept of falsification , we can evaluate the data to see if there is a significant probability that our data falsifies our alternative hypothesis (Wilkinson 2012) .
- The null hypothesis (H 0 ) states that the factor has no effect. The null hypothesis is the opposite of the alternative hypothesis. The null hypothesis is what we are testing when we perform a hypothesis test.
The output of a statistical test like the t-test is a p -value. A p -value is the probability that any effects we see in the sampled data are the result of random sampling error (chance).
- If a p -value is greater than the significance level (0.05 for 5% significance) we fail to reject the null hypothesis since there is a significant possibility that our results falsify our alternative hypothesis.
- If a p -value is lower than the significance level (0.05 for 5% significance) we reject the null hypothesis and have corroborated (provided evidence for) our alternative hypothesis.
The calculation and interpretation of the p -value goes back to the central limit theorem , which states that random sampling error has a normal distribution.
Using our example of a clinical drug trial, if the mean recovery times for the two groups are close enough together that there is a significant possibility ( p > 0.05) that the recovery times are the same (falsification), we fail to reject the null hypothesis.
However, if the mean recovery times for the two groups are far enough apart that the probability they are the same is under the level of significance ( p < 0.05), we reject the null hypothesis and have corroborated our alternative hypothesis.
Significance means that an effect is "probably caused by something other than mere chance" (Merriam-Webster 2022) .
- The significance level (α) is the threshold for significance and, by convention, is usually 5%, 10%, or 1%, which corresponds to 95% confidence, 90% confidence, or 99% confidence, respectively.
- A factor is considered statistically significant if the probability that the effect we see in the data is a result of random sampling error (the p -value) is below the chosen significance level.
- A statistical test is used to evaluate whether a factor being considered is statistically significant (Gallo 2016) .
Type I vs. Type II Errors
Although we are making a binary choice between rejecting and failing to reject the null hypothesis, because we are using sampled data, there is always the possibility that the choice we have made is an error.
There are two types of errors that can occur in hypothesis testing.
- Type I error (false positive) occurs when a low p -value causes us to reject the null hypothesis, but the factor does not actually result in the effect.
- Type II error (false negative) occurs when a high p -value causes us to fail to reject the null hypothesis, but the factor does actually result in the effect.
The numbering of the errors reflects the predisposition of the scientific method to be fundamentally skeptical . Accepting a fact about the world as true when it is not true is considered worse than rejecting a fact about the world that actually is true.
Statistical Significance vs. Importance
When we fail to reject the null hypothesis, we have found information that is commonly called statistically significant . But there are multiple challenges with this terminology.
First, statistical significance is distinct from importance (NIST 2012) . For example, if sampled data reveals a statistically significant difference in cancer rates, that does not mean that the increased risk is important enough to justify expensive mitigation measures. All statistical results require critical interpretation within the context of the phenomenon being observed. People with different values and incentives can have different interpretations of whether statistically significant results are important.
Second, the use of 95% probability for defining confidence intervals is an arbitrary convention. This creates a good vs. bad binary that suggests a "finality and certitude that are rarely justified." Alternative approaches like Beyesian statistics that express results as probabilities can offer more nuanced ways of dealing with complexity and uncertainty (Clayton 2022) .
Science vs. Non-science
Not all ideas can be falsified, and Popper uses the distinction between falsifiable and non-falsifiable ideas to make a distinction between science and non-science. In order for an idea to be science it must be an idea that can be demonstrated to be false.
While Popper asserts there is still value in ideas that are not falsifiable, such ideas are not science in his conception of what science is. Such non-science ideas often involve questions of subjective values or unseen forces that are complex, amorphous, or difficult to objectively observe.
Example Data
As example data, this tutorial will use a table of anonymized individual responses from the CDC's Behavioral Risk Factor Surveillance System . The BRFSS is a "system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services" (CDC 2019) .
A CSV file with the selected variables used in this tutorial is available here and can be imported into R with read.csv() .
Guidance on how to download and process this data directly from the CDC website is available here...
Variable Types
The publicly-available BRFSS data contains a wide variety of discrete, ordinal, and categorical variables. Variables often contain special codes for non-responsiveness or missing (NA) values. Examples of how to clean these variables are given here...
The BRFSS has a codebook that gives the survey questions associated with each variable, and the way that responses are encoded in the variable values.
Normality Tests
Tests are commonly divided into two groups depending on whether they are built on the assumption that the continuous variable has a normal distribution.
- Parametric tests presume a normal distribution.
- Non-parametric tests can work with normal and non-normal distributions.
The distinction between parametric and non-parametric techniques is especially important when working with small numbers of samples (less than 40 or so) from a larger population.
The normality tests given below do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012) .
The Shapiro-Wilk Normality Test
- Data: A continuous or discrete sampled variable
- R Function: shapiro.test()
- Null hypothesis (H 0 ): The population distribution from which the sample is drawn is not normal
- History: Samuel Sanford Shapiro and Martin Wilk (1965)
This is an example with random values from a normal distribution.
This is an example with random values from a uniform (non-normal) distribution.
The Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov is a more-generalized test than the Shapiro-Wilks test that can be used to test whether a sample is drawn from any type of distribution.
- Data: A continuous or discrete sampled variable and a reference probability distribution
- R Function: ks.test()
- Null hypothesis (H 0 ): The population distribution from which the sample is drawn does not match the reference distribution
- History: Andrey Kolmogorov (1933) and Nikolai Smirnov (1948)
- pearson.test() The Pearson Chi-square Normality Test from the nortest library. Lower p-values (closer to 0) means to reject the reject the null hypothesis that the distribution IS normal.
Modality Tests of Samples
Comparing two central tendencies: tests with continuous / discrete data, one sample t-test (two-sided).
The one-sample t-test tests the significance of the difference between the mean of a sample and an expected mean.
- Data: A continuous or discrete sampled variable and a single expected mean (μ)
- Parametric (normal distributions)
- R Function: t.test()
- Null hypothesis (H 0 ): The means of the sampled distribution matches the expected mean.
- History: William Sealy Gosset (1908)
t = ( Χ - μ) / (σ̂ / √ n )
- t : The value of t used to find the p-value
- Χ : The sample mean
- μ: The population mean
- σ̂: The estimate of the standard deviation of the population (usually the stdev of the sample
- n : The sample size
T-tests should only be used when the population is at least 20 times larger than its respective sample. If the sample size is too large, the low p-value makes the insignificant look significant. .
For example, we test a hypothesis that the mean weight in IL in 2020 is different than the 2005 continental mean weight.
Walpole et al. (2012) estimated that the average adult weight in North America in 2005 was 178 pounds. We could presume that Illinois is a comparatively normal North American state that would follow the trend of both increased age and increased weight (CDC 2021) .
The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight changed between 2005 and 2020 in Illinois.
One Sample T-Test (One-Sided)
Because we were expecting an increase, we can modify our hypothesis that the mean weight in 2020 is higher than the continental weight in 2005. We can perform a one-sided t-test using the alternative="greater" parameter.
The low p-value leads us to again reject the null hypothesis and corroborate our alternative hypothesis that mean weight in 2020 is higher than the continental weight in 2005.
Note that this does not clearly evaluate whether weight increased specifically in Illinois, or, if it did, whether that was caused by an aging population or decreasingly healthy diets. Hypotheses based on such questions would require more detailed analysis of individual data.
Although we can see that the mean cancer incidence rate is higher for counties near nuclear plants, there is the possiblity that the difference in means happened by accident and the nuclear plants have nothing to do with those higher rates.
The t-test allows us to test a hypothesis. Note that a t-test does not "prove" or "disprove" anything. It only gives the probability that the differences we see between two areas happened by chance. It also does not evaluate whether there are other problems with the data, such as a third variable, or inaccurate cancer incidence rate estimates.
Note that this does not prove that nuclear power plants present a higher cancer risk to their neighbors. It simply says that the slightly higher risk is probably not due to chance alone. But there are a wide variety of other other related or unrelated social, environmental, or economic factors that could contribute to this difference.
Box-and-Whisker Chart
One visualization commonly used when comparing distributions (collections of numbers) is a box-and-whisker chart. The boxes show the range of values in the middle 25% to 50% to 75% of the distribution and the whiskers show the extreme high and low values.
Although Google Sheets does not provide the capability to create box-and-whisker charts, Google Sheets does have candlestick charts , which are similar to box-and-whisker charts, and which are normally used to display the range of stock price changes over a period of time.
This video shows how to create a candlestick chart comparing the distributions of cancer incidence rates. The QUARTILE() function gets the values that divide the distribution into four equally-sized parts. This shows that while the range of incidence rates in the non-nuclear counties are wider, the bulk of the rates are below the rates in nuclear counties, giving a visual demonstration of the numeric output of our t-test.
While categorical data can often be reduced to dichotomous data and used with proportions tests or t-tests, there are situations where you are sampling data that falls into more than two categories and you would like to make hypothesis tests about those categories. This tutorial describes a group of tests that can be used with that type of data.
Two-Sample T-Test
When comparing means of values from two different groups in your sample, a two-sample t-test is in order.
The two-sample t-test tests the significance of the difference between the means of two different samples.
- Two normally-distributed, continuous or discrete sampled variables, OR
- A normally-distributed continuous or sampled variable and a parallel dichotomous variable indicating what group each of the values in the first variable belong to
- Null hypothesis (H 0 ): The means of the two sampled distributions are equal.
For example, given the low incomes and delicious foods prevalent in Mississippi, we might presume that average weight in Mississippi would be higher than in Illinois.
We test a hypothesis that the mean weight in IL in 2020 is less than the 2020 mean weight in Mississippi.
The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight in Illinois is less than in Mississippi.
While the difference in means is statistically significant, it is small (182 vs. 187), which should lead to caution in interpretation that you avoid using your analysis simply to reinforce unhelpful stigmatization.
Wilcoxen Rank Sum Test (Mann-Whitney U-Test)
The Wilcoxen rank sum test tests the significance of the difference between the means of two different samples. This is a non-parametric alternative to the t-test.
- Data: Two continuous sampled variables
- Non-parametric (normal or non-normal distributions)
- R Function: wilcox.test()
- Null hypothesis (H 0 ): For randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.
- History: Frank Wilcoxon (1945) and Henry Mann and Donald Whitney (1947)
The test is is implemented with the wilcox.test() function.
- When the test is performed on one sample in comparison to an expected value around which the distribution is symmetrical (μ), the test is known as a Mann-Whitney U test .
- When the test is performed to compare two samples, the test is known as a Wilcoxon rank sum test .
For this example, we will use AVEDRNK3: During the past 30 days, on the days when you drank, about how many drinks did you drink on the average?
- 1 - 76: Number of drinks
- 77: Don’t know/Not sure
- 99: Refused
- NA: Not asked or Missing
The histogram clearly shows this to be a non-normal distribution.
Continuing the comparison of Illinois and Mississippi from above, we might presume that with all that warm weather and excellent food in Mississippi, they might be inclined to drink more. The means of average number of drinks per month seem to suggest that Mississippians do drink more than Illinoians.
We can test use wilcox.test() to test a hypothesis that the average amount of drinking in Illinois is different than in Mississippi. Like the t-test, the alternative can be specified as two-sided or one-sided, and for this example we will test whether the sampled Illinois value is indeed less than the Mississippi value.
The low p-value leads us to reject the null hypothesis and corroborates our hypothesis that average drinking is lower in Illinois than in Mississippi. As before, this tells us nothing about why this is the case.
Weighted Two-Sample T-Test
The downloadable BRFSS data is raw, anonymized survey data that is biased by uneven geographic coverage of survey administration (noncoverage) and lack of responsiveness from some segments of the population (nonresponse). The X_LLCPWT field (landline, cellphone weighting) is a weighting factor added by the CDC that can be assigned to each response to compensate for these biases.
The wtd.t.test() function from the weights library has a weights parameter that can be used to include a weighting factor as part of the t-test.
Comparing Proportions: Tests with Categorical Data
Chi-squared goodness of fit.
- Tests the significance of the difference between sampled frequencies of different values and expected frequencies of those values
- Data: A categorical sampled variable and a table of expected frequencies for each of the categories
- R Function: chisq.test()
- Null hypothesis (H 0 ): The relative proportions of categories in one variable are different from the expected proportions
- History: Karl Pearson (1900)
- Example Question: Are the voting preferences of voters in my district significantly different from the current national polls?
For example, we test a hypothesis that smoking rates changed between 2000 and 2020.
In 2000, the estimated rate of adult smoking in Illinois was 22.3% (Illinois Department of Public Health 2004) .
The variable we will use is SMOKDAY2: Do you now smoke cigarettes every day, some days, or not at all?
- 1: Current smoker - now smokes every day
- 2: Current smoker - now smokes some days
- 3: Not at all
- 7: Don't know
- NA: Not asked or missing - NA is used for people who have never smoked
We subset only yes/no responses in Illinois and convert into a dummy variable (yes = 1, no = 0).
The listing of the table as percentages indicates that smoking rates were halved between 2000 and 2020, but since this is sampled data, we need to run a chi-squared test to make sure the difference can't be explained by the randomness of sampling.
In this case, the very low p-value leads us to reject the null hypothesis and corroborates the alternative hypothesis that smoking rates changed between 2000 and 2020.
Chi-Squared Contingency Analysis / Test of Independence
- Tests the significance of the difference between frequencies between two different groups
- Data: Two categorical sampled variables
- Null hypothesis (H 0 ): The relative proportions of one variable are independent of the second variable.
We can also compare categorical proportions between two sets of sampled categorical variables.
The chi-squared test can is used to determine if two categorical variables are independent. What is passed as the parameter is a contingency table created with the table() function that cross-classifies the number of rows that are in the categories specified by the two categorical variables.
The null hypothesis with this test is that the two categories are independent. The alternative hypothesis is that there is some dependency between the two categories.
For this example, we can compare the three categories of smokers (daily = 1, occasionally = 2, never = 3) across the two categories of states (Illinois and Mississippi).
The low p-value leads us to reject the null hypotheses that the categories are independent and corroborates our hypotheses that smoking behaviors in the two states are indeed different.
p-value = 1.516e-09
Weighted Chi-Squared Contingency Analysis
As with the weighted t-test above, the weights library contains the wtd.chi.sq() function for incorporating weighting into chi-squared contingency analysis.
As above, the even lower p-value leads us to again reject the null hypothesis that smoking behaviors are independent in the two states.
Suppose that the Macrander campaign would like to know how partisan this election is. If people are largely choosing to vote along party lines, the campaign will seek to get their base voters out to the polls. If people are splitting their ticket, the campaign may focus their efforts more broadly.
In the example below, the Macrander campaign took a small poll of 30 people asking who they wished to vote for AND what party they most strongly affiliate with.
The output of table() shows fairly strong relationship between party affiliation and candidates. Democrats tend to vote for Macrander, while Republicans tend to vote for Stewart, while independents all vote for Miller.
This is reflected in the very low p-value from the chi-squared test. This indicates that there is a very low probability that the two categories are independent. Therefore we reject the null hypothesis.
In contrast, suppose that the poll results had showed there were a number of people crossing party lines to vote for candidates outside their party. The simulated data below uses the runif() function to randomly choose 50 party names.
The contingency table() shows no clear relationship between party affiliation and candidate. This is validated quantitatively by the chi-squared test. The fairly high p-value of 0.4018 indicates a 40% chance that the two categories are independent. Therefore, we fail to reject the null hypothesis and the campaign should focus their efforts on the broader electorate.
The warning message given by the chisq.test() function indicates that the sample size is too small to make an accurate analysis. The simulate.p.value = T parameter adds Monte Carlo simulation to the test to improve the estimation and get rid of the warning message. However, the best way to get rid of this message is to get a larger sample.
Comparing Categorical and Continuous Variables
Analysis of variation (anova).
Analysis of Variance (ANOVA) is a test that you can use when you have a categorical variable and a continuous variable. It is a test that considers variability between means for different categories as well as the variability of observations within groups.
There are a wide variety of different extensions of ANOVA that deal with covariance (ANCOVA), multiple variables (MANOVA), and both of those together (MANCOVA). These techniques can become quite complicated and also assume that the values in the continuous variables have a normal distribution.
- Data: One or more categorical (independent) variables and one continuous (dependent) sampled variable
- R Function: aov()
- Null hypothesis (H 0 ): There is no difference in means of the groups defined by each level of the categorical (independent) variable
- History: Ronald Fisher (1921)
- Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?
As an example, we look at the continuous weight variable (WEIGHT2) split into groups by the eight income categories in INCOME2: Is your annual household income from all sources?
- 1: Less than $10,000
- 2: $10,000 to less than $15,000
- 3: $15,000 to less than $20,000
- 4: $20,000 to less than $25,000
- 5: $25,000 to less than $35,000
- 6: $35,000 to less than $50,000
- 7: $50,000 to less than $75,000)
- 8: $75,000 or more
The barplot() of means does show variation among groups, although there is no clear linear relationship between income and weight.
To test whether this variation could be explained by randomness in the sample, we run the ANOVA test.
The low p-value leads us to reject the null hypothesis that there is no difference in the means of the different groups, and corroborates the alternative hypothesis that mean weights differ based on income group.
However, it gives us no clear model for describing that relationship and offers no insights into why income would affect weight, especially in such a nonlinear manner.
Suppose you are performing research into obesity in your city. You take a sample of 30 people in three different neighborhoods (90 people total), collecting information on health and lifestyle. Two variables you collect are height and weight so you can calculate body mass index . Although this index can be misleading for some populations (notably very athletic people), ordinary sedentary people can be classified according to BMI:
Average BMI in the US from 2007-2010 was around 28.6 and rising, standard deviation of around 5 .
You would like to know if there is a difference in BMI between different neighborhoods so you can know whether to target specific neighborhoods or make broader city-wide efforts. Since you have more than two groups, you cannot use a t-test().
Kruskal-Wallace One-Way Analysis of Variance
A somewhat simpler test is the Kruskal-Wallace test which is a nonparametric analogue to ANOVA for testing the significance of differences between two or more groups.
- R Function: kruskal.test()
- Null hypothesis (H 0 ): The samples come from the same distribution.
- History: William Kruskal and W. Allen Wallis (1952)
For this example, we will investigate whether mean weight varies between the three major US urban states: New York, Illinois, and California.
To test whether this variation could be explained by randomness in the sample, we run the Kruskal-Wallace test.
The low p-value leads us to reject the null hypothesis that the samples come from the same distribution. This corroborates the alternative hypothesis that mean weights differ based on state.
A convienent way of visualizing a comparison between continuous and categorical data is with a box plot , which shows the distribution of a continuous variable across different groups:
A percentile is the level at which a given percentage of the values in the distribution are below: the 5th percentile means that five percent of the numbers are below that value.
The quartiles divide the distribution into four parts. 25% of the numbers are below the first quartile. 75% are below the third quartile. 50% are below the second quartile, making it the median.
Box plots can be used with both sampled data and population data.
The first parameter to the box plot is a formula: the continuous variable as a function of (the tilde) the second variable. A data= parameter can be added if you are using variables in a data frame.
The chi-squared test can be used to determine if two categorical variables are independent of each other.
IMAGES
VIDEO