In data science, one of the most important tools we use is statistical testing and hypothesis testing. Generally, we want to know if one measure we have is has a statistically significant effect on our target, and if it is, maybe we want to use that feature in a machine learning model. Statistical testing is an important tool we can use for feature selection, especially if we have many features and are unsure what really matters to our target variable (variable of interest/what we are looking to predict).
Before we do the statistical tests themselves, we need to know what we are testing. This is where the hypothesis testing set up comes in to play. Sometimes this can be very intuitive, but sometimes not so much. Espcially in a bootcamp environment where you are learning all of these in a day or two.
In this article, we are going to be focusing on how to write the hypothesis out. If you want to know what test to use when, I will link an amazing reference by Jagandeep Singh here. You’re welcome.
Let’s write some hypothesis tests!
The null hypothesis is often written as H0. The best way to remember how to write the null hypothesis is that it’s a glass half empty sort of measure. It is the Eeyore of data science. There is no change, no difference in the average, no effect. It does not matter.
The alternative hypothesis is usually what you are trying to find evidence to support, the eternal optimist. Denoted as H1 or HA, it is a better case scenario that states there is a statistically meaningful difference in average, some difference, some effect. (This does not tell you specifically how much just it does have some significant effect)
When writing a hypothesis it is critical to write them accurately. Small changes or omissions in wording can have a huge impact.
WARNING: Variable X != sample mean of Variable X
It’s important that we do not confuse the variable: price, weight, or variable X, with the mean of the variable: average price, average weight, mean of variable X. We are NOT calculating the significance on the target variable itself but the significance on sample mean of the target variable.
T-Test and Z-Test
Z-Tests compare The Population mean or proportions to see if they are the same when compared (2 sample) or are equal to a value (1 sample). (Here we KNOW the Population standard deviation; note population is capitalized above denoting this is the known actual population)
If we want to look at the Population mean or the Population proportion, we are looking at a Z-Test.
One Sample Z-Test
When you cross the street, do you look only one way? Or both ways? Both! We want to make sure there are no cars coming from either side. Just like in two tailed sample testing we are looking at BOTH sides of the graph when we are saying is it equal or not, 1) mu is greater than 40 and 2) mu is less than 40 both would not equal 40.
- H0: Population mean (mu)= 40
- H1: Population mean (mu) != 40
However, if we only care about a certain direction, we can look just that one way. Here we only want to test if it is less than 40.
- H0: Population mean (mu)≥ 40
- H1: Population mean (mu) <40
Two Sample Z-Test
For this example we will use the Z-Test difference in proportions.
- H0: p1 = p2
- H1: p1 != p2
— Comprehension check, would this be a one or two tailed example?
T-Tests compares the difference between 2 sample means (2 sample) or the sample mean to a known value (1 sample). (unknown Population standard deviation)
One Sample T-Test
Compares the mean to a known value.
- H0: The population mean = $100
- H1: The population mean != $100
— Comprehension answer, these are two tailed examples.
Two Sample T-Test
For example, if we wanted to test the average price of an AirBnb based on distance from a city center, we might break those distances down to be ≤3 or >3, two different samples. Our hypothesis would then be:
- H0: The average price with a distance of ≤ 3 miles from city center = the average price with a distance > 3 miles from city center
- H1: The average price with a distance of ≤ 3 miles from city center != the average price with a distance >3 miles from city center
The price is NOT the same as the average (mean) price.
Once we have calculated the test statistic or p-value, we are able to reject or fail to reject the null (Eeyore example) hypothesis.
Interpreting the results:
- If abs(t-statistic) ≤ critical value: Fail to Reject null hypothesis that the means are equal.
- If abs(t-statistic) > critical value: Reject the null hypothesis that the means are equal.
- If p ≥ alpha: Fail to Reject null hypothesis that the means are equal.
- If p < alpha: Reject null hypothesis that the means are equal.
One Way ANOVA
When using ANOVA testing, we are comparing 2 or more sample population means.
- H0: The sample population means are all equal.
- HA: At least one of the population means is significantly different from the rest.
Once we have the null and alternate hypothesis, we use the f-statistic to figure out the statistical differences between each of the tested means. With ANOVAs you have an F-Statistic (same as the F-Critical Value) and and F-Value to interpret your results.
- F Value < F Critical Value: fail to reject the null hypothesis
- F Value > F Critical Value: reject the null hypothesis
Two Way ANOVA
This one is fun because you need THREE hypotheses for the null and alternative hypotheses. This is because a two way ANOVA test takes the continuous sample mean and compares it to two independent catergorical variables. We test all three at the same time!
Keep in mind here Variables A and B are independent variables and are catergorical like ‘month’, ‘education’, ‘gender’. The dependent variable in H0–3 and H1–3 is continuous like ‘weight’.
First Independent Variable
- H0–1: There is no difference in the sample mean of Variable A. (equal)
- H1–1: There is a difference in the sample means of Variable A. (not equal)
Second Independent Variable
- H0–2: There is no difference in the sample means of Variable B. (equal)
- H1–2: There is a difference in the sample means of Variable B. (not equal)
Independent Variables on Dependent Variable
- H0–3: The effect Variable A on the sample mean does not depend on the effect of Variable B on the sample mean. (There is no interaction between Variable A and B)
- H1–3: The effect Variable A on the sample mean does not depend on the effect of Variable B on the sample mean. (There is some interaction between Variable A and B)
For a better idea in practice, here’s an example from TechnologyNetworks:
- H0: The means of all month groups are equal
- H1: The mean of at least one month group is different
- H0: The means of the gender groups are equal
- H1: The means of the gender groups are different
- H0: There is no interaction between the month and gender
- H1: There is interaction between the month and gender
This test can get… complicated. Not because it is, but because there are multiple Chi Squared test varieties, but we will pare it down to these 2, arguably the most common in data science:
- Goodness of Fit
- Test for Independence
Goodness of Fit: Does your observed value distribution differ from the expected/theorized value? Think of values like testing a dice for fairness. (The dice should be equally probable to roll a 6 as a 1,2,3,4, or 5, but what if that is not the observed frequency?)
- H0: There is no significant difference between the observed frequency/value and the expected frequency/value.
- H1: There is a significant difference between the observed frequency/value and the expected frequency/value.
Test for Independence: Is a relationship between two categorical variables? Think about catergorical variables like ‘gender’ and ‘intelligence’.
- H0: There is no relationship between variable x and variable y.
- H1: There is a relationship between variable x and variable y.
ANOVA Testing -great resource with examples
Chi-Squared from Slide Share
Chi-Squared by Scary Scientist -great resource with examples