Hypothesis Writing and Interpretation for Data Science Explained

Image for post
Image for post
Statistics is a crucial tool for data scientists. Image Source

In data science, one of the most important tools we use is statistical testing and hypothesis testing. Generally, we want to know if one measure we have is has a statistically significant effect on our target, and if it is, maybe we want to use that feature in a machine learning model. Statistical testing is an important tool we can use for feature selection, especially if we have many features and are unsure what really matters to our target variable (variable of interest/what we are looking to predict).

Before we do the statistical tests themselves, we need to know what we are testing. This is where the hypothesis testing set up comes in to play. Sometimes this can be very intuitive, but sometimes not so much. Espcially in a bootcamp environment where you are learning all of these in a day or two.

In this article, we are going to be focusing on how to write the hypothesis out. If you want to know what test to use when, I will link an amazing reference by Jagandeep Singh here. You’re welcome.

Let’s write some hypothesis tests!

Null Hypothesis

The null hypothesis is often written as H0. The best way to remember how to write the null hypothesis is that it’s a glass half empty sort of measure. It is the Eeyore of data science. There is no change, no difference in the average, no effect. It does not matter.

Alternative Hypothesis

The alternative hypothesis is usually what you are trying to find evidence to support, the eternal optimist. Denoted as H1 or HA, it is a better case scenario that states there is a statistically meaningful difference in average, some difference, some effect. (This does not tell you specifically how much just it does have some significant effect)

When writing a hypothesis it is critical to write them accurately. Small changes or omissions in wording can have a huge impact.

WARNING: Variable X != sample mean of Variable X

It’s important that we do not confuse the variable: price, weight, or variable X, with the mean of the variable: average price, average weight, mean of variable X. We are NOT calculating the significance on the target variable itself but the significance on sample mean of the target variable.

T-Test and Z-Test

Z-Tests compare The Population mean or proportions to see if they are the same when compared (2 sample) or are equal to a value (1 sample). (Here we KNOW the Population standard deviation; note population is capitalized above denoting this is the known actual population)

If we want to look at the Population mean or the Population proportion, we are looking at a Z-Test.

One Sample Z-Test

When you cross the street, do you look only one way? Or both ways? Both! We want to make sure there are no cars coming from either side. Just like in two tailed sample testing we are looking at BOTH sides of the graph when we are saying is it equal or not, 1) mu is greater than 40 and 2) mu is less than 40 both would not equal 40.

Two Tailed:

  • H0: Population mean (mu)= 40
  • H1: Population mean (mu) != 40

However, if we only care about a certain direction, we can look just that one way. Here we only want to test if it is less than 40.

One Tailed:

  • H0: Population mean (mu)≥ 40
  • H1: Population mean (mu) <40

Two Sample Z-Test

For this example we will use the Z-Test difference in proportions.

  • H0: p1 = p2
  • H1: p1 != p2

— Comprehension check, would this be a one or two tailed example?

T-Tests compares the difference between 2 sample means (2 sample) or the sample mean to a known value (1 sample). (unknown Population standard deviation)

One Sample T-Test

Compares the mean to a known value.

  • H0: The population mean = $100
  • H1: The population mean != $100

— Comprehension answer, these are two tailed examples.

Two Sample T-Test

For example, if we wanted to test the average price of an AirBnb based on distance from a city center, we might break those distances down to be ≤3 or >3, two different samples. Our hypothesis would then be:

  • H0: The average price with a distance of ≤ 3 miles from city center = the average price with a distance > 3 miles from city center
  • H1: The average price with a distance of ≤ 3 miles from city center != the average price with a distance >3 miles from city center

The price is NOT the same as the average (mean) price.

Once we have calculated the test statistic or p-value, we are able to reject or fail to reject the null (Eeyore example) hypothesis.

Interpreting the results:

  • If abs(t-statistic) ≤ critical value: Fail to Reject null hypothesis that the means are equal.
  • If abs(t-statistic) > critical value: Reject the null hypothesis that the means are equal.
  • If p ≥ alpha: Fail to Reject null hypothesis that the means are equal.
  • If p < alpha: Reject null hypothesis that the means are equal.
Image for post
Image for post
ANOVA Flashcard Source

One Way ANOVA

When using ANOVA testing, we are comparing 2 or more sample population means.

  • H0: The sample population means are all equal.
  • HA: At least one of the population means is significantly different from the rest.

Once we have the null and alternate hypothesis, we use the f-statistic to figure out the statistical differences between each of the tested means. With ANOVAs you have an F-Statistic (same as the F-Critical Value) and and F-Value to interpret your results.

  • F Value < F Critical Value: fail to reject the null hypothesis
  • F Value > F Critical Value: reject the null hypothesis

Two Way ANOVA

This one is fun because you need THREE hypotheses for the null and alternative hypotheses. This is because a two way ANOVA test takes the continuous sample mean and compares it to two independent catergorical variables. We test all three at the same time!

Keep in mind here Variables A and B are independent variables and are catergorical like ‘month’, ‘education’, ‘gender’. The dependent variable in H0–3 and H1–3 is continuous like ‘weight’.

First Independent Variable

  • H0–1: There is no difference in the sample mean of Variable A. (equal)
  • H1–1: There is a difference in the sample means of Variable A. (not equal)

Second Independent Variable

  • H0–2: There is no difference in the sample means of Variable B. (equal)
  • H1–2: There is a difference in the sample means of Variable B. (not equal)

Independent Variables on Dependent Variable

  • H0–3: The effect Variable A on the sample mean does not depend on the effect of Variable B on the sample mean. (There is no interaction between Variable A and B)
  • H1–3: The effect Variable A on the sample mean does not depend on the effect of Variable B on the sample mean. (There is some interaction between Variable A and B)

For a better idea in practice, here’s an example from TechnologyNetworks:

  • H0: The means of all month groups are equal
  • H1: The mean of at least one month group is different
  • H0: The means of the gender groups are equal
  • H1: The means of the gender groups are different
  • H0: There is no interaction between the month and gender
  • H1: There is interaction between the month and gender
Image for post
Image for post
So many varieties of Chi Squared, but which pairs well with your data and hypothesis testing needs? Image Source

CHI SQUARED

This test can get… complicated. Not because it is, but because there are multiple Chi Squared test varieties, but we will pare it down to these 2, arguably the most common in data science:

  • Goodness of Fit
  • Test for Independence

Goodness of Fit: Does your observed value distribution differ from the expected/theorized value? Think of values like testing a dice for fairness. (The dice should be equally probable to roll a 6 as a 1,2,3,4, or 5, but what if that is not the observed frequency?)

  • H0: There is no significant difference between the observed frequency/value and the expected frequency/value.
  • H1: There is a significant difference between the observed frequency/value and the expected frequency/value.

Test for Independence: Is a relationship between two categorical variables? Think about catergorical variables like ‘gender’ and ‘intelligence’.

  • H0: There is no relationship between variable x and variable y.
  • H1: There is a relationship between variable x and variable y.

Sources:

Statistics and Machine Learning

Hypothesis Testing from Good Data

ANOVAS — One Way and Two Way

ANOVA Testing -great resource with examples

Two Way ANOVA

ANOVA Hypothesis from Good Data

Hypothesis Testing R-Studio

Chi-Squared from Slide Share

Chi-Squared by Scary Scientist -great resource with examples

Curve model and Data Scientist based in New York City

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store