Home Picture

Statistics Summary Part 4: Chi-Square Test

30 Jun 2020 |

Categories: Math

Statistics 101

Chi-Square Test

A chi-square test, also written as $X^2$ test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-square distributed under the null hypothesis, specifically Pearson’s chi-square test and variants thereof. Pearson’s chi-square test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

Chi-square table; Source

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

  1. For Goodness of Fit
  2. For Independence

We are going to discuss each of cases in detail.

(1). Goodness of Fit Test:

The goodness of fit test is used to test if sample data fits a distribution from a certain population (i.e. a population with a normal distribution or one with a Weibull distribution). In other words, it tells you if your sample data represents the data you would expect to find in the actual population.

Get the p-value from the Chi-square table using the calculated $X^2$ and df values:

Example 1:
You breed Puffskeins and would like to determine the pattern of inheritance for coat color and purring ability. Puffskeins come in either pink or purple and can either purr or hiss. You breed a puredbred, pink purring male with a purebred, purple hissing female. All individuals of the $F_1$ generation are pink and purring. The $F_2$ offspring are shown below. Do the alleles for coat color and purring ability assort independently (assume $\alpha = 0.05$)?

Pink and Purring Pink and Hissing Purple and Purring Purple and Hissing
143 60 55 18

Hypothesis:
Independent assortment means the distribution of $F_1$ generation has a phenotypic ratio of 9:3:3:1 (hypothetically).

$H_0: \text{The observed distribution of $F_2$ offspring fits the distribution of $F_1$ generation}$

$H_a: \text{The observed distribution of $F_2$ offspring does not fit the distribution of $F_1$ generation}$

Calculate the expected values:

Pink and Purring Pink and Hissing Purple and Purring Purple and Hissing
155.25 51.75 51.75 17.25

The chi-square statistic:

$ X^2 = \Sigma{\frac{(O-E)^2}{E}}=\frac{(143-155.25)^2}{155.25} + \frac{(55-51.75)^2}{51.75} + \frac{(55-51.75)^2}{51.75} + \frac{(18-17.25)^2}{17.25} \approx 2.519 $

The degrees of freedom:

$df=4-1=3$

From the Chi-square table, the p-value is greater than 0.1 ($\alpha = 0.05$). So, we we fail to reject $H_0$. The observed distribution of $F_2$ offspring fits the distribution of $F_1$ generation.The alleles for coat color and purring ability do assort independently in Puffskeins.

Example 2:
You are studying the pattern of dispersion of king penguins and the diagram below represents an area you sampled. Each dot is a penguin. Do the penguins display a uniform distribution (assume $\alpha=0.05$)

Hypothesis:

$H_0: \text{There is a uniform distribution of penguins}$

$H_a: \text{There is not a uniform distribution of penguins}$

There are a total of 25 penguins. So, if there is a uniform distribution, there should be 2.778 penguins per square.
There actual observed values are 2, 4, 4, 3, 3, 3, 2, 3, 1, so the $X^2$ statistic is:

$ X^2 = \Sigma{\frac{(O-E)^2}{E}}= \frac{(2-2.778)^2}{2.778} + \frac{(4-2.778)^2}{2.778} + \frac{(4-2.778)^2}{2.778} + \frac{(3-2.778)^2}{2.778} + \frac{(3-2.7787)^2}{2.778} + \frac{(3-2.778)^2}{2.778} + \frac{(2-2.778)^2}{2.778} + \frac{(3-2.778)^2}{2.778} + \frac{(1-2.778)^2}{2.778} \approx 2.72 $

The degrees of freedom:

$df=9-1=8$

From the Chi-square table, the p-value is greater than 0.95 ($\alpha = 0.05$). So, we fail to reject $H_0$. The penguins do display a uniform distribution.


(2). Independence Test:

A chi-square test for independence compares two variables in a contingency table to see if they are related (independence). In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

Workflow:

Example:
Given the data below, is there a relationship between fitness level and smoking habits (assume $\alpha = 0.05$)?

Hypothesis:

$H_0: \text{Fitness level and smoking habits are independent (not related)}$

$H_a: \text{Fitness level and smoking habits are not independent (related)}$

Use the expected value formula, $\frac{(\text{row total})(\text{column total})}{\text{grand total}}$, to get the expected counts:

The chi-square statistic:

$$ X^2 = \Sigma{\frac{(O-E)^2}{E}} = \frac{(113-123.75)^2}{123.75} + \frac{(113-124)^2}{124} + \frac{(110-124.26)^2}{124.26} + etc... = 91.73 $$

And the degrees of freedom are:

$df = (r-1)(c-1) = (4-1)(4-1) = 9$

From the Chi-square table, the p-value is less than 0.005 ($\alpha = 0.05$). So, we reject $H_0$ and conclude that there is a relationship between fitness level and smoking habits.

Top