Two-way tables and chi-square
& Scatterplots and correlation

SOC 221 • Lecture 8

Victoria Sass

Monday, July 22, 2024

Two-way tables
and chi-square

Bivariate tables
and chi-square

Overview

  • Have used one sample to draw inferences about one population

  • Have drawn inferences about the difference between two populations

    • Beginning of efforts to test statistical significance of associations between variables
      • Example: testing for gender differences in emotional intelligence (EI) = testing for an association between gender and EI
  • New goal: test statistical significance for associations in two-way tables

    • Two-way tables: tables that simultaneously cross-classify values of one variable by values of another
      • Also called bivariate tables, contingency tables, or crosstabs
    • Useful for assessing the bivariate associations between categorical variables


Association
Two variables are “associated” with each other when variation in one variable corresponds with variation in the other variable.

Synonym: relationship


If the variables X and Y are associated:

  • particular values of Y tend to coincide with particular values of X
  • the values of Y are different across different values of X
  • the average value of Y depends on the value of X

Examples:

  • Education and income are associated because people with higher levels of education tend to have higher levels of income.
  • Political attitudes are associated with age in that older people tend to be more politically conservative than younger people.

Independent Variable
An independent variable is assumed to influence a dependent variable. The assumed “cause” in an association


Examples:

  • If we assume that education affects income, education is the independent variable and income is the dependent variable.
  • If we believe that political attitudes change with age, then age is independent variable and political attitudes represent the dependent variable.

Dependent Variable
A dependent variable is affected by one or more other variables. The assumed “effect” in an association



Generally avoid using language of “cause” and “effect” since establishing a case for causality is always difficult and rarely certain.

Associations in
bivariate tables

Goal: Understand the association between race and attitudes about the death penalty


Say you were presented with the following two tables. Can you tell whether there is an association between race and support for the death penalty?

Frequency distribution of attitudes towards the death penalty
Black and White respondents in GSS 2000
f Percent

Support Death Penalty

719

80.16

Oppose Death Penalty

178

19.84

TOTAL

897

100

Frequency distribution of race
Black and White respondents in GSS 2000
f Percent

Black

104

11.59

White

793

88.41

TOTAL

897

100


Two-way (contingency/bivariate) tables

  • Purpose:
    • Simultaneously cross-classify the distribution of two variables
    • Common first step in detecting and summarizing an association between variables

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Goal: Understand the association between race and attitudes about the death penalty


Title: Values of the DV by Values of the IV


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Goal: Understand the association between race and attitudes about the death penalty


Column variable (usually the independent variable)


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Goal: Understand the association between race and attitudes about the death penalty


Row variable (usually the dependent variable)


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Observed frequencies (Each cell of the table includes the count of cases with the specific combination of attributes on the two variables)

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60

659

719

Oppose Death Penalty

44

134

178

TOTAL

104

793

897

Marginals show the basic distribution of the two variables
(same information from frequency tables)

Total” row and column are called marginals

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Percentages can be used to compare the distribution of the dependent variable (DV) across values of the independent variable (IV)

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Percentages can be used to compare the distribution of the dependent variable (DV) across values of the independent variable (IV)

57.69% = (60/104) * (100)

Calculate percentages of the DV within values of IV (here use column %s)

Goal: Understand the association between race and attitudes about the death penalty


Placeholder


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Percentages can be used to compare the distribution of the dependent variable (DV) across values of the independent variable (IV)

Percentages of the DV within values of the IV =
Conditional Distributions

Can detect the ASSOCIATION between variables by comparing conditional distributions…

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

DIRECTION
Is the association positive or negative?


STATISTICAL SIGNIFICANCE
How certain can we be that the association exists in the population?

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

  • Maximum (perfect) association:
    • All cases with a particular value of X have the same value on Y (no conditional variation)
    • Knowing the value of X allows for perfect prediction of Y
  • Minimum (no) association:
    • High conditional variation (value of Y varies even among those cases with the same value of X)
    • Knowing the value of X does not improve ability to predict Y

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

  • Maximum (perfect) association:
    • All cases with a particular value of X have the same value on Y (no conditional variation)
    • Knowing the value of X allows for perfect prediction of Y
  • Minimum (no) association:
    • High conditional variation (value of Y varies even among those cases with the same value of X)
    • Knowing the value of X does not improve ability to predict Y

Conditional distributions are completely dissimilar (maximum difference in column %s across values of the IV)

Two-way table of vote on Affordable Care Act (ACA) by political party
US Senators in 2009
Democrat Republican TOTAL

Voted for ACA

60
(100.00%)

0
(0.00%)

60
(60.61%)

Voted against ACA

0
(0.00%)

39
(100.00%)

39
(39.39%)

TOTAL

60
(100.00%)

39
(100.00%)

99

No conditional variation (all cases with a particular IV value have the same DV value)

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

  • Maximum (perfect) association:
    • All cases with a particular value of X have the same value on Y (no conditional variation)
    • Knowing the value of X allows for perfect prediction of Y
  • Minimum (no) association:
    • High conditional variation (value of Y varies even among those cases with the same value of X)
    • Knowing the value of X does not improve ability to predict Y

Conditional distributions are exactly the same (no difference in column %s across values of the IV; all match the marginal %s)

Two-way table of transportation to work by gender
Alltech Corp. workers 2014
Female Male TOTAL

Drive

100
(71.43%)

150
(71.43%)

250
(71.43%)

Public Transportation

30
(21.43%)

45
(21.43%)

75
(21.43%)

Walk/bike

10
(7.14%)

15
(7.14%)

25
(7.14%)

TOTAL

140
(100.00%)

210
(100.00%)

350
(100.00%)

High conditional variation (lots of different values of DV for cases with same IV value)

Using conditional distributions to detect an association


Two-way table: Number of Crimes Committed by Education
Sample of parolees from Florida Prisons
Low Education High Education TOTAL

0 Crimes

80
(50.00%)

130
(86.67%)

210
(67.74%)

1 Crime

24
(15.00%)

15
(10.00%)

39
(12.58%)

2+ Crimes

56
(35.00%)

5
(3.33%)

61
(19.68%)

TOTAL

160
(100.00%)

150
(100.00%)

310
(100.00%)

YES: Conditional distributions are different


Is there any association?

Goal: Understand the association between race and attitudes about the death penalty


Is there an association between race and support for the death penalty? How do you know?


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

  • Conditional distributions are different from each other and the marginal percentages
  • If attitudes were completely unassociated with race, we’d expect \(80.16\%\) of both races to support the death penalty

Goal: Understand the association between race and attitudes about the death penalty


How STRONG is the association?



Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

  • Somewhere between the extremes of no association and a perfect association
    • We can quantify the strength of the association using risk ratios

Goal: Understand the association between race and attitudes about the death penalty



How STRONG is the association?


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

RISK RATIO (a.k.a. relative risk)
The ratio of the probability of some outcome among one group to the probability of the outcome among a different group.

Goal: Understand the association between race and attitudes about the death penalty



How STRONG is the association?

Probability of supporting the death penalty

Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

For black respondents:
\(P(SUPPORT) = 0.5769\)

For white respondents:
\(P(SUPPORT) = 0.8310\)

RISK RATIO = \((0.8310)\) \(/\) \((0.5769)\) \(= 1.44\)

The probability of supporting the death penalty is 1.44 times greater for white than for black respondents

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

  • Maximum (perfect) association:
    • All cases with a particular value of X have the same value on Y (no conditional variation)
    • Knowing the value of X allows for perfect prediction of Y
  • Minimum (no) association:
    • High conditional variation (value of Y varies even among those cases with the same value of X)
    • Knowing the value of X does not improve ability to predict Y

DIRECTION
Is the association positive or negative?

  • Positive association: High values on Y tend to coincide with high values on X
  • Negative association: High values on Y tend to coincide with low values on X

Using conditional distributions to detect an association


Two-way table: Number of Crimes Committed by Education
Sample of parolees from Florida Prisons
Low Education High Education TOTAL

0 Crimes

80
(50.00%)

130
(86.67%)

210
(67.74%)

1 Crime

24
(15.00%)

15
(10.00%)

39
(12.58%)

2+ Crimes

56
(35.00%)

5
(3.33%)

61
(19.68%)

TOTAL

160
(100.00%)

150
(100.00%)

310
(100.00%)

Negative:
Higher education associated with lower number of crimes


What is the direction of this association?

Goal: Understand the association between race and attitudes about the death penalty


What is the direction of this association?


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


Not relevant since these are nominal variables (no higher or lower values)

Key characteristics of an association (e.g. between X and Y)

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

  • Maximum (perfect) association:
    • All cases with a particular value of X have the same value on Y (no conditional variation)
    • Knowing the value of X allows for perfect prediction of Y
  • Minimum (no) association:
    • High conditional variation (value of Y varies even among those cases with the same value of X)
    • Knowing the value of X does not improve ability to predict Y

DIRECTION
Is the association positive or negative?

  • Positive association: High values on Y tend to coincide with high values on X
  • Negative association: High values on Y tend to coincide with low values on X

STATISTICAL SIGNIFICANCE
How certain can we be that the association exists in the population?

  • Relevant when drawing inferences from an association observed in a sample to a possible association in the population
  • Determined by a statistical hypothesis test

Goal: Understand the association between race and attitudes about the death penalty


There is an association in this sample


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


Key question: Is this association in the sample strong enough to convince us that there is a real association in the POPULATION from which the sample was drawn?

Chi-square test

  • Used to test statistical significance of associations in a two-way table (so, between categorical variables)

  • Intended to test whether a pattern or association observed in a set of sample data:

    1. represents a real association in the population from which the sample was drawn
      OR
    2. reflects random sampling error when, in reality, there is no real association in the population
  • Based on a comparison of our observed frequencies to expected frequencies.

    • Observed frequencies = the relative frequencies actually observed in the data for the sample
    • Expected frequencies = the relative frequencies that we would expect if there was no association in the data

Goal: Understand the association between race and attitudes about the death penalty


What would the table look like if there were no association between the variables?


Two-way table of attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

60
(57.69%)

659
(83.10%)

719
(80.16%)

Oppose Death Penalty

44
(42.31%)

134
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Chi-square test is based on comparison of the counts/frequencies we actually observe in the sample to what the table would look like if there were no association between the variables

Goal: Understand the association between race and attitudes about the death penalty


What would the table look like if there were no association between the variables?

Conditional distributions of the DV would match across the values of the IV (same as marginals)

EXPECTED FREQUENCIES for attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty


(80.16%)


(80.16%)

719
(80.16%)

Oppose Death Penalty


(19.84%)


(19.84%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Chi-square test is based on comparison of the counts/frequencies we actually observe in the sample to what the table would look like if there were no association between the variables

Goal: Understand the association between race and attitudes about the death penalty


What would the table look like if there were no association between the variables?

Conditional distributions of the DV would match across the values of the IV (same as marginals)

EXPECTED FREQUENCIES for attitudes towards the death penalty by race
Black and White respondents in GSS 2000
Black White TOTAL

Support Death Penalty

83.36
(80.16%)

635.64
(80.16%)

719
(80.16%)

Oppose Death Penalty

20.64
(19.84%)

157.36
(19.84%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897

Chi-square test is based on comparison of the counts/frequencies we actually observe in the sample to what the table would look like if there were no association between the variables

Calculation of chi-square

  • Obtained value of chi-square:

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \] where
\(f_o =\) \(observed\) \(\text{frequency in a given}\) cell
\(f_e =\) \(\text{frequency in the}\) cell expected \(\text{under the assumption of the null hypothesis}\)
   \(\text{(no association in the population)}\)

  • Shortcut to calculate expected cell frequencies:

\[ f_e = \frac{\text{(row marginal)(column marginal)}}{n} \]

Note: Chi-square will take a value of 0 if there is no association in the sample.

Goal: Understand the association between race and attitudes about the death penalty


OBSERVED and EXPECTED Frequencies
Black White TOTAL

Support Death Penalty

\(f_0 = 60\)
\(f_e = 83.36\)
(57.69%)

\(f_0 = 659\)
\(f_e = 635.64\)
(83.10%)

719
(80.16%)

Oppose Death Penalty

\(f_0 = 44\)
\(f_e = 20.64\)
(42.31%)

\(f_0 = 134\)
\(f_e = 157.36\)
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell #1 60 83.36 -23.36 545.69 6.55
Cell #2 659 635.64 23.36 545.69 0.86
Cell #3 44 20.64 23.36 545.69 26.44
Cell #4 134 157.36 -23.36 545.69 3.47

\[ \chi^2 = \Sigma\frac{(f_0 - f_e)^2}{f_e} \]

Goal: Understand the association between race and attitudes about the death penalty


OBSERVED and EXPECTED Frequencies
Black White TOTAL

Support Death Penalty

\(f_0 = 60\)
\(f_e = 83.36\)
(57.69%)

\(f_0 = 659\)
\(f_e = 635.64\)
(83.10%)

719
(80.16%)

Oppose Death Penalty

\(f_0 = 44\)
\(f_e = 20.64\)
(42.31%)

\(f_0 = 134\)
\(f_e = 157.36\)
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell #1 60 83.36 -23.36 545.69 6.55
Cell #2 659 635.64 23.36 545.69 0.86
Cell #3 44 20.64 23.36 545.69 26.44
Cell #4 134 157.36 -23.36 545.69 3.47

\[ \chi^2 = \Sigma\frac{(f_0 - f_e)^2}{f_e} \]

\(\Sigma\) \(= 37.31\)

Chi-square score summarizes the difference between what we observe in the sample and what would expect to observe if there was no association between the variables.

Question: Is that difference big enough to convince us that it did not just happen by chance (sampling error)?

Need a hypothesis test

Hypothesis test for two-way Chi-square

  1. Check assumptions
    • Random sample, scores are independent (i.e., each subject is allowed only one preference); no expected cell frequencies below 5.
  2. State the hypotheses
    • Null: No association in the population
    • Alternative: A real association in the population.
  3. Identify alpha and the critical value of chi-square
  4. Calculate the test statistic
    • chi-square obtained
  5. Make a decision
    • Reject or fail to reject null hypothesis
    • Make a statement about the implications for the population

Step 1: Check assumptions


OBSERVED and EXPECTED Frequencies
Black White TOTAL

Support Death Penalty

\(f_0 = 60\)
\(f_e = 83.36\)
(57.69%)

\(f_0 = 659\)
\(f_e = 635.64\)
(83.10%)

719
(80.16%)

Oppose Death Penalty

\(f_0 = 44\)
\(f_e = 20.64\)
(42.31%)

\(f_0 = 134\)
\(f_e = 157.36\)
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


  • Random sample ✔️
  • Scores are independent ✔️
  • No expected cell frequencies below 5 ✔️

Step 2: State the hypotheses



\(H_0\): No association between race and attitudes toward death penalty in the population


\(H_a\): A real association between race and attitudes toward death penalty in the population

Step 3: Identify alpha and the critical value of chi-square


The default alpha level is \(0.05\) (\(0.01\) and \(0.001\) are tougher alternatives)


\(df\) \(= (R-1)(C-1) = (2-1)(2-1) =\) \(1\)


Now we can go to the chi-square distribution table to see what critical value is associated with an alpha level of 0.05 and 1 degree of freedom



Our critical value is \(3.84\)

Step 4: Calculate the test statistic


OBSERVED and EXPECTED Frequencies
Black White TOTAL

Support Death Penalty

\(f_0 = 60\)
\(f_e = 83.36\)
(57.69%)

\(f_0 = 659\)
\(f_e = 635.64\)
(83.10%)

719
(80.16%)

Oppose Death Penalty

\(f_0 = 44\)
\(f_e = 20.64\)
(42.31%)

\(f_0 = 134\)
\(f_e = 157.36\)
(16.90%)

178
(19.84%)

TOTAL

104
(100.00%)

793
(100.00%)

897


\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell #1 60 83.36 -23.36 545.69 6.55
Cell #2 659 635.64 23.36 545.69 0.86
Cell #3 44 20.64 23.36 545.69 26.44
Cell #4 134 157.36 -23.36 545.69 3.47

\[ \chi^2 = \Sigma\frac{(f_0 - f_e)^2}{f_e} \]

\(\Sigma\) \(= 37.31\)

Step 5: Make a decision



Since the obtained chi-square (\(37.31\)) is greater than the critical value (\(3.84\)), I can reject the null hypothesis


This supports the research hypothesis that there IS a real association between race and attitudes toward the death penalty IN THE POPULATION

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.


In this sample, is there an association between diet and feelings towards Thanksgiving? How do you know?

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25





80

Indifferrent

36





111

Likes




184

TOTAL

100

275

375

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.


In this sample, is there an association between diet and feelings towards Thanksgiving? How do you know?

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25


55


80

Indifferrent

36


75


111

Likes

39


145


184

TOTAL

100

275

375

Fill in the missing observed frequencies

(note that once two cells are completed (and you have the marginals) you can complete the table)

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.


In this sample, is there an association between diet and feelings towards Thanksgiving? How do you know?

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25

(25.00%)

55

(20.00%)

80

Indifferrent

36

(36.00%)

75

(27.27%)

111

Likes

39

(39.00%)

145

(52.73%)

184

TOTAL

100

275

375

Add column percentages to better understand conditional distributions

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.


In this sample, is there an association between diet and feelings towards Thanksgiving? How do you know?

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25

(25.00%)

55

(20.00%)

80

Indifferrent

36

(36.00%)

75

(27.27%)

111

Likes

39

(39.00%)

145

(52.73%)

184

TOTAL

100

275

375

Difference in conditional distributions indicate that there IS an association in the sample

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.


In this sample, is there an association between diet and feelings towards Thanksgiving? How do you know?

\(\frac{0.53}{0.39} = 1.\)\(36\)

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25

(25.00%)

55

(20.00%)

80

Indifferrent

36

(36.00%)

75

(27.27%)

111

Likes

39

(39.00%)

145

(52.73%)

184

TOTAL

100

275

375

Use risk ratios to quantify the association. For example, the probability of liking Thanksgiving is 36% higher for carnivores than for vegetarians

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Is this association statistically significant?

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \]

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25

(25.00%)

55

(20.00%)

80

Indifferrent

36

(36.00%)

75

(27.27%)

111

Likes

39

(39.00%)

145

(52.73%)

184

TOTAL

100

275

375

\(H_a\): IN THE POPULATION, there is an association
between diet and attitudes towards Thanksgiving.

\(H_0\): There is no association between diet and
attitudes towards Thanksgiving IN THE POPULATION.

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Is this association statistically significant?

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \]

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25
\(f_e = 21.33\)
(25.00%)

55
\(f_e = 58.67\)
(20.00%)

80

Indifferrent

36
\(f_e = 29.60\)
(36.00%)

75
\(f_e = 81.40\)
(27.27%)

111

Likes

39
\(f_e = 49.07\)
(39.00%)

145
\(f_e = 134.93\)
(52.73%)

184

TOTAL

100

275

375

Find EXPECTED FREQUENCIES:

\(f_e = \frac{\text{(row marginal)(column marginal)}}{n}\)

For example: \(f_e\) for Cell 5 = \(\frac{(184)(100)}{375} =\) \(49.07\)

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Is this association statistically significant?

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \]

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25
\(f_e = 21.33\)
(25.00%)

55
\(f_e = 58.67\)
(20.00%)

80

Indifferrent

36
\(f_e = 29.60\)
(36.00%)

75
\(f_e = 81.40\)
(27.27%)

111

Likes

39
\(f_e = 49.07\)
(39.00%)

145
\(f_e = 134.93\)
(52.73%)

184

TOTAL

100

275

375

Expected frequencies reflect how the table would look if there were no association between the variables (i.e., if the null hypothesis were true)

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Is this association statistically significant?

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \]

\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell 1 25 21.33 3.67 13.44 0.63
Cell 2 55 58.67 -3.67 13.44 0.23
Cell 3 36 29.60 6.40 40.96 1.38
Cell 4 75 81.40 -6.40 40.96 0.50
Cell 5 39 49.07 -10.07 101.34 2.07
Cell 6 145 134.93 10.07 101.34 0.75

Expected frequencies reflect how the table would look if there were no association between the variables (i.e., if the null hypothesis were true)

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Is this association statistically significant?

\[ \chi^2 = \Sigma\frac{(f_o - f_e)^2}{f_e} \]

\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell 1 25 21.33 3.67 13.44 0.63
Cell 2 55 58.67 -3.67 13.44 0.23
Cell 3 36 29.60 6.40 40.96 1.38
Cell 4 75 81.40 -6.40 40.96 0.50
Cell 5 39 49.07 -10.07 101.34 2.07
Cell 6 145 134.93 10.07 101.34 0.75
\(\chi^2 =\) 5.56
  • Need to compare our obtained \(\chi^2\) value of 5.56 with the critical value of \(\chi^2\)
  • By default we use an alpha level of 0.05
  • \(df = (R-1)(C-1) = (3-1)(2-1) = 2\)

Expected frequencies reflect how the table would look if there were no association between the variables (i.e., if the null hypothesis were true)

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

\[ \text{obtained } \chi^2 = 5.56 \]

\(\text{critical value of }\)

\(\chi^2 = 5.99\)

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25
\(f_e = 21.33\)
(25.00%)

55
\(f_e = 58.67\)
(20.00%)

80

Indifferrent

36
\(f_e = 29.60\)
(36.00%)

75
\(f_e = 81.40\)
(27.27%)

111

Likes

39
\(f_e = 49.07\)
(39.00%)

145
\(f_e = 134.93\)
(52.73%)

184

TOTAL

100

275

375

Since obtained value of chi-square is LESS EXTREME than the critical value we FAIL TO REJECT THE NULL HYPOTHESIS. The association observed is NOT statistically significant. Cannot be confident that the association exists in the population.

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

25
\(f_e = 21.33\)
(25.00%)

55
\(f_e = 58.67\)
(20.00%)

80

Indifferrent

36
\(f_e = 29.60\)
(36.00%)

75
\(f_e = 81.40\)
(27.27%)

111

Likes

39
\(f_e = 49.07\)
(39.00%)

145
\(f_e = 134.93\)
(52.73%)

184

TOTAL

100

275

375

What happens if the sample is doubled, with the same conditional distributions?

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Since the conditional distributions are the same, and do not match, there still appears to be an association IN THE SAMPLE.

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

50

(25.00%)

110

(20.00%)

160

Indifferrent

72

(36.00%)

150

(27.27%)

222

Likes

78

(39.00%)

290

(52.73%)

368

TOTAL

200

550

750

What happens if the sample is doubled, with the same conditional distributions?

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

Since the conditional distributions are the same, and do not match, there still appears to be an association IN THE SAMPLE.

Feelings about Thanksgiving
Vegetarian Carnivore TOTAL

Dislikes

50
\(f_e = 42.67\)
(25.00%)

110
\(f_e = 117.33\)
(20.00%)

160

Indifferrent

72
\(f_e = 59.2\)
(36.00%)

150
\(f_e = 162.8\)
(27.27%)

222

Likes

78
\(f_e = -98.13\)
(39.00%)

290
\(f_e = 269.87\)
(52.73%)

368

TOTAL

200

550

750

What happens if the sample is doubled, with the same conditional distributions?

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell 1 50 42.67 7.33 53.73 1.26
Cell 2 110 117.33 -7.33 53.73 0.46
Cell 3 72 59.2 12.8 163.84 2.77
Cell 4 150 162.8 -12.8 163.84 1.01
Cell 5 78 98.13 -20.13 405.22 4.13
Cell 6 290 269.87 20.13 405.22 1.5

What happens if the sample is doubled, with the same conditional distributions?

Another example

You collect data from a random sample of 375 individuals to look at whether feelings toward Thanksgiving differ by dietary preferences. The partial data are in the table to the right.

\[ \text{obtained } \chi^2 = 11.13 \]

\(\text{critical value of }\)

\(\chi^2 = 5.99\)

\(f_0\) \(f_e\) \((f_0 - f_e)\) \((f_0 - f_e)^2\) \(\frac{(f_0 - f_e)^2}{f_e}\)
Cell 1 50 42.67 7.33 53.73 1.26
Cell 2 110 117.33 -7.33 53.73 0.46
Cell 3 72 59.2 12.8 163.84 2.77
Cell 4 150 162.8 -12.8 163.84 1.01
Cell 5 78 98.13 -20.13 405.22 4.13
Cell 6 290 269.87 20.13 405.22 1.5
\(\chi^2 =\) 11.13

The obtained value of the chi-square goes way up. Now, the obtained value of chi-square \(\gt\) critical value of chi-square. The association observed in the sample IS statistically significant. We REJECT THE NULL HYPOTHESIS and find SUPPORT FOR THE ALTERNATIVE HYPOTHESIS that there is an association in the population.

What happens if the sample is doubled, with the same conditional distributions?

Chi-square and strength of an association

  • Size of chi-square (obtained) statistic is directly proportional to sample size
    • Double cell counts = double chi-square, regardless of strength of association in the sample
    • Cut cell counts by \(\frac{1}{4}\) = \(\frac{1}{4}\) reduction in chi-square, regardless of strength of association in the sample
  • Can have a large chi-square with a weak association if \(n\) is large
  • Hard to find statistically significant associations with small \(n\)
    • Note the key assumption of no expected frequencies below 5
  • Avoid drawing conclusions about strength of an association based on size of chi-square
    • Large chi-square = stronger confidence in inference, not strength

Break!

Scatterplots and correlation

Overview of Correlation

Looking at the association between variables

  • The statistical link between variables
  • Tendency for certain types of values of one variable to coincide with certain kinds of values of the other variable

FIRST step in assessing arguments that one variable (independent variable) has a causal impact on another (dependent variable)

We’ve looked for associations between nominal and ordinal variables in bivariate tables

Now we want to measure association for interval variables

STRENGTH
How strong is the tendency for certain values of Y to go with particular values of X?

DIRECTION
Is the association positive or negative?

STATISTICAL SIGNIFICANCE
How certain can we be that the association exists in the population?

Correlation and Scatterplots

Scatterplot: A graph that uses points to simultaneously display the value on two variables for each case in the data

Allows us to picture the association between variables

Example: Association between hiker weight and weight of backpack carried

body backpack
120 26
187 30
109 26
103 24
131 29
165 35
158 31
116 28

Correlation and Scatterplots

Scatterplot: A graph that uses points to simultaneously display the value on two variables for each case in the data

Allows us to picture the association between variables

Example: Association between hiker weight and weight of backpack carried

body backpack
120 26
187 30
109 26
103 24
131 29
165 35
158 31
116 28

X-axis (horizontal) displays all values in the IV

Y-axis (vertical) displays all values on the DV

Each dot represents a case, positioned along the X and Y axes

Correlation and Scatterplots

Scatterplot: A graph that uses points to simultaneously display the value on two variables for each case in the data

Allows us to picture the association between variables

Example: Association between hiker weight and weight of backpack carried

body backpack
120 26
187 30
109 26
103 24
131 29
165 35
158 31
116 28

Can see the (positive) association
(high values on one variable tend to go with high values on the other)

Correlation and Regression


The two most common tool for measuring
associations between interval variables

CORRELATION

standardized summary of association between our variables

  • does not depend on the units of the variables
    • always 0 to |1.0|
  • allows for comparison of associations between different pairs of variables

REGRESSION

characterizes the substantive effect of X on Y

  • how much Y differs across different values of X
  • conveyed in units of our specific independent and dependent variables

Both correlation and regression are based on a description of a line used to characterize data points in a scatterplot


Correlation Coefficient (r)
A measure of association reflecting both the strength and the direction of the association between two interval-level variables


Why it’s helpful

  • Simple:
    • one number to convey a lot about an association
  • Symmetrical:
    • correlation of X and Y same as the correlation of Y and X
  • Standardized:
    • does not depend on the units of the variables
    • always ranges from 0 to |1|
    • therefore, allows for comparison of bivariate associations between different pairs of variables


Correlation Coefficient (r)
A measure of association reflecting both the strength and the direction of the association between two interval-level variables


Why you need to be careful

  • Correlation (and regression) only work well with linear associations
    • i.e., scatterplot shows a roughly straight-line pattern
  • Correlation ≠ causation
    • The observed association might be due to some third “lurking” variable
    • i.e., the association might be spurious
  • Don’t extrapolate
    • Can’t use the observed association to draw conclusions about the association among individuals with values outside of your observed range on the variables


Correlation Coefficient (r)
A measure of association reflecting both the strength and the direction of the association between two interval-level variables

Interpretation

  • Sign indicates the direction of the association
    • Positive correlation indicates a positive association
      • high values on one variable tend to correspond with high values on the other variable
      • e.g., education and income have a positive correlation
    • Negative correlation indicates a negative association
      • high values on one variable tend to correspond with low values on the other variable
      • e.g., income and stress have a negative correlation


Correlation Coefficient (r)
A measure of association reflecting both the strength and the direction of the association between two interval-level variables

Interpretation

  • Value of the number indicates strength of association
    • i.e., how strong is the tendency for certain values on one variable to correspond with certain values on the other?
    • Minimum value = 0
      • Values close to 0 indicate no association
    • Maximum value = 1 or -1
      • values close to -1.0 or 1.0 indicate strong relationships
    • Rule of thumb (in absolute values):
      • 0.00 to 0.30 = weak association
      • 0.31 to 0.60 = moderate association
      • 0.61 to 1.0 = strong association

Practice


What type of correlations are the following?

  • Correlation (r) for education and income = 0.385
    moderate and positive
  • Correlation (r) for education and stress = -0.615
    strong and negative
  • Correlation (r) for hours studied and number of dates = 0.045
    weak and positive
  • Correlation (r) for number of children and number of hours of sleep = -0.421
    moderate and negative

Practice


What type of correlations are the following?

  • Correlation (r) for education and income = 0.385
    moderate and positive
  • Correlation (r) for education and stress = -0.615
    strong and negative
  • Correlation (r) for hours studied and number of dates = 0.045
    weak and positive
  • Correlation (r) for number of children and number of hours of sleep = -0.421
    moderate and negative

Scatterplots allow us to picture the association between variables



Stronger associations = more tightly clustered points

Weak associations have lots of conditional variation and not much difference in conditional distributions (distribution of the DV across values of the IV)

Strong associations have very little conditional variation and lots of difference in conditional distributions (distribution of the DV across values of the IV)

Making a scatterplot

Example: Do people who study more have more or fewer dates?

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Making a scatterplot

Example: Do people who study more have more or fewer dates?

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Making a scatterplot

Example: Do people who study more have more or fewer dates?

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Making a scatterplot

Example: Do people who study more have more or fewer dates?

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Can see the association

Correlation coefficient allows us to quantify/summarize that association

Calculating correlation coefficient (r)

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Value of x variable
for an individual

Mean of
x variable

Standard deviation
of x variable

Value of y variable
for an individual

Mean of y variable

Standard deviation
of y variable

Add up for all
individuals

Divide the whole
mess by n-1

Calculating correlation coefficient (r)

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Just the standardized score on x (i.e. how many standard deviations the individual’s value on x is from the mean of x)

Just the standardized score on y (i.e. how many standard deviations the individual’s value on y is from the mean of y)

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Calculate means and standard
deviations for both variables

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Get the deviation of the x score
from the mean of x

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Get the standardized score of x

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Get the deviation of the y score
from the mean of y

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Get the standardized score of y

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Calculate the product of
standardized scores of x and y

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Do the same thing for every case

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Take the sum of the products of
standardized x and y values

Calculating Correlation Coefficient (r)

Association between hours studied and number of dates

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Hours studied (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 10 1 0.10 0.02 -1.30 -0.83 -0.02
2 15 2 5.10 1.02 -0.30 -0.19 -0.19
3 10 5 0.10 0.02 2.70 1.72 0.03
4 6 1 -3.90 -0.78 -1.30 -0.83 0.64
5 2 0 -7.90 -1.57 -2.30 -1.47 2.31
6 7 3 -2.90 -0.58 0.70 0.45 -0.26
7 10 4 0.10 0.02 1.70 1.08 0.02
8 12 3 2.10 0.42 0.70 0.45 0.19
9 7 1 -2.90 -0.58 -1.30 -0.83 0.48
10 20 3 10.10 2.01 0.70 0.45 0.90
sum 99.00 23.00 4.11
mean 9.90 2.30
st. dev. 5.02 1.57

Divide by n-1
\(r = \frac{4.11}{10-1} = 0.456\)

Association between hours studied and number of dates

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Description of the association?

Correlation \(r = 0.456\)
Positive, moderate association

Association between hours studied and number of dates

Hours studied Dates
10 1
15 2
10 5
6 1
2 0
7 3
10 4
12 3
7 1
20 3

Description of the association?

Correlation \(r = 0.456\)
Positive, moderate association

This is the DIRECTION and STRENGTH of
the association in the SAMPLE

Want to know whether there is an association
in the POPULATION (i.e., whether the
observed association is statistically significant)

Need a hypothesis test…

Hypothesis test for correlation


GOAL: We want to know if the association seen in the sample (as revealed by \(r\)) reflects


a real association between the two variables in the population


OR


chance sampling error when in reality the two variables are
not associated in the population

Hypothesis test for correlation

  1. Check assumptions
    • Random sample, variables roughly normally distributed in the population, linear relationship, homoscedasticity

Hypothesis test for correlation

  1. Check assumptions
    • Random sample, variables roughly normally distributed in the population, linear relationship, homoscedasticity

Similar error of prediction (similar spread around the line) at all values of X

Hypothesis test for correlation

  1. Check assumptions
    • Random sample, variables roughly normally distributed in the population, linear relationship, homoscedasticity
  1. State the hypothesis
    • \(H_0: \rho = 0\) (correlation in the population is 0)
    • \(H_1: \rho \ne 0\) (correlation in the population is statistically significantly different from 0)
      • or \(H_1: \rho \gt 0\) (correlation in the population is statistically significantly greater than 0)
      • or \(H_1: \rho \lt 0\) (correlation in the population is statistically significantly less than 0)
  2. Identify alpha and the critical value of \(r\)
    • Use this table to get the ciritical value of \(r\)
    • degrees of freedom \((df) = n-2\)
  3. Calculate the test statistic
    • Use \(r\) calculated from the sample
  4. Make a decision
    • Reject or fail to reject null hypothesis
    • Make a statement about the implications for the population

Hypothesis test for correlation

  1. Check assumptions
    • Random sample, variables roughly normally distributed in the population, linear relationship, homoscedasticity
  2. State the hypothesis
    • \(H_0: \rho = 0\) (correlation in the population is 0)
    • \(H_1: \rho \ne 0\) (correlation in the population is statistically significantly different from 0)
      • or \(H_1: \rho \gt 0\) (correlation in the population is statistically significantly greater than 0)
      • or \(H_1: \rho \lt 0\) (correlation in the population is statistically significantly less than 0)
  3. Identify alpha and the critical value of \(r\)
    • Use this table to get the ciritical value of \(r\)
    • degrees of freedom \((df) = n-2\)
  4. Calculate the test statistic
    • Use \(r\) calculated from the sample
  5. Make a decision
    • Reject or fail to reject null hypothesis
    • Make a statement about the implications for the population

Association between hours studied and number of dates

\(r = 0.456\)
Positive, moderate association in the sample

1. Check assumptions

  • Random sample?
  • Variables roughly normally distributed in the population?
    • check sample distributions for a clue
  • Linear relationship
    • Check scatterplot
  • Homoscedasticity
    • Similar error of prediction (similar spread around the line) at all values of X
    • Check scatterplot

Association between hours studied and number of dates

\(r = 0.456\)
Positive, moderate association in the sample


2. State the hypotheses


3. Decide on alpha and identify
    the critical value of \(r\)

4. Calculate the test statistic

5. Make a decision


\(H_0: \rho = 0\)
\(H_1: \rho \ne 0\)


Use alpha = 0.05 by default
Use table for critical values

\(r \text{ observed} = 0.456\)

Since \(r \text{ observed (0.456)}\) < \(r\text{ critical (0.6319)}\),
we FAIL TO REJECT \(H_0\)

2-sided
(direction typically not specified)

We cannot say that there is an association between hours studied and number of dates in the population of students

Practice

  1. Calculate \(r\)
  2. Interpret \(r\)
  3. Test statistical significance of \(r\)

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Example 1
Association between hours
on TikTok and # of dates?

Example 2
Association between hours playing
video games and # of dates?

Mean s
Hours of
playing video games
10 5 0 0 5 0 0 0 0 0 2.00 3.50
Hours on
TikTok
0 3 5 15 14 25 1 1 5 22 9.10 9.21
# of
dates
1 2 5 1 0 3 4 3 1 3 2.30 1.57

Example 1
Association between hours
on TikTok and # of dates?

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Tiktok hours (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 0 1 -9.10 -0.99 -1.30 -0.83 0.82
2 3 2 -6.10 -0.66 -0.30 -0.19 0.13
3 5 5 -4.10 -0.45 2.70 1.72 -0.77
4 15 1 5.90 0.64 -1.30 -0.83 -0.53
5 14 0 4.90 0.53 -2.30 -1.47 -0.78
6 25 3 15.90 1.73 0.70 0.45 0.77
7 1 4 -8.10 -0.88 1.70 1.08 -0.95
8 1 3 -8.10 -0.88 1.70 0.45 -0.39
9 5 1 -4.10 -0.45 -1.30 -0.83 0.37
10 22 3 12.90 1.40 0.70 0.45 0.63
sum 91.00 23.00 -0.71
mean 9.10 2.30
st. dev. 9.21 1.57

\(r = \frac{-0.71}{10-1} =\) \(-0.079\)

Interpretation?

Weak negative association between hours on TikTok and number of dates

Statistical significance?

Since absolute value of r-obtained (0.076) is less extreme than the critical value of r (0.6319), we fail to reject \(H_0\) that \(\rho = 0\).
We do not have enough evidence to say that the association observed in the sample exists in the population. It is not statistically significant.

Example 2
Association between hours playing
video games and # of dates?

\[ r = \frac{1}{n-1}\Sigma(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \]

Person Video game hours (\(x_i\)) Dates (\(y_i\)) \(x_i -\bar{x}\) \(\frac{x_i-\bar{x}}{s_x}\) \(y_i -\bar{y}\) \(\frac{y_i-\bar{y}}{s_y}\) \((\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\)
1 0 1 8.00 2.29 -1.30 -0.83 0.82
2 3 2 3.00 0.86 -0.30 -0.19 0.13
3 5 5 -2.00 -0.57 2.70 1.72 -0.77
4 15 1 -2.00 -0.57 -1.30 -0.83 -0.53
5 14 0 3.00 0.86 -2.30 -1.47 -0.78
6 25 3 -2.00 -0.57 0.70 0.45 0.77
7 1 4 -2.00 -0.57 1.70 1.08 -0.95
8 1 3 -2.00 -0.57 1.70 0.45 -0.39
9 5 1 -2.00 -0.57 -1.30 -0.83 0.37
10 22 3 -2.00 -0.57 0.70 0.45 0.63
sum 91.00 23.00 -4.75
mean 9.10 2.30
st. dev. 9.21 1.57

\(r = \frac{-4.75}{10-1} =\) \(-0.528\)

Interpretation?

Moderate negative association between hours of video games played and number of dates.

Statistical significance?

Since absolute value of r-obtained (0.528) is less extreme than the critical value of r (0.6319), we fail to reject \(H_0\) that \(\rho = 0\).
We do not have enough evidence to say that the association observed in the sample exists in the population. It is not statistically significant.

Correlation matrix

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Interpret the correlation coefficients

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Interpret the correlation coefficients

Weak positive association between age and hours worked in this sample.

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Interpret the correlation coefficients

Weak positive association between age and hours worked in this sample.

Moderate negative association between age and leisure hours in this sample.

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Interpret the correlation coefficients

Weak positive association between age and hours worked in this sample.

Moderate negative association between age and leisure hours in this sample.

Strong negative association between hours worked and leisure hours in this sample.

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Which of these correlation coefficients is statistically significant at the 0.05 alpha level?

Weak positive association between age and hours worked in this sample.

Moderate negative association between age and leisure hours in this sample.

Strong negative association between hours worked and leisure hours in this sample.

Data from a sample of 102 adults results in the correlation matrix to the right

Age Hours Worked Hours on leisure
Age 1.00 0.28 -0.40
Hours Worked 0.28 1.00 -0.61
Hours on leisure -0.40 -0.61 1.00

Which of these correlation coefficients is statistically significant at the 0.05 alpha level?

\(H_a: \rho \ne 0\)

\(H_0: \rho = 0\)

Compare observed values of \(r\) to the
critical value of \(r\)

critical value of \(r = 0.1946\)

Since all observed values of \(r\) are MORE EXTREME than the critical value of \(r\), we can reject the null hypothesis in each case and conclude that the correlations are likely non-zero IN THE POPULATION (all sample correlations statistically significant)

Homework