SOC 221 โข Lecture 1
Monday, June 17, 2024
Statistics is NOT mathโฆ
โ โ โฆbut it looks a lot like math ๐
We try to make the math easy
so we can focus on the concepts,
interpretation, and implications!
Social
statistics
are tools that
allow us to
quantify
observations
(data) about
the world
What is the relative number of red, blue, green, and yellow M&Ms in a bag? | |
What is the probability of pulling 2 red M&Ms in a row out of a bag of 100 M&Ms? | |
Are there gender differences in preference for red M&Ms? |
What percentage of people have access to health insurance through work? | |
How did access to health care affect levels of stress during the coronavirus pandemic? | |
What percentage of Americans approved of Bidenโs handling of the vaccine rollout? |
Survey
A tool used to ask
people question(s)
in order to gather
information about
what the subject
does, feels,
or thinks.
In what year were you born?
(enter a 4-digit number; i.e. 2005)
_________
Turn them into DATAโฆ
A survey will produce a dataset for analyses
Person # | birth_year | gender | ethnicity | race |
---|---|---|---|---|
1 | 1992 | 1 | 1 | 5 |
2 | 1993 | 2 | 0 | 5 |
3 | 1995 | 2 | 0 | 2 |
4 | 1980 | 2 | 0 | 3 |
5 | 1991 | 1 | 1 | 5 |
6 | 1975 | 4 | 1 | 1 |
7 | 1960 | 1 | 0 | 6 |
8 | 1952 | 1 | 0 | 5 |
9 | 2000 | 3 | 0 | 3 |
10 | 1990 | 1 | 1 | 2 |
11 | 1993 | 2 | 1 | 4 |
12 | 1992 | 3 | 0 | 4 |
Each row in the dataset contains information on one data unit (i.e. individual or case)
Units are the objects described by a set of data. May be people, animals, businesses, events, etc. โ whatever you collect data about.
People that you survey become units, cases, or individuals in a data set.
A survey will produce a dataset for analyses
Person # | birth_year | gender | ethnicity | race |
---|---|---|---|---|
1 | 1992 | 1 | 1 | 5 |
2 | 1993 | 2 | 0 | 5 |
3 | 1995 | 2 | 0 | 2 |
4 | 1980 | 2 | 0 | 3 |
5 | 1991 | 1 | 1 | 5 |
6 | 1975 | 4 | 1 | 1 |
7 | 1960 | 1 | 0 | 6 |
8 | 1952 | 1 | 0 | 5 |
9 | 2000 | 3 | 0 | 3 |
10 | 1990 | 1 | 1 | 2 |
11 | 1993 | 2 | 1 | 4 |
12 | 1992 | 3 | 0 | 4 |
Each column in the dataset contains data for one variable
Questions in the survey become variables.
A variable is a characteristic of units (cases or observations) that can take on different values or attributes.
Antonym to โvariableโ is a CONSTANT (same value for all cases)
A survey will produce a dataset for analyses
Person # | birth_year | gender | ethnicity | race |
---|---|---|---|---|
1 | 1992 | 1 | 1 | 5 |
2 | 1993 | 2 | 0 | 5 |
3 | 1995 | 2 | 0 | 2 |
4 | 1980 | 2 | 0 | 3 |
5 | 1991 | 1 | 1 | 5 |
6 | 1975 | 4 | 1 | 1 |
7 | 1960 | 1 | 0 | 6 |
8 | 1952 | 1 | 0 | 5 |
9 | 2000 | 3 | 0 | 3 |
10 | 1990 | 1 | 1 | 2 |
11 | 1993 | 2 | 1 | 4 |
12 | 1992 | 3 | 0 | 4 |
Each cell in the dataset refers to the value on a particular variable for a particular case/unit
A value is a number, word, or symbol that represents a characteristic of a particular case on a particular variable.
The level of measurement of a variable refers to the type of information represented in the VALUES of that variable.
The level of measurement of a variable refers to the type of information represented in the VALUES of that variable.
Qualitative measurements
Binary variables are a sub-type of nominal variables.
Quantitative measurements
Question to ask yourself
Can the values of this variable
be broken up into sub-units?
The level of measurement of a variable refers to the type of information represented in the VALUES of that variable.
How many hours did you spend studying yesterday
(in hours)?
______________
How much did you study yesterday?
โ a lot
โ a moderate amount
โ a little
โ none
A survey will produce a dataset for analyses
Person # | birth_year | gender | ethnicity | race |
---|---|---|---|---|
1 | 1992 | 1 | 1 | 5 |
2 | 1993 | 2 | 0 | 5 |
3 | 1995 | 2 | 0 | 2 |
4 | 1980 | 2 | 0 | 3 |
5 | 1991 | 1 | 1 | 5 |
6 | 1975 | 4 | 1 | 1 |
7 | 1960 | 1 | 0 | 6 |
8 | 1952 | 1 | 0 | 5 |
9 | 2000 | 3 | 0 | 3 |
10 | 1990 | 1 | 1 | 2 |
11 | 1993 | 2 | 1 | 4 |
12 | 1992 | 3 | 0 | 4 |
Using statistics we can analyze these (and other) data to try to answer interesting questions about the real world.
Example: Use survey results to examine attitudes about climate change among American adults.
We have to be careful about how we use statistics to analyze and interpret data.
Example: Any of the problems above could lead to misleading conclusions regarding attitudes about climate change.
Bias
A systematic
data flaw
that leads us to
mischaracterize
reality.
Population
The entire group of people
(or other units) that we
want to know about.
Example: Want to know attitudes about climate change among American adults.
Question: What is the population of interest?
Answer: American adults.
Unfortunately, we rarely have the time, energy, or money to study the entire population of interest!
So we collect information about the sample insteadโฆ
Sample
The smaller subset of the population
that you actually study/examine.
% of people in our SAMPLE who believe climate change is an urgent problem.
โInferenceโ>
% of American adults who believe climate change is an urgent problem
Sample statistics
The characteristic of the sample
that we actually observe.
Population parameter
The characteristic of the population
that we are interested in knowing.
% of people in our SAMPLE who believe climate change is an urgent problem.
โInferenceโ>
% of American adults who believe climate change is an urgent problem
This only works if our sample resembles the population of interest on important characteristics
In survey sampling, bias would be the tendency of a sample statistic to systematically over- or under-estimate a population parameter
i.e., the characteristics of the sample are similar in important ways to the population of interest.
Probability sampling ๐ฒ
Any sampling approach in which each member in the
population has an equal, or known, probability of being selected
into the sample.
Non-probability sampling ๐ ๐ฒ
Any sampling approach in which the probability of cases being
drawn into the sample varies or is unknown.
Simple random sampling
Systematic sampling
Stratified sampling
Cluster sampling
Convenience sampling
Quota sampling
Network sampling
Voluntary sampling
Advantages:
โ It provides a simple and fair way of selecting the sample
โ Itโs based on randomness so it maximizes the chance of an unbiased / representative sample
Limitations:
โ๏ธ Hard to do because a complete list of the population is usually not available
โ๏ธ Can be very expensive and time consuming
Advantage:
โ Sometimes it is easier to apply than simple random sampling
Limitation:
โ๏ธ Any patterns present in the population may bias the sample
Advantages:
โ It is very useful when we want to make sure that minority groups are represented in the sample
โ Can calculate the probability of an individual case being included
Limitations:
โ Creating stratified lists of individuals may be expensive
โ Sometimes the population cannot be divided into different strata
Advantage:
โ Cheaper and more efficient for geographically dispersed populations
Limitation:
โ๏ธThe sample will be biased if the clusters do not represent the population
Advantage:
โ Cheap, time-saving, and simple to implement
Limitation:
โ๏ธ Often leads to selection bias and thus study results arenโt generalizable
Advantages:
โ Cheaper, faster, and easier to implement than stratified sampling
โ More representative than other non-probability methods
Limitations:
โ Miss people who are not in that particular area at that particular time
โ Since itโs not based on random selection, selection bias is still possible and therefore representativeness is not guaranteed
also known as
SNOWBALL SAMPLING
Advantage:
โ Useful when individuals in the population are difficult to identify (e.g. drug users, un-housed folks)
Limitations:
โ๏ธ Selection bias is likely since respondents are more similar to each other than randomly drawn individuals
โ๏ธ Can be slow since it relies on respondentsโ referrals
Advantage:
โ Similar to convenience sampling, itโs cheap, time-saving, and simple to implement
Limitation:
โ๏ธ People with strong opinions are more likely to respond to your survey
Even with a representative sample there are additional sources of bias to be aware of
BIAS
This is a problem because those refusing to answer might differ from those who do answer. Therefore, the information from the sample does not reflect the reality of the population.
Example
Those who are least concerned about climate change might be less likely to respond.
Even with a representative sample there are additional sources of bias to be aware of
Example: social desirability bias
Respondent overstates or understates their level of concern about the environment to match what they think the research wants to hear
A common reason for response bias or non-response is the nature of the question.
What counts as a sensitive question?
Which do you think people found most sensitive?