OpenIntro Statistics

1

Data Basics: Organizing and Understanding Data

1.1 Case study: using stents to prevent strokes

🧭 Overview

🧠 One-sentence thesis

Data matrices organize observations and variables in a structured format that enables researchers to identify relationships between variables and distinguish between different types of data for appropriate analysis.

📌 Key points (3–5)

  • Data matrix structure: Each row represents a case (observational unit) and each column represents a variable, providing a standardized way to organize data.
  • Variable types matter: Variables are either numerical (continuous or discrete) or categorical (nominal or ordinal), and this classification determines appropriate analysis methods.
  • Association vs independence: Variables can be associated (showing a relationship) or independent (no relationship), but association does not prove causation.
  • Common confusion: Explanatory and response variables indicate a hypothesized causal direction, but labeling them this way does not prove causation—only experiments can establish causal relationships.
  • Data collection methods: Observational studies can show associations, while randomized experiments are needed to establish causal connections.

📊 Organizing data with matrices

📋 What is a data matrix

Data matrix: A structured format where each row corresponds to a unique case (observational unit) and each column corresponds to a variable.

  • This is the standard and recommended way to organize data, especially in spreadsheets.
  • New cases can be added as rows; new variables as columns.
  • Example: In the loan50 dataset, each row represents one loan with characteristics like amount, interest rate, and borrower information.

🔍 Cases and variables defined

Case (observational unit): A single entity or subject being studied, represented as one row in the data matrix.

Variable: A characteristic or measurement recorded for each case, represented as one column in the data matrix.

  • In the loan50 example: Each loan is a case, and loan_amount, interest_rate, term, grade, state, total_income, and homeownership are variables.
  • In the county example: Each county is a case with variables like population, poverty rate, and median education level.

📝 Why this structure matters

  • Allows consistent data entry and retrieval.
  • Makes it easy to review individual cases (read across a row) or analyze a single variable across all cases (read down a column).
  • Example: A gradebook can be organized with each student as a row and each assignment as a column, plus columns for student information.

🔢 Types of variables

🔢 Numerical variables

Numerical variable: A variable that takes numerical values where arithmetic operations (addition, subtraction, averaging) make sense.

Two subtypes:

TypeDefinitionExample
ContinuousCan take any value within a range, with no jumpsUnemployment rate, height
DiscreteCan only take specific values, typically whole numbers with jumpsPopulation count, number of siblings
  • Don't confuse: Not all numbers are numerical variables—telephone area codes use numbers but averaging them is meaningless, so they would be categorical.

🏷️ Categorical variables

Categorical variable: A variable where responses fall into categories or levels.

Two subtypes:

TypeDefinitionExample
NominalCategories with no natural orderingState names, treatment vs control group
OrdinalCategories with a natural orderingEducation level (below hs, hs diploma, some college, bachelors)
  • In this textbook, ordinal variables are treated as nominal (unordered) categorical variables to simplify analysis.
  • Example: The state variable can take 51 values (50 states plus DC), making it categorical with 51 levels.

🎯 Classification practice

When classifying variables, ask:

  1. Does arithmetic make sense? → Numerical
  2. If numerical: Can it take any value (continuous) or only specific values (discrete)?
  3. If not numerical: Are the categories ordered (ordinal) or unordered (nominal)?

Example: In a migraine study, the group variable (treatment or control) is categorical, while num_migraines (count of migraines) is discrete numerical.

🔗 Relationships between variables

🔗 Association and independence

Associated variables: Variables that show some connection or discernible pattern with one another; also called dependent variables.

Independent variables: Variables with no evident relationship between them.

  • Key principle: A pair of variables is either associated or independent, never both.
  • Association can be detected visually using scatterplots for numerical variables.

📈 Types of association

Positive association: As one variable increases, the other tends to increase.

  • Example: Counties with higher median household income tend to have higher population growth rates.

Negative association: As one variable increases, the other tends to decrease.

  • Example: Counties with more multi-unit structures tend to have lower homeownership rates.

🔬 Explanatory and response variables

Explanatory variable: The variable hypothesized to affect or cause changes in another variable.

Response variable: The variable hypothesized to be affected by the explanatory variable.

  • The relationship flows: explanatory variable → might affect → response variable.
  • Example: If asking whether median household income drives population change, income is the explanatory variable and population change is the response variable.
  • Important: Labeling variables this way does NOT prove causation—it only indicates a hypothesis to be tested.

⚠️ Critical distinction: Association ≠ Causation

  • Observational studies can show association but cannot prove causation.
  • Only randomized experiments can establish causal relationships.
  • Don't confuse: Just because two variables are associated does not mean one causes the other—there may be other explanations.

🔬 Data collection methods

👁️ Observational studies

Observational study: Data collection where researchers do not directly interfere with how the data arise; they merely observe.

Methods include:

  • Surveys
  • Reviewing existing records (medical, company, etc.)
  • Following a cohort of similar individuals over time

Limitations:

  • Can provide evidence of naturally occurring associations.
  • Cannot by themselves show causal connections.
  • May be affected by non-response bias if participation rates are low.

🧪 Experiments

Experiment: A study where researchers actively assign treatments to investigate causal connections.

Randomized experiment: An experiment where individuals are randomly assigned to treatment groups.

Key features:

  • Researchers collect a sample and split individuals into groups.
  • Each group receives a different treatment (or placebo).
  • Random assignment helps ensure groups are comparable.
  • Example: Heart attack patients randomly assigned to receive either a drug or placebo, then outcomes are compared.

🎲 Random sampling principles

Simple random sample: Each case in the population has an equal chance of being included, with no implied connection between cases—equivalent to a raffle.

Why randomness matters:

  • Reduces bias in sample selection.
  • Helps ensure the sample is representative of the population.
  • Without random selection, samples may be skewed by the selector's interests or accessibility.

⚠️ Common sampling pitfalls

Convenience sample: Individuals who are easily accessible are more likely to be included.

  • Example: Surveying only people walking in one neighborhood to represent an entire city.
  • Problem: Difficult to determine what sub-population the sample actually represents.

Non-response bias: When a low percentage of randomly selected individuals actually participate.

  • Example: If only 30% of people respond to a survey, results may not represent the entire population.
  • Problem: Those who respond may differ systematically from those who don't.

Selection bias: When the method of selecting participants introduces systematic differences.

  • Example: Online product reviews may be negatively biased because dissatisfied customers are more motivated to leave reviews.
  • Problem: The sample doesn't represent the full population of users.
2

Data basics

1.2 Data basics

🧭 Overview

🧠 One-sentence thesis

Random sampling techniques are essential for producing reliable statistical estimates, because non-random or biased sampling undermines the validity of all statistical methods that assume randomness.

📌 Key points (3–5)

  • Why randomness matters: Almost all statistical methods rely on implied randomness; without it, estimates and errors are unreliable.
  • Bias can creep in multiple ways: hand-picked samples, non-response, and convenience samples all introduce bias even when unintentional.
  • Observational vs experimental data: observational studies can show associations but cannot prove causation because confounding variables may explain the relationship.
  • Common confusion: correlation vs causation—just because two variables are associated does not mean one causes the other; a third variable (confounder) may drive both.
  • Four random sampling methods: simple random, stratified, cluster, and multistage sampling each have different use cases and trade-offs.

🎲 Why random sampling reduces bias

🎲 The problem with hand-picked samples

  • When people select samples manually, they risk picking a biased sample even without intending to.
  • Example: A nutrition major asked to pick graduates might unconsciously select more health-related majors, making the sample unrepresentative of all graduates.
  • Random selection (like a raffle) gives each case an equal chance of inclusion and avoids these unconscious preferences.

🚫 Non-response bias

Non-response bias: when a large fraction of randomly sampled people do not respond, the results may not be representative of the entire population.

  • Even if people are picked randomly, high non-response rates (e.g., only 30% respond) make it unclear whether results reflect the full population.
  • The excerpt notes it is "difficult, and often times impossible, to completely fix this problem."
  • Example: Online product reviews—if only dissatisfied customers leave ratings, 50% negative reviews does not mean 50% of all buyers are dissatisfied.

🛒 Convenience samples

Convenience sample: individuals who are easily accessible are more likely to be included.

  • Example: Conducting a political survey by stopping people in the Bronx does not represent all of New York City.
  • It is often hard to know what sub-population a convenience sample actually represents.
  • Don't confuse: a large sample from a convenient location is still biased; size does not fix bias.

🔬 Observational studies and causation

🔬 What observational data is

Observational data: data where no treatment has been explicitly applied or withheld.

  • Examples from the excerpt: loan data and county data.
  • Observational studies can show associations or form hypotheses, but making causal conclusions is "treacherous and is not recommended."
  • Causal conclusions are more reasonable from experiments, not observational studies.

🌀 Confounding variables

Confounding variable (also called lurking variable or confounder): a variable correlated with both the explanatory and response variables.

  • Example: Suppose an observational study finds that more sunscreen use is associated with more skin cancer. Does sunscreen cause cancer?
    • No—sun exposure is the confounder: people who spend more time in the sun use more sunscreen and have higher cancer risk.
    • The diagram in the excerpt: sun exposure → use sunscreen; sun exposure → skin cancer; the direct link sunscreen → cancer is questionable.
  • Another example: Homeownership rate and percentage of multi-unit structures show a negative association. A confounding variable might be population density—dense areas have more multi-unit housing and higher property values, making homeownership less feasible.
  • One method to justify causal claims from observational data is to "exhaust the search for confounding variables," but there is no guarantee all confounders can be examined or measured.

📅 Prospective vs retrospective studies

TypeWhen data is collectedExample
ProspectiveIdentifies individuals and collects information as events unfoldThe Nurses' Health Study: recruits nurses and follows them over years using questionnaires
RetrospectiveCollects data after events have taken placeReviewing past medical records
  • Some data sets contain both prospectively and retrospectively collected variables.

🎯 Four random sampling methods

🎯 Simple random sampling

Simple random sampling: each case in the population has an equal chance of being included, and knowing one case is included provides no information about which other cases are included.

  • Analogy: writing names on slips of paper, mixing them in a bucket, and drawing out the desired number.
  • Example: To sample 120 MLB players, write all players' names on slips, mix, and draw 120.
  • This is the most intuitive form of random sampling and the baseline for statistical methods.

🏛️ Stratified sampling

Stratified sampling: divide the population into groups (strata) where similar cases are grouped together, then use simple random sampling within each stratum.

  • Example: MLB teams as strata—some teams have much more money (up to 4 times as much). Randomly sample 4 players from each of 30 teams for a total of 120 players.
  • When to use: especially useful when cases within each stratum are very similar with respect to the outcome of interest.
    • Why? More stable and precise estimates within each group lead to a more precise overall population estimate.
  • Downside: analyzing stratified data is more complex than analyzing simple random samples; methods in the textbook need extension.

🗂️ Cluster sampling

Cluster sample: break the population into many groups (clusters), sample a fixed number of clusters, and include all observations from each selected cluster.

  • When to use: more economical than other techniques; works best when there is a lot of variability within each cluster but clusters themselves look similar to one another.
  • Example: If neighborhoods are clusters, cluster sampling works well when neighborhoods are very diverse internally.
  • Downside: more advanced analysis techniques are required (though methods in the book can be extended).

🌐 Multistage sampling

Multistage sample: like cluster sampling, but instead of keeping all observations in each selected cluster, collect a random sample within each selected cluster.

  • Combines the economy of cluster sampling with additional randomness within clusters.
  • When to use: same conditions as cluster sampling—high within-cluster variability, low between-cluster variability.
  • Downside: also requires more advanced analysis techniques.
MethodHow it worksBest whenTrade-off
Simple randomEach case has equal chance; no connection between casesGeneral-purposeBaseline; simpler analysis
StratifiedDivide into strata; random sample within eachCases within strata are very similarMore precise estimates; more complex analysis
ClusterSample clusters; include all observations in selected clustersHigh variability within clusters; clusters similar to each otherMore economical; requires advanced analysis
MultistageSample clusters; random sample within selected clustersSame as clusterMore economical; requires advanced analysis
3

Sampling Principles and Strategies

1.3 Sampling principles and strategies

🧭 Overview

🧠 One-sentence thesis

Random sampling techniques—simple, stratified, cluster, and multistage—each offer different trade-offs between cost, precision, and complexity, and the choice depends on whether the goal is to maximize similarity within groups or to economize on data collection across dispersed populations.

📌 Key points (3–5)

  • Why randomness matters: statistical methods and their error estimates are only reliable when data are collected in a random framework from a population.
  • Four main techniques: simple random, stratified, cluster, and multistage sampling, each suited to different population structures.
  • Stratified vs cluster logic: stratified sampling works best when cases within each group are very similar; cluster/multistage work best when there is high variability within clusters but clusters themselves are similar to one another.
  • Common confusion: cluster sampling includes all observations from selected clusters; multistage sampling randomly selects a subset within each selected cluster.
  • Trade-offs: stratified, cluster, and multistage methods can improve precision or reduce cost but require more complex analysis than simple random sampling.

🎲 Simple random sampling

🎲 What it is

Simple random sampling: each case in the population has an equal chance of being included in the final sample, and knowing that a case is included does not provide useful information about which other cases are included.

  • Most intuitive form of random sampling.
  • Example: to sample 120 MLB players, write all player names on slips of paper, mix them in a bucket, and draw 120 slips.
  • The key is equal probability and independence: every individual has the same chance, and one person's inclusion doesn't affect another's.

🔍 When to use it

  • When the population is relatively homogeneous or when you have no basis for grouping.
  • Downside: can be expensive if the population is geographically dispersed (e.g., drawing individuals from all 30 villages would require visiting all 30 locations).

🧩 Stratified sampling

🧩 What it is

Stratified sampling: a divide-and-conquer strategy where the population is divided into groups called strata (chosen so similar cases are grouped together), then a second sampling method (usually simple random sampling) is employed within each stratum.

  • Example: divide MLB players by team (30 teams), then randomly sample 4 players from each team for a total of 120 players.
  • The strata should group similar cases with respect to the outcome of interest.

🎯 Why similarity within strata helps

  • When cases within a stratum are very similar, the estimate for that subpopulation becomes more stable and precise.
  • Combining these precise group estimates yields a more precise overall population estimate.
  • Example: if all players on a team have similar salaries, sampling a few gives a reliable team average; aggregating these averages improves the overall estimate.

⚠️ Trade-offs

  • Advantage: more precise estimates when strata are homogeneous.
  • Disadvantage: analyzing stratified data is more complex than analyzing simple random samples; methods in the excerpt's book would need extension.

🗂️ Cluster and multistage sampling

🗂️ What cluster sampling is

Cluster sample: break the population into many groups called clusters, sample a fixed number of clusters, and include all observations from each selected cluster.

  • Example: divide a population into 9 clusters, randomly select 3 clusters, and measure every case in those 3 clusters.
  • Don't confuse: cluster sampling takes every observation in the selected clusters.

🔀 What multistage sampling is

Multistage sample: like cluster sampling, but rather than keeping all observations in each cluster, collect a random sample within each selected cluster.

  • Example: randomly select half of 30 villages, then randomly select 10 people from each selected village.
  • This is the key difference from cluster sampling: you do not measure everyone in the selected clusters.

🌍 When to use cluster or multistage

  • Best scenario: high case-to-case variability within each cluster, but clusters themselves don't look very different from one another.
  • Example: if neighborhoods are very diverse internally (rich and poor, young and old all mixed), but all neighborhoods are roughly similar to each other, cluster/multistage sampling works well.
  • Economic advantage: can substantially reduce data collection costs, especially when the population is geographically dispersed.
    • Example (from the excerpt): sampling malaria rates in 30 remote Indonesian villages. Simple random sampling would require visiting all 30 villages; multistage sampling (select half the villages, then 10 people per village) cuts travel costs while still providing reliable information.

⚖️ Trade-offs

  • Advantage: more economical than simple random or stratified sampling when clusters are spread out.
  • Disadvantage: requires more advanced analysis techniques (the excerpt notes that methods in the book would need extension).

📊 Comparison of the four methods

MethodHow it worksWhen it works bestAnalysis complexity
Simple randomEqual chance for every case; independent selectionHomogeneous population or no grouping basisSimplest
StratifiedDivide into strata (similar cases together); sample within each stratumHigh similarity within strata; want precise subgroup estimatesMore complex
ClusterSample a few clusters; include all cases in selected clustersHigh variability within clusters; clusters similar to each other; cost savings neededMore complex
MultistageSample a few clusters; sample some cases within each selected clusterSame as cluster, but even more cost-effectiveMore complex

🧠 Key distinction to remember

  • Stratified: cases within each group are similar → sample from every group.
  • Cluster/multistage: cases within each group are diverse, but groups are similar to each other → sample only some groups.
  • Cluster vs multistage: cluster takes everyone in selected groups; multistage takes a sample within selected groups.

⚠️ Why randomness is essential

⚠️ Reliability of statistical methods

  • The excerpt emphasizes that statistical methods and their error estimates are based on the notion of implied randomness.
  • If observational data are not collected in a random framework from a population, these methods and their associated errors are not reliable.
  • This is the foundational reason for using any of the four random sampling techniques described.
4

Experiments

1.4 Experiments

🧭 Overview

🧠 One-sentence thesis

Randomized experiments, which assign treatments to cases through randomization, are the gold standard for establishing causal connections between variables, but they require careful design principles and bias-reduction techniques to ensure valid conclusions.

📌 Key points (3–5)

  • What makes an experiment: researchers actively assign treatments to cases; when assignment includes randomization, it becomes a randomized experiment.
  • Four core principles: controlling other differences, randomizing assignment, replicating with sufficient sample size, and blocking by known influential variables.
  • Why randomization matters: it evens out uncontrolled variables and prevents accidental bias, making causal conclusions possible.
  • Common confusion in human studies: emotional effects (knowing you received treatment vs. not) can bias results; blinding prevents patients from knowing their group assignment.
  • Control vs. treatment groups: the treatment group receives the intervention being tested; the control group does not, serving as a baseline for comparison.

🔬 What defines an experiment

🔬 Experiments vs. other studies

Experiments: studies where researchers assign treatments to cases.

Randomized experiments: experiments where assignment includes randomization (e.g., using a coin flip to decide which treatment a patient receives).

  • The key difference from observational studies: researchers actively intervene by assigning treatments rather than passively observing.
  • Randomized experiments are fundamentally important for showing causal connections between two variables.
  • Example: a researcher decides which patients receive a new drug and which do not, rather than observing patients who already chose to take or not take the drug.

🧱 Four principles of experimental design

🎛️ Controlling

Controlling: researchers assign treatments to cases and do their best to control any other differences in the groups.

  • Goal: isolate the effect of the treatment by keeping everything else as similar as possible between groups.
  • Example: when patients take a drug in pill form, some might take it with only a sip of water while others drink an entire glass; a doctor may ask all patients to drink a 12-ounce glass of water with the pill to control for water consumption effects.
  • Don't confuse: controlling differences between groups is separate from having a "control group" (the group that doesn't receive treatment).

🎲 Randomization

Randomization: researchers randomize patients into treatment groups to account for variables that cannot be controlled.

  • Why it works: randomizing helps even out differences that researchers cannot directly control, such as dietary habits that might make some patients more susceptible to disease.
  • It also prevents accidental bias from entering the study.
  • Example: rather than letting patients choose their group or assigning based on any pattern, researchers use random assignment (like a coin flip) to determine who goes into treatment vs. control.

🔁 Replication

Replication: collecting a sufficiently large sample within a study, or repeating an entire study to verify findings.

  • The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response.
  • Two forms:
    • Within a study: collect a sufficiently large sample.
    • Across studies: a group of scientists may replicate an entire study to verify an earlier finding.

🧱 Blocking

Blocking: first grouping individuals based on a variable (other than the treatment) that is known or suspected to influence the response, then randomizing cases within each block to the treatment groups.

  • When to use: when researchers know or suspect that certain variables influence the response beyond the treatment itself.
  • How it works:
    1. Split participants into blocks based on the influential variable.
    2. Within each block, randomly assign half to treatment and half to control.
  • Example: studying a drug's effect on heart attacks, researchers might first split patients into low-risk and high-risk blocks, then randomly assign half from each block to control and half to treatment (see Figure 1.16).
  • Benefit: ensures each treatment group has an equal number of low-risk and high-risk patients, preventing imbalance.
  • Note: blocking is a more advanced technique; the first three principles (controlling, randomization, replication) are essential for any study.

🧠 Reducing bias in human experiments

🎭 The problem: emotional effects

  • Even with randomization, human studies can introduce unintentional bias.
  • Scenario: in a heart attack drug study, volunteers are randomly placed into treatment (receives drug) or control (no drug) groups.
  • Two effects emerge:
    1. The effect of interest: the drug's actual effectiveness.
    2. An emotional effect: treatment group members anticipate the fancy new drug will help them; control group members sit idly, hoping participation doesn't increase their risk.
  • The emotional effect is difficult to quantify and can bias the study results.

🕶️ The solution: blinding

Blind study: researchers keep patients uninformed about which treatment group they are in.

  • Purpose: prevent patients' knowledge of their group assignment from influencing outcomes.
  • Challenge: if a patient doesn't receive a treatment, she will know she is in the control group.
  • The excerpt hints at the solution (giving fake treatments) but does not complete the explanation in the provided text.

🆚 Treatment vs. control groups

GroupWhat they receivePurpose
Treatment groupThe drug or intervention being testedExperience the effect of the new treatment
Control groupNo drug treatment (or a fake treatment)Serve as a baseline for comparison
  • Don't confuse: the control group is not about "controlling variables" (that's the first principle); it's the comparison group that does not receive the intervention.
5

Examining numerical data

2.1 Examining numerical data

🧭 Overview

🧠 One-sentence thesis

Numerical data can be effectively summarized and visualized through scatterplots, dot plots, means, and histograms to reveal patterns, relationships, and distributions that help us understand the data's structure and central tendencies.

📌 Key points (3–5)

  • Scatterplots reveal relationships: they show case-by-case views of two numerical variables and can reveal linear, nonlinear, or no relationships between variables.
  • The mean as a balancing point: the sample mean (x̄) is computed by summing all observations and dividing by the count, serving as the distribution's center or "balancing point."
  • Histograms show distribution shape: by grouping data into bins and displaying counts as bars, histograms reveal data density and overall shape (such as skewness).
  • Common confusion—raw counts vs. averages: comparing raw totals can mislead when group sizes differ; standardizing through means enables fair comparisons.
  • Weighted means for unequal representation: when observations represent different-sized groups (e.g., counties with varying populations), a weighted mean accounts for these differences rather than treating all observations equally.

📊 Visualizing relationships between variables

📊 What scatterplots show

Scatterplot: a case-by-case view of data for two numerical variables, where each point represents a single case.

  • Each point on a scatterplot corresponds to one observation in the dataset.
  • Example: plotting total income versus loan amount for 50 loans creates 50 points.
  • Scatterplots quickly reveal associations: simple trends, complex patterns, or lack of relationship.

🔍 Linear vs. nonlinear relationships

  • Linear relationships appear as straight-line trends in the data.
  • Nonlinear relationships show curvature; the excerpt gives an example of median household income versus poverty rate showing a curved (nonlinear) pattern.
  • Some relationships may be horseshoe-shaped (∩ or ∪), occurring when a variable is beneficial only in moderation.
  • Example: health versus water consumption—some water is necessary, but too much becomes toxic.

Don't confuse: A scatterplot showing no clear pattern doesn't mean the variables are unrelated; it may mean the relationship is weak or requires different analysis methods.

📍 Single-variable summaries

📍 Dot plots for one variable

Dot plot: a one-variable scatterplot showing the distribution of a single numerical variable.

  • Each dot represents one observation's value.
  • Stacked dot plots group nearby values together, making patterns easier to see.
  • Useful for small datasets; larger samples become cluttered and harder to read.
  • Example: the excerpt shows interest rates for 50 loans as individual dots along a number line.

⚖️ The mean (average)

Mean: the sum of all observed values divided by the number of observations.

Formula in words: Add all values together, then divide by how many values you have.

  • Sample mean (x̄): calculated from sample data; serves as our best single estimate.
  • Population mean (μ): the true average across the entire population, often unknown and estimated by x̄.
  • The mean acts as the balancing point of the distribution.

Why it matters:

  • Standardizes metrics into comparable units (e.g., dollars per hour, attacks per patient).
  • Enables fair comparisons across different group sizes.

💡 Using means for comparisons

ScenarioRaw comparison problemMean-based solution
Drug trialNew drug: 200 attacks (500 patients) vs. Standard: 300 attacks (1000 patients)New: 0.4 attacks/patient vs. Standard: 0.3 attacks/patient—standard is better
Business earningsMade $11,000 over 3 months$17.60/hour makes it comparable to other jobs

Don't confuse: Simple means versus weighted means—when observations represent groups of different sizes (like counties with different populations), a weighted mean accounts for these size differences rather than treating all groups equally.

📊 Histograms and distribution shape

📊 How histograms work

Histogram: a plot showing binned counts of data as bars, where bar height represents frequency or density.

  • Data values are grouped into bins (ranges).
  • Observations on bin boundaries are allocated to the lower bin.
  • Each bar's height shows how many observations fall in that bin.
  • Example: interest rates grouped into 5.0%-7.5%, 7.5%-10.0%, etc.

📐 Reading distribution shape

  • Higher bars indicate where data are more common (higher density).
  • Skewed right (or positively skewed): long tail extending toward higher values; most data concentrated at lower values.
  • Skewed left (or negatively skewed): long tail extending toward lower values.
  • The excerpt's interest rate histogram is described as "strongly skewed to the right"—most loans have lower rates, with a few extending to much higher rates.

🔄 Dot plots vs. histograms

FeatureDot plotHistogram
Shows exact valuesYesNo (grouped into bins)
Best forSmall datasetsLarger datasets
RevealsIndividual observationsOverall distribution shape and density

Don't confuse: Histograms sacrifice exact values for a clearer view of overall patterns—this trade-off makes large datasets interpretable but loses individual-level detail.

6

Considering Categorical Data

2.2 Considering categorical data

🧭 Overview

🧠 One-sentence thesis

When comparing groups of different sizes, raw counts can mislead, so we must calculate rates or averages per unit to make fair comparisons and avoid artifacts of imbalanced group sizes.

📌 Key points (3–5)

  • Why raw counts mislead: comparing absolute numbers across unequal groups creates false impressions of which is better or worse.
  • The solution: divide by group size to get per-person (or per-unit) averages, putting data into a standard unit for comparison.
  • Common confusion: a larger raw count does not mean a better rate—always check whether group sizes differ before concluding.
  • Weighted means: when units represent different numbers of people (e.g., counties), a simple average treats all units equally and distorts the true per-person figure; instead, sum all totals and divide by the total population.
  • Practical value: standardized rates (e.g., per hour, per patient) allow meaningful comparisons across jobs, treatments, or other contexts.

📊 Why raw counts fail

📊 The imbalanced-group problem

  • The excerpt shows a drug trial: 500 patients got the new drug (200 attacks), 1,000 got the standard drug (300 attacks).
  • At first glance, 200 < 300 suggests the new drug is better.
  • But this is an artifact: the groups are different sizes, so raw counts cannot be compared directly.

🔢 Calculating per-unit rates

To compare fairly, compute the average number of events per patient (or per unit) in each group.

  • New drug: 200 ÷ 500 = 0.4 attacks per patient.
  • Standard drug: 300 ÷ 1,000 = 0.3 attacks per patient.
  • Conclusion reverses: the standard drug actually has a lower rate of attacks per patient.
  • Example: Don't confuse "fewer total attacks" with "lower attack rate"—always divide by the group size.

💼 Standardizing for real-world decisions

💼 Putting earnings into a standard unit

  • The excerpt describes Emilio's food truck: $11,000 earned over 625 hours worked.
  • Raw total ($11,000) is hard to evaluate without context.
  • Standard unit: $11,000 ÷ 625 hours = $17.60 per hour.
  • Why it matters: hourly wage is a common metric, so Emilio can now compare his venture to other jobs.

⚖️ When units represent populations (weighted means)

  • The excerpt warns against a simple mean when each observation represents different numbers of people.
  • Scenario: computing average income per person across U.S. counties.
    • A county with 5,000 residents and one with 5,000,000 should not count equally.
    • Wrong approach: average the per-capita income across all counties → treats each county the same → result is $26,093.
    • Correct approach: sum total income for all counties, then divide by total population → result is $30,861.
  • This is called a weighted mean: each county's contribution is weighted by its population.
  • Don't confuse: a simple mean of rates ignores that some units are much larger; always weight by the underlying count when units differ in size.

🧮 Summary table

SituationMisleading approachCorrect approachWhy
Unequal group sizesCompare raw countsDivide by group size (rate per unit)Raw counts reflect group size, not true rate
Units of different populationsSimple mean of ratesWeighted mean (total ÷ total population)Treats all units equally otherwise, distorting the per-person figure
Evaluating earningsLook at total dollarsCompute per-hour (or per-unit) wageStandard unit enables comparison across contexts
7

Box plots, quartiles, and robust statistics

2.3 Case study: malaria vaccine

🧭 Overview

🧠 One-sentence thesis

Box plots use five key statistics to summarize data and identify outliers, while robust statistics like the median and IQR remain stable even when extreme values change, unlike the mean and standard deviation.

📌 Key points (3–5)

  • Box plots visualize five statistics: minimum (lower whisker), Q1 (first quartile), median, Q3 (third quartile), and maximum (upper whisker), plus any outliers beyond 1.5 × IQR.
  • Median splits data in half: 50% of observations fall below it and 50% above; for even-numbered data sets, it's the average of the two middle values.
  • IQR measures middle spread: the interquartile range (Q3 − Q1) captures the middle 50% of the data and serves as a measure of variability.
  • Common confusion—robust vs. non-robust: the median and IQR barely change when extreme values shift, but the mean and standard deviation are heavily affected by outliers.
  • Choosing the right statistic: for skewed distributions, the median better represents a "typical" value; the mean is more useful when totals or scaling matter.

📦 Anatomy of a box plot

📦 The five-number summary

A box plot is built from five key statistics:

ComponentDefinitionPosition in data
Lower whiskerLowest value within Q1 − 1.5 × IQRBottom of plot
Q1 (first quartile)25th percentileBottom of box
Median50th percentileDark line in box
Q3 (third quartile)75th percentileTop of box
Upper whiskerHighest value within Q3 + 1.5 × IQRTop of plot
  • The box itself represents the middle 50% of the data.
  • Any points beyond the whiskers are marked as dots (potential outliers).

📏 The median: finding the middle

Median: If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, the median is the average of the two middle values.

  • Example: In the loan data with 50 observations, the median is (9.93% + 9.93%) / 2 = 9.93%.
  • The median divides the data so that 25% falls between Q1 and the median, and another 25% falls between the median and Q3.

📐 Interquartile range (IQR)

Interquartile range (IQR): The length of the box in a box plot, computed as IQR = Q3 − Q1, where Q1 and Q3 are the 25th and 75th percentiles.

  • Like standard deviation, IQR measures variability: more variable data → larger IQR.
  • The box captures the middle 50% of observations.

🔭 Whiskers and their reach

  • Whiskers extend from the box to capture data outside it, but never more than 1.5 × IQR from the box edges.
  • Upper whisker: reaches up to Q3 + 1.5 × IQR (or the highest data point within that limit).
  • Lower whisker: reaches down to Q1 − 1.5 × IQR (or the lowest data point within that limit).
  • Think of the box as the "body" and whiskers as "arms" trying to reach the rest of the data.

🚨 Outliers and extreme observations

🚨 What makes an outlier

Outlier: An observation that appears extreme relative to the rest of the data.

  • Any point beyond the whiskers is marked with a dot and considered a potential outlier.
  • Example: In the interest rate data, 24.85% and 26.30% are outliers because they lie beyond the upper whisker limit.

🔍 Why outliers matter

Examining outliers serves three purposes:

  1. Identifying strong skew: outliers often signal that the distribution is heavily skewed in one direction.
  2. Catching errors: extreme values may indicate data collection or entry mistakes.
  3. Revealing interesting patterns: outliers can point to unusual but real phenomena worth investigating.
  • Don't confuse: not every extreme value is an error—some are genuine and informative.

🛡️ Robust vs. non-robust statistics

🛡️ What "robust" means

Robust statistics: Statistics that are only slightly affected by extreme observations; the median and IQR remain stable even when outliers change dramatically.

  • The excerpt demonstrates this by modifying the most extreme interest rate (26.3%) in three scenarios: original, changed to 15%, and changed to 35%.
  • Result: the median stayed at 9.93% and IQR stayed at 5.76% in all three cases.

⚖️ Comparing robust and non-robust measures

StatisticTypeBehavior with extreme values
MedianRobustBarely changes when outliers shift
IQRRobustStays stable; only sensitive near Q1, median, Q3
MeanNon-robustShifts noticeably with outliers
Standard deviationNon-robustChanges substantially with outliers
  • Example: When the 26.3% rate was changed to 35%, the mean increased from 11.57% to 11.74% and standard deviation from 5.05% to 5.68%, but the median and IQR did not budge.

🧠 Why the median and IQR are stable

  • The median and IQR depend only on values near Q1, the median, and Q3.
  • As long as the middle region of the data stays the same, these statistics remain unchanged.
  • Extreme values at the tails have no influence unless they move into the middle region.

🎯 Choosing the right summary statistic

🎯 When to use median vs. mean

The choice depends on your goal:

  • Median is better for typical values: When data are skewed (e.g., loan amounts with a few very large loans), the median represents what a "typical" observation looks like.

  • Mean is better for totals and scaling: If you need to understand aggregate quantities (e.g., total money needed to fund 1,000 loans), the mean is more useful because it accounts for all values proportionally.

  • Example: For right-skewed loan amounts, the median tells you what a typical borrower receives, but the mean helps estimate total capital requirements.

📊 Variability: IQR vs. standard deviation

  • IQR: Use when the data are skewed or contain outliers; it focuses on the middle 50% and ignores extremes.

  • Standard deviation: Use when the data are roughly symmetric and you want a measure that incorporates all observations, including extremes.

  • Don't confuse: both measure spread, but IQR is resistant to outliers while standard deviation is sensitive to them.

8

Defining probability

3.1 Defining probability

🧭 Overview

🧠 One-sentence thesis

Probability forms the theoretical foundation of statistics, and understanding it—starting with simple examples like fair dice—helps build deeper insight into statistical methods even though mastery is not required for applying techniques in this book.

📌 Key points (3–5)

  • Role of probability: it is the foundation of statistics and supports deeper understanding of methods.
  • Not strictly required: mastery of probability concepts is not necessary to apply the statistical techniques introduced later.
  • Starting point: the chapter begins with familiar, intuitive examples (e.g., rolling a die) before moving to formal concepts.
  • Common confusion: probability may feel abstract, but the chapter emphasizes that many ideas are already familiar and the formalization is what is new.

🎲 The role of probability in statistics

🎲 Foundation and purpose

  • The excerpt states that "probability forms the foundation of statistics."
  • Understanding probability provides:
    • A theoretical foundation for ideas in later chapters.
    • A path to deeper understanding of statistical methods.
  • However, the excerpt emphasizes that mastery is not required for applying the applied techniques in the rest of the book.

🧠 What is new for most readers

  • Many probability ideas are already familiar to readers.
  • What is likely new is the formalization of these concepts.
  • The chapter aims to make intuitive ideas more rigorous without demanding full mastery for practical use.

🎯 Introductory examples

🎯 Starting with familiar scenarios

  • The chapter begins with "basic examples that may feel more familiar" before introducing technical ideas.
  • This approach helps bridge intuition and formalization.

🎲 Example: Rolling a fair die

A die is a cube with six faces numbered 1, 2, 3, 4, 5, and 6.

Problem: What is the chance of getting 1 when rolling a die?

Solution logic:

  • If the die is fair, the chance of rolling a 1 is as good as the chance of any other number.
  • Since there are six possible outcomes, the chance must be 1-in-6, or equivalently, 1 divided by 6.

Key reasoning:

  • Fairness means all outcomes are equally likely.
  • Probability is calculated as the number of favorable outcomes (one face showing 1) divided by the total number of possible outcomes (six faces).

Example: If you roll a fair die many times, you expect to see a 1 about one-sixth of the time.

Don't confuse: "1-in-6" and "1 / 6" are two ways of expressing the same probability; they both mean the same numerical value.

9

Conditional probability

3.2 Conditional probability

🧭 Overview

🧠 One-sentence thesis

This section formalizes conditional probability concepts to provide a theoretical foundation for statistical methods, though mastery is not required for applying techniques in the rest of the book.

📌 Key points (3–5)

  • Purpose of the chapter: Probability forms the foundation of statistics and provides a path to deeper understanding of methods introduced later.
  • Not strictly required: Mastery of the probability concepts introduced is not required for applying the applied techniques in the rest of the book.
  • Formalization is new: While readers are likely already aware of many probability ideas, the formalization of these concepts is likely new for most.
  • Foundation for future learning: Understanding these concepts sets a better foundation for future courses beyond the applied methods.

📚 Context and purpose

📚 Why probability matters in statistics

Probability forms the foundation of statistics.

  • Statistics relies on probability as its underlying theoretical framework.
  • The chapter aims to provide a theoretical foundation for ideas in later chapters.
  • Understanding probability helps gain a deeper understanding of statistical methods, even if not strictly necessary for application.

🎯 What to expect from this chapter

  • The chapter formalizes probability concepts that may already feel familiar.
  • Formalization means taking intuitive ideas and expressing them in precise, technical terms.
  • Example: The introductory section walks through basic examples (like rolling a die) before moving to technical ideas.

🔧 How to approach this material

🔧 For applied learners

  • You can apply the statistical techniques in the rest of the book without mastering these probability concepts.
  • The probability foundation is optional for immediate application but valuable for deeper understanding.
  • Don't confuse: "not required for applying methods" does not mean "not useful"—it provides theoretical grounding.

🔧 For deeper understanding

  • This chapter provides a path to a deeper understanding of statistical methods.
  • It sets a better foundation for future courses beyond this book.
  • The formalization helps connect intuitive probability ideas to rigorous statistical reasoning.

📖 Chapter structure

📖 Topics covered

The chapter includes five main sections:

  • Defining probability (3.1)
  • Conditional probability (3.2)
  • Sampling from a small population (3.3)
  • Random variables (3.4)
  • Continuous distributions (3.5)

📖 Pedagogical approach

  • Starts with introductory examples that feel more familiar (e.g., rolling a die).
  • Moves from basic, intuitive examples to technical formalization.
  • Example: Section 3.1 begins with "Before we get into technical ideas, let's walk through some basic examples that may feel more familiar," then presents a simple die-rolling problem.
10

Sampling from a small population

3.3 Sampling from a small population

🧭 Overview

🧠 One-sentence thesis

The excerpt provided contains only a chapter outline and introductory framing for probability concepts, but does not include substantive content about sampling from a small population.

📌 Key points (3–5)

  • The excerpt is a table of contents fragment showing that section 3.3 is titled "Sampling from a small population" within Chapter 3 on Probability.
  • Chapter 3 establishes that probability forms the foundation of statistics and provides theoretical grounding for later applied methods.
  • The chapter introduction notes that mastery of probability concepts is not required for applying the methods in the rest of the book.
  • The excerpt includes unrelated content (exam scores, marathon times, Oscar winners) that appears to be from exercises in Chapter 2, not section 3.3.

📋 What the excerpt contains

📋 Chapter structure only

The excerpt shows:

  • Chapter 3 is titled "Probability" and includes five sections (3.1 through 3.5).
  • Section 3.3 is listed as "Sampling from a small population" but no content for this section is provided.
  • The actual text begins with section 3.1 "Defining probability" and includes only introductory examples about dice rolling.

🚫 Missing content

  • No definitions, explanations, or examples related to sampling from a small population appear in the excerpt.
  • The excerpt does not explain what constitutes a "small population," how sampling differs for small vs. large populations, or any techniques or considerations specific to this topic.
  • The unrelated content about exam scores, marathon times, and Oscar winners appears to be exercise material from Chapter 2 on summarizing data, not relevant to section 3.3.

🔍 Context provided

🔍 Chapter 3 framing

The introduction states:

"Probability forms the foundation of statistics, and you're probably already aware of many of the ideas presented in this chapter."

  • The chapter aims to formalize probability concepts that may already be intuitively familiar.
  • Theoretical foundation is provided but is described as not strictly required for applying methods in later chapters.
  • The chapter is positioned as optional for deeper understanding rather than essential for practical application.

📖 What is available

The only substantive probability content in the excerpt is Example 3.1 from section 3.1:

  • A fair die has six equally likely outcomes (1, 2, 3, 4, 5, 6).
  • The chance of rolling a 1 is 1-in-6 or 1/6.
  • This illustrates basic probability calculation when outcomes are equally likely.

Note: To create meaningful review notes for "Sampling from a small population," the actual content of section 3.3 would need to be provided.

11

Random variables

3.4 Random variables

🧭 Overview

🧠 One-sentence thesis

The excerpt provides only a chapter heading and introductory context for probability, but does not contain substantive content about random variables themselves.

📌 Key points (3–5)

  • What the excerpt contains: only a table of contents entry (3.4 Random variables) and general introductory remarks about probability.
  • Context provided: probability forms the foundation of statistics; formalization of probability concepts is new for most readers.
  • Scope clarification: mastery of probability concepts is not required for applying methods in the rest of the book.
  • Common confusion: the excerpt does not define or explain random variables—it only announces the topic.

📖 What the excerpt actually covers

📖 Chapter structure

The excerpt shows that "Random variables" is section 3.4 within Chapter 3 (Probability), which also includes:

  • 3.1 Defining probability
  • 3.2 Conditional probability
  • 3.3 Sampling from a small population
  • 3.5 Continuous distributions

🎯 Introductory framing

The chapter introduction states:

Probability forms the foundation of statistics, and you're probably already aware of many of the ideas presented in this chapter. However, formalization of probability concepts is likely new for most readers.

  • The chapter provides theoretical foundation for later chapters.
  • Mastery is not required for applying methods in the rest of the book.
  • The material may help gain deeper understanding and set a better foundation for future courses.

⚠️ Content limitation

⚠️ No substantive material on random variables

  • The excerpt includes only the section heading "3.4 Random variables."
  • No definition, explanation, examples, or properties of random variables are provided.
  • The excerpt cuts off after showing examples from section 3.1 (Defining probability), which cover basic die-rolling probability.

📝 What is present instead

  • Chapter 2 exercises on data summarization (exam scores, marathon times, Oscar winners' ages).
  • Chapter 3 opening remarks about probability as a foundation.
  • The beginning of section 3.1 with a simple die example.

Note: To study random variables, you would need the actual content of section 3.4, which is not included in this excerpt.

12

Continuous Distributions

3.5 Continuous distributions

🧭 Overview

🧠 One-sentence thesis

Continuous probability distributions use smooth curves (densities) to model numerical variables that can take any value in a range, where probabilities correspond to areas under the curve rather than individual outcomes.

📌 Key points (3–5)

  • From discrete to continuous: As histograms use narrower bins, they approach a smooth curve called a probability density function.
  • Probability as area: For continuous distributions, probability equals the area under the density curve over an interval; the total area under any density curve equals 1.
  • Zero probability for exact values: The probability of any single exact value is zero; only intervals have positive probability.
  • Common confusion: Measured vs. theoretical values—in practice, measurements are rounded (e.g., to the nearest cm), giving positive probability to "exact" values, but theoretically exact values have zero probability.
  • Independence still applies: When sampling multiple observations from a continuous distribution, the multiplication rule for independent events works the same way.

📊 Understanding continuous distributions through histograms

📊 The transition from discrete bins to smooth curves

  • Start with a histogram that groups data into bins (e.g., heights grouped by 5 cm intervals).
  • As you make bins narrower and narrower, the histogram's outline becomes smoother.
  • Eventually, the histogram resembles a smooth curve that represents the probability density function (or simply "density" or "distribution").

Example: Figure 3.24 in the excerpt shows US adult heights with progressively narrower bins—the finest binning creates an almost smooth outline.

📏 The special property of densities

Probability density function: A smooth curve representing the distribution of a continuous variable, where the total area under the curve equals 1.

  • This "area = 1" property mirrors how probabilities must sum to 1 in discrete distributions.
  • The curve itself does not give probabilities directly; instead, area under the curve over an interval gives the probability for that interval.

🧮 Computing probabilities from continuous distributions

🧮 Probability equals area under the curve

  • To find the probability that a variable falls in a certain range, calculate the area under the density curve over that range.
  • The excerpt's example: For US adult heights, the probability someone is between 180 cm and 185 cm is the shaded area under the curve in that interval, approximately 0.1157.

How it connects to histograms:

  • In a histogram, you count how many observations fall in the range and divide by the total sample size.
  • In a continuous density, you compute the area under the curve in that range (usually with a computer or calculus).

Example: The excerpt compares the histogram-based estimate (0.1172) with the density-based probability (0.1157)—they are very close.

🎯 Exact values have zero probability

  • For a continuous variable, the probability of any single exact value is zero.
  • Why? There is no "width" to capture area at a single point.

Example: The probability that a randomly selected person is exactly 180.000... cm tall (measured perfectly) is zero.

Don't confuse with practical measurement:

  • In real life, measurements are rounded (e.g., to the nearest cm).
  • If someone's height rounds to 180 cm, that means they are between 179.5 cm and 180.5 cm—this interval has positive probability.
  • Theoretical "exact" vs. practical "rounded" is an important distinction.

🔗 Using continuous distributions with multiple observations

🔗 Independence and the multiplication rule

  • When you randomly select multiple individuals from a continuous distribution, each selection is independent (assuming sampling with replacement or from a very large population).
  • The multiplication rule for independent events still applies.

Example from the excerpt:

  • Probability one adult is between 180 and 185 cm: 0.1157
  • Probability all three randomly selected adults are in that range: 0.1157 × 0.1157 × 0.1157 = 0.0015
  • Probability none of the three are in that range: (1 − 0.1157)³ = 0.692

🧩 Why this matters

  • Continuous distributions model real-world measurements (height, weight, time, etc.) more realistically than discrete counts.
  • They allow precise probability statements about ranges and support statistical inference (covered in later chapters).
  • Understanding that "probability = area" is foundational for working with normal distributions, confidence intervals, and hypothesis tests.

Note: The excerpt provided is primarily introductory material for Section 3.5 and includes limited content. The notes above cover the core concepts presented: the transition from histograms to densities, probability as area, and the zero-probability property of exact values in continuous distributions.

13

Normal distribution

4.1 Normal distribution

🧭 Overview

🧠 One-sentence thesis

The normal distribution is the most common distribution in statistics, characterized by its symmetric, unimodal, bell-shaped curve that appears ubiquitously in data analysis and statistical inference.

📌 Key points (3–5)

  • Overwhelming prevalence: among all distributions encountered in practice, the normal distribution is by far the most common.
  • Defining shape: symmetric, unimodal, and bell-shaped curve.
  • Alternative names: also known as the "normal curve" or "normal distribution."
  • Role in statistics: used frequently throughout statistical analysis and inference, especially in later applications.

📊 Defining characteristics

🔔 The bell curve shape

Normal distribution (normal curve): a symmetric, unimodal, bell-shaped curve that is ubiquitous throughout statistics.

  • Symmetric: the left and right sides of the curve mirror each other.
  • Unimodal: the distribution has a single peak (mode) at its center.
  • Bell-shaped: the curve rises smoothly to a central peak and falls smoothly on both sides, resembling a bell.

🏆 Why it's called "normal"

  • The excerpt emphasizes that this distribution is "overwhelmingly the most common" in practice.
  • People recognize it widely enough to call it the "normal" distribution, reflecting its standard appearance in statistical work.
  • Don't confuse: "normal" here means "common/standard," not "correct" or "ideal"—it simply describes the distribution's prevalence.

🌍 Context and importance

📈 Prevalence in practice

  • The excerpt states it is "ubiquitous throughout statistics."
  • Among all distributions seen in practice, this one stands out as the most frequent.
  • Example: when analyzing real-world data or performing statistical inference, the normal distribution appears more often than any other distribution.

🔗 Role in the broader framework

  • The chapter introduces multiple statistical distributions (geometric, binomial, negative binomial, Poisson), but the normal distribution is highlighted first and used most frequently.
  • The excerpt notes that the normal distribution is "used frequently in later chapters," indicating its foundational importance.
  • Other distributions in the chapter are described as "occasionally referenced" and "may be considered optional," underscoring the normal distribution's central role.
14

Geometric distribution

4.2 Geometric distribution

🧭 Overview

🧠 One-sentence thesis

The geometric distribution is one of several statistical distributions that arise frequently in data analysis and inference, introduced alongside other discrete and continuous distributions in this chapter.

📌 Key points (3–5)

  • Chapter context: This section is part of a chapter on distributions of random variables, which includes normal, geometric, binomial, negative binomial, and Poisson distributions.
  • Position in the chapter: The geometric distribution is the second distribution covered, immediately following the normal distribution.
  • Optional nature: While the normal distribution is used frequently in later chapters, the geometric distribution and other remaining sections may be considered optional for the book's core content.
  • Common confusion: Don't confuse the geometric distribution with the normal distribution—the normal is symmetric, unimodal, and bell-shaped (continuous), while the geometric is one of several discrete distributions.

📚 Chapter structure and context

📚 What this chapter covers

The chapter introduces several statistical distributions that frequently appear in data analysis and statistical inference:

DistributionSectionUsage note
Normal4.1Used frequently in later chapters
Geometric4.2May be considered optional
Binomial4.3May be considered optional
Negative binomial4.4May be considered optional
Poisson4.5May be considered optional

🎯 Purpose of these distributions

  • These distributions arise frequently in the context of data analysis or statistical inference.
  • The excerpt emphasizes that they are practical tools that appear "in practice."
  • The normal distribution receives special emphasis as "overwhelmingly the most common" and "ubiquitous throughout statistics."

🔍 The geometric distribution's place

🔍 Relationship to other distributions

  • The geometric distribution is introduced second, after the normal distribution.
  • It is grouped with binomial, negative binomial, and Poisson distributions as "remaining sections" that may be optional.
  • The excerpt does not provide the actual definition or properties of the geometric distribution itself—only its position in the chapter structure.

📖 How to approach this section

  • The normal distribution (Section 4.1) is foundational and will be referenced frequently in later chapters.
  • The geometric distribution and other sections (4.2–4.5) will be "occasionally referenced" but are not essential for the book's core content.
  • Don't confuse: "optional" does not mean unimportant—it means these distributions are used less frequently in the specific context of this textbook.

⚠️ Note on excerpt content

⚠️ Limited substantive content

The provided excerpt contains:

  • Chapter and section titles
  • A brief introductory paragraph about the chapter's scope
  • A note about the normal distribution's prominence
  • Several unrelated probability exercises (twins, breakfast costs, ice cream scooping, variance of means)

The excerpt does not include:

  • The actual definition of the geometric distribution
  • Properties, formulas, or characteristics of the geometric distribution
  • Examples or applications of the geometric distribution
  • How to distinguish it from other distributions beyond what is implied by the chapter structure
15

Binomial distribution

4.3 Binomial distribution

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content about the binomial distribution; it only lists the section title within a table of contents for Chapter 4 on distributions of random variables.

📌 Key points (3–5)

  • The excerpt is a table of contents fragment showing that section 4.3 covers the binomial distribution.
  • The binomial distribution is one of several statistical distributions discussed in Chapter 4.
  • Other distributions mentioned in the chapter include normal, geometric, negative binomial, and Poisson distributions.
  • The chapter introduction states that these distributions frequently arise in data analysis and statistical inference.
  • No definitions, formulas, properties, or examples of the binomial distribution are provided in the excerpt.

📚 Context from the excerpt

📚 Chapter structure

The excerpt shows that the binomial distribution is section 4.3 within Chapter 4, titled "Distributions of random variables."

The chapter covers five distributions in order:

  1. Normal distribution (section 4.1)
  2. Geometric distribution (section 4.2)
  3. Binomial distribution (section 4.3)
  4. Negative binomial distribution (section 4.4)
  5. Poisson distribution (section 4.5)

🎯 Chapter purpose

According to the brief introduction provided:

  • The chapter discusses statistical distributions that frequently arise in data analysis or statistical inference.
  • The normal distribution (section 4.1) is used frequently in later chapters and is emphasized.
  • Sections beyond 4.1 (including the binomial distribution section) will occasionally be referenced but may be considered optional for the content in the book.

⚠️ Content limitation

⚠️ No substantive material

The excerpt does not contain the actual content of section 4.3 on the binomial distribution. It only shows:

  • The section number and title in a table of contents
  • A brief chapter-level introduction that does not explain any specific distribution

To learn about the binomial distribution itself—its definition, properties, parameters, formulas, or applications—you would need to access the full text of section 4.3, which is not included in this excerpt.

16

Negative binomial distribution

4.4 Negative binomial distribution

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content about the negative binomial distribution; it only lists the section title in a table of contents.

📌 Key points (3–5)

  • The excerpt shows only a chapter outline listing "4.4 Negative binomial distribution" as a section heading.
  • No definition, properties, formulas, or applications of the negative binomial distribution are provided.
  • The excerpt indicates this section is part of Chapter 4 on "Distributions of random variables."
  • The chapter introduction notes that sections beyond the normal distribution (4.2–4.5, including negative binomial) are occasionally referenced but may be considered optional.

📄 What the excerpt contains

📄 Table of contents entry only

  • The excerpt includes a table of contents for Chapter 4, which lists:
    • 4.1 Normal distribution
    • 4.2 Geometric distribution
    • 4.3 Binomial distribution
    • 4.4 Negative binomial distribution
    • 4.5 Poisson distribution
  • No body text, examples, or explanations for section 4.4 appear in the provided material.

📄 Context from chapter introduction

  • Chapter 4 focuses on statistical distributions that frequently arise in data analysis or statistical inference.
  • The normal distribution (section 4.1) is used frequently in later chapters.
  • Sections 4.2 through 4.5 (including the negative binomial distribution) will be occasionally referenced but may be considered optional for the content in the book.

⚠️ Note on missing content

⚠️ No substantive material provided

  • The excerpt does not include the actual section 4.4 content.
  • No information is available about what the negative binomial distribution is, how it is defined, when it is used, or how it differs from other distributions.
  • To learn about the negative binomial distribution, the full textbook section would need to be consulted.
17

Poisson Distribution

4.5 Poisson distribution

🧭 Overview

🧠 One-sentence thesis

The Poisson distribution models the number of events occurring in a fixed population over a unit of time when events are rare and independent, making it useful for estimating counts like daily hospital admissions or customer arrivals.

📌 Key points (3–5)

  • What it models: the number of events in a large population over a unit of time (e.g., heart attacks per day in a city).
  • Key parameter: the rate λ (lambda) or μ, which is the average number of events expected per time unit.
  • When to use it: events are rare, the population is large, and events occur independently.
  • Common confusion: Poisson vs binomial—Poisson is for counting events in time/space with no fixed "trials," while binomial counts successes in a fixed number of trials.
  • Mean and standard deviation: both are functions of the rate: mean = λ, standard deviation = √λ.

📊 Core concept and formula

📊 What the Poisson distribution describes

Poisson distribution: A probability distribution that describes the number of events occurring in a fixed population over a unit of time, given a known average rate.

  • The distribution is used when we want to know how many times something happens (not whether it happens).
  • Example: If NYC averages 4.4 heart attacks per day, the Poisson distribution can estimate the probability of observing exactly 3, or 5, or 10 heart attacks on a given day.
  • The time unit can be adjusted (hour, day, week) depending on the context.

🧮 The Poisson formula

The probability of observing exactly k events is:

P(observe k events) = (λ^k × e^(−λ)) / k!

Where:

  • k = the number of events (0, 1, 2, 3, ...)
  • λ = the rate (average number of events per time unit)
  • e ≈ 2.718 (the base of the natural logarithm)
  • k! = k factorial (e.g., 3! = 3 × 2 × 1 = 6)

Mean and standard deviation:

  • Mean = λ
  • Standard deviation = √λ

Example: If λ = 4.4 (average heart attacks per day), the mean is 4.4 and the standard deviation is √4.4 ≈ 2.1.

🔍 When to use the Poisson model

🔍 Guidelines for appropriateness

The Poisson distribution is appropriate when:

  • You are counting events (not measuring continuous quantities).
  • The population is large (e.g., all residents of a city).
  • Events occur independently of each other (one person's heart attack doesn't cause another's).

Don't confuse with:

  • Binomial distribution: Binomial requires a fixed number of trials (e.g., 10 coin flips). Poisson has no fixed number of trials—it counts events over time or space.
  • Geometric/Negative binomial: These model waiting time until a success, not the count of events in a time window.

⚠️ Relaxing the independence assumption

  • In practice, events are not always perfectly independent.
  • Example: Weddings are more common on weekends than weekdays.
  • Solution: Use different rates for different conditions (e.g., a higher rate for Saturdays, a lower rate for Tuesdays).
  • This idea forms the basis of more advanced models called generalized linear models.

📈 Interpreting the distribution shape

📈 Shape characteristics

  • The Poisson distribution is unimodal (one peak) and right-skewed (tail extends to the right).
  • As the rate λ increases, the distribution becomes more symmetric and bell-shaped.
  • Example: A histogram of daily heart attack counts in NYC (rate = 4.4) shows most days have 2–6 events, with a long tail extending to higher counts.

📉 Using the histogram

From the example in the excerpt:

  • Sample mean: 4.38 (close to the historical rate of 4.4)
  • Sample standard deviation: about 2
  • About 70% of days fall between 2.4 and 6.4 events (roughly mean ± 1 standard deviation)

Example interpretation: If you observe 10 heart attacks in one day, that would be unusually high (more than 2 standard deviations above the mean), suggesting either random variation or a change in the underlying rate.

🧪 Practical examples

🧪 Heart attacks in NYC

Scenario: NYC has about 8 million people. Historically, an average of 4.4 people per day are hospitalized for heart attacks.

Question: What is the probability that exactly 3 people will have a heart attack tomorrow?

Answer: Use the Poisson formula with λ = 4.4 and k = 3:

  • P(3 events) = (4.4^3 × e^(−4.4)) / 3!
  • This requires calculation (typically done with software or a calculator).

☕ Coffee shop customers

Scenario: A coffee shop serves an average of 75 customers per hour during the morning rush.

Questions:

  • Mean and standard deviation: Mean = 75, SD = √75 ≈ 8.66.
  • Is 60 customers unusually low? 60 is about 1.7 standard deviations below the mean, which is somewhat low but not extremely unusual.
  • Probability of exactly 70 customers: Use Poisson formula with λ = 75, k = 70.

🚗 Cars visiting a retailer

Scenario: Between 2pm and 3pm, an average of 6.5 cars visit a retailer (Monday–Thursday, no holidays).

Question: What is the probability that exactly 5 cars show up next Monday?

Answer: Use Poisson with λ = 6.5, k = 5.

Follow-up: If an average of 11.7 people visit during the same hour, is the number of people also Poisson?

  • Probably not, because the number of people per car is not constant (some cars have 1 person, others have 2 or more).
  • The Poisson model assumes each "event" is identical and independent.

🔄 Comparison with other distributions

DistributionWhat it modelsKey difference from Poisson
BinomialNumber of successes in n fixed trialsPoisson has no fixed n; counts events over time/space
GeometricNumber of trials until first successPoisson counts total events, not waiting time
Negative binomialNumber of trials until kth successPoisson counts events in a time window, not trials
NormalContinuous measurements (e.g., height)Poisson is for discrete counts (0, 1, 2, ...)

When Poisson resembles normal: If λ is large (e.g., λ > 20), the Poisson distribution becomes more symmetric and can sometimes be approximated by a normal distribution with mean λ and standard deviation √λ.

18

Point estimates and sampling variability

5.1 Point estimates and sampling variability

🧭 Overview

🧠 One-sentence thesis

Statistical inference focuses on understanding and quantifying the uncertainty inherent in estimating population parameters from sample data.

📌 Key points (3–5)

  • Core purpose of statistical inference: understanding and quantifying the uncertainty of parameter estimates.
  • Foundational consistency: the foundations for inference remain the same throughout all of statistics, even though equations and details change by setting.
  • Starting point: using a sample proportion to estimate a population proportion is a familiar example of the broader inference process.
  • Common confusion: while methods vary across different statistical settings, the underlying principles of quantifying uncertainty do not change.

🎯 What statistical inference does

🎯 The central concern

Statistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates.

  • Inference is not just about calculating estimates; it is about measuring how uncertain those estimates are.
  • The excerpt emphasizes two aspects:
    • Understanding uncertainty: knowing where it comes from and what it means.
    • Quantifying uncertainty: putting numbers on how much confidence we can have.

🔗 Why uncertainty matters

  • When we estimate a population parameter (e.g., a proportion) from a sample, we cannot be perfectly certain.
  • The goal is to express how much the estimate might vary if we took different samples.
  • Example: if we estimate that 60% of a population supports a policy based on a sample, statistical inference tells us how confident we can be that the true population proportion is near 60%.

🧱 Foundations across all statistics

🧱 Consistency of principles

  • The excerpt states that "the foundations for inference are the same throughout all of statistics."
  • This means the core logic—quantifying uncertainty—applies universally.
  • What changes:
    • The specific equations used.
    • The details of the method.
  • What stays the same:
    • The goal of understanding parameter uncertainty.
    • The conceptual framework.

📐 Don't confuse: methods vs foundations

  • Different statistical settings (e.g., proportions, means, regression) require different formulas.
  • However, the underlying idea—estimating a parameter and measuring its uncertainty—is always present.
  • Example: whether you are estimating a population proportion or a population mean, you still need to account for sampling variability.

📊 Starting example: sample proportion

📊 Using a sample to estimate a population

  • The excerpt introduces "using a sample proportion to estimate a population proportion" as a familiar starting point.
  • This is a concrete case of the broader inference process:
    • You have a population parameter (the true proportion).
    • You collect a sample and calculate a sample proportion (the point estimate).
    • You then assess how much that estimate might vary due to sampling.

🔍 What comes next

  • The excerpt mentions that after introducing the sample proportion example, the text will "create what…" (the sentence is incomplete).
  • This suggests the section will build tools (likely confidence intervals or hypothesis tests) to formalize the quantification of uncertainty.
19

5.2 Confidence intervals for a proportion

5.2 Confidence intervals for a proportion

🧭 Overview

🧠 One-sentence thesis

Confidence intervals provide a way to quantify the uncertainty around a sample proportion when estimating a population proportion.

📌 Key points (3–5)

  • Core purpose: confidence intervals express the uncertainty inherent in using a sample proportion to estimate a population proportion.
  • Foundation: the method builds on the idea of using sample proportions as estimates for population proportions.
  • Common thread: although equations and details vary across statistical settings, the foundational principles for inference remain consistent throughout statistics.
  • What this section does: it introduces the specific technique of constructing confidence intervals for proportions, building on earlier concepts of point estimates and sampling variability.

📐 The estimation problem

📐 Sample proportion as estimate

  • The excerpt positions this section within a broader framework: using a sample proportion to estimate a population proportion is described as "a familiar topic."
  • This is the starting point—we have a sample and want to say something about the whole population.
  • The challenge: a single sample proportion (a point estimate) does not convey how uncertain that estimate is.

🎯 Why uncertainty matters

  • Statistical inference is "primarily concerned with understanding and quantifying the uncertainty of parameter estimates."
  • A point estimate alone does not tell us how much the true population proportion might differ from what we observed in the sample.
  • Confidence intervals are the tool introduced to address this gap.

🔧 What confidence intervals do

🔧 Quantifying uncertainty

Confidence intervals for a proportion: a method to quantify the uncertainty around a sample proportion when estimating a population proportion.

  • Instead of giving a single number (the sample proportion), a confidence interval gives a range.
  • This range reflects the sampling variability—the fact that different samples would yield different sample proportions.
  • Example: if you survey 100 people and 60% support a policy, a confidence interval might be 50% to 70%, indicating the plausible range for the true population proportion.

🧩 Connection to foundations

  • The excerpt emphasizes that "the foundations for inference are the same throughout all of statistics."
  • Even though the specific formulas change depending on the setting (e.g., proportions vs. means, different sample sizes), the underlying logic of quantifying uncertainty is consistent.
  • Don't confuse: the equations may differ, but the conceptual goal—expressing uncertainty—does not.

🗺️ Context within inference

🗺️ Building on prior concepts

The excerpt places this section in a sequence:

  1. Point estimates and sampling variability (Section 5.1): introduces the idea that sample statistics vary from sample to sample.
  2. Confidence intervals for a proportion (Section 5.2): uses that variability to construct an interval estimate.
  3. Hypothesis testing for a proportion (Section 5.3): a different inference technique for proportions.
  • Confidence intervals are one of two main inference tools introduced for proportions; the other is hypothesis testing.
  • Both tools rest on the same foundation: understanding how sample proportions vary.

🧠 Unified framework

  • The excerpt states that "the foundations for inference are the same throughout all of statistics," even as "the equations and details change depending on the setting."
  • This means: once you understand how to quantify uncertainty for a proportion, the same reasoning applies to other parameters (means, differences, etc.), though the mechanics differ.
  • Example: whether you are estimating a proportion, a mean, or a regression coefficient, you always need to account for sampling variability and express uncertainty.
20

Hypothesis testing for a proportion

5.3 Hypothesis testing for a proportion

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing provides a formal framework to evaluate claims about population proportions by comparing observed sample data against a null hypothesis, using p-values to determine whether evidence is strong enough to reject the skeptical position.

📌 Key points (3–5)

  • What hypothesis testing does: evaluates competing claims (null vs alternative) about a population parameter using sample data and a significance level.
  • How p-values work: the p-value measures the probability of observing data as extreme as ours if the null hypothesis were true; smaller p-values indicate stronger evidence against the null.
  • Two types of errors: Type 1 Error = rejecting a true null hypothesis; Type 2 Error = failing to reject a false null hypothesis.
  • Common confusion: failing to reject H₀ does NOT mean we accept it as true—it only means we lack sufficient evidence against it.
  • Key procedural difference from confidence intervals: when checking conditions and computing standard error for hypothesis tests, use the null value p₀ instead of the sample proportion p̂.

🎯 The hypothesis testing framework

🎯 What hypotheses are

Null hypothesis (H₀): often represents a skeptical perspective or a claim to be tested.

Alternative hypothesis (Hₐ): represents an alternative claim under consideration, often a range of possible parameter values.

  • The null hypothesis typically represents "no difference" or a status quo position.
  • Our job as data scientists is to play skeptic: we need strong evidence before accepting the alternative.
  • Example: For the infant vaccination question, H₀: p = 0.333 (people perform like random guessing), Hₐ: p ≠ 0.333 (people perform differently than random).

🔢 The null value

  • The null value is the specific parameter value we're testing against (labeled p₀).
  • In the vaccination example, the null value is p₀ = 0.333 (one-third, representing random guessing on a 3-choice question).
  • We compare our sample data to this null value to evaluate the hypotheses.

⚖️ The burden of proof

  • Even if we don't believe the null hypothesis is exactly true, we still require strong evidence to reject it.
  • Simply disbelieving H₀ doesn't tell us anything useful—we need data pointing in a specific direction.
  • Example: In medical research testing a new drug vs. an existing drug, H₀ declares "no difference" and Hₐ says the new drug performs differently (better or worse).

🧪 Testing with confidence intervals

🧪 Using confidence intervals as a test

  • If a confidence interval for a parameter does NOT contain the null value, we have evidence against H₀.
  • If the confidence interval DOES contain the null value, we cannot say the null is implausible.
  • Example: For the infant vaccination question with 50 adults (24% correct), the 95% CI was (0.122, 0.358), which includes 0.333, so we cannot reject H₀.

⚠️ What "failing to reject" means

  • Failing to reject H₀ does NOT mean we accept it as true.
  • Perhaps there was a real difference, but our sample was too small to detect it.
  • Don't confuse: "not rejecting" ≠ "accepting"—we simply lack sufficient evidence.

📊 Example with larger sample

  • With 228 college-educated adults answering the children-in-2100 question, 14.9% got it correct.
  • The 95% CI was (0.103, 0.195), which does NOT include 0.333.
  • Because 0.333 falls outside the interval, we reject H₀ and conclude people do worse than random guessing.
  • Important: specify which confidence level you used when drawing conclusions.

🎲 Understanding p-values

🎲 What a p-value measures

P-value: the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis were true.

  • The p-value quantifies the strength of evidence against H₀.
  • Smaller p-values indicate stronger evidence against the null hypothesis.
  • We compute p-values using the sampling distribution under the assumption that H₀ is true (called the null distribution).

🔍 Computing the p-value

Key procedural differences from confidence intervals:

  1. Check the success-failure condition using p₀ (not p̂): np₀ ≥ 10 and n(1 - p₀) ≥ 10
  2. Compute standard error using p₀: SE = square root of [p₀(1 - p₀) / n]

Steps:

  • Find the null distribution (normal with mean = p₀ and SE computed using p₀)
  • Calculate the Z-score: Z = (p̂ - p₀) / SE
  • Find the tail area(s) corresponding to the observed p̂ or more extreme values
  • For two-sided tests, double the single tail area to get the p-value

📉 Example: Coal energy support

  • Sample: 1000 US adults, 37% support increased coal usage
  • H₀: p = 0.5, Hₐ: p ≠ 0.5
  • Check conditions using p₀ = 0.5: np₀ = 500, n(1-p₀) = 500 (both ≥ 10) ✓
  • SE = square root of [0.5 × 0.5 / 1000] = 0.016
  • Z = (0.37 - 0.5) / 0.016 = -8.125
  • P-value ≈ 0.00000000000000044 (extremely small)
  • Since p-value < 0.05, reject H₀: strong evidence that support differs from 50%

✅ Making a decision

Decision rule: When the p-value is less than the significance level α, reject H₀. When the p-value is greater than α, do not reject H₀.

  • The standard significance level is α = 0.05 (5%)
  • If p-value < α: reject H₀, conclude data provide strong evidence for Hₐ
  • If p-value > α: do not reject H₀, conclude insufficient evidence against H₀
  • Always describe the conclusion in context of the data

⚠️ Errors and significance levels

⚠️ Two types of errors

Error TypeWhat it meansWhen it occurs
Type 1 ErrorRejecting H₀ when H₀ is actually trueOnly possible when we reject H₀
Type 2 ErrorFailing to reject H₀ when Hₐ is actually trueOnly possible when we fail to reject H₀
  • Example in courts: Type 1 = convicting an innocent person; Type 2 = failing to convict a guilty person
  • Trade-off: reducing one type of error generally increases the other type

🎚️ Choosing a significance level

Significance level (α): indicates how often we would incorrectly reject H₀ if the null hypothesis were true.

  • Standard level: α = 0.05 (5%)
  • Choose smaller α (e.g., 0.01) when Type 1 Error is dangerous or costly—requires very strong evidence to reject H₀
  • Choose larger α (e.g., 0.10) when Type 2 Error is more dangerous or costly—makes it easier to reject H₀
  • Example: Safety components should use larger α to more readily detect improvements; expensive replacements should use smaller α

🔬 Statistical vs practical significance

  • Statistical significance: the difference is unlikely due to chance alone (p-value < α)
  • Practical significance: the difference is large enough to matter in real-world terms
  • With very large samples, even tiny differences become statistically significant but may lack practical value
  • Example: An online experiment might detect a statistically significant 0.001% increase in viewership, but this has no practical value

📋 Formal procedure

📋 Four-step process

Prepare:

  • Identify the parameter of interest
  • State hypotheses (H₀ and Hₐ)
  • Choose significance level α
  • Identify sample proportion p̂ and sample size n

Check:

  • Independence: simple random sample or random assignment
  • Success-failure condition using p₀: np₀ ≥ 10 and n(1 - p₀) ≥ 10

Calculate:

  • Compute SE using p₀: SE = square root of [p₀(1 - p₀) / n]
  • Compute Z-score: Z = (p̂ - p₀) / SE
  • Find the p-value from the null distribution

Conclude:

  • Compare p-value to α
  • State conclusion in context: either "reject H₀, data provide strong evidence for Hₐ" or "do not reject H₀, insufficient evidence against H₀"

🧮 Example: Nuclear arms reduction

  • Sample: 1028 US adults, 56% support nuclear arms reduction
  • H₀: p = 0.50, Hₐ: p ≠ 0.50, α = 0.05
  • Check: Independence ✓, np₀ = n(1-p₀) = 514 ≥ 10 ✓
  • SE = square root of [0.5 × 0.5 / 1028] = 0.0156
  • Z = (0.56 - 0.50) / 0.0156 = 3.85
  • P-value ≈ 0.0002 (double the upper tail area of 0.0001)
  • Since 0.0002 < 0.05, reject H₀: convincing evidence that a majority supported nuclear arms reduction

🔀 One-sided tests (special topic)

🔀 When to use one-sided tests

One-sided test forms:

  • H₀: p = p₀, Hₐ: p < p₀ (only interested in detecting if parameter is less than p₀)
  • H₀: p = p₀, Hₐ: p > p₀ (only interested in detecting if parameter is more than p₀)

Key difference: compute p-value as a single tail area in the direction of Hₐ (not doubled)

⚠️ Dangers of one-sided tests

  • Major risk: you must disregard any findings in the opposite direction
  • Example: Stent study—if researchers used one-sided test expecting stents to help, they would have limited their ability to detect that stents actually harmed patients
  • Critical question: "What would I conclude if data goes clearly in the opposite direction?" If there's any value in the opposite conclusion, use a two-sided test

🚫 Why not choose direction after seeing data

  • If we could pick the one-sided direction after observing data, we'd effectively test both tails at α each
  • This would make our true error rate 2α (e.g., 10% instead of 5%), undermining the testing framework
  • Best practice: Use two-sided tests unless there is genuinely no value in detecting an effect in one direction

💡 Important reminders

💡 About the null distribution

  • In real applications, we never actually observe the sampling distribution
  • However, it's useful to always think of a point estimate as coming from such a hypothetical distribution
  • Understanding the sampling distribution helps us characterize and make sense of observed point estimates

💡 About interpreting results

  • Confidence intervals and hypothesis tests only address sampling error, not bias
  • If data collection is biased (systematically over- or under-estimates), these methods won't fix that problem
  • Careful data collection procedures are essential to combat bias
  • Conclusions are always about the population parameter, never about individual observations or future point estimates

💡 About sample size

  • Larger samples provide more precise estimates (smaller standard error)
  • With large enough samples, even tiny real differences become detectable
  • Data scientists often plan sample size based on the smallest meaningful difference worth detecting

Hypothesis testing for a proportion

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing provides a formal framework to evaluate claims about population proportions by comparing observed sample data against a null hypothesis using p-values, allowing us to determine whether evidence is strong enough to reject a skeptical position in favor of an alternative claim.

📌 Key points (3–5)

  • What hypothesis testing evaluates: competing claims (null vs alternative) about a population parameter, where the null represents skepticism or "no difference" and the alternative represents a new claim.
  • How p-values work: the p-value measures the probability of observing data as extreme as ours if the null hypothesis were true; smaller p-values indicate stronger evidence against the null.
  • Two types of errors: Type 1 Error (rejecting a true null) vs Type 2 Error (failing to reject a false null); reducing one type generally increases the other.
  • Common confusion: failing to reject H₀ does NOT mean accepting it as true—it only means we lack sufficient evidence against it; also, use p₀ (not p̂) when checking conditions and computing SE for hypothesis tests.
  • Statistical vs practical significance: with large samples, even tiny differences become statistically significant (p-value < α) but may lack practical real-world value.

🎯 The hypothesis testing framework

🎯 Defining hypotheses

Null hypothesis (H₀): often represents a skeptical perspective or a claim to be tested.

Alternative hypothesis (Hₐ): represents an alternative claim under consideration, often a range of possible parameter values.

  • The null typically represents "no difference" or status quo
  • Our role as data scientists: play skeptic and require strong evidence before accepting the alternative
  • Example: For the Roslings' infant vaccination question (3 choices), H₀: p = 0.333 (people guess randomly), Hₐ: p ≠ 0.333 (people perform differently)

🔢 The null value

  • The null value (labeled p₀) is the specific parameter value we test against
  • Example: p₀ = 0.333 represents one-third, the probability of random guessing on a 3-choice question
  • We compare sample data to this null value to evaluate hypotheses

⚖️ Burden of proof analogy

  • Similar to US courts: defendant is innocent (H₀) or guilty (Hₐ)
  • Jurors examine evidence to see if it convincingly shows guilt beyond reasonable doubt
  • Even if unconvinced of guilt, this doesn't mean they believe the defendant is innocent
  • Likewise: failing to reject H₀ ≠ accepting H₀ as true

🤔 Why we can't just reject implausible nulls

  • Even if we don't believe p is exactly 0.333, that doesn't justify rejecting H₀
  • Without data pointing in a specific direction, rejecting H₀ is both uninteresting and pointless
  • We'd still face the original question: do people do better or worse than guessing?
  • Example: Testing new drug vs existing drug—H₀: no difference; Hₐ: new drug performs differently (better or worse)

🧪 Testing with confidence intervals

🧪 How confidence intervals inform hypothesis tests

  • If a confidence interval does NOT contain the null value p₀, we have evidence against H₀
  • If the confidence interval DOES contain p₀, we cannot say the null is implausible
  • This method works when we can construct a confidence interval for the parameter

📊 Example: Infant vaccination question

  • Data: 50 college-educated adults, 24% answered correctly
  • Check conditions: simple random sample (independence ✓); np̂ = 12, n(1-p̂) = 38 (both ≥ 10, success-failure ✓)
  • 95% CI: 0.24 ± 1.96 × 0.060 → (0.122, 0.358)
  • Since 0.333 falls within this interval, we cannot reject H₀
  • Conclusion: insufficient evidence that college-educated adults perform differently than random guessing

⚠️ What "failing to reject" means

  • Failing to reject H₀ ≠ concluding H₀ is true
  • Perhaps a real difference exists, but our sample was too small to detect it
  • Don't confuse: "not rejecting" with "accepting"—we simply lack sufficient evidence

📈 Example: Children-in-2100 question

  • Larger sample: 228 college-educated adults, 14.9% answered correctly
  • 95% CI: (0.103, 0.195)
  • Since 0.333 does NOT fall in this interval, we reject H₀
  • Conclusion: data provide statistically significant evidence that the actual proportion differs from 0.333
  • Because entire CI is below 0.333, we conclude people do worse than random guessing
  • Important: always specify which confidence level you used

🔄 Confidence level matters

  • Different confidence levels (95%, 99%, 99.9%) can lead to different conclusions
  • Always be clear about which confidence level you used when stating conclusions

🎲 Understanding p-values

🎲 What a p-value measures

P-value: the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis were true.

  • Quantifies the strength of evidence against H₀
  • Smaller p-values = stronger evidence against the null
  • We compute p-values using the null distribution (sampling distribution assuming H₀ is true)

🔧 Key procedural differences from confidence intervals

When conducting hypothesis tests for proportions:

  1. Check success-failure using p₀ (not p̂): np₀ ≥ 10 and n(1 - p₀) ≥ 10
  2. Compute SE using p₀ (not p̂): SE = square root of [p₀(1 - p₀) / n]

Why the difference?

  • In hypothesis testing, we suppose H₀ is true (different mindset than confidence intervals)
  • We're asking: "If the null were true, how likely is our observed data?"
  • Therefore, we use the null value p₀ for conditions and calculations

📐 Computing the p-value

Steps:

  1. Verify conditions (independence and success-failure using p₀)
  2. Compute SE using p₀
  3. Calculate Z-score: Z = (p̂ - p₀) / SE
  4. Find tail area(s) in the null distribution
  5. For two-sided tests, double the single tail area to get p-value

📉 Example: Coal energy support

  • Sample: 1000 US adults, 37% support increased coal usage
  • Hypotheses: H₀: p = 0.5, Hₐ: p ≠ 0.5, α = 0.05
  • Check conditions using p₀ = 0.5:
    • Independence: simple random sample ✓
    • Success-failure: np₀ = 500, n(1-p₀) = 500 (both ≥ 10) ✓
  • SE = square root of [0.5 × 0.5 / 1000] = 0.016
  • Z = (0.37 - 0.5) / 0.016 = -8.125
  • Tail area ≈ 0.00000000000000022
  • P-value = 2 × tail area ≈ 0.00000000000000044 (incredibly small)
  • Decision: since p-value < 0.05, reject H₀
  • Conclusion: data provide strong evidence that support for coal differs from 50%; specifically, a majority do NOT support it

✅ Decision rule

When the p-value is less than the significance level α, reject H₀. When the p-value is greater than α, do not reject H₀.

  • Standard significance level: α = 0.05 (5%)
  • If p-value < α: reject H₀, conclude data provide strong evidence for Hₐ
  • If p-value > α: do not reject H₀, conclude insufficient evidence against H₀
  • Always describe conclusion in context of the data

🎯 Example: Nuclear arms reduction

  • Sample: 1028 US adults, 56% support nuclear arms reduction
  • Hypotheses: H₀: p = 0.50, Hₐ: p ≠ 0.50, α = 0.05
  • Check: Independence ✓; np₀ = n(1-p₀) = 514 ≥ 10 ✓
  • SE = square root of [0.5 × 0.5 / 1028] = 0.0156
  • Z = (0.56 - 0.50) / 0.0156 = 3.85
  • Upper tail area ≈ 0.0001
  • P-value = 2 × 0.0001 = 0.0002
  • Decision: 0.0002 < 0.05, so reject H₀
  • Conclusion: convincing evidence that a majority of Americans supported nuclear arms reduction in March 2013

⚠️ Errors and significance levels

⚠️ Two types of errors

ScenarioH₀ trueHₐ true
Do not reject H₀OkayType 2 Error
Reject H₀Type 1 ErrorOkay

Type 1 Error: rejecting the null hypothesis when H₀ is actually true.

Type 2 Error: failing to reject the null hypothesis when the alternative is actually true.

Court analogy:

  • Type 1 Error = convicting an innocent person (defendant is innocent but wrongly convicted)
  • Type 2 Error = failing to convict a guilty person (defendant is guilty but court failed to convict)

⚖️ The error trade-off

  • Key principle: if we reduce one type of error, we generally make more of the other type
  • To lower Type 1 Error rate: raise the standard (e.g., "beyond a conceivable doubt" instead of "beyond a reasonable doubt")
    • Result: fewer wrongful convictions, but more guilty people go free (more Type 2 Errors)
  • To lower Type 2 Error rate: lower the standard (e.g., "beyond a little doubt")
    • Result: more guilty people convicted, but also more wrongful convictions (more Type 1 Errors)

🎚️ Choosing a significance level

Significance level (α): if the null hypothesis is true, α indicates how often the data lead us to incorrectly reject H₀.

  • Standard level: α = 0.05 (we don't want to incorrectly reject H₀ more than 5% of the time when it's true)
  • Using 95% CI for hypothesis testing = α = 0.05
  • Using 99% CI for hypothesis testing = α = 0.01

When to adjust α:

SituationChooseReason
Type 1 Error is dangerous/costlySmaller α (e.g., 0.01)Be very cautious about rejecting H₀; demand very strong evidence
Type 2 Error is more dangerous/costlyLarger α (e.g., 0.10)Be cautious about failing to reject H₀ when Hₐ is true
Data collection cost is lowCollect more dataReduce Type 2 Error without affecting Type 1 Error rate

🚗 Example: Car manufacturer decisions

Door hinges (marginal impact):

  • Testing if new equipment produces flaws < 0.2% of the time
  • Neither error type is particularly dangerous or expensive
  • Use standard α = 0.05

Safety components (high impact):

  • Testing if new supplier's safety parts are more reliable
  • Should be eager to switch even with moderately strong evidence
  • Use larger α = 0.10 to more readily detect safety improvements

Expensive machine part:

  • Part is very expensive to replace
  • Machine usually works even if part is broken
  • H₀: part is not broken; Hₐ: part is broken
  • Failing to fix a broken part isn't very problematic, but replacing is expensive
  • Use small α = 0.01 to require very strong evidence before replacing

🔬 Statistical vs practical significance

Statistical significance: the difference is unlikely due to chance alone (p-value < α).

Practical significance: the difference is large enough to matter in real-world terms.

  • With large samples, even tiny differences become statistically significant
  • These tiny differences may lack practical value
  • Example: Online experiment detects statistically significant 0.001% increase in viewership—statistically significant but not practically significant

Role of sample size planning:

  • Data scientist should plan study size in advance
  • Consult experts to learn the smallest meaningful difference from null value
  • Obtain rough estimate of true proportion to estimate SE
  • Suggest sample size large enough to detect meaningful real differences
  • Especially important when considering costs or potential risks (e.g., health impacts in medical studies)

📋 Formal testing procedure

📋 Four-step process

Prepare:

  • Identify the parameter of interest
  • List hypotheses (H₀ and Hₐ)
  • Identify the significance level α
  • Identify sample proportion p̂ and sample size n

Check:

  • Independence: verify observations are independent (simple random sample, random assignment, or seemingly random process)
  • Success-failure condition: using p₀, verify np₀ ≥ 10 and n(1 - p₀) ≥ 10

Calculate:

  • Compute SE using p₀: SE = square root of [p₀(1 - p₀) / n]
  • Compute Z-score: Z = (p̂ - p₀) / SE
  • Identify the p-value from the null distribution

Conclude:

  • Compare p-value to α
  • Provide conclusion in context:
    • If p-value < α: "Reject H₀. The data provide strong evidence supporting the alternative hypothesis."
    • If p-value > α: "Do not reject H₀. We do not have sufficient evidence to reject the null hypothesis."

🔍 Verifying independence

When observations are independent:

  • Subjects undergo random assignment to treatment groups (experiments)
  • Observations come from a simple random sample
  • Sample is from a seemingly random process (use best judgment)

Optional additional check:

  • Sample should be no larger than 10% of population
  • When sample exceeds 10% of population, methods slightly overestimate sampling error
  • This is rarely an issue; when it is, methods tend to be conservative

🔀 One-sided tests (special topic)

🔀 What are one-sided tests

Two-sided test (what we've been using):

  • Hₐ: p ≠ p₀ (care about detecting if p is either above or below p₀)

One-sided test forms:

  1. Hₐ: p < p₀ (only value in detecting if parameter is less than p₀)
  2. Hₐ: p > p₀ (only value in detecting if parameter is more than p₀)

Key difference:

  • Compute p-value as single tail area in the direction of Hₐ (don't double it)
  • This makes p-value smaller, so level of evidence required to reject H₀ goes down

⚠️ The heavy price of one-sided tests

  • Major risk: must disregard any interesting findings in the opposite direction
  • Example: Stent study for stroke patients
    • Researchers believed stents would help (existing research suggested benefit)
    • Would have been tempting to use one-sided test
    • But data showed opposite: patients with stents did worse
    • If they'd used one-sided test, they would have limited ability to identify harm to patients

🤔 When to use one-sided tests

Critical question to ask yourself: "What would I, or others, conclude if the data happens to go clearly in the opposite direction than my alternative hypothesis?"

  • If there's any value in making a conclusion about data going in the opposite direction, use a two-sided test
  • These considerations can be subtle—exercise caution
  • Best practice: Use two-sided tests (this book only applies two-sided tests)

🚫 Why not choose direction after seeing data

What goes wrong if we pick one-sided direction after observing data:

  • If p̂ < p₀, we'd use Hₐ: p < p₀, and any observation in lower 5% tail leads to rejecting H₀
  • If p̂ > p₀, we'd use Hₐ: p > p₀, and any observation in upper 5% tail leads to rejecting H₀
  • If H₀ were true, there's 10% chance of being in one of the two tails
  • Our testing error becomes α = 0.10, not 0.05
  • This effectively undermines the methods we're developing

Bottom line: Not being careful about when to use one-sided tests destroys the error-rate control that makes hypothesis testing rigorous.

💡 Important concepts and reminders

💡 The null distribution

Null distribution: the sampling distribution of the point estimate under the assumption that the null hypothesis is true.

  • In real applications, we never actually observe the sampling distribution
  • However, it's useful to always think of a point estimate as coming from such a hypothetical distribution
  • Understanding the sampling distribution helps us characterize and make sense of observed point estimates

💡 Double negatives in statistics

  • Statistical explanations often use double negatives
  • Examples: "the null hypothesis is not implausible" or "we failed to reject the null hypothesis"
  • Purpose: communicate that while we're not rejecting a position, we're also not saying it's correct
  • This reflects the asymmetry in hypothesis testing: we can reject H₀ with strong evidence, but we never "accept" H₀

💡 What hypothesis tests address

What they DO address:

  • Sampling error (variability from one sample to another)
  • Whether observed differences are likely due to chance

What they DON'T address:

  • Bias (systematic tendency to over- or under-estimate)
  • If data collection systematically biases results, hypothesis testing won't fix that
  • We rely on careful data collection procedures to combat bias

💡 Scope of conclusions

  • Confidence intervals and hypothesis tests are only about the population parameter
  • They say nothing about individual observations or point estimates
  • They don't predict future point estimates
  • Example: A 90% CI for solar energy support (87.1% to 90.4%) does NOT mean we're 90% confident a new survey's proportion will fall in that range

💡 Sample size effects

  • Larger samples provide more precise estimates (smaller SE)
  • Formula shows this: SE has n in the denominator, so bigger n means smaller SE
  • With large enough samples, even tiny real differences become detectable
  • This is why planning sample size is important: aim for a size that can detect the smallest meaningful difference

💡 Why 0.05 is standard

  • The α = 0.05 threshold is most common, but why?
  • Maybe the standard should be smaller or larger
  • The excerpt notes there's a 5-minute task to help clarify "why 0.05" at www.openintro.org/why05
  • The choice of significance level should depend on the context and consequences of errors
21

Inference for a single proportion

6.1 Inference for a single proportion

🧭 Overview

🧠 One-sentence thesis

This section reviews inference methods for a single proportion, including point estimates, confidence intervals, and hypothesis tests that were introduced in Chapter 5.

📌 Key points (3–5)

  • What this section covers: a review of inference methods for a single proportion from Chapter 5.
  • Three main techniques: point estimates, confidence intervals, and hypothesis tests.
  • Context: part of applying Chapter 5 methods to categorical data.
  • Foundation: uses the normal distribution to model uncertainty in the sample proportion.

📚 Chapter context

📚 Where this fits in the book

The excerpt places this section within Chapter 6, "Inference for categorical data," which applies methods and ideas from Chapter 5 to several contexts involving categorical data.

Chapter 6 structure:

  • Section 6.1: Inference for a single proportion (this section)
  • Section 6.2: Difference of two proportions
  • Section 6.3: Testing for goodness of fit using chi-square
  • Section 6.4: Testing for independence in two-way tables

🔗 Connection to earlier material

  • Chapter 5 introduced the foundational concepts: point estimates, confidence intervals, and hypothesis tests.
  • Chapter 6 applies these same ideas to categorical data contexts.
  • The normal distribution is used to model uncertainty in the sample proportion.

🔍 What this section reviews

🔍 The three inference methods

The excerpt states that the section reviews three techniques encountered in Chapter 5:

  1. Point estimates: estimating the population proportion from sample data.
  2. Confidence intervals: constructing a range of plausible values for the population proportion.
  3. Hypothesis tests: testing claims about the population proportion.

📊 The underlying model

The normal distribution can be used to model the uncertainty in the sample proportion.

  • This is the statistical foundation for all three inference methods.
  • The normal model allows us to quantify how much a sample proportion might vary from the true population proportion.

⚠️ Note on excerpt content

⚠️ Limited substantive content

The provided excerpt is primarily a chapter introduction and table of contents. It announces that the section will review inference for a single proportion but does not yet present the detailed methods, formulas, conditions, or worked examples. The actual review content appears to continue beyond what is shown in the excerpt.

22

6.2 Difference of two proportions

6.2 Difference of two proportions

🧭 Overview

🧠 One-sentence thesis

The difference of two proportions extends single-proportion inference methods to compare whether two groups have different rates of a categorical outcome, using the normal distribution to model uncertainty in the difference between sample proportions.

📌 Key points (3–5)

  • What it measures: whether two groups (e.g., unemployed vs underemployed) differ in the proportion experiencing a categorical outcome (e.g., relationship problems).
  • How it builds on prior knowledge: applies the same normal-model ideas from single-proportion inference to the difference between two proportions.
  • Core method: hypothesis testing evaluates whether observed differences are statistically significant or could arise from random variation.
  • Common confusion: a small p-value means the data are unlikely under the null hypothesis (no difference), not that the difference is large or practically important.
  • Context matters: interpreting results requires understanding what the proportions represent and what the hypothesis test is asking.

🧩 Core concept: comparing two proportions

🧩 What "difference of two proportions" means

Difference of two proportions: the comparison of the rate of a categorical outcome between two independent groups.

  • You have two groups (e.g., unemployed people and underemployed people).
  • Each group has a proportion experiencing some outcome (e.g., major relationship problems).
  • The question is: are these two proportions different, or could the observed difference be due to chance?

🔍 How it extends single-proportion inference

  • Chapter 5 covered inference for one proportion: point estimates, confidence intervals, and hypothesis tests using the normal distribution.
  • This section applies the same core ideas to the difference between two proportions.
  • The normal model still applies, but now it models the uncertainty in the difference rather than a single proportion.

🧪 Hypothesis testing for two proportions

🧪 Setting up hypotheses

  • The excerpt provides an example: comparing unemployed and underemployed respondents.
    • 27% of 1,145 unemployed respondents reported major relationship problems.
    • 25% of 675 underemployed respondents reported major relationship problems.
  • The hypothesis test evaluates: are the proportions of unemployed and underemployed people who had relationship problems different?
  • Typical structure:
    • Null hypothesis: the two proportions are equal (no difference).
    • Alternative hypothesis: the two proportions are different.

📊 Interpreting the p-value

  • The excerpt states: "The p-value for this hypothesis test is approximately 0.35."
  • What this means in context:
    • If the null hypothesis were true (i.e., unemployed and underemployed people have the same rate of relationship problems), there is about a 35% chance of observing a difference as large as (or larger than) the one in the data.
    • A p-value of 0.35 is relatively large, suggesting the observed difference (27% vs 25%) is consistent with random variation.
    • There is not strong evidence that the two proportions are different.

⚠️ Don't confuse: p-value vs effect size

  • A p-value tells you whether the data are surprising under the null hypothesis, not whether the difference is large or important.
  • Example: a 2 percentage-point difference (27% vs 25%) might be statistically insignificant (p = 0.35), meaning it could easily arise by chance even if the true proportions are equal.
  • Conversely, with very large samples, even tiny differences can be statistically significant (low p-value) but may not be practically meaningful.

🔗 Connection to broader inference framework

🔗 Same core ideas, new context

  • The chapter introduction states: "we apply the methods and ideas from Chapter 5 in several contexts for categorical data."
  • For two proportions:
    • Use the normal distribution to model uncertainty (same as single proportion).
    • Apply hypothesis testing logic: define hypotheses, calculate a test statistic, find a p-value, and interpret in context.
    • The difference is the parameter of interest is now the difference between two proportions, not a single proportion.

🧭 What comes next

  • Later in the chapter, inference techniques extend to contingency tables (two-way tables testing for independence).
  • While those methods use a different distribution (chi-square), the core ideas of hypothesis testing remain the same: define hypotheses, assess evidence, and interpret results in context.
23

Testing for goodness of fit using Chi-Square

6.3 Testing for goodness of fit using Chi-Square

🧭 Overview

🧠 One-sentence thesis

Simulation-based hypothesis testing allows researchers to determine whether an observed difference between groups is likely due to chance alone or reflects a real treatment effect by comparing the actual result to a distribution of differences generated under the assumption that the treatment has no effect.

📌 Key points (3–5)

  • Core method: simulate many random assignments under the independence model (null hypothesis) to see how often chance alone produces differences as large as the observed difference.
  • Two competing models: the independence model (H₀) says treatment has no effect and the observed difference is just random variation; the alternative model (Hₐ) says the treatment causes the observed difference.
  • Decision rule: if the observed difference is rare under the independence model (appears in only a small percentage of simulations), we reject H₀ and conclude the treatment likely has an effect.
  • Common confusion: observing a rare event in a controlled study is different from anecdotal rare events—in daily life, any outcome seems rare, but in formal studies, we specifically test whether the treatment group's outcome is unusually different from what chance would produce.
  • Error awareness: statistical inference provides tools to control how often we choose the wrong model, though errors (like rare events) can still occur.

🎲 The simulation approach

🎲 How the simulation works

The excerpt describes a vaccine study where researchers simulate what would happen if the vaccine had no effect:

  • Start with the actual patients (e.g., 6 in treatment, 14 in control).
  • Randomly reassign who gets infected, ignoring which group they're in.
  • Calculate the difference in infection rates (control rate minus treatment rate) for this random assignment.
  • Repeat this process many times (e.g., 100 simulations) to build a distribution of differences that could occur by chance alone.

Why this matters: If the independence model is true (vaccine does nothing), the simulated differences should center around zero—some random fluctuation but no systematic pattern.

Example: One simulation might produce 4/6 infected in treatment and 9/14 in control, giving a difference of -0.310; another might give 3/6 vs 8/14, or -0.071. Each simulation represents one possible outcome if treatment assignment doesn't matter.

📊 Interpreting the simulation results

Figure 2.31 (described in the excerpt) shows 100 simulated differences stacked as dots:

  • The distribution centers around 0, as expected when treatment has no effect.
  • The actual study observed a difference of 64.3% (0.643).
  • Only about 2 out of 100 simulations (2%) produced a difference at least that large.

Key insight: A 2% occurrence rate means the observed difference is a "rare event" under the independence model—it would happen "only about 2% of the time" if the vaccine truly had no effect.

Don't confuse: "Rare" here is defined relative to the specific null distribution, not general everyday rarity.

🔀 Two competing hypotheses

🔀 The independence model (H₀)

Independence model (H₀): The vaccine has no effect on infection rate, and we just happened to observe a difference that would only occur on a rare occasion.

  • This is the "null hypothesis"—the default assumption that treatment does nothing.
  • Under H₀, any observed difference is purely due to random variation in who happened to get infected.
  • The simulation creates the distribution of differences we'd expect if H₀ were true.

🔀 The alternative model (Hₐ)

Alternative model (Hₐ): The vaccine has an effect on infection rate, and the difference we observed was actually due to the vaccine being effective at combating malaria, which explains the large difference of 64.3%.

  • This model says the treatment causes a real change in outcomes.
  • The large observed difference is not random luck but reflects the vaccine's protective effect.

⚖️ Choosing between models

The excerpt presents two options after seeing the simulation results:

  1. Do not reject H₀: Conclude the study does not provide strong evidence against the independence model—we cannot confidently say the vaccine had an effect.
  2. Reject H₀: Conclude the evidence is strong enough to reject independence and assert the vaccine was useful.

Decision guideline: "When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event."

  • Because the 64.3% difference appeared in only ~2% of simulations, it's considered sufficiently rare.
  • The excerpt concludes: "we reject the independence model in favor of the alternative"—the data provide strong evidence the vaccine offers protection.

🎯 Statistical inference and errors

🎯 What statistical inference does

Statistical inference: a field of statistics built on evaluating whether differences are due to chance; data scientists evaluate which model is most reasonable given the data.

  • It provides systematic tools to decide between competing models.
  • It helps control and evaluate how often errors occur.

⚠️ Acknowledging errors

The excerpt is explicit about limitations:

  • "Errors do occur, just like rare events, and we might choose the wrong model."
  • "While we do not always choose correctly," statistical inference gives tools to manage error rates.

Why errors happen: Even under the independence model, rare events occur 2% of the time—so 2% of the time, we might wrongly reject H₀ when it's actually true.

🚫 Anecdotal evidence vs formal studies

The excerpt includes a footnote warning about misapplying this reasoning:

  • In formal studies: we design experiments to test specific hypotheses; observing a rare outcome under H₀ is meaningful evidence.
  • In daily life: "we observe incredibly rare events every day"—any lottery number combination has 1-in-292-million odds, but some combination must occur.
  • Key distinction: "Any set of numbers we could have observed would ultimately be incredibly rare," so anecdotal rare events don't imply causation.

Don't confuse: The reasoning "this is rare under H₀, so reject H₀" applies to controlled studies with pre-specified hypotheses, not to post-hoc observations of everyday coincidences.

📋 Exercise examples (Avandia and heart transplants)

💊 Avandia cardiovascular study

The excerpt includes an exercise about two diabetes drugs:

  • Rosiglitazone (Avandia): 2,593 out of 67,593 patients had cardiovascular problems (3.8% rate).
  • Pioglitazone (Actos): 5,386 out of 159,978 patients had problems (3.4% rate).

Common reasoning errors (from the exercise):

StatementWhy it's misleading
"More patients on pioglitazone had problems (5,386 vs 2,593), so pioglitazone has a higher rate"Ignores different group sizes; must compare rates, not counts
"Higher rate for rosiglitazone proves it causes problems"Correlation doesn't prove causation; could be due to chance or confounding
"Cannot tell if difference is due to relationship or chance"Correct—need simulation or statistical test to evaluate

Simulation approach: Write each patient's outcome on a card, shuffle, deal into two groups matching the original sizes (67,593 and 159,978), and repeat 1,000 times to see how often chance produces a difference as large as observed.

❤️ Heart transplant study

Another exercise describes:

  • Control group (no transplant): 30 out of 34 died.
  • Treatment group (transplant): 45 out of 69 died.

Setup for randomization test:

  • Write "alive" on cards for patients who survived and "dead" on cards for those who didn't.
  • Shuffle and split into two groups matching the original sizes.
  • Calculate the difference in death rates (treatment minus control) for each shuffle.
  • Compare the actual observed difference to the distribution from many shuffles.

Purpose: Determine whether the difference in survival rates is larger than what random assignment alone would produce, helping assess whether transplants improve survival.

24

Testing for Independence in Two-Way Tables

6.4 Testing for independence in Two-Way tables

🧭 Overview

🧠 One-sentence thesis

Randomization techniques can test whether two categorical variables (such as treatment and outcome) are independent by simulating what differences would occur by chance alone and comparing the observed difference to that distribution.

📌 Key points (3–5)

  • What independence means: whether survival is independent of treatment can be assessed by comparing proportions across groups and using visualization tools like mosaic plots.
  • Randomization testing approach: shuffle outcome labels randomly across groups many times, calculate the difference in proportions each time, and see how often chance alone produces a difference as extreme as the observed one.
  • Key setup elements: write outcomes on cards (alive/dead), shuffle and split into treatment/control groups of the original sizes, calculate the difference, and repeat to build a null distribution centered at zero.
  • Common confusion: the null distribution is centered at zero (no difference) because it assumes independence; if the observed difference falls far from this distribution, independence is unlikely.
  • Interpreting results: if the fraction of simulated differences as extreme as the observed is low, the null hypothesis (independence) should be rejected in favor of the alternative (treatment has an effect).

🧪 Setting Up the Independence Test

🧪 The research question

The Stanford Heart Transplant Study aimed to determine whether an experimental heart transplant program increased lifespan.

  • Patients were designated as transplant candidates (gravely ill, likely to benefit).
  • Some received transplants (treatment group, n=69), others did not (control group, n=34).
  • Outcome measured: survival status at the end of the study.

📊 Observed data

GroupDiedTotalProportion died
Control303430/34 ≈ 0.88
Treatment456945/69 ≈ 0.65
  • The observed difference in death proportions (treatment - control) is approximately 0.65 - 0.88 = -0.23.
  • Negative value means fewer deaths in the treatment group.

🔍 Visual assessment

The excerpt mentions a mosaic plot for assessing independence:

  • If survival were independent of transplant status, the proportions of alive/dead should be similar across treatment and control groups.
  • The mosaic plot helps visualize whether the pattern of outcomes differs between groups.

🎲 Randomization Technique Mechanics

🎲 Core idea

Randomization technique: a method to investigate whether treatment is effective by simulating what would happen if outcomes were randomly assigned to groups (i.e., if treatment had no effect).

  • If treatment truly has no effect, any observed difference is just due to chance assignment of patients to groups.
  • By simulating many random assignments, we can see how often chance alone produces differences as large as what we actually observed.

🃏 Physical simulation setup

The excerpt describes a card-shuffling approach:

  1. Create cards: Write "alive" on cards for patients who survived, and "dead" on cards for patients who died.

    • Total cards = total patients (34 + 69 = 103).
    • Number of "alive" cards = total alive; number of "dead" cards = total dead.
  2. Shuffle and split: Shuffle all cards together, then split into two groups:

    • One group of size 69 (representing treatment).
    • Another group of size 34 (representing control).
  3. Calculate difference: For each shuffle, calculate the difference between the proportion of "dead" cards in treatment and control groups (treatment - control).

  4. Repeat: Do this 100 times (or more) to build a distribution of simulated differences.

📍 What the null distribution shows

  • The distribution is centered at zero because under the null hypothesis (no treatment effect, independence), there should be no systematic difference between groups.
  • Most simulated differences cluster near zero; extreme values are rare if the null hypothesis is true.

🧮 Interpreting the Simulation Results

🧮 The decision rule

If the fraction of simulations where the simulated differences are as extreme as (or more extreme than) the observed difference is low, conclude that it is unlikely to have observed such an outcome by chance, and reject the null hypothesis in favor of the alternative.

  • "As extreme as" means as far from zero (in either direction) as the observed difference.
  • "Low fraction" typically means less than 5% (a common threshold, though not explicitly stated in the excerpt).

📉 What the simulation histogram suggests

The excerpt mentions "simulation results shown below" with a histogram of simulated differences in proportions ranging from about -0.25 to 0.25.

Key observations:

  • The observed difference was approximately -0.23 (treatment had lower death rate).
  • If most simulated differences are clustered near zero and very few are as extreme as -0.23, this suggests the observed difference is unlikely under the null hypothesis.
  • Example interpretation: If only 2 out of 100 simulations produced a difference of -0.23 or more extreme, the fraction is 0.02 (2%), which is low.
  • Conclusion: Reject the null hypothesis; the treatment appears effective.

⚠️ Don't confuse

  • Null hypothesis (independence): Treatment status and survival are unrelated; any observed difference is due to chance.
  • Alternative hypothesis: Treatment status affects survival; the observed difference reflects a real treatment effect.
  • The simulation tests the null by showing what "chance alone" looks like; if the observed data doesn't fit that pattern, the null is unlikely.

📦 Additional Context from the Excerpt

📦 Box plots and efficacy

The excerpt mentions box plots of survival time:

  • Box plots can show the distribution of survival times (in days) for control vs. treatment groups.
  • If the treatment group's box plot shows longer survival times (higher median, higher quartiles), this suggests the treatment is effective.
  • This complements the proportion-based analysis by looking at how long patients survived, not just whether they survived to the end of the study.

📦 Mosaic plot reasoning

For part (a), the question asks whether survival is independent based on the mosaic plot:

  • If the plot shows similar proportions of alive/dead across treatment and control, survival appears independent of treatment.
  • If proportions differ noticeably (e.g., treatment group has a larger "alive" section), survival is likely not independent of treatment.
  • The randomization test formalizes this visual intuition with a probability calculation.
25

One-Sample Means with the T-Distribution

7.1 One-Sample means with the T-Distribution

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content about one-sample means or the t-distribution; instead, it presents a heart transplant study exercise focused on comparing survival outcomes between treatment and control groups using proportions and randomization techniques.

📌 Key points (3–5)

  • What the excerpt actually covers: a Stanford heart transplant study comparing survival between patients who received transplants versus those who did not.
  • Data structure: 34 control patients (30 died), 69 treatment patients (45 died); survival time measured in days.
  • Analysis approach: uses mosaic plots, box plots, proportions, and randomization simulation to investigate treatment effectiveness.
  • Common confusion: this is a two-group comparison problem using proportions and randomization, not a one-sample means problem with the t-distribution.
  • Core method described: randomization testing to determine whether observed differences in death rates could occur by chance.

📋 Study design and data

📋 The Stanford Heart Transplant Study

  • Purpose: determine whether an experimental heart transplant program increased lifespan.
  • Participants: patients designated as official heart transplant candidates (gravely ill, would likely benefit from a new heart).
  • Groups:
    • Treatment group: received a transplant (69 patients, 45 died)
    • Control group: did not receive a transplant (34 patients, 30 died)
  • Variables tracked: transplant (treatment vs control), survived (alive or dead at study end), survival time in days.

📊 Observed outcomes

GroupTotal patientsDiedProportion who died
Control343030/34 ≈ 0.88
Treatment694545/69 ≈ 0.65
  • The excerpt asks for calculation of these proportions as part of the analysis.
  • The difference in death rates (treatment minus control) is a key quantity for evaluation.

🔬 Analysis methods presented

🔬 Visual analysis

  • Mosaic plot: used to assess whether survival is independent of transplant status.
    • The excerpt asks whether the plot suggests independence or association.
  • Box plots: show survival time distributions for control vs treatment groups.
    • Used to suggest efficacy (effectiveness) of the treatment.
    • Example: if treatment group shows longer survival times, this suggests potential benefit.

🎲 Randomization technique

A randomization technique investigates whether the treatment is effective by simulating what would happen if group assignment were random.

Setup described in the excerpt:

  • Write "alive" on cards for patients alive at study end, "dead" on cards for patients who died.
  • Shuffle all cards and split into two groups: one representing treatment, one representing control (sizes match the actual study groups).
  • Calculate the difference in proportion of "dead" cards between groups (treatment minus control).
  • Repeat this process 100 times to build a distribution centered at a specific value.
  • Compare the actual observed difference to the simulated distribution.

Claims being tested:

  • Null hypothesis: treatment has no effect (differences are due to chance).
  • Alternative hypothesis: treatment has an effect (observed difference is unlikely under random assignment).

🎯 Interpreting simulation results

  • The excerpt shows a dot plot of "simulated differences in proportions" ranging approximately from -0.25 to 0.25.
  • Decision rule: calculate the fraction of simulations where simulated differences are as extreme as (or more extreme than) the observed difference.
  • Interpretation: if this fraction is low, the observed outcome is unlikely to occur by chance, so reject the null hypothesis in favor of the alternative (treatment is effective).
  • Don't confuse: the distribution is centered at a particular value (likely zero, representing no difference) under the assumption that treatment has no effect.

⚠️ Mismatch with title

⚠️ Content does not match "One-Sample Means with the T-Distribution"

  • What's missing: no discussion of sample means, t-distribution, t-statistics, degrees of freedom, confidence intervals for means, or hypothesis tests for a single mean.
  • What's present: two-group comparison of proportions (categorical outcome: alive/dead), randomization testing, and visual analysis.
  • Why this matters for review: if studying one-sample t-tests, this excerpt does not provide the relevant material; it covers a different statistical method (randomization inference for comparing two proportions).
26

Paired data

7.2 Paired data

🧭 Overview

🧠 One-sentence thesis

When two sets of observations have a natural one-to-one correspondence, analyzing the differences between paired observations using t-distribution methods provides a powerful way to test whether the average difference is zero or to estimate the true average difference.

📌 Key points (3–5)

  • What makes data paired: each observation in one set has a special correspondence or connection with exactly one observation in the other set.
  • How to analyze paired data: compute the difference for each pair, then apply standard t-distribution inference methods to those differences.
  • Order consistency matters: always subtract in the same direction (e.g., always "Bookstore − Amazon") to maintain meaningful interpretation.
  • Common confusion: paired vs. independent samples—paired data requires natural correspondence between observations; different-sized data sets cannot be paired.
  • Why it works: by reducing two measurements per subject to one difference per subject, we account for subject-to-subject variability and focus on the treatment effect.

📚 What paired data means

🔗 Definition and structure

Paired data: Two sets of observations are paired if each observation in one set has a special correspondence or connection with exactly one observation in the other data set.

  • The key word is "correspondence"—there must be a natural link between specific observations in the two groups.
  • Example: The textbook data set has 68 books, each with both a UCLA Bookstore price and an Amazon price for the same book.
  • Each book is measured twice (once at each store), creating a natural pairing.

🎯 When to recognize pairing

Common paired scenarios from the excerpt:

  • Before-and-after measurements on the same subjects (pre-test and post-test scores; artery thickness at start and after 2 years)
  • Two measurements on the same units (reading and writing scores for the same students; stock prices on the same days)
  • Matched items (prices for the same 68 books at two different stores)

Don't confuse with: Independent samples where different subjects are measured in each group (e.g., randomly sampled men vs. randomly sampled women for salary comparison—these are different people, not paired).

🔢 The difference approach

➖ Computing differences

The excerpt emphasizes: "To analyze paired data, it is often useful to look at the difference in outcomes of each pair of observations."

Critical rule: Always subtract using a consistent order.

In the textbook example:

  • Difference = UCLA Bookstore price − Amazon price
  • First book: 47.97 − 47.45 = 0.52
  • Second book: 14.26 − 13.55 = 0.71
  • Third book: 13.50 − 12.53 = 0.97

Why consistency matters: If you switch the order partway through, positive and negative differences lose their meaning, and your analysis becomes meaningless.

📊 Treating differences as a single variable

Once you compute all differences:

  • You now have one number per pair (not two)
  • The 68 pairs become 68 differences
  • Summary statistics: mean of differences (x̄_diff = 3.58), standard deviation of differences (s_diff = 13.42), sample size (n_diff = 68)
  • A histogram of differences shows the distribution of this single variable

🧪 Hypothesis testing with paired data

🎲 Setting up the test

From Example 7.17, testing whether there is a difference between Amazon and UCLA Bookstore prices:

Hypotheses:

  • H₀: μ_diff = 0 (no difference in average textbook price)
  • Hₐ: μ_diff ≠ 0 (there is a difference in average prices)

The null hypothesis always states that the true mean difference is zero.

✅ Checking conditions

Before using the t-distribution, verify:

  1. Independence: Observations based on a simple random sample (satisfied in the textbook example)
  2. Normality: With n = 68 and no particularly extreme outliers, the normality of x̄ is satisfied

The excerpt notes: "We can use the same t-distribution techniques we applied in Section 7.1"—meaning all the standard t-test machinery applies once you have the differences.

🧮 Computing the test statistic

Standard error of the mean difference: SE = s_diff / √n_diff = 13.42 / √68 = 1.63

Test statistic (T-score): T = (x̄_diff − 0) / SE = (3.58 − 0) / 1.63 = 2.20

Degrees of freedom: df = n_diff − 1 = 68 − 1 = 67

P-value: Using statistical software, one-tail area = 0.0156; two-tailed p-value = 2 × 0.0156 = 0.0312

Conclusion: Because p-value (0.0312) < 0.05, reject the null hypothesis. Amazon prices are, on average, lower than UCLA Bookstore prices.

📏 Confidence intervals for paired data

🎯 Construction method

From Guided Practice 7.19, creating a 95% confidence interval for the average price difference:

Formula: point estimate ± t* × SE

Where:

  • Point estimate = x̄_diff = 3.58
  • t* with df = 67 at 95% confidence = 2.00 (from t-table or software)
  • SE = 1.63 (already computed)

Interval: 3.58 ± 2.00 × 1.63 → (0.32, 6.84)

Interpretation: We are 95% confident that Amazon is, on average, between $0.32 and $6.84 less expensive than the UCLA Bookstore for UCLA course books.

🔄 Agreement between tests and intervals

The excerpt asks: "Do your results from the hypothesis test and the confidence interval agree?"

How they relate:

  • If you reject H₀: μ_diff = 0 at α = 0.05, then a 95% confidence interval should not include 0
  • In the textbook example: rejected null at 0.05 level, and the interval (0.32, 6.84) does not include 0—they agree

🤔 Practical interpretation

📖 Real-world implications

From Guided Practice 7.20, the excerpt cautions against over-interpreting the average:

What the average tells you: On average, Amazon is cheaper by about $3.58.

What the distribution tells you: Looking at the histogram in Figure 7.9:

  • Some cases show Amazon far below UCLA Bookstore prices
  • Many cases show Amazon prices above UCLA Bookstore prices
  • Most of the time, the price difference isn't that large

Practical advice: "If getting a book immediately from the bookstore is notably more convenient... it's likely a good idea to go with the UCLA Bookstore unless the price difference on a specific book happens to be quite large."

Don't confuse: Statistical significance (we have strong evidence of a difference) with practical significance (the difference may not matter much for individual decisions, especially given the variability).

📅 Context matters

The excerpt notes this is "a very different result" from 2010 data, when "Amazon prices were almost uniformly lower... and by a large margin." Markets change, so conclusions are time-dependent.

27

Difference of Two Means

7.3 Difference of two means

🧭 Overview

🧠 One-sentence thesis

The t-distribution can be used to construct confidence intervals and conduct hypothesis tests for the difference between two population means when the data are independent (not paired) and certain conditions are met.

📌 Key points (3–5)

  • What this method tests: whether two groups have different population means (μ₁ − μ₂) using independent samples.
  • Key difference from paired data: the two samples are independent of each other, not matched observations on the same subjects.
  • Two required conditions: independence (within and between groups) and normality (checked by looking for outliers in each group separately).
  • Common confusion: paired vs. independent samples—paired data come from the same subjects measured twice; independent samples come from different subjects in two groups.
  • Standard error formula: combines variability from both groups; degrees of freedom use the smaller of (n₁ − 1) and (n₂ − 1) when software is unavailable.

📊 When to use this method

📊 Independent vs. paired data

The excerpt emphasizes that this section covers the difference of two means "under the condition that the data are not paired."

  • Independent samples: two separate groups of subjects, randomly assigned or randomly sampled.
  • Paired samples (covered elsewhere): the same subjects measured twice, or matched pairs.

Don't confuse:

  • Independent: Group A subjects are different people from Group B subjects.
  • Paired: each observation in Group A has a corresponding matched observation in Group B (e.g., before/after on the same person).

🎯 Motivating questions

The excerpt gives three application contexts:

  1. Does a treatment (embryonic stem cells) improve heart function after a heart attack?
  2. Do mothers who smoke have newborns with different average birth weight than mothers who don't smoke?
  3. Is one version of an exam harder than another version?

All share the structure: "Is there convincing evidence that the two groups have different means?"

✅ Conditions for using the t-distribution

✅ Independence, extended

Independence, extended: The data are independent within and between the two groups, e.g. the data come from independent random samples or from a randomized experiment.

  • Within each group: observations don't influence each other.
  • Between groups: Group 1 observations are independent of Group 2 observations.
  • How it's satisfied: random sampling or random assignment (randomized experiment).

Example: In the stem cell study, sheep were randomly assigned to ESC or control group, so independence is satisfied.

✅ Normality (outlier check)

Normality: We check the outliers rules of thumb for each group separately.

  • The excerpt does not require perfect normal distributions; it checks for clear outliers in each group.
  • With sample sizes over 30, the method is fairly robust.
  • How to check: look at histograms or summary statistics (min, max) for each group.

Example: In the stem cell study, histograms showed no clear outliers in either group, so the condition was met.

Don't confuse: You check normality separately for each group, not for the combined data.

🧮 Building a confidence interval

🧮 Point estimate

The point estimate for the difference in population means is simply the difference in sample means:

x̄₁ − x̄₂

Example: In the stem cell study, x̄(ESC) − x̄(control) = 3.50 − (−4.33) = 7.83.

🧮 Standard error

The standard error combines the variability from both groups:

SE = square root of [(s₁² / n₁) + (s₂² / n₂)]

  • s₁, s₂ are the sample standard deviations.
  • n₁, n₂ are the sample sizes.
  • The formula accounts for uncertainty in both groups.

Example: For the stem cell study, SE = square root of [(5.17² / 9) + (2.76² / 9)] = 1.95.

🧮 Degrees of freedom

  • With software: use the exact (complex) formula.
  • Without software: use the smaller of (n₁ − 1) and (n₂ − 1).

Example: With n(ESC) = 9 and n(control) = 9, df = 8.

🧮 Confidence interval formula

point estimate ± t★ × SE

  • t★ is the critical value from the t-distribution with the chosen confidence level and df.

Example: For 95% confidence with df = 8, t★ = 2.31, so the interval is 7.83 ± 2.31 × 1.95 → (3.32, 12.34). Interpretation: "We are 95% confident that embryonic stem cells improve the heart's pumping function by 3.32% to 12.34%."

🔬 Conducting a hypothesis test

🔬 Setting up hypotheses

  • Null hypothesis (H₀): There is no difference between the two population means. In notation: μ₁ − μ₂ = 0.
  • Alternative hypothesis (Hₐ): There is a difference. In notation: μ₁ − μ₂ ≠ 0 (two-sided test).

Example: For the smoking and birth weight study, H₀: μ(nonsmoker) − μ(smoker) = 0; Hₐ: μ(nonsmoker) − μ(smoker) ≠ 0.

🔬 Test statistic

T = (point estimate − null value) / SE

  • The null value is typically 0 (no difference).
  • T measures how many standard errors the observed difference is from zero.

Example: In the smoking study, T = (0.40 − 0) / 0.26 = 1.54.

🔬 P-value and conclusion

  • The p-value is the probability of observing a difference as extreme as (or more extreme than) the data, if the null hypothesis is true.
  • For a two-sided test, find the one-tail area and double it.
  • Compare p-value to the significance level α (commonly 0.05 or 0.01).
    • If p-value < α: reject H₀ (evidence of a difference).
    • If p-value ≥ α: do not reject H₀ (insufficient evidence).

Example: In the smoking study, p-value = 0.135 > 0.05, so we do not reject H₀. There is insufficient evidence of a difference in average birth weight.

Don't confuse: Failing to reject H₀ does not prove the groups are identical; it means the data did not provide strong enough evidence of a difference (possibly due to small sample size or high variability).

🔬 Type 1 and Type 2 errors

  • Type 1 error: rejecting H₀ when it is actually true (false positive).
  • Type 2 error: failing to reject H₀ when it is actually false (false negative).

Example: In the smoking study, if there truly is a difference but we failed to detect it, we made a Type 2 error. The excerpt notes that larger sample sizes increase the chance of detecting a real difference.

🧪 Case study insights

🧪 Exam versions example

An instructor gave two versions of an exam (A and B) and wanted to test if Version B was harder.

  • Hypotheses: H₀: μ(A) − μ(B) = 0; Hₐ: μ(A) − μ(B) ≠ 0.
  • Conditions: Exams were shuffled (random assignment), so independence is satisfied; no clear outliers.
  • Result: T = 1.15, p-value = 0.26 > 0.01, so do not reject H₀. The data do not convincingly show one version is harder.

🧪 Smoking and birth weight example

The excerpt includes a public service note: although the small sample did not show a statistically significant difference, larger datasets do show that smoking is associated with lower birth weights. The excerpt criticizes a 1971 tobacco industry claim that smaller babies from smoking mothers are "just as healthy"—this is false.

Lesson: A non-significant result in a small study does not mean there is no effect; it may mean the study lacked power to detect it.

🔧 Pooled standard deviation (special topic)

🔧 When to pool

Pooled standard deviation: a way to use data from both samples to better estimate the standard deviation and standard error.

  • Only use when: there is strong prior evidence (historical data or biological mechanism) that the two population standard deviations are equal.
  • Benefit: slightly more precise estimates and higher degrees of freedom.

Formula: s²(pooled) = [s₁² × (n₁ − 1) + s₂² × (n₂ − 1)] / (n₁ + n₂ − 2)

Then use s²(pooled) in place of both s₁² and s₂² in the SE formula, and df = n₁ + n₂ − 2.

🔧 Caution

Pool standard deviations only after careful consideration: A pooled standard deviation is only appropriate when background research indicates the population standard deviations are nearly equal. When the sample size is large and the condition may be adequately checked with data, the benefits of pooling the standard deviations greatly diminishes.

Don't confuse: Pooling is not the default method; it requires justification. Most of the time, use the unpooled (separate) standard deviations.

🛠️ General procedure

The excerpt emphasizes a consistent four-step approach:

  1. Prepare: Retrieve context and set up hypotheses (if doing a test).
  2. Check: Verify independence and normality (outlier check) conditions.
  3. Calculate: Compute SE, then construct a confidence interval or find the test statistic and p-value.
  4. Conclude: Interpret results in context.

This structure applies across many inference settings, with only the details changing.

28

Power Calculations for a Difference of Means

7.4 Power calculations for a difference of means

🧭 Overview

🧠 One-sentence thesis

Power calculations help researchers determine the minimum sample size needed to reliably detect practically important effects in experiments, balancing the goals of detecting real differences with the costs and risks of enrolling participants.

📌 Key points (3–5)

  • What power measures: the probability that a study will detect an effect of interest if that effect truly exists (commonly targeted at 80% or 90%).
  • Why sample size planning matters: too few participants means missing important effects; too many wastes resources and may expose unnecessary participants to risk.
  • How to calculate needed sample size: work backwards from the desired power level, significance level, and minimum effect size of interest to find the required number of participants per group.
  • Common confusion: rejection regions vs power—rejection regions are defined under the null hypothesis, but power is calculated under the alternative hypothesis (when a real effect exists).
  • Key trade-off: higher power requires larger samples, but the ethical and financial costs of large trials must be weighed against the risk of missing important effects.

🎯 The core problem: balancing detection and cost

🎯 Two competing considerations in experiment planning

Researchers face a fundamental tension:

  • Need sufficient data to detect effects that matter in practice
  • Data collection has costs: financial expense and potential risk to human participants (especially in clinical trials)

Clinical trial: a health-related experiment where the subjects are people.

The solution is to determine an appropriate sample size where we can be reasonably confident (typically 80% sure) of detecting any practically important effects.

⚖️ What happens with inadequate sample size

If a study uses too few participants and fails to reject the null hypothesis, several problems arise:

  • Researchers wonder whether a real effect exists but went undetected
  • Huge investments (potentially hundreds of millions of dollars) yield inconclusive results
  • Patients were exposed to experimental treatment without gaining clear knowledge
  • A second trial may be needed, requiring years and additional millions of dollars

🔬 Setting up the hypothesis test framework

🔬 Defining hypotheses for a clinical trial

Example context: testing a new blood pressure drug against standard medication.

Hypotheses (typically two-sided in clinical trials):

  • H₀: The new drug performs exactly as well as the standard medication (mean difference = 0)
  • Hₐ: The new drug's performance differs from the standard medication (mean difference ≠ 0)

📏 Calculating the standard error

With known or estimated standard deviation and sample sizes, the standard error for the difference in means is:

  • SE = square root of [(SD₁² / n₁) + (SD₂² / n₂)]
  • Example: with SD = 12 mmHg and n = 100 per group, SE = 1.70 mmHg

Don't confuse: This SE estimate may be imperfect if the assumed standard deviation doesn't match the actual study population, but it's sufficient for planning purposes.

📊 The null distribution and rejection regions

  • When degrees of freedom exceed 30, the distribution is approximately normal
  • Under H₀, the distribution is centered at 0 with standard deviation = SE
  • For α = 0.05 (two-sided), rejection regions are beyond ±1.96 × SE from zero
  • Example: with SE = 1.70, reject H₀ if the observed difference is below -3.332 or above +3.332 mmHg

💪 Understanding and computing power

💪 What power means

Power: the probability of rejecting the null hypothesis when the alternative hypothesis is actually true.

In other words: if there really is an effect of a certain size, what's the chance our study will detect it?

🔢 How to calculate power for a given sample size

The calculation requires:

  1. The alternative distribution: same shape as the null distribution but shifted to the true effect size
  2. The rejection regions: boundaries determined under the null hypothesis
  3. The overlap: the fraction of the alternative distribution that falls into the rejection regions

Example: with n = 100 per group, SE = 1.70, and true effect = -3 mmHg:

  • Alternative distribution centered at -3 mmHg
  • Lower rejection boundary at -3.332 mmHg
  • Calculate Z-score: (-3.332 - (-3)) / 1.70 = -0.20 → tail area ≈ 0.42
  • Power ≈ 42% (not good enough!)

📈 Why we ignore one rejection region

When calculating power for a specific direction of effect (e.g., a decrease of 3 mmHg), we ignore the opposite-direction rejection region because:

  • There's no value in rejecting H₀ in the wrong direction
  • Example: if the truth is a decrease, detecting an increase wouldn't be meaningful

🎲 Power increases with sample size

Example: with n = 500 per group:

  • SE = 0.76 mmHg
  • Rejection boundaries at ±1.49 mmHg
  • Z-score: (-1.49 - (-3)) / 0.76 = 1.99 → tail area ≈ 0.977
  • Power ≈ 97.7% (probably too high—unnecessarily large sample)

🎯 Determining the right sample size

🎯 Target power levels

Most common practice:

  • 80% power: standard target balancing detection probability with resource constraints
  • 90% power: used when higher certainty is needed
  • These values balance high power against ethical concerns and costs

🔄 Working backwards from desired power

Instead of trying many sample sizes, solve the problem in reverse:

Steps (for 80% power at α = 0.05):

  1. Find the Z-score for 80% lower tail: Z ≈ 0.84
  2. Note that rejection region extends 1.96 × SE from null center
  3. Total distance between null and alternative centers: (0.84 + 1.96) × SE = 2.8 × SE
  4. Set this equal to the minimum effect size of interest
  5. Solve for n

Example calculation for 3 mmHg minimum effect:

  • 3 = 2.8 × SE
  • 3 = 2.8 × square root of [(12² / n) + (12² / n)]
  • n = (2.8² / 3²) × (12² + 12²) ≈ 251 patients per group

Don't confuse: The multiplier 2.8 is specific to 80% power and α = 0.05; different targets require different multipliers.

🔧 Adjustments for different scenarios

Target powerSignificance levelMultiplier calculationNotes
80%α = 0.050.84 + 1.96 = 2.8Standard scenario
90%α = 0.011.28 + 2.58 = 3.86Higher certainty needed

⚠️ Small sample size correction

If the initial calculated sample size is roughly 30 or smaller:

  • Rework calculations using the t-distribution instead of normal approximation
  • Use degrees of freedom implied by the initial sample size
  • Adjust the critical values (not 0.84 and 1.96)
  • The revised target will generally be slightly larger

📊 Practical considerations and trade-offs

📊 When is large sample size worth it?

The power curve shows diminishing returns:

  • Beyond 250–350 observations per group, additional participants add little detection ability
  • Example: going from 350 to 5,000 per group provides minimal power improvement

💰 Context-dependent sample size decisions

High-risk or expensive experiments (e.g., drug trials):

  • Critical to do power calculations
  • Balance detection ability against participant risk and cost
  • Avoid both under-powered studies and unnecessarily large trials

Low-risk, inexpensive experiments (e.g., website feature testing):

  • Still ensure adequate sample size for detection
  • May choose larger samples if:
    • The feature is already known to perform reasonably well
    • More precise effect estimates have value (e.g., guiding future development)
    • Ethical concerns are minimal

🤔 Important considerations for setting power

Factors to weigh when choosing target power level:

  • Risk to participants: higher risk → need higher confidence in detecting effects
  • Cost of enrollment: expensive studies → balance power against budget
  • Downside of missing effects: serious consequences → target higher power
  • Precision needs: more precise estimates → may justify larger samples even with adequate power
29

Comparing many means with ANOVA

7.5 Comparing many means with ANOVA

🧭 Overview

🧠 One-sentence thesis

ANOVA (Analysis of Variance) provides a statistical method to test whether means differ across multiple groups simultaneously, avoiding the inflated error rates that would occur from conducting many pairwise comparisons.

📌 Key points (3–5)

  • What ANOVA tests: whether the mean outcome is the same across all groups (null hypothesis) versus at least one mean being different (alternative hypothesis)
  • Why not multiple t-tests: conducting many pairwise comparisons increases the chance of finding false differences just by chance (Type 1 Error inflation)
  • How ANOVA works: compares variability between group means (MSG) to variability within groups (MSE) using an F-statistic
  • Common confusion: rejecting the null in ANOVA tells you that some difference exists among groups, but doesn't identify which specific groups differ—pairwise tests with adjusted significance levels are needed for that
  • Key conditions: observations must be independent within and across groups, data within each group should be nearly normal, and variability across groups should be roughly equal

🎯 When to use ANOVA

🎯 The multiple comparison problem

Data snooping or data fishing: examining all data informally and only afterwards deciding which parts to formally test, leading to inflation in the Type 1 Error rate.

  • If you have many groups and do many pairwise comparisons, you're likely to eventually find a difference just by chance
  • Example: With 20 classes of students randomly assigned, some classes will look different from each other by chance alone; picking the most extreme cases for formal testing inflates error rates
  • ANOVA provides a holistic test first to check if there's evidence that at least one pair of groups differs

🎯 ANOVA hypotheses

The standard form:

  • H₀: μ₁ = μ₂ = ... = μₖ (the mean outcome is the same across all groups)
  • Hₐ: At least one mean is different

Example: Testing if average exam scores differ across three statistics lectures

  • H₀: μₐ = μᵦ = μ꜀ (average score is identical in all lectures)
  • Hₐ: The average score varies by class

🔍 How ANOVA works

🔍 Visual intuition

ANOVA assesses whether differences in group centers are large relative to the variability within each group.

When differences are hard to discern:

  • If data within each group are very volatile (spread out), real differences in means are difficult to detect
  • The within-group variability drowns out any between-group differences

When differences are noticeable:

  • Differences in group centers are large relative to the variability of individual observations within each group
  • Example: Groups IV, V, and VI show clear separation because the between-group differences are substantial compared to within-group scatter

🔍 The F-statistic

F-statistic: F = MSG/MSE, where MSG measures variability between group means and MSE measures variability within groups.

Mean Square Between Groups (MSG):

  • Scaled variance formula for the group means
  • Has degrees of freedom df_G = k - 1 (where k = number of groups)
  • If the null hypothesis is true, variation in sample means is just due to chance and shouldn't be too large

Mean Square Error (MSE):

  • Pooled variance estimate measuring variability within groups
  • Has degrees of freedom df_E = n - k (where n = total observations)
  • Serves as a benchmark for how much variability to expect if H₀ is true

Interpretation:

  • When H₀ is true, MSG and MSE should be about equal, so F ≈ 1
  • Larger F values indicate greater between-group variability relative to within-group variability
  • Stronger evidence against H₀ comes from larger F values
  • P-value is computed from the upper tail of the F-distribution with parameters df₁ = df_G and df₂ = df_E

🔍 Reading ANOVA output

Software typically presents results in a table format:

ComponentDfSum SqMean SqF valuePr(>F)
Between groupsdf_GSSGMSGFp-value
Residualsdf_ESSEMSE
  • The F-statistic and p-value appear in the last columns
  • Example: For baseball player positions (OF, IF, C) predicting on-base percentage with n=429 players and k=3 groups: df_G = 2, df_E = 426, F = 5.08, p-value = 0.0066

✅ Checking ANOVA conditions

✅ Three required conditions

1. Independence:

  • Observations must be independent within and across groups
  • Satisfied by: simple random sampling, carefully designed experiments
  • Example: MLB player data—no obvious reasons why independence wouldn't hold for most observations

2. Nearly normal distributions:

  • Data within each group should be nearly normal
  • Less critical with larger sample sizes
  • Check using: histograms of observations from each group
  • Look for: major outliers (minor outliers are less concerning with large samples)
  • Example: With 160+ players in outfield and infield groups, apparent outliers are not a major concern

3. Constant variance:

  • Variability across groups should be approximately equal
  • Check using: side-by-side box plots, comparing standard deviations across groups
  • Especially important when sample sizes differ between groups
  • Example: Standard deviations of 0.043, 0.038, and 0.038 across three positions are reasonably consistent

✅ Diagnostic approach

Use multiple visualization methods:

  • Side-by-side box plots: assess constant variance assumption
  • Histograms for each group: check for normality and outliers
  • Summary statistics table: compare standard deviations numerically

Don't confuse: The normality condition applies to data within each group, not to the distribution of group means.

🔬 Example: Baseball player positions

🔬 Research question

Does on-base percentage (OBP) vary by player position (outfielder, infielder, catcher)?

Data: 429 MLB players from 2018 season with ≥100 at bats

  • Outfielders (OF): n=160, mean=0.320, SD=0.043
  • Infielders (IF): n=205, mean=0.318, SD=0.038
  • Catchers (C): n=64, mean=0.302, SD=0.038

Hypotheses:

  • H₀: μ_OF = μ_IF = μ_C (average OBP is equal across positions)
  • Hₐ: Average OBP varies across some (or all) groups

Why not just compare catchers vs. outfielders?

  • The largest difference is between these two groups
  • But examining data first and then picking groups to test is data snooping
  • This inflates Type 1 Error rate
  • Must test all groups simultaneously with ANOVA first

Results:

  • MSG = 0.00803, MSE = 0.00158
  • df_G = 2, df_E = 426
  • F = 5.077, p-value = 0.0066

Conclusion: Since p-value < 0.05, we reject H₀. The data provide strong evidence that average on-base percentage varies by player's primary field position.

🎯 After ANOVA: Multiple comparisons

🎯 The follow-up question

After rejecting H₀ in ANOVA, we know some difference exists, but which specific groups differ?

🎯 Pairwise comparisons with Bonferroni correction

Bonferroni correction: Use a more stringent significance level α* = α/K, where K is the number of comparisons being made.

For k groups comparing all pairs: K = k(k-1)/2

Example with three statistics lectures:

  • ANOVA showed F = 3.48, p-value = 0.0330 → reject H₀
  • Three pairwise comparisons needed: A vs B, A vs C, B vs C
  • Adjusted significance level: α* = 0.05/3 = 0.0167
  • Use pooled standard deviation from ANOVA: s_pooled = 13.61 on df = 161

Conducting each test:

  • Calculate: difference in means, standard error, T-score, p-value
  • Compare p-value to α* = 0.0167 (not 0.05)
  • Example results:
    • Lecture A vs B: p-value = 0.228 > 0.0167 → no significant difference
    • Lecture A vs C: p-value = 0.146 > 0.0167 → no significant difference
    • Lecture B vs C: p-value = 0.010 < 0.0167 → significant difference

Notation for results: μₐ ≟ μᵦ, μₐ ≟ μ꜀, μᵦ ≠ μ꜀

🎯 Important limitation

Failing to reject H₀ does not imply H₀ is true: Not finding evidence of a difference doesn't prove the means are equal.

Possible outcome: Reject H₀ with ANOVA but find no significant differences in pairwise comparisons

  • This does not invalidate the ANOVA conclusion
  • It means we couldn't identify with high confidence which specific groups differ
  • Analogy: SEC may be certain insider trading is happening at a firm (ANOVA), but evidence against any single trader may not be strong (pairwise tests)

📊 Practical considerations

📊 Sample size and power

  • Larger sample sizes increase power to detect differences
  • With very large samples, even small practical differences become statistically significant
  • Consider both statistical significance and practical importance

📊 Reporting results

When reporting ANOVA results, include:

  1. The research question and hypotheses
  2. Verification that conditions are met (or note concerns)
  3. The F-statistic, degrees of freedom, and p-value
  4. The conclusion in context
  5. If H₀ is rejected, results of pairwise comparisons with Bonferroni correction

Don't confuse: Statistical software calculates everything, but you must still verify conditions are met before trusting the results.

30

8.1 Fitting a line, residuals, and correlation

8.1 Fitting a line, residuals, and correlation

🧭 Overview

🧠 One-sentence thesis

Linear regression models the relationship between two numerical variables with a straight line, but real-world relationships are rarely perfect because many factors influence outcomes beyond a single predictor.

📌 Key points (3–5)

  • What linear models do: predict outcomes or evaluate whether a linear relationship exists between two numerical variables.
  • Perfect vs. realistic relationships: a perfect linear relationship means knowing x gives the exact value of y, but natural processes are never this precise.
  • Why imperfection is normal: other factors beyond the single predictor variable influence the outcome, making predictions useful but inexact.
  • Common confusion: don't expect real-world data to fit a line perfectly—even strong relationships will have variation around the line.

📐 What linear models are for

📐 Two main uses

Linear models serve two purposes:

  • Prediction: use the value of one variable to estimate the value of another.
  • Evaluation: determine whether a linear relationship exists between two numerical variables.

The excerpt emphasizes that many people encounter regression in everyday contexts, such as news articles showing straight lines overlaid on scatterplots.

🔍 The form of a linear model

The excerpt introduces the equation for a line:

y = 5 + 64.96x

  • This formula expresses y (the outcome) as a function of x (the predictor).
  • The numbers (5 and 64.96) are constants that define the specific line.
  • Example: in the stock purchase scenario, x is the number of shares and y is the total cost.

🎯 Perfect vs. imperfect relationships

✨ What a perfect linear relationship means

A perfect linear relationship means: we know the exact value of y just by knowing the value of x.

  • If the relationship is perfect, every data point falls exactly on the line with no deviation.
  • The excerpt shows an example: purchasing Target Corporation stock, where total cost is computed using a linear formula, so the fit is perfect.
  • Example: if the cost per share is fixed, then knowing the number of shares tells you the exact total cost.

🌍 Why real-world relationships are imperfect

The excerpt states:

  • "This is unrealistic in almost any natural process."
  • Real-world outcomes depend on multiple factors, not just one predictor.

Concrete scenario from the excerpt:

  • Predictor (x): family income.
  • Outcome (y): financial support a college offers a prospective student.
  • Family income provides "some useful information" but the prediction is "far from perfect."
  • Why? Other factors (not just income) play a role in determining financial support.

⚠️ Don't confuse useful with perfect

  • A linear model can be useful for prediction even when the relationship is not perfect.
  • The excerpt emphasizes that imperfection is the norm: "other factors play a role."
  • Don't expect: every data point to lie exactly on the line.
  • Do expect: the line to capture the general trend, with individual points scattered around it.

🔧 The line fitting process

🔧 What this section covers

The excerpt introduces the goals of Section 8.1:

  • Define the form of a linear model (the equation structure).
  • Explore criteria for what makes a good fit (how to judge whether a line fits the data well).
  • Introduce a new statistic called correlation (a measure related to the strength of the linear relationship).

The excerpt encourages thinking "deeply about the line fitting process," suggesting that understanding how and why lines are fitted is central to using linear regression effectively.

31

8.2 Least squares regression

8.2 Least squares regression

🧭 Overview

🧠 One-sentence thesis

Least squares regression provides a method for fitting the best straight line to data where the relationship between two numerical variables is not perfect, enabling prediction and evaluation of linear relationships.

📌 Key points (3–5)

  • Purpose of linear regression: predict outcomes or evaluate whether a linear relationship exists between two numerical variables.
  • Perfect vs imperfect relationships: perfect linear relationships (knowing x gives exact y) are unrealistic in natural processes; real-world data requires fitting techniques.
  • What makes regression necessary: other factors beyond the predictor variable affect the outcome, so predictions are never perfect.
  • Common confusion: a perfect linear relationship (like a formula-based calculation) vs a fitted regression line (which approximates real-world data with variability).

📐 When and why we need regression

📐 Perfect linear relationships are rare

A perfect linear relationship means we know the exact value of y just by knowing the value of x.

  • In a perfect relationship, all data points lie exactly on a straight line.
  • The excerpt gives an example: total cost of stock purchases follows the equation y = 5 + 64.96x, where cost is computed by a linear formula.
  • Because the cost calculation is formula-based, the fit is perfect—every point falls exactly on the line.

🌍 Real-world relationships are imperfect

  • Natural processes almost never produce perfect linear relationships.
  • Other factors beyond the predictor variable (x) influence the outcome (y).
  • Example: family income (x) provides useful information about financial support a college may offer (y), but the prediction is far from perfect because other factors (not just income) play a role.
  • This variability is why we need a fitting process rather than a simple formula.

🎯 What linear regression does

🎯 Two main uses

Linear models serve two purposes:

UseWhat it means
PredictionEstimate the value of y for a given x, even when the relationship is not perfect
EvaluationAssess whether a linear relationship exists between two numerical variables

🔍 The fitting process

  • The excerpt emphasizes thinking deeply about "the line fitting process."
  • Fitting involves finding criteria for what makes a good fit when data points do not fall perfectly on a line.
  • The section introduces the form of a linear model and explores how to measure fit quality.

📊 Key concept: correlation

  • The excerpt mentions introducing a new statistic called correlation as part of understanding line fitting.
  • Correlation helps quantify the strength and direction of the linear relationship between two variables.
  • (The excerpt does not provide the full definition or formula, only that it is introduced in this context.)

🧮 Understanding the linear model form

🧮 Equation structure

  • The excerpt shows a linear equation example: y = 5 + 64.96x
  • This represents the general form of a line: an intercept (5) plus a slope (64.96) multiplied by the predictor variable (x).
  • In perfect relationships, this equation gives exact values; in real data, the equation represents the "best fit" line through scattered points.

⚠️ Don't confuse: formula vs fitted line

  • Formula-based calculation: when y is computed directly from x using a known equation (like cost = base fee + price per unit × quantity), the relationship is perfect.
  • Fitted regression line: when data points scatter around a trend, the line is an approximation that minimizes some measure of error (least squares).
  • The excerpt contrasts the perfect stock-purchase example with the imperfect family-income example to illustrate this distinction.
32

8.3 Types of outliers in linear regression

8.3 Types of outliers in linear regression

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content on types of outliers in linear regression; it only lists the section title in a table of contents and provides introductory material on linear regression basics.

📌 Key points (3–5)

  • The excerpt includes only a chapter outline showing section 8.3 exists but provides no actual content on outlier types.
  • The excerpt introduces linear regression as a technique for prediction and evaluating linear relationships between numerical variables.
  • A perfect linear relationship (where y is exactly determined by x) is unrealistic in natural processes.
  • Real-world examples show that even useful predictors (like family income predicting financial aid) are imperfect due to other influencing factors.

📋 What the excerpt contains

📋 Section listing only

The excerpt shows that section 8.3 is titled "Types of outliers in linear regression" within Chapter 8, but the actual content of that section is not included.

The chapter structure shown:

  • 8.1 Fitting a line, residuals, and correlation
  • 8.2 Least squares regression
  • 8.3 Types of outliers in linear regression
  • 8.4 Inference for linear regression

📋 Introductory context provided instead

The excerpt provides general introductory material about linear regression from earlier sections (8.1), not the requested section 8.3 content on outlier types.

🔍 Linear regression basics (from included material)

🔍 What linear regression does

Linear regression is a statistical technique that can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

  • The excerpt notes that people often see regression as "straight lines overlaid on scatterplots" in news media.
  • It is described as "very powerful" but the excerpt does not elaborate on specific applications beyond prediction and relationship evaluation.

🔍 Perfect vs realistic relationships

  • Perfect linear relationship: knowing x tells you the exact value of y.
  • The excerpt states this is "unrealistic in almost any natural process."

Example: The excerpt provides a stock purchase scenario where total cost equals a fixed formula based on number of shares—this produces a perfect linear fit because cost is computed using a linear formula.

🔍 Real-world imperfection

  • Even useful predictors don't give perfect predictions.
  • Example given: family income (x) provides "useful information" about college financial support (y), but "the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances."
  • Don't confuse: a variable being useful for prediction vs being sufficient for perfect prediction—most real relationships are imperfect even when there is a genuine association.

⚠️ Content limitation note

⚠️ Missing section content

The excerpt does not contain the actual content of section 8.3 on types of outliers in linear regression. To create complete review notes on outlier types, the actual section text would be needed.

33

Defining probability

8.4 Inference for linear regression

🧭 Overview

🧠 One-sentence thesis

Probability provides the theoretical foundation for statistics by quantifying the likelihood of outcomes in random processes, and while mastery is not required for applying statistical methods, it deepens understanding of those methods.

📌 Key points (3–5)

  • What probability measures: the proportion of times an outcome would occur in a random process.
  • How to calculate simple probabilities: count favorable outcomes divided by total equally likely outcomes.
  • Key operations: probabilities of "or" events add; probabilities of sequential independent events multiply.
  • Common confusion: "not rolling a 2" can be calculated either as 100% minus the probability of rolling a 2, or by counting all other outcomes directly.
  • Why it matters: probability forms the foundation of statistics, even though you can apply statistical methods without deep probability knowledge.

🎲 Basic probability concepts

🎲 What probability describes

Probability: the proportion of times an outcome would occur in a random process.

  • Probability frames situations in terms of a random process giving rise to an outcome.
  • Example: Roll a die → outcome is 1, 2, 3, 4, 5, or 6.
  • Example: Flip a coin → outcome is H or T.
  • The process appears random, but probability quantifies how often each outcome occurs.

🔢 Calculating probability for equally likely outcomes

  • When all outcomes are equally likely, probability = (number of favorable outcomes) / (total number of outcomes).
  • Example: Rolling a fair die has 6 equally likely outcomes. The chance of rolling a 1 is 1/6.
  • Example: The chance of rolling a 1 or 2 is 2/6 = 1/3, because two outcomes are favorable out of six total.

🔄 Probability operations

➕ "Or" probabilities (adding outcomes)

  • When you want the probability of getting one outcome or another, count all favorable outcomes.
  • Example: Rolling 1, 2, 3, 4, 5, or 6 covers all six outcomes, so the probability is 6/6 = 100%.
  • The probability equals the sum of individual outcome probabilities when outcomes don't overlap.

➖ "Not" probabilities (complement)

  • The probability of not getting an outcome = 100% minus the probability of getting that outcome.
  • Example: The chance of not rolling a 2 = 100% - 16.67% = 83.33% or 5/6.
  • Alternative method: count the other outcomes directly (1, 3, 4, 5, or 6 = five outcomes out of six = 5/6).
  • Don't confuse: both methods give the same answer; use whichever is easier to calculate.

✖️ Sequential probabilities (multiplying)

  • When events happen in sequence, multiply their probabilities.
  • Example: If 1/6 of the time the first die is a 1, and 1/6 of those times the second die is also a 1, then the chance both dice show 1 is (1/6) × (1/6) = 1/36.
  • The key phrase is "of those times"—the second probability applies only to the subset where the first event occurred.

📚 Role in statistics

📚 Foundation vs. application

AspectWhat the excerpt says
Theoretical roleProbability forms the foundation of statistics
Practical requirementMastery is not required for applying methods in this book
BenefitProvides deeper understanding and better foundation for future courses
  • The excerpt emphasizes that probability is theoretical background, not a prerequisite for using statistical techniques.
  • Understanding probability helps you grasp why methods work, even if you can apply them without that knowledge.
34

Defining Probability

9.1 Introduction to multiple regression

🧭 Overview

🧠 One-sentence thesis

Probability quantifies the likelihood of outcomes from random processes, and understanding how to calculate and combine probabilities—whether for disjoint events, complements, or independent processes—enables us to reason systematically about uncertainty.

📌 Key points (3–5)

  • What probability measures: the proportion of times an outcome would occur if we observed a random process infinitely many times (always between 0 and 1).
  • Disjoint vs independent: disjoint events cannot both happen (mutually exclusive), while independent events mean knowing one outcome tells you nothing about the other—events cannot be both.
  • Addition Rule: for disjoint events, add their probabilities; for non-disjoint events, add probabilities then subtract the overlap to avoid double-counting.
  • Multiplication Rule: for independent processes, multiply their separate probabilities to find the probability both occur.
  • Common confusion: complements are useful shortcuts—the probability of "not A" equals 1 minus the probability of A, which simplifies many calculations.

🎲 Core concepts

🎲 Random processes and outcomes

A random process gives rise to an outcome (e.g., rolling a die → 1, 2, 3, 4, 5, or 6; flipping a coin → H or T).

  • The excerpt frames probability around processes that appear random, even if they might be deterministic but too complex to predict exactly.
  • Example: rolling a die is a random process; getting a 3 is one possible outcome.

📊 Probability definition

Probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times.

  • Always takes values between 0 and 1 (or 0% to 100%).
  • The Law of Large Numbers says that as you collect more observations, the observed proportion converges to the true probability.
  • Example: if you roll a die 100,000 times, the fraction of 1s will stabilize around 1/6 ≈ 0.167.

🔣 Probability notation

  • P(rolling a 1) or abbreviated P(1) when context is clear.
  • The vertical bar | means "given" and separates the outcome from the condition in conditional probability.

🧩 Disjoint events and the Addition Rule

🧩 What disjoint means

Two outcomes are disjoint or mutually exclusive if they cannot both happen.

  • Example: rolling a 1 and rolling a 2 on a single die roll are disjoint.
  • Non-example: rolling a 1 and rolling an odd number are not disjoint (both occur if you roll 1).
  • The terms "disjoint" and "mutually exclusive" are interchangeable.

➕ Addition Rule for disjoint outcomes

If A₁ and A₂ are disjoint outcomes, then P(A₁ or A₂) = P(A₁) + P(A₂).

  • When outcomes cannot happen together, simply add their probabilities.
  • Example: P(rolling 1 or 2) = 1/6 + 1/6 = 1/3.
  • Extends to many disjoint outcomes: P(A₁) + P(A₂) + ... + P(Aₖ).

📦 Events as sets

  • An event is a collection of outcomes (e.g., A = {1, 2} means "rolling 1 or 2").
  • Events are disjoint if they share no outcomes.
  • Example: A = {1, 2} and B = {4, 6} are disjoint; A and D = {2, 3} are not disjoint because they share outcome 2.

🔀 Non-disjoint events and the General Addition Rule

🔀 When events overlap

  • If events are not disjoint, they share some outcomes.
  • Simply adding their probabilities double-counts the overlap.
  • Example: in a deck of cards, "diamond" and "face card" overlap at three cards (J♦, Q♦, K♦).

🔁 General Addition Rule

If A and B are any two events, then P(A or B) = P(A) + P(B) − P(A and B).

  • Subtract the overlap P(A and B) to correct for double-counting.
  • Example: P(diamond or face card) = 13/52 + 12/52 − 3/52 = 22/52.
  • Venn diagrams help visualize overlaps: circles represent events, intersections show shared outcomes.

💡 "Or" is inclusive

  • In statistics, "A or B" means A, B, or both A and B occur (inclusive or).
  • Don't confuse: if A and B are disjoint, then P(A and B) = 0, so the General Addition Rule simplifies back to the basic Addition Rule.

🔄 Complements

🔄 What a complement is

The complement of event A, denoted Aᶜ, represents all outcomes not in A.

  • A and Aᶜ are disjoint (they share no outcomes).
  • Together, A and Aᶜ cover all possible outcomes in the sample space S.
  • Example: if A = {2, 3} when rolling a die, then Aᶜ = {1, 4, 5, 6}.

➖ Complement formula

P(A) + P(Aᶜ) = 1, which rearranges to P(A) = 1 − P(Aᶜ).

  • Often easier to compute the complement than the event itself.
  • Example: P(sum of two dice is less than 12) = 1 − P(sum is 12) = 1 − 1/36 = 35/36.
  • Useful when the complement has fewer outcomes to count.

🎯 Probability distributions

🎯 What a distribution is

A probability distribution is a table of all disjoint outcomes and their associated probabilities.

Rules for probability distributions:

  1. Outcomes must be disjoint.
  2. Each probability must be between 0 and 1.
  3. Probabilities must total 1.

📊 Visualizing distributions

  • Bar plots show probability distributions: bar heights represent probabilities.
  • For numerical discrete outcomes, bars can be placed at their values (like a histogram).
  • Example: the sum of two dice has a distribution where P(7) = 6/36 is the highest probability.

🔗 Independence and the Multiplication Rule

🔗 What independence means

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other.

  • Example: flipping a coin and rolling a die are independent.
  • Non-example: stock prices usually move together, so they are not independent.
  • For events A and B to be independent: P(A and B) = P(A) × P(B).

✖️ Multiplication Rule for independent processes

If A and B are events from independent processes, then P(A and B) = P(A) × P(B).

  • Extends to k independent events: P(A₁) × P(A₂) × ... × P(Aₖ).
  • Example: probability both dice show 1 = (1/6) × (1/6) = 1/36.
  • Example: probability both of two randomly selected people are left-handed = 0.09 × 0.09 = 0.0081.

🎴 Testing independence

  • Check whether P(A and B) = P(A) × P(B) holds.
  • Example: drawing a heart (P = 1/4) and drawing an ace (P = 1/13) from a deck: P(heart and ace) = 1/52 = (1/4) × (1/13), so these events are independent.

🔍 Conditional probability

🔍 What conditional probability measures

Conditional probability is the probability of an outcome given that we know some condition is true.

  • Notation: P(A | B) reads as "probability of A given B."
  • The condition B is information we know; we compute the probability of A within that restricted context.

🧮 Conditional probability formula

P(A | B) = P(A and B) / P(B)

  • Think of it as: among all cases where B occurs, what fraction also have A?
  • Example: P(photo is about fashion | ML predicted fashion) = (number where both true) / (number where ML predicted fashion) = 197/219 = 0.900.
  • Can compute from counts or from probabilities (joint divided by marginal).

🔗 General Multiplication Rule

P(A and B) = P(A | B) × P(B)

  • This is just a rearrangement of the conditional probability formula.
  • Useful for computing joint probabilities when you know a conditional probability and a marginal probability.
  • Example: if 96.08% were not inoculated and 85.88% of those survived, then P(not inoculated and survived) = 0.8588 × 0.9608.

📋 Marginal vs joint probabilities

  • Marginal probability: based on a single variable (e.g., P(ML predicted fashion) = 219/1822).
  • Joint probability: based on two or more variables (e.g., P(ML predicted fashion and photo is fashion) = 197/1822).
  • Marginal probabilities appear in row/column totals; joint probabilities appear in table cells.

⚠️ Observational vs causal

  • Conditional probabilities from observational data (like the smallpox inoculation example) show associations but do not prove causation.
  • Confounding variables may influence both the condition and the outcome.
  • Example: people who chose inoculation might have been healthier or wealthier, affecting survival independently of inoculation.
35

Model selection

9.2 Model selection

🧭 Overview

🧠 One-sentence thesis

This excerpt does not contain substantive content about model selection; it consists entirely of probability calculations, conditional probability examples, and exercises from a statistics textbook chapter on probability theory.

📌 Key points (3–5)

  • The excerpt contains worked examples of conditional probability calculations (e.g., smallpox inoculation data, mammogram screening)
  • Tree diagrams and Bayes' Theorem are presented as methods for computing inverted conditional probabilities
  • The material covers sampling with and without replacement from small populations
  • Random variables, expected values, variance, and linear combinations are introduced
  • Normal distribution basics are briefly mentioned at the end

📋 Content note

📋 Mismatch between title and content

The title "9.2 Model selection" does not match the actual content of the excerpt. The provided text is from Chapter 3 ("Probability") of an introductory statistics textbook, covering:

  • Conditional probability formulas and calculations
  • Tree diagrams for organizing probabilities
  • Bayes' Theorem for inverting conditional probabilities
  • Sampling methods (with/without replacement)
  • Random variables and their properties
  • Expected value and variance
  • Linear combinations of random variables
  • Brief introduction to continuous distributions and normal distribution

📋 What is missing

There is no content about model selection in this excerpt. Model selection typically refers to choosing between competing statistical models based on criteria like fit, complexity, or predictive performance—topics not addressed here.

The excerpt appears to be a mislabeled section from a probability chapter rather than content about model selection methodology.

36

Standardizing with Z-scores and Normal Distribution Applications

9.3 Checking model conditions using graphs

🧭 Overview

🧠 One-sentence thesis

Z-scores standardize observations by measuring how many standard deviations they fall from the mean, enabling comparisons across different distributions and calculation of probabilities using the normal distribution.

📌 Key points (3–5)

  • What Z-scores measure: the number of standard deviations an observation falls above or below the mean
  • Why standardization matters: allows fair comparison of observations from different distributions (e.g., SAT vs ACT scores)
  • How to use Z-scores: convert raw scores to Z-scores, then use software/tables to find probabilities or percentiles
  • Common confusion: positive Z-scores mean above the mean, negative means below; larger absolute values indicate more unusual observations
  • Two-way process: can go from raw score to probability (using Z-score) or from probability/percentile to raw score (solving for x)

📏 What Z-scores are and how to calculate them

📏 Definition and formula

Z-score: The number of standard deviations an observation falls above or below the mean.

The formula for calculating a Z-score:

  • Z = (x - μ) / σ
  • Where x is the observation, μ (mu) is the mean, σ (sigma) is the standard deviation

➕ Interpreting the sign

  • Positive Z-score: observation is above the mean
  • Negative Z-score: observation is below the mean
  • Z-score of 0: observation equals the mean

Example: Ann scored 1300 on the SAT (mean 1100, SD 200). Her Z-score is Z = (1300 - 1100) / 200 = 1, meaning she scored 1 standard deviation above the mean.

🔍 Comparing unusual observations

An observation is more unusual than another if the absolute value of its Z-score is larger.

Example: For brushtail possum head lengths (mean 92.6 mm, SD 3.6 mm):

  • Possum with 95.4 mm: Z = 0.78
  • Possum with 85.8 mm: Z = -1.89
  • The second possum is more unusual because |-1.89| > |0.78|

🎯 Using Z-scores for comparisons

🎯 Comparing across different scales

Z-scores enable fair comparisons when observations come from different distributions.

Example: Ann scored 1300 on SAT (mean 1100, SD 200) versus Tom scored 24 on ACT (mean 21, SD 6).

  • Ann's Z-score: (1300 - 1100) / 200 = 1.0
  • Tom's Z-score: (24 - 21) / 6 = 0.5
  • Ann performed better relative to her peers because her Z-score is higher

Don't confuse: A higher raw score doesn't always mean better performance—you must account for the mean and standard deviation of each distribution.

📊 Finding probabilities using Z-scores

📊 The process: always draw first, calculate second

Standard workflow:

  1. Draw and label a normal curve
  2. Shade the area of interest
  3. Calculate the Z-score
  4. Use software/calculator/table to find the area
  5. Interpret the result

🔢 Finding tail areas (percentiles)

To find what fraction scored below a value:

  1. Calculate Z-score
  2. Find the area to the left of that Z-score

Example: Shannon's SAT score of 1190 (mean 1100, SD 200):

  • Z = (1190 - 1100) / 200 = 0.45
  • Area to left = 0.6736
  • Shannon is at the 67th percentile

⬆️ Finding upper tail areas

Many programs return left-tail area. To find the right-tail area:

  • Calculate: 1 - (left tail area)

Example: Probability of scoring at least 1190:

  • Left tail area = 0.6736
  • Right tail area = 1 - 0.6736 = 0.3264

Don't confuse: "At least" means greater than or equal to (upper tail), while "less than" means lower tail.

🔄 Working backwards: from percentile to score

🔄 Finding the observation for a given percentile

When you know the percentile and want the raw score:

  1. Draw the picture with the known probability shaded
  2. Use software to find the Z-score for that percentile
  3. Solve the Z-score formula for x: x = μ + Z × σ

Example: Erik is at the 40th percentile for height (mean 70 inches, SD 3.3 inches):

  • Z-score for 40th percentile ≈ -0.25
  • Solve: -0.25 = (x - 70) / 3.3
  • x = 70 + (-0.25 × 3.3) = 69.18 inches

📐 Finding middle regions

To find probability between two values:

  1. Find the area below the lower value
  2. Find the area below the upper value
  3. Subtract: (upper area) - (lower area)

Or equivalently: 1 - (lower tail) - (upper tail)

Example: Probability a male adult is between 5'9" (69 inches) and 6'2" (74 inches):

  • Area below 69 inches = 0.3821
  • Area above 74 inches = 0.1131
  • Middle area = 1 - 0.3821 - 0.1131 = 0.5048

📐 The 68-95-99.7 rule

📐 Quick estimation without calculation

68-95-99.7 rule: In a normal distribution, approximately 68% of observations fall within 1 SD of the mean, 95% within 2 SD, and 99.7% within 3 SD.

Distance from meanPercentage within
μ ± 1σ~68%
μ ± 2σ~95%
μ ± 3σ~99.7%

🎓 Practical application

Example: SAT scores with mean 1100 and SD 200:

  • About 95% score between 700 and 1500 (mean ± 2 SD)
  • About 47.5% score between 1100 and 1500 (half of the 95%)

Rare events: Being more than 4 SD from the mean is about 1-in-15,000; 5 SD is about 1-in-2 million.

💻 Tools for finding areas

💻 Three common methods

  1. Statistical software (most common in practice): Programs like R, Python, SAS, Excel, or Google Sheets

    • Example in R: pnorm(1) returns 0.8413447
    • Can specify mean and SD: pnorm(1300, mean = 1100, sd = 200)
  2. Graphing calculators: TI or Casio calculators with specific button sequences

  3. Probability tables: Found in appendices, occasionally used in classrooms but rarely in practice

Key principle: Always find the Z-score first, then use your chosen tool to find the area.

37

Foundations for Inference: Proportions and Sampling

9.4 Multiple regression case study: Mario Kart

🧭 Overview

🧠 One-sentence thesis

The Central Limit Theorem allows us to use the normal distribution to model sample proportions when independence and success-failure conditions are met, enabling us to construct confidence intervals and conduct hypothesis tests about population proportions.

📌 Key points (3–5)

  • When normal approximation works: Sample proportion ˆp can be modeled with a normal distribution if observations are independent and we expect at least 10 successes and 10 failures (success-failure condition).
  • Two types of inference: Confidence intervals estimate plausible ranges for population parameters; hypothesis tests evaluate claims by comparing p-values to significance levels.
  • Standard error calculation differs by context: For confidence intervals, use the sample proportion ˆp; for hypothesis tests, use the null value p₀ when checking conditions and computing standard error.
  • Common confusion: Failing to reject the null hypothesis does NOT mean accepting it as true—it only means insufficient evidence was found against it.
  • Practical vs statistical significance: Large samples can detect tiny differences that are statistically significant but have no real-world importance.

📊 Understanding sample proportions and their distributions

📊 What is a sample proportion

The sample proportion ˆp represents the fraction of observations in a sample that have a particular characteristic.

  • It serves as a point estimate for the true population proportion p
  • Example: If 887 out of 1000 surveyed adults support solar energy, then ˆp = 0.887
  • The sample proportion varies from sample to sample due to sampling variability

🎲 Sampling distribution behavior

When we take many samples and compute ˆp for each, these proportions form a sampling distribution.

Key properties when conditions are met:

  • Center: The mean equals the true population proportion p
  • Spread: Standard error = square root of [p(1-p)/n]
  • Shape: Approximately normal (bell-curved)

Don't confuse: The sampling distribution describes the behavior of ˆp across many samples, not the distribution of individual observations in one sample.

✅ Conditions for using the normal model

✅ Independence condition

Observations must be independent of each other.

How to verify:

  • Data come from a simple random sample, OR
  • Data come from a randomized experiment
  • Optional additional check: sample size is less than 10% of population (helps ensure independence when sampling without replacement)

✅ Success-failure condition

We need enough expected successes and failures.

For confidence intervals: Check that n·ˆp ≥ 10 AND n·(1-ˆp) ≥ 10

  • Use the sample proportion ˆp because we're estimating from observed data

For hypothesis tests: Check that n·p₀ ≥ 10 AND n·(1-p₀) ≥ 10

  • Use the null value p₀ because we're assuming the null hypothesis is true

Example: With n=1000 and p₀=0.5, we get 1000×0.5=500 successes and 500 failures expected, both well above 10.

🎯 Confidence intervals for proportions

🎯 General structure

A confidence interval provides a plausible range for the population parameter.

Formula: ˆp ± z* × SE

Where:

  • ˆp is the sample proportion
  • z* is the critical value (1.96 for 95% confidence, 2.58 for 99%, 1.65 for 90%)
  • SE = square root of [ˆp(1-ˆp)/n]

🎯 What confidence level means

"95% confident" means: if we repeated the sampling process many times and built a 95% confidence interval from each sample, about 95% of those intervals would contain the true population proportion.

It does NOT mean:

  • There's a 95% probability the parameter is in this specific interval
  • 95% of individual observations fall in this range
  • Future sample proportions will fall in this interval

🎯 Margin of error

The margin of error is the "±" part: z* × SE

To reduce margin of error:

  • Increase sample size (most common approach)
  • Accept a lower confidence level (less desirable)

Example: To achieve margin of error of 0.04 with 95% confidence when p≈0.5, solve: 1.96 × square root[0.5×0.5/n] < 0.04, giving n > 600.25, so need at least 601 participants.

🧪 Hypothesis testing framework

🧪 Setting up hypotheses

Null hypothesis (H₀): Represents the skeptical position or "no effect" claim Alternative hypothesis (Hₐ): Represents what we're looking for evidence of

Example for testing if proportion differs from 0.5:

  • H₀: p = 0.5
  • Hₐ: p ≠ 0.5

The null value (p₀) is the specific value claimed in H₀.

🧪 The p-value approach

p-value: The probability of observing data at least as extreme as our sample, if the null hypothesis were true.

Decision rule:

  • If p-value < α (significance level), reject H₀
  • If p-value ≥ α, do not reject H₀

Standard significance level: α = 0.05 (but can be adjusted based on context)

🧪 Computing the test statistic

Steps:

  1. Calculate standard error using the null value: SE = square root[p₀(1-p₀)/n]
  2. Compute Z-score: Z = (ˆp - p₀)/SE
  3. Find tail area(s) corresponding to this Z-score
  4. For two-sided tests, double the tail area to get p-value

Example: If ˆp=0.37, p₀=0.5, SE=0.016, then Z=(0.37-0.5)/0.016 = -8.125, giving an extremely small p-value.

⚠️ Understanding errors and significance

⚠️ Type 1 and Type 2 errors

TruthTest conclusionResult
H₀ trueReject H₀Type 1 Error (false positive)
H₀ trueDon't reject H₀Correct
Hₐ trueReject H₀Correct
Hₐ trueDon't reject H₀Type 2 Error (false negative)

Type 1 Error: Rejecting a true null hypothesis

  • Probability = α (significance level)
  • Example: Convicting an innocent person

Type 2 Error: Failing to reject a false null hypothesis

  • Example: Failing to convict a guilty person

Trade-off: Reducing one type of error generally increases the other.

⚠️ Choosing significance level

Default is α = 0.05, but adjust based on consequences:

Use smaller α (e.g., 0.01) when:

  • Type 1 Error is very costly or dangerous
  • You want to be very cautious about rejecting H₀

Use larger α (e.g., 0.10) when:

  • Type 2 Error is more costly than Type 1
  • Safety considerations favor being more sensitive to detecting effects

⚠️ Statistical vs practical significance

With very large samples, even tiny differences become statistically significant (p-value < 0.05) but may have no practical importance.

Example: An online experiment detects a "statistically significant" 0.001% increase in viewership, but this is too small to matter in practice.

🔄 Four-step procedure

🔄 Prepare

  • Identify the parameter of interest (usually p)
  • State hypotheses (for tests) or confidence level (for intervals)
  • Note the sample proportion ˆp and sample size n

🔄 Check

Verify conditions:

  • Independence: Random sample or randomized experiment
  • Success-failure: At least 10 expected successes and failures
    • Use ˆp for confidence intervals
    • Use p₀ for hypothesis tests

🔄 Calculate

  • Compute standard error (formula depends on context)
  • For confidence intervals: construct ˆp ± z*×SE
  • For hypothesis tests: compute Z-score and find p-value

🔄 Conclude

  • For confidence intervals: interpret in context ("We are 95% confident that...")
  • For hypothesis tests: compare p-value to α and state conclusion in context
  • Always relate back to the original question

📐 Difference of two proportions

📐 When to use

Comparing proportions from two independent groups.

Parameter of interest: p₁ - p₂ Point estimate: ˆp₁ - ˆp₂

📐 Conditions (extended)

Independence, extended: Data must be independent within each group AND between groups

  • Usually satisfied by two independent random samples or randomized experiment

Success-failure for both groups: Check separately for each group

  • At least 10 successes and 10 failures in group 1
  • At least 10 successes and 10 failures in group 2

📐 Standard error for two proportions

For confidence intervals: SE = square root of [ˆp₁(1-ˆp₁)/n₁ + ˆp₂(1-ˆp₂)/n₂]

For hypothesis tests when H₀ is p₁=p₂:

  • First compute pooled proportion: ˆp_pooled = (total successes)/(total observations)
  • Then: SE = square root of [ˆp_pooled(1-ˆp_pooled)/n₁ + ˆp_pooled(1-ˆp_pooled)/n₂]

Don't confuse: Use pooled proportion only when null hypothesis claims the proportions are equal.

📐 Confidence interval formula

(ˆp₁ - ˆp₂) ± z* × SE

Example interpretation: "We are 95% confident that treatment increases survival rate by between -2.6% and +28.6% compared to control."

If interval contains zero: insufficient evidence of a difference If interval is entirely above zero: evidence treatment helps If interval is entirely below zero: evidence treatment harms

🎲 Chi-square goodness of fit tests

🎲 When to use chi-square

Two main scenarios:

  1. Testing if a sample is representative of a population (multiple categories)
  2. Testing if data follow a particular distribution

Example: Do jury members' racial composition match the population of registered voters?

🎲 Chi-square test statistic

Formula: X² = sum of [(observed count - expected count)² / expected count]

For each category:

  • Compute expected count (usually: population proportion × sample size)
  • Find difference from observed count
  • Square the difference
  • Divide by expected count
  • Sum across all categories

The X² value summarizes how much observed counts deviate from expected counts.

🎲 Chi-square distribution properties

  • Always positive (since it's a sum of squared values)
  • Right-skewed, especially for small degrees of freedom
  • One parameter: degrees of freedom (df)

As df increases:

  • Distribution becomes more symmetric
  • Center moves right (mean = df)
  • Spread increases

🎲 Finding p-values

Compare the computed X² statistic to the chi-square distribution with appropriate df.

The p-value is the upper tail area beyond the observed X² value.

  • Large X² → small p-value → evidence against H₀
  • Small X² → large p-value → insufficient evidence against H₀

📋 Practical considerations

📋 Sample size planning

To achieve desired margin of error:

  1. Choose confidence level (determines z*)
  2. Estimate p (use 0.5 if unknown—gives largest needed sample)
  3. Solve: z* × square root[p(1-p)/n] = desired margin of error
  4. Always round UP for sample size

Example: For margin of error 0.04 with 95% confidence and p≈0.5, need n ≥ 601.

📋 When conditions aren't met

If success-failure condition fails:

  • For hypothesis tests: use simulation methods
  • For confidence intervals: use Clopper-Pearson interval (advanced method)

If independence fails:

  • Understand why (cluster sampling, convenience sample, etc.)
  • May need specialized methods beyond this text
  • Convenience samples generally cannot be corrected statistically

📋 Interpreting results carefully

Remember:

  • Confidence intervals describe parameters, not individual observations or future samples
  • "Not rejecting H₀" ≠ "accepting H₀ as true"
  • Statistical significance ≠ practical importance
  • Methods address sampling error, not bias from poor data collection
38

Chi-square distribution and finding areas

9.5 Introduction to logistic regression

🧭 Overview

🧠 One-sentence thesis

The chi-square distribution provides the mathematical foundation for testing whether observed categorical data patterns differ from what we would expect by chance alone, using a test statistic that follows a known distribution when the null hypothesis is true.

📌 Key points (3–5)

  • What the chi-square distribution is: A right-skewed distribution with one parameter (degrees of freedom) used to characterize always-positive statistics
  • How shape changes with df: As degrees of freedom increase, the distribution becomes more symmetric, the center moves right, and variability increases
  • When to use it: If the null hypothesis is true in goodness-of-fit tests, the X² statistic follows a chi-square distribution with k−1 degrees of freedom (k = number of categories)
  • Sample size requirement: Each expected count must be at least 5 to safely apply the chi-square distribution
  • Common confusion: Larger chi-square values provide stronger evidence against the null hypothesis, so we always use the upper tail for p-values

📊 Understanding the chi-square distribution

📊 What it is and its single parameter

Chi-square distribution: A distribution sometimes used to characterize data sets and statistics that are always positive and typically right skewed, with just one parameter called degrees of freedom (df).

  • Unlike the normal distribution (which has mean and standard deviation), the chi-square has only degrees of freedom
  • The df parameter influences the shape, center, and spread of the distribution
  • This distribution is specifically designed for statistics that cannot be negative

📈 How the distribution changes with degrees of freedom

Three general properties emerge as df increases:

PropertyWhat happens as df increases
ShapeBecomes more symmetric (less skewed)
CenterMoves to the right (mean equals df)
VariabilityIncreases (spread becomes larger)

Example: With df = 2, the distribution is very strongly skewed. With df = 4 or df = 9, distributions become more symmetric.

🔍 Finding areas under the chi-square curve

🔍 Methods for calculating tail areas

Three common approaches:

  • Using computer software
  • Using a graphing calculator
  • Using a chi-square table (Appendix C.3)

Don't confuse with: Normal distribution lookups—chi-square tables work differently because the distribution shape changes with each df value.

🎯 Working through examples

The excerpt provides several worked examples:

Example with df = 3, cutoff at 6.25: The upper tail area is 0.1001 (about 10%)

Example with df = 2, cutoff at 4.3: The tail area is 0.1165 (between 0.1 and 0.2 if using a table)

Example with df = 5, cutoff at 5.1: The tail area is 0.4038 (larger than 0.3 if using a table)

Key pattern: We always look at the upper tail because larger chi-square values indicate stronger evidence against the null hypothesis.

🧪 Applying chi-square to hypothesis testing

🧪 The test statistic X²

When testing goodness of fit, the test statistic is:

X² = (O₁ − E₁)² / E₁ + (O₂ − E₂)² / E₂ + ... + (Oₖ − Eₖ)² / Eₖ

Where:

  • O represents observed counts in each category
  • E represents expected counts under the null hypothesis
  • k is the number of categories

🔑 Degrees of freedom for goodness-of-fit tests

Degrees of freedom for chi-square test: When testing k categories, df = k − 1

Example: In the juror example with 4 racial categories (White, Black, Hispanic, other), df = 4 − 1 = 3.

Why k−1 and not k: This reflects the constraint that all counts must sum to the total sample size.

✅ Conditions that must be checked

Two essential conditions before performing a chi-square test:

  1. Independence: Each case contributing a count must be independent of all other cases
  2. Sample size/distribution: Each cell must have at least 5 expected cases

Important exception: When examining a table with just two bins, use one-proportion methods instead (from Section 6.1).

📉 Interpreting p-values from chi-square tests

The p-value represents the upper tail area of the chi-square distribution.

Example: In the juror case with X² = 5.89 and df = 3, the p-value is 0.1171. Since this is larger than typical significance levels (like 0.05), we do not reject the null hypothesis—the data don't provide convincing evidence of racial bias in juror selection.

Why upper tail only: Larger X² values correspond to greater differences between observed and expected counts, providing stronger evidence against the null hypothesis.

🔬 Real-world application: Stock market independence

🔬 Testing if trading days are independent

The excerpt examines whether daily stock returns from the S&P 500 show independence using waiting times until "Up" days.

Setup:

  • Label each day as Up or Down
  • Count days until each Up day occurs
  • If days are independent, waiting times should follow a geometric distribution

Null hypothesis: Stock market being up or down on a given day is independent from all other days (waiting times follow geometric distribution)

Alternative hypothesis: Days are not independent

📊 Expected vs observed counts

For 1,362 waiting time observations:

Days waitedObservedExpected (Geometric)
1717743
2369338
3155154
7+1012

The chi-square statistic calculated: X² = 4.61 with df = 6, giving p-value = 0.5951

Conclusion: Cannot reject the notion that trading days are independent—no strong evidence that the market is "due" for a correction after down days.

Practical implication: The analysis suggests any dependence between days is very weak, contradicting the common belief that markets are "due" for reversals.

🎲 Two-way tables vs one-way tables

🎲 Key distinction

One-way table: Describes counts for each outcome in a single variable

Two-way table: Describes counts for combinations of outcomes for two variables

When analyzing two-way tables, the central question becomes: Are these variables related (dependent) or unrelated (independent)?

Don't confuse: The mechanics are similar, but two-way tables test relationships between variables, while one-way tables test whether a single variable follows a specified distribution.

Chi-square distribution and finding areas

🧭 Overview

🧠 One-sentence thesis

The chi-square distribution enables us to test whether observed patterns in categorical data differ meaningfully from expected patterns by providing a probability model for the test statistic when the null hypothesis is true.

📌 Key points (3–5)

  • What it measures: The chi-square distribution characterizes always-positive, typically right-skewed statistics using a single parameter (degrees of freedom)
  • How it changes: As degrees of freedom increase, the distribution becomes more symmetric, shifts right, and becomes more variable
  • When to apply it: Use chi-square when testing goodness of fit with k categories (df = k−1) or independence in two-way tables (df = (rows−1)×(columns−1))
  • Critical condition: Each expected count must be at least 5 to safely use the chi-square distribution
  • Common confusion: We always examine the upper tail for p-values because larger X² values indicate stronger evidence against the null hypothesis, not both tails like some other tests

📐 The chi-square distribution fundamentals

📐 Definition and single parameter

Chi-square distribution: A distribution used to characterize data sets and statistics that are always positive and typically right skewed, having just one parameter called degrees of freedom (df).

  • Contrasts with normal distribution, which requires two parameters (mean and standard deviation)
  • The df parameter alone determines the distribution's shape, center, and spread
  • Used primarily for calculating p-values in categorical data analysis

📈 How distribution properties change with df

Three systematic changes occur as degrees of freedom increase:

Shape: The distribution starts very strongly skewed (df = 2) and becomes progressively more symmetric with larger df values (df = 4, df = 9, and beyond)

Center: The mean of each chi-square distribution equals its degrees of freedom—so the center moves rightward as df increases

Variability: The spread (variability) inflates as degrees of freedom increase

Example: A chi-square distribution with df = 2 is extremely right-skewed, while one with df = 9 appears much more bell-shaped and symmetric.

🔢 Computing tail areas

🔢 Three methods available

The excerpt describes three approaches for finding areas:

  1. Statistical software (most precise)
  2. Graphing calculator
  3. Chi-square table (Appendix C.3, gives ranges rather than exact values)

Practical note: With tables, you can only identify a range (e.g., "between 0.1 and 0.2"), while software provides exact values.

🎯 Worked examples of area calculations

Example 1: Chi-square with df = 3, upper tail starting at 6.25

  • Result: Shaded area = 0.1001 (about 10%)

Example 2: Chi-square with df = 2, upper tail bound at 4.3

  • Result: Tail area = 0.1165
  • Using table: between 0.1 and 0.2

Example 3: Chi-square with df = 5, cutoff at 5.1

  • Result: Tail area = 0.4038
  • Using table: larger than 0.3

Example 4: Chi-square with df = 7, cutoff at 11.7

  • Result: Area = 0.1109
  • Using table: between 0.1 and 0.2

Example 5: Chi-square with df = 4, cutoff at 10

  • Result: Precise value = 0.0404
  • Using table: between 0.02 and 0.05

🧮 The chi-square test for goodness of fit

🧮 The test statistic formula

When evaluating whether observed counts O₁, O₂, ..., Oₖ in k categories differ unusually from expected counts E₁, E₂, ..., Eₖ:

X² = (O₁ − E₁)² / E₁ + (O₂ − E₂)² / E₂ + ... + (Oₖ − Eₖ)² / Eₖ

When this works: If each expected count is at least 5 and the null hypothesis is true, this statistic follows a chi-square distribution with k−1 degrees of freedom.

Why we square differences: Squaring ensures all contributions are positive and gives more weight to larger deviations (being off by 4 is more than twice as bad as being off by 2).

🎲 Degrees of freedom calculation

Degrees of freedom for one-way table: df = k − 1, where k is the number of categories or bins

Example: The juror example examined 4 racial categories (White, Black, Hispanic, other), so df = 4 − 1 = 3.

Don't confuse: This is different from two-way tables, where df = (number of rows − 1) × (number of columns − 1).

✅ Required conditions

Two conditions must be verified:

Independence: Each case contributing a count to the table must be independent of all other cases in the table

Sample size/distribution: Each particular cell count must have at least 5 expected cases

Special case: When examining a table with just two bins, use the one-proportion methods from Section 6.1 instead.

Consequence of violating conditions: Failing to check may affect the test's error rates (Type I and Type II errors).

📊 Finding and interpreting p-values

The p-value comes from the upper tail of the chi-square distribution.

Why upper tail: Larger chi-square values indicate greater discrepancies between observed and expected counts, providing stronger evidence against the null hypothesis.

Example: In the juror analysis:

  • X² = 5.89 with df = 3
  • P-value = 0.1171 (the area in the upper tail beyond 5.89)
  • Interpretation: If there truly was no racial bias, the probability of observing a test statistic this large or larger is about 11.71%
  • Conclusion: Since p-value > 0.05, we do not reject the null hypothesis—insufficient evidence of racial bias

📈 Real application: Testing stock market independence

📈 The research question

Can we determine if daily stock market movements are independent using S&P 500 data from 10 years?

Method: Examine waiting times until positive trading days

  • Each "Up" day = success
  • Each "Down" day = failure
  • If days are independent, waiting times should follow a geometric distribution

🔬 Setting up the test

Null hypothesis (H₀): The stock market being up or down on a given day is independent from all other days; waiting times follow a geometric distribution

Alternative hypothesis (Hₐ): Days are not independent; waiting times do not follow a geometric distribution

Why this matters: If past days predict future days, traders could gain an advantage.

📉 Computing expected counts

For geometric distribution with success probability 0.545:

Method:

  1. Identify probability of waiting D days: P(D) = (1 − 0.545)^(D−1) × (0.545)
  2. Multiply by total number of observations (1,362)

Example: Waiting 3 days occurs about 0.455² × 0.545 = 11.28% of the time, corresponding to 0.1128 × 1,362 = 154 expected occurrences.

🧪 Test results

Observed vs expected counts for 1,362 waiting periods:

DaysObservedExpected
1717743
2369338
3155154
46970
7+1012

Calculation: X² = (717−743)²/743 + (369−338)²/338 + ... + (10−12)²/12 = 4.61

Degrees of freedom: k = 7 groups, so df = 7 − 1 = 6

P-value: 0.5951 (from chi-square distribution with df = 6)

💡 Conclusion and practical meaning

Since p-value (0.5951) > 0.05, we do not reject H₀.

What this means: Cannot reject the notion that trading days are independent during the last 10 years of data.

Practical implication: The market is not "due" for an Up day after several Down days. Any dependence between days is very weak. This analysis suggests that patterns traders think they see may just be chance.

Important caveat: Not rejecting H₀ doesn't prove independence—it just means we lack strong evidence of dependence.

🔀 Two-way tables: Testing independence between variables

🔀 One-way vs two-way distinction

One-way table: Describes counts for each outcome in a single variable

Two-way table: Describes counts for combinations of outcomes for two variables

Key question for two-way tables: Are the variables related (dependent) or unrelated (independent)?

🧮 Computing expected counts in two-way tables

Formula for expected count: Expected Count(row i, col j) = (row i total) × (column j total) / table total

Example: For the iPod disclosure study with 219 participants:

  • Row 1 total (Disclose): 61
  • Column 1 total (General question): 73
  • Table total: 219
  • Expected count = (61 × 73) / 219 = 20.33

Logic: If variables are independent, we'd expect the same proportion in each column to fall in each row.

🎲 Degrees of freedom for two-way tables

Formula: df = (R − 1) × (C − 1), where R = number of rows and C = number of columns

Example: The iPod study had 2 rows (Disclose/Hide) and 3 columns (three question types), so df = (2−1) × (3−1) = 2

Don't confuse: This is different from one-way tables where df = k − 1. The multiplication accounts for testing independence between two dimensions.

Special guideline: When analyzing 2-by-2 contingency tables, use the two-proportion methods from Section 6.2 instead.

📊 The iPod disclosure experiment

Context: Researchers wanted to know which questions get sellers to disclose product problems.

Setup: 219 participants sold an iPod known to have frozen twice. Three scripted questions:

  • General: "What can you tell me about it?"
  • Positive Assumption: "It doesn't have any problems, does it?"
  • Negative Assumption: "What problems does it have?"

Results:

Question typeDisclosedHid problemTotal
General27173
Positive Assumption235073
Negative Assumption363773
Total61158219

Test statistic: X² = 40.13 with df = 2

P-value: Extremely small (about 0.000000002)

Conclusion: Strong evidence that the question asked affected seller's likelihood to disclose the freezing problem. The "What problems does it have?" question was most effective.

🏥 Two-way table application: Diabetes treatments

🏥 The study design

Experiment compared three treatments for Type 2 Diabetes in patients aged 10-17:

  • Continued metformin (met)
  • Metformin + rosiglitazone (rosi)
  • Lifestyle intervention program

Outcome: Whether patient lacked glycemic control (failure) or maintained control (success)

🧪 Hypotheses

H₀: There is no difference in effectiveness of the three treatments

Hₐ: There is some difference in effectiveness between treatments (e.g., perhaps rosi performed better than lifestyle)

📊 Computing expected counts

Total results for 699 patients:

TreatmentFailureSuccessTotal
lifestyle109125234
met120112232
rosi90143233
Total319380699

Example calculation: Expected count for row 1, column 1: (234 × 319) / 699 = 106.8

All expected counts:

  • Row 1: 106.8, 127.2
  • Row 2: 105.9, 126.1
  • Row 3: 106.3, 126.7

All exceed 5, so the sample size condition is met.

🔍 Test results

Test statistic: X² = 8.16 with df = (3−1) × (2−1) = 2

P-value: 0.017

Conclusion: Since p-value < 0.05, reject H₀. At least one treatment is more or less effective than the others at treating Type 2 Diabetes for glycemic control.

What we can't conclude: The test doesn't tell us which specific treatments differ—only that not all are equally effective.

    OpenIntro Statistics | Thetawave AI – Best AI Note Taker for College Students