Significant Statistics

1

Introduction to Statistics

1.1 Introduction to Statistics

🧭 Overview

🧠 One-sentence thesis

Statistics translates data into knowledge by collecting, analyzing, and interpreting information to help people make educated decisions in everyday life and professional contexts.

📌 Key points (3–5)

  • What statistics does: organizes, summarizes, and draws conclusions from data through descriptive and inferential methods.
  • Two main branches: descriptive statistics (organizing and summarizing) vs. inferential statistics (drawing conclusions using probability).
  • Population vs. sample: studying an entire population is often impractical, so we select samples and use statistics to estimate parameters.
  • Common confusion: statistic vs. parameter—a statistic describes a sample; a parameter describes the whole population.
  • Why it matters: statistical methods help evaluate claims, make informed decisions, and determine confidence in conclusions.

📊 What statistics is and why we need it

📊 The core purpose

"Statistics' ultimate goal is translating data into knowledge." – Alan Agresti & Christine Franklin

  • Statistics appears everywhere: news reports, weather forecasts, education, crime data, sports, real estate, and more.
  • When you encounter sample information in media, statistical methods help you evaluate whether claims are correct.
  • Example: deciding whether to buy a house, manage a budget, or trust a news report all involve analyzing statistical information.

🎓 Practical applications

  • Many professions require statistical knowledge: economics, business, psychology, education, biology, law, computer science, police science, and early childhood development.
  • The goal is not to perform endless calculations but to interpret data to gain understanding.
  • Calculations can be done by calculators or computers; the understanding must come from you.

🔍 The two branches of statistics

🔍 Descriptive statistics

Descriptive statistics: organizing, summarizing, and presenting data.

  • This is the foundation—learning how to organize and summarize data first.
  • Data can be summarized with graphs or with numbers (e.g., finding an average).
  • Example: calculating the average grade in one class is descriptive.

🔬 Inferential statistics

Inferential statistics: formal methods for drawing useful conclusions from data while filtering out noise.

  • After studying probability and probability distributions, you use these formal methods.
  • Effective inference depends on good data collection procedures and thoughtful examination.
  • Statistical inference uses probability to determine how confident you can be that your conclusions are correct.
  • Example: using a sample average to test whether a claim about the entire population is valid.

🎲 Probability and randomness

🎲 What probability measures

Probability: a mathematical tool used to study randomness; it deals with the chance (likelihood) of an event occurring.

  • Individual outcomes are uncertain, but a regular pattern emerges with many repetitions.
  • Example: tossing a fair coin four times may not yield two heads and two tails, but tossing it 4,000 times will produce results close to half heads and half tails.
  • The expected theoretical probability of heads in one toss is one-half or 0.5.

🎯 Real-world uses

  • Predictions take the form of probabilities: likelihood of an earthquake, rain, getting an A in a course.
  • Doctors use probability to assess medical test accuracy.
  • Stockbrokers use it to determine investment returns.
  • You might use it to decide whether to buy a lottery ticket.

📖 Historical note

  • Probability theory began with studying games of chance like poker.
  • Example: Karl Pearson tossed a coin 24,000 times and got 12,012 heads; another researcher tossed 2,000 times and got 996 heads (fraction 996/2000 = 0.498, very close to 0.5).

🔑 Key terminology

🔑 Population and sample

Population: a collection of people or things under study.

Sample: a portion (or subset) of the larger population selected for study.

  • Examining an entire population takes great resources (time, money, manpower), so we often study only a sample.
  • Example: to compute overall GPA at a school, select a sample of students rather than surveying every single student.
  • Example: presidential opinion polls sample 1,000–2,000 people to represent the entire country's population.
  • The sample must contain the characteristics of the population to be a representative sample.

🔑 Parameter and statistic

Parameter: a number that describes a characteristic of the population.

Statistic: a number that represents a property of the sample.

  • A statistic is an estimate of a population parameter.
  • Example: the average points earned by students in one math class (sample) is a statistic; the average across all math classes (population) is a parameter.
  • Don't confuse: statistic = sample property; parameter = population property.
TermWhat it describesExample from excerpt
ParameterCharacteristic of the whole populationAverage points across all math classes
StatisticCharacteristic of the sampleAverage points in one math class

🔑 Individuals, variables, and data

Individuals: the units about which we are collecting information (could be a person, animal, thing, or place).

Variable: a specific characteristic or measurement that can be determined for each individual (usually represented by capital letters like X or Y).

Values: the possible observations of the variable.

Data: the actual values of the variables of interest (may be numbers or words).

  • If multiple variables are collected on an individual, the entire set may be called a case or observational unit.

📝 Example walkthrough

Study: We want to know the average amount of money first-year college students spend at ABC College on school supplies (excluding books). We randomly survey 100 first-year students. Three students spent $150, $200, and $225.

  • Population: all first-year students at ABC College.
  • Sample: the 100 students surveyed.
  • Variable: amount of money spent on school supplies.
  • Data: the actual values—$150, $200, $225, etc.
  • Statistic: the average calculated from the 100 students.
  • Parameter: the true average for all first-year students (unknown, estimated by the statistic).

🧮 The data analysis process

🧮 Four phases

The data analysis process consists of four phases:

1. Identify the research objective

  • What questions are to be answered?
  • What group should be studied?
  • Have attempts been made to answer it before?

2. Collect the information needed

  • Is data already available?
  • Can you access the entire population?
  • How can you collect a good sample?

3. Organize and summarize the information

  • What visual descriptive techniques are appropriate?
  • What numerical descriptive techniques are appropriate?
  • What aspects of the data stick out?

4. Draw conclusions from the information

  • What inferential techniques are appropriate?
  • What conclusions can be drawn?

🎯 The main concern

  • One of the main concerns in statistics is how accurately a statistic estimates a parameter.
  • Accuracy depends on how well the sample represents the population.
  • We are interested in both sample statistics and population parameters in inferential statistics.
2

Data Basics

1.2 Data Basics

🧭 Overview

🧠 One-sentence thesis

Data can be classified by type (qualitative vs. quantitative), level of measurement (nominal, ordinal, interval, ratio), and collection method (anecdotal, observational, experimental), with each classification determining what statistical operations are valid and whether causal conclusions can be drawn.

📌 Key points (3–5)

  • Two main data types: qualitative (categorical, described with words) vs. quantitative (numerical, from counting or measuring).
  • Quantitative splits into discrete and continuous: discrete comes from counting (whole numbers only); continuous comes from measuring (any value on an interval).
  • Four levels of measurement: nominal (no order), ordinal (ordered but differences not meaningful), interval (ordered with meaningful differences but no true zero), ratio (ordered with meaningful differences and true zero allowing ratios).
  • Common confusion: temperature scales (Celsius/Fahrenheit) are interval, not ratio—0° is arbitrary, so you cannot say 80° is "four times as hot" as 20°.
  • Data collection methods matter for causality: observational studies show associations but cannot prove causation due to confounding variables; only controlled experiments can establish causal relationships.

📊 Types of Data

📝 Qualitative (Categorical) Data

Qualitative or categorical data: data that can generally be described with words or letters.

  • Examples: hair color, blood type, ethnic group, car brand, street names.
  • Cannot perform mathematical operations on categorical data (e.g., you cannot calculate an "average party affiliation").
  • Example: A person's car type (Jaguar, Toyota, Honda) is categorical because it is described using words, not numbers.

🔢 Quantitative (Numerical) Data

Quantitative data: data that always takes the form of numbers, typically from counting or measuring.

  • Examples: amount of money, pulse rate, weight, number of students.
  • Mathematical operations (like calculating averages) make sense for quantitative data.
  • Splits into two subtypes: discrete and continuous.

🎯 Discrete vs. Continuous

TypeDefinitionSourceExamples
DiscreteTakes on only certain numerical valuesCountingNumber of phone calls (0, 1, 2, 3), number of students, cans of soup (3 cans)
ContinuousAll possible values on an interval (real numbers)MeasuringLength of phone call in minutes, weight of soup (19 ounces, 14.1 ounces), duration
  • Don't confuse: discrete is about counting (whole units), continuous is about measuring (can have decimals and any value in between).
  • Example: In a supermarket purchase, three cans of soup is discrete (you count cans), but 19 ounces of soup weight is continuous (you measure weight as precisely as possible).

📏 Levels of Measurement

🏷️ Nominal Scale

Nominal scale: categorical data where categories have no natural order.

  • Examples: colors, names, labels, favorite foods, yes/no responses, smartphone brands.
  • No agreed-upon ranking is possible—putting pizza first and sushi second creates no meaningful order.
  • Cannot be used in calculations.

📶 Ordinal Scale

Ordinal scale: similar to nominal but categories can be ordered.

  • Examples: top five national parks ranked 1–5, cruise survey responses ("excellent," "good," "satisfactory," "unsatisfactory").
  • Key difference from nominal: there is an order.
  • Key limitation: differences between values cannot be quantified—we cannot measure how much better "excellent" is than "good."
  • Cannot be used in calculations.

🌡️ Interval Scale

Interval scale: data with definite order and meaningful differences between values, but from an arbitrary starting point.

  • Examples: temperature in Celsius or Fahrenheit.
  • Differences make sense: 40° equals 100° minus 60°.
  • Critical limitation: no true zero—0°F and 0°C do not represent "no temperature," and negative values exist.
  • Can be used in calculations for differences, but ratios are meaningless: 80°C is not "four times as hot" as 20°C.
  • Don't confuse: interval has meaningful differences but no meaningful ratios.

📐 Ratio Scale

Ratio scale: data with definite order, meaningful differences, a true zero point, and meaningful ratios.

  • Examples: exam scores (out of 100 points), where 0 is the lowest possible score.
  • Example: Scores of 20, 68, 80, 92 can be ordered, differences are meaningful (92 is 24 points more than 68), and ratios work (80 is four times 20).
  • Gives the most information and allows all types of calculations.
  • The true zero makes ratios meaningful—this is what separates ratio from interval.

⚠️ Important Note

You may collect data as numbers but report it categorically. Example: quiz scores recorded as numbers throughout the term but reported as letter grades (A, B, C, D, F) at the end.

🔄 Variation and Analysis

📉 Variation in Data

Variation: natural differences present in any set of data.

  • Example: Eight 16-ounce beverage cans measured as 15.8, 16.1, 15.2, 14.8, 15.8, 15.9, 16.0, 15.5 ounces.
  • Causes: different people taking measurements, inexact filling processes.
  • Natural expectation: your data may vary somewhat from someone else's data for the same purpose.
  • Warning sign: if multiple people get very different results for the same measurement, re-evaluate methods and accuracy.

🔍 Data Analysis Process

Data analysis: the process of collecting, organizing, and analyzing data.

Four phases with key questions:

  1. Identify the research objective

    • What questions need answers?
    • What group should be studied?
    • Have previous attempts been made?
  2. Collect the information needed

    • Is data already available?
    • Can you access the entire population?
    • How can you collect a good sample?
  3. Organize and summarize the information

    • What visual descriptive techniques are appropriate?
    • What numerical descriptive techniques are appropriate?
    • What aspects of the data stand out?
  4. Draw conclusions from the information

    • What inferential techniques are appropriate?
    • What conclusions can be drawn?

🔬 Data Collection Methods

🗣️ Anecdotal Evidence

  • Definition: data collected in a haphazard fashion from one or a few cases.
  • Examples:
    • "Two students took over seven years to graduate, so Duke takes longer than other colleges."
    • "A man had an adverse vaccine reaction, so vaccines must be dangerous."
  • Two major problems:
    1. Only represents one or two cases.
    2. Unclear whether cases are representative of the population.
  • Warning: we tend to remember striking or unusual cases more than typical ones.
  • Don't confuse: anecdotal evidence may be true and verifiable, but it may only represent extraordinary cases, not the broader population.

👁️ Observational Studies

Observational study: researchers collect data without directly interfering with how the data arises.

  • Methods: questionnaires, surveys, reviewing records, following groups of individuals.
  • Key limitation: can show associations between variables but cannot by themselves show causal connections.
  • Example: A study finds more sunscreen use correlates with more skin cancer. Does sunscreen cause cancer? No—sun exposure is a confounding variable (people in the sun more use more sunscreen but also get more cancer).

🔀 Confounding Variables

Confounding (lurking/conditional) variable: a variable not accounted for that may actually be important.

  • Can cause misleading or counterintuitive correlations.
  • Why observational studies cannot prove causation: unaccounted confounding variables may be the true cause.

📅 Prospective vs. Retrospective Studies

TypeTimingExample
ProspectiveIdentifies individuals and collects information as events unfoldNurses' Health Study (started 1976, follows nurses over time using questionnaires)
RetrospectiveCollects data after events have taken placeReviewing past medical records
  • Some datasets contain both prospective and retrospective variables.

🏥 Other Study Classifications

  • Cohort study: follows a group of many similar individuals over time (often produces longitudinal data).
  • Cross-sectional study: data collection on a population at one point in time (often prospective).
  • Case-control study: compares a group with a certain characteristic to a group without it (often retrospective for rare conditions).

Example: A researcher studies the relationship between study time in medical school and depression. She reviews graduated students' medical records (retrospective) and sends questionnaires about study time (prospective). This is both a prospective and retrospective observational study.

🧪 Designed (Controlled) Experiments

  • The excerpt introduces experiments as distinct from observational studies but does not provide details in this section.
  • Key distinction: making causal conclusions based on experiments is often reasonable if we control for confounding factors.
  • Observational studies cannot make causal claims; experiments can (when properly designed).
3

Data Collection and Observational Studies

1.3 Data Collection and Observational Studies

🧭 Overview

🧠 One-sentence thesis

Observational studies can reveal associations between variables but cannot by themselves establish causation because confounding variables may explain the observed relationships.

📌 Key points (3–5)

  • Three main data collection methods: anecdotal evidence (unreliable), observational studies (show associations), and designed experiments (can show causation).
  • Observational studies do not interfere: researchers collect data without manipulating variables, so they can only show naturally occurring associations, not causal connections.
  • Confounding variables are the key problem: a lurking variable not accounted for may actually explain the observed relationship between explanatory and response variables.
  • Common confusion—association vs causation: just because two variables are associated in an observational study does not mean one causes the other; sun exposure confounds the sunscreen–skin cancer relationship.
  • Types of observational studies: prospective (follow individuals forward in time), retrospective (review past records), cohort, cross-sectional, and case-control studies.

🔬 Explanatory and response variables

🔬 What they are

Explanatory variable: the variable that may have an effect on another variable.
Response variable: the variable that is affected by the explanatory variable.

  • The excerpt frames research questions in terms of cause and effect: "Is one brand of fertilizer more effective at growing roses than another?"
  • In that example, fertilizer brand is the explanatory variable and flower growth is the response variable.
  • These terms help clarify what researchers are trying to measure and what they think might influence it.

📊 Three data collection methods

📊 Anecdotal evidence

Anecdotal evidence: data collected in a haphazard fashion, often representing only one or two cases.

  • Why it is unreliable: it is unclear whether these cases represent the population; they may be extraordinary or unusual.
  • Example: "I met two students who took more than seven years to graduate from Duke, so it must take longer to graduate at Duke than at many other colleges."
  • The problem: you may remember the two seven-year students more vividly than the six who graduated in four years.
  • Key takeaway: instead of unusual cases, examine a sample of many cases that represent the population.

🔍 Observational studies

Observational study: researchers collect data in a way that does not directly interfere with how the data arises.

  • Researchers merely observe what happens naturally—via surveys, medical records, or following groups over time.
  • What they can show: evidence of naturally occurring associations between variables.
  • What they cannot show: causal connections by themselves.
  • Example methods: questionnaires, reviewing company records, tracking similar individuals to form hypotheses about disease development.

🧪 Designed (controlled) experiments

  • The excerpt mentions this as the third method but does not elaborate in this section.
  • Experiments are "more commonly accepted" than anecdotal evidence.
  • The distinction from observational studies is that experiments can support causal claims (covered in later sections).

🌞 The confounding variable problem

🌞 Sunscreen and skin cancer example

  • Hypothetical finding: an observational study finds that the more sunscreen someone uses, the more likely they are to have skin cancer.
  • Naive conclusion: does sunscreen cause skin cancer?
  • Reality: previous research shows sunscreen actually reduces skin cancer risk.
  • The confounder: sun exposure is unaccounted for.
    • If someone is out in the sun all day, they are more likely to use sunscreen but also more likely to get skin cancer.
    • Sun exposure is the lurking variable that explains the association.

🧩 What is a confounding variable?

Confounding variable (also called lurking or conditional variable): a variable that was not accounted for and may actually be important in explaining the observed relationship.

  • Confounding variables can cause misleading, counterintuitive, or even spurious correlations.
  • Don't confuse: an association observed in an observational study with a causal relationship—always ask "what else could explain this pattern?"

🗂️ Types of observational studies

🗂️ Prospective vs retrospective

TypeDefinitionExample
ProspectiveIdentifies individuals and collects information as events unfoldNurses' Health Study (started 1976, expanded 1989): recruits registered nurses and collects data using questionnaires over many years
RetrospectiveCollects data after events have taken placeResearchers reviewing past events in medical records
  • Some datasets contain both prospectively and retrospectively collected variables.
  • Example from the excerpt: a researcher studying medical school students' depression sends out a questionnaire (prospective) and reviews past medical records (retrospective).

🏥 Other classifications (life science and medical contexts)

TypeDefinition
Cohort studyFollows a group of many similar individuals over time, often producing longitudinal data
Cross-sectional studyData collection on a population at one point in time (often prospective)
Case-control studyCompares a group that has a certain characteristic to a group that does not, often retrospective for rare conditions

📝 Example walkthrough

Scenario: A researcher studies the relationship between time spent studying in medical school and depression rates. He reviews graduated students' medical records to see if they have seen a psychologist and sends out a questionnaire asking how much time they spent studying.

Question: What type of study is this?

Answer: Both prospective and retrospective observational study.

  • Sending out a questionnaire → prospective.
  • Reviewing past medical records → retrospective.

🧪 Variation in data

🧪 What variation means

Variation: differences present in any set of data.

  • Example: eight 16-ounce beverage cans were measured and produced amounts ranging from 14.8 to 16.1 ounces.
  • Why variation occurs: different people took measurements, or the exact amount was not put into the cans.
  • Manufacturers run tests to check if amounts fall within the desired range.

🔧 What to do about variation

  • Be aware that your data may vary somewhat from someone else's data for the same purpose—this is natural.
  • Warning sign: if two or more people taking the same data get very different results, re-evaluate your data-taking methods and accuracy.

📐 Data analysis process

📐 Four phases

The excerpt outlines a formal four-phase process:

  1. Identify the research objective

    • What questions are to be answered?
    • What group should be studied?
    • Have attempts been made to answer it before?
  2. Collect the information needed

    • Is data already available?
    • Can you access the entire population?
    • How can you collect a good sample?
  3. Organize and summarize the information

    • What visual descriptive techniques are appropriate?
    • What numerical descriptive techniques are appropriate?
    • What aspects of the data stick out?
  4. Draw conclusions from the information

    • What inferential techniques are appropriate?
    • What conclusions can be drawn?
  • The excerpt notes that all of these questions will be answered throughout the course.
4

Designed Experiments

1.4 Designed Experiments

🧭 Overview

🧠 One-sentence thesis

Designed experiments, unlike observational studies, can establish causality by randomly assigning treatments and controlling confounding factors through randomization, replication, and control.

📌 Key points (3–5)

  • Experiments vs observational studies: experiments manipulate the explanatory variable and can show causality; observational studies cannot make causal claims due to confounding factors.
  • Three main principles: randomization (spreads lurking variables equally), replication (larger samples improve accuracy), and control (placebo groups balance suggestion effects).
  • Common confusion: taking vitamin D and being healthier does not prove causation—people who take vitamins may also exercise, eat better, etc. (confounding variables).
  • Blinding matters: double-blind experiments prevent both subjects and researchers from knowing who receives active treatment, eliminating bias from expectations.
  • Design types: completely randomized, block design (group by known factors first), and matched pairs (same or very similar individuals receive different treatments).

🔬 Core experimental concepts

🔬 Explanatory vs response variables

Explanatory variable: the variable the researcher manipulates; it causes change in another variable.
Response variable: the variable affected by the explanatory variable; the outcome being measured.

  • Treatments: the different values of the explanatory variable that researchers assign.
  • Experimental unit: a single object or individual being measured.
  • Example: In a vitamin D study, vitamin D intake is the explanatory variable; health outcomes are the response variable.

⚖️ Why experiments beat observation

  • Observational studies show associations but cannot prove causation.
  • The vitamin D example illustrates this: people who take vitamin D may also exercise, eat well, avoid smoking—any of these could explain better health.
  • Experiments isolate the explanatory variable by controlling other factors, so differences in outcomes must result from the treatment itself.

🎲 The three design principles

🎲 Randomization

  • What it does: randomly assigns experimental units to treatment groups.
  • Why it works: spreads all lurking variables equally among groups, so the only difference between groups is the planned treatment.
  • Result: different outcomes in the response variable must be a direct result of different treatments, enabling causal conclusions.
  • Example: Flip a coin to assign participants to either a control group (no vitamin D) or experimental group (extra vitamin D doses).

🔁 Replication

  • What it means: collecting a sufficiently large sample; observing more cases improves accuracy of estimates.
  • Additional forms:
    • Scientists may replicate an entire study to verify earlier findings.
    • Repeated measures: subjecting the same individuals to the same treatment more than once.
  • More observations → more accurate understanding of how the explanatory variable affects the response.

🛡️ Control and blinding

Control group: a treatment group given a placebo—a treatment that cannot influence the response variable.

  • Why control groups matter: the power of suggestion can influence outcomes as much as actual medication; one study found that believing you took a performance drug improved performance almost as much as actually taking it.
  • Blinding: participants don't know who receives active treatment vs placebo, preserving the power of suggestion.
  • Double-blind: both subjects and researchers interacting with subjects are unaware of treatment assignments.
  • Don't confuse: if participants know they're receiving a placebo, the power of suggestion disappears—blinding prevents this.

🏛️ Real-world importance

  • U.S. FDA and European Medicines Agency require two independent randomized trials before approving new drugs.
  • Example: In 1954, ~750,000 children participated in a randomized polio vaccine trial; results led to widespread successful vaccination.

🧪 Experimental design types

🧪 Completely randomized design

  • The most basic design: determine how many treatments to administer, then randomly assign participants to their respective groups.
  • No additional grouping or pairing—pure randomization.

🧱 Block design

Blocking: first grouping individuals based on variables known or suspected to influence the response, then randomly assigning treatments within each block.

  • When to use: researchers know certain outside variables affect the response.
  • How it works:
    1. Split participants into blocks (e.g., low-risk vs high-risk patients).
    2. Randomly assign half from each block to control, half to treatment.
  • Benefit: ensures each treatment group has equal representation from each block.
  • Example: In a heart attack drug study, block patients by risk level first, then randomize within each risk block.

👯 Matched pairs design

Matched pairs design: very similar individuals (or the same individual) receive two different treatments, and results are compared.

  • Common forms:
    • Twin studies
    • Before-and-after measurements
    • Pre- and post-test situations
    • Crossover studies (same person tries both treatments in random order)
  • Challenge: hard to find many suitably similar individuals.
  • Advantage: very effective at controlling for individual differences.

🏊 Matched pairs example: wetsuit study

  • Question: Did a new wetsuit design increase swim velocities at the 2000 Olympics?
  • Design: Twelve competitive swimmers swam 1,500 meters twice—once in a wetsuit, once in a regular swimsuit; order randomized for each swimmer.
  • Measurement: velocity difference (wetsuit velocity minus swimsuit velocity) for each swimmer.
  • Key insight: Two measurements per individual, but analyzing the difference allows for paired comparison—each swimmer serves as their own control.

🔍 Design identification practice

🔍 Sleep deprivation and driving

  • Setup: 19 professional drivers, two sessions each (normal sleep vs 27 hours sleep deprivation), treatments in random order, performance measured on driving simulation.
  • Design type: Matched pairs (same individuals receive both treatments in randomized order).

🔍 Smell and learning study

  • Setup: Subjects completed mazes multiple times wearing floral-scented or unscented masks; random assignment to order (floral first or last).
  • Variables:
    • Explanatory: scent (floral vs unscented)
    • Response: time to complete maze
  • Treatments: floral-scented mask and unscented mask
  • Lurking variables: eliminated by random assignment of treatment order—all subjects experienced both treatments.
  • Blinding: Subjects cannot be blinded (they know if they smell flowers), but researchers timing mazes can be blinded to which mask is worn.
5

Sampling

1.5 Sampling

🧭 Overview

🧠 One-sentence thesis

Random sampling methods allow researchers to gather representative data from large populations efficiently, but careful attention to sampling technique and ethical practices is essential to avoid bias and ensure valid conclusions.

📌 Key points (3–5)

  • What sampling does: Uses a subset (sample) of a population to gather information when studying the entire population is impractical.
  • Random sampling methods: Simple random sampling (SRS), stratified sampling, cluster sampling, and systematic sampling each have different procedures and use cases.
  • Common confusion: Sampling with replacement vs without replacement—most practical surveys use without replacement, which is approximately equivalent to with replacement when the population is large.
  • Bias and errors: Sampling bias occurs when some population members are less likely to be chosen; non-sampling errors come from factors unrelated to the sampling process itself.
  • Ethics matter: Researchers must protect participants, avoid fraud, and use proper methods to ensure data integrity and validity.

🎲 Random sampling fundamentals

🎯 Why we sample

  • Gathering information about an entire population is often impossible due to cost, time, or logistics.
  • A sample should have the same characteristics as the population it represents.
  • Goal: Use random methods so each member initially has an equal chance of selection.

🔀 Simple Random Sample (SRS)

Simple random sample: Any group of n individuals is equally likely to be chosen as any other group of n individuals.

  • This is the "gold standard" method.
  • Each sample of the same size has an equal chance of being selected.
  • Example: To form a study group of three from 31 classmates, put all names in a hat and draw three, or assign numbers and use a random number generator.
  • The excerpt shows Lisa assigning two-digit IDs (00–30) to classmates, generating random decimals, extracting two-digit sequences, and selecting corresponding students.

🧩 Other sampling techniques

📊 Stratified sampling

Stratified sample: Divide the population into groups (strata) based on a relevant characteristic, then take a proportionate number from each stratum.

  • Identify a similar characteristic and group people accordingly.
  • Then randomly select a proportionate number from each group.
  • Example: Divide a college by department (six in dept. 1, twelve in dept. 2, nine in dept. 3); for a total sample of nine, randomly choose two from dept. 1, four from dept. 2, and three from dept. 3.
  • Why use it: Ensures the sample reflects population demographics.

🗂️ Cluster sampling

Cluster sample: Divide the population into predefined clusters, randomly select some clusters, then include all members from those selected clusters.

  • Example: A college has five departments (the clusters); number them, randomly pick two departments, and survey everyone in those two departments.
  • Don't confuse with stratified: cluster takes all members from selected groups; stratified takes some from each group.

🔢 Systematic sampling

Systematic sample: Randomly select a starting point, then take every nth member from a list.

  • Example: You have 60 phone contacts and want a sample of 15; randomly pick a starting number, then select every fifth contact until you reach 15.
  • Frequently chosen because it is simple.

🌐 Multistage sampling

  • Researchers may combine techniques (e.g., stratify first, then cluster within strata).

⚖️ Sampling with vs without replacement

🔄 With replacement

  • Once a member is picked, they go back into the population and may be chosen again.
  • True random sampling is done with replacement.

🚫 Without replacement

  • A member can be chosen only once.
  • Most practical surveys use this method.
  • Key insight: When the population is large and the sample is small, sampling without replacement is approximately the same as with replacement because the chance of picking the same person twice is very low.
Population sizeSample sizeWith replacement (2nd pick)Without replacement (2nd pick)Difference
10,0001,000999/10,000 = 0.0999999/9,999 = 0.0999Negligible (same to 4 decimals)
25109/25 = 0.36009/24 = 0.3750Noticeable difference
  • Don't confuse: The distinction matters only when the population is small relative to the sample.

🚨 Bias and sampling errors

🎯 Sampling bias

Sampling bias: Created when some members of the population are not as likely to be chosen as others.

  • Each member should have an equally likely chance.
  • When bias occurs, incorrect conclusions can be drawn.
  • Example: Mailed surveys returned voluntarily may favor certain groups.

📉 Sampling vs non-sampling errors

  • Sampling errors: Caused by the actual sampling process (e.g., sample not large enough).
  • Non-sampling errors: Caused by factors unrelated to sampling (e.g., defective counting device).
  • A sample will never be exactly representative, so some sampling error always exists.
  • Rule of thumb: Larger samples → smaller sampling error.

🔀 Sampling variability

  • Two samples from the same population, even using the same method, will likely differ.
  • Example: Doreen uses systematic sampling and Jung uses cluster sampling to study student sleep; their samples of 500 will differ.
  • Even if both used the same method, samples would still differ.
  • Larger samples tend to produce results closer to the true population value, but variation remains.

⚠️ Common problems and critical evaluation

🛑 Convenience sampling

  • Non-random; uses results that are readily available.
  • Example: A software store surveys customers browsing in the store.
  • May be very good in some cases, highly biased in others.

🙋 Self-selected samples

  • Only people who choose to respond (e.g., call-in surveys).
  • Often unreliable because they may represent only those with strong opinions.

📏 Sample size issues

  • Samples that are too small may be unreliable.
  • Larger is better when possible.
  • Exception: Small samples may be unavoidable (e.g., crash testing cars, rare medical conditions) but can still be useful.

🎭 Other critical issues

ProblemWhat it meansImpact
Undue influenceCollecting data or asking questions in a way that influences responsesBiased results
Non-responseSubject refuses to participateSample may no longer represent the population
Causality confusionAssuming one variable causes another just because they're relatedInvalid conclusions; they may be connected through a third variable
Self-funded studiesStudy performed to support the funder's claimPotential bias; evaluate on merits, not automatically good or bad
Misleading data useImproperly displayed graphs, incomplete data, lack of contextMisinterpretation
ConfoundingEffects of multiple factors cannot be separatedCannot draw valid conclusions about individual factors

🛡️ Ethics in statistical research

📜 Researcher responsibilities

  • Verify proper methods are being followed.
  • The excerpt cites a fraud case: a social psychologist fabricated data in over 55 papers, driven by a desire for "elegant" results rather than truth.
  • Co-authors should have spotted statistical flaws but lacked familiarity with elementary statistics.

🏛️ Legal and institutional protections

  • U.S. Department of Health and Human Services oversees regulations for studies with human participants.
  • Research institutions establish Institutional Review Boards (IRBs) to approve studies in advance.

🔐 Key mandated protections

  • Minimize risks: Risks must be reasonable relative to projected benefits.
  • Informed consent: Risks must be clearly explained; subjects must consent in writing; documentation is required.
  • Privacy: Data must be carefully guarded to protect participant identity.

⚖️ Ethical challenges

  • Is removing a name sufficient to protect privacy, or could identity be discovered from remaining data?
  • What if unanticipated risks arise during the study?
  • Example dilemma: Does a researcher have the right to use leftover blood samples originally taken for a cholesterol test?

🚩 Unethical behaviors (from excerpt example)

BehaviorWhy it's unethicalCorrection
Selecting a convenient block to surveyIntentionally biased sampleSelect areas at random
Skipping houses where no one is homeOmits relevant data (e.g., working families)Make every effort to interview all target members
Filling in forms with random answers from other participantsFraudulent duplication creates biasWork diligently; never fake data
Commissioned by juice seller; only two brands tested; participants see brands; misleading claim in commercialMultiple issues: conflict of interest, limited scope, undue influence, misrepresentationDisclose funding, include more options, blind the test, report accurately

🔍 Vigilance and knowledge

  • Fraud is more prevalent than most realize; a website catalogs retracted fraudulent studies.
  • Learning statistics empowers critical analysis of studies.
  • Professional organizations (e.g., American Statistical Association) define clear expectations.
  • Federal laws govern the use of research data.
6

Chapter 1 Wrap-Up: Research Ethics and Sampling

Chapter 1 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

Research involving human subjects requires institutional oversight to protect participants through informed consent, risk minimization, and privacy safeguards, while researchers must avoid fraudulent practices that compromise data reliability.

📌 Key points (3–5)

  • Institutional Review Boards (IRBs) must approve all studies in advance to ensure participant safety and ethical compliance.
  • Three key legal protections: minimized risks, informed written consent, and careful privacy protection of collected data.
  • Common confusion: privacy protection is not as simple as removing names—remaining data may still reveal identity.
  • Ethical sampling requires: avoiding convenience samples, making every effort to reach all target participants, and never fabricating or duplicating data.
  • Fraud is more prevalent than expected: a dedicated website catalogs retracted fraudulent studies, highlighting the importance of statistical literacy.

🛡️ Institutional protections for human subjects

🏛️ Role of IRBs

Institutional Review Boards (IRBs): oversight committees established by research institutions to review and approve all planned studies involving human subjects.

  • All studies must receive advance approval before beginning.
  • IRBs ensure that research meets ethical and legal standards.
  • This requirement applies whenever a research institution engages in research with human participants.

⚖️ Three mandated legal protections

ProtectionWhat it requiresWhy it matters
Risk minimizationRisks must be minimized and reasonable relative to benefitsPrevents unnecessary harm to participants
Informed consentRisks clearly explained; written consent required and documentedEnsures participants understand what they're agreeing to
Privacy protectionData must be carefully guardedProtects participant identity and sensitive information

🔍 Challenges in ethical compliance

🕵️ Privacy is complex

  • Simply removing a participant's name may not be sufficient.
  • The person's identity might still be discoverable from remaining data.
  • Example: A dataset without names might still contain age, location, occupation, and other details that together identify someone.

⚠️ Unanticipated situations

The excerpt raises difficult questions that arise in practice:

  • What happens when a study doesn't proceed as planned and new risks emerge?
  • When is informed consent truly necessary?
  • Does a researcher have the right to use biological waste (like leftover blood samples) in a study after the original purpose is complete?

Don't confuse: Meeting the letter of the law (e.g., getting written consent) vs. truly protecting participants in all circumstances.

🚫 Unethical sampling behaviors

🚫 Convenience sampling bias

Problem: A researcher selects a block where she is comfortable walking because she knows many people there.

  • This creates intentional bias by selecting a convenient rather than representative sample.
  • Claiming this sample represents the community is misleading.
  • Correction: Select areas in the community at random.

🚫 Intentional omission

Problem: No one is home at four houses; the researcher doesn't record addresses or return later.

  • Omitting relevant data creates bias.
  • Example: If gathering information about jobs and childcare, missing working families who aren't home loses relevant data.
  • Correction: Make every effort to interview all members of the target sample.

🚫 Data fabrication

Problem: A researcher running late skips four houses and later fills in forms by selecting random answers from other residents.

  • Faking data is never acceptable.
  • Even using "real" responses from other participants, the duplication is fraudulent and creates bias.
  • Correction: Work diligently to interview everyone on the route.

🎯 Study design ethics example

🧃 Fruit juice study scenario

The excerpt presents a commissioned study to determine favorite fruit juice among California teens, with multiple ethical issues:

Issue a: The survey is commissioned by a seller of a popular apple juice brand.

  • Potential conflict of interest that could bias study design or reporting.

Issue b: Only two juice types included (apple and cranberry).

  • Limited choices don't represent the full range of preferences.

Issue c: Participants can see the brand as each sample is poured.

  • Visual cues can influence taste perception and responses.

Issue d: Brand X claims "Most teens like Brand X as much as or more than Brand Y" when 25% prefer X, 33% prefer Y, and 42% have no preference.

  • Misleading interpretation: combining "prefer X" with "no preference" to claim superiority is deceptive reporting.

📚 Why statistical literacy matters

📚 Fraud is a bigger problem than realized

  • There is a dedicated website (retractionwatch.com) cataloging retracted fraudulent study articles.
  • The misuse of statistics is more prevalent than most people realize.
  • A quick glance at the site reveals the scope of the problem.

🔬 Knowledge as vigilance

  • Vigilance against fraud requires knowledge.
  • Learning basic statistical theory empowers critical analysis of studies.
  • Students of statistics must take time to consider ethical questions that arise in research.

Don't confuse: Honest mistakes or limitations vs. deliberate fraud or unethical practices—both affect reliability, but fraud is intentional deception.

7

Descriptive Statistics and Frequency Distributions

2.1 Descriptive Statistics and Frequency Distributions

🧭 Overview

🧠 One-sentence thesis

Descriptive statistics provides both graphical and numerical tools to organize, summarize, and reveal patterns in data, with frequency tables serving as a foundational method for displaying how often values occur.

📌 Key points (3–5)

  • What descriptive statistics does: organizes and presents data through visual (graphs, charts) and numerical (summary statistics) methods to make large datasets manageable.
  • Frequency tables as a starting point: show how often each value occurs and work for both categorical and quantitative data.
  • Relative vs cumulative relative frequency: relative frequency is the proportion of each value; cumulative relative frequency accumulates these proportions row by row.
  • Common confusion: class limits vs class width—the lower and upper class limits define the boundaries of a group, while class width is the difference between consecutive lower limits.
  • Choosing bins matters: for quantitative data, grouping requires decisions about bin width, number of bins (typically 5–20), and boundaries to reveal distribution without distortion.

📊 What descriptive statistics accomplishes

📊 The role of descriptive statistics

Descriptive statistics: the area of statistics focused on numerical and graphical ways to describe and display data.

  • Raw data alone can be overwhelming (e.g., a list of all house prices in a neighborhood).
  • Descriptive methods condense data into understandable summaries—averages, medians, graphs.
  • Two main types: graphical (charts, tables, graphs) and numerical (summary statistics like central tendency and variability).

🎯 Why graphs matter

  • Graphs reveal the distribution or shape of data—where values cluster and where gaps exist.
  • A visual can communicate trends faster than a table of numbers.
  • The choice of graph depends on data type: pie/bar charts for categorical data; histograms, box plots, dot plots for quantitative data.

📋 Frequency tables: the foundation

📋 What a frequency table shows

Frequency: how often each value occurs in the dataset.

  • A frequency table organizes data by listing each value (or group of values) and counting occurrences.
  • Works for both categorical and quantitative data.
  • Example: polling 20 children on favorite colors yields a simple table with colors and their counts.

🔢 Handling quantitative data

  • Discrete data with few values: straightforward—list each value and its frequency.
  • Continuous or large-range data: requires grouping into classes (bins).
  • Natural groupings often exist (e.g., ages 20–29, 30–39).

🧮 Key terms for grouped data

TermDefinitionExample (class 30–39)
Class (bin)A grouping interval30–39
Lower class limitStarting value of the class30
Upper class limitEnding value of the class39
Class widthDifference between consecutive lower limits40 – 30 = 10
Class midpointAverage of lower and upper limits(30 + 39) / 2 = 34.5

🛠️ Guidelines for creating bins

  • Number of bins: typically 5–20; a good starting point is the square root of the sample size.
  • Bin width: should be consistent across all bins and use "reasonable" numbers (whole numbers, multiples of 5 or 10).
  • Boundaries: should not overlap, have no gaps, and cover the entire data range.
  • Starting point: carry the lower boundary to one more decimal place than the most precise data value to avoid values falling exactly on boundaries.
    • Example: if the smallest value is 6.1, start at 6.05; if smallest is 2 (an integer), start at 1.5.

🔄 Relative and cumulative relative frequencies

🔄 Relative frequency

Relative frequency (RF): the ratio of the frequency of a value to the total number of observations.

  • Formula: RF = f / n, where f = frequency and n = total sample size.
  • Can be expressed as a fraction, decimal, or percentage.
  • Example: if 3 out of 40 students scored 90–100%, RF = 3/40 = 0.075 = 7.5%.
  • All relative frequencies in a table must sum to 1 (allowing for rounding).

📈 Cumulative relative frequency

Cumulative relative frequency: the accumulation of all previous relative frequencies up to the current row.

  • Calculated by adding the current row's relative frequency to the sum of all previous relative frequencies.
  • Shows the proportion of data at or below a given value.
  • Key properties:
    • First entry equals the first relative frequency (nothing to accumulate yet).
    • Last entry always equals 1.00 (100% of data accounted for).

🧪 Example walkthrough

Using the soccer player height table (100 players):

  • Heights 59.95–61.95 inches: frequency = 5, RF = 5/100 = 0.05, cumulative RF = 0.05.
  • Heights 61.95–63.95 inches: frequency = 3, RF = 0.03, cumulative RF = 0.05 + 0.03 = 0.08.
  • Heights 63.95–65.95 inches: frequency = 15, RF = 0.15, cumulative RF = 0.08 + 0.15 = 0.23.
  • Interpretation: 23% of players are shorter than 65.95 inches (read from cumulative RF).
  • To find percentage between 61.95 and 65.95 inches: add the two relevant RFs: 0.03 + 0.15 = 0.18 = 18%.

✅ Checking your work

  • Sum of all frequencies = n (sample size).
  • Sum of all relative frequencies = 1.
  • Final cumulative relative frequency = 1.

🎓 Practical application

🎓 Data collection considerations

  • For the soccer player example, the data type is quantitative continuous (height can take any value within a range).
  • To make the sample representative: obtain rosters from all teams and use simple random sampling from each.

🎓 Remember the distinction

  • You count frequencies (raw counts).
  • You calculate relative frequencies (proportions).
  • You accumulate cumulative relative frequencies (running totals of proportions).
8

Displaying and Describing Categorical Distributions

2.2 Displaying and Describing Categorical Distributions

🧭 Overview

🧠 One-sentence thesis

Categorical data is best understood through visual displays (pie charts and bar graphs) and simple numerical summaries (mode and variability), with the choice of graph depending on whether categories overlap, data is missing, or comparisons across groups are needed.

📌 Key points (3–5)

  • Visual methods come first: Start with graphs (pie charts or bar graphs) to understand categorical data before moving to numerical summaries.
  • Percentages enable fair comparisons: When comparing groups with different totals, showing percentages alongside counts makes patterns clearer.
  • Graph choice matters: Pie charts require categories that don't overlap and percentages that sum to 100%; bar graphs are more flexible and work when categories overlap or data is missing.
  • Common confusion—when percentages don't add to 100%: If students can belong to multiple categories (totals exceed 100%) or if data is missing (totals under 100%), use bar graphs, not pie charts.
  • Numerical description is limited: Categorical data can be described by the mode (most frequent category) and variability (how spread out observations are across categories), but mathematical calculations are not applicable.

📊 Visual methods for categorical data

📊 Tables with frequencies and percentages

  • Tables organize categorical data by showing both counts (frequencies) and percentages (relative frequencies).
  • Percentages are especially important when comparing datasets with different totals.
  • Example: Comparing part-time vs. full-time students at two colleges—one college has 22,496 total students, the other 14,183. Raw counts are hard to compare, but percentages (59.1% part-time vs. 71.4% part-time) reveal the pattern clearly.

🥧 Pie charts

Pie chart: categories are represented by wedges in a circle, proportional in size to the percent of individuals in each category.

  • Each wedge's size shows the proportion of that category relative to the whole.
  • When to use: Categories are mutually exclusive (no overlap) and percentages add to exactly 100%.
  • Tip: Sorting wedges by size (largest to smallest) makes the chart easier to read.
  • Example: A professor's students classified as freshmen, sophomores, juniors, or seniors—each student belongs to exactly one category.

📊 Bar graphs

Bar graphs: separate bars represent categories, where the length of each bar is proportional to the number or percent of individuals in that category.

  • Bars can be vertical or horizontal, and the height/length shows frequency or percentage.
  • More flexible than pie charts: Work even when categories overlap or data is missing.
  • Example: Facebook users by age group (13–25, 26–44, 45–64)—bars show the proportion in each group.

🔄 Pareto charts

Pareto chart: bars sorted by category size from largest to smallest.

  • A special type of bar graph that arranges categories in descending order.
  • Makes it easier to identify the mode and compare category sizes at a glance.
  • Example: Ethnicity data sorted from largest (Asian 36.1%) to smallest (Native American 0.6%) is easier to interpret than alphabetical order.

🚫 Special cases and graph selection

🚫 When percentages add to more than 100%

  • Why it happens: Individuals can belong to multiple categories.
  • What to do: Use a bar graph; pie charts cannot be used because wedges would overlap conceptually.
  • Example: Students categorized as "full-time" (40.9%), "intend to transfer" (48.6%), and "under age 25" (61.0%)—totals 150.5% because one student can be in all three categories.

🚫 When percentages add to less than 100%

  • Why it happens: Some data is missing or categorized as "Other/Unknown."
  • What to do: Use a bar graph and include a bar for the missing category if it's substantial.
  • Example: Ethnicity data totals only 90.4% because 9.6% is "Other/Unknown"—including this category in the bar graph shows its relative importance compared to small categories like Native American (0.6%).

🔍 Comparing pie charts vs. bar graphs

SituationBest choiceReason
Categories mutually exclusive, no missing dataPie chart or bar graphBoth work; pie chart emphasizes parts of a whole
Categories overlap or percentages ≠ 100%Bar graph onlyPie chart cannot represent overlapping or incomplete data
Comparing two groups side-by-sideBar graphEasier to compare bar heights than wedge sizes across two pies
Showing ranked orderPareto chart (bar graph)Sorting by size highlights the mode and relative importance

Don't confuse: Pie charts and bar graphs both show categorical data, but pie charts require the data to represent parts of a single whole (100%), while bar graphs are more versatile.

🔢 Numerical descriptions of categorical data

🔢 Mode

Mode: the most frequently occurring value in a dataset.

  • For categorical data, the mode is the category with the highest frequency.
  • Multiple modes: A dataset can have two modes (bimodal), three modes (trimodal), or more (multimodal) if multiple categories share the highest frequency.
  • How to find it: Look for the largest wedge in a pie chart or the tallest bar in a bar graph; Pareto charts make this trivial.
  • Example: In a pie chart of student classifications, if "Freshman" is the largest wedge, it is the mode.

📏 Variability (diversity)

Variability in categorical data: how spread out observations are across categories, thought of as diversity.

  • High variability: Observations are fairly evenly distributed across many categories.
  • Low variability: Most observations fall into one or a few categories.
  • Visual assessment: Compare the sizes of wedges or bars—similar sizes indicate high variability; one dominant category indicates low variability.
  • Example: A college with 60% part-time and 40% full-time students shows higher variability than one with 90% full-time and 10% part-time.

Don't confuse: Variability in categorical data is not calculated numerically here; it is observed visually by comparing how evenly spread the data is across categories.

🚫 Limitations of numerical methods

  • Categorical data does not lend itself to mathematical calculations (you can't add or average categories like "Freshman" or "Asian").
  • The mode and a visual sense of variability are the main numerical descriptors available.
  • This is why visual methods (graphs) are emphasized first and remain the primary tool for understanding categorical distributions.
9

Displaying Quantitative Distributions

2.3 Displaying Quantitative Distributions

🧭 Overview

🧠 One-sentence thesis

Quantitative data can be displayed through multiple graphical methods—each with distinct strengths—and should be described by examining shape, outliers, center, and spread.

📌 Key points (3–5)

  • Graphical options for quantitative data: stem-and-leaf plots, dot plots, line graphs, histograms, frequency polygons, and time series plots—each suited to different data characteristics and purposes.
  • When to use which method: small datasets work well with stem-and-leaf plots and dot plots; large continuous datasets (100+ values) are best shown with histograms; time-ordered data requires time series plots.
  • Four key aspects to describe (SOCS): Shape, Outliers, Center, and Spread—these four features provide a complete picture of quantitative distributions.
  • Common confusion: histograms vs. frequency polygons—both show distributions, but frequency polygons (line-based) are better for comparing multiple datasets side-by-side.
  • Trade-offs: some methods (histograms) show overall patterns clearly but lose individual data points; others (stem-and-leaf, dot plots) preserve individual values but become unwieldy with large datasets.

📊 Graphical methods for small datasets

🌿 Stem-and-leaf plots

A stem-and-leaf plot divides each observation into a "stem" (first part of the number) and a "leaf" (final significant digit), arranged vertically by stem with leaves in increasing order.

  • How it works: For the number 23, stem = 2, leaf = 3; for 432, stem = 43, leaf = 2; for 9.3, stem = 9, leaf = 3.
  • Advantages: quick to create, preserves all individual data points, easy to find maximum, minimum, range, median, and quartiles.
  • Best for: small datasets, discrete or rounded continuous data.
  • Example: Exam scores 33, 42, 49, 49, 53... can be organized with stems 3, 4, 5, 6... and corresponding leaves, making it easy to see that most scores fell in the 60s–90s range.

🌿 Back-to-back stem-and-leaf plots

  • Two datasets share the same stem column, with one set of leaves on the left and one on the right.
  • Why useful: allows direct comparison of two distributions (e.g., presidential ages at inauguration vs. at death).

🔵 Dot plots

  • A number line with dots positioned above it to represent each data value.
  • Advantages: cleaner appearance than stem-and-leaf plots, can reveal overall patterns and outliers.
  • What is an outlier: an observation that does not fit the rest of the data; may indicate a mistake (e.g., writing 50 instead of 500) or something unusual happening.
  • Example: Sleep hours 5, 5.5, 6, 6, 6, 6.5... shown as dots above a number line from 5 to 9.

📈 Graphical methods for larger or specialized datasets

📊 Histograms

A histogram consists of contiguous (adjoining) boxes with a horizontal axis labeled with what the data represents and a vertical axis labeled "frequency" or "relative frequency."

  • Rule of thumb: use histograms when the dataset has 100 or more values.
  • Advantages: can readily display large continuous datasets; shows overall shape, center, and spread clearly.
  • Trade-off: you lose individual data points.

🔧 How to construct a histogram

  1. Decide the starting and ending points (use precise values, e.g., 59.95 instead of 60).
  2. Calculate the width of each bar (class interval): subtract starting point from ending value, divide by desired number of bars.
  3. Determine boundaries for each interval.
  4. Count how many data values fall into each interval.
  5. Draw contiguous bars with heights corresponding to frequencies.
  • Note on boundaries: different researchers may set up histograms differently for the same data; there is more than one correct way.
  • Guideline for number of bars: some take the square root of the number of data values and round to the nearest whole number.

📉 Frequency polygons

  • Analogous to line graphs but use binning techniques (like histograms) to make continuous data easy to interpret.
  • How to construct: examine data, decide on class intervals, plot data points at the midpoint of each interval, connect points with line segments.
  • Key advantage: more useful than histograms for comparing multiple continuous distributions by overlaying frequency polygons.
  • Example: Calculus test scores grouped into intervals (49.5–59.5, 59.5–69.5, etc.) with points connected by lines; the graph may show skewness (one side does not mirror the other).

📈 Line graphs

  • The x-axis consists of data values, the y-axis consists of frequency points; frequency points are connected using line segments.
  • Best for: showing trends in discrete data values.
  • Note: can also be used with some ordinal categorical data.
  • Example: Survey of 40 mothers asked how many times per week a teenager must be reminded to do chores—frequencies plotted and connected with lines.

⏱️ Time series plots

A graph that displays data in chronological order, with time on the horizontal axis and the measured variable on the vertical axis.

  • Why important: when recording values of the same variable over an extended period, trends or patterns may be difficult to discern in raw data but jump out when displayed graphically.
  • How to construct: use a Cartesian coordinate system; horizontal axis = dates or time increments; vertical axis = measured values; connect points with straight lines in the order they occur.
  • What it reveals: makes trends easy to spot that might be hidden in tables.
  • Example: Annual Consumer Price Index data from 2009–2019 plotted over time shows a constant positive trend.

🔍 Describing quantitative distributions: SOCS

🔍 The four key aspects (SOCS acronym)

When describing a quantitative distribution, note at least four characteristics:

AspectWhat to observeHow to determine
ShapeOverall pattern of the distributionMain characteristic determined by looking at a graph
OutliersData points that don't fit the patternOften identified visually
CenterWhere data clusters or the "middle"Can be roughly gauged visually; also calculated numerically
SpreadHow dispersed or variable the data isCan be roughly gauged visually; also calculated numerically

🔍 Why SOCS matters

  • It isn't enough to just make graphs; you must interpret the information with a critical eye.
  • Shape and outliers can be determined primarily through visual inspection.
  • Center and spread can be roughly gauged visually but require numerical calculations for precision (covered in following sections).
  • Example question to consider: "Where do your data appear to cluster? How might you interpret the clustering?"
10

Describing Quantitative Distributions

2.4 Describing Quantitative Distributions

🧭 Overview

🧠 One-sentence thesis

When analyzing quantitative data distributions, we must systematically describe four key aspects—shape, outliers, center, and spread—to interpret patterns and understand what the data reveals.

📌 Key points (3–5)

  • SOCS framework: Shape, Outliers, Center, and Spread are the four essential aspects to describe in any quantitative distribution.
  • Shape matters first: The shape of a distribution (symmetric vs. skewed, unimodal vs. multimodal) dictates how to proceed with further analysis.
  • Outliers serve multiple purposes: Extreme values help identify skewness, catch data errors, and reveal interesting properties.
  • Common confusion: A "mode" in data analysis means a prominent peak in the distribution, not necessarily the single most frequent value (which may not exist in real-world datasets).
  • Visual vs. numerical: Shape and outliers can be identified visually from graphs, while center and spread can be roughly estimated visually but require numerical calculations for precision.

📊 The SOCS Framework

📊 What SOCS stands for

SOCS: Shape, Outliers, Center, Spread—a helpful acronym for remembering the four key aspects to describe when analyzing quantitative distributions.

  • This framework provides a systematic approach to data interpretation.
  • It's not enough to just create graphs; you must interpret the information critically.
  • The order matters: shape is examined first because it influences how you analyze the other aspects.

🔍 Why each component matters

  • Shape: Determines which analytical methods are appropriate.
  • Outliers: Can be identified visually and help spot data issues or interesting patterns.
  • Center: Describes the most typical value (central tendency); can be estimated visually but benefits from calculation.
  • Spread: Measures variability; range (maximum minus minimum) provides a rough visual estimate.

🎨 Understanding Shape

🎨 Symmetry vs. Skewness

Distribution TypeDescriptionTail Behavior
SymmetricRoughly equal tails trailing off equally in both directionsBalanced on both sides
Right-skewedData trails off to the right with a longer right tailLong tail extends right
Left-skewedData trails off to the left with a longer left tailLong tail extends left
  • Example: Most loans have rates under 15%, while only a handful have rates above 20% → this creates a right-skewed distribution with a long right tail.
  • Histograms are the best graphical choice for identifying shape in most situations.

🏔️ Modality (number of peaks)

Modality: the number of prominent peaks in a distribution, where a mode is represented by a prominent peak.

Three main types:

  • Unimodal: One prominent peak
  • Bimodal: Two prominent peaks
  • Multimodal: More than two prominent peaks (any distribution with more than two)

Important distinction: In data analysis, a mode is a prominent peak in the distribution, not necessarily the value with the most occurrences (the traditional math class definition). Many real-world datasets have no repeated values, making the traditional definition impractical.

  • Don't confuse: A small secondary peak that differs from neighboring bins by only a few observations is not counted as a separate mode.
  • The goal isn't finding a "correct" number of modes but better understanding your data—"prominent" is intentionally not rigorously defined.

🎯 Identifying Outliers

🎯 What outliers are

Outliers: extreme values that stick out visually from the rest of the data points.

  • Sometimes outliers are obvious in a histogram (e.g., a single bar far separated from the main cluster).
  • Other times they only show up upon careful examination of dot plots or through numerical methods.

🎯 Why outliers matter

Examining data for outliers serves three useful purposes:

  1. Identifying skewness in the distribution
  2. Catching errors in data collection or data entry
  3. Revealing insights into interesting properties of the data
  • Future sections will discuss numerical methods to "officially" identify outliers and how to deal with them.
  • Don't ignore outliers—they often contain valuable information about your data.

📏 Center and Spread Basics

📏 Center (central tendency)

  • Describes the distribution's most typical value.
  • Can be estimated visually from graphs.
  • More robust and appropriate calculated measures will be covered in future sections.
  • Example: Looking at a distribution, the center might appear to be roughly at value 6.

📏 Spread (variability)

  • A rough visual measure: range = maximum – minimum
  • Example: If data ranges from 3 to 7, the range is 7 – 3 = 4.
  • Like center, more robust and appropriate calculated measures exist and will be discussed later.
  • The range gives a quick sense of how dispersed the data are but is limited because it only uses two values.

🔬 Practical Analysis Example

🔬 Applying SOCS systematically

When analyzing a distribution, work through each component:

Shape: Identify if it's symmetric or skewed (and which direction).

  • Example: "The distribution is left-skewed."

Outliers: Look for values that stand out visually.

  • Example: "3 may be a potential outlier."

Center: Estimate where the typical value lies.

  • Example: "The center appears to be roughly 6."

Spread: Calculate or estimate the range.

  • Example: "Range of 7 – 3 = 4 for a rough measure of spread."

Modality: Count prominent peaks.

  • Example: "The distribution appears to be unimodal with a mode of 7."

🔬 Critical interpretation questions

Beyond just describing, ask interpretive questions:

  • Where do the data cluster?
  • How might you interpret the clustering?
  • Would different populations show the same pattern? Why or why not?
11

Measures of Location and Outliers

2.5 Measures of Location and Outliers

🧭 Overview

🧠 One-sentence thesis

Measures of location—percentiles, quartiles, and the median—quantify where observations stand within a distribution and provide the foundation for formally identifying outliers using fence rules.

📌 Key points (3–5)

  • What measures of location do: they quantify where an observation stands relative to the rest of the distribution, not just absolute values.
  • Percentiles vs quartiles: percentiles divide ordered data into hundredths; quartiles are special percentiles (25th, 50th, 75th) that divide data into quarters.
  • Two directions of percentile work: finding the kth percentile of a distribution (e.g., what score is the 90th percentile?) or finding which percentile a given value represents.
  • Common confusion: scoring in the 90th percentile does NOT mean you got 90% correct; it means 90% of scores are at or below yours.
  • Why it matters: these measures enable formal outlier detection through the fence rule and provide context for comparing individual observations to the group.

📊 Percentiles: comparing values in context

📊 What a percentile means

Percentile: indicates the relative standing of a data value when data are sorted from smallest to largest; the percentage of data values that are less than or equal to the pth percentile.

  • A percentile is about relative position, not absolute performance.
  • Low percentiles correspond to lower data values; high percentiles correspond to higher data values.
  • Example: being in the 75th percentile means 75% of observations are at or below your value, and 25% are at or above.

Don't confuse: the 90th percentile does NOT mean "90% correct" on a test—it means your score is higher than 90% of all scores.

🔍 Finding the kth percentile of a distribution

When you want to know "what value represents the 70th percentile?":

  • Order data from smallest to largest.
  • Calculate i = (k/100) × (n + 1), where k is the desired percentile and n is the total number of data points.
  • If i is a whole number, the kth percentile is the data value at position i.
  • If i is not a whole number, round i down and up to nearest integers, then average the two data values at those positions.

Example from the excerpt: For 29 ages with k = 70, i = (70/100) × 30 = 21. The 21st value in the ordered list is 64, so the 70th percentile is 64 years.

🔍 Finding the percentile of a given value

When you want to know "what percentile does a score of 58 represent?":

  • Order data from smallest to largest.
  • Count x = number of values below the target value (not including it).
  • Count y = number of values equal to the target value.
  • Calculate: [(x + 0.5y) / n] × 100, then round to nearest integer.

Example from the excerpt: For the value 58 with 18 values below it and 1 equal to it, the calculation gives approximately 64, so 58 is the 64th percentile.

🎯 Interpreting percentiles properly

A complete interpretation should include:

  • Context of the situation.
  • The data value representing the percentile.
  • The percent of individuals/items below the percentile.
  • The percent of individuals/items above the percentile.

Value judgment depends on context: In a timed exam, the first quartile (25th percentile) of 35 minutes means 25% finished in 35 minutes or less—a low percentile is "good" here because finishing faster is desirable. In other contexts, high percentiles might be preferred.

🔢 Quartiles and the five-number summary

🔢 What quartiles are

Quartiles: special percentiles that divide ordered data into quarters.

QuartileEquivalent percentileAlso called
Q₁ (first quartile)25th percentileLower quartile
Q₂ (second quartile)50th percentileMedian
Q₃ (third quartile)75th percentileUpper quartile

📏 The median as a measure of location

Median: the number that separates ordered data into halves; half the values are at or below it, half are at or above it.

  • The median may or may not be an actual observed value—it's a position marker.
  • With an even number of observations, interpolate by averaging the two middle values.
  • Example from excerpt: For 14 ordered values, the median falls between the 7th (6.8) and 8th (7.2) values, so median = (6.8 + 7.2) / 2 = 7.

🗂️ Finding quartiles

Treat quartiles as the median of each half:

  • Find the overall median (Q₂) first.
  • Q₁ is the median of the lower half of the data.
  • Q₃ is the median of the upper half of the data.
  • If halves have an even count, interpolate or treat Q₁ and Q₃ as the 25th and 75th percentiles.

Example from excerpt: Dataset 1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 has median 7, which splits into two 7-value halves. Q₁ = 2 (middle of lower half), Q₃ = 9 (middle of upper half).

📋 Five-number summary

A quick way to summarize a dataset:

  1. Minimum
  2. Q₁
  3. Median
  4. Q₃
  5. Maximum

This summary provides the building blocks for outlier detection and box plots.

🚨 Identifying outliers with the fence rule

🚨 Interquartile range (IQR)

IQR = Q₃ – Q₁

  • Measures the spread of the middle 50% of the data.
  • Used both as a measure of spread and as the basis for outlier detection.

🚧 The fence rule for outliers

Formal method to decide if a value is an outlier:

  • Lower fence (LF) = Q₁ – 1.5 × IQR
  • Upper fence (UF) = Q₃ + 1.5 × IQR
  • Any value less than LF or greater than UF is flagged as a potential outlier.

Example from excerpt: For the Sharpe Middle School exercise data with Q₁ = 20, Q₃ = 60, IQR = 40:

  • UF = 60 + 1.5(40) = 120
  • The value 300 is greater than 120, so it is a potential outlier.

Important: Potential outliers always require further investigation—they may be errors, abnormalities, or key insights.

⚠️ Why outliers matter

  • They can distort summaries and conclusions.
  • Small samples are especially sensitive—the excerpt notes that 15 students is small, and the principal should survey more to confirm results.
  • Outliers may reveal important patterns or data quality issues.

📦 Box plots: visualizing the five-number summary

📦 What a box plot shows

A box plot graphically displays:

  • A rectangular box from Q₁ to Q₃ (contains the middle 50% of data).
  • A line inside the box at the median.
  • Whiskers extending from the box to the minimum and maximum values.
  • Sometimes dots marking outliers (when using fence rules, whiskers extend only to the fences, not to outlier values).

🎨 Constructing a box plot

Steps:

  1. Start with a scaled number line (horizontal or vertical).
  2. Mark the five-number summary positions.
  3. Draw the box from Q₁ to Q₃.
  4. Draw a line (often dashed) at the median.
  5. Extend whiskers from the box edges to the min and max (or to the fences if marking outliers separately).

🔍 Reading a box plot

From the student heights example (59 to 77 inches):

  • Each quarter contains approximately 25% of the data.
  • The spread of each quarter can differ: second quarter spread = 1.5 inches (smallest), fourth quarter spread = 7 inches (largest).
  • The middle 50% of data (the box) has a range equal to the IQR.
  • More data can be packed into a smaller interval—the interval 59–65 has more than 25% of the data even though it's smaller than some other intervals.

🔄 Special cases in box plots

When values coincide:

  • If the median equals Q₃, no dashed line appears inside the box; the right edge represents both.
  • If the minimum equals Q₁, the left whisker disappears.
  • Example from excerpt: If min = Q₁ = 1, median = Q₃ = 5, max = 7, at least 25% of values equal 1, and at least 25% equal 5.

Key insight: Box plots quickly reveal concentration, spread, and symmetry of data, making them powerful tools for comparing distributions.

12

Measures of Center

2.6 Measures of Center

🧭 Overview

🧠 One-sentence thesis

The choice between mean, median, and mode as the best measure of a dataset's center depends on the distribution's shape and the presence of outliers, with the median being more robust to extreme values than the mean.

📌 Key points (3–5)

  • Three measures of center: mean (average), median (middle value), and mode (most frequent value) each describe the "typical value" differently.
  • When to use which: the mean is most common but sensitive to outliers; the median is more robust and better for skewed data or datasets with extreme values.
  • Shape matters: in symmetrical distributions, mean and median are similar; in skewed distributions, they diverge—the mean is pulled toward the tail.
  • Common confusion: the location of the median (which position) vs. the value of the median (the actual number); also, the mean can be misleading when outliers are present.
  • Grouped data estimation: when only frequency tables are available, the mean must be estimated using interval midpoints rather than exact values.

📊 The three measures explained

📊 Mean (average)

Sample mean (denoted x̄, "x bar"): calculated by adding all values and dividing by the number of values; population mean (denoted μ, "mew"): the mean of the entire population.

  • Most common measure of center.
  • Technical term: "arithmetic mean."
  • Calculation: sum all values, divide by the count.
  • Example: For the sample {1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4}, the mean = (1+1+1+2+2+3+4+4+4+4+4)/11 = 2.7.
  • Weakness: sensitive to extreme values—one very large or small value can pull the mean away from the "typical" value.

📊 Median (middle value)

Median (denoted M): the value that splits the ordered data into two equal halves.

  • Generally better when extreme values or outliers are present because it is more robust (not affected by the precise numerical values of outliers).
  • Finding the median:
    • First, arrange data in ascending order.
    • Use the location function: (n+1)/2, where n is the sample size.
    • If n is odd: the median is the middle value at that position.
    • If n is even: the location function gives a decimal ending in .5; average the values at positions n/2 and (n/2)+1.
  • Example: For 40 data values, location = (40+1)/2 = 20.5, so the median is the average of the 20th and 21st values.
  • Don't confuse: the location (which position) vs. the value (the actual number).

📊 Mode (most frequent)

Mode: the value that occurs most frequently in the dataset.

  • A dataset can have more than one mode if multiple values share the highest frequency (bimodal = two modes).
  • Example: In {430, 430, 480, 480, 495}, both 430 and 480 are modes (bimodal).
  • When useful: can reveal what "most people" experience, which may differ from the mean.
  • Example: A weight loss program advertises a mean loss of six pounds, but the mode might show most people lose only two pounds.
  • Special note: mode works for categorical data (e.g., "red" is the mode in {red, red, red, green, yellow}), but for quantitative data, the mode may not always describe where data clusters well.

🔄 How shape affects measures of center

🔄 Symmetrical distributions

  • When a distribution is symmetrical (mirror image on both sides of a vertical line), the mean and median are the same or very close.
  • If the distribution is also unimodal (one mode), the mode equals the mean and median.
  • Example: Dataset {4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10} has mean = median = mode = 7.

🔄 Skewed left (negative skew)

  • The distribution is "pulled out" to the left (left tail is longer).
  • Order relationship: mode > median > mean.
  • The mean is smallest because it is pulled toward the left tail.
  • Example: Dataset {4, 5, 6, 6, 6, 7, 7, 7, 7, 8} has mode = 7, median = 6.5, mean = 6.3.

🔄 Skewed right (positive skew)

  • The distribution is "pulled out" to the right (right tail is longer).
  • Order relationship: mode < median < mean.
  • The mean is largest because it is pulled toward the right tail.
  • Example: Dataset {6, 7, 7, 7, 7, 8, 8, 8, 9, 10} has mode = 7, median = 7.5, mean = 7.7.

🔄 Summary table

Distribution shapeRelationshipMean behavior
SymmetricalMean ≈ Median ≈ ModeCentrally located
Skewed leftMean < Median < ModePulled toward left tail
Skewed rightMode < Median < MeanPulled toward right tail

Key insight: The mean reflects skewing more than the median does.

💡 Choosing the right measure

💡 When median is better than mean

  • Presence of outliers: The median is not affected by extreme values.
  • Example: In a town of 50 people, 49 earn $30,000 and one earns $5,000,000. The mean is $129,400, but the median is $30,000. The median better represents the "typical" income because the $5,000,000 is an outlier.
  • Skewed distributions: The median stays closer to where most data values lie.

💡 When mean is appropriate

  • Symmetrical distributions without outliers: The mean uses all data values and is the most common measure.
  • When every value matters: The mean incorporates the magnitude of every data point.

💡 When mode is useful

  • Categorical data: The mode is the only measure of center that works for categories.
  • Revealing common experience: Shows what value occurs most often, which can differ from the average.

📐 Estimating the mean from grouped data

📐 The challenge

  • When data is grouped into intervals (frequency tables), individual values are unknown.
  • We cannot calculate an exact mean, only an estimate.

📐 The method

  • Step 1: Find the midpoint of each interval using: midpoint = (lower boundary + upper boundary) / 2.
  • Step 2: Multiply each midpoint by its frequency (f × m).
  • Step 3: Sum all products and divide by the total number of data values: mean estimate = Σ(f × m) / Σf.

📐 Example walkthrough

For a statistics test with intervals and frequencies:

  • Interval 50–56.5 (frequency 1): midpoint = 53.25, product = 53.25.
  • Interval 62.5–68.5 (frequency 4): midpoint = 65.5, product = 262.
  • (Continue for all intervals...)
  • Sum of all products = 1,460.25.
  • Total number of students = 19.
  • Estimated mean = 1,460.25 / 19 = 76.86.

Don't confuse: This is an estimate, not the exact mean, because we assume all values in an interval are at the midpoint.

13

Measures of Spread

2.7 Measures of Spread

🧭 Overview

🧠 One-sentence thesis

Measures of spread quantify how dispersed data values are around the center, with the standard deviation being the most common measure and z-scores allowing standardized comparisons across different datasets.

📌 Key points (3–5)

  • What spread measures: the variability or dispersion of data values, complementing measures of center in the SOCS framework (Shape, Outliers, Center, Spread).
  • Two main measures: the interquartile range (IQR) is robust for skewed data with outliers; the standard deviation measures average deviation from the mean for symmetric data.
  • Standard deviation interpretation: small values mean data cluster near the mean; large values mean data are more dispersed; it must always be ≥ 0.
  • Common confusion: sample vs population formulas—sample standard deviation divides by (n – 1), population divides by N; this gives a better estimate of population variance.
  • Practical application: z-scores standardize observations by expressing them as number of standard deviations from the mean, enabling comparisons across different scales.

📏 Understanding spread and variability

📏 What spread means

Spread (also known as variation or variability): how concentrated or dispersed data values are in a distribution.

  • Spread complements the center—two datasets can have the same mean but very different spreads.
  • Some datasets have values concentrated closely together; others have values more spread out.
  • The shape of the distribution and presence of extreme values determine which measure is most appropriate.

🎯 When to use each measure

MeasureBest forWhy
IQRSkewed distributions with outliersRobust—not affected by extreme values; describes middle 50%
Standard deviationSymmetric distributionsCaptures overall variation; uses all data points

📊 The Interquartile Range (IQR)

📊 What IQR measures

IQR = Q₃ – Q₁

  • Indicates the spread of the middle half (middle 50%) of the data.
  • A "rough but very robust" measure when outliers may be present.
  • Often used alongside the median to describe skewed distributions.
  • Example: showing the five-number summary or box plot gives all information for a skewed dataset in one place.

🛡️ Why IQR is robust

  • Not affected by extreme values because it only uses the middle 50% of data.
  • Don't confuse: IQR measures spread of the middle portion only, not the entire dataset's spread.

📐 The Standard Deviation

📐 What standard deviation measures

Standard deviation: a measure of spread that assesses how dispersed values are from their mean; essentially the "average" deviation—the distance of each observation from the mean.

Notation:

  • Sample: s (standard deviation), (variance)
  • Population: σ (sigma, standard deviation), σ² (variance)

🔍 Interpreting standard deviation values

  • Small standard deviation: data concentrated close to the mean, little variation.
  • Large standard deviation: data more spread out from the mean, more variation.
  • Always ≥ 0: zero means no spread (all values equal); larger values mean more spread.

Example: Two supermarkets both have 5-minute average wait times. Supermarket A has standard deviation = 2 minutes; Supermarket B has standard deviation = 4 minutes. Supermarket B has more variation—wait times are more spread out from the average.

⚙️ How standard deviation is calculated

The calculation involves several steps:

  1. Find deviations: For each value, calculate (x – mean).

    • Positive deviation: data value > mean
    • Negative deviation: data value < mean
    • Problem: deviations always sum to zero
  2. Square the deviations: Makes all values positive; sum will also be positive.

  3. Calculate variance: Average of the squared deviations.

    • Population variance (σ²): divide by N (total population size)
    • Sample variance (s²): divide by (n – 1) (sample size minus one)
  4. Take square root: Standard deviation is the square root of variance.

⚠️ Sample vs population formulas

Why divide by (n – 1) for samples instead of n?

  • The sample variance estimates the population variance.
  • Based on theoretical mathematics, dividing by (n – 1) gives a better estimate of the population variance.
  • Don't confuse: you may need to indicate on your technology which formula to use.

🧮 Using standard deviation in context

🧮 The number line interpretation

Standard deviation can be visualized on a number line:

  • General formula: value = mean + (#ofSTDEVs)(standard deviation)
  • #ofSTDEVs = number of standard deviations (doesn't need to be an integer)

Sample formula: x = x̄ + (#ofSTDEVs)(s)
Population formula: x = μ + (#ofSTDEVs)(σ)

Example: If mean = 5 and standard deviation = 2:

  • 7 is one standard deviation to the right: 5 + (1)(2) = 7
  • 1 is two standard deviations to the left: 5 + (–2)(2) = 1

📉 When standard deviation is most helpful

  • Symmetric distributions: standard deviation is very helpful.
  • Skewed distributions: standard deviation may not be much help because the two sides have different spreads.
  • For skewed data, better to look at: first quartile, median, third quartile, smallest value, largest value.
  • Always graph your data: use histograms or box plots to understand the distribution.

🎯 Z-scores: Standardizing observations

🎯 What z-scores measure

z-score: represents the number of standard deviations between a given observation and its mean.

Formulas:

  • Sample: z = (x – x̄) / s
  • Population: z = (x – μ) / σ

🔄 Why z-scores are useful

Comparing across different scales:

  • If two datasets have different means and standard deviations, comparing raw values directly can be misleading.
  • z-scores put things on a "level playing field" for comparison.
  • Calculate how many standard deviations each value is from its mean, then compare.

Example: John's GPA is 2.85 (school mean = 3.0, SD = 0.7). Ali's GPA is 77 (school mean = 80, SD = 10).

  • John's z-score: (2.85 – 3.0) / 0.7 = –0.21
  • Ali's z-score: (77 – 80) / 10 = –0.3
  • John has the better GPA relative to his school because –0.21 > –0.3 (closer to the mean).

🚨 Identifying unusual observations

Rule of thumb for symmetric, bell-shaped distributions:

  • Observations with z-scores below –2 or greater than +2 are considered "unusual."
  • Generally, an observation should be within ±2 standard deviations 95% of the time.
  • This is an approximate guideline, not a rigid rule.
  • Don't confuse: this is different from the fence rules for outliers (which use IQR).

Example: US male heights are symmetric and bell-shaped with mean = 69.7 inches, SD = 2.8 inches. "Unusually" tall would be: 69.7 + (2 × 2.8) = 75.3 inches or taller.

14

Chapter 2 Wrap-Up: Descriptive Statistics Review

Chapter 2 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

Chapter 2 covers the fundamental tools for organizing, displaying, and summarizing data through both graphical and numerical methods, distinguishing between categorical and quantitative data types.

📌 Key points (3–5)

  • Two main approaches: descriptive statistics include graphical methods (visual displays) and numerical methods (calculated measures).
  • Data type distinction: categorical data (categories/labels) vs. quantitative data (numerical measurements), each requiring different display and summary techniques.
  • Key numerical summaries: measures of center (mean, median, mode), measures of location (median, outliers), and measures of spread (standard deviation, variance, z-score).
  • Common confusion: sample vs. population—sample mean/standard deviation describe a subset, while population parameters describe the entire group.
  • Distribution characteristics: shape, center, spread, outliers, and modality together describe a quantitative distribution.

📊 Core descriptive statistics framework

📊 What descriptive statistics are

Descriptive statistics: methods for organizing, displaying, and summarizing data.

The chapter distinguishes two main types:

  • Graphical descriptive methods: visual representations of data
  • Numerical descriptive methods: calculated summary values

🔢 Distribution concept

Distribution: the pattern of frequencies across different values or categories.

Key related terms:

  • Frequency: how often each value or category appears
  • Relative frequency: the proportion or percentage of observations in each category
  • Cumulative relative frequency: running total of relative frequencies

🏷️ Data types and their methods

🏷️ Categorical data

Categorical data: data that represents categories or labels rather than numerical measurements.

  • The primary summary measure is the mode (most frequent category)
  • Variability describes how spread out observations are across categories
  • Requires different display methods than quantitative data

🔢 Quantitative data

Quantitative data: numerical measurements that can be ordered and have meaningful mathematical operations.

Two subtypes mentioned:

  • Discrete data: countable values (often whole numbers)
  • Ordinal categorical data: categories with a natural order

📏 Describing quantitative distributions

📏 Four key characteristics (SOCS)

The chapter emphasizes four features for describing quantitative distributions:

CharacteristicWhat it describes
ShapeThe overall pattern (symmetric, skewed, bell-shaped)
OutliersObservations far from the typical pattern
CenterThe typical or middle value
SpreadHow much variability exists (also called variation or variability)

🎯 Modality

Modality: the number of peaks or modes in a distribution.

  • Helps describe the shape of the distribution
  • Example: a distribution might have one peak (unimodal), two peaks (bimodal), or more

📍 Measures of center and location

📍 Mean (average)

Mean (average): the sum of all values divided by the number of observations.

Two types distinguished:

  • Sample mean: calculated from a subset of data
  • Population mean: calculated from all data in the entire group

📍 Median

Median: the middle value when data are ordered from smallest to largest.

  • A measure of location that divides the data in half
  • Related to identifying outliers (values far from the center)

📍 Robust measures

Robust: a measure that is not strongly affected by outliers or extreme values.

  • The median is more robust than the mean
  • Important when deciding which center measure to use

📐 Measures of spread

📐 Standard deviation

Standard deviation: a measure of how far observations typically fall from the mean.

The chapter distinguishes:

  • Sample standard deviation: calculated from sample data
  • Population standard deviation: calculated from all population data

📐 Variance

Variance: the square of the standard deviation; another measure of spread.

  • Related to standard deviation but in squared units
  • Both describe how dispersed the data are

📐 Z-score

Z-score: the number of standard deviations an observation falls from the mean.

The "rule of thumb" example:

  • Observations within ±2 standard deviations are considered typical (approximately 95% of data in bell-shaped distributions)
  • Observations beyond ±2 standard deviations may be considered unusual
  • Example given: US male heights average 69.7 inches with 2.8 inch standard deviation → 69.7 + (2 × 2.8) = 75.3 inches would be unusually tall
  • Don't confuse: this is an approximate guideline, not a rigid rule

🔧 Frequency distribution tools

🔧 Class structure

For organizing data into groups:

  • Lower class limit: the smallest value that can belong to a class
  • Upper class limit: the largest value that can belong to a class
  • Class width: the range covered by each class
  • Class midpoint: the middle value of a class interval

These tools help create organized frequency tables and histograms for quantitative data.

15

Introduction to Bivariate Data

3.1 Introduction to Bivariate Data

🧭 Overview

🧠 One-sentence thesis

Bivariate data analysis examines relationships between two variables—whether both categorical, both quantitative, or one of each—using visual displays like comparative box plots, contingency tables, and bar chart variations to reveal patterns and dependencies.

📌 Key points (3–5)

  • What bivariate data means: data involving two variables that may be related (e.g., exam grades and final grades, education and income).
  • Three types of bivariate relationships: categorical vs. categorical, quantitative vs. quantitative, or categorical vs. quantitative.
  • Visual tools for categorical grouping: comparative box plots work better than overlaid histograms when displaying a quantitative variable broken down by a categorical grouping variable.
  • Contingency tables: display sample values for two categorical variables in a table format, showing frequencies and facilitating probability calculations.
  • Common confusion: stacked vs. grouped bar charts—both show two categorical variables, but stacked bars show composition within each category while grouped bars place bars side-by-side for easier comparison.

📊 Types of bivariate relationships

📊 The three combinations

The excerpt identifies three possible pairings:

TypeDescriptionExample from excerpt
Categorical vs. CategoricalBoth variables are categoriesDriver cell phone use vs. speeding violations
Quantitative vs. QuantitativeBoth variables are numericalStudent's second exam grade vs. final exam grade
Categorical vs. QuantitativeOne category, one numberProfession (categorical) vs. income (quantitative)

🎯 Focus of the chapter

  • The section briefly covers categorical vs. quantitative displays.
  • It then focuses on two categorical variables.
  • The rest of the chapter (not shown in excerpt) will focus on two quantitative variables.

📈 Displaying quantitative variables with categorical grouping

📈 When to use this approach

This applies to situations where we have a quantitative response variable being measured and want to further break it down by another categorical grouping variable.

  • The quantitative variable is the response (what you're measuring).
  • The categorical variable is the grouping or predictor (how you're dividing the data).

🔧 Visual options available

Basic overlays:

  • Overlaid line graphs
  • Overlaid histograms
  • These work well when bins for each group line up well, but often have limitations.

Better option:

  • Comparative box plot: the excerpt recommends this as "a better option" for most cases.
  • Shows distributions of the quantitative variable side-by-side for each category.

Specialized displays:

  • Heat maps: particularly suited when there is a geographical or spatial element.

⚠️ Numerical methods limitation

The excerpt notes that numerical methods exist for analyzing categorical response and quantitative predictor variables, but they are mathematically complicated and beyond the scope of the course.

🔢 Displaying two categorical variables

🔢 Building on univariate methods

The excerpt shows how bivariate categorical displays extend univariate (single-variable) tools:

  • Univariate frequency tables → Contingency tables
  • Univariate bar chart → Stacked or grouped bar charts

📋 Contingency tables

A contingency table portrays data in a way that can facilitate calculating probabilities. The table displays sample values in relation to two different variables that may be dependent or contingent on one another.

Key features:

  • Shows frequencies for combinations of two categorical variables.
  • Includes marginal totals (row totals and column totals).
  • The sum of row totals equals the sum of column totals, which equals the overall total.

Example structure from the excerpt:

  • Rows: cell phone use while driving (yes/no)
  • Columns: speeding violation in last year (yes/no)
  • Marginal row totals: 305 and 450 (sum = 755)
  • Marginal column totals: 70 and 685 (sum = 755)
  • Overall total: 755 people in the sample

What you can calculate:

  • Marginal frequencies (totals for each category)
  • Overall total
  • Marginal relative frequencies (proportions)
  • Conditional percentages (e.g., percentage of each crime type in a given year)

📊 Bar chart variations

Two main types provide more visual information than a contingency table alone:

Stacked bar charts:

  • Bars are divided into segments representing the second categorical variable.
  • Shows composition within each category of the first variable.

Grouped (side-by-side) bar charts:

  • Separate bars for each combination, placed next to each other.
  • Makes direct comparison between categories easier.

Don't confuse: Both show relationships between two categorical variables, but stacked charts emphasize composition/proportions within groups, while grouped charts emphasize direct comparison of frequencies across groups.

16

Visualizing Bivariate Quantitative Data

3.2 Visualizing Bivariate Quantitative Data

🧭 Overview

🧠 One-sentence thesis

Scatter plots reveal the shape, trend, and strength of relationships between two quantitative variables, helping us decide whether a linear model is appropriate and what kind of relationship exists.

📌 Key points (3–5)

  • Response vs explanatory variables: one variable (y, dependent) measures outcomes; the other (x, independent) explains changes in the response.
  • Scatter plot analysis workflow: start with a graph, look for overall patterns and deviations, use numerical descriptions (correlation), then consider a mathematical model (regression).
  • Three key features to observe: shape (linear or not), trend (positive or negative), and strength (tight clustering vs spread out).
  • Common confusion: strength is not always obvious from a scatter plot alone—points can appear somewhat scattered but still show a relationship; numerical measures help clarify strength.
  • Linear regression scope: "simple" linear regression means one independent variable and a linear relationship (two dimensions), not that the concepts are easy.

📊 Understanding variable roles

📊 Response variable (dependent variable)

Response variable (also called y, dependent variable, and predicted variable): measures or records an outcome of a study.

  • This is what you are trying to understand or predict.
  • It "responds" to changes in the other variable.
  • Example: In Amelia's basketball scenario, points scored in a game is the response variable.

📊 Explanatory variable (independent variable)

Explanatory variable (also called x, independent variable, and predictor variable): explains changes in the response variable.

  • This is the variable you think might cause or influence the outcome.
  • It "explains" why the response variable changes.
  • Example: In Amelia's scenario, hours practicing jump shot is the explanatory variable.

🔍 Deciding which is which

  • Ask: "Does changing one variable seem to lead to a change in the other?"
  • The excerpt notes we only calculate a regression line "if one of the variables helps to explain or predict the other variable."
  • Don't confuse: not all pairs of variables have a clear explanatory/response relationship; sometimes both just vary together without one causing the other.

🔬 The scatter plot workflow

🔬 Four-step process

The excerpt outlines a systematic approach when considering the relationship between two quantitative variables:

  1. Start with a graph (scatter plot): visualize the data first.
  2. Look for an overall pattern and deviations from the pattern: identify what the data show as a whole and note any outliers.
  3. Use numerical descriptions: apply correlation and coefficient of determination to quantify the relationship.
  4. Consider a mathematical model (regression): fit a line or other model if appropriate.

🎯 What to always note

When looking at a scatter plot, you always want to note:

  • Shape
  • Trend
  • Strength

These three features form the foundation of scatter plot interpretation.

🔷 Shape: Is the pattern linear?

🔷 Linear patterns

  • The excerpt focuses on linear patterns because "linear patterns are quite common."
  • A linear pattern means the points roughly follow a straight line.
  • Only when we see a linear pattern do we draw a line on the scatter plot.

🔷 When linear relationships are strong

The linear relationship is strong if the points are close to a straight line, except in the case of a horizontal line, which indicates no relationship.

  • Don't confuse: a horizontal line looks like a line but actually shows no relationship between the variables.
  • The excerpt notes "we are currently only interested in applying these ideas when we see a linear pattern," meaning other shapes exist but are not the focus here.

🔷 Other shapes

  • The excerpt mentions "we may see other shapes in a scatter plot" but does not detail them.
  • Example from the figure descriptions: exponential growth pattern (dots curve upward) or no pattern (random dots).

↗️ Trend: Positive or negative?

↗️ Positive trend

High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable.

  • A positive trend is seen when increasing x also increases y.
  • The points move upward from left to right.
  • Example: Amelia's practice hours and points scored show a positive trend—more practice leads to more points.

↘️ Negative (inverse) trend

High values of one variable occurring with low values of the other variable.

  • A negative trend is seen when increasing x appears to cause y to decrease.
  • The points move downward from left to right.
  • Example: One of the scatter plots in the excerpt shows dots moving from top left to bottom right.

➡️ No trend

  • When there is no correlation, the points are scattered randomly with no clear upward or downward direction.
  • Example: The excerpt describes a scatter plot that "appear[s] to have no correlation."

💪 Strength: How tightly clustered?

💪 Strong relationships

  • Points are clustered together closely around the linear pattern.
  • The excerpt describes this as points forming "an almost perfect line."
  • Example: "The data appear to be linear with a strong, positive correlation."

💪 Weak relationships

  • Points are more spread out from the linear pattern.
  • The excerpt describes this as "nowhere near a perfect line, but not completely random."
  • Example: "The data appear to be linear with a weak, negative correlation."

💪 Assessing strength

  • "The strength of a relationship is not always apparent in a scatter plot, but we will see them measured numerically in the future."
  • Don't confuse: weak does not mean no relationship—a weak linear relationship still shows a trend, just with more scatter around the line.

📐 Examples from the excerpt

📐 Strong positive correlation

The first example scatter plot shows:

  • Linear appearance: yes
  • Trend: positive (upward from left to right)
  • Strength: strong (points form "an almost perfect line")
  • Solution: "The data appear to be linear with a strong, positive correlation."

📐 Weak negative correlation

The second example scatter plot shows:

  • Linear appearance: yes
  • Trend: negative (downward from left to right)
  • Strength: weak (points spread out but not random)
  • Solution: "The data appear to be linear with a weak, negative correlation."

📐 No correlation

The third example scatter plot shows:

  • No clear pattern
  • Random dots all over the graph
  • Solution: "The data appear to have no correlation."

📐 Amelia's basketball practice

Hours practicing (x)Points scored (y)
515
722
928
1031
1133
1236
  • The task is to construct a scatter plot and determine if Amelia's hypothesis (more practice → more points) appears true.
  • Based on the data pattern (both variables increase together), the hypothesis would appear supported.
17

Measures of Association

3.3 Measures of Association

🧭 Overview

🧠 One-sentence thesis

The correlation coefficient and coefficient of determination provide numerical measures of how strongly two quantitative variables are linearly related, helping to quantify patterns that scatter plots reveal visually.

📌 Key points (3–5)

  • Why numerical measures matter: scatter plots show patterns, but numerical measures like r quantify the strength and direction of linear relationships more precisely.
  • What r tells us: the correlation coefficient ranges from –1 to +1; its value indicates strength (closer to ±1 = stronger), and its sign indicates direction (positive or negative).
  • What r² tells us: the coefficient of determination expresses what percent of variation in y can be explained by variation in x using the best-fit line.
  • Common confusion: strong correlation does not mean causation—correlation does not imply that x causes y or vice versa.
  • How to interpret: r close to 0 suggests little linear relationship, but always check the scatter plot because curved or other patterns may still exist.

📏 The correlation coefficient (r)

📏 What r measures

The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is a numerical measure of the strength and direction of the linear association between the independent variable x and the dependent variable y.

  • It is a single number that summarizes how tightly the data points cluster around a straight line.
  • The excerpt emphasizes that r measures linear association specifically—if the relationship is curved, r may not capture it well.
  • Technology is recommended for calculation because the formula is complex.

🔢 What the value of r tells us

  • Range: r is always between –1 and +1 (i.e., –1 ≤ r ≤ 1).
  • Strength: values of r close to –1 or +1 indicate a stronger linear relationship; values near 0 indicate a weaker or no linear relationship.
  • Perfect correlation: r = 1 means perfect positive correlation (all points lie on a straight line going up); r = –1 means perfect negative correlation (all points lie on a straight line going down). In the real world, perfect correlation rarely happens.
  • Zero correlation: r = 0 suggests likely no linear correlation, but the excerpt warns to always view the scatter plot—data with a curved or horizontal pattern may have r = 0 even though a relationship exists.

Example: If r = 0.6631 (as in the student exam scores example), the relationship is moderately strong and positive, but not perfect.

➕➖ What the sign of r tells us

  • Positive r: when x increases, y tends to increase; when x decreases, y tends to decrease (positive correlation).
  • Negative r: when x increases, y tends to decrease; when x decreases, y tends to increase (negative correlation).
  • Link to slope: the sign of r is the same as the sign of the slope of the best-fit line (b).

Example: In the matching exercise, scatter plot (a) shows points ascending from lower left to upper right → 0 < r < 1; scatter plot (b) shows points descending from upper left to lower right → –1 < r < 0; scatter plot (c) shows a horizontal pattern → r = 0.

⚠️ Correlation does not imply causation

  • The excerpt includes a key warning: strong correlation does not suggest that x causes y or y causes x.
  • This is a common confusion: just because two variables move together does not mean one is causing the other.
  • Always remember: "correlation does not imply causation."

📊 The coefficient of determination (r²)

📊 What r² measures

The coefficient of determination (r²) is the square of the correlation coefficient, usually stated as a percent rather than in decimal form.

  • It has a specific interpretation in context: r², when expressed as a percent, represents the percent of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
  • The remaining variation (1 – r²) represents the percent of variation in y that is not explained by variation in x using the regression line; this is seen as the scattering of the observed data points about the regression line.

🧮 How to calculate and interpret r²

  • Calculation: square the correlation coefficient. For example, if r = 0.6631, then r² = 0.6631² = 0.4397 (approximately 44%).
  • Interpretation in context: approximately 44% of the variation in the final exam scores can be explained by the variation in the grades on the third exam, using the best-fit regression line.
  • Unexplained variation: approximately 56% (1 – 0.44 = 0.56) of the variation in the final exam grades can not be explained by the variation in the scores on the third exam, using the best-fit regression line. This is the scattering of the points about the line.

Example: In the student exam scores example, r = 0.6631 → r² ≈ 0.44 → about 44% of final exam score variation is explained by third exam scores; the remaining 56% is due to other factors or randomness.

🔍 Interpreting scatter plots and numerical measures together

🔍 Why visual inspection matters

  • The excerpt stresses that it is always good practice to first examine things visually.
  • Deciphering a scatter plot can be tricky, especially when it comes to the strength of a relationship, so numerical measures like r and r² are the next step.
  • However, numerical measures alone are not enough: if r = 0, there may still be a curved or other non-linear pattern visible in the scatter plot.

🔍 Matching patterns to correlation values

Scatter plot patternCorrelation coefficient rangeInterpretation
Points ascending from lower left to upper right0 < r < 1Positive linear relationship
Points descending from upper left to lower right–1 < r < 0Negative linear relationship
Horizontal or random patternr = 0No linear relationship (but check for other patterns)
Points almost perfectly on an upward liner close to +1Strong positive linear relationship
Points almost perfectly on a downward liner close to –1Strong negative linear relationship

Don't confuse: r = 0 does not mean "no relationship at all"—it means "no linear relationship." Always look at the scatter plot to see if a curved or other pattern exists.

18

Modeling Linear Relationships

3.4 Modeling Linear Relationships

🧭 Overview

🧠 One-sentence thesis

Linear regression allows us to fit a best-fit line through data points and use it to predict values of a dependent variable from an independent variable, though predictions are reliable only within the observed data range.

📌 Key points (3–5)

  • What linear regression does: fits a "line of best fit" (least-squares regression line) through scattered data to enable predictions.
  • The regression equation: ŷ = a + bx, where ŷ is the predicted y value, b is the slope (related to the correlation coefficient r), and a is the y-intercept.
  • Interpreting slope: the slope tells how much the dependent variable changes on average for every one-unit increase in the independent variable.
  • Common confusion: interpolation vs extrapolation—predictions are reliable only within the range of observed x values (interpolation), not outside that range (extrapolation).
  • Why it matters: regression enables practical predictions, but the y-intercept may not always make sense in context, and predictions outside the data domain are unreliable.

📐 The regression line and its equation

📐 What the line of best fit is

Line of best fit (least-squares regression line): a straight line that best represents the pattern in a scatter plot, minimizing the vertical distance of each data point to the line.

  • Data rarely fit a straight line perfectly, but we can make rough predictions if the scatter plot appears linear.
  • The process of fitting this line is called linear regression.
  • The line always passes through the point (mean of x, mean of y).

🧮 The regression equation

The equation is: ŷ = a + bx

  • ŷ (read "y hat"): the estimated or predicted value of y using the regression line; it may or may not equal the actual observed y values.
  • x: the independent variable (the predictor).
  • b (slope): calculated as b = r × (sy / sx), where r is the correlation coefficient, sy is the standard deviation of y values, and sx is the standard deviation of x values.
  • a (y-intercept): calculated using the slope and the means of x and y.

Example: In the exam score data (third exam score predicting final exam score), the regression line is ŷ = –173.51 + 4.83x.

🔍 Interpreting the slope

🔍 What the slope tells us

Slope (b): describes how changes in the independent variable relate to changes in the dependent variable.

  • Interpretation rule: For every one-unit increase in the independent variable (x), the dependent variable (y) changes by b units on average.
  • You should be able to write a sentence interpreting the slope in plain English, in the context of the data.

📊 Example interpretation

In the exam score example, the slope is b = 4.83.

  • Plain-language interpretation: For a one-point increase in the score on the third exam, the final exam score increases by 4.83 points on average.
  • Don't confuse: the slope is not the total change in y; it is the average change per one-unit increase in x.

🎯 Interpreting the y-intercept

🎯 What the y-intercept tells us

y-intercept (a): the predicted value of y when x is 0.

  • In some contexts, x = 0 makes sense; in many, it does not.
  • If x = 0 is not meaningful in the real-world context, the y-intercept may not be useful or interpretable.

📊 Example interpretation

In the exam score example, the y-intercept is a = –173.51.

  • Context check: It does not make sense for a student to score 0 on the third exam (unless they did not take it or try at all).
  • Conclusion: The y-intercept does not have a meaningful interpretation in this context.

🔮 Making predictions with the regression line

🔮 How to predict

Once you have the regression equation, substitute a value of x into the equation to predict the corresponding ŷ.

  • Reliable predictions: Only predict for x values within the range of the observed data (this is called interpolation).
  • Unreliable predictions: Predicting for x values outside the observed range is called extrapolation and is not reliable.

📊 Example predictions

Using the exam score regression line ŷ = –173.51 + 4.83x:

x (third exam score)Prediction methodŷ (predicted final exam score)Reliability
73Within data range (65–75)179.08Reliable (interpolation)
66Within data range (65–75)145.27Reliable (interpolation)
90Outside data range (65–75)(Can calculate, but...)Unreliable (extrapolation)
  • Why 90 is unreliable: The x values in the data are between 65 and 75. Even though you can plug 90 into the equation and calculate a y value, that prediction is not reliable because 90 is outside the domain of observed x values.
  • Don't confuse: being able to calculate a value does not mean the prediction is trustworthy.

🎵 Practice scenario

Data on hours per week practicing a musical instrument (x) and math test scores (y) gives the line ŷ = 72.5 + 2.8x.

  • Question: Predict the math test score for a student who practices 5 hours per week.
  • Method: Substitute x = 5 into the equation: ŷ = 72.5 + 2.8(5) = 72.5 + 14 = 86.5.
  • Interpretation: A student who practices 5 hours per week is predicted to score 86.5 on the math test, on average.

⚠️ Cautions about regression

⚠️ Four key cautions

The excerpt introduces four main things to keep in mind when interpreting regression results:

  1. Linearity: Always plot a scatter diagram first. Only use regression if the scatter plot shows a linear relationship.
  2. Correlation does not imply causation: Even with a strong linear relationship and a reasonable r value, there can be confounding or lurking variables. Be wary of spurious correlations and ensure the connection makes sense.
  3. Extrapolation: Predictions are reasonable only within the domain of observed x values (interpolation), not outside that domain (extrapolation).
  4. Outliers and influential points: (The excerpt mentions this but does not elaborate further in the provided text.)

🔄 Causation confusion

  • Common issue: It may not be clear which variable affects the other.
  • Example: Does lack of sleep lead to higher stress levels, or do high stress levels lead to lack of sleep?
  • What we can do: Show an association exists, even if we cannot determine causation.

📏 Interpolation vs extrapolation

Interpolation: predicting y for x values inside the range of observed x values in the data.

Extrapolation: predicting y for x values outside the range of observed x values in the data.

  • Reliability: Interpolation is reasonable; extrapolation is not reliable.
  • Don't confuse: Just because you can calculate a prediction outside the data range does not mean it is trustworthy.
19

Cautions about Regression

3.5 Cautions about Regression

🧭 Overview

🧠 One-sentence thesis

Regression analysis is powerful but requires careful attention to linearity assumptions, the distinction between correlation and causation, the dangers of extrapolation, and the proper handling of outliers and influential points to avoid misuse and misinterpretation.

📌 Key points (3–5)

  • Four main cautions: linearity, correlation vs. causation, extrapolation, and outliers/influential points must all be checked when using regression.
  • Extrapolation vs. interpolation: predicting within the observed data range (interpolation) is reasonable, but predicting outside that range (extrapolation) can produce absurd results.
  • Common confusion—outliers vs. influential points: outliers are univariate (stick out in one variable), influential points are bivariate (don't follow the trend and affect the regression line slope); one does not imply the other.
  • Residuals as diagnostic tools: the difference between observed and predicted values (residuals) helps identify problematic points; points more than two standard deviations away warrant examination.
  • Don't automatically delete unusual points: outliers and influential points may contain valuable information or reveal data errors; always investigate the cause before removal.

⚠️ Four fundamental cautions

📊 Linearity assumption

  • Always plot a scatter diagram first before using regression methods.
  • Regression methods discussed here assume a linear relationship between variables.
  • If the scatter plot shows a non-linear pattern, linear regression is inappropriate even if calculations are possible.
  • Example: The CPI data example showed a significant correlation coefficient, but the scatterplot pattern indicated a curve would be more appropriate than a line.

🔗 Correlation does not imply causation

Even when there is an apparent linear relationship and a reasonable value of r, there can always be confounding or lurking variables at work.

  • Be wary of spurious correlations—make sure the connection makes sense.
  • Direction of causation may be unclear: Does lack of sleep lead to higher stress, or do high stress levels lead to lack of sleep?
  • Sometimes causation cannot be determined, but association can still be demonstrated.
  • Don't confuse: showing two variables move together (association) is not the same as proving one causes the other (causation).

🎯 Extrapolation dangers

📏 Interpolation vs. extrapolation definitions

Interpolation: the process of predicting inside of the observed x values in the data.

Extrapolation: the process of predicting outside of the observed x values in the data.

  • Predictions are reasonable only within the domain of x values in the sample data.
  • Extrapolation can produce absurd results that violate real-world constraints.

🚫 Why extrapolation fails

Example from the exam score data:

  • The data included third exam scores between 65 and 75.
  • Predicting for x = 73 (within range) is reasonable—this is interpolation.
  • Predicting for x = 50 (below range) is unreliable—this is extrapolation.
  • Predicting for x = 90 (above range) gave a final exam score of 261.19, which is impossible since the maximum score is 100.

Key lesson: The linear relationship observed in the sample range may not hold outside that range.

🔍 Outliers and influential points

🎯 Distinguishing the two concepts

ConceptDefinitionImpact
OutlierA point that sticks out from the rest in a single variable (univariate idea)May or may not affect the regression line
Influential pointA point that does not follow the trend of the rest of the data (bivariate idea)Can strongly affect the slope of the regression line
  • One does not imply the other: a point can be an outlier but not influential, influential but not an outlier, both, or neither.
  • Example: A point may be an outlier in y but still fit the trend (not influential); another point may not be an outlier in x or y but still be influential (doesn't fit the trend).

🔬 Identifying outliers using residuals

Residual: the difference between the actual y value and the predicted y value, calculated as y₀ – ŷ₀ = ε₀.

  • The residual measures the vertical distance between the actual data point and the predicted point on the line.
  • Positive residual: the point lies above the line (line underestimates).
  • Negative residual: the point lies below the line (line overestimates).

Rough rule of thumb: Flag any point located further than two standard deviations above or below the best-fit line as a potential outlier.

📐 Mathematical identification methods

Graphical method:

  • Draw two extra lines parallel to the best-fit line, positioned at ±2s (two standard deviations) above and below.
  • Any points outside these boundary lines are potential outliers.
  • Example: For the exam data, Y₂ = –173.5 + 4.83x – 2(16.4) and Y₃ = –173.5 + 4.83x + 2(16.4) created the boundaries.

Numerical method:

  • Calculate each residual: y – ŷ.
  • Compare the absolute value of each residual to 2s.
  • If |y – ŷ| ≥ 2s, the point is a potential outlier.
  • Example: The student with third exam score 65 and final exam score 175 had a residual of 35, which exceeded 2(16.4) = 32.8.

🧮 Computing the standard deviation of residuals

The standard deviation of residuals (s) is calculated from the sum of squared errors (SSE):

Steps:

  1. Calculate each residual: εᵢ = yᵢ – ŷᵢ for all data points.
  2. Square each residual.
  3. Sum all squared residuals to get SSE.
  4. Divide by (n – 2) and take the square root: s = square root of [SSE / (n – 2)].

Note: We divide by (n – 2) as the degrees of freedom because the regression model involves two estimates.

🎲 Identifying influential points

High leverage points: Points that fall far from the line and can strongly influence the slope calculation.

Test for influence:

  • Remove the suspected point from the dataset.
  • Recalculate the regression line.
  • If the slope changes significantly, the point is influential.

Example from exam data:

  • Original line with outlier: ŷ = –173.51 + 4.83x, with r = 0.6631.
  • New line without outlier: ŷ = –355.19 + 7.39x, with r = 0.9121.
  • The correlation became stronger (closer to 1), indicating the removed point was influential.
  • The prediction for x = 73 changed from 179.08 to 184.28.

🛠️ Handling unusual points

⚖️ When to keep or remove points

Do NOT automatically delete outliers or influential points:

  • They may hold valuable information about the population under study.
  • Models that ignore exceptional cases often perform poorly.
  • Example: A financial firm ignoring the largest market swings (outliers) would make poor investment decisions.

Investigate the cause:

  • Check if the point results from erroneous data (data entry error, measurement error).
  • If data is incorrect and the correct value is known, make the correction.
  • If data is correct, leave it in the dataset.

Documentation requirements:

  • If data is deleted, record what was deleted and why.
  • Provide results both with and without the deleted data when possible.

🔄 Outliers that help avoid extrapolation

  • An outlier that is not influential may actually be beneficial.
  • It can extend the range of observed x values, reducing the need to extrapolate.
  • Don't confuse: Not all outliers are problematic; examine their role carefully.
20

Chapter 3 Wrap-Up: Bivariate Data and Regression

Chapter 3 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

This chapter covers how to analyze relationships between two quantitative variables using visualization, correlation measures, linear regression models, and awareness of common pitfalls like outliers and extrapolation.

📌 Key points (3–5)

  • Bivariate data analysis: examining relationships between two variables (explanatory and response).
  • Measuring association: correlation coefficient (r) shows strength and direction; coefficient of determination (r²) shows proportion of variance explained.
  • Linear regression modeling: using slope and y-intercept to predict response variable values from explanatory variable values.
  • Common confusion: outliers vs influential points—outliers are extreme in Y; influential points affect the regression line's position.
  • Key caution: extrapolation (predicting outside the data range) and the impact of outliers/influential points on model accuracy.

📊 Core terminology and concepts

📊 Variables in bivariate analysis

Bivariate data: data involving two variables measured together.

Explanatory variable: the variable used to explain or predict changes in another variable.

Response variable: the variable being predicted or explained.

  • The explanatory variable is typically plotted on the x-axis; the response variable on the y-axis.
  • Example: if studying how study hours affect exam scores, study hours would be explanatory and exam scores would be response.

📋 Organizing categorical bivariate data

Contingency table: a table format for displaying the relationship between two categorical variables.

  • Used when both variables are categorical rather than quantitative.
  • Helps identify patterns of association between categories.

📏 Measuring relationships

📏 Correlation coefficient (r)

Correlation coefficient (r): a measure of the strength and direction of the linear relationship between two quantitative variables.

  • Values range from -1 to +1.
  • Positive values indicate positive association; negative values indicate negative association.
  • Closer to ±1 means stronger linear relationship; closer to 0 means weaker relationship.

📐 Coefficient of determination (r²)

Coefficient of determination (r²): the proportion of variance in the response variable that is explained by the explanatory variable.

  • Derived by squaring the correlation coefficient.
  • Expressed as a proportion or percentage.
  • Example: r² = 0.64 means 64% of the variation in the response variable is explained by the linear relationship with the explanatory variable.

🔧 Linear regression modeling

🔧 The regression equation

Linear regression: a method for modeling the linear relationship between variables using an equation of the form y = mx + b.

Slope: the rate of change in the response variable for each unit change in the explanatory variable.

y-intercept: the predicted value of the response variable when the explanatory variable equals zero.

  • The regression line is the "line of best fit" through the data points.
  • Used to make predictions about the response variable based on explanatory variable values.

📍 Residuals

Residuals: the vertical distances between observed data points and the predicted values on the regression line.

  • Residual = observed value - predicted value.
  • The regression line minimizes the sum of squared residuals.
  • Residual patterns help diagnose whether a linear model is appropriate.

⚠️ Cautions and pitfalls

⚠️ Extrapolation

Extrapolation: making predictions outside the range of the observed data.

  • Dangerous because the relationship may not hold beyond the observed range.
  • The linear pattern observed within the data range may not continue outside it.
  • Example: using a regression model built on data from years 2000-2010 to predict values for year 2025 involves extrapolation and increased uncertainty.

🔴 Outliers and influential points

Outliers: data points that fall far from the overall pattern, particularly in the Y (response) direction.

Influential points: data points that, if removed, would substantially change the position or slope of the regression line.

How to distinguish:

TypeCharacteristicEffect
Outlier in YExtreme vertical distance from the lineMay or may not affect the line much
Outlier in XExtreme horizontal positionOften influential
Influential pointChanges the regression line significantly if removedStrong leverage on the model
  • A point can be both an outlier and influential.
  • The excerpt shows that outliers are identified by falling outside boundary lines parallel to the regression line.
  • Don't confuse: not all outliers are influential, and not all influential points are outliers in Y.

🎯 Chapter structure and resources

🎯 Chapter sections covered

The chapter included five main sections:

  1. Introduction to Bivariate Data
  2. Visualizing Bivariate Quantitative Data
  3. Measures of Association
  4. Modeling Linear Relationships
  5. Cautions about Regression

📚 Self-check components

  • Concept check quiz: interactive quiz to test comprehension.
  • Key terms list: definitions for all major concepts introduced.
  • Extra practice problems: additional exercises available at the end of the book.
21

Introduction to Probability and Random Variables

4.1 Introduction to Probability and Random Variables

🧭 Overview

🧠 One-sentence thesis

Probability provides a systematic framework for measuring how certain we are about outcomes of experiments, with random variables serving as mathematical models that quantify these uncertain situations.

📌 Key points (3–5)

  • What probability measures: the certainty we have about outcomes of experiments, expressed as values between 0 and 1.
  • Law of large numbers: as experiments are repeated more times, observed frequencies approach theoretical probabilities (short-term results may vary widely).
  • Complement rule: if you know the probability an event occurs, you can easily find the probability it does not occur using P(A) + P(A′) = 1.
  • Random variables as models: uppercase letters (X, Y) represent the general outcome in words; lowercase letters (x, y) represent specific numerical values.
  • Common confusion: probability describes long-term behavior, not short-term results—flipping a coin twice does not guarantee one head and one tail.

🎲 Core probability concepts

🎲 What probability is

Probability: a measure associated with how certain we are of outcomes of a particular experiment or activity.

  • Not a prediction of what will happen next, but a measure of long-term likelihood.
  • Always between 0 and 1, inclusive.
  • P(A) = 0 means event A can never happen; P(A) = 1 means A always happens; P(A) = 0.5 means A is equally likely to occur or not.

🔬 Experiments and outcomes

Experiment: a planned operation carried out under controlled conditions where the result is not predetermined.

Sample space (S): the set of all possible outcomes of an experiment.

  • Example: flipping one fair coin has sample space S = {H, T}.
  • An outcome is a single result; an event is any combination of outcomes (represented by uppercase letters like A or B).
  • Sample spaces can be represented by listing outcomes, tree diagrams, or Venn diagrams.

📊 Probability models

Probability model: a mathematical representation of a random process that lists all possible outcomes and assigns probabilities to each.

  • This is the ultimate goal when studying statistics.
  • Models allow systematic problem-solving rather than intuitive guessing.

📏 Fundamental probability rules

📏 The three axioms

The excerpt lists three foundational axioms:

  1. P(S) = 1: probabilities of all outcomes in a sample space add up to 1.
  2. 0 ≤ P(E) ≤ 1: the probability of any event must be between 0 and 1.
  3. Disjoint addition rule: for two events E₁ and E₂ with no overlap, P(E₁ or E₂) = P(E₁) + P(E₂).

The first two axioms are intuitive; the third becomes important for more complex situations.

🔄 The complement rule

Complement of event A (denoted A′ or Aᶜ): all outcomes in the sample space that are NOT included in A.

Three useful forms:

  • P(A) + P(A′) = 1
  • 1 – P(A) = P(A′)
  • 1 – P(A′) = P(A)

Example: If S = {1, 2, 3, 4, 5, 6} and A = {1, 2, 3, 4}, then A′ = {5, 6}. The probabilities must sum to 1.

Why this matters: If you know the probability something happens, you automatically know the probability it doesn't happen.

⏱️ Long-term behavior vs short-term results

⏱️ The law of large numbers

Law of large numbers: as the number of repetitions of an experiment increases, the relative frequency obtained tends to become closer to the theoretical probability.

  • Key insight: probability does NOT describe short-term results; it describes long-term expectations.
  • The excerpt emphasizes "long-term observed relative frequency" approaching theoretical probability.
  • Example from the excerpt: Karl Pearson tossed a fair coin 24,000 times and got 12,012 heads—very close to the theoretical 12,000 (50%).

🎯 Don't confuse short-term with long-term

  • Flipping a coin twice does NOT guarantee one head and one tail, even though each has 50% probability.
  • You might flip a fair coin ten times and get nine heads—this does not contradict probability theory.
  • The word "empirical" means "observed" in this context.

🎰 Random variables

🎰 What random variables are

Random variable (RV): describes the outcomes of a statistical experiment in words or as a function that assigns each element of a sample space a unique real number.

Notation conventions:

  • Uppercase letters (X, Y) denote the random variable itself (the general concept in words).
  • Lowercase letters (x, y) denote specific numerical values.
  • Example: P(X = 3) means "the probability that random variable X equals 3."

🔢 Discrete vs continuous random variables

The excerpt distinguishes two types:

TypeDescriptionData produced
Discrete random variable (DRV)Models processes with countable outcomesDiscrete data (countable)
Continuous random variable (CRV)Models processes with uncountable outcomesContinuous data (measurable)

Example of discrete RV: Let X = number of heads when tossing three fair coins. Possible x values are 0, 1, 2, 3—these are countable outcomes, so X is a discrete random variable.

The excerpt focuses on discrete random variables first, with continuous random variables to be revisited later.

22

Discrete Random Variables

4.2 Discrete Random Variables

🧭 Overview

🧠 One-sentence thesis

Discrete random variables model countable outcomes of random experiments, and their behavior can be summarized through probability distributions and measures like expected value and standard deviation.

📌 Key points (3–5)

  • What a discrete random variable is: a random variable that takes on countable values (0, 1, 2, 3, etc.) from a statistical experiment.
  • Two key functions: the probability mass function (PMF) tells you the probability of each specific value; the cumulative distribution function (CDF) tells you the probability of being less than or equal to a value.
  • Valid probability distribution requirements: each probability must be between 0 and 1 (inclusive), and all probabilities must sum to 1.
  • Common confusion: don't confuse capital X (the random variable in words) with lowercase x (the specific numeric values it can take).
  • Why measures matter: expected value (mean) gives the long-term average outcome; variance and standard deviation measure how spread out the outcomes are around that average.

🎲 What discrete random variables are

🎲 Definition and notation

A discrete random variable is a random variable that models a process or experiment that produces discrete (countable) data.

  • Discrete means you have a countable number of outcomes.
  • The random variable describes outcomes in words; its values are the actual numbers.
  • Notation distinction:
    • Capital X = the random variable itself (in words), e.g., "the number of heads"
    • Lowercase x = the specific numeric values X can take, e.g., 0, 1, 2, 3
  • Example: X stands for the number of heads when you toss three fair coins. The sample space is TTT, THH, HTH, HHT, HTT, THT, TTH, HHH. Then x = 0, 1, 2, 3.

🔢 Why "discrete" matters

  • The values are countable outcomes, not continuous.
  • Because you can count the possible values and the outcomes are random, the variable is discrete.
  • Example: number of times a newborn wakes its mother after midnight per week—x can be 0, 1, 2, 3, 4, 5 (countable).

📊 Probability distributions for discrete random variables

📊 Two main characteristics of a valid distribution

A discrete probability distribution must exhibit:

  1. Each probability is between zero and one, inclusive.
  2. The sum of the probabilities is one.
  • The distribution can be shown in a table, graph, or formula.
  • These characteristics ensure the probabilities are valid and account for all possible outcomes.

📈 Probability mass function (PMF)

The probability mass function (PMF) of a discrete random variable tells you the probability of the random variable taking on a certain value.

  • Notation: P(X = x)
  • Sometimes (erroneously) called probability distribution function (PDF).
  • Example: if X = number of days Nancy attends class per week, the PMF gives P(X = 0), P(X = 1), P(X = 2), P(X = 3).

📉 Cumulative distribution function (CDF)

The cumulative distribution function (CDF) of a discrete random variable tells you the probability of the random variable being less than or equal to a certain value.

  • Notation: P(X ≤ x)
  • It accumulates probabilities from the smallest value up to x.
  • Example: if x = 2, the CDF is P(X = 0) + P(X = 1) + P(X = 2).

🧩 Why distributions are useful

  • A probability distribution function is a pattern.
  • You fit a probability problem into a pattern or distribution to perform calculations more easily.
  • Each distribution has special characteristics; learning them helps you distinguish among different distributions.

🎯 Expected value (mean) of a discrete random variable

🎯 What expected value means

The expected value or mean of a random variable is the long-term average outcome after conducting many trials of an experiment.

  • Denoted by the Greek letter μ or E[X].
  • It is the average value you would expect after many repetitions.
  • Related to the law of large numbers: as the number of trials increases, results become closer to what we expect.

🧮 How to calculate expected value

  • Multiply each value of the random variable by its probability, then add the products.
  • It is a probability-weighted average of the values.
  • Formula: μ = sum of [x · P(x)] for all x

Example: A men's soccer team plays soccer 0, 1, or 2 days a week with probabilities 0.2, 0.5, and 0.3 respectively.

xP(x)x · P(x)
00.20
10.50.5
20.30.6
  • Expected value = 0 + 0.5 + 0.6 = 1.1
  • The team would, on average, expect to play soccer 1.1 days per week.

📋 Expected value table

  • Construct a table with columns: x, P(x), and x · P(x).
  • The last column gives the products; sum them to find the expected value.
  • This table is called an expected value table.

📏 Variance and standard deviation of a discrete random variable

📏 What variance and standard deviation measure

  • Variance (σ² or V[X]) and standard deviation (σ or SD[X]) measure how spread out the outcomes are around the mean.
  • Like data distributions, probability distributions have standard deviations.
  • The process is similar to finding these measures for a data sample but uses a probability-weighted approach.

🧮 How to calculate variance and standard deviation

Steps:

  1. Find the mean (μ).
  2. Subtract the mean from each value of x to get deviations (x - μ).
  3. Square each deviation: (x - μ)².
  4. Multiply each squared deviation by its probability, P(x).
  5. Sum each of the products to get the variance.
  6. Take the square root of the variance to get the standard deviation.

Formula for variance: σ² = sum of [(x - μ)² · P(x)] for all x

Formula for standard deviation: σ = square root of σ²

📊 Example calculation

For a newborn waking its mother after midnight:

  • First, find μ = 2.1 (from the expected value calculation).
  • Then, for each x, calculate (x - μ)² · P(x).
  • Sum these products: 0.1764 + 0.2662 + 0.0046 + 0.1458 + 0.2888 + 0.1682 = 1.05
  • Standard deviation σ = square root of 1.05 ≈ 1.0247

🔍 Don't confuse with data formulas

  • When all outcomes are equally likely, these formulas coincide with the mean and standard deviation of the set of possible outcomes.
  • For general probability distributions, the probability-weighted approach is necessary.

🖩 Note on calculations

  • For probability distributions, use a calculator or computer to reduce roundoff error.
  • For many special cases of probability distributions, there are shortcut formulas for calculating μ, σ, and associated probabilities.
23

The Binomial Distribution

4.3 The Binomial Distribution

🧭 Overview

🧠 One-sentence thesis

The binomial distribution provides a shortcut method for calculating probabilities when an experiment consists of a fixed number of independent trials, each with only two possible outcomes (success or failure) and constant probability.

📌 Key points (3–5)

  • What makes an experiment binomial: fixed number of trials (n), only two outcomes per trial, independent trials with constant probabilities (p for success, q for failure).
  • How to calculate probabilities: use the binomial probability mass function (PMF) that combines the "choose" function with probabilities of success and failure.
  • Mean and standard deviation shortcuts: for binomial distributions, mean equals n times p, and variance equals n times p times q.
  • Common confusion: not all two-outcome experiments are binomial—independence matters; if one trial affects another (like drawing without replacement), it violates binomial conditions.
  • Notation and parameters: written as X ~ B(n, p), where n is the number of trials and p is the probability of success on each trial.

🎯 The Binomial Setting

🎯 Three defining characteristics

A binomial experiment must have all three of these properties:

  1. Fixed number of trials: The letter n denotes how many times the experiment is repeated.
  2. Only two outcomes: Each trial results in either "success" or "failure." The letter p denotes probability of success, q denotes probability of failure, and p + q = 1.
  3. Independence with identical conditions: Trials are independent (one outcome doesn't affect another), and probabilities p and q stay the same for every trial.

✅ Example of a binomial situation

  • Withdrawal rate from a physics course is 30% each term.
  • Define "success" as a student who withdraws.
  • Random variable X = number of students who withdraw from a randomly selected class.
  • Each student's decision is independent, and the 30% probability remains constant.

❌ Example of a non-binomial situation

  • A committee of 10 staff and 6 students chooses a chairperson and recorder by drawing two names without replacement.
  • Why it's not binomial: The first draw affects the second. If a student is drawn first, the probability of drawing a student second is 5/15; if staff is drawn first, the probability is 6/15.
  • This violates the independence requirement.

🔍 Bernoulli trial

A Bernoulli trial: any experiment with the two-outcome and constant-probability characteristics where n = 1.

  • Named after Jacob Bernoulli (late 1600s).
  • A binomial experiment is essentially counting successes across one or more Bernoulli trials.

📐 Binomial Notation and Formula

📐 Standard notation

X ~ B(n, p): "X is a random variable with a binomial distribution."

  • Parameters: n (number of trials) and p (probability of success on each trial).
  • Range of X: can be 0, 1, 2, …, n (counting the number of successes).

🧮 Binomial probability mass function (PMF)

The formula has two parts:

Part 1 - How many ways: The "choose" function (binomial coefficient) calculates how many different ways to get x successes in n trials:

  • Written as: nCx = n! / (x! times (n - x)!)
  • The ! symbol is the factorial operator.

Part 2 - Probability of one way: Using the multiplication rule for independent events:

  • p raised to the power x (probability of x successes) times q raised to the power (n - x) (probability of n - x failures).

Combined formula:

P(X = x) = nCx times p^x times q^(n-x)

📊 Cumulative distribution function (CDF)

  • The binomial does not have a simple closed-form CDF.
  • The CDF is the sum of all PMF values up to that point.
  • Example: P(X ≤ 12) requires adding P(X = 0) + P(X = 1) + … + P(X = 12).

📏 Measures of the Binomial Distribution

📏 Mean and standard deviation shortcuts

For binomial distributions, there are simple formulas:

MeasureFormulaMeaning
Mean (μ)n times pExpected number of successes
Variance (σ²)n times p times qSpread of the distribution
Standard deviation (σ)square root of (n times p times q)Typical deviation from mean

💡 Why these shortcuts work

  • Because all binomial experiments share the same structure (fixed n, constant p), we don't need to calculate the full discrete probability distribution formulas.
  • These formulas save time compared to computing mean and variance from the general definitions.

🧪 Example calculation

In a catalog with 560 pages where 8 feature signature artists, if we sample 100 pages:

  • n = 100, p = 8/560
  • Mean = 100 times (8/560) ≈ 1.4286
  • Standard deviation = square root of (100 times 8/560 times 552/560) ≈ 1.1867

🔢 Working with Binomial Probabilities

🔢 Exact probability (using PMF)

To find the probability of exactly x successes:

  • Plug values into the binomial PMF formula.
  • Example: For 20 workers where p = 0.41, find P(X = 12) by calculating 20C12 times 0.41^12 times 0.59^8.

📈 Cumulative probability (using CDF)

For "at most" or "at least" questions:

  • At most x: P(X ≤ x) = sum all probabilities from 0 to x.
  • At least x: P(X ≥ x) = 1 - P(X ≤ x - 1).
  • More than x: P(X > x) = 1 - P(X ≤ x).

Example: "At most 12 workers" means P(X ≤ 12), which equals the sum of probabilities for X = 0, 1, 2, …, 12.

🖩 Technology note

The excerpt mentions using calculators or computers:

  • "binompdf" for exact probabilities (PMF).
  • "binomcdf" for cumulative probabilities (CDF).
  • These reduce roundoff error compared to hand calculations.

🎓 Practical Application Steps

🎓 Identifying if binomial applies

Before using binomial formulas, verify:

  1. Is there a fixed number of trials?
  2. Does each trial have only two outcomes?
  3. Are trials independent?
  4. Do probabilities stay constant?

If any answer is "no," the binomial distribution does not apply.

🎓 Defining the random variable

  • Clearly state what X represents (e.g., "X = number of students who complete homework on time").
  • Identify what counts as "success" and "failure" in context.
  • Determine the values X can take (0 to n).

🎓 Setting up the problem

  • Identify n (number of trials).
  • Identify p (probability of success per trial).
  • Calculate q = 1 - p (probability of failure).
  • Translate word problems into probability notation (e.g., "at least 40" becomes P(X ≥ 40)).

Don't confuse: "At least" means ≥, "at most" means ≤, and "more than" means >.

24

Continuous Random Variables

4.4 Continuous Random Variables

🧭 Overview

🧠 One-sentence thesis

Continuous random variables represent measured quantities whose probabilities are calculated as areas under curves, with the uniform and normal distributions being foundational examples where probability equals area.

📌 Key points (3–5)

  • What continuous random variables are: quantities that are measured (not counted), such as distances, weights, or time durations.
  • How probability works for CRVs: probability is represented by area under a curve, calculated using the cumulative distribution function (CDF).
  • Key difference from discrete variables: individual point probabilities are always zero; only intervals have non-zero probability.
  • Common confusion: whether a variable is discrete or continuous depends on how it is defined—counting makes it discrete, measuring makes it continuous.
  • Two important distributions introduced: the uniform distribution (all outcomes equally likely over an interval) and the normal distribution (bell-shaped, symmetric, most important distribution).

🎯 Defining continuous random variables

🎯 What makes a variable continuous

Continuous random variables (CRVs): random variables whose values are measured rather than counted.

  • Examples from the excerpt: baseball batting averages, IQ scores, telephone call duration, amount of money carried, computer chip lifespan, SAT scores.
  • The field of reliability depends on various continuous random variables.
  • The key distinction is measurement vs counting.

⚠️ How definition matters

The excerpt emphasizes that how you define the variable determines whether it is discrete or continuous:

DefinitionTypeReason
Number of miles to the nearest mileDiscreteYou count the miles
Distance you drive to workContinuousYou measure values
Number of books in a backpackDiscreteYou count books
Weight of a bookContinuousWeights are measured
  • Don't confuse: the same real-world quantity can be either discrete or continuous depending on whether you count or measure it.

📊 How probability works for continuous distributions

📊 Probability as area

The fundamental principle for continuous distributions:

  • PROBABILITY = AREA
  • The graph of a continuous probability distribution is a curve.
  • Probability is represented by the area under the curve.

📈 Two key functions

Probability density function (PDF): the function f(x) that corresponds to the graph; used to draw the graph of the probability distribution.

Cumulative distribution function (CDF): the function used to evaluate probability as area under the curve.

  • The PDF (symbol: f(x)) defines the curve itself.
  • The CDF calculates the actual probabilities (areas).
  • In general, calculus (integral calculus) is needed to find areas, but formulas have already been derived for common distributions.

🔢 Properties when dealing with CDFs

Measurement, not counting:

  • Outcomes are measured, not counted.

Total area:

  • The entire area under the curve and above the x-axis equals one (maximum probability).

Intervals, not points:

  • Probability is found for intervals of x values rather than individual x values.
  • P(c < x < d) is the probability that X is between values c and d.
  • This equals the area under the curve, above the x-axis, to the right of c and left of d.

Zero probability for single points:

  • The probability that x takes on any single individual value is zero: P(x = c) = 0.
  • Reason: the area between x = c and x = c has no width, therefore no area.
  • Don't confuse: this is fundamentally different from discrete variables, where individual values can have non-zero probability.

Inclusive vs exclusive doesn't matter:

  • P(c < x < d) is the same as P(c ≤ x ≤ d) because probability equals area.
  • Whether endpoints are included makes no difference (each has zero probability).

🔄 Using the CDF for "greater than" probabilities

  • P(X ≤ x) or P(X < x) is the CDF—it gives "area to the left."
  • To find "area to the right": P(X > x) = 1 – P(X < x).

📏 The uniform distribution

📏 What the uniform distribution represents

Uniform distribution: concerned with events that are equally likely to occur over an interval.

  • Notation: X ~ U(a, b), where a = the lowest value of x and b = the highest value of x.
  • The probability density function is: f(x) = 1/(b - a) for a ≤ x ≤ b.
  • This creates a horizontal line segment (constant height).
  • Be careful to note if the data is inclusive or exclusive of endpoints.

📐 Calculating probabilities with uniform distribution

The area under the uniform distribution is a rectangle:

  • Base = the interval length
  • Height = 1/(b - a)
  • Area (probability) = base × height

Example from the excerpt:

  • For X ~ U(0, 23): f(x) = 1/20 for 0 ≤ x ≤ 20 (using a different example).
  • To find P(2.3 < x < 12.7): Area = (12.7 - 2.3) × (1/20) = 0.52.
  • To find P(0 < x < 2): Area = (2 - 0) × (1/20) = 0.1.
  • To find P(4 < x < 15): Area = (15 - 4) × (1/20) = 0.55.

🧮 Theoretical mean and standard deviation

The excerpt provides formulas (expressed in words):

  • Mean (μ) = (a + b)/2
  • Standard deviation (σ) = (b - a) divided by the square root of 12

Example with baby smiling times:

  • Data: 55 smiling times in seconds; sample mean = 11.49, sample standard deviation = 6.23.
  • Assumption: smiling times follow uniform distribution between 0 and 23 seconds.
  • Theoretical mean = (0 + 23)/2 = 11.50 seconds.
  • Theoretical standard deviation = (23 - 0)/square root of 12 = 6.64 seconds.
  • Notice: theoretical values are close to sample values in this example.

🔔 Introduction to the normal distribution

🔔 Why the normal distribution matters

Normal (Gaussian) distribution: the most important of all distributions, continuous or otherwise.

  • Graph characteristics: symmetric, bell-shaped, and unimodal.
  • Widely used across disciplines: psychology, business, economics, sciences, nursing, mathematics.
  • Many real-world elements fit a normal distribution: IQ scores, real estate prices, and more.
  • The excerpt warns it is "even more widely abused."

🎚️ Parameters of the normal distribution

The normal distribution has two parameters:

  • Mean (μ): center of the distribution
  • Standard deviation (σ): spread of the distribution
  • Notation: X ~ N(μ, σ)

Key properties:

  • The curve is symmetric about a vertical line drawn through the mean μ.
  • In theory, the mean equals the median (because of symmetry).
  • The probability density function is complex (involves exponential and pi), but technology and tables help work around this.

📉 The normal PDF

The excerpt provides the formula (in symbols):

  • f(x) involves exponential function with (x - μ) squared in the exponent, divided by sigma squared.
  • Valid for all real numbers: negative infinity < X < positive infinity.
  • Mean μ can be any real number; standard deviation σ must be positive.
  • The CDF P(X ≤ x) can be calculated by calculus, technology, or tables (though technology has made tables almost obsolete).
25

The Normal Distribution

4.5 The Normal Distribution

🧭 Overview

🧠 One-sentence thesis

The normal distribution is the most important probability distribution because it describes many real-world phenomena and provides practical tools—like the empirical rule and standardization—for calculating probabilities and interpreting data.

📌 Key points (3–5)

  • What the normal distribution is: a symmetric, bell-shaped, unimodal curve defined by two parameters—mean (μ) and standard deviation (σ).
  • The empirical rule (68-95-99.7 rule): approximately 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three.
  • Standardization process: any normal distribution can be converted to the standard normal distribution (mean = 0, standard deviation = 1) using z-scores, enabling probability lookups in z-tables.
  • Common confusion: distinguishing between "less than" (left-tail), "greater than" (right-tail), and "between" probabilities—each requires different calculations using the cumulative distribution function.
  • Working backwards: given a percentile or probability, you can reverse the standardization process to find the corresponding value on the original distribution.

📊 What is the normal distribution

📊 Definition and notation

The normal (Gaussian) distribution is a symmetric, bell-shaped, unimodal distribution defined by its mean (μ) and standard deviation (σ).

  • Notation: X ~ N(μ, σ) means "X follows a normal distribution with mean μ and standard deviation σ"
  • The curve is symmetric about a vertical line through the mean
  • In theory, mean equals median because of symmetry
  • There are infinitely many normal distributions, each determined by its specific μ and σ values

🔧 Parameters and their effects

ParameterSymbolEffect on curve
MeanμShifts the graph left or right
Standard deviationσChanges the shape—larger σ makes it wider/flatter, smaller σ makes it narrower/taller
  • The curve extends from negative infinity to positive infinity
  • The total area under the curve equals one (representing 100% probability)

📏 The empirical rule (68-95-99.7 rule)

📏 What the rule tells us

The empirical rule applies to any normal distribution or bell-shaped, symmetric data:

  • 68% of values lie within one standard deviation of the mean (μ ± σ, or z-scores of ±1)
  • 95% of values lie within two standard deviations of the mean (μ ± 2σ, or z-scores of ±2)
  • 99.7% of values lie within three standard deviations of the mean (μ ± 3σ, or z-scores of ±3)

🧮 Applying the empirical rule

Example: Suppose X has a normal distribution with mean 50 and standard deviation 6.

  • One standard deviation: 50 ± 6 = values between 44 and 56 contain about 68% of data (z-scores: -1 to +1)
  • Two standard deviations: 50 ± 12 = values between 38 and 62 contain about 95% of data (z-scores: -2 to +2)
  • Three standard deviations: 50 ± 18 = values between 32 and 68 contain about 99.7% of data (z-scores: -3 to +3)

Don't confuse: The percentages (68%, 95%, 99.7%) are fixed for any normal distribution; only the actual values change based on μ and σ.

🔄 Finding normal probabilities

🔄 Three approaches

The excerpt identifies three main methods for finding probabilities:

  1. Complicated math: using the probability density function formula directly (impractical)
  2. The standardizing process: converting to z-scores and using z-tables (traditional method)
  3. Technology: calculators or software that compute probabilities instantly (modern approach)

📐 Types of probability calculations

  • Left-tail (less than): P(X < x) represents the area to the left of x
  • Right-tail (greater than): P(X > x) = 1 - P(X < x) using the complement rule
  • Between two values: P(a < X < b) = P(X < b) - P(X < a)

Note: For continuous distributions, P(X < x) is approximately equal to P(X ≤ x), and P(X > x) is approximately equal to P(X ≥ x).

🎯 The standard normal distribution and z-scores

🎯 What makes it "standard"

The standard normal distribution (SND) is the simplest form of the normal distribution with mean = 0 and standard deviation = 1, denoted Z ~ N(0, 1).

🧮 Z-score formula and interpretation

A z-score tells you how many standard deviations a value x is above (positive z) or below (negative z) the mean μ.

Formula in words: z-score = (value minus mean) divided by standard deviation

  • Positive z-score: value is above the mean
  • Negative z-score: value is below the mean
  • Zero z-score: value equals the mean

📋 Using the z-table

  • Most z-tables provide left-tailed (cumulative) probabilities: P(Z ≤ z)
  • This cumulative distribution function value is also written as Φ(z)
  • Example: P(Z ≤ -3.37) = 0.0004 means only 0.04% of values fall below z = -3.37

To find other probabilities:

  • Greater than: P(Z ≥ z) = 1 - P(Z ≤ z)
  • Between: P(a ≤ Z ≤ b) = P(Z ≤ b) - P(Z ≤ a)

🔀 The standardizing process

🔀 Converting any normal to standard normal

The general process: X ~ N(μ, σ) → Z ~ N(0, 1) → probability from z-table

Steps:

  1. Start with a value from your original distribution
  2. Calculate the z-score using the formula
  3. Look up the z-score in the z-table to find the probability

🧪 Example application

Scenario: Weights of 80 cm girls have mean μ = 10.2 kg and standard deviation σ = 0.8 kg, normally distributed.

Finding P(weight < 11 kg):

  • Calculate z-score: (11 - 10.2) / 0.8 = 1
  • Interpretation: 11 kg is one standard deviation above the mean
  • Look up: P(Z ≤ 1) = 0.8413

Finding P(weight > 7.9 kg):

  • Calculate z-score: (7.9 - 10.2) / 0.8 = -2.875
  • Look up: P(Z ≤ -2.88) = 0.002
  • Apply complement: P(Z ≥ -2.88) = 1 - 0.002 = 0.998

Finding P(11.2 kg < weight < 12.2 kg):

  • Calculate z₁: (11.2 - 10.2) / 0.8 = 1.25
  • Calculate z₂: (12.2 - 10.2) / 0.8 = 2.5
  • Look up both: P(Z ≤ 2.5) = 0.9938 and P(Z ≤ 1.25) = 0.8944
  • Subtract: 0.9938 - 0.8944 = 0.0994

🔙 Working backwards (un-standardizing)

🔙 From probability to value

Sometimes you know a percentile or probability and need to find the corresponding value on the original distribution.

The reverse process: Probability in z-table → Z ~ N(0, 1) → X ~ N(μ, σ)

🧮 The un-standardizing formula

Formula in words: value = mean + (z-score × standard deviation)

Example 1: If mean = 5, standard deviation = 2, what value is three standard deviations above the mean?

  • x = 5 + (3)(2) = 11

Example 2: What value corresponds to the 90th percentile on the same distribution?

  • Look up probability 0.9 in z-table → z ≈ 1.28
  • x = 5 + (1.28)(2) = 7.56

🍊 Practical scenario

Mandarin oranges have diameters that are normally distributed with mean = 5.85 cm and standard deviation = 0.24 cm.

To find the middle 20% of diameters:

  • Middle 20% means between the 40th and 60th percentiles
  • Look up z-scores for probabilities 0.40 and 0.60 in the z-table
  • Convert each z-score back to diameter values using the un-standardizing formula
  • Result: between 5.79 cm and 5.91 cm

Don't confuse: "Middle X%" means finding two symmetric values around the mean, not starting from zero.

26

The Normal Approximation to the Binomial

4.6 The Normal Approximation to the Binomial

🧭 Overview

🧠 One-sentence thesis

When the sample size is large enough, the normal distribution can replace the cumbersome binomial formula to estimate probabilities more quickly and easily.

📌 Key points (3–5)

  • Why approximation is needed: the binomial formula becomes tedious and nearly impossible to compute for large sample sizes, especially when calculating a range of outcomes.
  • When approximation works: the binomial distribution becomes nearly normal when both np and n(1 − p) are at least 10.
  • How to use it: replace the binomial with a normal distribution using mean μ = np and standard deviation σ = square root of np(1 − p).
  • Common confusion: for small ranges of counts, the normal approximation performs poorly unless you apply the continuity correction (adjust cutoffs by ±0.5).
  • Key trade-off: the method is faster and easier but requires conditions to be met and may need correction for narrow intervals.

🔢 The problem with large binomial calculations

🔢 Why the binomial formula becomes impractical

  • The binomial formula works well for small sample sizes, but calculating probabilities for large n is tedious and long.
  • When you need a range of observations (e.g., k = 0, 1, 2, …, 42), you must compute many individual probabilities and add them together.
  • Example: To find the probability of observing 42 or fewer smokers out of 400 people when p = 0.15, you would need to calculate 43 separate probabilities: P(k = 0) + P(k = 1) + … + P(k = 42) = 0.0054.
  • The excerpt emphasizes that this work is "nearly impossible if you do not have access to technology."

⚡ Why an alternative is desirable

  • The excerpt states we should "avoid long, tedious work if an alternative method exists that is faster, easier, and still accurate."
  • Calculating probabilities of a range of values is much easier in the normal model.
  • The normal approximation provides a practical shortcut when certain conditions are met.

📊 When and how the approximation works

📊 The transformation from binomial to normal

The excerpt shows that as sample size increases, the binomial distribution's shape changes:

  • With small n (e.g., n = 10), the distribution is blocky and skewed.
  • As n grows (n = 30, 100, 300), the distribution gradually transforms.
  • By n = 300, the histogram resembles the normal distribution—no longer blocky or skewed.

✅ Conditions for using the approximation

The binomial distribution with probability of success p is nearly normal when the sample size n is sufficiently large that np and n(1 − p) are both at least 10.

How to verify the conditions:

  • Calculate np (expected number of successes).
  • Calculate n(1 − p) (expected number of failures).
  • Both must be at least 10.

Example: For n = 400 and p = 0.15:

  • np = 400 × 0.15 = 60 ✓
  • n(1 − p) = 400 × 0.85 = 340 ✓
  • Both conditions met, so normal approximation is valid.

🧮 Parameters for the normal approximation

Once conditions are met, use a normal distribution with:

  • Mean: μ = np
  • Standard deviation: σ = square root of np(1 − p)

These parameters correspond to the mean and standard deviation of the original binomial distribution.

Example continued: For the smoking survey, use N(μ = 60, σ = 7.14) to approximate the binomial probabilities.

🎯 Computing probabilities with the approximation

  • Standardize the value using the z-score formula: Z = (observed value − μ) / σ
  • Look up the corresponding probability in the normal table or use technology.

Example: To find P(42 or fewer smokers):

  • Z = (42 − 60) / 7.14 = −2.52
  • Left tail area = 0.0059
  • This is very close to the exact binomial answer of 0.0054.

🔧 The continuity correction

🔧 When the approximation performs poorly

The normal approximation "tends to perform poorly when estimating the probability of a small range of counts, even when the conditions are met."

Example problem: Find the probability of observing exactly 49, 50, or 51 smokers in 400 when p = 0.15.

  • Exact binomial: 0.0649
  • Normal approximation (without correction): 0.0421
  • The difference is notable and problematic.

🔧 Why the discrepancy occurs

The excerpt explains the cause using a visual comparison:

  • The binomial probability is represented by outlined bars.
  • The normal approximation is a shaded continuous area.
  • The normal distribution's width is "0.5 units too slim on both sides of the interval."
  • The normal curve doesn't capture the full width of the discrete bars.

🔧 How to apply the continuity correction

The continuity correction: cutoff values for the lower end of a shaded region should be reduced by 0.5, and the cutoff value for the upper end should be increased by 0.5.

When to use it:

  • Most useful when examining a range of observations (narrow intervals).
  • For tail areas (wide intervals), the benefit usually disappears because the total interval is already quite wide.

Effect of the correction:

  • In the example above, the revised estimate becomes 0.0633.
  • This is much closer to the exact value of 0.0649.
  • The correction adds the missing area on both sides of the interval.

📋 Summary comparison

AspectBinomial (exact)Normal approximation
AccuracyExactApproximate (close when conditions met)
ComputationTedious for large nFast and easy
ConditionsAlways valid for binomial settingsRequires np ≥ 10 and n(1 − p) ≥ 10
Small rangesAccuratePoor unless continuity correction applied
Wide ranges/tailsAccurateGood approximation, correction less important
27

Chapter 4 Wrap-Up: Probability and Random Variables

Chapter 4 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

This chapter wrap-up consolidates the key terminology and concepts from probability experiments, discrete and continuous random variables, binomial and normal distributions, and their approximations.

📌 Key points (3–5)

  • Chapter structure: covers six major topics from basic probability concepts through normal approximation to the binomial.
  • Key term categories: probability fundamentals, discrete variables (PMF, CDF, expected value), binomial distributions, continuous variables (PDF), and normal distributions (z-scores, quantiles).
  • Visual progression: figures illustrate how binomial distributions approach normal distributions as sample size increases (n = 10, 30, 100, 300).
  • Common confusion: discrete vs continuous random variables require different functions—PMF for discrete, PDF for continuous.
  • Self-assessment tools: includes concept check quiz and extra practice problems for review.

📚 Terminology framework

📚 Probability foundations (Section 4.1)

The chapter begins with fundamental probability concepts:

  • Probability experiment: the process that generates outcomes
  • Sample space: all possible outcomes
  • Event: a subset of outcomes
  • Law of large numbers: describes long-run behavior
  • Complement: the opposite of an event

🎲 Discrete random variables (Section 4.2)

Discrete random variable: a variable that takes on countable values.

Key functions for discrete variables:

  • Probability mass function (PMF): assigns probabilities to each discrete value
  • Cumulative distribution function (CDF): gives cumulative probabilities up to a value
  • Expected value: the long-run average outcome

🔢 Binomial distribution (Section 4.3)

Specialized discrete distribution with specific requirements:

  • Bernoulli trial: a single trial with two outcomes
  • Independence: trials do not affect each other
  • Binomial distribution: counts successes in fixed number of independent trials

📊 Continuous distributions

📊 Continuous random variables (Section 4.4)

Continuous random variable (CRV): a variable that can take any value in an interval.

  • Uses probability density function (PDF) instead of PMF
  • Uniform distribution: simplest continuous distribution where all values in a range are equally likely
  • Don't confuse: PDF values are not probabilities themselves; probabilities come from areas under the curve

🔔 Normal distribution (Section 4.5)

Normal (Gaussian) distribution: the bell-shaped distribution central to statistics.

Key concepts:

  • Empirical rule: describes percentages within standard deviations
  • Standard normal distribution (SND): the special case with mean 0 and standard deviation 1
  • z-score: standardized value showing distance from mean in standard deviations
  • Quantile: the value below which a given percentage of data falls

🔗 Connecting discrete and continuous

🔗 Normal approximation to binomial (Section 4.6)

The chapter concludes by bridging discrete and continuous distributions.

Visual evidence from Figure 4.23:

Sample size (n)Distribution shape
n = 10Skewed, higher values at 0-2
n = 30More centered around 4
n = 100Bell-shaped, range 0-20
n = 300Clear bell shape, range 10-50
  • As sample size increases, the binomial distribution increasingly resembles a normal distribution
  • Continuity correction: adjustment needed when approximating discrete with continuous
  • Figure 4.24 shows a bell curve with highlighted section at x = 50, illustrating the continuous approximation

🔗 Why this matters

  • Allows use of normal distribution tools for binomial problems when sample sizes are large
  • Simplifies calculations for large n
  • Example: instead of calculating many individual binomial probabilities, use one normal probability with continuity correction

🎯 Review resources

🎯 Self-assessment tools

The wrap-up provides multiple ways to check understanding:

  • Concept check quiz: interactive quiz accessible via QR code or online link
  • Extra practice problems: additional exercises available at end of book
  • Section resources: links to all six chapter sections for targeted review

🎯 Study approach

  • Start by defining each key term without looking
  • Check definitions against the glossary
  • Work through the concept check quiz
  • Complete extra practice problems for weak areas
  • Review specific sections as needed using the provided links
28

Point Estimation and Sampling Distributions

5.1 Point Estimation and Sampling Distributions

🧭 Overview

🧠 One-sentence thesis

Point estimates from samples vary due to sampling variability, but as sample size increases, these estimates become both more accurate (centered on the true parameter) and more precise (less spread out), following predictable sampling distributions.

📌 Key points (3–5)

  • What point estimation is: using a sample statistic (like sample mean) to estimate an unknown population parameter (like population mean).
  • Sampling variability: different random samples from the same population produce different statistics, creating a distribution of possible estimates called a sampling distribution.
  • Good estimates require two qualities: accuracy (the estimate centers on the true parameter value) and precision (the estimates cluster tightly together).
  • Common confusion: a parameter is a fixed population value that never changes, while a statistic varies from sample to sample—don't confuse the unchanging target with the variable estimate.
  • Law of large numbers: larger sample sizes make estimates more accurate and precise, reducing fluctuations around the true parameter.

🎯 What point estimation means

🎯 The basic idea

Point estimate: a single value calculated from a sample that serves as a best guess for an unknown population parameter.

  • You use what you observe in your sample to estimate what you cannot observe in the entire population.
  • The most natural approach: use the sample version of whatever you want to know about the population.
  • Example: if you want to know the mean rent in your town, you collect several rents, average them, and that average is your point estimate of the true mean rent.

📊 Common parameters and their estimates

Population ParameterWhat it measuresSample Statistic (Point Estimate)
Mean of a single populationAverage valueSample mean
Proportion of a single populationPercentage with a traitSample proportion (p-hat)
Variance of a single populationSpread of valuesSample variance
Standard deviation of a single populationSpread of valuesSample standard deviation
Mean difference of matched pairsAverage difference within pairsSample mean difference
Difference in means of two groupsGap between two averagesDifference in sample means
Difference in proportionsGap between two percentagesDifference in sample proportions

🔢 Concrete scenarios

Example (mean): A sample of 60 adults has a mean weight of 173.3 lbs. This sample mean is a point estimate of the unknown population mean weight (μ).

Example (proportion): A poll shows 45% approval rating for a president. This 45% is a point estimate of the true approval rating across the entire population. The true population proportion (p) remains unknown unless you survey everyone; you use the sample proportion (p-hat) as your estimate.

Example (difference in means): A sample of men yields mean weight 185.1 lbs; a sample of women yields 162.3 lbs. The difference (185.1 minus 162.3) is a point estimate for the difference in population means.

🎲 Sampling distributions and variability

🎲 Why estimates vary

  • Different random samples from the same population produce different statistics—this is sampling variability.
  • The population parameter itself is a fixed value that does not change.
  • Each time you take a new sample, you get a new estimate, even though the true parameter stays the same.
  • Example: one sample of 60 adults gives mean weight 173.3 lbs; another sample gives 169.5 lbs; a third gives 172.1 lbs—all are estimates of the same unchanging population mean.

📈 What a sampling distribution is

Sampling distribution: the distribution of all possible values of a sample statistic (like the sample mean) based on samples of a fixed size n from a certain population.

  • Think of each point estimate as one draw from a random variable.
  • The sample mean itself is a random variable (often written as X-bar) with its own mean and standard deviation.
  • In theory, the sampling distribution would require taking an infinite number of samples; in practice, we simulate it by repeatedly sampling and graphing the results.
  • Every statistic in the table above has its own unique sampling distribution.

⚠️ Don't confuse parameter and statistic

  • Parameter: the true, fixed value in the population (e.g., μ, p). It does not change.
  • Statistic: the value calculated from a sample (e.g., sample mean, p-hat). It changes from sample to sample.
  • The statistic is your estimate; the parameter is the unknown target you are trying to estimate.

✅ What makes an estimate "good"

✅ Accuracy: hitting the target

  • Accuracy means the estimate is centered on the true parameter value.
  • Mathematically: the expected value of your statistic equals the parameter.
  • Visually: the center of the sampling distribution sits right at the parameter value.
  • The law of large numbers says that as sample size increases, the sample statistic converges toward the true parameter—larger samples are more accurate.

🎯 Precision: tight clustering

  • Precision means the estimates from repeated samples are close together, not spread out.
  • Visually: the sampling distribution has a narrow spread.
  • Quantified by the standard deviation of the sampling distribution, called the standard error.
  • Smaller standard errors mean more precise estimates.
  • Larger sample sizes reduce the standard error, making estimates more precise.

📉 How sample size affects estimates

The excerpt describes a simulation where sample size increases from 1 to 500, and the sample mean weight is calculated at each step:

  • At small sample sizes (around 50), the sample mean can be 10 lbs higher or lower than the true population mean (169.7 lbs, according to CDC data).
  • As sample size increases, fluctuations around the population mean decrease.
  • Larger samples produce sample means that are less variable and more reliable.
  • Don't confuse: a single large sample is more reliable than a single small sample, but the population mean itself never changes—only our estimate of it improves.

🔑 Standard error

Standard error: the standard deviation of a sampling distribution.

  • It measures the typical distance between a sample statistic and the true parameter.
  • Smaller standard errors indicate more precise estimates.
  • Sample size affects standard error: larger samples lead to smaller standard errors.

🧮 Summary of key distinctions

ConceptMeaningKey point
ParameterTrue, fixed population valueNever changes; usually unknown
StatisticValue from a sampleVaries from sample to sample
Point estimateA single statistic used to guess a parameterNatural choice: use sample version of what you want to know
Sampling variabilityDifferent samples give different statisticsUnavoidable randomness in sampling
Sampling distributionDistribution of a statistic over all possible samplesHas its own mean and standard deviation
AccuracyEstimate centers on the true parameterImproves with larger sample size (law of large numbers)
PrecisionEstimates cluster tightly togetherMeasured by standard error; improves with larger sample size
Standard errorStandard deviation of the sampling distributionSmaller is better; decreases as sample size increases
29

The Sampling Distribution of the Sample Mean (Central Limit Theorem)

5.2 The Sampling Distribution of the Sample Mean (Central Limit Theorem)

🧭 Overview

🧠 One-sentence thesis

The Central Limit Theorem establishes that as sample size increases, the distribution of sample means approaches a normal distribution regardless of the original population's shape, making the sample mean a powerful and predictable tool for statistical inference.

📌 Key points

  • What the CLT says: drawing larger samples causes the distribution of sample means to approximate a normal distribution, even when the original population is not normal.
  • How sample size affects spread: as sample size n increases, the standard deviation of sample means (standard error) gets smaller, meaning sample means cluster more tightly around the population mean.
  • When to apply the CLT: use it when finding probabilities about a sample mean, not about individual values from the population.
  • Common confusion: the CLT applies to the distribution of sample means, not to individual observations; don't use the CLT for single-value probabilities.
  • Sample size threshold: samples should be at least 30, or the data should come from a normal distribution; populations far from normal require larger samples for the CLT to hold.

📊 What the Central Limit Theorem tells us

📊 The core claim

The Central Limit Theorem (CLT) has two alternative forms, both concerned with drawing finite samples of size n from a population with known mean μ and known standard deviation σ.

Central Limit Theorem for sample means: if you keep drawing larger and larger samples and calculating their means, the sample means form their own normal distribution (the sampling distribution). The normal distribution has the same mean as the original distribution and a variance that equals the original variance divided by the sample size.

  • The CLT says that collecting samples of size n with a "large enough" n produces a distribution that can be approximated by the normal distribution.
  • The larger n gets, the smaller the standard deviation of the sample means becomes.
  • This means the sample mean must be close to the population mean μ.

🎲 Illustration through dice rolling

The excerpt provides a concrete example:

  • Eight students roll one fair die ten times each.
  • Seven roll two fair dice ten times each.
  • Nine roll five fair dice ten times each.
  • Eleven roll ten fair dice ten times each.
  • Each person calculates the sample mean of the faces showing for each roll.

What happens as the number of dice rolled increases from one to two to five to ten:

  1. The mean of the sample means remains approximately the same.
  2. The spread of the sample means (standard deviation of the sample means) gets smaller.
  3. The graph appears steeper and thinner.

Example: One person rolls five fair dice and gets 2, 2, 3, 4, 6. The mean is (2+2+3+4+6)/5 = 3.4. This person repeats nine more times for a total of ten means. As more dice are rolled per trial, the distribution of these means becomes more normal and concentrated.

🔍 How sample size changes the distribution

🔍 Patterns from normal populations

The excerpt describes sampling distributions built from taking 1,000 samples of different sample sizes from a normal population. The pattern observed:

  • As sample size increases (from n = 5 to n = 15 to n = 30), the histogram of sample means becomes more narrow around the mean.
  • All histograms follow a typical bell-curve shape.
  • The shape gets steeper and thinner with larger n.

🔍 Patterns from non-normal populations

The excerpt also describes sampling from a non-normal (exponential) population:

  • The original population histogram is skewed right.
  • As sample size increases (n = 5 to n = 15 to n = 30), the distribution of sample means still gets more narrow around the mean.
  • Even though the original population is not normal, the sampling distribution of the mean becomes more bell-shaped as n increases.

Key difference: when sampling from non-normal populations, you need larger sample sizes for the sampling distribution to look normal compared to sampling from already-normal populations.

📏 The "large enough" threshold

  • The size of the sample n that is considered "large enough" depends on the original population from which the samples are drawn.
  • General rule: the sample size should be at least 30, or the data should come from a normal distribution.
  • If the original population is far from normal, then more observations are needed for the sample means to be normal.
  • Sampling is done with replacement.

🧮 Mathematical structure of the CLT

🧮 Notation and formulas

Let X be a random variable with any distribution (known or unknown). Using a subscript that matches the random variable:

  • μ_X = the mean of X
  • σ_X = the standard error of X

Standard error of the mean: the standard deviation of the sample mean, written as σ_X divided by the square root of n. This assumes we know the population standard deviation.

  • The variable n is the number of values that are averaged together, not the number of times the experiment is done.
  • The standard error describes how far (on average) the sample mean will be from the population mean in repeated simple random samples of size n.

🧮 The sampling distribution

If you draw random samples of size n, the distribution of the random variable consisting of sample means is called the sampling distribution of the sample mean.

As sample size increases:

  • The random variable (the sample mean) tends to be normally distributed.
  • The sampling distribution of the mean approaches a normal distribution: the sample mean follows N(μ_X, σ_X divided by square root of n).

🧮 Connection to the law of large numbers

The excerpt explains that applying the law of large numbers here means:

  • Taking larger and larger samples from a population brings the sample mean closer and closer to μ.
  • From the CLT, we know that the sample means increasingly follow a normal distribution as n gets larger.
  • The larger n gets, the smaller the standard deviation gets (because standard deviation for the sample mean is σ divided by square root of n).
  • We can say that μ is the value that the sample means approach as n gets larger.
  • The central limit theorem illustrates the law of large numbers.

🎯 When and how to use the CLT

🎯 Deciding when to apply the CLT

Critical distinction:

  • If you are being asked to find the probability of an individual value, do NOT use the CLT. Use the distribution of its random variable.
  • If you are being asked to find the probability of the mean of a sample, then use the CLT for the mean.

Don't confuse: the CLT applies to the distribution of sample means, not to individual observations from the population.

🎯 The z-score for sample means

The z-score associated with the sample mean differs from the score of a single observation.

Remember:

  • The sample mean is the mean of one sample.
  • μ_X is the average, or center, of both X (the original distribution) and the sample mean.

We can use a z-table and standardizing, or we can use technology.

🎯 Worked example: probability of a sample mean

Problem: An unknown distribution has a mean of 90 and a standard deviation of 15. Samples of size n = 25 are drawn randomly from the population. Find the probability that the sample mean is between 85 and 92.

Solution steps:

  1. Let X represent one value from the original unknown population.
  2. The standard error of the mean is 15 divided by the square root of 25 = 15/5 = 3.
  3. Let the sample mean follow N(90, 15 divided by square root of 25).
  4. Find P(85 < sample mean < 92). This is a "between" problem requiring two z-scores.
  5. Z₁ = (85 - 90)/3 = -1.67
  6. Z₂ = (92 - 90)/3 = 0.67
  7. The probability that the sample mean is between 85 and 92 is 0.7475 - 0.0478 = 0.6997.

Another question from the same problem: Find the value that is two standard deviations above 90, the expected value of the sample mean.

Solution: Use the formula: value = μ_X + (number of standard deviations)(standard error).

  • Value = 90 + 2(3) = 96.
  • The value that is two standard deviations above the expected value is 96.

💡 Why the CLT matters

💡 Importance in statistical theory

The excerpt states: "It would be difficult to overstate the importance of the central limit theorem in statistical theory."

Why it's powerful:

  • Knowing that data behaves in a predictable way—even if its distribution is not normal—is a powerful tool.
  • The CLT allows us to make inferences about population means even when we don't know the shape of the original population distribution.

💡 Why we focus on sample means

Two reasons the excerpt gives for focusing on sample means:

  1. They give us a middle ground for comparison.
  2. They are easy to calculate.

The CLT makes the sample mean not just easy to calculate but also predictable in its behavior, which is essential for statistical inference.

30

Introduction to Confidence Intervals

5.3 Introduction to Confidence Intervals

🧭 Overview

🧠 One-sentence thesis

Confidence intervals provide a range of reasonable values around a point estimate that is likely to capture the unknown population parameter with a predictable probability of success.

📌 Key points (3–5)

  • What a confidence interval is: a range of values (not just one number) built around a point estimate to capture the population parameter.
  • Why we need intervals instead of point estimates: sampling variability means a single point estimate is unlikely to be exactly the true parameter value.
  • How to construct the interval: use the formula (Point Estimate – Margin of Error, Point Estimate + Margin of Error).
  • Common confusion: the confidence interval is the random variable that changes from sample to sample; the population parameter is fixed.
  • What the confidence level means: not the probability that one specific interval contains the parameter, but the percentage of intervals that would contain the parameter if we repeated sampling many times.

🎯 What confidence intervals are and why we use them

🎯 The problem with point estimates

  • Inferential statistics help us make generalizations about an unknown population using sample data.
  • A point estimate is the simplest approach: use a sample statistic (like sample mean) to estimate a population parameter (like population mean).
  • Due to sampling variability, the point estimate is most likely not the exact value of the population parameter, though it should be close.
  • Example: if you survey consumers and calculate the sample mean number of songs downloaded per month, that sample mean is a point estimate for the population mean.

📏 What a confidence interval provides

Confidence interval: a range of reasonable values in which we expect the population parameter to fall.

  • Instead of a single number, we build an interval based on the point estimate.
  • There is no guarantee that a given confidence interval captures the parameter, but there is a predictable probability of success.
  • The confidence interval itself is a random variable (it changes from sample to sample), while the population parameter is fixed.
  • Don't confuse: the parameter doesn't move; our interval does. Some intervals will capture it, some won't.

🔧 How confidence intervals are built

🔧 The general structure

  • Confidence intervals for most parameters have the form:
    • (Point Estimate ± Margin of Error) = (Point Estimate – Margin of Error, Point Estimate + Margin of Error)
  • The margin of error (MoE) depends on:
    • The confidence level (or percentage of confidence).
    • The standard error of the mean.
  • Example: if a sample has a mean of 10 and MoE = 5, the 90% confidence interval is (5, 15).

🧮 The empirical rule foundation

  • When the population standard deviation is known and the sample size is large enough, the central limit theorem tells us the sample mean follows an approximately normal distribution.
  • The empirical rule (for bell-shaped distributions) says the sample mean will be within two standard deviations of the population mean in approximately 95% of samples.
  • Example (iTunes downloads): if the population standard deviation is 1 and sample size is 100, the standard deviation for the sample mean is 0.1. Two standard deviations = 0.2. So the sample mean is likely within 0.2 units of the population mean in 95% of samples.
  • Because the sample mean is within 0.2 of the population mean, the population mean is likely within 0.2 of the sample mean in 95% of samples.

📐 Constructing the interval: an example

  • Suppose a sample produced a sample mean of 2, and we know the margin of error is 0.2.
  • The unknown population mean is between 2 – 0.2 = 1.8 and 2 + 0.2 = 2.2.
  • We say: "We are about 95% confident that the unknown population mean number of songs downloaded from iTunes per month is between 1.8 and 2.2."
  • The approximate 95% confidence interval is (1.8, 2.2).

🎲 What the interval implies

  • Two possibilities:
    1. The interval (1.8, 2.2) contains the true mean.
    2. Our sample produced a sample mean that is not within 0.2 units of the true mean.
  • The second possibility happens for only 5% of all samples (100% – 95%).

📊 Confidence level and alpha

📊 What the confidence level means

  • The confidence level (CL) is often thought of as the probability that the calculated confidence interval contains the true population parameter.
  • More accurately: the confidence level is the percent of confidence intervals that contain the true population parameter when repeated samples are taken.
  • Most often, people choose a confidence level of 90% or higher to be reasonably certain of their conclusions.

📊 The relationship with alpha

  • Alpha (α) represents the chance that the interval does not contain the unknown population parameter.
  • Mathematically: α + CL = 1
  • Example: if CL = 0.95, then α = 0.05 (5% chance the interval misses the parameter).

🛠️ Steps to construct a confidence interval

🛠️ The five-step process

When the population standard deviation is known:

  1. Calculate the sample mean from the sample data.
  2. Find the z-score (critical value) that corresponds to the confidence level.
  3. Calculate the margin of error.
  4. Construct the confidence interval using (sample mean – MoE, sample mean + MoE).
  5. Write a sentence that interprets the estimate in the context of the problem (use the words of the problem to explain what the confidence interval means).

🔍 Example: simple interval calculation

  • Given: sample mean = 7, margin of error = 2.5.
  • Confidence interval = (7 – 2.5, 7 + 2.5) = (4.5, 9.5).
  • Interpretation (if CL = 95%): "We estimate with 95% confidence that the true value of the population mean is between 4.5 and 9.5."

🎚️ Changing the confidence level

🎚️ How the confidence level affects the interval

  • A confidence interval for a population mean with known standard deviation is based on the sample means following an approximately normal distribution.
  • To get a 90% confidence interval, we include the central 90% of the probability of the normal distribution.
  • If we include the central 90%, we leave out a total of α = 10% in both tails, or 5% in each tail.

🎚️ Finding the critical value (z-score)

Critical value: the z-score that puts an area equal to the confidence level (in decimal form) in the middle of the standard normal distribution Z ~ N(0, 1).

  • The confidence level is the area in the middle of the standard normal distribution.
  • Since CL = 1 – α, α is the area split equally between the two tails.
  • Each tail contains an area equal to α/2.
  • The z-score that has an area to the right of α/2 is denoted by z(α/2).
  • Note: Use the area to the LEFT of z(α/2) when looking up values.

🎚️ Example: 95% confidence level

  • CL = 0.95, α = 0.05, α/2 = 0.025.
  • We write z(α/2) = z(0.025).
  • The area to the right of z(0.025) is 0.025; the area to the left is 1 – 0.025 = 0.975.
  • Using technology or a standard normal table: z(α/2) = z(0.025) = 1.96.

🎚️ Example: 90% confidence level

  • To capture the central 90%, we must go out 1.645 standard deviations on either side of the calculated sample mean.
  • The value 1.645 is the z-score that results in an area of 0.90 in the center, 0.05 in the far left tail, and 0.05 in the far right tail.

📐 Standard error and the normal distribution

📐 Standard error of the mean

  • The standard deviation used must be appropriate for the parameter we are estimating.
  • For sample means, we use the standard deviation that applies to sample means: (population standard deviation) / (square root of sample size).
  • This fraction is commonly called the "standard error of the mean" to distinguish it from the population standard deviation.

📐 Central limit theorem summary

  • The sample mean is normally distributed: sample mean ~ N(population mean, standard error).
  • When the population standard deviation is known, we use a normal distribution to calculate the margin of error and construct the confidence interval.
31

The Behavior of Confidence Intervals

5.4 The Behavior of Confidence Intervals

🧭 Overview

🧠 One-sentence thesis

Confidence intervals become wider when the confidence level increases or the sample size decreases, and researchers can work backwards from intervals to find margins of error or calculate the sample size needed for a desired precision.

📌 Key points (3–5)

  • What makes a "good" estimate: A narrower (more precise) confidence interval is more useful for estimation.
  • How confidence level affects width: Increasing the confidence level (e.g., from 90% to 95%) increases the margin of error, making the interval wider but more likely to capture the true parameter.
  • Working backwards: From a stated confidence interval, you can calculate both the margin of error and the sample mean using simple arithmetic.
  • Sample size planning: Researchers can use the margin of error formula solved for n to determine how many observations they need for a desired precision.
  • Common confusion: Higher confidence does not mean the parameter is more likely to be in a specific range—it means the method produces intervals that capture the true parameter more often.

📏 How confidence intervals behave

📏 Precision as a goal

A smaller, or more narrow, interval gives us a more precise and therefore useful estimate.

  • Precision is one criterion for a "good" statistical estimate.
  • The excerpt emphasizes that narrower intervals are more useful because they pin down the parameter more tightly.
  • Trade-off: You cannot always have both high confidence and high precision simultaneously.

📊 Effect of changing confidence level

The excerpt walks through the same problem with two different confidence levels:

Confidence LevelCritical z-valueMargin of ErrorInterval WidthResulting Interval
90%1.6450.82Narrower(67.18, 68.82)
95%1.960.98Wider(67.02, 68.98)
  • Why it happens: A 95% confidence level requires capturing the true mean in 95 out of 100 intervals, so the area under the curve is larger (0.95 vs 0.90), which requires a larger critical z-value.
  • The mechanism: Larger z-value → larger margin of error → wider interval.
  • Example: The 95% interval (67.02, 68.98) is wider than the 90% interval (67.18, 68.82) for the same data.

Don't confuse: The confidence level applies to the method (the sampling and interval construction process), not to the parameter itself. The parameter is fixed; what varies is the interval you calculate from each sample.

🔄 Working backwards from intervals

🔄 Finding the margin of error

When only the confidence interval is given, you can recover the margin of error using two methods:

  • Method 1 (if you know the sample mean): Subtract the sample mean from the upper endpoint.
    • Example: Upper endpoint 68.82 minus sample mean 68 equals margin of error 0.82.
  • Method 2 (if you don't know the sample mean): Subtract the lower endpoint from the upper endpoint, then divide by two.
    • Example: (68.82 minus 67.18) divided by 2 equals 0.82.

🔄 Finding the sample mean

Similarly, you can recover the sample mean:

  • Method 1 (if you know the margin of error): Subtract the margin of error from the upper endpoint.
    • Example: 68.82 minus 0.82 equals 68.
  • Method 2 (if you don't know the margin of error): Average the two endpoints.
    • Example: (67.18 plus 68.82) divided by 2 equals 68.

Tip: Choose the method that matches the information you have available.

🧮 Calculating required sample size

🧮 The sample size formula

The formula for sample size is n = (z times sigma divided by MoE) squared, found by solving the margin of error formula for n.

  • When to use it: Researchers planning a study who want a specific margin of error and confidence level.
  • What you need to know: the population standard deviation (sigma), the desired margin of error, and the desired confidence level (which determines z).
  • Important rule: Always round up to the next higher integer to ensure the sample size is large enough.

🧮 Example walkthrough

The excerpt gives an example: estimating the mean age of college students within 2 years with 95% confidence, given that the population standard deviation is 15 years.

  • Sigma = 15, MoE = 2, z = 1.96 (for 95% confidence).
  • Calculation: n = (1.96 times 15 divided by 2) squared = 216.09.
  • Round up to n = 217 students.

Another example: estimating mean height within 1 inch with 93% confidence, sigma = 2.5 inches.

  • z = 1.812 (the z-value for which the area to the right is 0.035, since alpha/2 = 0.035).
  • Calculation: n = (1.812 times 2.5 divided by 1) squared ≈ 20.52.
  • Round up to n = 21 students.

Why it matters: This allows researchers to plan studies efficiently—they know in advance how many observations they need to achieve their desired precision.

🎯 Interpretation reminders

🎯 What the confidence level means

The excerpt provides two ways to interpret a confidence interval:

  • Direct interpretation: "We estimate with 95% confidence that the true population mean for all statistics exam scores is between 67.02 and 68.98."
  • Alternative interpretation: "Ninety-five percent of all confidence intervals constructed in this way contain the true value of the population mean."

🎯 What to avoid

  • Don't say: "The parameter is 95% likely to be in this interval."
  • Why not: The parameter is a fixed value; it either is or isn't in the interval. The 95% refers to the long-run success rate of the method, not the probability that this particular interval captured the parameter.
  • The excerpt emphasizes: "Be careful that you do not associate the confidence level with the parameter itself. Your parameter is a fixed value; what is changing is the sample you take and the interval you calculate. We always want to associate the CL% with the sampling process and the interval."
32

Introduction to Hypothesis Tests

5.5 Introduction to Hypothesis Tests

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing is a statistical method that uses sample data to decide whether there is sufficient evidence to reject a claim about a population parameter.

📌 Key points (3–5)

  • What hypothesis testing does: makes decisions about population parameters by evaluating sample data against two competing statements (null and alternative hypotheses).
  • The two hypotheses: the null hypothesis (H₀) represents the accepted or historical value; the alternative hypothesis (Hₐ) contradicts it and represents what we conclude if we reject H₀.
  • How decisions are made: compare the p-value (probability of getting results as extreme as observed) to a significance level (α); if p-value is smaller, reject the null hypothesis.
  • Common confusion: "do not reject H₀" does NOT mean H₀ is true—it only means there isn't enough evidence against it.
  • Key symbols: H₀ always contains an equal sign (=, ≤, or ≥); Hₐ never does (≠, <, or >).

🔬 The Two Competing Hypotheses

🔵 The null hypothesis (H₀)

The null hypothesis is often a statement of the accepted historical value or norm; it is the starting point you must assume from the beginning in order to show an effect exists.

  • H₀ represents the status quo or default assumption.
  • It always includes an equality symbol (=, ≤, or ≥).
  • Example: If testing whether mean GPA is 2.0, then H₀: μ = 2.0.

🔴 The alternative hypothesis (Hₐ)

The alternative hypothesis is a claim about the population that is contradictory to H₀ and what we conclude when we reject H₀.

  • Hₐ represents what you suspect might be true instead.
  • It never includes an equality symbol; uses ≠, <, or >.
  • The choice of symbol depends on the wording of the research question.
  • Example: Testing if mean GPA differs from 2.0 gives Hₐ: μ ≠ 2.0.

🔀 How they relate

AspectNull Hypothesis (H₀)Alternative Hypothesis (Hₐ)
SymbolAlways has = in itNever has = in it
RoleStarting assumptionWhat we conclude if we reject H₀
Examplesμ = 2.0, μ ≤ 15μ ≠ 2.0, μ > 15
  • The two hypotheses are contradictory—they cannot both be true.
  • You examine sample evidence to decide which one the data supports.

📊 The Five-Step Process

📝 Step 1: Define hypotheses

  • State both H₀ and Hₐ clearly.
  • Ensure H₀ contains an equality and Hₐ does not.

📐 Step 2: Collect data and choose distribution

  • Use sample data (often provided in classroom contexts).
  • Determine the correct distribution based on assumptions (currently focusing on z-distribution when population standard deviation is known).

🧮 Step 3: Calculate test statistic

  • The test statistic measures the distance between what you observed and what you assume to be true.
  • For testing a mean with known population standard deviation: the test statistic (z₀) quantifies the number of standard deviations between the sample mean and the population mean.
  • Formula concept: standardize the difference between sample mean and hypothesized population mean (μ₀).

⚖️ Step 4: Make a decision

  • Two methods exist: critical value method and p-value method.
  • Currently focusing on the p-value method.

✍️ Step 5: Write a conclusion

  • State your conclusion in the context of the original scenario.
  • Incorporate what the hypotheses mean in plain language.

🎯 Understanding the p-Value

🔍 What a p-value measures

The p-value is the probability that, if the null hypothesis is true, the results from another randomly selected sample will be as extreme or more extreme as the results obtained from the given sample.

  • It quantifies how unlikely your sample result would be if H₀ were actually true.
  • A small p-value means your observed result is very unlikely under the null hypothesis.
  • A large p-value means your result could easily happen by chance if H₀ is true.

📉 Interpreting p-values

  • Large p-value: do not reject the null hypothesis; the data are consistent with H₀.
  • Small p-value: reject the null hypothesis; the data provide strong evidence against H₀.
  • The smaller the p-value, the stronger the evidence against the null hypothesis.

🍞 Bread height example

Example: A baker claims bread height is more than 15 cm on average. Sample of 10 loaves has mean 17 cm. Population standard deviation is 0.5 cm.

  • H₀: μ ≤ 15 (bread height is at most 15 cm)
  • Hₐ: μ > 15 (bread height is more than 15 cm)
  • The p-value is P(sample mean > 17) ≈ 0 (approximately zero)
  • Interpretation: It is highly unlikely (almost 0% chance) that a sample mean would be at least 17 cm if the true population mean were really 15 cm or less.
  • Conclusion: Strong evidence against H₀; the true mean height is likely greater than 15 cm.

⚖️ Making the Decision

🎚️ The significance level (α)

  • α is a preset threshold, also called the significance level.
  • It represents the probability of a Type I error (rejecting H₀ when H₀ is actually true).
  • If not given, use α = 0.05 by default.

🔀 Decision rules

Compare the p-value to α:

ConditionDecisionInterpretation
α > p-valueReject H₀Results are statistically significant; sufficient evidence for Hₐ
α ≤ p-valueDo not reject H₀Results are not significant; insufficient evidence for Hₐ

🧠 Memory aid

The excerpt provides a helpful rhyme:

  • "If the p-value is low, the null must go." (Reject H₀)
  • "If the p-value is high, the null must fly." (Do not reject H₀)

⚠️ Important clarification

Don't confuse: "Do not reject H₀" does NOT mean you believe H₀ is true.

  • It only means the sample data failed to provide sufficient evidence to cast serious doubt about H₀.
  • You are not proving H₀ correct; you simply lack enough evidence to reject it.
  • This is a critical distinction in hypothesis testing logic.
33

Hypothesis Tests in Depth

5.6 Hypothesis Tests in Depth

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing requires understanding not only p-values and decision rules but also the probability and consequences of errors, the difference between statistical and practical significance, and how rare events challenge assumptions.

📌 Key points (3–5)

  • Rare events logic: when sample data show outcomes that would be very unlikely under the null hypothesis, we doubt the null hypothesis.
  • Two types of errors: Type I error (rejecting a true null, probability α) and Type II error (failing to reject a false null, probability β).
  • Common confusion: statistical significance vs. practical significance—large samples can detect tiny differences that have no real-world value.
  • Decision rule: compare p-value to preset α; if α > p-value, reject H₀; if α ≤ p-value, fail to reject H₀.
  • Power of a test: equals 1 – β; larger samples reduce both error probabilities and increase power.

🎲 Rare events and hypothesis logic

🎲 How rare events challenge assumptions

  • The core reasoning: you assume the null hypothesis is true, then collect real sample data.
  • If the sample shows something that would be very unlikely under that assumption, you conclude the assumption is probably wrong.
  • The excerpt emphasizes: "your assumption is just an assumption; it is not a fact," but "your sample data are real and are showing you a fact that seems to contradict your assumption."

🎁 Example scenario

  • A basket contains 200 bubbles; the claim (null hypothesis) is that only one contains a $100 bill.
  • The first person draws a bubble and gets the $100 bill.
  • Probability of this happening is 1/200 = 0.005 (very low).
  • Because this rare event occurred, the second person doubts the original claim and suspects there may be more $100 bills.
  • Don't confuse: the rare event is not proof that the null is false, but it casts serious doubt.

⚠️ Errors in hypothesis testing

⚠️ The four possible outcomes

The excerpt presents a table with four outcomes based on whether H₀ is actually true or false and whether you reject or do not reject it:

H₀ is actuallyAction: Do not rejectAction: Reject
TrueCorrect outcomeType I error
FalseType II errorCorrect outcome

🔴 Type I error (false positive)

Type I error: rejecting the null hypothesis when the null hypothesis is true.

  • Probability of Type I error = α.
  • α is often preset (commonly 0.05), meaning "we accept making this error in 5% of samples."
  • The p-value is the exact probability of a Type I error based on the observed data.
  • Example: Frank thinks his rock climbing equipment is unsafe when it is actually safe.

🔵 Type II error (false negative)

Type II error: not rejecting the null hypothesis when the null hypothesis is false.

  • Probability of Type II error = β.
  • Example: Frank thinks his rock climbing equipment is safe when it is actually not safe.
  • Don't confuse: in Frank's scenario, the Type II error has greater consequences (he uses unsafe equipment), even though Type I errors are often emphasized in decision rules.

💪 Power of the test

Power of a test = 1 – β.

  • Power is the probability of correctly rejecting a false null hypothesis.
  • Ideally, α and β should be as small as possible, and power should be as close to 1 as possible.
  • Larger sample sizes reduce both α and β, thereby increasing power.

📏 Statistical vs. practical significance

📏 What the distinction means

Statistical significance: the difference is detected and the p-value is below α.

Practical significance: the difference is large enough to matter in real-world terms.

  • As sample size increases, point estimates become more precise and even very small real differences become easier to detect.
  • With a large enough sample, even the slightest difference will be statistically significant, but it may have no practical value.

🎬 Example scenario

  • An online experiment finds that adding ads to a movie review website statistically significantly increases TV show viewership by 0.001%.
  • This increase is statistically significant (detected with high confidence) but not practically significant (no real-world value).

🧪 Planning sample size

  • A data scientist should consult experts or literature to determine the smallest meaningful difference from the null value.
  • She estimates the standard error and suggests a sample size large enough to detect meaningful differences but not so large that trivial differences are over-emphasized.
  • This is especially important when costs or risks (e.g., health impacts in medical studies) are involved.

🧠 Decision rules and interpretation

🧠 How to decide: comparing p-value and α

  • If α > p-value: reject H₀. The results are statistically significant; there is sufficient evidence that H₀ is incorrect and Hₐ may be correct.
  • If α ≤ p-value: fail to reject H₀. The results are not significant; there is not sufficient evidence to conclude Hₐ may be correct.

🧠 Memory aid

The excerpt provides rhymes:

  • "If the p-value is low, the null must go" (reject H₀).
  • "If the p-value is high, the null must fly" (do not reject H₀).

⚠️ Important note on "fail to reject"

When you "do not reject H₀," it does not mean that you should believe that H₀ is true.

  • It simply means the sample data have failed to provide sufficient evidence to cast serious doubt about H₀.
  • Don't confuse: failing to reject is not the same as accepting or proving the null hypothesis.
34

Chapter 5 Wrap-Up: Statistical Significance and Study Design

Chapter 5 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

Statistical significance does not guarantee practical importance, and data scientists must design studies that balance sample size with the ability to detect meaningful differences while avoiding detection of trivial effects.

📌 Key points (3–5)

  • Statistical vs. practical significance: Large samples can detect tiny differences that are statistically significant but have no real-world value.
  • Sample size planning: Data scientists should consult experts to determine the smallest meaningful difference before designing a study.
  • Common confusion: A statistically significant result does not mean the effect is large or important—it only means the difference is unlikely due to chance.
  • Study design considerations: Balancing sample size involves weighing detection power against costs and risks (e.g., health impacts in medical studies).

📏 Statistical vs. Practical Significance

📊 What each term means

Statistically significant: A difference that is unlikely to have occurred by chance alone, typically detected through hypothesis testing.

Practically significant: A difference large enough to matter in real-world applications.

  • These two concepts are independent—you can have one without the other.
  • The excerpt emphasizes that statistical significance alone is not sufficient for decision-making.

🔍 How large samples create the gap

  • Larger samples → more precision: Point estimates become more accurate and can detect smaller real differences.
  • The problem: With very large samples, even trivial differences become detectable.
  • Example: An online experiment detects that additional ads increase TV show viewership by 0.001%—this is statistically significant but has no practical value.
  • Don't confuse: "Significant" in statistics means "detectable/real," not "important" or "large."

🎯 Planning Study Size

🧪 The data scientist's role

The excerpt identifies several steps a data scientist should take when planning a study:

  1. Consult experts or literature to identify the smallest meaningful difference from the null value.
  2. Obtain rough estimates of parameters (e.g., the true proportion p) to estimate the standard error.
  3. Calculate appropriate sample size that is large enough to detect meaningful differences but not wastefully large.

⚖️ Balancing considerations

FactorWhy it matters
Detection powerSample must be large enough to find real, meaningful differences
CostsLarger samples require more resources
RisksMedical studies may expose more volunteers to potential health impacts
  • The goal is "sufficiently large enough to detect the real difference if it is meaningful."
  • Larger samples may still be used, but these calculations help when costs or risks are important constraints.

📚 Chapter structure reference

🗂️ Key terms covered

The excerpt lists terminology from throughout Chapter 5:

Foundational concepts (5.1):

  • Statistical inference, point estimate, parameter, statistic
  • Sampling variability, sampling distribution, standard error
  • Law of large numbers

Core methods (5.3–5.6):

  • Confidence interval, margin of error, critical value
  • Hypothesis test, null hypothesis (H₀), alternative hypothesis (Hₐ)
  • Test statistic, p-value, significance level

Error types (5.6):

  • Type I error, Type II error, Power

📖 Chapter organization

The chapter covers six main sections progressing from estimation to testing:

  1. Point Estimation and Sampling Distributions
  2. The Sampling Distribution of the Sample Mean (CLT)
  3. Introduction to Confidence Intervals
  4. The Behavior of Confidence Intervals
  5. Introduction to Hypothesis Tests
  6. Hypothesis Tests in Depth
35

The Sampling Distribution of the Sample Mean (t)

6.1 The Sampling Distribution of the Sample Mean (t)

🧭 Overview

🧠 One-sentence thesis

When the population standard deviation is unknown, we use the Student's t-distribution instead of the normal distribution, with the exact shape depending on sample size through degrees of freedom.

📌 Key points (3–5)

  • The practical problem: In real studies, we rarely know the population standard deviation σ, so we must estimate it with the sample standard deviation s.
  • Student's t-distribution: William Gosset discovered that using s instead of σ requires a different distribution, especially for small samples, which depends on sample size.
  • Degrees of freedom: The t-distribution's shape is determined by df = n - 1, where n is the sample size.
  • Common confusion: As sample size increases (around n = 30), the t-distribution becomes more like the normal distribution; modern practice uses t whenever σ is unknown, regardless of sample size.
  • Key difference from normal: The t-distribution has thicker tails and is shorter in the center compared to the standard normal distribution.

📜 Historical context and the problem

📜 Why the normal distribution wasn't enough

  • Historically, when sample sizes were large, statisticians simply replaced σ with s and used normal distribution methods with acceptable results.
  • Small sample sizes caused inaccuracies in confidence intervals when using this approach.
  • The problem: s is a less reliable estimate of σ when samples are small.

🍺 William Gosset's discovery

  • William S. Gosset (1876–1937) worked at Guinness brewery in Dublin, Ireland.
  • His experiments with hops and barley produced very few samples, making existing inference techniques inaccurate.
  • He realized the actual distribution depends on the sample size because s becomes more reliable as samples get bigger.
  • He published under the pseudonym "Student" (so readers wouldn't know he was a Guinness scientist), leading to the name "Student's t-distribution."
  • Example: A brewery scientist testing a new ingredient with only 10 samples cannot rely on normal distribution approximations.

⏳ Evolution of practice

  • Until the mid-1970s: statisticians used normal distribution for large samples (often n > 30) and t-distribution only for smaller samples.
  • Modern practice: use the Student's t-distribution whenever s is used as an estimate for σ, regardless of sample size.

🔢 Understanding the t-distribution mechanics

🔢 The t-score formula

  • The t-score is calculated as: t = (sample mean - population mean) divided by (s divided by square root of n).
  • It has the same interpretation as the z-score: it measures how far the sample mean is from the population mean μ.
  • The distribution of these t-scores follows a Student's t-distribution with n - 1 degrees of freedom.

🎯 Degrees of freedom (df)

Degrees of freedom: the number n - 1, which comes from the calculation of the sample standard deviation s.

  • When calculating sample standard deviation, we divide by n - 1 even though we use n deviations.
  • Why n - 1? Because the sum of all deviations equals zero, so once you know n - 1 deviations, the last one is determined—only n - 1 can "vary freely."
  • Notation: if sample size is 20, then df = 20 - 1 = 19, written as T ~ t₁₉.
  • Example: With a sample of 20 items, you have 19 degrees of freedom; knowing any 19 deviations from the mean automatically determines the 20th.

📊 Properties and behavior of the t-distribution

📊 Comparing t to normal (z)

PropertyStudent's t-distributionStandard normal distribution
CenterSymmetric about mean of zeroSymmetric about mean of zero
TailsThicker (more probability)Thinner
Center heightShorterTaller
SpreadGreater spreadStandard spread

📈 How df affects the shape

  • As degrees of freedom increase, the t-distribution becomes more like the standard normal distribution.
  • Around df = 30, the t-distribution closely resembles the normal distribution (connecting to the Central Limit Theorem).
  • With small df (e.g., df = 5), the distribution is noticeably shorter and wider than normal.
  • Don't confuse: each sample size has its own t-distribution; they are a family of distributions, not a single distribution.

✅ Key properties summary

  • Symmetry: Like the normal curve, symmetric about zero.
  • Tail behavior: More probability in tails than standard normal because of greater spread.
  • Shape dependency: Exact shape depends on degrees of freedom.
  • Convergence: As df increases, approaches standard normal distribution.
  • Assumptions: The underlying population of individual observations is assumed to be normally distributed (bell-shaped) with unknown μ and unknown σ.

🛠️ When to use the t-distribution

🛠️ Deciding between distributions

  • Use t-distribution when:
    • Population standard deviation σ is unknown
    • Using sample standard deviation s as an estimate
    • Data show no skewness or outliers
    • Random sampling is assumed

💉 Example scenario

  • Study: measuring effectiveness of acupuncture in relieving pain
  • Sample: 15 subjects with sensory rate measurements
  • Data check: plots show no skewness or outliers
  • Decision: Use t-distribution with df = 14 (since n = 15, df = 15 - 1 = 14)
  • Why: no information about population standard deviation and small sample size

📋 Finding probabilities

  • t-tables: Give t-scores corresponding to confidence levels (columns) and degrees of freedom (rows).
  • Limitation: t-tables are adequate for finding critical values but very limited for finding p-values.
  • Modern approach: Calculators and computers easily calculate any Student's t-probabilities.
  • Note: Some tables show confidence level in headers; others show area in one or both tails.
36

Inference for the Mean in Practice

6.2 Inference for the Mean in Practice

🧭 Overview

🧠 One-sentence thesis

In practice, we almost never know the population standard deviation, so we use the t-distribution (especially for small samples) to construct confidence intervals and perform hypothesis tests for the population mean.

📌 Key points (3–5)

  • When to use t vs z: Use t when the population standard deviation σ is unknown and sample size is small (n < 30); for larger samples, z may work via the Central Limit Theorem.
  • Confidence interval structure: The general format is point estimate ± margin of error, where the margin of error uses the t critical value, sample standard deviation, and degrees of freedom (df = n – 1).
  • Hypothesis testing with t: The test statistic changes to t = (sample mean – hypothesized mean) / (sample standard deviation / √n), and p-values can be found using technology or estimated from t-tables.
  • Common confusion: t-distribution vs normal distribution—use t when σ is unknown; use z (normal) only when σ is known, which is rare in practice.
  • Key assumptions: Data must be a simple random sample from an approximately normal population (or large enough sample size), and we use the sample standard deviation to approximate the unknown population standard deviation.

📊 When to use the t-distribution

📊 Practical reality: σ is rarely known

  • In real-world scenarios, we almost never know the true population standard deviation σ.
  • The excerpt emphasizes: "In practice, we rarely know the population standard deviation."
  • For larger samples (n ≥ 30), the Central Limit Theorem allows us to use z even without knowing σ.
  • Most common scenario: Use t when σ is unknown and sample size is small (n < 30).

🔍 t vs z decision rule

ConditionDistribution to useWhy
σ known, any sample sizez (normal)Population parameter is known
σ unknown, n ≥ 30z or t (either works)CLT makes sampling distribution approximately normal
σ unknown, n < 30tMust account for extra uncertainty from estimating σ
  • Don't confuse: The choice depends on both whether σ is known and the sample size.

🎯 Confidence intervals using t

🎯 General structure

Confidence interval format: PE – MoE, PE + MoE

  • Point estimate (PE): The sample mean (x̄) estimates the population mean μ.
  • Margin of error (MoE): Calculated as t(α/2) × (s / √n), where:
    • t(α/2) is the t critical value with area α/2 to the right
    • s is the sample standard deviation
    • n is the sample size
    • Degrees of freedom: df = n – 1

🧮 Step-by-step construction

  1. Identify sample statistics: Calculate sample mean (x̄) and sample standard deviation (s) from the data.
  2. Determine confidence level: If confidence level is CL, then α = 1 – CL, and α/2 goes in each tail.
  3. Find t critical value: Use t(α/2) with df = n – 1 degrees of freedom.
  4. Calculate margin of error: MoE = t(α/2) × (s / √n).
  5. Build the interval: Lower bound = x̄ – MoE; Upper bound = x̄ + MoE.

💡 Example walkthrough

Example: A sample of 30 Leadership PACs had mean receipts of $251,854.23 and sample standard deviation of $521,130.41. Construct a 96% confidence interval.

  • Given: n = 30, so df = 30 – 1 = 29; x̄ = $251,854.23; s = $521,130.41
  • Confidence level: CL = 0.96, so α = 1 – 0.96 = 0.04, and α/2 = 0.02
  • Critical value: t(0.02) with df = 29 is 2.150
  • Margin of error: 2.150 × (521,130.41 / √30) ≈ $204,561.66
  • Interval: Lower = 251,854.23 – 204,561.66 = $47,292.57; Upper = 251,854.23 + 204,561.66 = $456,415.89
  • Interpretation: We estimate with 96% confidence that the mean amount raised by all Leadership PACs lies between $47,292.57 and $456,415.89.

🧪 Hypothesis tests using t

🧪 Test statistic formula

t = (sample mean – hypothesized mean) / (sample standard deviation / √n)

  • The structure mirrors the z-test, but uses the sample standard deviation s instead of the population standard deviation σ.
  • Degrees of freedom: df = n – 1.
  • The test statistic measures how many standard errors the sample mean is from the hypothesized population mean.

🔬 Hypothesis test steps

  1. Set up hypotheses: State H₀ (null hypothesis, usually μ = some value) and Hₐ (alternative hypothesis, μ >, <, or ≠ some value).
  2. Choose significance level: Typically α = 0.05 or 0.01.
  3. Determine distribution: Use t with df = n – 1 when σ is unknown.
  4. Calculate test statistic: Compute t using the formula above.
  5. Find p-value: Use technology or estimate from t-tables (tables give ranges, not exact values).
  6. Compare α and p-value: If p-value < α, reject H₀; otherwise, fail to reject H₀.
  7. State conclusion: Interpret the decision in the context of the problem.

📝 Example walkthrough

Example: An instructor thinks the mean score on a statistics test is higher than 65. A sample of 10 students has scores: 65, 65, 70, 67, 66, 63, 63, 68, 72, 71. Test at α = 0.05.

  • Hypotheses: H₀: μ = 65; Hₐ: μ > 65 (right-tailed test because instructor thinks mean is higher)
  • Distribution: t with df = 10 – 1 = 9 (no population standard deviation given)
  • Sample statistics: Sample mean = 67, sample standard deviation = 3.1972
  • Test statistic: t = (67 – 65) / (3.1972 / √10) (calculation not shown in detail)
  • p-value: P(t > 67) = 0.0396 (probability of observing a sample mean of 67 or more if H₀ is true)
  • Decision: Since α = 0.05 > p-value = 0.0396, reject H₀
  • Conclusion: At the 5% significance level, there is sufficient evidence that the mean test score is more than 65.

⚠️ Interpreting p-values

  • The p-value is the probability of observing a sample result as extreme as (or more extreme than) what was observed, assuming the null hypothesis is true.
  • Example: p-value = 0.0396 means there is a 3.96% chance of getting a sample mean of 67 or higher if the true population mean is 65.
  • Don't confuse: The p-value is not the probability that the null hypothesis is true.

📊 Using t-tables for p-values

  • Technology (calculators, computers) can calculate exact p-values easily.
  • t-tables are limited: they provide ranges of p-values rather than exact values.
  • You can still compare the range to your significance level α to make a decision.

✅ Assumptions and conditions

✅ When t-tests are valid

For inference on a single population mean μ using the t-distribution, the following assumptions must be met:

  • Random sampling: Data must be a simple random sample from the population.
  • Normality: The population should be approximately normally distributed.
    • Exception: If the sample size is sufficiently large, the t-test works even if the population is not normal (thanks to the Central Limit Theorem).
  • Unknown σ: Use the sample standard deviation s to approximate the population standard deviation σ.

🔄 Comparison: t-test vs z-test assumptions

Assumptiont-testz-test
Random sampleRequiredRequired
Population distributionApproximately normal (or large n)Normal (or large n)
Population standard deviationUnknown (use s)Known (use σ)
RealityCommon in practiceRare in practice
  • Don't confuse: The z-test requires knowing σ, which "is rarely known in reality" according to the excerpt.
  • The t-test is more practical because it accounts for the uncertainty of estimating σ from the sample.
37

The Sampling Distribution of the Sample Proportion

6.3 The Sampling Distribution of the Sample Proportion

🧭 Overview

🧠 One-sentence thesis

When the sample size is sufficiently large and observations are independent, the sample proportion follows a normal distribution centered at the true population proportion, allowing us to apply the central limit theorem to categorical data.

📌 Key points (3–5)

  • What we're estimating: For categorical data, we estimate the population proportion (p) using the sample proportion (p-hat), which is the count of individuals with a characteristic divided by the total sample size.
  • How the CLT applies: The sample proportion behaves like a normal distribution when np ≥ 10 and n(1 − p) ≥ 10, with center at p and standard error that decreases as sample size increases.
  • Common confusion: The distribution is never perfectly normal because the sample proportion takes discrete values (x/n), but it approximates normality well when conditions are met.
  • What affects the distribution: Sample size and the value of p both influence the shape—larger samples reduce variability, and p = 0.5 produces the largest variability for a given sample size.
  • Why conditions matter: When np or n(1 − p) is too small (below 10), the distribution becomes noticeably skewed and discrete rather than smooth and symmetric.

📊 Understanding proportion variability through simulation

📊 The simulation setup

The excerpt uses a concrete example to illustrate sampling variability:

  • Population parameter: p = 0.88 (88% of American adults support solar energy expansion)
  • Sample size: n = 1,000 American adults
  • Question: How close will a sample proportion be to the true 88%?

The simulation process:

  1. Imagine 250 million pieces of paper (one per American adult)
  2. Write "support" on 88% and "not" on 12%
  3. Randomly draw 1,000 pieces
  4. Calculate the fraction that say "support"

🔢 What the simulation revealed

Running 10,000 simulated samples produced these patterns:

  • Individual sample errors: One sample gave 0.894 (error +0.014), another 0.885 (error +0.005), another 0.878 (error −0.002), another 0.859 (error −0.021)
  • Overall distribution: The histogram of all 10,000 sample proportions showed a predictable pattern

Example: Even though the true proportion is 0.88, any single sample of 1,000 people might give estimates ranging from about 0.86 to 0.91, but most cluster near 0.88.

📐 Three characteristics of the sampling distribution

The simulation's histogram can be described by:

CharacteristicWhat it showsNotation
CenterMean = 0.880, matching the parameter pCenter at p
SpreadStandard deviation = 0.010 (called "standard error" for sampling distributions)SE notation used
ShapeSymmetric, bell-shaped, resembles normal distributionApproximately normal

Don't confuse: "Standard deviation" vs "standard error"—when discussing sampling distributions or variability of point estimates, use "standard error" rather than "standard deviation."

✅ Conditions for applying the CLT to proportions

✅ The success-failure condition

When observations are independent and the sample size is sufficiently large, the sample proportion will tend to follow a normal distribution with center at p.

Sufficiently large means:

  • np ≥ 10 (expected number of successes)
  • n(1 − p) ≥ 10 (expected number of failures)
  • Some resources use 5 as the threshold, but 10 is safer
  • Also recommended: actual successes (x) and failures (n − x) both exceed 10, implying minimum sample size of 20

Why these conditions: The normal approximation simply doesn't work well with smaller sample sizes; the distribution remains too discrete and potentially skewed.

🔴 What happens when conditions fail

The excerpt shows multiple distributions, comparing those that meet conditions (green) versus those that don't (red):

Patterns when conditions are NOT met:

  1. Small n or small np or n(1 − p): Distribution looks discrete (not continuous/smooth)
  2. np or n(1 − p) < 10: Distribution shows noticeable skew
  3. Very small values: Discreteness is obvious, shape far from normal

Patterns when conditions ARE met:

  1. Large np and n(1 − p): Distribution looks much more normal
  2. Very large values: Discreteness hardly evident, smooth bell curve
  3. Larger sample sizes: Variability becomes much smaller

Example: With p = 0.1 and small n, the distribution is right-skewed and lumpy; with p = 0.5 and large n, it's symmetric and smooth.

🎯 How parameters affect the distribution

🎯 Effect of the population proportion p

Center is always unbiased:

  • The sampling distribution is always centered at the true population proportion p
  • This means the sample proportion is an accurate (unbiased) estimator when data are independent

Variability depends on p:

  • For a given sample size, variability is largest when p = 0.5
  • The differences may be subtle but reflect the role of p in the standard error formula
  • As p moves toward 0 or 1, variability decreases slightly

Shape depends on p:

  • When p is near 0.5, the distribution is most symmetric
  • When p = 0.1, the distribution is right-skewed
  • When p = 0.9, the distribution is left-skewed

📏 Effect of sample size n

Variability decreases with larger n:

  • For a particular population proportion, the sampling distribution's variability decreases as sample size increases
  • This aligns with intuition: estimates based on larger samples tend to be more accurate

Shape improves with larger n:

  • Larger samples make the distribution look more normal
  • The discrete nature becomes less evident
  • Skewness becomes less pronounced

Don't confuse: The distribution will never be perfectly normal because the sample proportion always takes discrete values (x/n)—it's always a matter of degree, which is why we use the np ≥ 10 and n(1 − p) ≥ 10 guideline.

38

Inference for a Proportion

6.4 Inference for a Proportion

🧭 Overview

🧠 One-sentence thesis

When working with categorical data, we can perform hypothesis tests and construct confidence intervals for a population proportion using the normal approximation, provided certain sample size conditions are met.

📌 Key points (3–5)

  • What we're estimating: the population proportion p using the sample proportion (x/n), where x is the number of successes and n is the sample size.
  • When normal approximation works: conditions np ≥ 10 and n(1 − p) ≥ 10 must be met to apply the central limit theorem.
  • Hypothesis test adjustment: when testing proportions, we substitute the null hypothesis value p₀ into the standard error formula instead of using the unknown true p.
  • Common confusion: for confidence intervals we use sample proportions in the standard error, but for hypothesis tests we use the null hypothesis value p₀.
  • Sample size planning: to achieve a desired margin of error, use p = 0.5 as a conservative estimate if no prior information exists.

🔍 Recognizing proportion problems

🔍 How to identify categorical data

  • The underlying distribution is binomial, not normal.
  • You see categorical data with no mention of a mean or average.
  • If X is a binomial random variable, then X ~ B(n, p), where n is the number of trials and p is the probability of success.
  • Example: surveying whether people own cell phones (yes/no) rather than measuring how many minutes they use them.

📊 The sampling distribution

When conditions np ≥ 10 and n(1 − p) ≥ 10 are met, the sample proportion follows approximately a normal distribution.

  • The point estimate for p is the sample proportion = x/n (also sometimes denoted as p').
  • This allows us to use normal distribution methods for inference.

🧪 Hypothesis tests for proportions

🧪 Setting up the hypotheses

The hypotheses take the form:

  • H₀: p = p₀ (null hypothesis specifies a value)
  • Hₐ: p (<, >, or ≠) p₀ (alternative can be one-tailed or two-tailed)

📐 Test statistic formula

The general test statistic form is: (point estimate − null value) / standard error.

Key adjustment for proportions:

  • We substitute p₀ (the null hypothesis value) into the standard error calculation.
  • This gives: z = (sample proportion − p₀) / square root of [p₀(1 − p₀)/n]
  • Why? Because (1) we don't actually know p, and (2) in hypothesis testing we begin by assuming the null is true.

🎯 Worked scenario

Example from the excerpt: Joon tests whether 50% of first-time brides are younger than their grooms.

  • Sample: 100 brides, 53 say they are younger
  • Hypotheses: H₀: p = 0.50 vs Hₐ: p ≠ 0.50 (two-tailed)
  • Sample proportion = 53/100 = 0.53
  • Test statistic: z = 0.6
  • P-value = 0.5485
  • Decision at α = 0.01: cannot reject H₀ because p-value > α
  • Conclusion: insufficient evidence that the percentage differs from 50%

📏 Confidence intervals for proportions

📏 General structure

The confidence interval follows the format: PE − MoE, PE + MoE

Where:

  • PE (point estimate) = sample proportion
  • MoE (margin of error) = (critical value) × (standard error)

🔧 Key formula difference

For confidence intervals (not hypothesis tests):

  • Standard error uses the sample proportions: square root of [sample proportion × (1 − sample proportion) / n]
  • We use sample proportions because p and q are unknown
  • The estimated proportions from data are used: one is the estimated proportion of successes, the other is the estimated proportion of failures

Don't confuse: In hypothesis tests we use p₀ in the standard error; in confidence intervals we use the sample proportion.

🎯 Worked scenario

Example from the excerpt: estimating cell phone ownership in a city.

  • Sample: 500 adults, 421 own cell phones
  • Sample proportion = 421/500 = 0.842
  • 95% confidence level means α = 0.05, so critical value z = 1.96
  • Margin of error = 0.032
  • Confidence interval: (0.810, 0.874)
  • Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult residents have cell phones

💡 What 95% confidence means

Ninety-five percent of confidence intervals constructed this way would contain the true population proportion.

📊 Sample size determination

📊 Planning for a target margin of error

If researchers want a specific margin of error, they can solve the margin of error formula for n.

The formula becomes: n = p(1 − p) × (critical value / desired margin of error)²

🤔 The p problem

Challenge: the formula requires p, but if we're estimating p, we don't know it yet.

Three solutions:

  1. Use prior information: if you have a previous sample, calculate a point estimate and plug it in
  2. Use your best guess: estimate p based on expert knowledge
  3. Use a conservative estimate: set p = 0.5

🛡️ Why p = 0.5 is conservative

  • We multiply p × (1 − p) in the formula
  • The product p × (1 − p) is maximized when both equal 0.5
  • (0.5)(0.5) = 0.25 is the largest possible product
  • Other examples: (0.6)(0.4) = 0.24; (0.3)(0.7) = 0.21; (0.2)(0.8) = 0.16
  • The largest product gives the largest n, ensuring the sample is large enough to achieve the desired confidence level and margin of error
39

Behavior of Confidence Intervals for a Proportion

6.5 Behavior of Confidence Intervals for a Proportion

🧭 Overview

🧠 One-sentence thesis

The "plus-four" method improves confidence interval accuracy for proportions by adding two imaginary successes and two failures to the sample data when certain conditions are met.

📌 Key points (3–5)

  • Standard method limitation: Using point estimates to calculate the standard deviation for proportion confidence intervals introduces error because the true population proportion is unknown.
  • Plus-four adjustment: Add four imaginary observations (two successes, two failures) to get adjusted sample size (n + 4) and adjusted successes (x + 2).
  • When to use plus-four: Apply this method when the confidence level is at least 90% and the sample size is at least ten.
  • Common confusion: Don't use the original sample values directly—you must adjust both the count of successes and the total sample size before calculating the confidence interval.
  • Effectiveness: Computer studies have demonstrated that this simple adjustment produces more accurate confidence intervals than the standard method.

🔧 The problem with standard estimation

⚠️ Why standard methods have error

  • When constructing a confidence interval for a population proportion, we face a fundamental problem: we don't know the true proportion.
  • Because the true proportion is unknown, we are forced to use point estimates (sample statistics) to calculate the standard deviation of the sampling distribution.
  • Studies have shown that this estimation of the standard deviation can be flawed, introducing error into the confidence interval.

🎯 The need for adjustment

  • The error is systematic and predictable enough that a correction method can improve accuracy.
  • Fortunately, there is a simple adjustment available that produces more accurate results.

🔢 The plus-four method mechanics

📐 How the adjustment works

The plus-four method: pretend you have four additional observations—two successes and two failures—then calculate the confidence interval using these adjusted values.

Step-by-step adjustment:

  • Original sample size: n
  • Original number of successes: x
  • Adjusted sample size: n + 4
  • Adjusted number of successes: x + 2
  • Then calculate the sample proportion and confidence interval using the adjusted values

🎲 Worked example from the excerpt

Scenario: 25 statistics students surveyed; 6 reported smoking in the past week. Find a 95% confidence interval using plus-four.

Original values:

  • x = 6 (successes)
  • n = 25 (sample size)

Adjusted values:

  • x = 6 + 2 = 8
  • n = 25 + 4 = 29

Calculation:

  • Sample proportion (p-hat) = 8/29 ≈ 0.276
  • Complement (q-hat) = 1 - 0.276 = 0.724
  • For 95% confidence level: z = 1.96
  • Margin of error ≈ 0.163
  • Confidence interval: 0.276 ± 0.163 = (0.113, 0.439)

Interpretation: We are 95% confident that the true proportion of all statistics students who smoke is between 0.113 and 0.439.

✅ When to apply plus-four

📋 Conditions for use

The excerpt specifies two requirements:

RequirementThreshold
Confidence levelAt least 90%
Sample sizeAt least ten
  • Both conditions must be met to use this method appropriately.
  • Computer studies have validated the effectiveness of this approach under these conditions.

🚫 Don't confuse with standard method

  • Standard method: Use x and n directly from the sample.
  • Plus-four method: Always add 2 to x and 4 to n before any calculations.
  • The adjustment is not optional once you decide to use plus-four—you must apply it to both the numerator and denominator of the sample proportion.
40

Chapter 6 Wrap-Up: Inference for Proportions and Means

Chapter 6 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

The "plus-four" method provides a more reliable way to construct confidence intervals for proportions when sample sizes are moderate (at least 10) and confidence levels are high (at least 90%).

📌 Key points (3–5)

  • When to use plus-four: Recommended when confidence level is at least 90% and sample size is at least 10.
  • How the method works: Add 2 successes and 4 total observations to the raw data before calculating the proportion.
  • What it produces: A confidence interval for the true population proportion with specified confidence level.
  • Key terms covered: Sampling distributions, confidence intervals, margin of error, point estimates, and test statistics for both means and proportions.
  • Common confusion: The adjusted values (x + 2, n + 4) are used for calculation, not the original raw counts.

🔧 The Plus-Four Method

🔧 What gets adjusted

The method modifies the raw sample data before computing the sample proportion:

  • Original successes: x
  • Adjusted successes: x + 2
  • Original sample size: n
  • Adjusted sample size: n + 4

📐 Calculation steps

  1. Add 2 to the number of successes and 4 to the sample size
  2. Calculate adjusted sample proportion: p-hat = (x + 2) / (n + 4)
  3. Calculate complement: q-hat = 1 - p-hat
  4. Find the critical z-value for the desired confidence level
  5. Compute margin of error using the formula with the critical value
  6. Construct the interval: (p-hat - MoE, p-hat + MoE)

🎯 Worked example interpretation

In the smoking example:

  • Raw data: 6 out of 25 students smoked
  • Adjusted: 8 out of 29 (after plus-four)
  • Adjusted proportion: 0.276
  • 95% CI: (0.113, 0.439)
  • Interpretation: We are 95% confident the true proportion of all statistics students who smoke is between 11.3% and 43.9%

📚 Chapter Content Summary

📚 Core topics covered

The chapter wrap-up references five main sections:

SectionTopic
6.1Sampling distribution of the sample mean (t-distribution)
6.2Inference for the mean in practice
6.3Sampling distribution of the sample proportion
6.4Inference for a proportion
6.5Behavior of confidence intervals for a proportion

🔑 Key terminology introduced

  • Sampling distribution: The distribution of a statistic across all possible samples
  • Confidence interval: A range of plausible values for a population parameter
  • Margin of error (MoE): The amount added/subtracted from the point estimate to create the interval
  • Point estimate: A single value used to estimate a population parameter
  • Degrees of freedom (df): Parameter used with t-distributions
  • Population proportion (p): True proportion in the entire population
  • Sample proportion (p-hat): Proportion observed in the sample

Don't confuse: Population parameters (like p) are unknown fixed values we're trying to estimate; sample statistics (like p-hat) are calculated from data and vary from sample to sample.

41

Inference for Two Dependent Samples (Matched Pairs)

7.1 Inference for Two Dependent Samples (Matched Pairs)

🧭 Overview

🧠 One-sentence thesis

When comparing two related groups through matched pairs, we analyze the differences between paired observations to test whether the population mean difference is zero, using a t-distribution because the pairs are linked and sample sizes are typically small.

📌 Key points (3–5)

  • What matched pairs means: two samples are dependent when sample values have an identifiable relationship (e.g., before-and-after measurements on the same subjects).
  • The parameter of interest: we focus on μ_d (the population mean difference), not two separate means; the differences themselves become the data we analyze.
  • Why we use t-distribution: matched pairs usually have small sample sizes and unknown population standard deviation for differences, so we use t with (n – 1) degrees of freedom, where n is the number of pairs.
  • Common confusion: independent vs dependent samples—independent samples have no relationship between groups; dependent samples involve paired or matched observations from the same or similar subjects.
  • Hypothesis test structure: typically test H₀: μ_d = 0 (no difference) against Hₐ: μ_d ≠ 0 (or < 0, or > 0), using the differences as a single sample.

🔗 Understanding dependent vs independent samples

🔗 What makes samples dependent

Dependent samples consist of two groups that have some sort of identifiable relationship.

  • The key is that values in one sample are linked to specific values in the other sample.
  • Contrast: independent samples have no relationship—sample values from one population are unrelated to values from the other.
  • Example: measuring pain levels in the same patients before and after treatment creates dependent samples; measuring pain in two separate groups of patients would be independent samples.

🎯 Matched pairs design

Two samples that are dependent typically come from a matched pairs experimental design.

  • What it involves: two measurements drawn from the same (or extremely similar) individuals or objects.
  • Why it matters: pairing controls for individual variation, making it easier to detect treatment effects.
  • Example: testing a diet by measuring each person's cholesterol before and after, rather than comparing two different groups.

📊 The matched pairs analysis approach

📊 From two samples to one: the differences

The excerpt emphasizes a crucial shift in perspective:

  • Although we start with two samples, the differences form the sample that is used for analysis.
  • We calculate the difference for each pair (often "after – before" or similar).
  • These differences become a single dataset, and our parameter of interest becomes μ_d (the population mean difference).
  • Point estimate: x̄_d (the sample mean of the differences).

Don't confuse: We are not comparing two separate means; we are testing whether the mean of the differences equals zero.

🧮 Characteristics of matched pairs inference

When using inference techniques for matched or paired samples, the following should be present:

  • Simple random sampling is used
  • Sample sizes are often small
  • Two measurements are drawn from the same (or extremely similar) individuals or objects
  • Differences are calculated from the matched or paired samples
  • The differences form the sample used for analysis

📐 The sampling distribution

  • In a perfect world, if both original samples come from normal distributions, the differences are also normally distributed.
  • However, we almost never know the population standard deviation of the differences.
  • Solution: use a t-distribution with (n – 1) degrees of freedom, where n is the number of differences (pairs).
  • We cannot use Z because we don't know the population standard deviation for a difference distribution.

🧪 Hypothesis testing for mean difference

🧪 Setting up the hypotheses

The population mean difference μ_d is the parameter of interest.

Most common setup (testing for any effect):

  • H₀: μ_d = 0 (no difference)
  • Hₐ: μ_d (<, >, or ≠) 0 (depending on the research question)

The excerpt notes: "Although it is possible to test for a certain magnitude of effect, we are most often just looking for a general effect."

🧮 The test statistic

The test uses a Student's t-test for a single population mean with (n – 1) degrees of freedom.

Test statistic formula (in words):

  • t equals (sample mean of differences minus hypothesized mean difference) divided by (standard deviation of differences divided by square root of n)
  • Under H₀, the hypothesized mean difference is usually zero

📋 Worked example: hypnotism and pain

The excerpt provides a detailed example testing whether hypnotism reduces pain:

Setup:

  • Eight subjects measured before and after hypnotism (lower score = less pain)
  • Differences calculated as "after – before"
  • Sample mean of differences: x̄_d = –3.13
  • Sample standard deviation: s_d = 2.91

Hypotheses:

  • H₀: μ_d ≥ 0 (no improvement or worse; same or more pain after)
  • Hₐ: μ_d < 0 (improvement; less pain after)

Distribution: t with df = 8 – 1 = 7

Results:

  • p-value = 0.0095
  • α = 0.05
  • Since α > p-value, reject H₀

Conclusion: At 5% significance level, there is sufficient evidence that sensory measurements are lower on average after hypnotism; hypnotism appears effective in reducing pain.

⚠️ Interpreting the direction

  • Negative differences (after – before < 0) indicate improvement when "after" should be lower.
  • The alternative hypothesis direction must match the research question.
  • Example: if testing whether pain decreases, Hₐ: μ_d < 0 makes sense when calculating "after – before."

📏 Confidence intervals for mean difference

📏 General structure

The general format of a confidence interval is (PE – MoE, PE + MoE)

  • Parameter: μ_d (population mean difference)
  • Point estimate (PE): x̄_d (sample mean of differences)
  • Margin of error (MoE): calculated using the t-distribution

🧮 Margin of error formula

In words:

  • MoE equals t-critical value times (standard deviation of differences divided by square root of n)
  • The t-critical value has area to the right equal to α/2 (for a two-sided interval)
  • Use df = n – 1 degrees of freedom, where n is the number of pairs
  • s_d is the standard deviation of the differences

🔄 When to use confidence intervals

The excerpt notes: "Confidence intervals may be calculated on their own for two samples, but often, we first want to conduct a hypothesis test to formally check if a difference exists, especially in the case of matched pairs."

Typical workflow:

  1. First conduct a hypothesis test to see if there is a statistically significant difference
  2. If significant, then estimate the magnitude with a confidence interval

📋 Worked example: strength training

The excerpt provides an example with four football players testing whether a strength class increases bench press weight:

Data:

  • Differences (after – before): {90, 11, -8, -8}
  • Sample mean: x̄_d = 21.3
  • Sample standard deviation: s_d = 46.7

Note from excerpt: The data appear right-skewed; the value 90 may be an extreme outlier pulling the mean positive, while the other three values average to negative.

Test setup:

  • H₀: μ_d ≤ 0 (no improvement)
  • Hₐ: μ_d > 0 (improvement)
  • Distribution: t₃ (df = 4 – 1 = 3)

Results:

  • p-value = 0.2150
  • At α = 0.05, do not reject H₀ (because α < p-value)

Conclusion: At 5% significance, there is not sufficient evidence to conclude that the strength class made players stronger on average.

🛠️ Practical considerations

🛠️ Technology notes

The excerpt provides calculator instructions for TI-83+/TI-84:

Option 1: Calculate differences ahead of time and put them in a list.

Option 2:

  • Put "after" data in one list, "before" data in another
  • Create a third list by subtracting (1st list – 2nd list)
  • The calculator computes the differences automatically

Then use STAT → TESTS → 2:T-Test with the differences list as data, μ₀ = 0, and choose the appropriate alternative hypothesis direction.

⚠️ Assumptions and checks

Key assumption: The differences have a normal distribution.

  • The excerpt states this assumption explicitly in the examples.
  • With small sample sizes (common in matched pairs), normality is important.
  • Warning from the strength training example: Even when stated, check if the data actually appear normal; outliers can distort results.

🎯 Interpreting results in context

Each example concludes with a context-specific interpretation:

  • Not just "reject H₀" but what that means for the research question
  • Example: "Hypnotism appears to be effective in reducing pain" (not just "μ_d < 0")
  • Example: "There is not sufficient evidence to conclude that the strength development class helped to make the players stronger" (not just "fail to reject H₀")
42

Inference for Two Independent Sample Means

7.2 Inference for Two Independent Sample Means

🧭 Overview

🧠 One-sentence thesis

When comparing two independent samples, we estimate the difference between population means (μ₁ - μ₂) using sample means and account for variation by standardizing with the standard error, typically using a t-distribution when population standard deviations are unknown.

📌 Key points (3–5)

  • Parameter of interest: the difference in means (μ₁ - μ₂), estimated by the difference in sample means (x̄₁ - x̄₂).
  • Why standardization matters: very different means can occur by chance if there is great variation, so we divide by the standard error to account for this.
  • Which distribution to use: Z-distribution when both population standard deviations are known (rare); t-distribution when they are unknown (typical).
  • Common confusion: degrees of freedom for two independent samples require either a complex formula or a conservative estimate using the smaller of (n₁-1) or (n₂-1).
  • Assumptions: underlying normal distribution with no outliers or skewness for t-distribution; these can be relaxed for larger samples.

📊 Core parameter and estimation

📊 What we're measuring

Parameter of interest: the difference in means, μ₁ - μ₂, with a point estimate of x̄₁ - x̄₂.

  • The comparison focuses on the difference between two population means, not the means themselves.
  • We use independent samples—there is no apparent relationship between them.
  • The point estimate is simply the difference between the two sample means.

🎯 Why standardization is necessary

  • A difference between two sample means depends on both the means themselves and their respective standard deviations.
  • Very different means can occur by chance if there is great variation among individual samples.
  • To account for variation, we take the difference of sample means and divide by the standard error, standardizing the difference.
  • This standardization allows us to determine whether an observed difference is meaningful or just due to random variation.

🔢 When population standard deviations are known (Z)

🔢 The ideal but unlikely scenario

  • This situation is unlikely because population standard deviations are rarely known.
  • If both means' sampling distributions are normal, the sampling distribution for the difference between means is also normal.
  • Both populations must be normal for this to hold.

📐 Standard error and test statistic

  • The standard error combines the standard errors of each sampling distribution.
  • The sampling distribution of (x̄₁ - x̄₂) is approximately normal when we know both standard deviations.
  • The z-test statistic formula divides the difference in sample means by the standard error.
  • In practice, this approach is only considered for two very large samples, since we rarely know even one population's standard deviation.

🎚️ Confidence interval structure

  • Form: (PE - MoE, PE + MoE)
  • Point estimate (PE): x̄₁ - x̄₂
  • Margin of error (MoE) components:
    • Critical z-value with area to the right equal to α/2
    • Standard error (SE) based on known population standard deviations

🧮 When population standard deviations are unknown (t)

🧮 The typical real-world case

  • Most likely, we will not know the population standard deviations.
  • We estimate them using the two sample standard deviations from our independent samples.
  • In this case, we use a t sampling distribution.
  • The standard error formula incorporates both sample standard deviations (s₁ and s₂) and sample sizes (n₁ and n₂).

✅ Assumptions for using t-distribution

  • Need to assume an underlying normal distribution.
  • No outliers or skewness should be present.
  • Relaxing assumptions: as sample sizes get bigger, we can relax these assumptions.
  • For very large sample sizes, we can typically just use the Z distribution instead.

🔧 Degrees of freedom challenge

Two methods for determining degrees of freedom:

MethodDescriptionWhen to use
Precise calculationComplex formula involving both sample sizes and standard deviationsWhen you have access to a computer or calculator
Conservative estimatemin(n₁-1, n₂-1)Working on your own without technology
  • The precise df are not always a whole number; usually round down.
  • It is not necessary to compute the precise formula by hand—find reliable technology.
  • Don't confuse: the conservative method always uses the smaller of the two sample degrees of freedom, which is simpler but less precise.

🧪 Hypothesis testing for two means

🧪 Setting up hypotheses

When the parameter of interest is μ₁ - μ₂, we are often interested in an effect between the two groups.

  • To show an effect, first assume there is no difference in the null hypothesis:
    • H₀: μ₁ - μ₂ = 0 OR H₀: μ₁ = μ₂
    • Hₐ: μ₁ - μ₂ (<, >, ≠) 0 OR Hₐ: μ₁ (<, >, ≠) μ₂

📏 The t-test statistic

Components of the test statistic:

  • Numerator: difference in sample means minus the hypothesized difference (typically zero)
  • Denominator: standard error incorporating s₁, s₂, n₁, and n₂
  • s₁ and s₂ (sample standard deviations) are estimates of σ₁ and σ₂
  • x̄₁ and x̄₂ are the sample means; μ₁ and μ₂ are the population means
  • Note: in the null hypothesis, we are typically assuming μ₁ - μ₂ = 0

🔍 The hypothesis test process

  • The steps to a hypothesis test never change, regardless of the parameter.
  • Calculate the test statistic using the formula above.
  • Compare to the t-distribution with appropriate degrees of freedom.
  • Make a decision based on the p-value or critical value approach.

📊 Confidence intervals for two means

📊 Estimating the difference

  • After identifying a difference in a hypothesis test, we may want to estimate it.
  • Confidence interval form: (PE - MoE, PE + MoE)
  • Point estimate (PE): x̄₁ - x̄₂

🎯 Margin of error components

  • Critical t-value: t* with area to the right equal to α/2
  • Standard error (SE): incorporates both sample standard deviations and sample sizes
  • The margin of error accounts for sampling variability in both groups.

Example: If we want to estimate how much taller one group is than another on average, the confidence interval gives a range of plausible values for the true difference, not just a single point estimate.

43

Inference for Two-Sample Proportions

7.3 Inference for Two-Sample Proportions

🧭 Overview

🧠 One-sentence thesis

When comparing two population proportions, we use hypothesis tests and confidence intervals to determine whether observed differences in sample proportions reflect true population differences or merely chance variation.

📌 Key points (3–5)

  • What we're comparing: two independent population proportions (p₁ and p₂) using sample estimates.
  • Conditions required: simple random samples, independence, at least five successes and five failures in each sample, and populations at least 10–20 times the sample size.
  • Pooled proportion for testing: when the null hypothesis assumes no difference (p₁ = p₂), we combine both samples to estimate a single pooled proportion.
  • Common confusion: hypothesis tests use a pooled proportion in the standard error, but confidence intervals estimate each proportion separately.
  • Two inference tools: hypothesis tests determine if a difference exists; confidence intervals estimate the size of that difference.

📋 Requirements and setup

✅ Conditions for valid inference

Before conducting inference on two independent population proportions, verify:

  • Both samples are simple random samples and independent of each other.
  • Each sample has at least five successes and at least five failures.
  • Each population is at least 10 or 20 times larger than its sample (prevents over-sampling).

📊 Sampling distribution

The difference of two proportions (p̂₁ - p̂₂) follows an approximate normal distribution.

  • The parameter of interest is p₁ - p₂ (difference in population proportions).
  • We estimate it with p̂₁ - p̂₂ (difference in sample proportions).
  • Standard error calculations differ between hypothesis tests and confidence intervals.

🧪 Hypothesis testing for two proportions

🎯 Typical null hypothesis

  • Generally: H₀: p₁ = p₂ (the two proportions are the same).
  • Alternative form: H₀: p₁ - p₂ = 0.
  • The alternative hypothesis can be two-tailed (≠) or one-tailed (< or >).

🔄 Pooled proportion

Pooled proportion (pₚ): a combined estimate of the common proportion when assuming the null hypothesis is true.

Calculation: pₚ = (total successes from both samples) / (total observations from both samples)

  • Use this pooled proportion in the test statistic calculation.
  • Also calculate 1 - pₚ for the standard error.
  • Example from excerpt: pₚ = (20 + 12) / (200 + 200) = 32/400 = 0.08

📐 Test statistic

The z-test statistic formula uses:

  • Numerator: (p̂₁ - p̂₂) - 0 (the observed difference minus the hypothesized difference).
  • Denominator: standard error calculated using the pooled proportion pₚ.

🔍 Medication example walkthrough

Scenario: Testing if two hive medications (A and B) have different reaction proportions.

  • Medication A: 20 out of 200 adults still had hives (p̂ₐ = 0.1).
  • Medication B: 12 out of 200 adults still had hives (p̂ᵦ = 0.06).
  • Significance level: α = 0.01 (1%).

Steps:

  1. Random variable: p̂ₐ - p̂ᵦ = difference in proportions who did not react.
  2. Hypotheses: H₀: pₐ = pᵦ vs Hₐ: pₐ ≠ pᵦ (two-tailed because "is a difference").
  3. Pooled proportion: pₚ = 32/400 = 0.08.
  4. Calculate p-value: 0.1404.
  5. Compare: α = 0.01 < p-value = 0.1404.
  6. Decision: Do not reject H₀.
  7. Conclusion: At 1% significance, insufficient evidence to conclude a difference exists.

Don't confuse: The phrase "is a difference" signals a two-tailed test; "greater than" or "less than" would signal one-tailed.

📏 Confidence intervals for two proportions

🎯 Purpose and structure

Once a hypothesis test identifies a difference, a confidence interval estimates the size of that difference.

General form: (PE - MoE, PE + MoE)

  • Point estimate (PE): p̂₁ - p̂₂
  • Margin of error (MoE): critical value × standard error

🔢 Formula components

Complete formula: (p̂₁ - p̂₂) ± z* × SE

Where:

  • z* is the critical value with area to the right equal to α/2.
  • SE (standard error) uses individual sample proportions, not the pooled proportion.
  • If p₁ and p₂ are unknown, estimate them with p̂₁ and p̂₂.

⚖️ Key difference from hypothesis tests

AspectHypothesis testConfidence interval
AssumptionAssumes H₀ true (p₁ = p₂)No assumption about equality
Proportion usedPooled proportion pₚSeparate estimates p̂₁ and p̂₂
PurposeDetect if difference existsEstimate size of difference

Don't confuse: The pooled proportion is only for hypothesis tests; confidence intervals estimate each proportion separately because we're not assuming they're equal.

44

Chapter 7 Wrap-Up: Two-Sample Inference

Chapter 7 Wrap-Up

🧭 Overview

🧠 One-sentence thesis

Chapter 7 covers methods for comparing two groups—either dependent (matched pairs) or independent samples—using hypothesis tests and confidence intervals for means and proportions.

📌 Key points (3–5)

  • Three main scenarios: matched pairs (dependent samples), two independent sample means, and two-sample proportions.
  • Confidence interval structure: all follow the form (Point Estimate – Margin of Error, Point Estimate + Margin of Error).
  • Two-proportion inference: uses pooled proportion for hypothesis tests; uses sample proportions for confidence intervals.
  • Common confusion: dependent vs independent samples—matched pairs require different methods than independent samples.
  • Key terms span: placebo, independence, matched pairs, degrees of freedom, pooled proportion, and standard error.

📊 Two-Sample Proportion Methods

📊 Confidence interval formula

The excerpt provides the structure for estimating the difference between two proportions:

Confidence interval form: (PE – MoE, PE + MoE)

  • Point estimate (PE): the difference between the two sample proportions (written as p̂₁ – p̂₂ in the excerpt).
  • Margin of Error (MoE): constructed using a z critical value multiplied by the standard error.
  • The z critical value corresponds to the desired confidence level (area to the right equals α/2).

🔧 Standard error construction

  • When population proportions p₁ and p₂ are unknown, estimate them with the sample proportions.
  • The standard error (SE) incorporates both sample proportions and sample sizes.
  • Don't confuse: hypothesis tests may use a pooled proportion, but confidence intervals use individual sample proportions.

🧪 Hypothesis test example

The excerpt mentions a valve pressure test:

  • Valve A: 15 out of 100 cracked under 4,500 psi.
  • Valve B: 6 out of 100 cracked under 4,500 psi.
  • Test at 5% significance level to determine if there is a difference in pressure tolerances.

Example: This setup compares two independent proportions (failure rates) to see if one valve type is more prone to cracking.

🗂️ Chapter structure and key terms

🗂️ Three main sections

The chapter is organized by sample type and data type:

SectionFocusKey concept
7.1Two dependent samples (matched pairs)Population mean difference
7.2Two independent sample meansDegrees of freedom, standard error
7.3Two-sample proportionsPooled proportion

📚 Important terminology

The wrap-up lists key terms students should be able to define:

  • 7.1 terms: placebo, inference, quantitative/categorical data, independence, matched pairs, population mean difference, sampling distribution, point estimate.
  • 7.2 terms: standard error, degrees of freedom, confidence interval.
  • 7.3 terms: population proportion, pooled proportion.

✅ Self-assessment tools

  • The chapter includes a concept check quiz (interactive element).
  • Extra practice problems are available in a separate section at the end of the book.
  • Section resources are provided with links and QR codes for multimedia materials (podcasts, videos, lecture notes, worked examples).
    Significant Statistics | Thetawave AI – Best AI Note Taker for College Students