Introduction to Statistical Thinking

1

Student Learning Objectives: Confidence Intervals

1.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Confidence intervals estimate unknown parameters using a range of values that contains the true parameter with a prescribed probability (the confidence level), and can be constructed for expectations, variances, and event probabilities using different methods depending on sample size and distribution assumptions.

📌 Key points (3–5)

  • What a confidence interval is: an estimate of an unknown parameter by a range of values (not a single point), calculated from data, that contains the parameter with a prescribed probability.
  • Confidence level: the probability that the interval includes the unknown population parameter.
  • Construction methods vary: large samples use Normal approximation via the Central Limit Theorem; small samples from Normal measurements use different techniques.
  • Common confusion: confidence intervals vs point estimators—intervals give a range with a probability guarantee, not a single number.
  • Practical skill: computing the sample size needed to achieve a confidence interval of a given width.

🎯 What confidence intervals are

🎯 Definition and purpose

A confidence interval is a method for estimating the unknown value of a parameter using an interval of numbers calculated from the data, likely to include the unknown population parameter.

  • Unlike point estimators (which produce a single number), confidence intervals produce a range.
  • The interval is calculated from observed data.
  • The interval is "likely" to include the true parameter—this likelihood is quantified by the confidence level.

📊 Confidence level

The confidence level is the probability of the event that the interval includes the unknown population parameter.

  • It is a prescribed probability, chosen in advance (e.g., 95%, 99%).
  • Example: if the confidence level is 95%, then 95% of such intervals constructed from repeated samples would contain the true parameter.
  • Don't confuse: the confidence level is not the probability that a specific computed interval contains the parameter; it describes the long-run frequency of coverage across many samples.

🔧 Construction methods for different scenarios

🔧 For expectation and event probability (large samples)

  • The excerpt states that construction methods for the expectation of a measurement and for the probability of an event rely on the Normal approximation suggested by the Central Limit Theorem.
  • This approximation is valid when the sample size is large enough.
  • Example: estimating the average height in a population or the proportion of voters favoring a candidate.

🔧 For Normal measurements (small samples)

  • When the sample is small, the construction of confidence intervals is considered specifically in the context of Normal measurements.
  • The excerpt distinguishes this from the large-sample case: different techniques apply when the underlying distribution is Normal but the sample is not large.
  • This allows confidence intervals for both the expectation and the variance of a Normal measurement.

📐 Practical applications

📐 Types of parameters covered

The chapter discusses constructing confidence intervals for:

ParameterDescription
Expectation of a measurementThe mean or average value of a random variable
Probability of an eventThe proportion or likelihood of an outcome
Variance of a Normal measurementThe spread of a Normal distribution (small sample case)

📐 Sample size determination

  • One learning objective is to compute the sample size that will produce a confidence interval of a given width.
  • This is a planning tool: before collecting data, determine how many observations are needed to achieve a desired precision (interval width).
  • Example: if you want to estimate a population mean within ±2 units with 95% confidence, you can calculate the required sample size.

🧩 Relationship to earlier concepts

🧩 Background: estimator performance

The excerpt includes a brief review of estimator properties (from the previous chapter):

  • Bias: the difference between the expected value of an estimator and the true parameter value.
  • Variance: the concentration of the estimator's distribution around its expected value.
  • Mean Square Error (MSE): equals variance plus the square of the bias; if the estimator is unbiased, MSE equals variance.

These concepts underpin the construction of confidence intervals, because intervals are built from estimators whose properties (bias, variance) determine the interval's reliability.

🧩 Optimal vs robust estimators (context)

  • The excerpt mentions a forum discussion on optimal estimators (best under an assumed model) versus robust estimators (less sensitive to model misspecification).
  • Example given: for a Uniform distribution, the mid-range estimator is better than the sample average if the model is correct, but the sample average is more robust if the distribution is not symmetric.
  • This context highlights that confidence interval construction depends on model assumptions (e.g., Normality, large sample size), and robustness considerations may influence method choice.
2

Why Learn Statistics?

1.2 Why Learn Statistics?

🧭 Overview

🧠 One-sentence thesis

Statistics is essential for making informed decisions about claims and information encountered in everyday life and across nearly all professional fields.

📌 Key points (3–5)

  • Why statistics matters: statistical information appears everywhere—news, media, Internet—and you need techniques to analyze it thoughtfully.
  • What statistics enables: making the "best educated guess" when evaluating statements, claims, or "facts" based on sample information.
  • Where statistics is used: required in economics, business, psychology, education, biology, law, computer science, police science, early childhood development, and personal decisions like buying a house or managing a budget.
  • How statistics works: probability and statistics work together—descriptive statistics organize and summarize data, while inferential statistics use probability to determine if conclusions are reliable.
  • Common confusion: statistics is not just about numbers; it involves collection, analysis, interpretation, and presentation of data to support decision-making.

📰 Statistics in everyday life

📰 Where you encounter statistics

  • Statistical information appears in:
    • Newspapers and television news programs
    • Internet content
    • Topics like crime, sports, education, politics, and real estate
  • You are typically given sample information from which you must make decisions.

🤔 Why you need statistical skills

  • When you read an article or watch a news program, you receive statistical information.
  • You may need to make a decision about the correctness of a statement, claim, or "fact."
  • Statistical methods help you make the "best educated guess" rather than relying on intuition alone.
  • Example: evaluating whether a claim in a news report is supported by the data presented.

💼 Statistics in professional and personal contexts

💼 Professional applications

The excerpt lists fields that require at least one course in statistics:

FieldImplication
Economics, BusinessAnalyzing market trends, financial data
Psychology, EducationUnderstanding research studies, test results
BiologyInterpreting experimental data
Law, Police scienceEvaluating evidence, crime statistics
Computer scienceData analysis, algorithms
Early childhood developmentResearch on child development patterns

🏠 Personal applications

  • Buying a house: evaluating real estate data and market trends.
  • Managing a budget: analyzing spending patterns and financial information.
  • The excerpt emphasizes that statistical literacy is "in your own best self-interest."

🔬 What statistics is

🔬 Definition and scope

The science of statistics deals with the collection, analysis, interpretation, and presentation of data.

  • Statistics is not just calculation; it encompasses four distinct activities:
    • Collection: gathering data
    • Analysis: examining patterns
    • Interpretation: drawing meaning
    • Presentation: communicating findings
  • We see and use data in everyday lives.
  • To be able to use data correctly is essential to many professions.

📊 Example: sleep data analysis

The excerpt provides a concrete example with sleep hours data: 5, 5.5, 6, 6, 6, 6.5, 6.5, 6.5, 6.5, 7, 7, 8, 8, 9.

This data is presented in a bar plot:

  • The horizontal x-axis shows time (hours of sleep).
  • The vertical y-axis shows frequency (how many people).
  • Each bar's length corresponds to the number of data points with that value.

🧐 Analytical questions

The excerpt poses questions to illustrate statistical thinking:

  • Would data from a different group look the same or different? Why?
  • Would the same-sized, same-age group produce the same results? Why or why not?
  • Where does the data cluster? How could you interpret the clustering?

These questions demonstrate that statistics involves analyzing and interpreting data, not just collecting it.

🧮 Two branches of statistics

📝 Descriptive statistics

Organizing and summarizing data is called descriptive statistics.

  • Purpose: to organize and summarize data.
  • Methods: two main ways:
    • Graphing: visual representations like bar plots.
    • Numbers: calculating summaries like averages.
  • Example: the bar plot of sleep data is descriptive statistics—it organizes raw numbers into a visual summary.

🔍 Inferential statistics

The formal methods are called inferential statistics.

  • Purpose: drawing conclusions from "good" data using formal methods.
  • How it works: uses probabilistic concepts to determine if conclusions are reliable or not.
  • Relationship to probability: "probability and statistics work together"—you cannot assess reliability without understanding probability.
  • Don't confuse: descriptive statistics summarize what you observe; inferential statistics assess whether patterns are meaningful or just chance.

🔗 How they connect

  • Descriptive statistics is covered in the first part of the book.
  • Inferential statistics is covered in the second part.
  • Effective interpretation of data is based on both: first summarize (descriptive), then assess reliability (inferential).
3

Statistics

1.3 Statistics

🧭 Overview

🧠 One-sentence thesis

Statistics provides methods to collect, analyze, interpret, and present data so that we can make informed decisions under uncertainty, while probability theory quantifies that uncertainty and enables reliable conclusions from samples.

📌 Key points (3–5)

  • What statistics does: collects, analyzes, interprets, and presents data to help make educated decisions.
  • Two branches: descriptive statistics (organizing and summarizing data through graphs and numbers) vs. inferential statistics (drawing conclusions from data using probability).
  • Common confusion: statistics is not about performing many calculations—the goal is understanding your data; calculations are just tools.
  • How probability supports statistics: probability quantifies uncertainty and explains patterns in large samples, allowing extrapolation from observed samples to entire populations.
  • Why it matters: statistical literacy is essential across professions (economics, psychology, law, etc.) and everyday decisions (budgeting, evaluating claims).

📊 What statistics is and why it matters

📊 The science of statistics

Statistics: the science that deals with the collection, analysis, interpretation, and presentation of data.

  • Data appears everywhere—news, sports, crime reports, real estate, politics.
  • When you encounter sample information (e.g., in a newspaper), you need techniques to analyze it thoughtfully and decide whether a claim is correct.
  • Statistical methods help you make the "best educated guess."

🌍 Real-world relevance

  • Many professions require statistics: economics, business, psychology, education, biology, law, computer science, police science, early childhood development.
  • Everyday decisions also rely on data: buying a house, managing a budget, evaluating health information.
  • Example: If you see a claim about average sleep hours or crime rates, statistical literacy helps you assess its validity.

🔍 Two branches of statistics

🔍 Descriptive statistics

Descriptive statistics: organizing and summarizing data.

  • Two main tools:
    • Graphing (e.g., bar plots that show frequency of values).
    • Numbers (e.g., finding an average).
  • Example from the excerpt: A bar plot shows how many people sleep 5, 5.5, 6, 6.5, 7, 8, or 9 hours per night. The x-axis is time (hours), the y-axis is frequency (how many people), and bar length corresponds to the count.
  • Questions to ask when interpreting:
    • Would a different group produce the same plot? Why or why not?
    • Where does the data cluster? (e.g., most people sleep around 6–7 hours)
    • What does clustering mean? (e.g., typical sleep duration for this group)

🔬 Inferential statistics

Inferential statistics: formal methods for drawing conclusions from "good" data.

  • Uses probabilistic concepts to determine if conclusions are reliable.
  • Goes beyond describing the sample—extrapolates to the entire population.
  • Effective inference depends on:
    • Good procedures for producing data.
    • Thoughtful examination of the data.
  • Don't confuse: The goal is not to perform many calculations using formulas, but to gain understanding. Calculations can be done by calculator or computer; understanding must come from you.

🎲 How probability supports statistics

🎲 What probability is

Probability: the mathematical theory used to study uncertainty; it formalizes and quantifies the notion of uncertainty.

  • Deals with the chance of an event occurring.
  • When outcomes are equally likely, probability of each outcome = 1 divided by the number of potential outcomes.
  • Example: Tossing a fair coin has two equally likely outcomes (head or tail), so probability of each is 1/2.

🔁 Pattern regularity in large samples

  • Short-term uncertainty: If you toss a coin 4 times, you may not get exactly 2 heads and 2 tails.
  • Long-term regularity: If you toss the same coin 4,000 times, outcomes will be close to 2,000 heads and 2,000 tails.
    • Very unlikely to get more than 2,060 tails or fewer than 1,940 tails.
    • This consistency matches the theoretical probability (1/2 per toss).
  • Key insight: Even though a few repetitions are uncertain, a regular pattern emerges when the number of repetitions is large.

🔗 Linking probability and statistics

  • Statistics exploits this pattern regularity to make extrapolations from an observed sample to the entire population.
  • Probability began with games of chance (e.g., poker) but now predicts earthquakes, rain, exam outcomes, and more.
  • Without probability, we cannot assess whether conclusions from data are reliable or just due to chance.

🛠️ Practical approach to learning statistics

🛠️ Focus on understanding, not formulas

  • You will encounter many mathematical formulas describing statistical procedures.
  • Always remember: the goal is to understand your data, not to perform numerous calculations.
  • Calculations are tools (done by calculator or computer); interpretation and insight come from you.

🧪 Working with data thoughtfully

  • Good statistical practice requires:
    • Producing data with sound procedures.
    • Examining data carefully (e.g., looking for clustering, outliers, patterns).
  • Example questions from the sleep-hours data:
    • Would a same-sized, same-age group produce the same results? Why or why not?
    • What does the clustering around 6–7 hours tell us about typical sleep patterns?
  • Grasping the basics of statistics builds confidence in making real-life decisions.
4

Probability

1.4 Probability

🧭 Overview

🧠 One-sentence thesis

Probability provides the mathematical foundation for quantifying uncertainty and enables statistics to extrapolate from observed samples to entire populations by exploiting the regular patterns that emerge in large numbers of repetitions.

📌 Key points (3–5)

  • What probability measures: the chance of an event occurring; when outcomes are equally likely, probability equals one divided by the number of potential outcomes.
  • Pattern regularity: although a few repetitions produce uncertain outcomes, large numbers of repetitions reveal regular patterns (e.g., 4,000 coin tosses approach 2,000 heads).
  • Role in statistics: statistics exploits this pattern regularity to make inferences from samples to populations.
  • Common confusion: short-run vs long-run—small samples may deviate widely from expected probabilities, but large samples converge toward theoretical probabilities.
  • Applications: probability is used to predict earthquakes, rain, vaccination risks, investment returns, and other uncertain events.

🎲 Core concept: what probability measures

🎲 Definition and basic calculation

Probability: the mathematical theory used to study uncertainty; it provides tools for the formalization and quantification of the notion of uncertainty.

  • Probability deals with the chance of an event occurring.
  • When all potential outcomes are equally likely, the probability of each outcome is calculated as:
    • One divided by the number of potential outcomes.
  • Example: Tossing a fair coin has two equally likely outcomes (head or tail), so the probability of each is 1/2.

🔍 Why "equally likely" matters

  • The excerpt specifies that the simple formula (one divided by number of outcomes) applies only when outcomes are equally likely.
  • A fair coin means head and tail have the same chance; if the coin were biased, this formula would not apply.

🔁 Pattern regularity: from uncertainty to predictability

🔁 Small vs large repetitions

  • Few repetitions: outcomes are uncertain and may not match theoretical probability.
    • Example: Tossing a fair coin 4 times may not yield exactly 2 heads and 2 tails.
  • Many repetitions: outcomes converge toward theoretical probability.
    • Example: Tossing the same coin 4,000 times will produce close to 2,000 heads and 2,000 tails.
    • The excerpt states it is "very unlikely" to get more than 2,060 tails or fewer than 1,940 tails (out of 4,000 tosses).

📊 The key insight

  • "Even though the outcomes of a few repetitions are uncertain, there is a regular pattern of outcomes when the number of repetitions is large."
  • This regularity is what makes probability useful: we can predict long-run behavior even when individual trials are unpredictable.

🔗 How probability connects to statistics

🔗 Exploiting pattern regularity

  • Statistics uses the regular pattern that emerges in large samples to make extrapolations from the observed sample to the entire population.
  • The idea: if we observe a sample and see a pattern, we can infer that the population follows a similar pattern—because large samples reliably reflect underlying probabilities.

🧪 Philosophical approach in this course

  • The excerpt notes that this introductory course will not develop the mathematical theory of probability.
  • Instead, the focus is on:
    • Philosophical aspects of probability theory.
    • Computerized simulations to demonstrate probabilistic computations used in statistical inference.
  • Don't confuse: this course emphasizes understanding and application, not formal proofs or advanced probability mathematics.

🌍 Applications of probability

🌍 Real-world uses

The excerpt lists several domains where probability is applied:

DomainApplication
Natural disastersPredicting the likelihood of an earthquake
WeatherPredicting the chance of rain
EducationEstimating the probability of getting an "A" in a course
MedicineDetermining the chance a vaccination causes the disease it is meant to prevent
FinanceDetermining the rate of return on investments
Personal decisionsDeciding whether to buy a lottery ticket

🎰 Historical origin

  • The theory of probability began with the study of games of chance such as poker.
  • Today it has expanded far beyond gambling to inform decisions in science, medicine, business, and everyday life.
5

Key Terms

1.5 Key Terms

🧭 Overview

🧠 One-sentence thesis

Statistics studies populations by selecting samples, calculating statistics from sample data to estimate population parameters, with accuracy depending on how well the sample represents the population.

📌 Key points (3–5)

  • Population vs sample: a population is the entire collection under study; a sample is a selected portion used to gain information about the population.
  • Statistic vs parameter: a statistic is a number calculated from sample data; a parameter is a number describing the entire population.
  • Why sampling matters: examining an entire population is costly and time-consuming, so we use samples and estimate parameters from statistics.
  • Common confusion: don't confuse a statistic (property of the sample) with a parameter (property of the population)—the statistic estimates the parameter.
  • Representative samples: accuracy of estimation depends on whether the sample contains the characteristics of the population.

🎯 Population and sample

🎯 What a population is

Population: an entire collection of persons, things, or objects under study.

  • The population is the full group you want to learn about.
  • Example: all students at your school, all math classes, all voters in a country.

🔬 What a sample is

Sample: a portion (or subset) of the larger population selected for study.

  • You select a sample to gain information about the population without examining every member.
  • Why use samples: examining an entire population takes too much time and money; sampling is practical.
  • Example: selecting some students from your school to compute overall grade point average; polling 1,000–2,000 people to represent views of an entire country; testing some 16-ounce drink containers to check if they contain 16 ounces.

📊 What data are

Data: the result of sampling from a population.

  • Data are the actual measurements or observations collected from the sample.
  • Example: students' grade point averages collected from the sample; exam scores; container volumes.

🔢 Statistic and parameter

🔢 What a statistic is

Statistic: a number that is a property of the sample.

  • You calculate a statistic from the sample data.
  • Example: the average number of points earned by students in one math class (if that class is the sample).
  • The statistic is used to estimate the corresponding population parameter.

🏛️ What a parameter is

Parameter: a number that is a property of the population.

  • A parameter describes the entire population.
  • Example: the average number of points earned per student over all math classes (if all classes are the population).

⚖️ Statistic vs parameter comparison

ConceptScopeRoleExample (from excerpt)
StatisticCalculated from the sampleEstimates the parameterAverage points in one math class
ParameterDescribes the entire populationThe true value we want to knowAverage points over all math classes

Don't confuse: A statistic is not the same as a parameter. The statistic is what you compute from your sample; the parameter is the true population value you are trying to estimate.

🎯 Accuracy and representativeness

🎯 How accuracy depends on the sample

  • Main concern in statistics: how accurately a statistic estimates a parameter.
  • Accuracy depends on how well the sample represents the population.

🔑 What makes a sample representative

Representative sample: a sample that contains the characteristics of the population.

  • The sample must reflect the population's features to produce accurate estimates.
  • If the sample does not match the population's characteristics, the statistic will be a poor estimate of the parameter.

📐 Common measures: average and proportion

📐 Average

  • What it is: add up all values and divide by the number of values.
  • Example: exam scores of 86, 75, and 92 → average = (86 + 75 + 92) divided by 3 = 84.3 (to one decimal place).

📐 Proportion

  • What it is: the fraction of the total that belongs to a category.
  • Example: in a class of 40 students, 22 are men and 18 are women → proportion of men = 22 divided by 40; proportion of women = 18 divided by 40.

Note: The excerpt states that average and proportion are discussed in more detail in later chapters.

6

The R Programming Environment

1.6 The R Programming Environment

🧭 Overview

🧠 One-sentence thesis

R is a widely used open-source statistical programming system that enables both standard and advanced statistical analysis through hands-on practice with code, functions, and data manipulation.

📌 Key points (3–5)

  • What R is: an open-source system for statistical analysis and programming, popular in academia for developing new statistical tools.
  • How to learn R: only through practice—run code yourself, experiment with changes, and don't fear mistakes (worst case: restart the computer).
  • Basic workflow: R is object-oriented; you create and manipulate objects using functions, assign results to named objects, and can save/load them.
  • Common confusion: R distinguishes between capital and small letters (e.g., X and x are different objects).
  • Why use RStudio: it provides an integrated development environment with console, editor, plotting tools, and workspace management.

🖥️ Getting started with R

📥 Installation and setup

  • Download R from the R project home page (https://www.r-project.org).
  • Also install RStudio (https://www.rstudio.com), which provides a more user-friendly interface.
  • RStudio includes four panels: editor (top-left), R-Console (bottom-left), and two right panels showing environment, history, files, plots, and help.

📁 Good practices

  • Create a separate folder for each project to store data and R code.
  • Set the working directory at the start of each session using the setwd function.
  • Example: setwd("~/IntroStat") points R to your project folder.

🧩 Core R concepts

🧩 Object-oriented system

R is an object-oriented programming system where you create and manipulate objects using functions.

  • During a session, you create objects (data, results) and apply functions to them.
  • Most functions are written in the R language itself.
  • You can write new functions or modify existing ones for specific needs.

🎯 The prompt and execution

  • The > prompt indicates the system is ready to receive commands.
  • Type an expression (e.g., 1+2) and hit Return to execute it.
  • If no other action is specified, R applies the default action (usually displaying the result on screen).
  • Example: typing 1+2 and hitting Return displays [1] 3.

💾 Creating and naming objects

  • To save output for later use, assign it to a named object using the assignment operator <-.
  • The arrow is typed as < followed by -.
  • Example: X <- c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9) creates an object named X.
  • When you assign to an object, no output appears on screen—the result is stored in memory instead.

🔤 Case sensitivity

  • Don't confuse: R distinguishes between capital and small letters.
  • X and x are completely different objects.
  • Typing x when only X exists produces an error: object 'x' not found.
  • You are free to choose names for objects you create (e.g., my.vector), but system function names are fixed.

🔧 Working with functions and data

🔧 The c function

The c function combines its arguments and produces a sequence with those arguments as components.

  • Arguments are the inputs to a function, separated by commas inside brackets.
  • Example: c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9) creates a sequence of 14 numbers.
  • The excerpt uses "sequence" to refer to what R calls a vector.

📊 The table function

  • Applies to a data object and produces a frequency count of different values.
  • Example: table(X) shows how many times each value appears in X.
  • Output format: the first row shows the unique values, the second row shows their frequencies.

📈 The plot function

  • Can create visualizations from data.
  • Example: plot(table(X)) produces a bar-plot showing the frequency distribution.
  • To plot different data, simply replace X with another object name (e.g., plot(table(my.vector))).

💾 Saving and loading work

💾 Persistence of objects

  • Objects created during a session exist only in memory.
  • At the end of a session, objects are erased unless specifically saved.
  • Don't confuse: having an object in memory vs. having it saved on the hard disk.

💾 Save and load functions

  • Use save to store an object on disk: save(X, file = "X.RData").
  • Use rm to delete an object from memory: rm(X).
  • Use load to restore a saved object: load("X.RData").
  • After loading, the object is back in memory and can be used again.

🎓 Learning approach

🎓 Practice is essential

Learning R, like learning any programming language, can be achieved only through practice.

  • Don't just read code—run it yourself in parallel with reading explanations.
  • Experiment: introduce changes in code and data, observe how output changes.
  • The excerpt strongly recommends hands-on experimentation.

🎓 Don't fear mistakes

  • You should not be afraid to experiment.
  • Worst case: the computer may crash or freeze.
  • Solution: restarting the computer solves the problem.
  • The excerpt emphasizes that mistakes are part of learning and not dangerous.

🎓 Code navigation tips

  • Browse through previous lines using up and down arrow keys.
  • Move along a line using left and right arrow keys.
  • You can recall and edit earlier commands instead of retyping everything.
  • Example: find a previous line with c(...), move to the beginning, add X <-, and hit Return.
7

1.7 Exercises

1.7 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply the fundamental statistical concepts of population, sample, parameter, and statistic to real-world scenarios involving polling and observational data.

📌 Key points (3–5)

  • Exercise 1.1 tests conceptual distinctions: identifying which elements in a polling scenario are populations, samples, parameters, or statistics.
  • Exercise 1.2 focuses on data observation: counting frequencies and identifying modes (most/least common values) in a small dataset.
  • Common confusion: distinguishing between a parameter (characteristic of the entire population) and a statistic (characteristic calculated from the sample).
  • Practical application: both exercises require applying definitions to concrete examples—polling results and customer counts.

🗳️ Exercise 1.1: Polling scenario

🗳️ The setup

  • A political candidate wants to assess her chances in party primaries.
  • A polling agency surveys 500 registered voters of the party.
  • One question asks about willingness to vote for a female candidate:
    • 42% prefer a female candidate
    • 38% say gender is irrelevant
    • The rest (20%) prefer a male candidate

🔍 What you must identify

The exercise asks you to classify four items as:

  1. Population: the entire group of interest
  2. Sample: the subset actually observed
  3. Parameter: a numerical characteristic of the population
  4. Statistic: a numerical characteristic calculated from the sample

📋 The four items to classify

ItemDescription
1The 500 registered voters
2The percentage, among all registered voters of the given party, of those that prefer a male candidate
3The number 42% that corresponds to the percentage of those that prefer a female candidate
4The voters in the state that are registered to the given party

💡 How to approach this

  • Don't confuse: "all registered voters of the party" (population) vs. "the 500 voters surveyed" (sample).
  • Don't confuse: a percentage calculated from the 500 surveyed (statistic) vs. the true percentage across all party voters (parameter).
  • Example reasoning: Item 3 (42%) comes from the 500-person survey, so it describes the sample, not the entire population.

📊 Exercise 1.2: Customer waiting data

📊 The dataset

Number of customers waiting at a coffee shop opening, recorded over 25 days:

4, 2, 1, 1, 0, 2, 1, 2, 4, 2, 5, 3, 1, 5, 1, 5, 1, 2, 1, 1, 3, 4, 2, 4, 3

🔢 Three tasks

🔢 Task 1: Count days with 5 customers

  • Identify how many days had exactly 5 customers waiting.
  • Method: scan the list and count occurrences of the value 5.

🔢 Task 2: Find the mode (most frequent value)

The number of waiting customers that occurred the largest number of times.

  • This is asking for the mode: the value that appears most often in the dataset.
  • Example approach: count how many times each value (0, 1, 2, 3, 4, 5) appears, then identify which has the highest count.

🔢 Task 3: Find the least frequent value

The number of waiting customers that occurred the least number of times.

  • Opposite of the mode: which value appears fewest times?
  • Don't confuse: "least frequent value" (which number appears least often) vs. "smallest value" (which is 0 in this dataset).

🧮 How to solve

  • The excerpt earlier showed that table(X) constructs a table of counts of different values.
  • You can apply table() to this dataset to see the frequency of each customer count.
  • Example: if table() shows that 1 appears 9 times, 2 appears 6 times, etc., you can identify the mode and the least frequent value.
8

Summary

1.8 Summary

🧭 Overview

🧠 One-sentence thesis

This summary consolidates the foundational concepts of statistics—data, statistics, parameters, populations, and samples—and introduces basic R functions for data manipulation and visualization.

📌 Key points (3–5)

  • Core definitions: data, statistic, parameter, population, and sample are the building blocks of statistical reasoning.
  • Statistics vs. probability: statistics processes and infers from data; probability models randomness mathematically.
  • Common confusion: a statistic (a numerical summary of sample data) estimates a parameter (a numerical characteristic of the entire population).
  • R workflow basics: setting directories, combining data, saving/loading objects, counting values, and plotting.
  • Key question: whether a sample can appropriately represent an entire population depends on context and research goals.

📚 Core terminology

📊 Data

Data: A set of observations taken on a sample from a population.

  • Data are the raw observations you collect from a subset (sample) of the group you care about (population).
  • They are not yet summaries or conclusions—just the recorded values.

🔢 Statistic

Statistic: A numerical characteristic of the data. A statistic estimates the corresponding population parameter.

  • A statistic is computed from the sample (e.g., the average number of forum contributions this term).
  • It serves as an estimate of the true population value (the parameter).
  • Example: 42% of 500 polled voters prefer a female candidate—that 42% is a statistic.

🎯 Parameter

  • A parameter is a numerical characteristic of the entire population (not just the sample).
  • Example: the true percentage of all registered voters who prefer a male candidate is a parameter.
  • Don't confuse: statistic (sample-based, observed) vs. parameter (population-based, usually unknown).

🌍 Population and sample

  • Population: the complete group you want to understand (e.g., all registered voters of a given party in a state).
  • Sample: the subset you actually observe (e.g., 500 polled voters).
  • The sample is used to make inferences about the population.

🔬 Statistics vs. probability

📐 Statistics

Statistics: The science that deals with processing, presentation and inference from data.

  • Focuses on collecting, summarizing, and drawing conclusions from observed data.
  • Example: using sample data to estimate population parameters.

🎲 Probability

Probability: A mathematical field that models and investigates the notion of randomness.

  • Provides the theoretical foundation for understanding variability and uncertainty.
  • Statistics applies probability models to real data.

💻 R functions introduced

🗂️ Workspace management

FunctionPurpose
setwd(dir)Set the working directory to dir
save(..., file)Write R objects to the specified file
load()Reload datasets written with save
rm()Remove objects from the workspace

🔧 Data manipulation and visualization

FunctionPurpose
c()Combine arguments into a vector
table()Construct a table of counts of different values
plot()Generic function for plotting R objects

📊 Example workflow

  • To visualize frequency counts of data stored in object X:
    • Use plot(table(X)).
    • The table() function counts occurrences of each value; plot() displays the counts as a bar chart.
  • To apply the same workflow to other data (e.g., my.vector), replace X with my.vector: plot(table(my.vector)).

🧪 Exercise examples

🗳️ Polling scenario (Exercise 1.1)

  • Context: 500 registered voters polled; 42% prefer a female candidate, 38% say gender is irrelevant, the rest prefer a male candidate.
  • Identify:
    • Population: all voters registered to the given party in the state.
    • Sample: the 500 registered voters who were polled.
    • Statistic: the 42% (computed from the sample).
    • Parameter: the true percentage among all registered voters who prefer a male candidate (unknown, to be estimated).

☕ Customer counts (Exercise 1.2)

  • Data: number of customers waiting at a coffee shop over 25 days: 4, 2, 1, 1, 0, 2, 1, 2, 4, 2, 5, 3, 1, 5, 1, 5, 1, 2, 1, 1, 3, 4, 2, 4, 3.
  • Tasks:
    1. Count how many days had 5 customers waiting.
    2. Identify the most frequent count (the mode).
    3. Identify the least frequent count.
  • These tasks illustrate using table() to summarize frequency distributions.

🤔 Discussion question

🎯 Can a sample represent the entire population?

  • The excerpt poses: is it appropriate to use only a sample to represent the entire population?
  • Context matters: the answer depends on the research question, the target population, and how the sample is selected.
  • Example approach: identify a question from your field, define the population of interest, and discuss whether a sample would be suitable for investigating that question.
  • Key tension: samples are practical (cheaper, faster) but introduce uncertainty about how well they reflect the population.
9

Student Learning Objectives

2.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing provides formal guidelines for choosing between two competing options by determining whether a meaningful phenomenon exists beyond random noise.

📌 Key points (3–5)

  • Purpose of hypothesis testing: a crucial component in decision-making when selecting between two competing options.
  • What it detects: meaningful phenomena hidden in environments contaminated by random noise.
  • Scope of this chapter: formulation of statistical hypothesis testing, decision rules, and testing expectations and probabilities.
  • Common confusion: hypothesis testing answers "Is there a phenomenon at all?" (the first step), not "What is the exact effect?" or "Why does it happen?"
  • Critical awareness: students must identify limitations and dangers of misinterpreting test conclusions.

🎯 What hypothesis testing does

🎯 Core purpose

Hypothesis testing: formal guidelines for selecting one of two competing options in decision-making.

  • It is not about estimating a parameter value; it is about choosing between two alternatives.
  • The excerpt frames it as a decision-making tool, not merely a calculation.
  • Example: An organization must decide whether a new process is better than the old one—hypothesis testing provides a structured way to make that choice.

🔍 What it detects

  • Statistical inference aims to detect and characterize meaningful phenomena that may be hidden in random noise.
  • Hypothesis testing is typically the first step in the inference process.
  • It answers: "Is there a phenomenon at all?"—not "How large is it?" or "What causes it?"
  • Don't confuse: hypothesis testing establishes presence/absence of an effect, not the magnitude or mechanism.

📐 Scope and structure

📐 What this chapter covers

The excerpt states the chapter deals with:

  • Formulation of statistical hypothesis testing.
  • Decision rules associated with hypothesis testing.
  • Context: testing hypotheses about the expectation of a measurement and the probability of an event.
  • Future chapters: hypothesis testing for other parameters.

🧩 Learning outcomes

By the end of the chapter, students should be able to:

SkillDescription
Formulate hypothesesSet up statistical hypotheses for testing
Test expectations and probabilitiesUse sample data to test hypotheses about measurement expectations and event probabilities
Recognize limitationsIdentify the limitations of hypothesis testing and the danger of misinterpreting conclusions

⚠️ Critical awareness

⚠️ Limitations and misinterpretation

  • The excerpt explicitly warns about limitations of statistical hypothesis testing.
  • There is a danger of misinterpretation of the test's conclusions.
  • Students must learn to identify these pitfalls—not just how to perform the test mechanically.
  • Example: A test result might be statistically significant but not practically meaningful, or a non-significant result might be due to small sample size rather than absence of an effect.

🧭 Hypothesis testing as a first step

  • The excerpt emphasizes that hypothesis testing is "typically the first" step in inference.
  • It does not provide a complete picture on its own; it is part of a larger process.
  • Don't confuse: passing the hypothesis test does not automatically mean the phenomenon is large, important, or well-understood—it only means "something is there."
10

The Sampled Data

2.2 The Sampled Data

🧭 Overview

🧠 One-sentence thesis

Statistics aims to learn population characteristics from samples, and understanding variation—both in data itself and across different samples—is central to quantifying uncertainty and making valid inferences.

📌 Key points (3–5)

  • Central role of variation: Variation exists in all data (measurement conditions, sampling randomness) and assessing it is the statistician's main concern.
  • Sample-to-sample differences: Two random samples from the same population will differ due to sampling variation, but larger samples produce characteristics closer to true population values.
  • Frequency distributions: The primary way to summarize data variability is through frequency, relative frequency, and cumulative relative frequency tables.
  • Common confusion: Variation arises from two sources—measurement error for a given observation vs. sampling variation (different individuals in different samples).
  • Critical evaluation needed: Biased samples, poor data quality, small sample sizes, and confounding factors can all produce misleading conclusions if not carefully assessed.

📊 Understanding variation in data

📊 What variation means

Variation: differences present in any set of data, arising from measurement conditions or inexact quantities.

  • Variation is not an error to eliminate—it is an inherent feature that statistics must quantify.
  • Example: Eight 16-ounce beverage cans measured 15.8, 16.1, 15.2, 14.8, 15.8, 15.9, 16.0, 15.5 ounces—no two cans contained exactly the same amount.
  • Sources: measurement conditions varied, or the exact target amount was not achieved in production.

🔀 Sampling variation vs. measurement variation

  • Measurement variation: differences in recorded values for the same unit under different conditions.
  • Sampling variation: differences between samples because different individuals are randomly selected.
  • Example: Two researchers (Doreen and Jung) each sample 50 students to study sleep time; randomness ensures different students appear in each sample, producing different sleep patterns even if both samples represent the population well.
  • Don't confuse: the samples differ not because of measurement error, but because they contain different people.

📏 Sample size and uncertainty

  • Larger samples produce sample characteristics (e.g., averages) closer to the true population value.
  • Even with larger samples, two samples will still differ from each other, but the difference shrinks.
  • Practical guideline: polls with 1200–1500 observations are considered sufficient if sampling is random and well-executed.
  • The theory of statistical inference (covered later in the source material) justifies these claims and provides techniques to quantify uncertainty.

📋 Frequency distributions

📋 What frequency measures

Frequency: the number of times a given value occurs in a data set.

  • Example: Twenty students reported daily work hours: 5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3.
  • Frequency table shows: 2 hours (3 students), 3 hours (5 students), 4 hours (3 students), 5 hours (6 students), 6 hours (2 students), 7 hours (1 student).
  • The total of the frequency column equals the sample size (20 students).

🔢 Relative frequency

Relative frequency: the fraction of times a value occurs, calculated by dividing each frequency by the total number of observations.

  • Can be expressed as fractions, decimals, or percents.
  • Example: For 2 hours, relative frequency = 3/20 = 0.15 (or 15%).
  • The sum of all relative frequencies always equals 1 (or 100%).

📈 Cumulative relative frequency

Cumulative relative frequency: the accumulation of all previous relative frequencies up to and including the current value.

  • Calculation: add all previous relative frequencies to the current value's relative frequency.
  • Example progression for work hours:
    • 2 hours: 0.15
    • 3 hours: 0.15 + 0.25 = 0.40
    • 4 hours: 0.40 + 0.15 = 0.55
    • 5 hours: 0.55 + 0.30 = 0.85
    • 6 hours: 0.85 + 0.10 = 0.95
    • 7 hours: 0.95 + 0.05 = 1.00
  • The last entry is always 1.00, indicating 100% of data has been accumulated.

⚠️ Critical evaluation of data

⚠️ Why critical evaluation matters

  • Inappropriate sampling or data collection methods produce samples that do not represent the target population.
  • Naïve statistical analysis of flawed data leads to misleading conclusions.
  • The excerpt emphasizes evaluating analyses critically before accepting their conclusions.

🚫 Common data problems

ProblemDescriptionImpact
Biased samplesSample not representative of populationInaccurate, invalid results
Data quality issuesErrors from inaccurate form handling, input mistakesUnreliable data requiring cleaning
Self-selected samplesOnly responses from volunteers (e.g., call-in surveys)Often biased
Sample size too smallInsufficient observationsUnreliable conclusions (though sometimes unavoidable, e.g., crash testing)
Undue influenceQuestions or collection methods that influence responsesDistorted data
Causality confusionCorrelation between two variables mistaken for causationMay both relate to a third variable
Self-funded studiesResearch by organizations supporting their own claimsPotential bias; requires careful evaluation
Misleading displaysImproperly displayed graphs, incomplete dataFalse impressions
ConfoundingEffects of multiple factors cannot be separatedImpossible to draw valid conclusions about individual factors

🔍 How to approach questionable studies

  • Do not automatically assume a study is good or bad based on its source.
  • Evaluate it on its merits and the quality of work done.
  • Read the study carefully and assess whether it is impartial.

💾 Working with external data

💾 Why external data import matters

  • Manual data entry is inefficient for large datasets.
  • Real-world statistical work requires reading data from files.
  • The excerpt introduces the Comma Separated Values (CSV) format as a standard method.

📁 CSV file format

  • CSV files are ordinary text files.
  • Can be created manually or by converting data from other formats.
  • Can be browsed and edited using spreadsheet programs (Excel, Calc from LibreOffice).
  • Example file mentioned: "ex1.csv" containing sex and height data for 100 individuals.

🗂️ Workflow recommendations

  • Create a special directory for course materials (example: "IntroStat").
  • Store data files in this directory before reading them.
  • Important caution: Never edit raw data files directly (the excerpt emphasizes this but does not provide the reason in the given text).
11

2.3 Reading External Data into R

2.3 Reading External Data into R

🧭 Overview

🧠 One-sentence thesis

Reading external data into R requires setting a working directory, using the read.csv function to import CSV files into data frames, and understanding that variables are classified as either factors (qualitative) or numeric (quantitative).

📌 Key points (3–5)

  • Why external data reading matters: Real datasets are too large to enter manually, so R must read from files (CSV format is common).
  • The workflow: Save the CSV file in a directory, set that directory as R's working directory, then use read.csv() to import the file into a data frame object.
  • Data frames structure: Columns are variables (measurements), rows are observations (subjects); each variable has a type.
  • Two major variable types: Factors (qualitative/categorical data like sex or hair color) vs numeric (quantitative data like height or pulse rate).
  • Common confusion: Don't confuse the working directory (where R looks for files by default) with any other directory; files must be in the working directory or you must provide the full path.

📂 Preparing to read data

📂 Saving the file locally

  • Before reading a file into R, obtain a copy and store it in a directory on your computer.
  • The excerpt recommends creating a special directory (example name: "IntroStat") to keep all course-related material organized.
  • Important: Never edit raw data files directly. Keep them in a separate directory and never overwrite them. Any changes should be documented through R scripts and saved under a different name.

🗂️ CSV file format

CSV (Comma Separated Values) files are ordinary text files that can be created manually or by converting data from other formats.

  • Common tools for creating/editing CSV files: Excel (Microsoft Office) or Calc (LibreOffice).
  • Opening a CSV file in a spreadsheet program displays data in cells; the first row should contain variable names (preferably single character strings with no spaces).
  • Following rows contain data values.
  • When saving from a spreadsheet, choose the CSV format in the "Save by Type" selection.

🎯 Setting the working directory

The working directory is the first place R searches for files; files produced by R are saved there.

  • R must be notified where the file is located in order to read it.
  • In RStudio, set the working directory via the "Files" panel:
    • Click "More" on the toolbar at the top of the Files panel.
    • Select "Set As Working Directory" from the menu.
    • Browse to the target directory, select it, and click "OK".
  • Example: If you save "ex1.csv" in the "IntroStat" directory, set "IntroStat" as the working directory.

📝 Using R scripts for reproducibility

  • Working only from the R console makes it hard to retrace analysis steps without documentation.
  • Better approach: Organize R code in scripts (plain text files containing R commands).
  • Executing scripts performs the full analysis and makes work reproducible.
  • In RStudio:
    • Click the first button on the main toolbar.
    • Select "R-script" from the popup menu.
    • Add R commands in the editing panel.
    • Save the script.
    • Select lines and click "Run" to execute them in the R console.
  • Once a working directory is set, the history of subsequent R sessions is stored there; objects created in a session can be uploaded next time if you save the session image.

🔄 Reading CSV files into R

🔄 The read.csv function

  • Use the R function read.csv to read CSV files.
  • Syntax: ex.1 <- read.csv("_data/ex1.csv")
    • Input: the address of a CSV file (placed between double-quotes).
    • Output: a data frame object with the file's content.
  • If the file is in the working directory, giving just the file name is sufficient.
  • If the file is elsewhere, provide the complete address including the path.
  • The file need not be on the computer; you can provide a URL (internet address) as the address.

📊 Example: reading "ex1.csv"

The excerpt uses a file "ex1.csv" containing data on sex and height of 100 individuals, available at http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex1.csv.

After running ex.1 <- read.csv("_data/ex1.csv") and inspecting with head(ex.1), the output shows:

  • Row 1: id 5696379, FEMALE, height 182
  • Row 2: id 3019088, MALE, height 168
  • Row 3: id 2038883, MALE, height 172
  • Row 4: id 1920587, FEMALE, height 154
  • Row 5: id 6006813, MALE, height 174
  • Row 6: id 4055945, FEMALE, height 176

🗃️ Understanding data frames

🗃️ Data frame structure

Data frames are the standard tabular format of storing statistical data in R.

  • Columns = variables: correspond to measurements.
  • Rows = observations: correspond to subjects.
  • Example from "ex1.csv" with 100 subjects:
    • Subject 1: female, height 182 cm, id 5696379.
    • Subject 98: male, height 195 cm, id 9383288.

🏷️ Variables in the example

The "ex1.csv" data frame has three variables:

VariableDescriptionType
id7-digit unique identifier of the subjectNumeric
sexSex of each subject (values: "MALE" or "FEMALE")Factor
heightHeight in centimetersNumeric

🔢 Data types in R

🔢 Two major types

R associates each variable with a type characterizing its content:

TypeDescriptionExamples
Factor (Qualitative)Categorizing or describing attributes; generally words or lettersHair color, blood type, ethnic group, car type, street name, sex
Numeric (Quantitative)Result of counting or measuring; always numbersAmount of money, pulse rate, weight, number of people, height

🎨 Factors (qualitative data)

  • Represent categories or levels.
  • Examples: hair color (black, dark brown, light brown, blonde, gray, red), blood type (AB+, O-, B+).
  • Limitation: Many numerical techniques don't apply. For example, it doesn't make sense to find an average hair color or blood type.
  • When values are qualitative or level-based, the variable is a factor.

📏 Numeric variables (quantitative data)

  • Always numbers; usually the data of choice because many analytical methods are available.
  • Result of counting or measuring population attributes.
  • Examples: amount of money, pulse rate, weight, number of residents, number of students.
  • Subtypes:
    • Discrete: Result of counting (e.g., number of students).
    • Continuous: Result of measuring (implied but not fully detailed in this excerpt).
  • When values are numerical, the variable is a quantitative or numeric variable.

🔍 How to distinguish

  • Factor: If you can describe it with words/categories and can't meaningfully average it → factor.
  • Numeric: If it's a number from counting or measuring and you can perform arithmetic operations → numeric.
  • Example from "ex1.csv": sex is a factor (categories "MALE"/"FEMALE"), height is numeric (measured in cm).
12

Data Types and Variables

2.4 Exercises

🧭 Overview

🧠 One-sentence thesis

Understanding the distinction between factors (qualitative data) and numeric variables (quantitative data) is essential because different statistical methods apply to each type, even when factors are coded with numbers.

📌 Key points (3–5)

  • Two major data types: factors (qualitative) and numeric (quantitative); R associates a type with each variable that determines which analysis methods can be used.
  • Factors vs numeric: factors describe categories or attributes (hair color, blood type); numeric data result from counting or measuring (pulse rate, weight).
  • Common confusion: coding factor levels with numbers does not make them numeric—they must still be treated as factors, or R will apply inappropriate methods.
  • Discrete vs continuous numeric: discrete comes from counting (number of books), continuous from measuring (weight of backpacks); R stores both as "numeric" and does not distinguish between them.
  • Why type matters: different statistical methods apply to different data types; for example, finding an average hair color or blood type does not make sense.

📊 Core data types in R

📊 Factors (qualitative data)

Factor: qualitative data associated with categorization or the description of an attribute.

  • The type in R is "factor."
  • Examples from the excerpt: hair color (black, blonde, gray), blood type (AB+, O-, B+), sex, backpack color (red, black, green, gray).
  • Generally described by words or letters, not numbers.
  • Why it matters: many numerical techniques do not apply—you cannot compute an average hair color or blood type.

🔢 Numeric variables (quantitative data)

Quantitative data: data generated by numerical measurements; always numbers.

  • The type in R is "numeric."
  • Result from counting or measuring attributes of a population.
  • Examples: amount of money, pulse rate, weight, number of phone calls, height (182 cm, 195 cm).
  • Why preferred: many analysis methods are available for numeric data.

🧮 Subdivisions of numeric data

🧮 Discrete numeric data

Quantitative discrete data: all data that are the result of counting; take on only certain numerical values.

  • Examples from the excerpt: number of phone calls (0, 1, 2, 3, etc.), number of books in backpacks (3, 4, 2, 1), number of calves born to cows.
  • You cannot have 2.5 books or 3.7 phone calls.

📏 Continuous numeric data

Quantitative continuous data: data that are the result of measuring on a continuous scale, assuming accurate measurement.

  • Examples: angles in radians (π/6, π/3, π/2, etc.), weights of backpacks (6.2, 7, 6.8, 9.1, 4.3 pounds).
  • Can take any value within a range.
  • Note from excerpt: backpacks carrying three books can have different weights because weight is measured, not counted.

⚙️ How R handles numeric subtypes

  • R does not distinguish between discrete and continuous numeric data.
  • Both are stored as "numeric."
  • The distinction between continuous and discrete is usually not reflected in the statistical methods used.
  • Implication: we treat all numeric data as one category for analysis purposes.

⚠️ Common confusion: numbers that are really factors

⚠️ Coding factors with numbers

  • One may code categories of qualitative data with numerical values (e.g., 1 = male, 2 = female).
  • Critical point: the resulting data should nonetheless be treated as a factor, not as numeric.
  • Example from the excerpt: quiz scores recorded as numbers throughout the term but reported categorically as A, B, C, D, or F at the end.

🔍 Why type assignment matters

  • Different statistical methods are applied to different data types.
  • If a factor coded with numbers is mislabeled as numeric, R will treat it as quantitative data and apply inappropriate methods.
  • What to do: make sure that factors using numbers to denote levels are labeled as factors in R.
  • As default, R saves variables containing non-numeric values as factors; otherwise, variables are saved as numeric.

🗂️ Data structures

🗂️ Data frames

Data frame: a tabular format for storing statistical data; columns correspond to variables and rows correspond to observations.

  • Each column represents a variable (a measurement recorded for each subject).
  • Each row represents an observation (the evaluation of variables for a given subject).
  • Example from the excerpt: a data set with 100 subjects; subject 1 is a female of height 182 cm with ID 5696379; subject 98 is a male of height 195 cm with ID 9383288.

🏷️ Variables and observations

Variable: a measurement that may be carried out over a collection of subjects; outcome may be numerical (quantitative variable) or non-numeric (factor).

Observation: the evaluation of a variable (or variables) for a given subject.

  • In the example: "sex" is a factor, "height" is a numeric variable.
  • Rows = subjects = observations; columns = variables.

📁 File formats and R functions

📁 CSV files

CSV files: a digital format for storing data frames.

  • Can be read into R using read.csv(file), which creates a data frame.

🛠️ Key R functions introduced

FunctionPurpose
data.frame()Creates data frames; fundamental data structure for modeling
head() / tail()Return first or last parts of a data frame
sum() / cumsum()Return sum and cumulative sum of values
read.csv(file)Reads a table-format file and creates a data frame
table()Creates frequency tables (used in Exercise 2.2)

📝 Exercise examples

📝 Exercise 2.1: Hurricane frequency table

  • A relative frequency table on hurricanes (1851–2004) with missing entries.
  • Categories based on minimum wind speed.
  • Tasks: calculate relative frequency of category 1 hits, and of category 4 or higher.

📝 Exercise 2.2: Calves born to cows

  • Data on number of calves born during cows' productive years, stored in R object "calves."
  • Code provided: freq <- table(calves) and cumsum(freq) output showing cumulative counts.
  • Tasks: determine total number of cows, how many gave birth to 4 calves, and relative frequency of cows with at least 4 calves.
  • Type note: number of calves is discrete numeric data (result of counting).
13

Summary

2.5 Summary

🧭 Overview

🧠 One-sentence thesis

This summary consolidates key statistical terminology and R functions for sampling and data structures, distinguishing between qualitative factors and quantitative measurements to ensure appropriate analysis methods are applied.

📌 Key points (3–5)

  • Population vs sample: population is all subjects under study; sample is a representative portion.
  • Frequency concepts: frequency counts occurrences; relative frequency is the ratio to total; cumulative relative frequency sums up to a given value.
  • Data structures: data frames store variables (columns) and observations (rows); CSV files are the digital format.
  • Common confusion: factors coded with numbers should still be treated as qualitative, not quantitative—variable type determines which statistical methods apply.
  • Why variable type matters: R applies different methods to factors vs numeric data, so correct labeling (especially for numeric-coded factors) is essential.

📊 Core terminology

📊 Population and sample

Population: The collection, or set, of all individuals, objects, or measurements whose properties are being studied.

Sample: A portion of the population understudy. A sample is representative if it characterizes the population being studied.

  • Population = the entire group you want to understand.
  • Sample = the subset you actually measure.
  • A sample should characterize (represent) the population to be useful.

📊 Frequency measures

Frequency: The number of times a value occurs in the data.

Relative Frequency: The ratio between the frequency and the size of data.

Cumulative Relative Frequency: The term applies to an ordered set of data values from smallest to largest. The cumulative relative frequency is the sum of the relative frequencies for all values that are less than or equal to the given value.

MeasureWhat it countsHow it's calculated
FrequencyRaw count of occurrencesDirect count
Relative FrequencyProportion of totalFrequency ÷ total size
Cumulative Relative FrequencyRunning total up to a valueSum of all relative frequencies ≤ that value
  • Cumulative relative frequency requires ordered data (smallest to largest).
  • Example: if values 1, 2, 3 have relative frequencies 0.2, 0.3, 0.5, then cumulative relative frequency at 2 is 0.2 + 0.3 = 0.5.

🗂️ Data structures

🗂️ Data frame

Data Frame: A tabular format for storing statistical data. Columns correspond to variables and rows correspond to observations.

  • Columns = variables (measurements you take).
  • Rows = observations (individual subjects or cases).
  • This is the fundamental structure for most R modeling.

🗂️ Variable and observation

Variable: A measurement that may be carried out over a collection of subjects. The outcome of the measurement may be numerical, which produces a quantitative variable; or it may be non-numeric, in which case a factor is produced.

Observation: The evaluation of a variable (or variables) for a given subject.

  • Variable = what you measure (e.g., height, category).
  • Observation = one subject's recorded values.

🗂️ CSV files

CSV Files: A digital format for storing data frames.

  • CSV = comma-separated values.
  • Used to store and exchange data frames digitally.

🔢 Variable types

🔢 Factor vs quantitative

Factor: Qualitative data that is associated with categorization or the description of an attribute.

Quantitative: Data generated by numerical measurements.

  • Factor = categories or labels (e.g., A/B/C grades, gender).
  • Quantitative = numeric measurements (e.g., height, weight).
  • Don't confuse: even if you code factor levels with numbers (e.g., 1 = male, 2 = female), the variable should still be treated as a factor, not quantitative.

⚠️ Why variable type matters

  • R applies different statistical methods depending on whether data is factor or numeric.
  • The excerpt emphasizes: "one should make sure that the variables that are analyzed have the appropriate type."
  • Common mistake: factors coded with numbers may be incorrectly treated as quantitative by R unless explicitly labeled as factors.
  • Example: quiz scores reported as A/B/C/D/F are factors; even if you code them as 1/2/3/4/5, they should remain factors for analysis.

🛠️ R functions introduced

🛠️ Data manipulation

  • data.frame(): creates data frames (tightly coupled collections of variables).
  • head() and tail(): return the first or last parts of a data frame.
  • read.csv(file): reads a CSV file and creates a data frame from it.

🛠️ Calculation functions

  • sum(): returns the sum of values.
  • cumsum(): returns the cumulative sum of values.

💬 Discussion point

💬 Coding factors with numbers

  • The excerpt notes a common practice: coding factor levels using numerical values.
  • The forum question asks: what are the benefits or disadvantages of this practice?
  • Key consideration from the excerpt: even when coded numerically, factors must be treated as qualitative, not quantitative, to apply correct statistical methods.
  • Example: an organization might code survey responses (strongly disagree = 1, disagree = 2, neutral = 3, agree = 4, strongly agree = 5), but these should remain factors for proper analysis.
14

Comparing Two Samples: Student Learning Objectives

3.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

This chapter extends statistical inference to examine how one variable (explanatory) affects the distribution of another variable (response) by comparing two sub-samples split by a two-level factor.

📌 Key points (3–5)

  • Core framework: the next 3 chapters study the relationship between two variables—one (response) whose distribution is investigated, and one (explanatory) that may affect it.
  • This chapter's scope: the explanatory variable is a factor with two levels, splitting the sample into two sub-samples for comparison.
  • Three inference tools: point estimation, confidence intervals, and hypothesis testing are all used to compare distributions.
  • Common confusion: distinguish the response (the variable whose distribution we study) from the explanatory variable (the factor that may influence the response).
  • Practical methods: R functions t.test (for comparing expectations) and var.test (for comparing variances) carry out the inference.

🔍 Core framework: response vs explanatory variable

🔍 Response variable

The variable whose distribution is being investigated.

  • This is the outcome or measurement you want to understand.
  • The chapter focuses on how its distribution differs between groups.

🔍 Explanatory variable

The variable which may have an effect on the distribution of the response.

  • In this chapter, the explanatory variable is a factor with two levels.
  • It divides the sample into two sub-samples.
  • Example: An organization collects data on employee productivity (response) and wants to see if training status (trained vs. untrained, a two-level factor) affects it.

🔀 How the split works

  • The two-level factor creates two groups.
  • Statistical inference then compares the distribution of the response variable across these two groups.
  • Don't confuse: the explanatory variable is not being "measured" in the same way as the response; it is a grouping criterion.

🧰 Three inference tools for comparison

🧰 Point estimation

  • Compute summary statistics (e.g., means, variances) separately for each sub-sample.
  • These estimates describe the center and spread of the response in each group.

🧰 Confidence intervals

  • Construct intervals to estimate the difference between the two groups (e.g., difference in means).
  • Provides a range of plausible values for the true difference.

🧰 Hypothesis testing

  • Test whether the observed difference between the two sub-samples is statistically significant.
  • Uses the same logic as earlier chapters: null hypothesis (no difference) vs. alternative hypothesis (there is a difference).

🛠️ R functions for two-sample inference

🛠️ t.test for comparing expectations

  • Purpose: investigate the difference between the expectations (means) of the response variable in the two sub-samples.
  • This is the primary tool when you want to know if the average response differs between the two groups.
  • Example: Compare average productivity between trained and untrained employees.

🛠️ var.test for comparing variances

  • Purpose: investigate the ratio between the variances of the response variable in the two sub-samples.
  • Variance measures spread; this test checks if variability differs between groups.
  • Example: Check if productivity variability is higher in one training group than the other.

📚 Context and learning goals

📚 Transition from single-variable inference

  • Earlier chapters focused on the distribution of a single variable.
  • Now the focus shifts to relationships: how one variable influences another.
  • This chapter is the first of three dealing with two-variable inference.

📚 What students should be able to do

GoalDescription
Define estimators, intervals, and testsUnderstand the statistical tools for comparing distributions between two sub-populations
Apply t.testUse R to compare means (expectations) between two groups
Apply var.testUse R to compare variances between two groups
  • The emphasis is on both conceptual understanding (what the tools measure) and practical application (using R functions).
15

Displaying Data with Histograms and Box Plots

3.2 Displaying Data

🧭 Overview

🧠 One-sentence thesis

Histograms and box plots provide graphical representations of data frequency and distribution, allowing viewers to quickly assess shape, center, spread, and extreme values.

📌 Key points (3–5)

  • Histograms display continuous numerical data by dividing the range into equal intervals (boxes) whose heights represent the count of observations in each interval.
  • Box plots summarize data using five values: smallest value, first quartile, median, third quartile, and largest value.
  • When to use histograms: rule of thumb is 100 values or more; they can readily display large data sets.
  • Common confusion: In this book, histogram box height represents frequency (count), not area; some books use area to represent frequency or relative frequency.
  • Quartiles divide data into quarters: the median (Q2) splits data in half; Q1 is the middle of the lower half; Q3 is the middle of the upper half.

📊 Histograms

📊 What a histogram shows

Histogram: a frequently used method for displaying the distribution of continuous numerical data.

  • A histogram consists of contiguous boxes with a horizontal axis (labeled with what the data represents) and a vertical axis (labeled "Frequency").
  • By examining the histogram, one can appreciate:
    • The shape of the data
    • The center of the data
    • The spread of the data

🔨 How histograms are constructed

  • The range of the data (x-axis) is divided into equal intervals, which form the bases of the boxes.
  • The height of each box represents the count of observations that fall within that interval.
  • Example: A box with base between 160 and 170 has height 19 if there are 19 subjects with height greater than 160 but no more than 170 (160 < height ≤ 170).

💡 Height vs area representation

Don't confuse: Different conventions exist for histograms.

ConventionWhat the box represents
This book (R default)Height = frequency (count)
Some other booksArea = frequency or relative frequency
  • In the area convention, for the same example above, height would be 19/10 = 1.9 (if area = frequency) or (19/100)/10 = 0.019 (if area = relative frequency).
  • This book follows R's default: height represents frequency.

🖥️ Creating histograms in R

  • Use the hist function applied to a sequence of numerical data.
  • To extract a variable from a data frame: use the format dataframe.name$variable.name.
  • Example: hist(ex.1$height) creates a histogram of the height variable from the ex.1 data frame.
  • The input to hist must be a numeric sequence.

📦 Box Plots

📦 What a box plot shows

Box plot (or box-whisker plot): gives a good graphical overall impression of the concentration of the data and shows how far from most of the data the extreme values are.

  • Constructed from five values:
    1. Smallest value
    2. First quartile (Q1)
    3. Median (Q2)
    4. Third quartile (Q3)
    5. Largest value

📏 The median (Q2)

Median: a number that measures the "center" of the data; it separates ordered data into halves.

  • Think of it as the "middle value" (though it does not have to be an observed value).
  • Half the values are the same size or smaller than the median; half are the same size or larger.
  • How to find it:
    • Order the data from smallest to largest.
    • If there is an odd number of values, the median is the middle value.
    • If there is an even number of values, the median is the average of the two middle values.

Example: For the ordered data 1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 (14 values), the median is between the 7th value (6.8) and 8th value (7.2). Calculate: (6.8 + 7.2) / 2 = 7.

🔢 Quartiles (Q1 and Q3)

Quartiles: numbers that separate the data into quarters; they may or may not be part of the data.

  • First quartile (Q1): the middle value of the lower half of the data.
    • One-fourth of the values are the same or less than Q1.
    • Three-fourths of the values are more than Q1.
  • Third quartile (Q3): the middle value of the upper half of the data.

How to find quartiles:

  1. First find the median (Q2).
  2. Split the data into lower and upper halves (excluding the median if it's a data point).
  3. Q1 is the median of the lower half.
  4. Q3 is the median of the upper half.

Example: Using the same ordered data with median = 7:

  • Lower half: 1, 1, 2, 2, 4, 6, 6.8 → Q1 = 2 (middle value)
  • Upper half: 7.2, 8, 8.3, 9, 10, 10, 11.5 → Q3 = 9 (middle value)
16

Measures of the Center of Data

3.3 Measures of the Center of Data

🧭 Overview

🧠 One-sentence thesis

The mean and median are the two most widely used measures of central location, and the median is generally better when extreme values or outliers are present because it is not affected by their precise numerical values.

📌 Key points (3–5)

  • Two main measures: mean (average) and median (middle value) are the most widely used measures of the center of data.
  • When to prefer median: the median is generally a better measure when there are extreme values or outliers because it is not affected by their precise numerical values.
  • How to calculate mean: add all values and divide by the count, or multiply each distinct value by its relative frequency and sum the products.
  • Common confusion: in symmetrical distributions the mean and median are the same, but in skewed distributions they differ—the mean is pulled toward the tail.
  • Why it matters: choosing the right measure of center depends on the shape of the distribution and the presence of outliers.

📊 The median as a measure of center

📊 What the median is

The median is a number that separates ordered data into halves; half the values are the same size or smaller than the median and half the values are the same size or larger than it.

  • The median is a way of measuring the "center" of the data.
  • You can think of it as the "middle value," although it does not actually have to be one of the observed values.
  • To find the median, order the data from smallest to largest and locate the middle.

🔢 How to calculate the median

When there is an even number of observations:

  • The median is between the two middle values.
  • Add the two middle values together and divide by 2.
  • Example: For the ordered data 1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 (14 values), the median is between the 7th value (6.8) and the 8th value (7.2). The median is (6.8 + 7.2) divided by 2, which equals 7.

When there is an odd number of observations:

  • The median is the single middle value after ordering.

🎯 Why the median matters

  • Half of the values are smaller than the median and half are larger.
  • The median is not affected by the precise numerical values of outliers or extreme values.
  • This makes it a better measure of center when the data contains outliers.

📐 The mean as a measure of center

📐 What the mean is

The mean (average) is calculated by adding together all the values and dividing the result by the number of values.

  • The mean is the most commonly used measure of the center.
  • It is denoted by placing a bar over the variable name, e.g., x-bar (pronounced "x bar").
  • Example: To calculate the average weight of 50 people, add together the 50 weights and divide the result by 50.

🔢 Two ways to calculate the mean

Method 1: Direct averaging

  • Add all data points together and divide by the total number of observations.
  • Example: For the data 1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4 (11 values), the mean is (1+1+1+2+2+3+4+4+4+4+4) divided by 11, which equals 2.7.

Method 2: Using relative frequencies

  • Multiply each distinct value by its relative frequency, then sum the products across all values.
  • Example: For the same data, the distinct values are 1, 2, 3, and 4 with relative frequencies of 3/11, 2/11, 1/11, and 5/11, respectively. The mean is 1×(3/11) + 2×(2/11) + 3×(1/11) + 4×(5/11) = 2.7.
  • Both methods produce the same result.

⚠️ Limitation of the mean

  • The mean is affected by extreme values or outliers.
  • When outliers are present, the median is generally a better measure of the center.
  • Despite this limitation, the mean is still the most commonly used measure.

🔄 Quartiles and the inter-quartile range

🔄 What quartiles are

Quartiles are numbers that separate the data into quarters; they may or may not be part of the data.

  • The first quartile (Q1) is the middle value of the lower half of the data; one-fourth of the values are the same or less than Q1 and three-fourths are more.
  • The second quartile is the median; it separates the data into two halves.
  • The third quartile (Q3) is the middle value of the upper half of the data; three-fourths of the values are less than Q3 and one-fourth are more.

📏 How to find quartiles

  1. First find the median (second quartile).
  2. The first quartile is the middle value of the lower half of the data.
  3. The third quartile is the middle value of the upper half of the data.

Example: For the ordered data 1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5:

  • The median is 7.
  • The lower half is 1, 1, 2, 2, 4, 6, 6.8; the first quartile (Q1) is 2.
  • The upper half is 7.2, 8, 8.3, 9, 10, 10, 11.5; the third quartile (Q3) is 9.

📊 Inter-quartile range (IQR)

The inter-quartile range is the distance between the third quartile and the first quartile: IQR = Q3 − Q1.

  • The IQR measures the spread of the central 50% of the data.
  • It is used to identify potential outliers.
  • Example: If Q3 = 9 and Q1 = 2, then IQR = 9 − 2 = 7.

🚨 Outliers and their identification

🚨 What outliers are

Outliers are values that do not fit with the rest of the data and lie outside of the normal range; they are data points with values that are much too large or much too small in comparison to the vast majority of the observations.

  • Outliers may have a substantial effect on the outcome of statistical analysis.
  • It is important to be alerted to the presence of outliers.

🔍 How to identify outliers using IQR

Upper threshold:

  • A data point larger than Q3 + 1.5 × IQR is marked as a potential outlier.

Lower threshold:

  • A data point smaller than Q1 − 1.5 × IQR is marked as a potential outlier.

Example: With Q1 = 2, Q3 = 9, and IQR = 7:

  • Upper threshold = 9 + 1.5 × 7 = 19.5
  • Lower threshold = 2 − 1.5 × 7 = −8.5
  • All data points between −8.5 and 19.5 are not outliers.

📦 Box plots and outliers

  • In a box plot, outliers are marked as points above or below the endpoints of the whiskers.
  • The whiskers extend from the ends of the box to the smallest and largest data values that are not outliers.
  • Example: For the height data with Q1 = 158.0, Q3 = 180.2, and IQR = 22.2, the lower threshold is 124.7. The minimal observation (117.0) is less than this threshold, so it is an outlier and is marked as a point below the end of the lower whisker. The second smallest observation (129) lies above the lower threshold and marks the endpoint of the lower whisker.

⚖️ Skewness and the relationship between mean and median

⚖️ Symmetrical distributions

  • A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and to the right of the vertical line are mirror images of each other.
  • In a perfectly symmetrical distribution, the mean and the median are the same.
  • Example: For the data 4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10, both the mean and the median are 7.

📉 Skewed distributions

  • When the histogram is not symmetrical, the distribution is skewed.
  • The excerpt describes a distribution that is "skewed to the left" when the right-hand side seems "chopped off" compared to the left side—the shape is pulled out towards the left.
  • In skewed distributions, the mean and median differ because the mean is affected by the extreme values in the tail.

🔀 Don't confuse symmetry with skewness

Distribution typeShapeMean vs Median
SymmetricalMirror images on both sides of a vertical lineMean = Median
SkewedOne side is "chopped off" or pulled outMean ≠ Median; mean is pulled toward the tail
  • The key distinction: symmetry means the mean and median coincide; skewness means they differ.
  • The mean is more sensitive to extreme values, so it shifts toward the tail in skewed distributions.
17

Measures of the Spread of Data

3.4 Measures of the Spread of Data

🧭 Overview

🧠 One-sentence thesis

The standard deviation and variance quantify how spread out data values are from their mean, providing a numerical measure of variability that complements the inter-quartile range.

📌 Key points (3–5)

  • Two main spread measures: inter-quartile range (IQR) and standard deviation, with standard deviation being the most important.
  • Variance calculation: average of squared deviations from the mean (divided by n-1 for sample variance).
  • Standard deviation interpretation: measures distance of data values from the mean in the same units as the original data.
  • Common confusion: variance vs. standard deviation—variance is squared and has different units; standard deviation is the square root and matches data units.
  • When standard deviation works best: most helpful for symmetrical distributions; less useful for skewed distributions where quartiles are better.

📐 Core spread measures

📏 Inter-quartile range (IQR)

  • Introduced earlier in the context of box plots.
  • Measures the spread of the middle 50% of the data.
  • The excerpt notes this exists but focuses primarily on standard deviation as the most important measure.

📊 Standard deviation as primary measure

The most important measure of spread is the standard deviation.

  • Standard deviation is the main tool statisticians use to quantify variability.
  • Before calculating standard deviation, you must first compute the variance.
  • The two concepts are directly related: standard deviation is the square root of variance.

🧮 Computing variance and standard deviation

🔢 Understanding deviations

If x_i is a data value for subject i and x̄ is the sample mean, then x_i − x̄ is called the deviation of subject i from the mean, or simply the deviation.

  • A deviation measures how far each individual data point is from the average.
  • Every data value has its own deviation.
  • Example: if the mean is 10.525 and one value is 9, the deviation is 9 - 10.525 = -1.525.
  • Negative deviations mean the value is below the mean; positive deviations mean above the mean.

📦 Variance formula

The variance is in principle the average of the squares of the deviations.

Sample variance calculation steps:

  1. Find the mean (x̄)
  2. Calculate each deviation: (x_i - x̄)
  3. Square each deviation: (x_i - x̄)²
  4. Sum all squared deviations
  5. Divide by (n - 1), where n is the number of data values

Why divide by n-1 instead of n?

  • This produces the "sample variance."
  • The reason stems from statistical inference theory (covered later in the source material).
  • For large datasets, dividing by n or n-1 makes little practical difference.

🔨 Standard deviation formula

The standard deviation is obtained by taking the square root of the variance.

  • Formula: s = √(s²), where s² is the variance.
  • The standard deviation measures spread in the same units as the original data.
  • Example: if data is in years, variance is in "years squared" but standard deviation is in years.

Don't confuse: Variance has squared units (hard to interpret directly); standard deviation returns to original units (easier to interpret).

🎯 Interpreting standard deviation

📏 Measuring distance from the mean

The standard deviation is a number that measures how far data values are from their mean.

How to read standard deviation values:

  • s = 0: no spread at all; all data values are identical.
  • s > 0: data values vary; larger s means more spread out.
  • Very large s: data values are very spread out, possibly with outliers.

Example interpretation from the excerpt:

  • If the mean is 5 and standard deviation is 2:
    • A value of 7 is "one standard deviation larger than the mean" (5 + 1×2 = 7)
    • A value of 1 is "two standard deviations smaller than the mean" (5 - 2×2 = 1)

⚠️ When standard deviation is less useful

The excerpt cautions that standard deviation is not always the best choice:

Distribution typeStandard deviation usefulnessBetter alternative
SymmetricalVery helpful
SkewedLess helpfulFirst quartile, median, third quartile, min, max

Why skewed distributions are problematic:

  • The two sides of a skewed distribution have different spreads.
  • A single number (standard deviation) cannot capture this asymmetry well.
  • Quartiles and range provide a more complete picture.

🎨 Graphing for better understanding

  • The excerpt recommends graphing data to get a better "feel" for deviations and standard deviation.
  • Visual inspection helps you understand whether the standard deviation is meaningful for your particular dataset.
  • When first learning, standard deviation "may not be too simple to interpret"—visualization helps build intuition.

🛠️ Practical computation example

👥 Fifth grade class ages

The excerpt walks through a complete example with 20 students' ages:

Data: 9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5

Step-by-step calculation:

  1. Mean: 10.525 years
  2. Deviations: First value: 9 - 10.525 = -1.525; last value: 11.5 - 10.525 = 0.975
  3. Squared deviations: First: (-1.525)² = 2.325625; last: (0.975)² = 0.950625
  4. Sum of squared deviations divided by (n-1): sum / 19 = 0.5125 (this is the variance)
  5. Standard deviation: √0.5125 = 0.7158911 years

💻 Using functions

The excerpt mentions that statistical software provides built-in functions:

  • var(x) computes sample variance
  • sd(x) computes standard deviation
  • These functions handle all the steps automatically.
18

Exercises on Descriptive Statistics

3.5 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises test the ability to match summary statistics with visual representations (histograms and box plots) and to compute key descriptive measures like mean, standard deviation, median, and IQR from frequency data.

📌 Key points (3–5)

  • Exercise 3.1 focus: matching numerical summaries with the correct histogram and box plot, plus identifying outliers.
  • Exercise 3.2 focus: calculating mean, standard deviation, median, IQR, and distance in standard deviations from a frequency table.
  • Common confusion: distinguishing which visual (histogram vs box plot) corresponds to which summary—requires understanding how quartiles, medians, and ranges map to visual features.
  • Why it matters: these exercises reinforce interpretation of summary statistics and their relationship to data distribution shapes.

📊 Exercise 3.1: Matching summaries to visuals

📊 The task

Three data sequences (x1, x2, x3) have been summarized using the summary function in R. Three histograms (Figure 3.3) and three box plots (Figure 3.4) are provided in random orders.

Students must:

  1. Match each summary result with the correct histogram and box plot
  2. Determine if 0.000 in sequence x1 is an outlier
  3. Determine if 6.414 in sequence x3 is an outlier

🔍 What the summaries show

SequenceMin1st Qu.MedianMean3rd Qu.Max
x10.0002.4983.2183.0813.8404.871
x20.00010830.57720001.50700001.84200002.90500004.9880000
x32.2003.3914.0204.0774.6906.414

🎯 Matching strategy

  • Compare the range (min to max) shown in each summary to the x-axis range in histograms and box plots
  • Check whether the median and quartiles align with the center and spread shown in box plots
  • Look for skewness: if mean differs noticeably from median, the histogram should show asymmetry

⚠️ Outlier identification

The excerpt earlier explained that outliers are "observations with values outside the normal range." To determine if a value is an outlier, students should apply the IQR rule or examine whether the value falls far from the quartile structure shown in the box plot.

🧮 Exercise 3.2: Computing from frequency tables

🧮 The data

A frequency table shows the number of toilet facilities counted in 30 buildings:

Value (x)246810
Frequency1061022

📐 Required calculations

1. Mean (x̄)

  • Use the formula: mean equals sum of all values divided by number of values
  • With frequency data: sum each value times its frequency, then divide by total count (30)

2. Sample standard deviation

  • First compute deviations from the mean: (x - x̄)
  • Square each deviation, multiply by frequency, sum them
  • Divide by (n - 1), then take the square root

3. Median

  • The middle value when data is ordered
  • With 30 observations, the median is between the 15th and 16th values

4. Interquartile range (IQR)

IQR = Q3 - Q1 (the distance between the third quartile and the first quartile)

5. Standard deviations from the mean

  • Calculate how many standard deviations the value 10 is from the mean
  • Use the relationship: (value - mean) / standard deviation

💡 Working with frequency tables

  • Don't confuse: the frequency table shows 5 distinct values, but represents 30 total observations
  • Example: the value 2 appears 10 times, so it contributes 2×10 = 20 to the sum
  • When finding the median, mentally "expand" the frequency table to see which position the middle observation falls at

📚 Supporting definitions from the glossary

The excerpt includes key definitions students need for these exercises:

Mean: A number that measures the central tendency. By definition, the mean for a sample is the sum of all values in the sample divided by the number of values in the sample.

Median: A number that separates ordered data into halves: half the values are the same number or smaller than the median and half the values are the same number or larger.

Interquartile Range (IQR): The distance between the third quartile (Q3) and the first quartile (Q1). IQR = Q3 - Q1.

Sample Standard Deviation: A number that is equal to the square root of the variance and measures how far data values are from their mean.

Outlier: An observation that does not fit the rest of the data.

19

Comparing Two Samples: Summary

3.6 Summary

🧭 Overview

🧠 One-sentence thesis

This chapter provides formulas and methods for comparing two groups on either their means (expectations) or their variances, using t-tests and F-tests respectively.

📌 Key points (3–5)

  • Two types of comparisons: testing equality of expectations (means) between two groups, and testing equality of variances between two groups.
  • Key formulas: t-statistic for comparing means; F-statistic (ratio of sample variances) for comparing variances.
  • Confidence intervals: constructed for the difference in means and for the ratio of variances.
  • Common confusion: the t-test compares differences in means, while the F-test compares the ratio of variances—different structures for different parameters.
  • Design vs analysis: the glossary and forum discussion emphasize that good study design (quality and quantity of data) is as important as the statistical test itself.

📐 Comparing expectations (means)

📐 Test statistic for equality of expectations

The test statistic for equality of expectations is: t = (x̄ₐ − x̄ᵦ) / √(s²ₐ/nₐ + s²ᵦ/nᵦ).

  • What it measures: how many standard errors the difference in sample means is from zero.
  • Structure: numerator is the difference between the two sample means; denominator is the pooled standard error of that difference.
  • Example: if Group A has mean 124.3 and Group B has mean 80.5, the numerator is their difference; the denominator accounts for both sample variances and sample sizes.

🎯 Confidence interval for the difference in means

  • Formula: (x̄ₐ − x̄ᵦ) ± qnorm(0.975) √(s²ₐ/nₐ + s²ᵦ/nᵦ).
  • Interpretation: a range estimate for the true difference in population means.
  • The multiplier qnorm(0.975) corresponds to the 97.5th percentile of the standard Normal distribution (for a 95% confidence level).

📊 Comparing variances

📊 Test statistic for equality of variances

The test statistic for equality of variances is: f = s²ₐ / s²ᵦ.

  • What it measures: the ratio of the two sample variances.
  • Why a ratio: under the null hypothesis that the population variances are equal, this ratio should be close to 1.
  • Example: if Group A has variance 13.4² and Group B has variance 16.7², the F-statistic is (13.4²)/(16.7²).

🔍 Confidence interval for the ratio of variances

  • Formula: [(s²ₐ/s²ᵦ) / qf(0.975, dfₐ, dfᵦ), (s²ₐ/s²ᵦ) / qf(0.025, dfₐ, dfᵦ)].
  • Structure: the point estimate (ratio of sample variances) is divided by F-distribution quantiles to form the interval.
  • The degrees of freedom dfₐ and dfᵦ depend on the sample sizes (typically nₐ − 1 and nᵦ − 1).
  • Don't confuse: the interval is for the ratio of variances, not the difference; it is asymmetric around the point estimate.

🧪 Robustness and assumptions

🧪 Normality assumption for the F-test

  • The excerpt notes that the F-test for equality of variances assumes measurements are Normally distributed.
  • Robustness question: Exercise 13.2 asks whether the test maintains its nominal 5% significance level when data come from non-Normal distributions (e.g., Exponential).
  • Example: comparing the test's actual Type I error rate under Normal(4, 4²) versus Exponential(1/4) distributions.

🗂️ Key terminology

🗂️ Response and explanatory variables

TermDefinition (from glossary)
ResponseThe variable whose distribution one seeks to investigate.
Explanatory VariableA variable that may affect the distribution of the response.
  • In the magnet-pain example, the pain score is the response; the factor "active" (active magnet vs placebo) is the explanatory variable with two levels.

🎲 Design considerations

  • The forum discussion emphasizes that design (how data are collected) is as important as analysis.
  • Quantity vs quality: large sample size is valuable, but high-quality data (e.g., face-to-face interviews vs telephone surveys) may yield more trustworthy conclusions.
  • Example: a telephone survey may reach many people quickly (quantity), but face-to-face interviews may produce more accurate responses (quality).

📋 Summary of formulas

PurposeFormulaNotes
Test for equal meanst = (x̄ₐ − x̄ᵦ) / √(s²ₐ/nₐ + s²ᵦ/nᵦ)Difference in means divided by pooled SE
CI for difference in means(x̄ₐ − x̄ᵦ) ± qnorm(0.975) √(s²ₐ/nₐ + s²ᵦ/nᵦ)95% confidence interval
Test for equal variancesf = s²ₐ / s²ᵦRatio of sample variances
CI for ratio of variances[(s²ₐ/s²ᵦ)/qf(0.975, dfₐ, dfᵦ), (s²ₐ/s²ᵦ)/qf(0.025, dfₐ, dfᵦ)]Asymmetric interval using F-quantiles
  • All tests in the exercises are conducted at the 5% significance level.
  • The chapter provides both hypothesis-testing and interval-estimation approaches for the same parameters.
20

Student Learning Objectives

4.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

This chapter transitions from comparing two groups (where the explanatory variable is a two-level factor) to examining relationships where both response and explanatory variables are numeric, using linear regression.

📌 Key points (3–5)

  • What the previous chapter covered: numeric response with a two-level factor as the explanatory variable (comparing two groups).
  • What this chapter will cover: situations where both the response and the explanatory variable are numeric.
  • Key shift: moving from group comparison to modeling continuous relationships.
  • Common confusion: the previous chapter used factors (categorical) to explain numeric responses; this chapter uses numeric variables to explain numeric responses—the type of explanatory variable changes the analysis method.

🔄 Transition from Chapter 13

🔄 What Chapter 13 examined

  • The previous chapter focused on a specific scenario:
    • Response variable: numeric (e.g., pain score, measurement values).
    • Explanatory variable: a factor with exactly two levels (e.g., treatment vs. control, active vs. placebo).
  • This setup is about comparing two groups or two samples.
  • Example: comparing pain scores between patients who received an active magnet (level 1) versus those who received an inactive placebo (level 2).

🆕 What Chapter 14 will examine

  • This chapter deals with a different scenario:
    • Response variable: still numeric.
    • Explanatory variable: also numeric (not a categorical factor).
  • The focus shifts to modeling the relationship between two continuous variables.
  • The method introduced is linear regression.

🔀 Understanding the shift

🔀 Type of explanatory variable matters

ChapterResponse typeExplanatory typeMethod
13NumericFactor (2 levels)Two-sample tests (t-test, F-test)
14NumericNumericLinear regression
  • Don't confuse: both chapters deal with numeric responses, but the nature of the explanatory variable determines the statistical approach.
  • When the explanatory variable is categorical (a factor), you compare groups.
  • When the explanatory variable is numeric, you model how the response changes as the explanatory variable changes continuously.

📐 What linear regression does

  • Linear regression examines how a numeric response variable changes in relation to a numeric explanatory variable.
  • It models the relationship as a straight line (or linear pattern).
  • Example: instead of comparing two fixed groups, you might examine how pain score changes continuously with dosage level, or how sales change with advertising spend.
21

4.2 Different Forms of Variability

4.2 Different Forms of Variability

🧭 Overview

🧠 One-sentence thesis

Variability comes in three distinct forms—data variability (observed sample), population variability (all members), and random variable variability (uncertainty before measurement)—and understanding these abstract types is essential for statistical practice even though we cannot directly observe population or random variable values.

📌 Key points (3–5)

  • Three types of variability: data variability (sample we see), population variability (entire population), and random variable variability (uncertainty/randomness in measurements).
  • Population vs. data variability: population variability covers all members of the population, not just the sample; data variability is only the observed sample.
  • Random variable variability models uncertainty: it represents the uncertainty about what value will be obtained before a measurement is made (e.g., selecting one subject at random).
  • Common confusion: data variability is concrete and observable; population and random variable variability are theoretical—we discuss them even though we don't observe the full population or the pre-measurement uncertainty.
  • Same tools apply: all three types can be displayed with the same graphical tools and numerical summaries (with possible modifications).

🔍 The three forms of variability

📊 Data variability (what we observe)

  • This is the variability in the concrete list of data values that is presented to us.
  • Example: the height measurements in the sample of 100 subjects from the "ex.1" dataset.
  • We actually get to see these values; they are observable.
  • This type was examined in Chapters 2 and 3.

🌐 Population variability (the entire group)

Population variability: the variability of the quantity of interest across all members of the population, not only those selected into the sample.

  • Nature: similar to data variability, but covers everyone, not just the sample.
  • Key difference: it corresponds to the entire population (e.g., 100,000 people), including the 100 subjects in the sample.
  • Example: when we examine height values across the entire population of 100,000, different people have different heights—that spread is population variability.
  • Important: we do not actually observe the full population data in real statistical practice; the example in this chapter is artificially constructed for illustration only.
  • Don't confuse: we can discuss and theorize about population variability even though we don't see the list of measurements for the entire population.

🎲 Random variable variability (uncertainty before measurement)

Random variable variability: a mathematical concept that models the notion of randomness in measurements or the uncertainty regarding the outcome of a measurement.

  • Core idea: before selecting and measuring, we are uncertain what value will be obtained.
  • Example: imagine a population of 100,000; you are about to select one subject at random and measure their height. Before the selection, you don't know which height value you will get—different subjects yield different measurements, and you don't know beforehand which subject will be selected. This uncertainty is random variable variability.
  • Broader scope: random variables can be defined for more abstract settings, not necessarily tied to a specific population; they provide models for randomness and uncertainty in measurements.
  • The same definitions used for the "sampling one subject" example apply to more abstract constructions.

🧩 Theoretical vs. observable variability

🧩 What we see vs. what we theorize

Type of variabilityObservable?What it represents
Data variabilityYesConcrete list of sample values we have
Population variabilityNoVariability across the entire population (we don't observe all of it)
Random variable variabilityNoUncertainty/randomness before measurement
  • Data variability relates to quantities we actually observe.
  • Population and random variable variability are not associated with quantities we actually get to observe.
  • Example: we see the sample data, but not the data for the rest of the population.
  • The discussion of population and random variable variability is theoretical in nature, yet this theoretical discussion is instrumental for understanding statistics.

🔧 Same tools, different contexts

  • All three types of variability can be:
    • Displayed using graphical tools (plots).
    • Characterized with numerical summaries (mean, median, quartiles, etc.).
  • Essentially the same type of plots and numerical summaries, possibly with some modifications, may and will be applied to all three forms.
  • The excerpt emphasizes that these are "essential foundations for the practice of statistics."

🎯 Why distinguish these forms

🎯 Foundation for statistical practice

  • The excerpt states that understanding these different forms of variability is part of the "essential foundations for the practice of statistics."
  • Subsequent chapters (up to the end of this part of the book) build on these concepts.
  • Even though population and random variable variability are abstract and not directly observable, they are necessary for understanding how statistics works.

🎯 Context of the example

  • The chapter uses an artificially constructed population example (100,000 members, including the 100-subject sample from "ex.1") for illustration.
  • In actual statistical practice, one does not obtain measurements from the entire population—only from the subjects in the sample.
  • The theoretical discussion helps bridge the gap between what we observe (sample) and what we want to understand (population and uncertainty).
22

A Population

4.3 A Population

🧭 Overview

🧠 One-sentence thesis

The variability of an entire population can be characterized using numerical summaries like the population mean (μ), which exists as a fixed parameter even when we cannot observe all population values, and understanding this theoretical variability is essential for statistical inference.

📌 Key points (3–5)

  • Three types of variability: variability of the data (observed sample), variability of the population (theoretical, usually unobserved), and other abstract variability (introduced later).
  • Data vs population: we observe the sample data directly, but population variability is theoretical—we discuss it without seeing the entire population's measurements.
  • Population mean (μ): computed by summing all values in the population and dividing by population size N; it is a parameter (a fixed characteristic of the population).
  • Common confusion: sample mean (x̄) vs population mean (μ)—the sample mean is calculated from observed data (size n), while the population mean is calculated from all N values in the population (usually unknown in practice).
  • Why it matters: even though we cannot compute μ in real life, it exists as an unknown quantity that statistics aims to estimate from sample data.

📊 Types of variability

📊 Variability of the data

  • What it is: variability in a concrete list of data values that we actually observe.
  • The excerpt uses the example of "ex1.csv" with 100 subjects' heights and sex.
  • This is the variability we dealt with in earlier chapters—applied to the sample we have in hand.

🌐 Variability of the population

  • What it is: variability across the entire target population, which we typically do not observe.
  • The discussion is theoretical in nature—we talk about the population's distribution even though we don't have measurements from every member.
  • Example: the excerpt constructs an artificial population file "pop1.csv" with 100,000 subjects for illustration; in real statistics, we only observe the sample, not the full population.
  • Don't confuse: this is not about the sample we see; it's about the larger group from which the sample was drawn.

🔮 Abstract variability

  • The excerpt mentions a third type: more abstract random variables used to model randomness and uncertainty in measurements.
  • These need not be tied to a specific population.
  • The same graphical tools and numerical summaries apply to all three types of variability.

🧮 Numerical summaries: sample vs population

🧮 Sample mean (x̄)

Sample mean: the arithmetic average of the data, computed by summing all observed values and dividing by the number of observations (n).

  • Formula (in words): x̄ = (Sum of all values in the data) / (Number of values in the data).
  • Example: in "ex1.csv" with n = 100 subjects, the mean height is 170.1 cm.
  • This is a quantity we can compute because we have the data.

🎯 Population mean (μ)

Population mean (μ, pronounced "mew"): the arithmetic average of all values in the entire population, computed by summing all population values and dividing by the population size (N).

  • Formula (in words): μ = (Sum of all values in the population) / (Number of values in the population).
  • Example: in "pop1.csv" with N = 100,000 subjects, the mean height is 170 cm.
  • Key point: in actual practice, we will not have all population values, so we cannot compute μ directly—but it still exists as a fixed number.
  • Don't confuse: μ is not the same as x̄; μ is the true population average (unknown in real life), while x̄ is the observed sample average (known from data).

📐 Other summaries

Both sample and population can be summarized using:

  • Minimum and maximum: smallest and largest values.
  • Median: the middle value when data are ordered.
  • Quartiles: first quartile (25th percentile), median (50th percentile), third quartile (75th percentile).
  • Example: in the population, height ranges from 117 to 217 cm, with the central 50% between 162 and 178 cm.
SummarySample ("ex1.csv", n=100)Population ("pop1.csv", N=100,000)
Meanx̄ = 170.1 cmμ = 170 cm
Median171.0 cm170 cm
Min117.0 cm117 cm
Max208.0 cm217 cm
1st Qu.158.0 cm162 cm
3rd Qu.180.2 cm178 cm

🏛️ Parameters and their role

🏛️ What is a parameter?

Parameter: a characteristic of the distribution of an entire population.

  • The population mean μ is a parameter.
  • Parameters are fixed numbers that describe the population, even though we usually don't know their values.
  • Don't confuse: a parameter (population characteristic) vs a statistic (sample characteristic)—the excerpt hints that statistics aims to estimate unknown parameters from observed sample data.

🔍 Why discuss theoretical variability?

  • Even though we don't observe the entire population, we can still discuss its variability in a theoretical way.
  • This theoretical discussion is instrumental for understanding statistics—it provides the foundation for inference.
  • Example: knowing that μ exists (even if unknown) allows us to frame the problem of estimating it from the sample mean x̄.

📈 Visualizing population distribution

📈 Bar plot of population height

  • The excerpt describes a bar plot (Figure 4.1) showing the frequency of each height value in the population.
  • Each vertical bar represents how many subjects have that height.
  • The distribution is centered at 170 cm, with integer values ranging from 117 to 217 cm.
  • The same graphical tools used for sample data (bar plots, histograms, etc.) can be applied to population data, with possible modifications.

🔄 Similarity in computation

  • The formulas for sample mean and population mean are structurally identical—both are arithmetic averages.
  • The only difference: sample mean uses the n values we observe; population mean uses all N values in the population.
  • This similarity helps us understand that the sample mean is a natural estimate of the population mean.
23

Random Variables

4.4 Random Variables

🧭 Overview

🧠 One-sentence thesis

A random variable represents the future outcome of a measurement before it is taken, characterized by its sample space (all possible values) and the probability of each value, and once measured it becomes data rather than a random variable.

📌 Key points (3–5)

  • What a random variable is: the outcome of a measurement before it happens, with a distribution over potential values rather than a single known value.
  • Sample space and probability: the sample space is the collection of all possible values; probability is the likelihood of each value (analogous to relative frequency in a population).
  • Common confusion: random variable vs. data—a random variable exists before measurement; after measurement reveals a specific value, it becomes data and is no longer random.
  • How probability is computed: probability of a value equals its relative frequency in the population (frequency divided by population size).
  • Notation conventions: random variables are denoted by capital letters (X, Y, Z); their realized values by lowercase (x, y, z); probability by P.

🎲 What is a random variable

🎲 Definition and nature

A random variable refers to numerical values, typically the outcome of an observation, a measurement, or a function thereof, characterized by the collection of potential values it may obtain (the sample space) and the likelihood of obtaining each value (the probability).

  • It does not have a specific value yet—only a collection of potential values with a distribution.
  • You cannot predict the exact outcome in advance, but you can describe patterns (e.g., the range of possible values, the probability of each).
  • Example: Sampling one person's height from a population—before you measure, the height is a random variable; after you measure and get 162 cm, it becomes data.

🔄 The transition from random variable to data

  • Before measurement: the outcome is uncertain → random variable.
  • After measurement: a specific value is revealed → data.
  • Don't confuse: the same quantity changes status depending on whether you have observed it yet.

📦 Sample space and distribution

📦 Sample space

  • The sample space is the set of all values the random variable can take.
  • Example: In the imaginary city population, heights range from 117 to 217 cm (94 distinct integer values), so the sample space contains these 94 values.

📊 Probability function

  • The probability of a value is its relative frequency in the population: frequency of that value divided by total population size N.
  • Example: 3,476 subjects have height 168 cm out of 100,000 total → P(X = 168) = 3,476 / 100,000 = 0.03476.
  • Example: 488 subjects have height 192 cm → P(X = 192) = 0.00488.
  • Example: 393 subjects have height ≥ 200 cm → P(X ≥ 200) = 0.00393.

📐 Distribution

  • The probability function defined over all values in the sample space gives the distribution of the random variable.
  • The distribution can come from:
    • Relative frequency in a population (as in the height example).
    • Theoretical modeling (mentioned for later chapters).

🔢 Notation and conventions

🔤 Symbols

SymbolMeaning
X, Y, ZRandom variables (capital Latin letters)
x, y, zSpecific values the random variables may obtain (lowercase)
PProbability
  • Example: P(X = 168) means "the probability that random variable X takes the value 168."
  • Example: P(X ≥ 200) means "the probability that X is 200 or greater."
  • Example: P(|X − 170| ≤ 10) means "the probability that X differs from 170 by at most 10," equivalent to P(160 ≤ X ≤ 180).

🧮 Computing probabilities

🧮 Probability as relative frequency

  • To find the probability that a random variable falls in a certain range, compute the relative frequency of values in that range within the population.
  • Example: To find P(160 ≤ X ≤ 180) for height X, count how many of the 100,000 subjects have height between 160 and 180, then divide by 100,000.
  • The excerpt shows this yields 0.64541, meaning about 64.5% of the population falls in that range.

🛠️ Computational method (using R example)

  • The excerpt illustrates using a logical expression to filter values:
    • Compute the absolute difference from a target (e.g., |X − 170|).
    • Test whether it is within a threshold (e.g., ≤ 10).
    • Count the proportion of TRUE results.
  • Example with a small sequence Y = (6.3, 6.9, 6.6, 3.4, 5.5, 4.3, 6.5, 4.7, 6.1, 5.3):
    • Compute |Y − 5| to get distances from 5.
    • Test |Y − 5| ≤ 1 to see which values are within 1 unit of 5.
    • Result: 4 out of 10 values satisfy the condition → proportion = 0.4.
  • The same logic applies to the full population: the mean of the logical sequence (TRUE = 1, FALSE = 0) gives the relative frequency, which is the probability.

➕ Sum of probabilities

  • Probabilities over the entire sample space sum to 1.
  • Example: If X can be 0, 1, 2, or 3, and P(X = 1) = 0.25, P(X = 2) = 0.15, P(X = 3) = 0.10, then:
    • P(X > 0) = 0.25 + 0.15 + 0.10 = 0.50.
    • Therefore P(X = 0) = 1 − 0.50 = 0.50.
  • Don't confuse: this is the same principle as relative frequencies summing to 1.

🔗 Connection to population parameters

🔗 Population vs. sample

  • The excerpt contrasts population (entire target group, size N) with sample (subset, size n).
  • Population characteristics are called parameters (e.g., population mean μ, population variance σ², population standard deviation σ).
  • Sample characteristics are computed from observed data (e.g., sample mean x̄, sample variance s²).

📏 Population mean and variance

Population mean μ: sum of all values in the population divided by population size N.

Population variance σ²: the average of squared deviations from the population mean; computed as the sum of (xᵢ − μ)² divided by N.

Population standard deviation σ: the square root of the population variance.

  • Example: For the imaginary city height population, σ² = 126.1576 and σ = 11.23199.
  • In practice, you usually do not know the true population parameters; you estimate them from sample data.
  • Don't confuse: sample variance divides by (n − 1); population variance divides by N (though for large N the difference is negligible).

🎯 Why parameters matter for random variables

  • When you sample randomly from a population, the random variable inherits the population's distribution.
  • The population's relative frequencies become the probabilities for the random variable.
  • Example: The sequence of all 100,000 heights represents the distribution of the random variable X (height of a randomly sampled person).
24

Probability and Statistics: Random Variables and Expectation

4.5 Probability and Statistics

🧭 Overview

🧠 One-sentence thesis

Random variables are characterized by their probability distributions, and the expectation (weighted average of values by their probabilities) marks the center of that distribution just as the mean marks the center of data or a population.

📌 Key points (3–5)

  • Probability function: assigns probabilities to all possible values a random variable can take; probabilities sum to 1.
  • Computing probabilities for events: sum the probabilities of all values that belong to the event (subset of the sample space).
  • Expectation as center: the expectation E(X) is computed as the sum of each value multiplied by its probability, analogous to a weighted average.
  • Common confusion: expectation vs. data mean—expectation uses probabilities (theoretical), data mean uses relative frequencies (observed), but the formulas are structurally identical.
  • Cumulative probability: like cumulative relative frequency, it sums probabilities for all values less than or equal to a given value.

🎲 Random Variables and Probability Functions

🎲 What a random variable is

A random variable is a quantity that may obtain different values, each with an associated probability.

  • The sample space is the collection of all possible values the random variable can take.
  • The probability function assigns a probability P(X = x) to each value x in the sample space.
  • Example: A random variable X may take values {0, 1, 2, 3} with specified probabilities for each.

📊 Properties of probability functions

  • All probabilities are non-negative.
  • The sum of probabilities over the entire sample space equals 1.
  • This mirrors the properties of relative frequency from data distributions.

Don't confuse: Probability function (theoretical model) vs. relative frequency (empirical observation)—both describe distributions and sum to 1, but one is based on theory/modeling and the other on actual data counts.

📈 Cumulative probability

  • Order the values from smallest to largest.
  • The cumulative probability at a given value is the sum of probabilities for all values less than or equal to that value.
  • Analogous to cumulative relative frequency in data analysis.

🎯 Computing Probabilities for Events

🎯 What an event is

An event is a subset of values from the sample space.

  • Example from the excerpt: the event {0.5 ≤ X ≤ 2.3} or {X = odd}.
  • Events represent conditions or ranges we want to calculate probabilities for.

➕ How to compute event probabilities

  • Sum the probabilities of all values that belong to the event.
  • Example: If X can be {0, 1, 2, 3} with P(X=1)=0.25, P(X=2)=0.15, P(X=3)=0.10, and P(X=0)=0.50, then:
    • P(0.5 ≤ X ≤ 2.3) = P(X=1) + P(X=2) = 0.25 + 0.15 = 0.40
    • P(X = odd) = P(X=1) + P(X=3) = 0.25 + 0.10 = 0.35

🔍 Finding missing probabilities

  • Since probabilities sum to 1, you can find an unknown probability by subtraction.
  • Example: If P(X=1) + P(X=2) + P(X=3) = 0.50, then P(X=0) = 1 − 0.50 = 0.50.

📐 Expectation: The Center of a Distribution

📐 Definition of expectation

Expectation E(X) = sum over all values of (value × probability of that value).

In formula terms (rewritten in words):

  • E(X) equals the sum, over all unique values x in the sample space, of (x times P(x)).

🔗 Connection to data mean

The excerpt shows two ways to compute a data mean:

  1. Sum all data points and divide by sample size.
  2. Weighted sum: each unique value times its relative frequency.
ConceptFormula structureWeight used
Data meanSum of (value × relative frequency)Relative frequency f/n
ExpectationSum of (value × probability)Probability P(x)
  • The only difference is that expectation uses probabilities (theoretical) while the data mean uses relative frequencies (observed).
  • Otherwise, the definitions are identical in structure.

🧮 Example calculation

For the random variable X with distribution:

ValueProbability
00.50
10.25
20.15
30.10

E(X) = 0×0.50 + 1×0.25 + 2×0.15 + 3×0.10 = 0 + 0.25 + 0.30 + 0.30 = 0.85

🎯 Expectation vs. population mean

  • The excerpt mentions that for the height example, the expectation equals μ (the population mean).
  • This is not an accident: the expectation of a potential measurement from a population equals the population mean.
  • The expectation represents the theoretical center, just as the population mean represents the actual center of all population values.

Don't confuse: Expectation is for random variables (before observation); mean is for data (after observation) or populations (all actual values). They use the same mathematical structure but apply to different contexts.

25

Probability Exercises and Random Variable Calculations

4.6 Exercises

🧭 Overview

🧠 One-sentence thesis

This section provides exercises that apply the concepts of probability distributions, expectation, and variance to concrete problems involving random variables and games of chance.

📌 Key points (3–5)

  • What the exercises cover: computing probabilities from distributions, calculating expectations and variances, and applying probability to real scenarios (games of chance).
  • Key skill: finding the unknown probability parameter that makes the distribution valid (probabilities must sum to 1).
  • Event probability calculation: summing probabilities of all values that belong to the event (e.g., odd values, intervals).
  • Common confusion: expectation can be negative when modeling gains/losses—it represents the average outcome, not necessarily a positive result.
  • Why it matters: these exercises reinforce the mechanics of working with discrete random variables and interpreting probabilistic results in decision contexts.

📊 Working with probability distributions

📊 Finding the missing probability parameter

The excerpt presents Table 4.4 with a random variable Y whose probabilities depend on an unknown parameter p.

  • The first question asks for the value of p.
  • Since all probabilities in a distribution must sum to 1, you solve for p by adding all given probabilities and setting the sum equal to 1.
  • Example: if the table shows probabilities 1p, 2p, 3p, 4p, 5p, 6p for values 0 through 5, then 1p + 2p + 3p + 4p + 5p + 6p = 1, so p = 1/(1+2+3+4+5+6).

🎯 Computing event probabilities

Questions 2–5 ask for probabilities of various events:

Event notationWhat it meansHow to compute
P(Y < 3)Probability Y is less than 3Sum probabilities for all values less than 3
P(Y = odd)Probability Y is an odd numberSum probabilities for values 1, 3, 5
P(1 ≤ Y < 4)Probability Y is in [1, 4)Sum probabilities for values 1, 2, 3
P(|Y - 3| < 1.5)Probability Y is within 1.5 of 3Find values where distance from 3 is less than 1.5, sum their probabilities
  • The excerpt emphasizes that events are subsets of the sample space.
  • The probability of an event is computed by summing the probabilities of all values that belong to that subset.

🧮 Expectation and variance calculations

🧮 Computing expectation

Question 6 asks for E(Y).

Expectation: the sum of each value multiplied by its probability, E(X) = sum over all x of (x × P(x)).

  • Multiply each value in the sample space by its probability.
  • Sum all these products.
  • Example from the excerpt: E(X) = 0×0.5 + 1×0.25 + 2×0.15 + 3×0.10 = 0.85.

📏 Computing variance and standard deviation

Questions 7 and 8 ask for V(Y) and the standard deviation.

Variance: the sum of squared deviations from the expectation, weighted by probabilities, V(X) = sum over all x of ((x - E(X))² × P(x)).

Steps:

  1. Compute the expectation E(Y) first.
  2. For each value x, compute the deviation (x - E(Y)).
  3. Square each deviation.
  4. Multiply each squared deviation by the probability of that value.
  5. Sum all these products to get the variance.
  6. The standard deviation is the square root of the variance.
  • Example from the excerpt: V(X) = (0-0.85)²×0.5 + (1-0.85)²×0.25 + (2-0.85)²×0.15 + (3-0.85)²×0.10 = 1.0275.
  • Standard deviation = √1.0275 ≈ 1.014.

🎲 Game of chance application

🎲 Modeling a coin-toss game

Exercise 4.2 describes a game:

  • Cost to play: $2
  • Rule: toss a coin three times; if all three are Heads, win $10; otherwise, lose the $2 investment.

🎯 Probability of winning and losing

Question 1 asks for the probability of winning.

  • Winning requires all three tosses to be Heads.
  • The excerpt does not give the coin's probability, but standard assumption is fair coin (probability 1/2 per toss).
  • Probability of three Heads in a row = (1/2) × (1/2) × (1/2) = 1/8.

Question 2 asks for the probability of losing.

  • Losing is the complement of winning.
  • P(lose) = 1 - P(win) = 1 - 1/8 = 7/8.

💰 Expected gain

Question 3 asks for the expected gain.

  • The excerpt notes that "the expectation can obtain a negative value."
  • This is important: expectation represents the average outcome, not a guaranteed positive result.
  • Define the random variable as net gain (winnings minus cost).
  • If you win: gain = $10 - $2 = $8, with probability 1/8.
  • If you lose: gain = $0 - $2 = -$2, with probability 7/8.
  • Expected gain = 8 × (1/8) + (-2) × (7/8) = 1 - 14/8 = 1 - 1.75 = -$0.75.
  • Don't confuse: a negative expectation means on average you lose money per game, even though you might win in any single play.

📚 Context and purpose

📚 Why these exercises matter

The excerpt places these exercises after introducing:

  • The definition of random variables and sample spaces.

  • How to compute probabilities of events by summing probabilities of constituent values.

  • The formulas for expectation and variance.

  • These exercises reinforce the mechanical skills needed to work with discrete random variables.

  • They also illustrate how probability models apply to real decisions (e.g., whether a game is favorable).

🔍 Connection to earlier material

The excerpt references earlier sections:

  • Table 4.1 and the example random variable X with values 0, 1, 2, 3 and probabilities 0.50, 0.25, 0.15, 0.10.

  • The formulas for expectation and variance, which parallel the formulas for sample mean and sample variance but use probabilities instead of relative frequencies.

  • The concept of events as subsets of the sample space, with probability computed by summation.

  • The exercises in 4.6 apply these same principles to new random variables (Y and the game outcome).

26

Summary of Probability Foundations

4.7 Summary

🧭 Overview

🧠 One-sentence thesis

Probability provides the mathematical foundation for statistics by modeling measurement variability through random variables, enabling inference from uncertain data.

📌 Key points (3–5)

  • Historical development: probability evolved from gambling mathematics to a tool for modeling measurement variability in science.
  • Key discovery: averages of multiple independent measurements are less variable and more reproducible than single measurements.
  • Random variable as the core concept: the probabilistic model for a measurement's value before it is taken, linking probability theory to statistical inference.
  • Common confusion: the sample space includes not just the outcome that occurred, but all outcomes that could have occurred—this context is essential for interpretation.
  • Foundation for statistics: probability theory (Part I) underpins statistical inference methods (Part II).

📜 Historical context and motivation

🎲 Origins in gambling

  • Probability was first introduced as a branch of mathematics to investigate uncertainty in gambling and games of chance.
  • This mathematical framework was later adapted to scientific measurement.

🔬 Application to measurement variability

  • During the early 19th century, probability began modeling variability in scientific measurements.
  • Scientists observed that repeated measurements of the same quantity produced different results—there was inherent variability.
  • Key insight: general laws govern this variability across repetitions.

📉 Reduced variability through averaging

One of the major achievements of probability was the development of mathematical theory that explains the phenomena of reduced variability observed when averages are used instead of single measurements.

  • Averages of several independent repeats are less variable and more reproducible than individual measurements.
  • This mathematical theory (discussed in Chapter 7 of the source) became a cornerstone of statistical practice.
  • Example: if you measure a length five times and get slightly different values each time, the average of those five measurements will be more reliable than any single measurement.

🎯 Core probabilistic concepts

🎲 Random variable

Random Variable: The probabilistic model for the value of a measurement, before the measurement is taken.

  • It represents uncertainty before the fact.
  • Not the actual outcome, but the model of all possible outcomes and their likelihoods.
  • This concept is key for understanding statistics.

🗂️ Sample space

Sample Space: The set of all values a random variable may obtain.

  • Includes the outcome that actually occurred plus all outcomes that could have occurred but did not materialize.
  • The rationale: putting the observed outcome in context by considering what else was possible.
  • Don't confuse: the sample space is not just "what happened"—it's "what could have happened."

📊 Probability

Probability: A number between 0 and 1 assigned to a subset of the sample space, indicating the likelihood of the random variable obtaining a value in that subset.

  • Quantifies how likely different outcomes are.
  • Always between 0 (impossible) and 1 (certain).

📍 Expectation

Expectation: The central value for a random variable. The expectation of the random variable X is marked by E(X).

  • Formula: E(X) = sum over all x of (x × P(x))
  • Represents the "average" or "center" of the distribution.
  • Can be negative (e.g., expected loss in a game).

📏 Variance and standard deviation

Variance: The (squared) spread of a random variable. The variance of the random variable X is marked by V(X). The standard deviation is the square root of the variance.

  • Formula: V(X) = sum over all x of ((x − E(X))² × P(x))
  • Measures how much values tend to differ from the expectation.
  • Standard deviation is the square root of variance, giving spread in the original units.

🔗 Relationship to statistics

🧱 Probability as foundation

  • Probability serves as the mathematical foundation for the development of statistical theory.
  • The book structure reflects this: Part I covers probability theory, Part II covers statistical inference.

🔍 From theory to inference

Statistics study method for inference based on data.

  • Statistical inference uses probability theory to draw conclusions from uncertain data.
  • The random variable concept bridges the gap: it models what we don't yet know, allowing us to reason about data before we collect it.

💡 Practical example from the excerpt

🏭 Quality assurance scenario

The excerpt provides a concrete example to illustrate the sample space concept:

  • Context: A factory produces car parts; QA personnel test batches before shipment.
  • Measurement: 20 parts are selected and tested; count how many fail quality testing.
  • Random variable: the number of parts (out of 20) that fail.
  • Sample space: any of the numbers 0, 1, 2, …, 20.
    • 0 = all parts passed
    • 1 = one part failed, 19 passed
    • 2 = two parts failed, 18 passed
    • etc.
  • Why the full sample space matters: even if the actual outcome is "0 failures," understanding that 1, 2, …, 20 failures were possible provides context for interpreting the result.

🎰 Game of chance scenario

The excerpt includes an exercise illustrating expectation:

  • Setup: invest $2 to play; toss a coin three times; win $10 if all three tosses are "Head," otherwise lose the $2.
  • Random variable: the player's gain (which can be negative).
  • Key point: expectation can be negative, representing an expected loss.
  • This example shows how probability models real decisions under uncertainty.

📐 Summary formulas

The excerpt provides key formulas relating population parameters to random variable properties:

ConceptPopulation formulaRandom variable formula
Average/Expectationμ = (1/N) × sum of all xᵢE(X) = sum of (x × P(x))
Varianceσ² = (1/N) × sum of (xᵢ − μ)²V(X) = sum of ((x − E(X))² × P(x))
  • Population size N = number of people, things, etc. in the population.
  • The random variable formulas weight each value by its probability, analogous to how population formulas weight each value equally (1/N).
27

Student Learning Objectives for Bernoulli Response Analysis

5.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

This chapter extends statistical inference to cases where the response variable is Bernoulli (binary/two-level), using either a two-level factor or a numeric explanatory variable to model event probabilities.

📌 Key points (3–5)

  • What changes: the response is now Bernoulli (indicator of an event or a two-level factor) instead of numeric.
  • Two scenarios: explanatory variable can be either a factor with two levels (use prop.test) or numeric (use glm for logistic regression).
  • Common confusion: prop.test was introduced for single-sample probability analysis (Chapter 12); here it is extended to compare probabilities between two sub-samples, similar to how t.test compares numeric responses.
  • New tool: the glm function (Generalized Linear Model) fits logistic regression when the explanatory variable is numeric.
  • Learning goals: produce mosaic plots, compare event probabilities between groups, define and fit logistic regression models, and perform inference.

🔄 Progression from previous chapters

📚 What Chapters 13 and 14 covered

  • Both chapters dealt with a numeric response and an explanatory variable.
  • Chapter 13: explanatory variable was a factor with two levels, splitting the sample into two sub-samples.
  • Chapter 14: explanatory variable was numeric, producing a linear trend with the response.

🆕 What this chapter adds

Bernoulli response: a variable that indicates the occurrence of an event or is a factor with two levels.

  • The key shift is from numeric response to binary/event-based response.
  • The explanatory variable remains either a two-level factor or numeric, but the modeling approach changes to accommodate the binary nature of the response.

🧩 Two-level factor as explanatory variable

🧩 Using prop.test for comparison

  • When the explanatory variable is a factor with two levels, use the prop.test function.
  • This function was previously used in Chapter 12 for analyzing the probability of an event in a single sample.
  • Here it is extended to compare probabilities between two sub-samples.

🔍 Parallel with t.test

FunctionResponse typeSingle sample useTwo-sample use
t.testNumericAnalyze mean in one sampleCompare means between two sub-samples
prop.testBernoulliAnalyze event probability in one sampleCompare event probabilities between two sub-samples
  • Don't confuse: prop.test is not limited to single-sample analysis; it extends to comparing two groups, just as t.test does for numeric responses.

📊 Mosaic plots

  • One learning objective is to produce mosaic plots of the response and the explanatory variable.
  • These plots visualize the relationship between a Bernoulli response and a categorical explanatory variable.

📈 Numeric explanatory variable and logistic regression

📈 When to use glm

  • When the explanatory variable is numeric and the response is Bernoulli, use the glm function.
  • glm stands for Generalized Linear Model.
  • It fits an appropriate regression model to the data, specifically a logistic regression model.

🔗 What logistic regression does

Logistic regression model: relates the probability of an event in the response to a numeric explanatory variable.

  • Unlike linear regression (which models a numeric response directly), logistic regression models the probability of an event occurring.
  • The model accommodates the binary nature of the response by linking the probability to the explanatory variable through a logistic function.

🛠️ Fitting and inference

  • Use glm to fit the logistic regression model to data.
  • After fitting, produce statistical inference on the fitted model (e.g., test significance of coefficients, estimate probabilities).
  • Example: An organization collects data on whether an event occurred (yes/no) and a numeric predictor; glm estimates how the predictor affects the probability of the event.

🎯 Learning objectives summary

🎯 Skills to acquire

By the end of this chapter, students should be able to:

  1. Visualize: Produce mosaic plots showing the relationship between a Bernoulli response and an explanatory variable.
  2. Compare probabilities: Apply prop.test to compare the probability of an event between two sub-populations (when the explanatory variable is a two-level factor).
  3. Define logistic regression: Understand the logistic regression model that relates event probability to a numeric explanatory variable.
  4. Fit and infer: Use glm to fit the logistic regression model to data and perform statistical inference on the fitted model.

🔑 Key distinction

  • Two-level factor explanatory variable → use prop.test (compare probabilities between groups).
  • Numeric explanatory variable → use glm (fit logistic regression to model probability as a function of the numeric variable).
28

Discrete Random Variables

5.2 Discrete Random Variables

🧭 Overview

🧠 One-sentence thesis

Discrete random variables model measurements that take separated values (like integers), and the Binomial distribution specifically models the count of "successes" when a two-outcome trial is repeated multiple times with a fixed probability of success.

📌 Key points (3–5)

  • What discrete random variables are: random variables whose sample space consists of separated values (e.g., integers), not a continuum.
  • How the Binomial model works: it counts the number of successes in a fixed number of independent trials, each with the same probability of success.
  • Key Binomial parameters: n (number of trials) and p (probability of success per trial) fully define the distribution, written as X ~ Binomial(n, p).
  • Common confusion: the sample space (possible values) stays the same for different p values, but the probabilities assigned to each value change.
  • Why it matters: Binomial models apply to real-world counting scenarios like quality control, polling, and repeated experiments.

🎲 What makes a random variable discrete

🎲 Definition and characteristics

A discrete random variable is one for which the values in the sample space are separated from each other, such as integers.

  • The sample space is the collection of all values the measurement may potentially obtain.
  • Each value in the sample space has an associated probability.
  • All probabilities are positive and sum to one (like relative frequencies).
  • A random variable is fully defined by identifying its sample space and the probabilities of each value.

🔢 Example: quality control

  • Testing 20 parts for defects: the number of defective parts can be 0, 1, 2, …, 20.
  • The sample space is the set of integers {0, 1, 2, …, 20}.
  • Each integer corresponds to a different outcome (e.g., 0 means all passed, 1 means one failed and 19 passed, etc.).

🎯 The Binomial random variable

🎯 When to use the Binomial model

The Binomial random variable applies when:

  • A trial has exactly two possible outcomes: "Success" and "Failure".
  • The trial is repeated a fixed number of times (n).
  • Each trial has the same probability of success (p), where 0 < p < 1.
  • The count of successes across all trials is the random variable.

Notation: X ~ Binomial(n, p) means X follows a Binomial distribution with n trials and success probability p.

🪙 Example: coin tossing

  • Toss 10 fair coins; designate "Head" as success and "Tail" as failure.
  • Probability of "Head" is 1/2 for each coin.
  • X = total number of "Heads" follows Binomial(10, 0.5).
  • Sample space: {0, 1, 2, …, 10}.
  • X = 0 means all coins show "Tail"; X = 1 means one "Head" and nine "Tails"; etc.

🎲 Example: rolling a die

  • Roll a die 10 times; count how many times the face "3" appears.
  • "Success" = face "3" appears; probability p = 1/6.
  • X ~ Binomial(10, 1/6).
  • Sample space is still {0, 1, …, 10}, but probabilities differ from the coin example.

📊 Computing Binomial probabilities

📊 Using the dbinom function

  • The R function dbinom computes the probability P(X = x) for a Binomial random variable.
  • Inputs: a sequence of values, n, and p.
  • Output: the probability for each value in the input sequence.

Example: Compute the probability that X is odd when X ~ Binomial(10, 0.5):

  • Input odd values: {1, 3, 5, 7, 9}.
  • dbinom(c(1,3,5,7,9), 10, 0.5) returns the probabilities for each odd value.
  • Sum these probabilities to get P(X is odd) = 0.5.

📈 Cumulative probabilities with pbinom

  • The function pbinom computes the cumulative probability P(X ≤ x).
  • This is the sum of all probabilities for values less than or equal to x.
  • Example: P(X ≤ 3) = P(X=0) + P(X=1) + P(X=2) + P(X=3).

Don't confuse: dbinom gives the probability of a single value; pbinom gives the cumulative probability up to and including that value.

📊 Visualizing the distribution

  • A bar plot displays the Binomial distribution: a vertical bar above each sample-space value, with height equal to the probability.
  • Example: Figure 5.1 shows the Binomial(10, 0.5) distribution with bars over integers 0 through 10.

🧮 Expectation and variance of the Binomial

🧮 General formulas for any discrete random variable

Expectation: E(X) = sum over all x of (x times P(X = x))

Variance: V(X) = sum over all x of ((x minus E(X)) squared times P(x))

  • Expectation measures the central location of the distribution.
  • Variance (and standard deviation, the square root of variance) measures the spread.

🎯 Simplified formulas for the Binomial

For X ~ Binomial(n, p):

  • Expectation: E(X) = n times p
  • Variance: V(X) = n times p times (1 minus p)

Example: For Binomial(10, 0.5):

  • E(X) = 10 × 0.5 = 5
  • V(X) = 10 × 0.5 × 0.5 = 2.5

🔄 How changing p affects the distribution

  • The sample space remains the same (e.g., {0, 1, …, 10}) regardless of p.
  • The probabilities assigned to each value change when p changes.
  • Example comparison (all with n = 10):
p valueExpectationVarianceProbability pattern
0.552.5Symmetric around 5
1/61.671.39Higher probabilities for smaller values
0.662.4Shifted toward larger values

Don't confuse: Changing p does not change the sample space, only the probability distribution over that space.

🌍 Real-world applications

🗳️ Example: pre-election polling

  • A candidate wants to estimate the percentage of support (p) in the population.
  • A sample of 300 people is selected.
  • X = count of supporters in the sample.
  • Model: X ~ Binomial(300, p).
  • Each person is either a supporter ("Success") or not ("Failure"), with probability p.

🔍 Example: quality control

  • 20 items are tested; count the number of faulty items.
  • Let p = probability that an item is faulty.
  • X = total number of faulty items follows Binomial(20, p).
  • This model helps make statements about the true defect rate p based on the observed count.

🔗 Connection to statistical inference

  • In both examples, the actual observed count is related to the theoretical Binomial distribution.
  • Statistical inference uses this relationship to make conclusions about the unknown probability p.
29

Continuous Random Variables

5.3 Continuous Random Variable

🧭 Overview

🧠 One-sentence thesis

Continuous random variables model measurements that can take any value in an interval, using densities instead of probabilities for individual values, with integration replacing summation in calculations of expectation and variance.

📌 Key points (3–5)

  • Core difference from discrete: Continuous random variables assign probabilities to intervals rather than individual values; densities replace probability functions.
  • How densities work: Area under the density curve corresponds to probability; the total area equals 1.
  • Computation shift: Integration replaces summation—expectation becomes integral of (x × density), variance becomes integral of ((x − E(X))² × density).
  • Common confusion: Density values are not probabilities; only areas under the density curve represent probabilities.
  • Two key distributions: Uniform (all values in an interval equally likely) and Exponential (models times between events, with smaller values more likely).

🔄 From discrete to continuous

🔄 What changes in the sample space

  • Discrete random variables: sample space consists of separated values (e.g., integers {0, 1, 2, …}).
  • Continuous random variables: sample space is the entire real line or an interval of real numbers (possibly open-ended).
  • The continuum of possible outcomes requires a different way to describe the distribution.

📊 How distributions are described differently

AspectDiscreteContinuous
Probability assignmentTo specific valuesTo intervals of values
Display methodTable, formula, or bar plotDensity function (like a histogram)
Key toolProbability function P(x)Density function f(x)
InterpretationBar height = probabilityArea under curve = probability
  • Don't confuse: In continuous distributions, the probability of any single exact value is zero; only intervals have positive probability.

🧮 Computation formulas

Expectation (discrete): E(X) = sum over all x of (x × P(x))
Expectation (continuous): E(X) = integral of (x × f(x)) dx

Variance (discrete): V(X) = sum over all x of ((x − E(X))² × P(x))
Variance (continuous): V(X) = integral of ((x − E(X))² × f(x)) dx

  • The intuitive meaning remains: expectation identifies the central location, standard deviation (square root of variance) summarizes total spread.

📏 The Uniform distribution

📏 What it models

Uniform distribution: Models measurements that may have values in a given interval, with all values in that interval equally likely to occur.

  • Notation: X ~ Uniform(a, b) means X is uniformly distributed over the interval [a, b].
  • Example: X ~ Uniform(3, 7) assigns equal likelihood to any value between 3 and 7.

🎨 Density shape and properties

  • The density forms a rectangle over the interval [a, b].
  • Height of the rectangle = 1 / (b − a), so that total area = (b − a) × (1 / (b − a)) = 1.
  • Density is zero outside [a, b].
  • Example: For Uniform(3, 7), density height = 1/4 over [3, 7]; density = 0 elsewhere.

🧮 Expectation and variance

  • Expectation: E(X) = (a + b) / 2 (the midpoint of the interval).
  • Variance: V(X) = (b − a)² / 12.
  • Standard deviation: square root of variance.
  • Example: For Uniform(3, 7), E(X) = 5, V(X) = 16/12 ≈ 1.333, SD ≈ 1.155.

🔍 Computing probabilities

  • Use the function "punif" to compute cumulative probability P(X ≤ x).
  • Probability of an interval = area of the rectangle over that interval.
  • Example: For X ~ Uniform(3, 7), P(X ≤ 4.73) = (4.73 − 3) / (7 − 3) = 1.73 / 4 = 0.4325.

💡 Application examples

  • Rain drop position (Example 5.5): The position where a rain drop hits a power line between two utility poles can be modeled by Uniform—any position is equally likely.
  • Genetic crossover (Example 5.6): In the Haldane model, the position of a crossover between two loci on a chromosome follows a Uniform distribution.

⏱️ The Exponential distribution

⏱️ What it models

Exponential distribution: Frequently used to model times between events (e.g., time between phone calls, time until a component malfunctions).

  • Notation: X ~ Exponential(λ), where λ is the rate parameter.
  • Sample space: all non-negative numbers [0, ∞).
  • Smaller values are more likely than larger values—the density is highest near 0 and decreases as x increases.

🔗 Connection to Poisson

  • The Exponential and Poisson distributions are tightly interconnected.
  • If time between occurrences has Exponential(λ) distribution, then the total number of occurrences in a unit time interval has Poisson(λ) distribution.
  • Example: If decays per second follow Poisson(λ), then time between decays follows Exponential(λ).

🧮 Expectation and variance

  • Expectation: E(X) = 1 / λ.
  • Variance: V(X) = 1 / λ².
  • Standard deviation: 1 / λ.
  • Inverse relationship: larger rate λ → smaller expectation and spread (events happen more frequently, so times between them are shorter).

🔍 Computing probabilities

  • Use "dexp" to compute density values.
  • Use "pexp" to compute cumulative probability P(X ≤ x).
  • Probability of an interval (a, b] = P(X ≤ b) − P(X ≤ a) = area under the density curve between a and b.
  • Example: For X ~ Exponential(0.5), P(2 < X ≤ 6) = pexp(6, 0.5) − pexp(2, 0.5) ≈ 0.318.

📉 Effect of the rate parameter

  • As λ increases, the density shifts toward smaller values (the curve becomes steeper near 0).
  • Higher rate → more occurrences per unit time → shorter times between occurrences.
  • Example: Exponential(2) has smaller expected time between events than Exponential(0.5).

💡 Application examples

  • Rain drop timing (Example 5.7): Times between consecutive hits of a power line by rain drops follow an Exponential distribution.
  • Radioactive decay (Example 5.8): Times between radioactive decays are modeled by Exponential; the rate λ equals the expected count of decays per second (the Poisson expectation).

🎯 Key computational tools

🎯 R functions for Uniform

  • dunif(x, a, b): computes density at value x for Uniform(a, b).
  • punif(x, a, b): computes cumulative probability P(X ≤ x).
  • Density is constant (1 / (b − a)) inside [a, b] and zero outside.

🎯 R functions for Exponential

  • dexp(x, λ): computes density at value x for Exponential(λ).
  • pexp(x, λ): computes cumulative probability P(X ≤ x).
  • Density formula: f(x) = λ × exp(−λx) for x ≥ 0, and 0 for x < 0.

🎯 Plotting densities

  • Generate a sequence of x values using "seq(start, end, length = n)".
  • Compute density at each x value.
  • Use "plot(x, density, type = 'l')" to draw a line plot connecting the points.
  • Example: For Uniform(3, 7), plot shows a flat line at height 0.25 over [3, 7] and zero elsewhere.
30

5.4 Exercises

5.4 Exercises

🧭 Overview

🧠 One-sentence thesis

This section introduces exercises that apply logistic regression concepts, beginning with a comparison between Mediterranean diet and low-fat diet in the context of health risks.

📌 Key points (3–5)

  • What logistic regression models: the relationship between the probability of an event (e.g., having four doors) and an explanatory variable (e.g., car length).
  • The core formula: probability p_i equals e raised to (a + b times x_i) divided by (1 + e raised to (a + b times x_i)), where a and b are coefficients.
  • How to test relationships: the null hypothesis H₀: b = 0 tests whether the explanatory variable and the event probability are unrelated.
  • Common confusion: the response can be TRUE/FALSE logical values or 1/0 numeric values—both represent whether the event occurred.
  • Why the slope matters: when coefficient b equals zero, there is no relationship; rejecting this hypothesis means the explanatory variable affects the probability.

📐 The logistic regression model

📐 What the model relates

Logistic regression: a method for investigating relations between the probability of an event and explanatory variables.

  • The model connects p_i (probability of the event for observation i) to x_i (the value of the explanatory variable for that observation).
  • It is not a simple linear relationship; instead, it uses an exponential transformation to keep probabilities between 0 and 1.
  • Example: In the car data, the event is "having four doors" and the explanatory variable is "car length."

🔢 The probability formula

The relationship is given by:

  • p_i = e raised to (a + b times x_i) divided by (1 + e raised to (a + b times x_i))
  • a and b are coefficients common to all observations.
  • This formula ensures that no matter what values a, b, and x_i take, p_i stays between 0 and 1.

📊 The linear form

An equivalent way to express the same relationship:

  • log of (p_i divided by [1 minus p_i]) = a + b times x_i
  • This states that a function of the probability has a linear trend with the explanatory variable.
  • The left side is called the log-odds or logit.

🔧 Fitting the model in practice

🔧 Using the glm function

The excerpt shows how to fit logistic regression using the glm function:

  • The argument family=binomial tells the function to use the logistic regression model.
  • The formula involves a response and an explanatory variable.
  • The argument data=cars informs the function where the variables are located.

🎯 Response variable formats

The response can take two forms:

  • Logical values: TRUE or FALSE (e.g., the expression num.of.doors == "four" produces TRUE when the car has 4 doors, FALSE when it has 2 doors).
  • Numeric values: 1 or 0, where 1 corresponds to the event occurring and 0 to the event not occurring.

Don't confuse: Both formats represent the same thing—whether the event happened for each observation.

📋 Interpreting the output

📋 Coefficient estimates

The fitted model provides estimates for the coefficients:

  • Intercept (a): estimated as -13.14767
  • Slope (b): estimated as 0.07726 for the length variable
  • These coefficients define the relationship between car length and the probability of having four doors.

🧪 Testing for a relationship

The key hypothesis test:

  • Null hypothesis H₀: b = 0 claims there is no relation between the explanatory variable and the response.
  • When b equals 0, the probability of the event and the explanatory variable are unrelated.
  • In the car example, this hypothesis is clearly rejected (p-value 2.37 × 10⁻⁷).
  • Conclusion: car length is related to the probability of having four doors.

📏 Confidence intervals

The confint function produces confidence intervals for the coefficients:

  • For the intercept: 2.5% = -18.50384373, 97.5% = -8.3180877
  • For the slope (length): 2.5% = 0.04938358, 97.5% = 0.1082429
  • These intervals quantify the uncertainty in the coefficient estimates.

🔍 Comparison with linear regression

🔍 Similarities in reporting

Both logistic regression and linear regression reports contain:

  • Estimates of the coefficients a and b
  • Tests for the equality of these coefficients to zero
  • Standard errors and significance codes

🔍 Key differences

The excerpt notes there are "similarities and differences" between the reports:

  • Logistic regression uses a different model structure (exponential transformation).
  • The interpretation of coefficients differs: in logistic regression, b affects the log-odds, not the response directly.
  • The response variable is binary (event occurred or not) rather than continuous.

🏋️ Exercise context

🏋️ Exercise 15.1 setup

The exercise section begins with:

  • A comparison between Mediterranean diet and low-fat diet recommended by the American Heart Association.
  • The context is risks for illness.
  • This suggests applying logistic regression to compare health outcomes (a binary event: illness or not) between two diet groups.
31

The Negative-Binomial Distribution and Summary of Random Variables

5.5 Summary

🧭 Overview

🧠 One-sentence thesis

The Negative-Binomial distribution is a discrete random variable model characterized by two parameters (r and p), and this summary chapter consolidates key discrete and continuous distributions—Binomial, Poisson, Uniform, and Exponential—each serving as theoretical models for different types of uncertainty.

📌 Key points (3–5)

  • Negative-Binomial structure: a discrete distribution over non-negative integers {0, 1, 2, …}, specified by parameters r and p.
  • Four core distributions summarized: Binomial (successes in n trials), Poisson (rare events), Uniform (equally likely outcomes over an interval), Exponential (times between events).
  • Common confusion: discrete vs continuous—discrete random variables use summation for expectation/variance; continuous use integration with a density function.
  • Practical modeling: each distribution models a different real-world uncertainty (e.g., Exponential for radioactive decay timing, Poisson for decay counts).
  • Parameter effects: changing r and p in Negative-Binomial shifts the shape, expectation, and variance of the distribution.

🎲 The Negative-Binomial distribution

🎲 What it is

Negative-Binomial distribution: a discrete, integer-valued random variable with sample space {0, 1, 2, …}, specified by parameters r and p.

  • Notation: X ∼ Negative-Binomial(r, p)
  • The parameters r and p control the shape and spread of the distribution.
  • Example: X₁ ∼ Negative-Binomial(2, 0.5), X₂ ∼ Negative-Binomial(4, 0.5), X₃ ∼ Negative-Binomial(8, 0.8) produce different bar plots over integers 0 to 15.

📊 How parameters affect the distribution

  • The excerpt presents three random variables with different (r, p) pairs and asks to match them to bar plots.
  • Expectation and variance pairs given:
    • E(X) = 4, V(X) = 8
    • E(X) = 2, V(X) = 4
    • E(X) = 2, V(X) = 2.5
  • The task is to match by inspecting the structure of bar plots (center and spread), not by formula.
  • Don't confuse: the same expectation can have different variances depending on the parameters; variance affects how spread out the bars are.

📚 Summary of key distributions

📚 Discrete distributions

DistributionWhat it modelsNotationExpectationVariance
BinomialNumber of successes in n independent trials, each with probability pBinomial(n, p)npnp(1 − p)
PoissonApproximation for rare events when expected count is λPoisson(λ)λλ
Negative-Binomial(Introduced in this section)Negative-Binomial(r, p)r(1 − p)/pr(1 − p)/p²
  • Binomial: counts successes among repeated trials.
  • Poisson: models rare occurrences; expectation equals variance.
  • Negative-Binomial: generalizes count models with two parameters.

📚 Continuous distributions

DistributionWhat it modelsNotationExpectationVariance
UniformEqually likely outcomes over interval [a, b]Uniform(a, b)(a + b)/2(b − a)²/12
ExponentialTimes between eventsExponential(λ)1/λ1/λ²
  • Uniform: flat density over an interval; center is midpoint.
  • Exponential: models waiting times; rate parameter λ controls both mean and variance.

🔍 Density for continuous variables

Density: a histogram-like curve describing the distribution of a continuous random variable; the area under the curve corresponds to probability.

  • Continuous random variables do not have probabilities at single points; probability is area under the density curve over an interval.
  • Don't confuse: discrete variables use probability mass (P(x)); continuous use density (f(x)) and integrate to get probability.

🧮 Formulas for expectation and variance

🧮 Discrete random variables

  • Expectation: E(X) = sum over all x of (x × P(x))
    • Multiply each outcome by its probability, then add them up.
  • Variance: V(X) = sum over all x of ((x − E(X))² × P(x))
    • Measure squared deviations from the mean, weighted by probability.

🧮 Continuous random variables

  • Expectation: E(X) = integral of (x × f(x)) dx
    • Integrate the product of outcome and density over the entire range.
  • Variance: V(X) = integral of ((x − E(X))² × f(x)) dx
    • Integrate squared deviations weighted by density.
  • Don't confuse: summation (Σ) for discrete, integration (∫) for continuous; the logic is the same, but the math operation differs.

🌍 Real-world modeling example

🌍 Radioactive decay

  • The excerpt gives Carbon-14 decay as an example:
    • Time until decay: modeled by Exponential distribution.
    • Number of decays per second: modeled by Poisson distribution.
    • The decay rate (3.8394 × 10⁻¹² particles per second) is the rate parameter λ for the Exponential and the expectation for the Poisson.
  • Example use: dating ancient specimens by computing probabilities based on the Exponential model.
  • This illustrates why theoretical models matter: they enable precise calculations for real phenomena.

🌍 Forum question: usefulness of theoretical models

  • The excerpt asks: is it useful to have a theoretical model for real-life uncertainty?
  • It suggests considering examples from your own field where a random variable (Binomial, Poisson, Uniform, Exponential, or Negative-Binomial) could model a situation.
  • The importance lies in enabling predictions, computations, and understanding of variability.
32

Student Learning Objective

6.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

This chapter concludes the book by reviewing the topics covered in the second part, which dealt with statistical modeling and inference.

📌 Key points (3–5)

  • Purpose: The chapter serves as a conclusion to the book.
  • Content focus: Reviews topics from the second part of the book.
  • Topic area: The second part dealt with statistical modeling and related methods.

📚 Chapter structure and purpose

📚 Concluding chapter

  • This is the final chapter of the book.
  • It provides a review rather than introducing new material.
  • The review focuses specifically on the second part of the book, not the entire text.

🔄 Review scope

  • The excerpt indicates the chapter will review topics from the second part.
  • The second part dealt with statistical concepts and methods (based on context from preceding sections).
  • The excerpt does not provide details about what specific topics will be reviewed, only that a review will occur.

⚠️ Note on excerpt content

The provided excerpt is extremely brief and consists primarily of a chapter heading and an incomplete introductory sentence. The substantive content of the chapter—what topics are reviewed, how they are summarized, or what case studies are presented—is not included in this excerpt. The excerpt ends mid-sentence ("the part that dealt") without completing the thought about what the second part of the book covered.

33

The Normal Random Variable

6.2 The Normal Random Variable

🧭 Overview

🧠 One-sentence thesis

The Normal distribution is the most important distribution in statistics because it models many measurements directly and also emerges as an approximation for other distributions like the Binomial under appropriate conditions.

📌 Key points (3–5)

  • What the Normal distribution is: a continuous, bell-shaped distribution over all numbers (negative or positive), symmetric about its expectation, denoted Normal(μ, σ²).
  • How to compute probabilities: use the pnorm function for cumulative probabilities and qnorm for percentiles; probabilities for intervals are computed as differences of cumulative probabilities.
  • Standardization: any Normal random variable X can be transformed into a standard Normal Z (with mean 0 and variance 1) by the formula Z = (X − μ)/σ, which simplifies probability calculations.
  • Common confusion: the third argument to pnorm is the standard deviation σ (not the variance σ²); for example, if variance is 9, enter 3.
  • Why it matters: the Normal distribution approximates other distributions (e.g., Binomial) and serves as a generic model for measurements influenced by many independent factors.

📐 Defining the Normal distribution

📐 Notation and parameters

Normal distribution: denoted X ∼ Normal(μ, σ²), where μ = E(X) is the expectation and σ² = V(X) is the variance.

  • The distribution is continuous and defined over all real numbers.
  • It is symmetric about the expectation μ.
  • Values near the expectation are more likely; values much larger or smaller are substantially less likely.

📊 Shape and variance

  • Smaller variance → more concentrated around the expectation.
  • Larger variance → more spread out.
  • Example: Normal(0,1) is more concentrated than Normal(2,9); Normal(−3, 0.25) is even more concentrated because variance is only 0.25.

🧬 Real-world examples

  • IQ scores: X ∼ Normal(100, 15²). The population mean is set to 100 and standard deviation to 15.
  • Human height: influenced by many genetic and environmental factors, so heights tend to follow a Normal distribution.
  • General principle: any measurement produced by the combination of many independent factors is likely to be Normal.

🔢 Computing probabilities and using R functions

🔢 The pnorm function

  • pnorm(x, mean, sd) computes the cumulative probability P(X ≤ x).
  • Default values: mean=0 and sd=1 (standard Normal).
  • Example: for X ∼ Normal(2, 9), the probability P(0 < X ≤ 5) is computed as:
    • P(X ≤ 5) − P(X ≤ 0) = pnorm(5, 2, 3) - pnorm(0, 2, 3) ≈ 0.589.
    • Notice: the third argument is 3 (the standard deviation), not 9 (the variance).

🔢 The dnorm function

  • dnorm(x, mean, sd) computes the density (height of the curve) at x.
  • Used for plotting the distribution, not for probabilities (since probabilities for continuous distributions are areas, not point values).

🔢 Interval probabilities

  • For a continuous distribution, P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a).
  • The probability of a single point is zero, so P(X = a) = 0.

🎯 Standardization and the standard Normal

🎯 What is a z-score?

A z-score is the original measurement expressed in units of standard deviation from the expectation.

  • Formula: z = (x − μ) / σ.
  • Example: if μ = 2 and σ = 3, then x = 0 has z-score (0 − 2)/3 = −2/3 (i.e., 2/3 standard deviations below the mean).
  • Similarly, x = 5 has z-score (5 − 2)/3 = 1 (i.e., 1 standard deviation above the mean).

🎯 The standard Normal distribution

The standard Normal distribution is Normal(0, 1): expectation 0 and variance 1.

  • If X ∼ Normal(μ, σ²), then Z = (X − μ)/σ ∼ Normal(0, 1).
  • Why standardize? It allows you to use a single reference distribution (the standard Normal) for all Normal computations.
  • Example: P(0 < X ≤ 5) for X ∼ Normal(2, 9) equals P(−2/3 < Z ≤ 1) for Z ∼ Normal(0, 1).

🎯 Using standardization in R

  • You can compute probabilities by standardizing manually:
    • pnorm((5-2)/3) - pnorm((0-2)/3) gives the same result as pnorm(5, 2, 3) - pnorm(0, 2, 3).
  • When you omit the mean and sd arguments, pnorm defaults to the standard Normal.

📏 Percentiles and central intervals

📏 What is a percentile?

The p-percentile of a distribution is the value x such that the cumulative probability up to x equals p.

  • Example: the 2.5%-percentile is the value with 2.5% of the distribution to its left.
  • The 97.5%-percentile has 97.5% to its left (and 2.5% to its right).

📏 The qnorm function

  • qnorm(p, mean, sd) computes the p-percentile of the Normal distribution.
  • Default: mean=0 and sd=1 (standard Normal).
  • Example: qnorm(0.975) ≈ 1.96 and qnorm(0.025) ≈ −1.96.

📏 Central 95% interval

  • For the standard Normal, 95% of the distribution lies in [−1.96, 1.96].
  • For any Normal(μ, σ²), the central 95% interval is [μ − 1.96·σ, μ + 1.96·σ], often written μ ± 1.96·σ.
  • Example: for X ∼ Normal(2, 9) with σ = 3, the interval is [2 − 1.96·3, 2 + 1.96·3] = [−3.88, 7.88].

📏 How to compute in R

  • Direct: qnorm(0.975, 2, 3) and qnorm(0.025, 2, 3).
  • Via standardization: 2 + qnorm(0.975)*3 and 2 + qnorm(0.025)*3.

📦 Outliers and the interquartile range

📦 Quartiles and the interquartile range

  • First quartile (Q1): 25% of the distribution is to the left; computed as qnorm(0.25).
  • Third quartile (Q3): 75% of the distribution is to the left; computed as qnorm(0.75).
  • Interquartile range (IQR): Q3 − Q1.
  • For the standard Normal: Q3 ≈ 0.674, Q1 ≈ −0.674, so IQR ≈ 1.349.

📦 Outlier definition

  • A value is an outlier if it is more than 1.5 times the IQR away from the central box.
  • Upper threshold: Q3 + 1.5·IQR.
  • Lower threshold: Q1 − 1.5·IQR.
  • For the standard Normal: thresholds are approximately ±2.698.

📦 Probability of outliers

  • For the standard Normal, the probability of an outlier is computed as:
    • 2 × (1 − pnorm(2.697959)) ≈ 0.007 (about 0.7%).
  • This means outliers are rare in a Normal distribution.

🔄 Normal approximation of the Binomial

🔄 When to use the approximation

  • The Normal distribution can approximate the Binomial distribution under appropriate conditions (large number of trials).
  • Example: X ∼ Binomial(4000, 0.5) can be approximated by a Normal distribution with the same expectation and variance.

🔄 Matching parameters

  • If X ∼ Binomial(n, p), then:
    • Expectation: E(X) = n·p.
    • Variance: V(X) = n·p·(1 − p).
    • Standard deviation: σ = square root of variance.
  • Example: for Binomial(4000, 0.5), E(X) = 2000 and V(X) = 1000, so σ ≈ 31.62.

🔄 Computing probabilities

  • Exact (Binomial): P(1940 ≤ X ≤ 2060) = pbinom(2060, 4000, 0.5) - pbinom(1939, 4000, 0.5) ≈ 0.944.
  • Approximate (Normal): use Normal(2000, 1000) with the same expectation and variance.
  • The Normal approximation is not exact but is often very close and computationally simpler for large n.

🔄 Why it works

  • The Central Limit Theorem (covered in the next chapter) provides the mathematical foundation for this approximation.
  • The Normal distribution emerges as an approximation of data characteristics even when the original distribution is not Normal.
MethodDistributionComputationResult
ExactBinomial(4000, 0.5)pbinom differences≈ 0.944
ApproximateNormal(2000, 1000)pnorm with matched parametersClose to exact
34

Approximation of the Binomial Distribution

6.3 Approximation of the Binomial Distribution

🧭 Overview

🧠 One-sentence thesis

The Normal distribution can approximate Binomial probabilities when the number of trials is large, offering a practical computational tool that extends to many other distributions beyond just the Binomial case.

📌 Key points (3–5)

  • Normal approximation replaces Binomial calculations: use a Normal distribution with the same expectation and standard deviation as the Binomial to approximate probabilities and percentiles.
  • Continuity correction improves accuracy: adding or subtracting 0.5 to discrete values before applying Normal approximation produces much better results, especially for smaller sample sizes.
  • When Normal approximation may fail: it works poorly when the number of trials n is small or when the success probability p is very close to 0 or 1.
  • Common confusion—Poisson vs Normal: when n is large but p is very small, Poisson approximation may outperform Normal approximation; don't assume Normal is always best.
  • Broader significance: the Normal approximation applies to a wide class of distributions, not just Binomial, making it valuable even when exact computational tools are unavailable.

🎲 How Normal approximation works

🎲 The basic method

Normal approximation: replace Binomial computations with Normal distribution computations using the same expectation and standard deviation.

  • For a Binomial random variable X with parameters (n, p):
    • Expectation: E(X) = n × p
    • Variance: V(X) = n × p × (1 - p)
    • Standard deviation: square root of variance
  • Use the Normal distribution with these same parameters to compute probabilities.
  • Example: For Binomial(4000, 0.5), expectation = 2000, variance = 1000, so approximate with Normal(2000, √1000).

🎯 Accuracy of the approximation

  • In the coin-toss example with 4000 trials, the Normal approximation (0.9442441) agreed with the exact Binomial probability (0.9442883) to 3 significant digits.
  • The approximation also works for finding percentiles: the central 95% region computed both ways yielded the same interval [1938, 2062].
  • The key insight: this is an approximation, not an exact computation, but it can be very accurate when conditions are right.

🔧 Continuity correction technique

🔧 Why correction is needed

  • The Binomial distribution is discrete (takes only integer values), while the Normal distribution is continuous.
  • A naïve Normal approximation may not align well with the discrete bars of the Binomial probability histogram.
  • The excerpt shows an example where uncorrected approximation gave 0.1159989 but the true probability was 0.1595230—a poor match.

🔧 How to apply continuity correction

Continuity correction: associate each discrete value x with the interval [x - 0.5, x + 0.5] under the Normal density.

  • For P(X ≤ 6), use the Normal probability P(X ≤ 6.5) instead of P(X ≤ 6).
  • In the example with Binomial(30, 0.3), the corrected approximation (0.1596193) was much closer to the target (0.1595230) than the uncorrected version (0.1159989).
  • Recommendation: always apply continuity correction when approximating discrete distributions with the Normal.

📊 Visual interpretation

  • Each bar in the Binomial histogram is centered at an integer x.
  • The continuity correction uses the area under the Normal curve from x - 0.5 to x + 0.5 to represent that bar.
  • This better captures the discrete probability mass.

⚠️ When Normal approximation fails

⚠️ Small number of trials

  • The Normal approximation is valid when n (number of trials) is large.
  • When n is relatively small, the approximation may not be good even with continuity correction.
  • The excerpt demonstrates this with Binomial(30, 0.3), where the approximation was acceptable but not as precise as with larger n.

⚠️ Extreme success probabilities

  • When p (probability of success) is too close to 0 or too close to 1, Normal approximation may fail.
  • In such cases, the Poisson distribution may provide a better approximation.
  • Don't confuse: large n alone doesn't guarantee good Normal approximation; p must not be too extreme.

⚠️ When to use Poisson instead

Poisson approximation: replace Binomial computations with Poisson distribution computations using the same expectation.

  • Best when n is large and p is small (or p is close to 1).
  • The excerpt compares three cases with expectation = 2:
    • n = 20, p = 0.1: Normal approximation slightly better
    • n = 200, p = 0.01: Poisson slightly better
    • n = 2000, p = 0.001: Poisson much better (0.8571235 vs true 0.8572138; Normal gave 0.8556984)

🌍 Broader significance

🌍 Beyond the Binomial

  • The real importance is not just approximating Binomial distributions specifically.
  • The Normal distribution can approximate a wide class of distributions, with Binomial being only one example.
  • This means Normal-based computations can be valid even when:
    • Exact computational tools are not available
    • The exact distribution is unknown

🌍 Limitations of universality

  • Not every distribution is well approximated by the Normal.
  • Example from the excerpt: wealth distribution in a population tends to be skewed (more than 50% of people possess less than 50% of wealth).
  • For skewed distributions, the Exponential or similar distributions may be more appropriate than Normal.
  • Don't assume Normal approximation works everywhere; consider the shape and characteristics of the actual distribution.
35

6.4 Exercises

6.4 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply Normal distribution concepts to real-world problems (lift weight limits) and compare different approximation methods (Normal vs Poisson) for Binomial probabilities.

📌 Key points (3–5)

  • Exercise 6.1 focus: using Normal distribution to assess safety probabilities for lift occupancy limits with different numbers of people.
  • Exercise 6.2 focus: computing a Binomial probability exactly and then comparing three approximation methods (Normal without correction, Normal with continuity correction, Poisson).
  • Common confusion: continuity correction—when approximating discrete distributions (Binomial) with continuous ones (Normal), adding 0.5 can improve accuracy.
  • Practical context: the lift problem demonstrates how probability models inform safety regulations by comparing risk levels under different scenarios.
  • Approximation trade-offs: the excerpt shows that Poisson works better when n is large and p is small, while Normal approximation can be better in other cases.

🏋️ Lift weight capacity problem

🏋️ Problem setup (Exercise 6.1)

  • Scenario: determining maximum occupancy for a lift by comparing probabilities of exceeding a weight limit.
  • Two cases to compare:
    • 8 people: total weight follows Normal distribution with mean 560 kg, standard deviation 57 kg
    • 9 people: total weight follows Normal distribution with mean 630 kg, standard deviation 61 kg
  • The critical threshold is 650 kg in both cases.

📊 Questions asked

The exercise requires four calculations:

QuestionWhat to find
1Probability that 8 people exceed 650 kg
2Probability that 9 people exceed 650 kg
3Central region containing 80% of the distribution for 8 people
4Central region containing 80% of the distribution for 9 people

🎯 Purpose

  • Comparing probabilities (questions 1 vs 2) helps assess which occupancy limit is safer.
  • Central regions (questions 3 and 4) identify the typical weight ranges, which inform design decisions.
  • Example: if 9 people have a much higher probability of exceeding 650 kg than 8 people, regulations should limit occupancy to 8.

🔢 Binomial approximation comparison

🔢 Problem setup (Exercise 6.2)

Given: X follows Binomial distribution with n = 27 and p = 0.32

  • Target probability: P(X > 11)
  • The exercise asks for four different ways to compute or approximate this probability.

🧮 Four computation methods

MethodDescriptionNotes from excerpt
ExactDirect Binomial calculationThe true value to compare against
Normal (no correction)Use Normal with same mean and SDMay be less accurate for discrete distributions
Normal (with correction)Add continuity correction (typically ±0.5)Often improves accuracy when approximating discrete with continuous
PoissonUse Poisson with same expectationWorks best when n is large and p is small

⚠️ When to use which approximation

The excerpt's earlier examples show:

  • Normal approximation: better when n is moderate and p is not extremely small (e.g., n = 20, p = 0.1).
  • Poisson approximation: more accurate when n is large and p is small (e.g., n = 2000, p = 0.001).
  • Continuity correction: improves Normal approximation by accounting for the discrete nature of Binomial.

Don't confuse: in Exercise 6.2, n = 27 and p = 0.32 is a moderate case, so the Normal approximation (especially with continuity correction) may perform reasonably well, but Poisson is not ideal here because p is not small.

📐 Continuity correction explained

  • Binomial is discrete (counts: 0, 1, 2, ...), but Normal is continuous.
  • To approximate P(X > 11) with Normal, use P(X > 11.5) instead—this shifts the boundary by 0.5 to better match the discrete steps.
  • Example: the excerpt shows that for n = 20, p = 0.1, the Normal approximation with continuity correction (0.8682238) is closer to the exact Binomial (0.8670467) than the Poisson approximation (0.8571235).

🎓 Learning goals

🎓 Skills practiced

  • Applying Normal distribution: calculating probabilities and percentiles for real-world measurements (weights).
  • Comparing approximations: understanding trade-offs between exact computation and different approximation methods.
  • Interpreting results: using probability calculations to inform decisions (e.g., safety regulations).

🔍 Why these exercises matter

  • Exercise 6.1 connects abstract probability theory to concrete safety decisions.
  • Exercise 6.2 reinforces understanding of when different approximations are appropriate and how continuity corrections work.
  • Both exercises require translating problem descriptions into the correct distribution parameters and then performing the calculations.
36

The Normal Random Variable: Summary

6.5 Summary

🧭 Overview

🧠 One-sentence thesis

The Normal distribution serves as a widely-used bell-shaped model for measurements and provides the foundation for approximating other distributions like the Binomial and Poisson.

📌 Key points (3–5)

  • Normal distribution: A bell-shaped distribution characterized by mean (μ) and variance (σ²), frequently used to model measurements.
  • Standard Normal: The special case Normal(0,1) used for standardized measurements.
  • Approximation methods: The Normal can approximate Binomial computations (using matching expectation and standard deviation), and Poisson can also approximate Binomial (using matching expectation).
  • Common confusion: Mathematical models vs. reality—the Normal distribution's sample space includes all real numbers (even negative), but real measurements (e.g., IQ scores) may only produce positive values.
  • Percentiles: A way to find the value x where a given probability p of the distribution falls at or below x.

📊 Core distributions

📊 Normal Random Variable

Normal Random Variable: A bell-shaped distribution that is frequently used to model a measurement. The distribution is marked with Normal(μ, σ²).

  • Parameters: μ (mean) and σ² (variance) fully describe the distribution.
  • Shape: Bell-shaped, symmetric around the mean.
  • Use case: Modeling measurements in practice.

🎯 Standard Normal Distribution

Standard Normal Distribution: The Normal(0,1). The distribution of standardized Normal measurement.

  • Special case: Mean = 0, variance = 1.
  • Purpose: Standardization allows comparison across different Normal distributions.
  • Relationship: Any Normal measurement can be converted to Standard Normal through standardization.

🔄 Approximation techniques

🔄 Normal Approximation of the Binomial

Normal Approximation of the Binomial: Approximate computations associated with the Binomial distribution with parallel computations that use the Normal distribution with the same expectation and standard deviation as the Binomial.

  • Key idea: Replace Binomial calculations with Normal calculations.
  • Matching requirement: Use a Normal distribution with the same expectation and standard deviation as the original Binomial.
  • When useful: Simplifies probability computations for Binomial random variables.
  • Refinement: Can be applied with or without a continuity correction.

🎲 Poisson Approximation of the Binomial

Poisson Approximation of the Binomial: Approximate computations associated with the Binomial distribution with parallel computations that use the Poisson distribution with the same expectation as the Binomial.

  • Key idea: Replace Binomial calculations with Poisson calculations.
  • Matching requirement: Use a Poisson distribution with the same expectation as the original Binomial (only expectation needs to match, not standard deviation).
  • Alternative approach: Another way to simplify Binomial computations.

📏 Percentiles

📏 What percentiles represent

Percentile: Given a percent p·100% (or a probability p), the value x is the percentile of a random variable X if it satisfies the equation P(X ≤ x) = p.

  • Definition: The value x such that the probability of being at or below x equals p.
  • Interpretation: Divides the distribution at a specific probability level.
  • Example: If p = 0.80, the 80th percentile is the value x where 80% of the distribution falls at or below x.

⚠️ Models vs. reality

⚠️ Discrepancies between theory and practice

The excerpt raises an important question: should mathematical models perfectly match reality, or is a partial match sufficient?

Example from the excerpt: IQ testing

  • Model: IQ scores are modeled as Normal(100, 15).
  • Theoretical sample space: All real numbers, including negative numbers.
  • Reality: IQ tests produce only positive values.
  • Discrepancy: The model allows impossible values (negative IQ scores).

Two perspectives:

  • Perfect match view: Models should characterize all important features exactly.
  • Partial match view: Models should capture key features; minor discrepancies are acceptable if the model provides insight.

Don't confuse: A model being imperfect does not necessarily make it inappropriate—the question is whether discrepancies affect the model's usefulness for the intended purpose.

37

Student Learning Objective

7.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

The sampling distribution connects observed data from a specific sample to the random variable concept by treating data summaries (like the sample average) as random variables that vary across all possible samples, and for large samples the Central Limit Theorem shows that the sample average follows a Normal distribution.

📌 Key points (3–5)

  • What sampling distribution is: the distribution of a statistic (e.g., sample average) across all possible samples that could be selected, not just the one observed sample.
  • Data vs random sample: observed data is known and fixed; a random sample represents what will be or could have been selected, with unknown content but known probabilities.
  • Statistic as random variable: a statistic computed from observed data is a number; the same statistic considered before sampling is a random variable with its own sampling distribution.
  • Common confusion: don't confuse the distribution of individual measurements in the population with the sampling distribution of the sample average—the latter describes how averages vary across samples.
  • Central Limit Theorem: for large samples, the sampling distribution of the sample average can be approximated by the Normal distribution, linking the expectation and standard deviation of measurements to those of the sample average.

🔗 From random variable to random sample

🔗 Random variable recap

  • A random variable represents a measurement before it is taken.
  • Example: selecting a random person and measuring height—before selection, height is a random variable with a sample space (all possible heights in the population) and probabilities (relative frequencies).
  • After selection and measurement, you get an observed value, which is no longer random.

🎲 Random sample concept

A random sample: the data that will be selected when taking a sample, prior to the selection itself.

  • Observed data: the actual values from a sample that has been taken; content is known.
  • Random sample: the potential data before sampling; content is unknown but possible evaluations and their probabilities are known.
  • The sample space of the random sample is the collection of all possible evaluations the sample may have.
  • The distribution of the random sample assigns probabilities to these evaluations.
  • Alternatively (past tense): the sample space is what could have been selected, with probabilities for each possibility.

Don't confuse: a random sample is not a single random variable—it is a collection of measurements considered before they are observed.

📊 Statistics and sampling distributions

📊 What a statistic is

A statistic: a function applied to the data.

  • Examples: sample average, sample variance, sample standard deviation, median.
  • Each statistic applies a specific formula to the data.
  • When computed from observed data, a statistic is a known number.

🎯 Statistic as random variable

  • The same formula can be applied to a random sample (before data is observed).
  • Before sampling, we don't know what values the sample will include, so we cannot know the statistic's value.
  • However, we can identify:
    • The sample space of the statistic: all possible values it can take.
    • The sampling distribution of the statistic: the probabilities of each possible value, derived from the random sample's distribution.

Example:

  • Sample average from observed data → a number (e.g., 72.3).
  • Sample average from a random sample → a random variable with its own distribution (the sampling distribution of the sample average).

🔄 Sampling distribution defined

Sampling distribution of a statistic: the distribution of that statistic across all possible samples.

  • Applies to any statistic: median, variance, etc.
  • The sampling distribution describes how the statistic varies from sample to sample.

Don't confuse:

  • The distribution of individual measurements (e.g., heights in the population).
  • The sampling distribution of a statistic (e.g., how sample averages vary across different samples).

🧮 Central Limit Theorem and learning goals

🎓 What students should be able to do

By the end of the chapter, students should:

GoalDescription
Comprehend sampling distributionUnderstand the notion and simulate the sampling distribution of the sample average
Relate expectations and standard deviationsConnect the expectation and standard deviation of a measurement to those of the sample average
Apply Central Limit TheoremUse the theorem to approximate the sampling distribution of sample averages

📈 Central Limit Theorem preview

  • The excerpt states that for large samples, the sampling distribution of the sample average may be approximated by the Normal distribution.
  • This is a mathematical theorem that proves the approximation.
  • It links individual measurement properties (expectation, standard deviation) to the sample average's distribution.

Why it matters: even if individual measurements are not Normally distributed, their average (for large samples) will be approximately Normal, enabling statistical inference.

🌐 Extension to abstract settings

  • Random variables model uncertainty in future measurements, not just in population sampling.
  • Examples: Binomial, Poisson (for counting), Uniform, Exponential, Normal (for continuous measurements).
  • The sampling distribution concept extends to sequences of measurements taken independently.
  • Each measurement is independent of the others, forming a sequence.

Don't confuse: a sequence of independent measurements with a single sample from a population—both use sampling distributions but in different contexts.

38

The Sampling Distribution

7.2 The Sampling Distribution

🧭 Overview

🧠 One-sentence thesis

The sampling distribution describes the probability distribution of a statistic (such as the sample average) across all possible samples, enabling us to assess how close our estimates are likely to be to the true population parameters before we even collect the data.

📌 Key points (3–5)

  • Random sample vs. observed data: A random sample represents all possible samples that could be selected (with their probabilities) before selection; the observed data is what we actually get after selection.
  • Statistic as random variable: Any statistic (average, variance, median) computed on observed data is a known number, but the same statistic applied to a random sample becomes a random variable with its own distribution—the sampling distribution.
  • How sampling distributions work: The sampling distribution assigns probabilities to all possible values a statistic can take, inherited from the probabilities of the samples themselves.
  • Common confusion: Don't confuse the distribution of the original population (e.g., heights ranging 127–217 cm) with the sampling distribution of the sample average (which is centered at the same place but much less spread out).
  • Why it matters: Sampling distributions let us assess the probability that our estimate will fall within a certain range of the true parameter, even when we don't know the true parameter.

🎲 Random samples and statistics

🎲 What a random sample is

A random sample is the data that will be selected when taking a sample, prior to the selection itself.

  • The content is unknown before selection, just as a random variable's value is unknown before observation.
  • The random sample has a sample space (all possible evaluations the sample could take) and a distribution (the probabilities of those evaluations).
  • After selection, the sample becomes observed data with known values—no longer random.

Example: Before drawing 100 people from a population, the random sample is "all possible sets of 100 people" with their probabilities; after drawing, we have one specific set of 100 heights.

📊 What a statistic is

A statistic is a function of the data.

  • Examples: sample average, sample variance, sample standard deviation, median.
  • Each statistic applies a specific formula to the data.
  • In the context of observed data: the statistic is a known number.
  • In the context of a random sample: the same formula becomes a random variable, because we don't yet know which sample we'll get.

Don't confuse: The sample average of your observed data (e.g., 171.14 cm) is fixed; the sample average as a random variable (before you collect data) can take many possible values.

🌊 The sampling distribution concept

🌊 Definition and construction

The sampling distribution of a statistic is the distribution of that statistic when considered as a random variable across all possible samples.

  • Each possible sample has a probability (from the random sample distribution).
  • Apply the statistic formula to each possible sample → each outcome of the statistic inherits the probability of the sample that produced it.
  • The collection of all these outcomes and their probabilities is the sampling distribution.

Example: For samples of size 100 from a population, each sample produces a sample average. The sampling distribution of the sample average is the probability distribution of all those averages.

🔢 Sampling distribution in different contexts

The excerpt describes two settings:

ContextDescriptionExample
Population samplingTaking a sample from a finite populationDrawing 100 heights from a population of 100,000 people
Theoretical modelsTaking independent measurements from a probability distribution64 independent measurements from Binomial(10, 0.5)
  • In both cases, the sampling distribution describes how a statistic (like the average) behaves across all possible samples.
  • The term "sample" can mean a subset from a population or a sequence of independent measurements.

🎯 Why sampling distributions matter

  • Probabilistic assessment: We can ask, "What is the probability that my sample average will be within 1 cm of the true population mean?" before collecting data.
  • Inference under uncertainty: In realistic settings, we don't know the true population parameter, but the sampling distribution lets us assess how far our estimate is likely to be from the target.
  • The excerpt emphasizes: "The sampling distribution is the vehicle that may enable us to address these questions."

📐 Properties and behavior of sampling distributions

📐 Center and spread

The excerpt illustrates with a population of heights:

  • Population distribution: heights range from 127 to 217 cm, centered around 170 cm, with standard deviation about 11.23 cm.
  • Sampling distribution of the sample average (for samples of size 100): values range essentially from 166 to 174 cm, centered around 170 cm, with standard deviation about 1.12 cm.

Key observation:

  • The expectation (center) of the sample average equals the expectation of the population.
  • The standard deviation of the sample average is much smaller (about 10 times smaller in this example) than the standard deviation of the population.
  • "This result is not accidental and actually reflects a general phenomena."

Don't confuse: The sample average is less spread out than individual measurements, even though both are centered at the same location.

🧮 Computing probabilities from the sampling distribution

The excerpt shows how to compute the probability that the sample average falls within 1 cm of the population mean:

  • Simulate a large number of samples (e.g., 100,000).
  • For each sample, compute the statistic (e.g., the average).
  • The proportion of simulated statistics that satisfy the condition approximates the true probability.
  • In the example, about 62.6% of sample averages fall within 1 cm of the population mean.

Why simulation? The total number of possible samples of size 100 from 100,000 individuals is on the order of 10^342, which cannot be handled by any computer. Simulating a large number of samples approximates the true sampling distribution.

🔄 Examples and applications

🔄 Sampling from a population

The excerpt uses an imaginary population file with 100,000 individuals' heights:

  • First sample: 100 individuals selected, sample average = 171.14 cm (true population mean = 170.035 cm).
  • Second sample: different 100 individuals, sample average = 169.3 cm.
  • Third sample: yet another set, sample average = 171.26 cm.
  • Each sample gives a different estimate, but all are reasonably close to the true mean.

Insight: "Was it pure luck that we got such good estimates?" The sampling distribution answers this by showing the probability of getting estimates within a certain range.

🎲 Theoretical model example

Example from the excerpt (Example 7.1):

  • A poll to estimate the proportion p of supporters in a population.
  • Sample size n = 300.
  • X = number of supporters in the sample has a Binomial(300, p) distribution—this is the sampling distribution of X.
  • The sample proportion X/300 is used to estimate p.
  • The sampling distribution of X/300 can be used to assess the discrepancy between the estimate and the true parameter.

Another example: 64 independent measurements from Binomial(10, 0.5), denoted X₁, X₂, …, X₆₄. The sample average is the sum of these 64 measurements divided by 64. The sampling distribution of this average can be approximated by simulation.

🛠️ Practical note on simulation

The excerpt includes a technical note:

  • Simulating sampling distributions can be computationally intensive.
  • Running 100,000 iterations may take a long time depending on your computer.
  • You can reduce the number of iterations (e.g., to 10,000) for faster but less accurate results—"The results of the simulation will be less accurate, but will still be meaningful."
39

Law of Large Numbers and Central Limit Theorem

7.3 Law of Large Numbers and Central Limit Theorem

🧭 Overview

🧠 One-sentence thesis

The Law of Large Numbers and the Central Limit Theorem are mathematical theorems that describe how the sampling distribution of the sample average becomes more concentrated and approximately Normal as sample size increases, enabling practical probability computations even when the original measurement distribution is unknown.

📌 Key points (3–5)

  • Law of Large Numbers: as sample size grows, the sampling distribution of the sample average becomes more concentrated around the expectation (variance decreases as V(X)/n).
  • Central Limit Theorem (CLT): the standardized sample average converges to the standard Normal distribution as sample size increases, regardless of the original measurement distribution.
  • Practical application: CLT allows probability computations using the Normal distribution when you know only the expectation, variance, and sample size—not the actual measurement distribution.
  • Common confusion: CLT applies only when sample size is "large enough"; for small samples or highly skewed distributions, the Normal approximation may be poor and misleading.
  • How to tell if sample size is sufficient: symmetric distributions (e.g., Uniform) converge quickly (n=10 may suffice); skewed distributions (e.g., Exponential) require much larger samples (n=1000) for good Normal approximation.

📐 Law of Large Numbers

📐 What it states

The Law of Large Numbers states that, as the sample size becomes larger, the sampling distribution of the sample average becomes more and more concentrated about the expectation.

  • "Concentrated" means the variance shrinks and values cluster tightly around the mean.
  • The expectation of the sample average remains E(X̄) = E(X) for all sample sizes.
  • The variance of the sample average decreases according to V(X̄) = V(X)/n.

🔍 Demonstration with Uniform distribution

The excerpt simulates Uniform(3,7) measurements for three sample sizes: n=10, n=100, n=1000.

Sample sizeExpectation of X̄Variance of X̄
n=105 (unchanged)≈0.134
n=1005 (unchanged)≈0.013
n=10005 (unchanged)≈0.001
  • Variance of a single Uniform(3,7) measurement: V(X) = (7−3)²/12 ≈ 1.333.
  • Variance shrinks by factor of n: V(X̄) = 1.333/10 ≈ 0.133, 1.333/100 ≈ 0.013, etc.
  • Smaller variance → distribution is more tightly packed around expectation 5.

🎯 Why it matters

  • Variance measures spread around the expectation.
  • The smaller the variance, the more concentrated the distribution.
  • Larger sample size → more reliable estimate of the true expectation.

🔔 Central Limit Theorem (CLT)

🔔 What it states

The Central Limit Theorem states that, with the increase in sample size, the sample average converges (after standardization) to the standard Normal distribution.

  • "Standardization" means forming Z = (X̄ − E(X̄)) / √V(X̄) = √n(X̄ − E(X)) / √V(X).
  • The standardized average magnifies the shrinking deviation by √n.
  • The distribution of Z approaches the standard Normal distribution N(0,1) as n grows.

🧮 The standardized sample average

The excerpt defines:

Z = (X̄ − E(X̄)) / √V(X̄)

which can be rewritten as:

Z = √n(X̄ − E(X)) / √V(X)

  • Numerator: the deviation X̄ − E(X) shrinks as n increases.
  • √n magnifies this shrinking deviation to keep the scale stable.
  • The result is that Z has a distribution that stabilizes to N(0,1).

📊 Key insight

  • CLT applies regardless of the original measurement distribution.
  • You do not need to know the actual distribution of X; you only need E(X), V(X), and n.
  • This makes CLT extremely powerful for practical probability computations.

🧪 How sample size affects convergence

🧪 Uniform distribution example

The excerpt simulates Uniform(3,7) measurements and plots the density of the standardized average for n=10, 100, 1000 alongside the standard Normal density.

  • n=10: the density curve almost overlaps the Normal curve.
  • n=100, n=1000: even closer match.
  • Conclusion: for symmetric distributions like Uniform, CLT approximation is good even for small n.

⚠️ Exponential distribution example

The excerpt repeats the simulation with Exponential(0.5) measurements (E(X)=2, V(X)=4).

Sample sizeMatch with Normal
n=10Clear distinction; red curve differs from black Normal curve
n=100Better match (green curve), but not perfect
n=1000Very good agreement (blue curve)
  • Why the difference? Exponential distribution is highly skewed (not symmetric).
  • Skewed distributions require much larger sample sizes for the CLT approximation to be accurate.

🚨 Don't confuse: "large enough" depends on the distribution

  • Symmetric, well-behaved distributions → CLT works well even for n=10.
  • Skewed or heavy-tailed distributions → may need n=100 or n=1000 for good approximation.
  • Warning from the excerpt: "When the sample is small a careless application of the Central Limit Theorem may produce misleading conclusions."

🛠️ Applying the Central Limit Theorem

🛠️ What you need to know

To use CLT for probability computations, you need:

  1. E(X): the expectation of a single measurement.
  2. V(X) or SD(X): the variance (or standard deviation) of a single measurement.
  3. n: the sample size.

Then the sampling distribution of X̄ is approximately Normal with:

  • Expectation: E(X̄) = E(X)
  • Variance: V(X̄) = V(X)/n
  • Standard deviation: SD(X̄) = √(V(X)/n)

📏 Example: central interval for Binomial average

The excerpt refers to an earlier example (Subsection 7.2.3):

  • Goal: find the central interval containing 95% of the sampling distribution of a Binomial average.
  • Method: compute the 2.5% and 97.5% percentiles of the Normal distribution with the same E(X̄) and SD(X̄).
  • Result: boundaries from the Normal approximation agreed well with boundaries from simulation.

Example: if E(X̄)=5 and SD(X̄)=0.2, the 95% central interval is approximately [4.61, 5.39] using Normal percentiles.

🎓 When to apply

  • Whenever you need probability computations for the sample average.
  • Especially useful when the actual distribution of X is unknown or complicated.
  • More examples are provided in exercises and later chapters.

⚠️ Treat with caution

"With all its usefulness, one should treat the Central Limit Theorem with a grain of salt. The approximation may be valid for large samples, but may be bad for samples that are not large enough."

  • Always check whether your sample size is large enough for your distribution.
  • For skewed or unusual distributions, verify the approximation with simulation if possible.
  • Small sample + CLT = potential for misleading conclusions.

📚 Formulas summary

📚 Key relationships

QuantityFormulaMeaning
Expectation of X̄E(X̄) = E(X)Sample average has same expectation as single measurement
Variance of X̄V(X̄) = V(X)/nVariance shrinks by factor of n
Standardized X̄Z = (X̄ − E(X)) / √(V(X)/n)Magnifies deviation by √n
CLT conclusionZ ≈ N(0,1) for large nStandardized average is approximately standard Normal

🔑 Why these formulas matter

  • E(X̄) = E(X): the average is an unbiased estimator of the true expectation.
  • V(X̄) = V(X)/n: larger samples give more precise estimates (smaller variance).
  • Standardization: allows comparison across different scales and sample sizes.
  • CLT: bridges any distribution to the Normal, enabling universal probability computations.
40

7.4 Exercises

7.4 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply the Central Limit Theorem and the Law of Large Numbers to compute probabilities and intervals for the sample average of uniformly distributed measurements.

📌 Key points (3–5)

  • Problem setup: 25 random hits uniformly distributed along a 10 nm detector; compute properties of the average location.
  • What to find: expectation, standard deviation, probability of the average falling in a region, and a central interval containing 99% of the distribution.
  • Key tool: the Central Limit Theorem approximates the distribution of the sample average as Normal when sample size is large.
  • Common confusion: the exercises ask about the average location (a statistic), not individual hit locations—don't confuse the sampling distribution of the average with the distribution of a single measurement.
  • Why it matters: these calculations illustrate how the sampling distribution of the sample average behaves and how to use the Normal approximation in practice.

🎯 Problem context

🎯 The scenario

  • A linear detector is 10 nm long.
  • 25 hits occur at random locations, uniformly distributed along the detector.
  • Locations are measured from a specified endpoint.
  • The average of the 25 locations is computed.

🔍 What the exercises ask

The exercises focus on the sampling distribution of the sample average:

  1. Expectation of the average location.
  2. Standard deviation of the average location.
  3. Probability that the average location falls in the left-most third of the detector (using the Central Limit Theorem).
  4. A central interval of the form 5 ± c that contains 99% of the distribution of the average (using the Central Limit Theorem).

Don't confuse: the questions are about the average of 25 measurements, not about a single hit location.

📐 Key formulas and concepts

📐 Expectation and variance of the sample average

The excerpt provides summary formulas:

Expectation of the sample average: E(X̄) = E(X)

Variance of the sample average: Var(X̄) = Var(X) / n

  • X̄ denotes the sample average.
  • X denotes a single measurement.
  • n is the sample size (here, n = 25).
  • The standard deviation of the sample average is the square root of the variance: SD(X̄) = SD(X) / √n.

Example: if individual hits have expectation 5 nm and standard deviation σ, then the average of 25 hits has expectation 5 nm and standard deviation σ / √25 = σ / 5.

🔔 The Central Limit Theorem

The Central Limit Theorem: A mathematical result regarding the sampling distribution of the sample average. States that the distribution of the average is approximately Normal when the sample size is large.

  • Even if individual measurements are not Normal (here, they are uniform), the average of many measurements is approximately Normal.
  • This allows us to use the Normal distribution to compute probabilities and intervals for the sample average.
  • The excerpt notes that the sample size n goes to infinity in the theorem, but in practice the approximation is used for finite (but "large enough") sample sizes.

Don't confuse: the Central Limit Theorem applies to the sample average, not to individual measurements. Individual hits are uniformly distributed; the average of 25 hits is approximately Normal.

🧮 Applying the theorem

🧮 Computing probabilities

  • Question 3 asks for the probability that the average location is in the left-most third of the detector.
  • The left-most third of a 10 nm detector is the interval [0, 10/3] nm.
  • Use the Central Limit Theorem: approximate the distribution of the average as Normal with mean E(X̄) and standard deviation SD(X̄).
  • Convert to a standard Normal variable and use Normal tables or software.

🧮 Finding central intervals

  • Question 4 asks for a central interval 5 ± c that contains 99% of the distribution of the average.
  • The center is 5 nm (the midpoint of the detector, which is the expectation for a uniform distribution).
  • Use the Central Limit Theorem: the average is approximately Normal.
  • Find the value c such that P(5 − c ≤ X̄ ≤ 5 + c) = 0.99.
  • This corresponds to finding the 99.5th percentile of the standard Normal distribution (since 0.5% is in each tail) and scaling by SD(X̄).

Example: if SD(X̄) = s, then c ≈ 2.576 × s (since the 99.5th percentile of the standard Normal is approximately 2.576).

🧠 Conceptual foundations

🧠 The Law of Large Numbers

The Law of Large Numbers: A mathematical result regarding the sampling distribution of the sample average. States that the distribution of the average of measurements is highly concentrated in the vicinity of the expectation of a measurement when the sample size is large.

  • As sample size increases, the sample average becomes more tightly concentrated around the true expectation.
  • This is reflected in the formula Var(X̄) = Var(X) / n: as n grows, the variance (and standard deviation) of the average shrinks.

🧠 Practical use of limit theorems

The excerpt includes a discussion question:

  • Some say the Law of Large Numbers and the Central Limit Theorem are useless because they deal with infinite sample sizes, but real samples are finite.
  • The excerpt counters: these theorems are used in practice to make statements about the sampling distribution of the sample average (e.g., computing central regions with 95% or 99% coverage).
  • The Normal distribution is used in these computations, justified by the Central Limit Theorem.

Don't confuse: "large" sample size does not mean infinite; it means "large enough" for the Normal approximation to be accurate. In practice, n = 25 is often considered sufficient for the Central Limit Theorem to apply, especially if the underlying distribution is not too skewed.

📚 Related definitions

📚 Key terms from the glossary

TermDefinition (from excerpt)
Random SampleThe probabilistic model for the values of measurements in the sample, before the measurement is taken.
Sampling DistributionThe distribution of a random sample.
Sampling Distribution of a StatisticA statistic is a function of the data. When applied to a random sample, it becomes a random variable; its distribution is the sampling distribution.
Sampling Distribution of the Sample AverageThe distribution of the sample average, considered as a random variable.

Don't confuse: the data (observed values) vs. the random sample (the probabilistic model before measurement). The exercises ask about the random sample (the model), not about a specific set of 25 observed hits.

📚 Variability and statistics

The excerpt notes:

  • Statistics deals with variability in different forms and for different reasons.
  • Descriptive statistics examines the distribution of observed data (plots, tables, summaries like mean and standard deviation).
  • Probability examines the distribution of all data sets that could have been observed (the frame of reference is the probabilistic model, not the data at hand).

In these exercises, the frame of reference is probability: we model the 25 hits as a random sample and compute properties of the sample average under that model.

41

Summary

7.5 Summary

🧭 Overview

🧠 One-sentence thesis

The first part of the book introduces the fundamentals of statistics and probability as tools for making inferences about populations from sample data, with a focus on understanding and handling different forms of variability.

📌 Key points (3–5)

  • Core purpose: Statistics uses sample data to make inferences about population parameters, with probability theory assessing the reliability of those inferences.
  • Central challenge: All statistical work deals with variability—understanding its source (data itself vs. all possible samples) is essential.
  • Common confusion: Descriptive statistics examines the distribution of observed data, while probability examines the distribution of all possible samples (the sample space); the observed sample is just one realization among many.
  • Key tools: Data frames store measurements; plots, tables, and numerical summaries describe distributions; probabilistic models (Binomial, Poisson, Normal, etc.) provide the theoretical framework for inference.
  • Practical application: Sample characteristics (e.g., sample mean) estimate population parameters (e.g., population mean), and sampling distribution variability quantifies estimation accuracy.

📊 Two forms of variability

📊 Descriptive statistics: variability in the data

Descriptive statistics examines the distribution of data. The frame of reference is the data itself.

  • What it measures: How the observed measurements vary within the collected sample.
  • Tools used: Bar plots, histograms, box plots; frequency and cumulative frequency tables; mean, median, standard deviation.
  • Purpose: Summarize and understand the distribution of the given data set.
  • Example: A sample of 100 measurements shows a mean of 50 and a standard deviation of 10—these describe that specific sample.

🎲 Probability: variability across all possible samples

In probability, the frame of reference is not the data at hand but, instead, it is all data sets that could have been sampled (the sample space of the sampling distribution).

  • What it measures: How statistics (functions of the sample) would vary if we repeated the sampling process many times.
  • Key insight: The observed sample is only one realization within the sample space; it has no special role compared to all other potential samples.
  • Tools used: Similar plots, tables, and summaries as descriptive statistics, but applied to the distribution of possible sample outcomes.
  • Don't confuse: Probability does not directly describe the data you have; it describes the behavior of the sampling process that generated your data.

🔗 How probability enables inference

🔗 The indirect but essential relationship

  • The connection between probabilistic variability and observed data is indirect but forms the basis for statistical inference.
  • How it works:
    • Use sample characteristics (e.g., sample mean) to estimate population parameters (e.g., population mean).
    • Use the probabilistic description of the sampling distribution to assess how reliable that estimate is.
  • Example: If the sample mean is 50, probability theory tells us how much that estimate might vary across different samples, allowing us to quantify accuracy.

📐 Formulas for the sample average

The excerpt provides two key formulas:

FormulaMeaning
Expectation of sample average: E(X̄) = E(X)The average of the sample average equals the population average
Variance of sample average: V(X̄) = V(X)/nThe variability of the sample average decreases as sample size n increases
  • Why this matters: As sample size grows, the sample average becomes more concentrated around the population average (lower variance).
  • This mathematical relationship justifies using the sample average to estimate the population average.

🗂️ Data structure and sampling

🗂️ Data frames

  • Structure: Columns represent measured variables; rows represent observations from the selected sample.
  • Variable types: Numeric (discrete or continuous) and factors (categorical).
  • The excerpt mentions learning to produce data frames and read data into R for analysis.

🎯 Simple random sampling

  • Core principle: A sample must represent the population it came from.
  • Focus of this book: Simple random sampling, where each subset of a given size has equal probability of selection.
  • Important caveat: Other sampling designs exist and may be more appropriate in specific applications.

🧮 Probabilistic models

🧮 Theoretical modeling in statistics

Statistics, like many other empirically driven forms of science, uses theoretical modeling for assessing and interpreting observational data. In statistics this modeling component usually takes the form of a probabilistic model for the measurements as random variables.

  • Purpose: Provide a structured framework for understanding how measurements behave as random variables.
  • Examples encountered: Binomial, Poisson, Uniform, Exponential, and Normal distributions.
  • Practical note: Many more models exist in the literature; R functions are available for applying them when appropriate.

🔍 Simple sampling model

  • Assumption: Each subset of a given size from the population has equal probability of being selected as the sample.
  • This is the foundational model for simple random sampling.

🤔 Addressing a common objection

🤔 "Infinite sample sizes are impractical"

The excerpt poses a question: Some people say the Law of Large Numbers and Central Limit Theorem are useless because they deal with sample sizes going to infinity, but real samples are finite.

The excerpt's response:

  • These theorems are used to solve practical problems despite the infinity assumption.
  • Example given: When making statistical inference, we often need to describe the sampling distribution of the sample average (e.g., find the central region containing 95% of the distribution).
    • The Normal distribution is used in the computation.
    • The justification is the Central Limit Theorem.
  • Merit: Provides a usable approximation for finite samples.
  • Weakness: The excerpt asks readers to identify weaknesses themselves, acknowledging that applying abstract theory to finite samples involves trade-offs.

🎯 Integration and next steps

🎯 Purpose of this overview

  • Relate the concepts and methods from the first part of the book to each other.
  • Put them in perspective before moving to the second part (statistical procedures for analyzing data).

🧩 Skills to integrate

The excerpt lists three learning objectives:

  • Better understanding of the relation between descriptive statistics, probability, and inferential statistics.
  • Distinguishing between different uses of the concept of variability.
  • Integrating tools from previous chapters to solve complex problems.
42

Student Learning Objective

8.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

Statistical inference uses probabilistic models to extrapolate from sampled data to the entire population, with the Central Limit Theorem enabling practical approximations of the sample average's distribution regardless of the underlying measurement distribution.

📌 Key points (3–5)

  • Indirect relation between probability and data: the observed sample is only one realization among all possible realizations in the sample space; probabilistic analysis considers all potential outcomes, not just the observed one.
  • What statistics are: functions of sampled data used for inference; when computed on random samples, they become random variables with their own distributions.
  • Central Limit Theorem's power: the sample average's distribution can be approximated by the Normal distribution for large enough samples, depending only on the measurement's expectation and variance, not its exact distribution.
  • Common confusion: the distribution of a statistic (e.g., sample average) is not the same as the distribution of individual measurements—e.g., averaging Uniform measurements does not produce a Uniform distribution.
  • Why it matters: enables estimation of population parameters and quantification of estimation accuracy using the variability of the sampling distribution.

🎲 Probabilistic modeling vs observed data

🎲 The sample space perspective

  • The excerpt emphasizes that probabilistic analysis considers the sample space—all possible samples that could be drawn.
  • The actual observed sample is just one realization; it has no special role compared to other potential realizations in the probabilistic context.
  • This may seem counterintuitive: why analyze what could have been sampled rather than what was sampled?

🔗 The indirect but crucial relation

The relation between probabilistic variability and the observed data is not direct, but this indirect relation is the basis for making statistical inference.

  • How it works: characteristics of the observed data (e.g., sample average) are used to extrapolate to the entire population.
  • Why probability matters: the probabilistic description of the sampling distribution quantifies how reliable that extrapolation is.
  • Example: estimate the population average from the sample average; use the sampling distribution's variability to assess the accuracy of this estimate.
  • Don't confuse: "analyzing the sample" (descriptive) vs "using the sample to infer about the population" (inferential)—the latter requires probabilistic modeling.

📐 Statistics as random variables

📐 What a statistic is

A statistic is a function of sampled data that is used for making statistical inference.

  • When you compute a statistic (e.g., the sample average) from a random sample, the result is itself a random variable from a probabilistic viewpoint.
  • The statistic's distribution depends on the measurements' distribution but is not identical to it.

🔄 Distribution transformation

  • The excerpt gives a clear example: the average of a sample from the Uniform distribution does not follow the Uniform distribution.
  • In general, the relation between a measurement's distribution and a statistic's distribution can be complex.
  • This is a key point for avoiding confusion: computing a function of random variables changes the distribution.
ConceptWhat it isDistribution
Individual measurementOne data point from the populationFollows the population distribution (e.g., Uniform, Exponential, Normal)
Statistic (e.g., sample average)Function of multiple measurementsHas its own distribution, different from the measurement distribution

🌟 The Central Limit Theorem

🌟 What it says

The Central Limit Theorem provides an approximation for the distribution of the sample average that typically improves as sample size increases.

Three key facts:

  1. Expectation: the sample average's expectation equals the expectation of a single measurement.
  2. Variance: the sample average's variance equals the variance of a single measurement divided by the sample size.
  3. Shape: the sample average's distribution can be approximated by the Normal distribution (with the expectation and standard deviation from facts 1 and 2).

🎯 Why it's powerful

  • The approximation is valid for practically any distribution of the measurement.
  • The statistic's distribution depends on the underlying measurement distribution only through expectation and variance, not through other characteristics.
  • Example: whether individual measurements are Uniform, Exponential, or any other shape, the sample average will be approximately Normal if the sample is large enough.

📏 Extensions beyond the sample average

  • The theorem applies to quantities proportional to the sample average.
  • Sum of the sample: obtained by multiplying the sample average by the sample size n, so the theorem approximates the distribution of sums.
  • The theorem generalizes further to smooth functions of the sample average, increasing its applicability.

⚠️ Practical requirement

  • The approximation works "provided that the sample size is not too small."
  • The excerpt states that for practical computations, you only need to figure out the expectation and variance of the underlying measurement; the exact distribution is irrelevant.

🧮 Application framework

🧮 What you need for inference

The excerpt outlines the practical steps for using the Central Limit Theorem:

  1. Identify the measurement distribution: determine the expectation and variance of a single measurement.
  2. Translate to the sample average: use the relations E(X̄) = E(X) and V(X̄) = V(X)/n.
  3. Apply Normal approximation: compute probabilities or percentiles using the Normal distribution with the sample average's expectation and standard deviation.

📊 Example scenario structure

The excerpt introduces an example (stress scores on a college campus) to demonstrate:

  • Measurements follow a Uniform distribution from 1 to 5.
  • Sample size is 75 students.
  • Questions include: probability that the average is less than 2, the 90th percentile for the average, probability that the total is less than 200, and the 90th percentile for the total.

Key formulas mentioned:

  • For Uniform(a, b): expectation = (a + b)/2, variance = (b − a)² / 12.
  • For the sample average: expectation = E(X), variance = V(X)/n.

Don't confuse: questions about the average vs questions about the total—both can be answered using the Central Limit Theorem, but the total requires multiplying by the sample size.

43

An Overview of Statistics and the Central Limit Theorem

8.2 An Overview

🧭 Overview

🧠 One-sentence thesis

The Central Limit Theorem allows us to approximate the distribution of the sample average using the Normal distribution, regardless of the underlying measurement distribution, provided we know the expectation and variance and the sample size is large enough.

📌 Key points (3–5)

  • What a statistic is: a function of sampled data used for statistical inference; when computed on a random sample, it is itself a random variable with its own distribution.
  • The key insight of the Central Limit Theorem: the sample average's distribution can be approximated by the Normal distribution, with expectation equal to the measurement's expectation and variance equal to the measurement's variance divided by sample size.
  • What makes it powerful: the approximation works for practically any underlying measurement distribution and depends only on expectation and variance, not other characteristics.
  • Common confusion: the distribution of the sample average is not the same as the distribution of individual measurements (e.g., averaging Uniform measurements does not produce a Uniform distribution).
  • Why it matters: enables practical probability calculations for sample averages, sums, and smooth functions of averages without needing to know the exact measurement distribution.

📊 Statistics as random variables

📊 What a statistic is

A statistic is a function of sampled data that is used for making statistical inference.

  • When you compute a statistic (like an average) from a random sample, the result is itself a random variable from a probabilistic point of view.
  • The statistic has its own distribution, which depends on but is not identical to the distribution of the individual measurements.

🔄 The distribution mismatch

  • Key point: the distribution of a statistic is generally different from the distribution of a single measurement.
  • Example: if you sample from a Uniform distribution and compute the average, that average does not follow a Uniform distribution.
  • The relationship between the measurement distribution and the statistic distribution can be complex in general.
  • Don't confuse: "distribution of measurements" ≠ "distribution of a statistic computed from those measurements."

🎯 The Central Limit Theorem

🎯 What the theorem says

The Central Limit Theorem provides a simple approximation for the sample average, especially for large samples:

  • Expectation: E(sample average) = E(single measurement)
  • Variance: V(sample average) = V(single measurement) / n (where n is sample size)
  • Distribution shape: the sample average can be approximated by the Normal distribution with the above expectation and variance

🌟 Why it's remarkable

  • The approximation is valid for practically any distribution of the underlying measurement.
  • The sample average's distribution depends on the measurement distribution only through expectation and variance, not through other characteristics (e.g., skewness, shape).
  • The approximation typically improves as sample size increases.

🔗 Extensions beyond the average

The theorem applies to more than just the sample average:

QuantityHow it relatesWhy the theorem applies
Sample sumSum = average × nProportional to the sample average
Smooth functions of averagef(average)Generalization of the theorem
  • Because the sum is just the average multiplied by sample size, the Central Limit Theorem can approximate the distribution of sums.
  • The theorem can be generalized much further, including to smooth functions of the sample average, greatly increasing its applicability.

🧮 Practical application example

🧮 The stress score problem

The excerpt provides a concrete example to demonstrate how to use the Central Limit Theorem:

Setup:

  • Stress scores follow a continuous Uniform distribution from 1 to 5
  • Sample size: 75 students
  • Goal: compute probabilities and percentiles for the sample average and total

📐 The solution approach

Step 1: Identify the measurement distribution

  • X (stress score of a random student) ~ Uniform(1, 5)

Step 2: Compute expectation and variance of a single measurement

  • Use formulas for Uniform distribution:
    • E(X) = (a + b) / 2
    • V(X) = (b - a)² / 12

Step 3: Translate to the sample average

  • E(sample average) = E(X)
  • V(sample average) = V(X) / n

Step 4: Apply the Central Limit Theorem

  • Once you have the expectation and variance of the sample average, you can use the Normal distribution to compute probabilities and percentiles.

🎲 Types of questions answered

The example shows four types of calculations:

  1. Probability that the average is less than a value
  2. A percentile for the average
  3. Probability that the total (sum) is less than a value
  4. A percentile for the total

All four rely on the same principle: the Central Limit Theorem allows Normal distribution approximations for both averages and sums.

44

Integrated Applications

8.3 Integrated Applications

🧭 Overview

🧠 One-sentence thesis

The Central Limit Theorem allows us to compute probabilities for sample averages using the Normal distribution regardless of the underlying measurement distribution, provided we know the expectation and variance and the sample size is not too small.

📌 Key points (3–5)

  • Core message of CLT: For sample averages, we can use Normal distribution approximations; only the expectation and variance of the underlying measurement matter, not its exact distribution.
  • How to apply CLT: Calculate the expectation and variance of a single observation, translate them to the sample average (expectation stays the same, variance divides by n), then use Normal distribution functions.
  • Sample average vs total sum: The total sum is less than a threshold if and only if the average is less than that threshold divided by n; percentiles scale by multiplying by n.
  • Common confusion: When dealing with integer-valued data, the continuity correction can improve Normal approximation accuracy (e.g., "less than 200" becomes "less than or equal to 199.5" for continuous approximation).
  • Practical applications: CLT enables quality control decisions, probability calculations for various distributions (Uniform, Exponential), and comparison of statistical estimators through simulation.

📐 Applying the Central Limit Theorem

📐 The CLT workflow

The excerpt emphasizes a systematic approach:

  1. Identify the distribution and parameters of a single measurement X.
  2. Compute E(X) and V(X) using distribution-specific formulas.
  3. Translate to the sample average: E(X̄) = E(X) and V(X̄) = V(X)/n.
  4. "Forget about" the original distribution and use only Normal distribution functions.

Central Limit Theorem application principle: After computing expectation μ and standard deviation σ for the sample average, the distribution of the sample average is approximately Normal(μ, σ²), regardless of the original distribution.

🔄 From single observation to sample average

  • Expectation: The expectation of the sample average equals the expectation of a single observation.
  • Variance: The variance of the sample average equals the variance of a single observation divided by the sample size n.
  • Standard deviation: The standard deviation of the sample average is the standard deviation of a single observation divided by the square root of n.
  • Example: If a single stress score has mean 3 and variance 4/3, then for n=75, the sample average has mean 3 and standard deviation √(4/3)/√75 ≈ 0.133.

🔗 Connecting average and total sum

The excerpt shows a key relationship:

  • The total sum of n observations equals n times the sample average.
  • Probability translation: P(Total < threshold) = P(Average < threshold/n).
  • Percentile translation: If the p-th percentile of the average is q, then the p-th percentile of the total is n·q.
  • Example: If 90% of the distribution of the average is less than 3.171, then 90% of the total sum (for n=75) is less than 75 × 3.171 = 237.8.

🎯 Worked examples with different distributions

🎯 Continuous Uniform distribution (Example 1)

Setup: Stress scores follow Uniform(1, 5), sample size n=75.

  • Formulas used: E(X) = (a+b)/2 = 3, V(X) = (b−a)²/12 = 16/12.
  • Sample average parameters: μ̄ = 3, σ̄ = √(16/12)/√75 ≈ 0.133.
  • Probability calculation: P(X̄ < 2) computed using pnorm(2, 3, 0.133) ≈ 0 (extremely unlikely).
  • Percentile calculation: 90th percentile of X̄ is qnorm(0.9, 3, 0.133) ≈ 3.171.

🎲 Discrete Uniform distribution (Example 2)

Setup: Stress scores are integers {1, 2, 3, 4, 5} with equal probability 1/5 each, n=75.

  • Key difference from Example 1: Discrete values lead to slightly larger variance (1.633 vs 1.333 standard deviation for single observation).
  • Continuity correction: Because the sum is integer-valued, approximate P(Sum < 200) as P(Sum ≤ 199) ≈ P(Normal variable ≤ 199.5).
  • Translation to average scale: P(Average ≤ 199.5/75) gives a more accurate approximation (0.0187 vs 0.0206 without correction).
  • Don't confuse: The continuity correction applies when the data are discrete; it adjusts the threshold by 0.5 before converting to the continuous Normal scale.

⏱️ Exponential distribution (Example 3)

Setup: Excess phone time follows Exponential(λ), with mean 22 minutes, n=80.

  • Parameter identification: For Exponential, E(X) = 1/λ = 22, so λ = 1/22; also V(X) = 1/λ² = 484.
  • Sample average parameters: μ̄ = 22, σ̄ = √(484/80) ≈ 2.46.
  • Upper-tail probability: P(X̄ > 20) = 1 − P(X̄ ≤ 20) = 1 − pnorm(20, 22, 2.46) ≈ 0.792.
  • Why CLT matters here: The Exponential distribution is highly skewed, yet the sample average is approximately Normal for n=80.

🏭 Quality control application (Example 4)

🏭 The quality control scenario

A beverage company fills cans with expected content 16.0 ounces and standard deviation 0.10 ounces. Every hour, 50 cans are sampled; if the average is below a threshold, production stops for recalibration.

🔍 Single can vs sample average

  • Question 4.1: Probability that a single can is below 15.95 ounces.
    • Answer: Not enough information—only expectation and standard deviation are given, not the actual distribution.
    • Don't confuse: "Normal production conditions" does not mean the measurement follows a Normal distribution.
  • Question 4.2: Probability that the sample average of 50 cans is below 15.95.
    • Answer: CLT applies; X̄ ~ Normal(16, 0.1/√50), so P(X̄ < 15.95) ≈ 0.0002.
    • Why CLT helps: Even without knowing the single-can distribution, we can approximate probabilities for the average.

🚨 Setting a control threshold

  • Goal: Find a threshold such that P(stopping | normal conditions) = 5%.
  • Solution: Compute the 5th percentile of the sample average distribution: qnorm(0.05, 16, 0.1/√50) ≈ 15.977.
  • Interpretation: If the average of 50 cans is below 15.977 ounces, stop and recalibrate (this will happen 5% of the time even under normal conditions).

📊 Analyzing real data

The excerpt describes a file "QC.csv" with 8 hours of measurements (50 cans per hour).

  • Question 4.4: Which hours required recalibration?
    • Hours h3 (mean 15.91) and h8 (mean 15.974) are below the threshold 15.977.
  • Question 4.5: Which hours show suspected outliers?
    • Box plots reveal outliers in hours h4, h6, h7, and h8.
  • Tool: The boxplot(QC) function in R plots all variables side by side for easy comparison.

🔬 Comparing estimators via simulation (Example 5)

🔬 The estimation problem

A measurement follows Uniform(0, b) for unknown b. Two statisticians propose different estimators based on a sample of n=100:

  • Statistician A: Use 2X̄ (twice the sample average).
    • Motivation: E(X) = b/2, so E(2X̄) = b.
  • Statistician B: Use the maximum observation.
    • Motivation: The largest observation is close to b.

📏 Mean square error (MSE)

Mean square error: MSE = V(T) + (E(T) − b)², where T is the statistic.

  • Interpretation: MSE measures how far the statistic tends to be from the true value b, combining both variance (spread) and bias (systematic error).
  • Decision rule: Prefer the statistic with smaller MSE.

🧪 Simulation results

The excerpt uses 100,000 simulated samples to estimate expectation, variance, and MSE for each statistic.

True bStatisticExpectationVarianceMSE
10A (2X̄)9.9980.3330.333
10B (max)9.9010.00980.0197
13.7A (2X̄)13.6950.6220.622
13.7B (max)13.5640.01810.0362
  • Conclusion: Statistician B's estimator (maximum) has smaller MSE in both cases, making it preferable for estimating b in this Uniform setting.
  • Why B wins: Although B is slightly biased (expectation < b), its much smaller variance more than compensates, resulting in lower overall error.

🛠️ Simulation workflow

The excerpt demonstrates a standard Monte Carlo approach:

  1. Create empty sequences A and B of length 100,000.
  2. In each iteration: generate a sample, compute both statistics, store results.
  3. After the loop: compute mean(A), var(A), mean(B), var(B), and MSE for each.
  4. Compare MSE values to choose the better estimator.

🔑 Key computational techniques

🔑 R functions for Normal distribution

  • pnorm(x, mu, sigma): Cumulative probability P(X ≤ x) for Normal(mu, sigma²).
  • qnorm(p, mu, sigma): The p-th quantile (percentile) of Normal(mu, sigma²).
  • Upper-tail probability: P(X > x) = 1 − pnorm(x, mu, sigma).

🔑 Working with data frames

  • read.csv("file.csv"): Load data into a data frame.
  • summary(df): Overview of all variables (min, quartiles, mean, max).
  • mean(df$variable): Compute the mean of a specific column with full precision.
  • boxplot(df): Create side-by-side box plots for all variables to spot outliers.

🔑 Simulation loop pattern

result <- rep(0, number_of_simulations)
for (i in 1:number_of_simulations) {
  sample <- generate_sample()
  result[i] <- compute_statistic(sample)
}
mean(result)  # estimate expectation
var(result)   # estimate variance
  • Purpose: Approximate the sampling distribution of a statistic when analytical formulas are difficult or unavailable.
45

Student Learning Objectives

9.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Statistical inference uses formal computations on observed data—interpreted through probability and sampling distributions—to gain insight about population parameters via estimation and hypothesis testing.

📌 Key points (3–5)

  • What inference does: extends beyond describing observed data to making conclusions about population parameters using probabilistic reasoning.
  • Two main tools: point estimation (best guess of a parameter) and confidence intervals (likely range containing the parameter); hypothesis testing (formal method to choose between competing theories).
  • Why sampling distributions matter: the justification for any statistic or test comes from examining its probabilistic behavior across all potential datasets.
  • Common confusion: descriptive statistics describes what you see; probability considers what could have been; inference uses probability to conclude what is true in the population.
  • Error is unavoidable: decisions based on test statistics can fall on the wrong side of a threshold by chance, but sampling distributions let us control error rates.

🔍 Three frames of reference

📊 Descriptive statistics

  • Investigates characteristics of the data using graphs and numerical summaries.
  • Frame of reference: the observed data only.
  • You describe what you have; you do not generalize beyond it.

🎲 Probability

  • Extends the frame to include all datasets that could have potentially emerged.
  • The observed data is one realization among many possible outcomes.
  • This shift is the foundation for inference.

🔬 Inferential statistics

  • Goal: gain insight about population parameters from observed data.
  • Method: apply formal computations (statistics) to the data, then interpret results in the probabilistic context.
  • The justification for choosing a particular computation comes from examining its properties across all potential datasets (its sampling distribution).

🎯 Point estimation and confidence intervals

🎯 Point estimation

Point estimation: attempts to obtain the best guess for the value of a population parameter.

Estimator: a statistic that produces such a guess.

  • An estimator is a function of the data.
  • Why one estimator is better than another: prefer an estimator whose sampling distribution is more concentrated around the true parameter value.
  • Example: A car manufacturer tests 10 new cars and measures fuel consumption. The parameter of interest is the average consumption among all cars of that type. The sample average of the 10 cars is a point estimate.

📏 Confidence intervals

Confidence interval: an interval computed from the data that is most likely to contain the population parameter.

Confidence level: the sampling probability that the confidence interval will indeed contain the parameter value.

  • Instead of a single guess, you construct a range.
  • Confidence intervals are designed to achieve a prescribed confidence level (e.g., 95%).
  • Don't confuse: the confidence level is a probability about the procedure (across all potential samples), not a probability that the parameter lies in one specific computed interval.

🧪 Hypothesis testing

🧪 The scientific paradigm

  • Science proposes new theories that predict empirical outcomes.
  • If empirical evidence supports the new hypothesis but contradicts the old theory, the old theory is rejected.
  • Otherwise, the established theory is retained.
  • Statistical hypothesis testing formalizes this decision process.

⚖️ How hypothesis testing works

Test statistic: a statistic computed to decide which of two hypotheses (old vs. new) is more consistent with the data.

  • Each hypothesis predicts a different distribution for the measurements.
  • A threshold is set; depending on where the test statistic falls relative to this threshold, you decide whether to reject the old theory.
  • Key mechanism: the decision rule is based on the sampling distribution of the test statistic.

⚠️ Errors are unavoidable

  • The test statistic may fall on the wrong side of the threshold by chance.
  • The excerpt emphasizes that the decision rule is "not error proof."
  • However, by examining the sampling distribution, you can understand and control the probability of such errors.
  • Don't confuse: a single test outcome can be wrong; the sampling distribution tells you how often the procedure will err in the long run.

📚 Learning objectives summary

By the end of this chapter, students should be able to:

ObjectiveWhat it means
Define key termsUnderstand point estimation, estimator, confidence interval, confidence level, hypothesis testing, test statistic
Recognize variables in "cars.csv"Identify the dataset used for demonstrations in Chapters 9–15
Revise probability conceptsReview random variables, sampling distribution, and the Central Limit Theorem—all essential for inference
  • The excerpt notes that the "cars.csv" data frame will be presented in the third section and probability topics will be reviewed in the fourth section.
  • These foundational concepts (sampling distribution, Central Limit Theorem) underpin all inferential methods.
46

Key Terms in Statistical Inference

9.2 Key Terms

🧭 Overview

🧠 One-sentence thesis

Statistical inference uses formal computations on observed data—evaluated through their probabilistic properties across all potential datasets—to gain insight into population parameters via point estimation, confidence intervals, and hypothesis testing.

📌 Key points (3–5)

  • Descriptive vs inferential statistics: descriptive examines only the observed data; inferential extends the frame to all potential datasets and uses probability to interpret computations.
  • Three main tools: point estimation (best guess of a parameter), confidence intervals (likely range for a parameter), and hypothesis testing (formal decision between competing theories).
  • Why sampling distributions matter: the justification for any statistic or test comes from examining its probabilistic behavior across all possible samples.
  • Common confusion: hypothesis testing is not error-proof—the test statistic can fall on the wrong side of the threshold by chance, so significance level measures the probability of wrongly rejecting the old theory.
  • Real-world application: these methods are used to compare treatments in clinical trials, estimate fuel consumption in manufacturing, and make decisions under uncertainty.

📊 Descriptive vs inferential statistics

📊 Descriptive statistics

Descriptive statistics: investigates the characteristics of the data by using graphical tools and numerical summaries.

  • The frame of reference is only the observed data.
  • You describe what you see, without extending conclusions beyond the sample.

🎲 Probability and inferential statistics

Inferential statistics: aims to gain insight regarding the population parameters from the observed data.

  • The frame of reference includes all datasets that could have potentially emerged, with the observed data as one among many.
  • Probability provides the context: you consider what would happen if you applied the same computation to every possible sample.
  • The justification for a specific computation comes from examining its probabilistic properties in the sampling distribution.

Don't confuse: Descriptive statistics summarizes what you have; inferential statistics uses what you have to learn about the larger population you didn't observe.

🎯 Point estimation

🎯 What point estimation does

Point estimation: attempts to obtain the best guess to the value of a population parameter.

Estimator: a statistic that produces such a guess.

  • An estimator is a function of the data (a statistic).
  • You prefer an estimator whose sampling distribution is more concentrated around the true parameter value.
  • The choice of estimator is justified by its probabilistic characteristics in the sampling distribution.

🚗 Fuel consumption example

  • Scenario: A car manufacturer wants to know the average fuel consumption of a new car type (the population parameter).
  • Method: Apply a standard test cycle to a sample of 10 cars and measure their fuel consumptions.
  • Point estimate: The average consumption of the 10 cars is the best guess for the average consumption of all cars of that type.

📏 Confidence intervals

📏 What a confidence interval is

Confidence interval: an interval computed on the basis of the data that is most likely to contain the population parameter.

Confidence level: the sampling probability that the confidence interval will indeed contain the parameter value.

  • Instead of a single guess, you construct a range that is likely to include the true parameter.
  • Confidence intervals are constructed to have a prescribed confidence level (e.g., 95%).
  • The confidence level is a probability statement about the procedure, not about any single interval: if you repeated the sampling many times, the prescribed percentage of intervals would contain the true parameter.

Don't confuse: The confidence level is not the probability that this particular interval contains the parameter; it is the long-run success rate of the interval-construction method.

🧪 Hypothesis testing

🧪 The scientific paradigm

  • New theories propose predictions that can be examined empirically.
  • If empirical evidence is consistent with the new hypothesis but not the old theory, the old theory is rejected in favor of the new one.
  • Otherwise, the established theory maintains its status.

Hypothesis testing: a formal method for determining which of two hypotheses should prevail using this paradigm.

🧪 How hypothesis testing works

  • Each hypothesis (old and new) predicts a different distribution for the empirical measurements.
  • A test statistic is computed from the data to decide which distribution is more in tune with the observations.
  • A threshold is set: depending on where the test statistic falls relative to this threshold, you decide whether or not to reject the old theory.

⚠️ Errors and significance level

  • The decision rule is not error-proof: the test statistic may fall on the wrong side of the threshold by chance.
  • By examining the sampling distribution of the test statistic, you can assess the probability of making an error.

Significance level: the probability of erroneously rejecting the currently accepted theory (the old one).

  • The threshold is selected to ensure a small enough significance level.

Don't confuse: A low significance level does not mean the test is always correct; it only controls the probability of one type of error (wrongly rejecting the old theory).

🚗 Factory comparison example

  • Scenario: A car is manufactured in two different factories; you want to test whether fuel consumption is the same for both.
  • Method: 5 cars from one factory and 5 from the other are tested.
  • Test statistic: The absolute value of the difference between the average consumption of the first 5 and the average consumption of the other 5.
  • Decision: If the test statistic exceeds the threshold, you reject the hypothesis that consumption is the same.

💊 Clinical trials example

  • Scenario: Before a new medical treatment is approved, it undergoes clinical trials.
  • Method: Some patients receive the new treatment; others receive the standard treatment.
  • Application: Statistical tests compare the two groups.
  • Decision: The new treatment is released only if it is shown to be beneficial with statistical significance and has no unacceptable side effects.

🔗 Why sampling distributions are central

🔗 Justification for formal computations

  • The excerpt emphasizes that the interpretation of any computation is carried out in the probabilistic context.
  • You consider the application of the formal computation to all potential datasets, not just the one you observed.
  • The sampling distribution of a statistic shows how that statistic would behave across all possible samples.
  • This probabilistic examination justifies why a particular statistic or test is used on the observed data.

🔗 Summary table

ToolWhat it doesHow it is justified
Point estimationProduces a best guess for a parameterSampling distribution is concentrated around the true parameter
Confidence intervalConstructs a range likely to contain the parameterConfidence level (sampling probability) is prescribed
Hypothesis testingDecides between two competing theoriesSignificance level (error probability) is controlled via the sampling distribution of the test statistic

Key takeaway: All three tools rely on examining the probabilistic properties of statistics in the context of the sampling distribution.

47

The Cars Data Set

9.3 The Cars Data Set

🧭 Overview

🧠 One-sentence thesis

The cars data set serves as a concrete, recurring example throughout the book's statistical inference chapters, containing 205 observations of car models with 17 variables (6 categorical, 11 numeric) that will be used to demonstrate point estimation, confidence intervals, hypothesis testing, and regression.

📌 Key points (3–5)

  • Purpose: This data set will be used consistently across Chapters 10–15 to make inferential procedures concrete and demonstrate different statistical methods.
  • Structure: Contains 205 car models with 17 variables—6 factors (qualitative attributes like make, fuel type, body style) and 11 numeric measurements (dimensions, performance, price).
  • Missing values: Some variables have "NA" entries (missing observations), but their relative frequency is low enough not to be a major concern for analysis.
  • Common confusion: Missing values are handled differently by different R functions—analysts must ensure the appropriate method is applied for each specific analysis rather than assuming a default behavior.
  • Variables span multiple domains: The data set includes manufacturer information, physical dimensions, engine characteristics, fuel efficiency, and pricing, enabling diverse types of statistical comparisons.

📊 Data set characteristics

📁 Source and accessibility

  • The data is stored in a CSV file named "cars.csv" available at http://pluto.huji.ac.il/~msby/StatThink/Datasets/cars.csv.
  • Based on 205 observations from the original "Automobiles" data set at the UCI Machine Learning Repository.
  • Original source: 1985 Model Import Car and Truck Specifications and 1985 Ward's Automotive Yearbook, assembled by Jeffrey C. Schlimmer.
  • The current file uses 17 of the 26 variables available in the original source.

🔢 Data structure overview

The data frame contains two types of variables:

Variable typeCountHow R summarizes themExamples
Factors (qualitative)6Lists attributes and their frequencies; shows most frequent if many existmake, fuel.type, num.of.doors, body.style, drive.wheels, engine.location
Numeric (quantitative)11Shows min, max, three quartiles (Q1, median, Q3), and meanwheel.base, length, width, height, curb.weight, engine.size, horsepower, peak.rpm, city.mpg, highway.mpg, price

🏷️ Categorical variables (factors)

🚗 Manufacturer and fuel information

  • make: The name of the car producer
    • Example from summary: Toyota has 32 cars (most frequent), Nissan 18, Mazda 17, Honda 13, Mitsubishi 13, Subaru 12
  • fuel.type: The type of fuel used by the car
    • Two categories: diesel (20 cars) or gas (185 cars)
    • Has 2 missing values (NA's)

🚪 Physical configuration

  • num.of.doors: The number of passenger doors
    • Two categories: four doors (114 cars) or two doors (89 cars)
    • Has 2 missing values
  • body.style: The type of the car
    • Categories include: sedan (96), hatchback (70), wagon (25), hardtop (8), convertible (6)
  • drive.wheels: The wheels powered by the engine
    • Three categories: front-wheel drive/fwd (120), rear-wheel drive/rwd (76), four-wheel drive/4wd (9)
  • engine.location: The location in the car of the engine
    • Two categories: front (202 cars), rear (3 cars)

📏 Numeric variables (measurements)

📐 Dimensional measurements

All dimensions are measured in inches:

  • wheel.base: Distance between the centers of the front and rear wheels
    • Range: 86.60 to 120.90 inches; median 97.00; mean 98.76
  • length: The length of the body of the car
    • Range: 141.1 to 208.1 inches; median 173.2; mean 174.0
  • width: The width of the body of the car
    • Range: 60.30 to 72.30 inches; median 65.50; mean 65.91
  • height: The height of the car
    • Range: 47.80 to 59.80 inches; median 54.10; mean 53.72
    • Has 2 missing values

⚙️ Weight and engine characteristics

  • curb.weight: Total weight in pounds of a vehicle with standard equipment and a full tank of fuel, but with no passengers or cargo
    • Range: 1488 to 4066 pounds; median 2414; mean 2556
  • engine.size: (Description cut off in excerpt, but numeric variable shown in summary)
    • Range: 61.0 to 326.0; median 120.0; mean 126.9

🏎️ Performance and efficiency

  • horsepower: Engine power output
    • Range: 48.0 to 288.0; median 95.0; mean 104.3
    • Has 2 missing values
  • peak.rpm: Peak revolutions per minute
    • Range: 4150 to 6600; median 5200; mean 5125
    • Has 2 missing values
  • city.mpg: Fuel efficiency in city driving (miles per gallon)
    • Range: 13.00 to 49.00; median 24.00; mean 25.22
  • highway.mpg: Fuel efficiency in highway driving (miles per gallon)
    • Range: 16.00 to 54.00; median 30.00; mean 30.75

💰 Pricing

  • price: Cost of the vehicle
    • Range: 5118 to 45400 (currency units not specified); median 10295; mean 13207
    • Has 4 missing values

⚠️ Missing values considerations

🔍 What missing values are

Missing values: observations for which a value for a given variable is not recorded; R uses the symbol "NA" to identify them.

  • In the cars data frame, missing values appear in: num.of.doors (2), fuel.type (2), height (2), horsepower (2), peak.rpm (2), and price (4).
  • If you scan the CSV file directly with a spreadsheet, you will encounter the "NA" symbol where values are missing.

🎯 When to be concerned

The excerpt distinguishes two scenarios:

Concern levelConditionReasonAction needed
Low concernRelative frequency of missing values is lowLess likely to bias conclusionsThe cars data set falls into this category
High concernRelative frequency is substantial AND reason for missing data is related to the phenomena under investigationNaïve statistical inference may produce biased conclusionsMore careful handling required

🛠️ Practical warning

  • Different R functions may have different ways of dealing with missing values.
  • Don't confuse: There is no single default behavior—you must verify that the appropriate method is applied for each specific analysis.
  • Example: One function might exclude all rows with any NA, while another might only exclude NAs in the specific variable being analyzed.
48

The Sampling Distribution

9.4 The Sampling Distribution

🧭 Overview

🧠 One-sentence thesis

The sampling distribution transforms statistics from single observed values into random variables, enabling statistical inference by modeling how those statistics would vary across repeated samples.

📌 Key points (3–5)

  • Observed vs. random context: A statistic computed from data is a single value (lowercase notation, e.g., x̄), but when viewed across all possible samples it becomes a random variable (uppercase notation, e.g., X̄).
  • Two approaches to sampling distribution: Either imagine repeated sampling from a real population, or assign a theoretical distribution to measurements and model repeated samples from that distribution.
  • Determining the distribution: Use the Normal approximation (via the Central Limit Theorem for large samples) or use simulation to approximate the sampling distribution.
  • Common confusion: The sampling distribution of a statistic is usually not the same as the distribution of individual measurements—e.g., even if measurements are Uniform, the sample average is approximately Normal (for large n).
  • Why it matters: The sampling distribution is the foundation for all statistical inference—point estimation, confidence intervals, and hypothesis testing all depend on knowing how a statistic behaves across samples.

📊 Statistics: observed values vs. random variables

📊 What a statistic is

A statistic is a function or formula applied to a data frame.

  • Examples: sample average, sample standard deviation, smallest value, largest value, quartiles, median, frequency of a category.
  • When computed on the actual data frame, a statistic yields a single number.
  • When considered across all possible random samples, the same formula produces a random variable—this is the sampling distribution of the statistic.

🔤 Notation: lowercase vs. uppercase

The excerpt uses notation to distinguish the two contexts:

ContextNotationMeaning
Observed dataLowercase (x₁, x₂, …, xₙ; x̄; s²)Non-random quantities computed from the actual data
Random sampleUppercase (X₁, X₂, …, Xₙ; X̄; S²)Random variables representing all possible samples
  • Example: x̄ = (x₁ + x₂ + ⋯ + xₙ) / n is the observed sample average.
  • Example: X̄ = (X₁ + X₂ + ⋯ + Xₙ) / n is the random variable representing the sample average across all possible samples.
  • The same formula is applied in both cases, but the interpretation differs.

🧮 Sample variance example

  • Observed: s² = sum of squared deviations / (n − 1) = Σ(xᵢ − x̄)² / (n − 1)
  • Random: S² = sum of squared deviations / (n − 1) = Σ(Xᵢ − X̄)² / (n − 1)
  • S² is a random variable; s² is its evaluation on the specific observed sample.

🌍 Two ways to define sampling distribution

🌍 Population-based approach

  • Imagine a larger population from which the observed data is a random sample.
  • Example: The "cars" data frame contains 205 car types from 1985; the population could be all car types sold in the U.S. during the 1980s.
  • The sampling distribution arises from imagining we could have sampled a different year (e.g., 1987 instead of 1985).
  • This approach is natural for surveys of a specific target population.

🎲 Theoretical model approach

  • Assign a theoretical probability distribution to the measurements.
  • Example: Model car prices as following an Exponential(λ) distribution, where λ is an unknown parameter.
  • The sampling distribution is then 205 (or 201, excluding missing values) independent copies from that Exponential distribution.
  • This is the standard approach in statistical inference and is used throughout the rest of the book.
  • Don't confuse: The theoretical model is not claiming the population literally follows that distribution; it is a simplifying assumption to enable inference.

🔍 Factor variables

  • Sampling distribution applies to categorical data too.
  • Example: The frequency of diesel cars in the data is 20, but in another sample (another year) it could differ.
  • Theoretical model: Assume each car has probability p of being diesel; then the frequency follows a Binomial(205, p) distribution.
  • The parameter p is unknown and must be estimated from the data.

🎯 Theoretical distributions for observations

The excerpt reviews five key distributions used to model individual measurements:

DistributionUse caseParametersSample spaceExpectationVarianceR functions
Binomial(n, p)Count of occurrences in n trialsn (trials), p (success probability){0, 1, 2, …, n}npnp(1 − p)dbinom, pbinom, qbinom, rbinom
Poisson(λ)Counts when n is large, p is smallλ (expectation){0, 1, 2, …}λλdpois, ppois, qpois, rpois
Uniform(a, b)All values in [a, b] equally likelya, b (endpoints)[a, b](a + b)/2(b − a)² / 12dunif, punif, qunif, runif
Exponential(λ)Times between events; smaller values more likelyλ (rate)Positive numbers1/λ1/λ²dexp, pexp, qexp, rexp
Normal(μ, σ²)Generic model; approximation for many statisticsμ (expectation), σ² (variance)All real numbersμσ²dnorm, pnorm, qnorm, rnorm

🧩 Role in inference

  • Statistical inference problems involve identifying a theoretical model for measurements.
  • The model is a function of an unknown parameter (e.g., p, λ, μ, σ²).
  • The goal is to make statements about the parameter using observed data: estimate it (point estimation), construct an interval (confidence interval), or test a hypothesis (hypothesis testing).

🔔 Sampling distribution of statistics

🔔 How statistics inherit distributions

  • A statistic computed from measurements inherits a distribution from the sampling distribution of those measurements.
  • The sampling distribution of a statistic depends on:
    • The sample size (n)
    • The parameters of the measurement distribution
  • This distribution is often complex and not the same as the distribution of individual measurements.

🏷️ Types of statistics by inference goal

Inference goalStatistic namePurpose
Point estimationEstimatorGuess the value of the parameter
Confidence intervalConfidence intervalPropose an interval containing the parameter with prescribed probability
Hypothesis testingTest statisticTest whether the parameter equals a specific value

⚠️ Not always the same distribution

  • Common confusion: The sampling distribution of a statistic is usually not the same as the distribution of individual measurements.
  • Example: If measurements are Uniform, the sample average is not Uniform—it is approximately Normal (for large n).
  • Exception: The minimum of n Exponential(λ) measurements is Exponential(nλ), regardless of sample size.

📐 The Normal approximation

📐 When it applies

  • The Central Limit Theorem states that the sample average (or functions of it) is approximately Normal for large sample sizes, regardless of the distribution of individual measurements.
  • This is the most important scenario for identifying the sampling distribution of a statistic.
  • The approximation also applies to some other statistics (e.g., sample median under certain conditions), even if they are not functions of the sample average.

🧮 Using the approximation for the sample average

  • The sample average X̄ has:
    • Expectation = expectation of a single measurement
    • Variance = variance of a single measurement / n
    • Standard deviation = standard deviation of a single measurement / √n
  • To approximate probabilities: treat X̄ as Normal with the above expectation and standard deviation.
  • Example: For Normal(μ, σ²), the central 95% region is μ ± 1.96σ. For the sample average, it is μ ± 1.96(σ / √n).

⚠️ When it does not apply

  • The Normal approximation is not universal.
  • Example: The minimum of Exponential measurements follows an Exponential distribution (with rate nλ), not a Normal distribution, even for large n.
  • Always check whether the approximation is appropriate for the specific statistic and setting.

🖥️ Simulations for validation and computation

🖥️ Purpose of simulations

  1. Validation: Check whether the Normal approximation is appropriate by comparing it to simulated results.
  2. Computation: Calculate probabilities when the Normal approximation does not hold.

🔄 How to simulate

  1. Assume a model for the distribution of observations and specify parameter values.
  2. Generate many random samples (e.g., 100,000) of size n from that distribution using R functions (e.g., rexp, runif).
  3. Compute the statistic for each sample and store the results.
  4. The collection of simulated statistic values approximates the sampling distribution.
  5. Compare probabilities from the simulation to those from the Normal approximation (or compute probabilities directly if no approximation is available).

🧪 Example: sample average of Exponential prices

  • Assume car prices follow Exponential(1/12000), so expectation = 12,000.
  • Sample size n = 201; standard deviation of X̄ = (1/λ) / √201 = 0.0705 × 12,000.
  • Normal approximation: the central 95% region is 12,000 ± 1.96 × 0.0705 × 12,000.
  • Simulation: Generate 100,000 samples of size 201 from Exponential(1/12000), compute the average for each, and check the proportion falling in the interval.
  • Result: Simulation gives approximately 0.95, confirming the Normal approximation is adequate.

🧪 Example: mid-range statistic (no Normal approximation)

Mid-range statistic: the average of the largest and smallest values in the sample.

  • Suppose 100 observations from Uniform(3, 7).
  • The Normal approximation does not apply to the mid-range.
  • Use simulation:
    • Generate samples of size 100 from Uniform(3, 7) using runif.
    • For each sample, compute min and max using the min and max functions.
    • Compute mid-range = (min + max) / 2.
    • The distribution of simulated mid-range values approximates the sampling distribution.
    • Use this to find the central 95% range or other probabilities.
  • Don't confuse: Simulations are for demonstration; in practice, you may need to try multiple parameter values to gain confidence.
49

9.5 Exercises

9.5 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise section applies simulation-based inference techniques to a real clinical study examining whether magnetic fields reduce chronic pain in postpolio patients.

📌 Key points (3–5)

  • Simulation for sampling distributions: The 95% central range of a statistic's sampling distribution can be approximated by finding the 0.025 and 0.975 percentiles of simulated values.
  • The quantile function: Used to identify percentiles of simulated sequences; the p-percentile is the value where proportion p of entries are smaller and proportion 1−p are larger.
  • Computing distribution summaries: Expectation and standard deviation of a sampling distribution are approximated by applying mean and sd functions to the simulated sequence.
  • Real-world application: The exercises use data from 50 postpolio patients comparing active magnetic treatment versus inactive placebo on pain score changes.
  • Common confusion: Don't confuse the simulated sequence of statistic values with the original data—the simulation generates many possible values of the statistic to approximate its distribution.

🎲 Simulation mechanics

🎲 Approximating the central 95% range

  • The excerpt demonstrates simulating 100,000 values of the mid-range statistic from a Uniform(3,7) distribution.
  • The central region containing 95% of the sampling distribution is found by identifying two percentiles: 0.025 and 0.975.
  • Between these percentiles lie 95% of the simulated values.
  • Example: For the mid-range statistic, approximately 95% of values fall in the range [4.941680, 5.059004].

📊 The quantile function

The p-percentile of a sequence is a number with the property that the proportion of entries with values smaller than that number is p and the proportion of entries with values larger than the number is 1−p.

  • First argument: a sequence of values (the simulated statistics).
  • Second argument: a number p between 0 and 1, or a sequence of such numbers.
  • Output: the p-percentile(s) of the sequence.
  • The p-percentile of the simulated sequence approximates the p-percentile of the true sampling distribution.

🧮 Computing expectation and standard deviation

  • Expectation: Apply the mean function to the simulated sequence.
    • Example: The mid-range statistic's expectation is approximately 5.000067 (practically equal to 5).
  • Standard deviation: Apply the sd function to the simulated sequence.
    • Example: The mid-range statistic's standard deviation is approximately 0.028.
  • These summaries describe the sampling distribution, not the original data.

🧲 The magnetic pain relief study

🧲 Study design

  • Research question: Can chronic pain in postpolio patients be relieved by magnetic fields applied over a pain trigger point?
  • Sample: 50 patients with post-polio pain syndrome.
  • Treatment groups:
    • Active magnetic device (indicated by active = 1; first 29 patients)
    • Inactive placebo device (indicated by active = 2; last 21 patients)
  • Measurements: Pain rated before (score1) and after (score2) device application; change is the difference between these scores.

📁 Data structure

  • Data stored in file magnets.csv with variables: score1, score2, change, and active.
  • The variable active indicates treatment condition (1 = active magnet, 2 = inactive placebo).
  • Don't confuse: The first 29 patients form one group (active) and the last 21 form another (placebo); subsetting requires expressions like change[1:29] and change[30:50].

📝 Exercise tasks

📝 Descriptive analysis (Exercise 9.1)

The exercise asks students to:

  1. Compute the sample average of the change variable across all patients.
  2. Determine whether active is a factor or numeric variable.
  3. Calculate the average change separately for active magnet recipients (first 29) and placebo recipients (last 21).
  4. Compute the sample standard deviation of change for each treatment group.
  5. Produce boxplots of change for each group and count outliers in each.

📝 Statistical testing preview (Exercise 9.2)

  • The excerpt mentions that Chapter 13 will present a statistical test for comparing the expected value between the two treatment groups.
  • This foreshadows hypothesis testing to determine if there is a real difference in pain relief between active and placebo treatments.
50

Summary

9.6 Summary

🧭 Overview

🧠 One-sentence thesis

Statistical inference provides methods—point estimation, confidence intervals, and hypothesis testing—to draw conclusions about population parameters from observed sample data.

📌 Key points (3–5)

  • What statistical inference does: uses observed data to gain insight into population parameters.
  • Three main methods: point estimation (best guess), confidence intervals (likely range), and hypothesis testing (choosing between hypotheses).
  • Confidence interval meaning: the interval most likely to contain the true parameter; confidence level = sampling probability that the interval captures the parameter.
  • Hypothesis testing logic: choose between two hypotheses (one currently accepted) based on a test statistic; significance level = probability of falsely rejecting the accepted hypothesis.
  • Common confusion: an estimate (the observed value) vs an estimator (the statistic/formula that produces the guess).

📊 The three pillars of statistical inference

📍 Point estimation

Point estimation: an attempt to obtain the best guess of the value of a population parameter.

  • Estimator vs estimate:
    • An estimator is a statistic (a formula/rule) that produces a guess.
    • An estimate is the observed value you get when you apply the estimator to your data.
  • Example: if you use the sample mean as your estimator, the actual number you calculate from your sample is the estimate.
  • Don't confuse: the estimator is the method; the estimate is the result.

📏 Confidence interval

Confidence interval: an interval that is most likely to contain the population parameter.

  • What it tells you: a range of values where the true parameter probably lies.
  • Confidence level: the sampling probability that the confidence interval contains the parameter value.
    • This is a probability statement about the procedure, not about any single interval.
    • Example: a 95% confidence level means that if you repeated the sampling many times, 95% of the intervals you construct would contain the true parameter.
  • The excerpt emphasizes "most likely to contain"—it is not a guarantee, but a probabilistic statement.

🧪 Hypothesis testing

Hypothesis testing: a method for determining between two hypotheses, with one of the two being the currently accepted hypothesis.

  • How it works: you compute a test statistic from your data and use it to decide which hypothesis to accept.
  • Significance level: the probability of falsely rejecting the currently accepted hypothesis.
    • This is the risk you accept of making a Type I error (rejecting a true hypothesis).
  • Example: if the test statistic falls far from what you'd expect under the accepted hypothesis, you might reject it in favor of the alternative.
  • Don't confuse: the significance level is not the probability that the accepted hypothesis is true; it is the probability of incorrectly rejecting it if it were true.

🔗 How the methods relate

🔗 Shared foundation

All three methods rely on understanding the sampling distribution of statistics:

  • Point estimation uses a statistic to guess the parameter.
  • Confidence intervals use the sampling distribution to build a range.
  • Hypothesis testing uses the sampling distribution to assess whether observed data is consistent with a hypothesis.

🔗 Complementary roles

MethodPurposeOutput
Point estimationBest single guessA number (the estimate)
Confidence intervalLikely range for the parameterAn interval with a confidence level
Hypothesis testingChoose between competing claimsA decision (reject or not reject)
  • These are not competing approaches; they answer different questions about the same population parameter.
51

Point Estimation: Student Learning Objectives

10.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Point estimation uses observed data to produce the best possible guess of an unknown parameter's value, with estimators chosen by criteria that measure how close their values tend to be to the true parameter.

📌 Key points (3–5)

  • What estimation is: using a statistic (a formula applied to data) to guess the value of a parameter that characterizes a distribution.
  • Core problem: selecting among competing estimators (e.g., sample average vs. sample mid-range for expectation) using criteria like bias, variance, and mean squared error (MSE).
  • Common confusion: estimator vs. estimate—an estimator is the formula/procedure; an estimate is the number it produces from a specific dataset.
  • Why it matters: estimation is the foundation for drawing solid conclusions about phenomena from sample data, assuming observations come from a probabilistic model.
  • Key challenge: missing values may contain information, and deleting them can bias analysis.

🎯 The estimation problem

🎯 What point estimation does

Point estimation: producing the best possible guess of the value of a parameter on the basis of available data.

  • The parameter is a summary or characteristic of the distribution from which observations emerge (e.g., the rate λ in an Exponential(λ) distribution of car prices).
  • The data is a sample of observations, each the outcome of a measurement on a subject.
  • Example: given a sample of car prices, estimate the rate parameter that specifies the Exponential distribution of all car prices.

🔧 Estimator vs. estimate

  • Estimator: a statistic—a formula applied to the data—that tries to guess the parameter's value.
  • Estimate: the number produced when the estimator is applied to a specific dataset.
  • Don't confuse: the estimator is the procedure; the estimate is the result.

🧩 Competing estimators

  • Multiple estimators can target the same parameter.
  • Example: for the expectation (central location) of a distribution, natural candidates include:
    • Sample average
    • Sample mid-range
  • The chapter's main topic is identifying criteria to choose which estimator to use for which parameter.

📏 Criteria for choosing estimators

📏 Three key measures

The excerpt introduces three criteria for assessing estimator performance:

CriterionWhat it measures
BiasWhether the estimator systematically over- or under-estimates the parameter
VarianceHow much the estimator's values fluctuate across different samples
Mean Squared Error (MSE)Overall closeness: combines bias and variance to measure typical distance from the true parameter
  • Goal: seek an estimator that tends to obtain values as close as possible to the parameter's true value.
  • These criteria help compare estimators and select the best one for a given estimation problem.

🎓 Learning objectives

By the end of the chapter, students should be able to:

  • Recognize issues associated with parameter estimation.
  • Define bias, variance, and MSE of an estimator.
  • Estimate parameters from data and assess the estimation procedure's performance.

🧮 Estimation targets

🧮 Expectation (central location)

  • The expectation of a measurement is always of interest—it describes the distribution's central location.
  • Natural estimator: the sample average.
  • Alternative: sample mid-range (presented in a previous chapter).
  • The chapter discusses how to choose between such candidates.

📊 Variance and standard deviation (spread)

  • These summaries characterize the distribution's spread.
  • The chapter covers estimation methods for variance and standard deviation.

🎲 Distribution parameters

  • Theoretical models (Normal, Exponential, Poisson, etc.) are characterized by parameters.
  • Example: the rate λ in an Exponential(λ) distribution.
  • The chapter discusses ways to estimate parameters that characterize these distributions from observed data.

⚠️ Missing values and their impact

⚠️ What missing values are

Missing value: an observation for which the value of the measurement is not recorded.

  • In R, missing values are identified by the symbol "NA".
  • Most statistical procedures delete observations with missing values and conduct inference on the remaining observations.

🚨 The danger of deletion

  • Some argue that deleting observations with missing values is dangerous and may lead to biased analysis.
  • Reason: missing values may contain information relevant for inference.
  • Example from clinical trials: the goal is to assess a new treatment's effect on survival of patients with a life-threatening illness over two years.
    • Time of death is recorded for patients who die during the trial.
    • Time of death is missing for patients who survive the entire two years.
    • Yet, one is advised not to ignore these patients—their survival is informative about the treatment's effectiveness.
  • Don't confuse: "missing" does not always mean "uninformative"; the pattern of missingness can carry meaning.

🔍 When missing values matter

  • In some cases, missing values contain information; in others, they do not.
  • When they do contain information, deleting them can distort the analysis.
  • Example: in the survival trial, ignoring survivors would underestimate the treatment's benefit.
  • When formulating an answer, consider examples from your own field: does the missingness itself tell you something about the phenomenon?

🧪 Practical example: estimating car price expectation

🧪 Using the sample average

  • The excerpt uses car prices from "cars.csv" to illustrate estimation of the expectation.
  • Natural estimator: the sample average, computed in R with mean(cars$price).

🛑 Encountering missing values

  • Applying mean to the price variable produces NA because the variable contains 4 missing values.
  • By default, when mean is applied to a sequence with missing values, it returns NA.
  • The function's behavior with missing values is controlled by the argument na.rm (not fully explained in the excerpt).
  • Implication: the analyst must decide how to handle missing values—delete them, or account for them in some other way.
52

Estimating Parameters

10.2 Estimating Parameters

🧭 Overview

🧠 One-sentence thesis

When sample sizes are small, confidence intervals for Normal measurements require the t-distribution and chi-square distribution instead of the Normal approximation, ensuring accurate inference even with limited data.

📌 Key points (3–5)

  • When Normal methods apply: If measurements themselves are Normally distributed, exact confidence intervals can be constructed even for small samples using the t-distribution (for means) and chi-square distribution (for variances).
  • Key difference from large-sample methods: Replace the standard Normal percentile (1.96) with the t-distribution percentile, which has wider tails and depends on degrees of freedom (n − 1).
  • Common confusion: The t-distribution vs Normal—small samples need t-distribution percentiles (e.g., 2.09 for n=20), but as sample size grows, t-percentiles converge to Normal percentiles (≈1.96).
  • Variance intervals are asymmetric: Unlike mean intervals, variance confidence intervals use the chi-square distribution, which is not symmetric, requiring separate lower and upper percentiles.
  • Why it matters: Proper small-sample methods avoid the invalid assumption that sample variance equals population variance and that the Central Limit Theorem applies.

📏 Confidence intervals for different confidence levels

📏 Adjusting the percentile for other confidence levels

  • The 95% confidence level uses 1.96 (or qt(0.975, n−1)) because it leaves 2.5% in each tail.
  • For a 90% confidence level, use 1.645 (the 0.95-percentile of the standard Normal), leaving 5% in each tail.
  • The resulting intervals:
    • For an expectation: X̄ ± 1.645 · S / √n
    • For a proportion: P̂ ± 1.645 · √{P̂(1 − P̂) / n}
  • Don't confuse: The percentile changes with the confidence level, but the structure of the interval remains the same.

🔔 Confidence intervals for Normal means (small samples)

🔔 Why the t-distribution is needed

  • Large-sample assumption breaks down: The Central Limit Theorem approximation may be poor, and substituting sample variance for population variance introduces additional uncertainty.
  • When measurements are Normal: The exact distribution of the standardized sample average (X̄ − E(X)) / (S / √n) is known—it follows the t-distribution with (n − 1) degrees of freedom.

t-distribution: A bell-shaped, symmetric distribution similar to the standard Normal but with wider tails; characterized by degrees of freedom.

📐 Structure of the confidence interval

  • Formula: X̄ ± qt(0.975, n−1) · S / √n
  • The only change from the large-sample method is replacing 1.96 with the t-distribution percentile.
  • Degrees of freedom: (n − 1), the number of observations used to estimate variance minus 1.
  • Example: For n = 20, qt(0.975, 19) ≈ 2.09; for n = 185, qt(0.975, 184) ≈ 1.97 (close to 1.96).

🚗 Fuel consumption example

  • Context: Difference in miles per gallon between highway and city driving for diesel (n=20) and gas (n=185) cars.
  • Diesel cars:
    • Sample mean: 4.45
    • Sample SD: 2.78
    • 95% CI: [3.15, 5.75]
  • Gas cars:
    • Sample mean: 5.65
    • Sample SD: 1.43
    • 95% CI: [5.44, 5.86]
  • Don't confuse: The diesel interval is wider because the smaller sample size (20) leads to a larger t-percentile and more uncertainty.

📊 Confidence intervals for Normal variances

📊 The chi-square distribution

  • Why a new distribution: For Normal measurements, the quantity (n − 1)S² / σ² follows the chi-square distribution with (n − 1) degrees of freedom, denoted χ²(n−1).

Chi-square distribution: A distribution associated with the sum of squares of Normal variables; not symmetric; parameterized by degrees of freedom.

  • The chi-square is not symmetric, so the central 95% region requires both the 0.025- and 0.975-percentiles.

🧮 Structure of the variance confidence interval

  • Formula: [(n − 1) / qchisq(0.975, n−1)] × S² , [(n − 1) / qchisq(0.025, n−1)] × S²
  • The sample variance S² is multiplied by ratios involving degrees of freedom and chi-square percentiles.
  • Left boundary: Uses the larger probability (0.975) → smaller ratio → lower limit.
  • Right boundary: Uses the smaller probability (0.025) → larger ratio → upper limit.

🚗 Fuel consumption variance example

  • Diesel cars (n=20):
    • Estimated variance: 7.73
    • 95% CI: [4.47, 16.50]
  • Gas cars (n=185):
    • Estimated variance: 2.06
    • 95% CI: [1.69, 2.55]
  • Don't confuse: The interval is not symmetric around the point estimate because the chi-square distribution is asymmetric.

🎯 Simulation validation

🎯 Checking the confidence level

  • Setup: Simulate 100,000 samples of size n=20 from a Normal distribution with mean μ=4 and variance σ²=9.
  • Mean confidence interval: Computed using the t-distribution method; 94.93% of intervals contained the true mean (nominal 95%).
  • Variance confidence interval: Computed using the chi-square method; 94.90% of intervals contained the true variance (nominal 95%).
  • Conclusion: The simulated confidence levels match the nominal 95% level, confirming the methods are exact when measurements are Normal.

📐 Choosing sample size

📐 Why sample size matters

  • Larger samples improve accuracy but increase cost.
  • Goal: Find the minimal sample size that achieves a desired level of precision (e.g., margin of error).

📐 Example: Opinion poll for a proportion

  • Question: How large must n be so that the sample proportion is within 0.5% (or 0.25%) of the population proportion with 95% confidence?
  • Approach: The range of the confidence interval for a proportion is 1.96 · √{P̂(1 − P̂) / n}.
  • To guarantee the range ≤ 0.005 (or 0.0025), solve for n.
  • Don't confuse: This is about the width of the interval, not the confidence level; a narrower interval requires a larger sample.

📋 Summary table: Distributions for small-sample inference

ParameterDistribution usedDegrees of freedomPercentile functionConfidence interval structure
Mean (Normal)t-distributionn − 1qt(0.975, n−1)X̄ ± qt(0.975, n−1) · S / √n
Variance (Normal)Chi-squaren − 1qchisq(0.025/0.975, n−1)[(n−1)/qchisq(0.975, n−1)] × S² to [(n−1)/qchisq(0.025, n−1)] × S²
53

Choosing the Sample Size

10.3 Estimation of the Expectation

🧭 Overview

🧠 One-sentence thesis

Statistics provides guidelines for selecting the minimal sample size needed to achieve a desired level of accuracy in estimation, balancing statistical validity with resource constraints.

📌 Key points (3–5)

  • Why sample size matters: larger samples improve statistical accuracy but increase costs, so the goal is to find the minimum sufficient size.
  • How to determine sample size: use the confidence interval range to work backward and calculate the required n.
  • Key insight for proportions: the maximum value of p(1−p) is 1/4, which allows calculating a worst-case sample size without knowing the true proportion.
  • Common confusion: accuracy vs sample size—doubling accuracy (halving the margin of error) requires quadrupling the sample size, not doubling it.
  • Practical implication: even small improvements in precision can demand dramatically larger samples.

📐 The sample size problem

🎯 Why sample size planning matters

  • Statistical perspective: larger samples are usually better for accuracy.
  • Resource perspective: more observations typically mean higher expenses.
  • Design goal: collect the minimal number of observations still sufficient to reach a valid conclusion.
  • This is a narrow aspect of experiment design, but an important practical consideration.

📊 Example scenario: opinion poll

The excerpt uses a concrete case to illustrate the problem:

  • Goal: estimate the proportion of people who support a specific candidate.
  • Accuracy requirement: the sample percentage should be within 0.5% (or 0.25%) of the true population percentage with high probability.
  • Method: use a confidence interval for the proportion; if the range is no more than the desired margin, the requirement is met.

🔢 Calculating required sample size for proportions

🧮 The confidence interval structure

Confidence interval for proportion: P-hat ± 1.96 · square root of [P-hat(1−P-hat)/n]

  • For a 95% confidence level, the multiplier is 1.96.
  • Range of the interval: 1.96 · square root of [P-hat(1−P-hat)/n].
  • The range is twice the margin of error (the ± part).

🔑 The key mathematical insight

The challenge is that P-hat(1−P-hat) is a random quantity—we don't know it in advance.

Solution: use the maximum possible value.

  • The function f(p) = p(1−p) reaches its maximum at p = 1/2.
  • The maximum value is 1/4.
  • Therefore: 1.96 · square root of [P-hat(1−P-hat)/n] ≤ 1.96 · square root of [0.25/n] = 0.98/square root of n.

This upper bound allows us to calculate a sample size that works in the worst case, regardless of the true proportion.

📝 Working through the calculation

For a range of 0.05 (margin ±0.025 or 2.5%):

  • Requirement: 0.98/square root of n ≤ 0.05
  • Rearrange: square root of n ≥ 0.98/0.05 = 19.6
  • Square both sides: n ≥ (19.6)² = 384.16
  • Conclusion: n should be larger than 384; n = 385 is sufficient.

For a range of 0.025 (margin ±0.0125 or 1.25%):

  • Requirement: 0.98/square root of n ≤ 0.025
  • Rearrange: square root of n ≥ 0.98/0.025 = 39.2
  • Square both sides: n ≥ (39.2)² = 1536.64
  • Conclusion: n = 1537 will do.

Don't confuse: the "range" is the full width of the interval (from lower bound to upper bound), which is twice the margin of error on each side.

⚖️ The accuracy-sample size trade-off

📈 Non-linear relationship

The excerpt highlights a striking result:

  • Increasing accuracy by 50% (from 0.05 range to 0.025 range, i.e., from ±2.5% to ±1.25%) requires a sample size that is 4 times larger (from 385 to 1537).

Why this happens:

  • Sample size appears under a square root in the formula.
  • To halve the margin of error, you must quadruple n (because square root of 4 = 2).
  • This is a general pattern: small improvements in precision can demand large increases in sample size.

💡 Practical implication

Example: An organization planning a survey must weigh:

  • Benefit: more precise estimates (smaller margin of error).
  • Cost: potentially much larger sample (more time, money, effort).

The quadratic relationship means diminishing returns: each additional unit of precision becomes progressively more expensive.

54

Estimation of the Variance and Standard Deviation

10.4 Estimation of the Variance and Standard Deviation

🧭 Overview

🧠 One-sentence thesis

This section demonstrates how to estimate variance and standard deviation from sample data and construct confidence intervals for these parameters, particularly for small samples under the assumption of Normal distribution.

📌 Key points (3–5)

  • What is being estimated: variance and standard deviation of a population from sample data, alongside expectations (means).
  • Two approaches to confidence intervals: one uses Normal approximation of the sample average; the other assumes observations are Normally distributed (more suitable for small samples).
  • Common confusion: large vs small samples—large samples rely on general limit theorems (like the Central Limit Theorem) with fewer distributional assumptions, while small samples require strong assumptions about the distribution (e.g., Normality).
  • Chi-square distribution role: used to construct confidence intervals for variance when data are assumed Normal.
  • Why it matters: accurate estimation of variance and standard deviation is essential for understanding variability and making valid inferences, especially when sample sizes are limited.

📊 Estimation tasks and methods

📊 Estimating expectation and standard deviation

  • The excerpt asks students to estimate the expectation (mean) and standard deviation from all observations in a dataset.
  • It also requests these estimates for subgroups (e.g., only students who received a "charismatic" summary).
  • Plain language: calculate the average rating and the spread (standard deviation) to summarize the data.

🔢 Confidence intervals for the expectation

  • The excerpt instructs construction of a 99% confidence interval for the mean rating among a subgroup, assuming ratings follow a Normal distribution.
  • This interval is "most likely to contain the population parameter."
  • Example: if students rated a teacher after receiving a charismatic summary, the interval estimates the true average rating for all such students.

📐 Confidence intervals for the variance

  • The excerpt also asks for a 90% confidence interval for the variance of ratings in the same subgroup, again assuming Normality.
  • The chi-square distribution is used for this purpose (as noted in the glossary).
  • Don't confuse: variance intervals use chi-square distribution, not the t-distribution used for means.

🔍 Two approaches to confidence intervals

🔍 Normal approximation vs assuming Normality

The excerpt contrasts two methods for constructing confidence intervals:

ApproachBasisWhen appropriate
Normal approximationUses the Central Limit Theorem; sample average is approximately NormalLarge samples; fewer distributional assumptions needed
Assume Normal observationsAssumes the data themselves are Normally distributedSmall samples; requires strong distributional assumptions

🎲 Simulation to check actual confidence level

  • The excerpt describes an exercise where twenty observations are drawn from an Exponential distribution (not Normal).
  • Students simulate confidence intervals using both approaches and compute the actual confidence level (the proportion of intervals that contain the true parameter).
  • The nominal confidence level is 95%, but the actual level may differ if assumptions are violated.
  • Example: if the data are not Normal, the method assuming Normality may produce intervals that cover the true mean less than 95% of the time.

🧩 Key statistical concepts

🧩 Confidence interval

Confidence Interval: An interval that is most likely to contain the population parameter.

  • It is not a guarantee; it is a probabilistic statement.
  • The confidence level indicates how often such intervals, constructed from random samples, will contain the true parameter.

🧩 Confidence level

Confidence Level: The sampling probability that random confidence intervals contain the parameter value.

  • An observed interval (from one sample) does not have a probability of containing the parameter; rather, it was constructed using a formula that works correctly a certain percentage of the time across many samples.
  • Example: a 99% confidence level means that if you repeated the sampling and interval construction many times, 99% of those intervals would contain the true parameter.

📉 t-Distribution

t-Distribution: A bell-shaped distribution that resembles the standard Normal distribution but has wider tails. The distribution is characterized by a positive parameter called degrees of freedom.

  • Used for constructing confidence intervals for means when the sample size is small and data are assumed Normal.
  • Wider tails reflect greater uncertainty with smaller samples.

📈 Chi-Square Distribution

Chi-Square Distribution: A distribution associated with the sum of squares of Normal random variables. The distribution obtains only positive values and it is not symmetric. The distribution is characterized by a positive parameter called degrees of freedom.

  • Used for constructing confidence intervals for variance.
  • Only takes positive values because variance is always non-negative.

⚖️ Large vs small samples

⚖️ Large samples: fewer assumptions

  • With large samples, general limit theorems (like the Central Limit Theorem) justify inference under broad conditions.
  • You do not need to assume a specific distribution for the data.
  • Example: even if individual observations are not Normal, the sample mean's distribution becomes approximately Normal as sample size grows.

⚖️ Small samples: strong assumptions required

  • For small sample sizes, you must make strong assumptions about the distribution (e.g., that observations are Normally distributed) to justify the validity of confidence intervals.
  • Don't confuse: the same procedure may be valid for large samples without strong assumptions but invalid for small samples unless those assumptions hold.
  • The excerpt emphasizes this trade-off in the forum discussion prompt.
55

10.5 Estimation of Other Parameters

10.5 Estimation of Other Parameters

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content about "Estimation of Other Parameters" but instead presents exercises, a summary chapter on confidence intervals, and an introduction to hypothesis testing.

📌 Key points (3–5)

  • What the excerpt actually covers: exercises on confidence intervals, glossary definitions for confidence intervals and related distributions, and an introduction to hypothesis testing.
  • Confidence interval formulas: provided for expectation and probability at 95% confidence level using different distributions.
  • Key distributions: t-distribution and chi-square distribution are introduced as tools for small-sample inference.
  • Common confusion: nominal vs. actual confidence level—when sample size is small and data are not Normal, the stated confidence level may not match the true coverage probability.
  • Transition to hypothesis testing: the excerpt shifts focus from estimation (confidence intervals) to testing competing hypotheses.

📚 Content overview

📚 What is present in the excerpt

The excerpt contains:

  • Two exercises on confidence intervals (one about simulation, one about insurance company surveys).
  • A summary section with glossary definitions.
  • A forum discussion prompt about small-sample inference.
  • Formulas for 95% confidence intervals.
  • The opening of Chapter 12 on hypothesis testing.

📚 What is missing

  • There is no substantive discussion of "Estimation of Other Parameters" as the title suggests.
  • The excerpt does not introduce new parameter estimation methods beyond what is already covered in confidence intervals.

🔑 Key definitions from the summary

🔑 Confidence interval

Confidence Interval: An interval that is most likely to contain the population parameter.

  • It is a range estimate, not a point estimate.
  • The interval is constructed from sample data.

🔑 Confidence level

Confidence Level: The sampling probability that random confidence intervals contain the parameter value. The confidence level of an observed interval indicates that it was constructed using a formula that produces, when applied to random samples, such random intervals.

  • It describes the long-run frequency property of the method, not the probability for one specific interval.
  • Example: a 95% confidence level means that if you repeatedly sample and construct intervals, 95% of those intervals will contain the true parameter.

🔑 t-Distribution

t-Distribution: A bell-shaped distribution that resembles the standard Normal distribution but has wider tails. The distribution is characterized by a positive parameter called degrees of freedom.

  • Used when sample size is small and population standard deviation is unknown.
  • Wider tails reflect greater uncertainty.

🔑 Chi-square distribution

Chi-Square Distribution: A distribution associated with the sum of squares of Normal random variable. The distribution obtains only positive values and it is not symmetric. The distribution is characterized by a positive parameter called degrees of freedom.

  • Used for variance estimation.
  • Only takes positive values; skewed to the right.

⚠️ Small-sample inference challenges

⚠️ The core problem

The forum discussion highlights a critical issue:

  • Large samples: fewer assumptions needed; the Central Limit Theorem justifies inference under general conditions.
  • Small samples: strong assumptions required (e.g., data must be Normally distributed) to justify the validity of procedures.

⚠️ Nominal vs. actual confidence level

  • Nominal confidence level: the stated or intended level (e.g., 95%).
  • Actual confidence level: the true coverage probability when the method is applied.
  • When they differ: if sample size is small and the data distribution is not Normal, the nominal level may not match the actual level.
  • Example: you construct a "95% confidence interval" using a formula that assumes Normality, but your data are skewed—the true coverage might be 90% or 88%, not 95%.

⚠️ The debate

  • One view: making inferences with small samples is worthless because you cannot verify the distributional assumptions.
  • Counterpoint (implied): procedures can still be useful if assumptions are reasonable or if sensitivity is checked (e.g., via simulation, as in Exercise 11.2).

📐 Confidence interval formulas (95% level)

The excerpt provides four formulas for 95% confidence intervals:

ParameterFormulaNotes
Expectation (large sample or unknown distribution)x̄ ± qnorm(0.975) · s/√nUses Normal quantile; relies on CLT
Probabilityp̂ ± qnorm(0.975) · √[p̂(1−p̂)/n]For proportions; uses Normal approximation
Normal expectation (small sample)x̄ ± qt(0.975, n−1) · s/√nUses t-distribution with n−1 degrees of freedom
Normal variance[(n−1)s²/qchisq(0.975, n−1), (n−1)s²/qchisq(0.025, n−1)]Uses chi-square distribution; interval for variance

📐 Key distinctions

  • qnorm vs. qt: qnorm is for the standard Normal; qt is for the t-distribution (wider tails, used when sample size is small).
  • Variance interval: unlike mean intervals, the variance interval uses the chi-square distribution and is not symmetric around the sample variance.

🔬 Transition to hypothesis testing

🔬 What hypothesis testing does

The excerpt introduces Chapter 12:

  • Purpose: decide between two competing options; answer "Is there a phenomenon at all?"
  • Context: testing hypotheses about the expectation of a measurement or the probability of an event.
  • Role in inference: hypothesis testing is typically the first step—before estimating parameters, you test whether there is something to estimate.

🔬 Learning objectives (from Chapter 12 intro)

  • Formulate statistical hypotheses for testing.
  • Test hypotheses about expectation and probability based on a sample.
  • Identify limitations and dangers of misinterpreting test conclusions.

🔬 Don't confuse

  • Confidence intervals (Chapter 11): estimate a range for a parameter.
  • Hypothesis testing (Chapter 12): decide whether a specific claim about a parameter is supported by data.
  • Both use similar distributions (Normal, t, chi-square) but serve different inferential goals.
56

Testing Hypothesis

10.6 Exercises

🧭 Overview

🧠 One-sentence thesis

Statistical hypothesis testing provides a formal framework for deciding between a null hypothesis (the currently accepted explanation) and an alternative hypothesis (a new claim), using sample data to determine whether observed differences are statistically significant rather than due to random chance.

📌 Key points (3–5)

  • Core structure: hypothesis testing splits into three steps—formulating null and alternative hypotheses, specifying the test statistic and rejection region, and reaching a conclusion based on observed data.
  • Two error types: Type I error (rejecting a true null hypothesis) is controlled by the significance level (typically 5%), while Type II error (failing to reject a false null hypothesis) relates to statistical power.
  • Common confusion: simple-minded significance vs. statistical significance—a smaller deviation can be more statistically significant if sampling variability is much smaller; the test compares deviation to the standard error, not just absolute difference.
  • The p-value: measures the probability of observing data as extreme as (or more extreme than) what was observed, assuming the null hypothesis is true; reject the null when p-value < 0.05 (at 5% significance level).
  • Sample size matters: the same estimated difference can lead to different conclusions depending on sample size, because larger samples reduce sampling variability and increase the ability to detect real effects.

🏗️ The structure of hypothesis testing

🏗️ Three-step process

The testing process follows a logical sequence:

  1. Formulate hypotheses (before seeing data): split parameter values into two collections—null hypothesis H₀ (phenomenon absent) and alternative hypothesis H₁ (phenomenon present)
  2. Specify the test (before seeing data): choose a test statistic and define the rejection region (values that lead to rejecting H₀)
  3. Reach conclusion (using observed data): compute the test statistic from the sample and decide whether it falls in the rejection region

Null hypothesis (H₀): the sub-collection of parameter values where the phenomenon is absent; the currently accepted theory; the hypothesis where erroneously rejecting it is more severe.

Alternative hypothesis (H₁): the sub-collection reflecting the presence of the phenomenon; the new theory challenging the established one.

🎯 Formulating hypotheses

The formulation depends on what phenomenon you want to investigate:

  • Two-sided alternative: H₁: E(X) ≠ value (testing for any change)
  • One-sided alternatives:
    • H₁: E(X) < value (testing for decrease)
    • H₁: E(X) > value (testing for increase)

Example: To test whether car prices changed, use H₀: E(X) = 13,662 vs. H₁: E(X) ≠ 13,662. To test specifically for a price rise, use H₀: E(X) ≥ 13,662 vs. H₁: E(X) < 13,662.

Don't confuse: The null hypothesis is not always "equals"—it can be "greater than or equal" or "less than or equal" depending on what phenomenon you're investigating.

📊 Test statistic and rejection region

Test statistic: a statistic that summarizes the sample data to decide between the two hypotheses.

Rejection region: a set of values for the test statistic; if the observed value falls in this region, reject H₀.

For testing expectations, the t-statistic is commonly used:

  • Formula: t = (sample mean − null value) / (sample standard deviation / √n)
  • This measures the discrepancy in units of the estimated standard error

Rejection regions by alternative type:

  • Two-sided: reject if |t| > threshold (e.g., 1.972 for n=201)
  • Greater than: reject if t > threshold (e.g., 1.653)
  • Less than: reject if t < negative threshold (e.g., −1.653)

The threshold is chosen to achieve the desired significance level (typically 5%).

⚠️ Error types and probabilities

⚠️ Two types of error

Error typeWhat happensWhen it occurs
Type IReject H₀ when H₀ is trueFalse positive; claiming a phenomenon exists when it doesn't
Type IIFail to reject H₀ when H₁ is trueFalse negative; missing a real phenomenon

The two errors are not treated symmetrically: Type I error is considered more severe, so the test is designed to control its probability at a pre-specified level.

⚠️ Significance level and power

Significance level: the probability of Type I error; the probability (computed under H₀) of rejecting H₀. Commonly set at 5% or 1%.

Statistical power: the probability (computed under H₁) of rejecting H₀; equals 1 − probability of Type II error.

When comparing two tests with the same significance level, prefer the one with higher statistical power.

Why the asymmetry? In scientific research, the currently accepted theory is designated as H₀. A novel claim requires strong evidence to replace it. Similarly, a new drug must demonstrate clear benefit before approval—the null is "no better than current treatment."

⚠️ Interpreting significance

The excerpt emphasizes that statistical significance differs from simple-minded significance:

  • Simple assessment: looks only at the size of the deviation from the null value
  • Statistical assessment: compares the deviation to the sampling variability (standard error)

Example: Two groups showed the same deviation from the null expectation (about 0.275), but one had p-value = 0.127 (not significant) while the other had p-value = 0.052 (nearly significant). The difference was due to different sample standard deviations (1.806 vs. 1.422)—smaller variability makes the same deviation more statistically significant.

📈 The p-value approach

📈 What the p-value measures

p-value: the probability, computed under the null hypothesis, of obtaining a test statistic as extreme as (or more extreme than) the observed value.

The p-value is itself a test statistic. It equals the significance level of a test where the observed value serves as the threshold.

Decision rule:

  • If p-value < 0.05, reject H₀ at the 5% significance level
  • If p-value < 0.01, reject H₀ at the 1% significance level

📈 Computing the p-value

For a two-sided test with observed t = −0.811:

  • p-value = P(|T| > 0.811) under H₀
  • This equals twice the upper tail probability (by symmetry of the t-distribution)
  • Formula: 2 × [1 − P(T ≤ 0.811)]

For one-sided tests:

  • Greater than alternative: p-value = P(T > observed value)
  • Less than alternative: p-value = P(T < observed value)

Advantage of p-values: No need to look up critical thresholds; simply compare the p-value directly to your chosen significance level.

📈 P-value and sample size

The excerpt illustrates how sample size affects conclusions:

  • With n=20 and estimated proportion = 0.3 (vs. null of 0.5): p-value = 0.118, do not reject
  • With n=200 and estimated proportion = 0.3 (vs. null of 0.5): p-value = 0.00000002, strongly reject

The estimated value (0.3) is identical, but the larger sample reduces sampling variability, making the same discrepancy highly significant.

Key lesson: Statistical testing is based on the relative discrepancy compared to sampling variability, not the absolute discrepancy alone.

🧪 Testing expectations (t-test)

🧪 When to use the t-test

The t-test is used to test hypotheses about the expected value (mean) of a measurement:

  • Statistical model: observations are a random sample
  • Parameter of interest: E(X), the expectation
  • Test statistic: t = (sample mean − null value) / (s / √n)
  • Distribution under H₀: t-distribution with n−1 degrees of freedom

🧪 Applying the t-test in practice

The excerpt demonstrates using the function t.test with car price data:

Basic syntax: t.test(data, mu=null_value)

  • data: the sample observations
  • mu: the expected value under H₀ (default is 0)

For one-sided alternatives, add:

  • alternative="greater" for H₁: E(X) > null value
  • alternative="less" for H₁: E(X) < null value
  • Default is alternative="two.sided"

🧪 Interpreting t-test output

The output includes:

  • Test statistic value (t = ...)
  • Degrees of freedom (df = n−1)
  • p-value: compare to significance level
  • Confidence interval: for two-sided tests, a 95% interval; for one-sided, a one-sided interval like [lower bound, ∞)
  • Sample estimate: the sample mean

Example interpretation: If p-value = 0.418 > 0.05, do not reject H₀; conclude the expected value is not significantly different from the null value.

🧪 Subsetting data for testing

The excerpt shows how to test hypotheses for subgroups using logical indexing:

  • Create a logical variable (e.g., heavy <- weight > 2414)
  • Use it to subset: data[heavy] selects observations where heavy is TRUE
  • Use negation: data[!heavy] selects observations where heavy is FALSE

This allows testing the same hypothesis separately for different subgroups (e.g., heavier vs. lighter cars).

🎲 Testing proportions

🎲 When to use proportion tests

Proportion tests examine hypotheses about the probability p of an event:

  • Connection to expectations: p is the expected value of a Bernoulli variable (1 if event occurs, 0 otherwise)
  • Estimator: sample proportion p̂ = (number of occurrences) / n
  • Variance under H₀: V(p̂) = p₀(1−p₀)/n, where p₀ is the null value

🎲 The test statistic for proportions

The test statistic measures the standardized deviation:

Z = (p̂ − p₀) / √[p₀(1−p₀)/n]

Under H₀, Z is approximately standard Normal (by Central Limit Theorem), so Z² follows a chi-square distribution with 1 degree of freedom.

Rejection region: {Z² > c} for some threshold c, or equivalently {|Z| > √c}.

🎲 Applying prop.test

The excerpt demonstrates using the function prop.test:

Basic syntax: prop.test(x, n)

  • x: number of occurrences of the event
  • n: total sample size
  • Default null probability is 0.5; change with p=value

Example: Testing whether the median weight for diesel cars is 2,414 lb:

  • 6 out of 20 diesel cars are below the threshold
  • prop.test(6, 20) tests H₀: p = 0.5
  • Output: X-squared = 2.45, p-value = 0.118, do not reject

Continuity correction: By default, the function applies Yates' continuity correction (similar to the Normal approximation for Binomial). This can be disabled with correct=FALSE.

🎲 Sample size and proportion tests

The excerpt demonstrates the effect of sample size:

  • With n=20, x=6: p̂=0.3, p-value=0.118 (not significant)
  • With n=200, x=60: p̂=0.3, p-value=0.00000002 (highly significant)

Even though the estimated proportion is identical (0.3 vs. null of 0.5), the larger sample makes the difference statistically significant because sampling variability decreases with sample size.

🔍 Important considerations

🔍 Robustness and assumptions

The t-test assumes measurements are Normally distributed. The excerpt mentions examining robustness to violations of this assumption (e.g., testing with Exponential or Uniform distributions instead of Normal).

When sample size is small and the distribution is not Normal, the nominal significance level may not match the actual significance level.

🔍 Confidence intervals in testing

The excerpt notes that confidence intervals and hypothesis tests are related:

  • A 95% confidence interval is reported alongside the test
  • For one-sided tests, a one-sided confidence interval is given (e.g., [lower bound, ∞))
  • If the null value falls outside the 95% confidence interval, the null hypothesis would be rejected at the 5% level

🔍 Practical interpretation

The excerpt emphasizes several practical points:

  • Conservatism in science: statisticians advocate caution to add objectivity; investigators may prefer bold discoveries
  • Publication bias: many journals require p-value < 0.05 for publication; should results with p-value around 10% also be published?
  • Context matters: the severity of Type I vs. Type II errors depends on the application (e.g., drug approval, scientific discovery)

Don't confuse: Failing to reject H₀ does not prove H₀ is true; it only means the data do not provide sufficient evidence against it.

57

Summary of Confidence Intervals and Introduction to Hypothesis Testing

10.7 Summary

🧭 Overview

🧠 One-sentence thesis

Statistical hypothesis testing provides formal guidelines for deciding whether observed differences reflect real phenomena or merely random noise, but its validity depends critically on sample size and distributional assumptions.

📌 Key points (3–5)

  • Core purpose of hypothesis testing: to determine whether observed data reveal a meaningful phenomenon or can be explained by randomness alone.
  • The validity problem: confidence intervals and tests rely on assumptions (large sample size or Normal distribution for small samples) that cannot always be verified, creating a gap between nominal and actual significance levels.
  • Common confusion: a higher observed value does not automatically mean a significant difference—context (e.g., inflation adjustment) and statistical testing are needed to distinguish signal from noise.
  • Two main testing contexts: testing hypotheses about expectations (means) and about probabilities of events.

⚠️ Limitations of small-sample inference

⚠️ The assumption problem

  • Confidence intervals and hypothesis tests are provably valid under two conditions:
    • The sample size is large, or
    • The sample size is small but the observations follow a Normal distribution.
  • When the sample is small and the distribution is not Normal, the nominal significance level (what the formula claims) may not match the actual significance level (what you really get).

🤔 The verification dilemma

  • The excerpt raises a critical question: How can we trust conclusions that depend on distributional assumptions we cannot verify?
  • With small samples, you cannot reliably check whether the data are Normal, yet the validity of your interval or test depends on that assumption.
  • Don't confuse: "the formula always works" vs. "the formula's guarantees hold only under certain conditions."

📐 Confidence interval formulas (95% level)

The excerpt lists four formulas (notation preserved from the source):

ParameterFormulaNotes
Expectation (general)x̄ ± qnorm(0.975) · s/√nLarge-sample or Normal assumption
Probabilityp̂ ± qnorm(0.975) · √[p̂(1−p̂)/n]For proportions
Normal Expectation (mean)x̄ ± qt(0.975, n−1) · s/√nUses t-distribution when Normal
Normal Variance[(n−1)s²/qchisq(0.975, n−1), (n−1)s²/qchisq(0.025, n−1)]Interval for variance under Normality
  • These are computational recipes; their correctness depends on the conditions discussed above.

🧪 What hypothesis testing does

🧪 The central question

Hypothesis testing tries to answer: "Is there a phenomenon at all?"

  • Statistical inference detects and characterizes meaningful phenomena hidden in random noise.
  • Hypothesis testing is typically the first step: before estimating effect sizes or building models, you ask whether the effect exists.

🔍 The basic approach

  • Determine whether the observed data can or cannot be reasonably explained by a model of pure randomness (no phenomenon).
  • If randomness alone is an unlikely explanation, you conclude a phenomenon is present.
  • Example: Observed car prices in 2009 (inflation-adjusted to 1985 dollars) are $13,662 vs. the 1985 average of $13,207—is the $455 difference real or just noise?

🎯 Two main contexts

The excerpt mentions testing in two settings:

  • Expectation of a measurement (e.g., mean car price).
  • Probability of an event (e.g., proportion of cars above a threshold).

🚗 Illustrative example: car prices

🚗 The raw comparison

  • 1985 average car price: $13,207.
  • 2009 average car price: $27,958 (nominal dollars).
  • Clearly higher, but this comparison ignores inflation.

💵 Adjusting for context

  • After adjusting for inflation, the 2009 price corresponds to $13,662 in 1985 dollars.
  • The difference is now $13,662 − $13,207 = $455.
  • Don't confuse: "higher number" vs. "statistically significant difference"—the latter requires a formal test.

🧪 The statistical question

  • Is $455 a significant difference, or could it arise from random variation in the data?
  • The excerpt mentions using the function t.test to conduct this test (details deferred to later sections).
  • This illustrates the core logic: observed difference → adjust for context → test whether the adjusted difference exceeds what randomness would produce.

🎓 Learning goals for hypothesis testing

The excerpt lists three objectives (from the Student Learning Objectives):

  1. Formulate statistical hypotheses for testing.
  2. Test hypotheses about the expectation of a measurement and the probability of an event, based on a sample.
  3. Identify limitations of statistical hypothesis testing and the danger of misinterpretation of the test's conclusions.
  • Emphasis on both mechanics (how to test) and critical thinking (when conclusions are trustworthy, when they are not).
58

Testing Hypotheses on Expectation and Proportion

11.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

The t-test and proportion test both assess whether sample data provide sufficient evidence to reject a null hypothesis about a population parameter, with statistical significance depending not just on the deviation from the null value but on the sampling variability relative to that deviation.

📌 Key points (3–5)

  • Equivalence of p-value and T statistic: The p-value test and the T statistic test are equivalent—one rejects the null if and only if the other does—but the p-value is compared directly to the significance level without further computation.
  • Statistical vs. simple-minded significance: Two samples with equal deviations from the null can have very different statistical significance if their sampling variabilities differ; smaller standard deviation makes the same deviation more significant.
  • Subsetting with logical sequences: A logical sequence (TRUE/FALSE values) can index another sequence to extract components that meet a condition, enabling hypothesis tests on subgroups.
  • One-sided vs. two-sided tests: One-sided tests have p-values half the size of two-sided tests (for the same test statistic) because they consider probability in only one tail of the distribution.
  • Common confusion—deviation vs. significance: A larger deviation from the null expectation is not always more statistically significant; the ratio of deviation to standard error determines significance.

🔬 Testing hypotheses on the difference in fuel consumption

🚗 The variable and research question

  • The variable dif.mpg measures the difference in miles-per-gallon between highway and city driving for each car type.
  • Summary statistics: range 0–11, median 6, mean 5.532, with 50% of values between 5 and 7.
  • Research question: Does the expected difference in fuel consumption between highway and city conditions differ by car weight?

⚖️ Competing predictions about weight and fuel difference

Two opposing conjectures exist:

ConjectureReasoningPrediction
Heavier cars → larger differenceUrban traffic requires frequent speed changes; heavier cars need more energy to accelerate, so city driving is less efficient for themHeavier cars have bigger highway-city difference
Heavier cars → smaller differenceHeavier cars do fewer miles-per-gallon overall; the difference between two smaller numbers may be smaller than the difference between two larger numbersHeavier cars have smaller highway-city difference

🔪 Dividing cars into weight groups

  • The threshold is set at the median curb weight: 2,414 lb.
  • Cars above this threshold are "heavy" (102 cars); cars at or below are "light" (103 cars).
  • The logical variable heavy is TRUE for cars above the threshold and FALSE otherwise.

🧮 Subsetting data with logical sequences

🔍 How logical indexing works

Logical indexing: Using a sequence of TRUE/FALSE values to select components from another sequence of the same length; only positions with TRUE are extracted.

Example:

  • w <- c(5,3,4,6,2,9) and d <- c(13,22,0,12,6,20)
  • w > 5 produces [FALSE, FALSE, FALSE, TRUE, FALSE, TRUE]
  • d[w > 5] extracts the 4th and 6th elements of d: [12, 20]

🔄 Reversing logical values with !

  • The operator ! reverses TRUE to FALSE and FALSE to TRUE.
  • !(w > 5) produces [TRUE, TRUE, TRUE, FALSE, TRUE, FALSE]
  • d[!(w > 5)] extracts elements of d where w is ≤ 5: [13, 22, 0, 6]

🚙 Applying to car data

  • dif.mpg[heavy] extracts fuel differences for heavier cars (curb weight > 2,414 lb).
  • dif.mpg[!heavy] extracts fuel differences for lighter cars (curb weight ≤ 2,414 lb).

📊 Conducting t-tests on weight subgroups

🏋️ Test for heavier cars

Hypotheses:

  • Null H₀: E(X) = 5.53 (expected difference for heavier cars equals the overall average)
  • Alternative H₁: E(X) ≠ 5.53 (two-sided)

Results:

  • Test statistic: t = −1.5385, degrees of freedom = 101
  • p-value = 0.127
  • Sample mean = 5.254902
  • 95% confidence interval: [4.900198, 5.609606]

Conclusion: The null hypothesis is not rejected at the 5% significance level (0.127 > 0.05); no significant evidence that heavier cars differ from the overall average.

🪶 Test for lighter cars

Hypotheses: Same structure as heavier cars.

Results:

  • Test statistic: t = 1.9692, degrees of freedom = 102
  • p-value = 0.05164
  • Sample mean = 5.805825
  • 95% confidence interval: [5.528002, 6.083649]

Conclusion: The null hypothesis is not rejected at the 5% level, but the p-value is very close to 0.05, suggesting the expected difference for lighter cars is almost significantly different from the overall average.

🧠 Why the difference in significance?

Both groups have nearly identical deviations from the null:

  • Heavier cars: 5.254902 − 5.53 = −0.275098
  • Lighter cars: 5.805825 − 5.53 = 0.275825

The key difference is sampling variability:

  • Sample standard deviation for lighter cars: 1.421531
  • Sample standard deviation for heavier cars: 1.805856

The T statistic measures the ratio of deviation to the estimated standard error (S/√n). A smaller standard deviation makes the same deviation more statistically significant.

Don't confuse: A larger absolute deviation does not guarantee higher significance; the deviation must be large relative to the standard error.

🎯 One-sided hypothesis tests

➡️ Testing H₁: E(X) > 5.53 for lighter cars

  • Specified by alternative="greater" in the t.test function.
  • Test statistic: t = 1.9692 (same as two-sided)
  • p-value = 0.02582 (half of the two-sided p-value)
  • One-sided confidence interval: [5.573323, ∞)

Interpretation: The p-value is the probability that the test statistic is greater than the observed value under the null. The null hypothesis is rejected at the 5% level (0.02582 < 0.05).

⬅️ Testing H₁: E(X) < 5.53 for lighter cars

  • Specified by alternative="less".
  • Test statistic: t = 1.9692
  • p-value = 0.9742
  • One-sided confidence interval: (−∞, 6.038328]

Interpretation: The p-value is the probability that the test statistic is less than the observed value. The null hypothesis is not rejected (0.9742 >> 0.05).

🔀 One-sided vs. two-sided p-values

  • For the same test statistic, the one-sided p-value is approximately half the two-sided p-value.
  • Why: The two-sided test considers both tails (probability of being greater than t or less than −t), while the one-sided test considers only one tail.

🎲 Testing hypotheses on proportions

🧩 Relating proportions to expectations

Probability p of an event: Can be estimated by the sample proportion , the relative frequency of the event in the sample.

  • The underlying model uses a Bernoulli random variable X: X = 1 if the event occurs, X = 0 otherwise.
  • The probability p is the expectation of X.
  • The estimator is the sample average of X.

📐 The test statistic for proportions

Hypotheses:

  • Null H₀: p = 0.5
  • Alternative H₁: p ≠ 0.5

Test statistic:

Z = ( − 0.5) / √[0.5(1 − 0.5)/n]

  • Numerator: deviation of the sample proportion from the null value.
  • Denominator: standard deviation of the sample proportion under the null hypothesis.
  • The variance of is p(1 − p)/n; under H₀, this becomes 0.5(1 − 0.5)/n.

🎯 Rejection region

  • If H₀ is true, the sampling distribution of Z is centered at 0.
  • Values of Z much larger or much smaller than 0 indicate the null is unlikely.
  • Rejection region: {|Z| > c} for some threshold c set to control the significance level.

Don't confuse: This test is similar to the t-test but uses the standard deviation under the null hypothesis (not the sample standard deviation), making it a Z-test rather than a t-test.

59

Testing Hypothesis on Proportion

11.2 Intervals for Mean and Proportion

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing for proportions uses the sample proportion and its standard deviation to determine whether an observed probability differs significantly from a hypothesized value, with larger samples making it easier to detect the same discrepancy.

📌 Key points (3–5)

  • What the test measures: whether the probability of an event equals a specific hypothesized value (e.g., p = 0.5) by comparing the sample proportion to the null hypothesis value.
  • How the test statistic works: it standardizes the deviation of the sample proportion from the null value using the standard deviation under the null hypothesis.
  • Distribution used: the test statistic approximately follows a chi-square distribution on 1 degree of freedom (or equivalently, the square of a standard Normal variable).
  • Common confusion: a large discrepancy between sample proportion and null value does not automatically mean rejection—sample size matters because smaller samples have higher variability.
  • Why sample size matters: the same estimated proportion can lead to rejection with a large sample but not with a small sample, because larger samples reduce sampling variability.

🧪 The statistical model and setup

🧪 Bernoulli framework

  • The problem relates to estimating the probability p of some event.
  • The event is modeled using a Bernoulli random variable X:
    • X = 1 when the event occurs
    • X = 0 when it does not
  • The probability p is the expectation of X.
  • The sample proportion (denoted P-hat) is the sample average of this measurement.

🎯 Hypothesis formulation

The excerpt illustrates testing:

  • Null hypothesis H₀: p = 0.5 (the probability equals a specific value, here one half)
  • Alternative hypothesis H₁: p ≠ 0.5 (the probability is not equal to this value)

The framework can be adapted to other null values besides 0.5 by replacing the appropriate probability.

📐 Constructing the test statistic

📐 The standardized proportion

The test statistic measures the ratio between:

  1. The deviation of the estimator from its null expected value
  2. The standard deviation of the estimator

Test statistic Z = (P-hat - 0.5) / sqrt[0.5(1 - 0.5)/n]

  • The numerator is the difference between the sample proportion and the null value.
  • The denominator is the standard deviation of the sample proportion under the null hypothesis.
  • Under the null hypothesis, the variance of P-hat is V(P-hat) = p(1-p)/n, which becomes 0.5(1-0.5)/n when p = 0.5.

🎲 Distribution and rejection region

  • If the null hypothesis is true, the value 0 is the center of the sampling distribution of Z.
  • Values much larger or much smaller than 0 indicate the null hypothesis is unlikely.
  • The rejection region takes the form {|Z| > c} or equivalently {Z² > c²}, where c is a threshold.
  • The threshold c is set high enough to ensure the required significance level (the probability under the null of obtaining a value in the rejection region).

📊 Using the chi-square distribution

  • By the Central Limit Theorem, the distribution of the test statistic is approximately Normal.
  • If Z has the standard Normal distribution, then Z² has a chi-square distribution on one degree of freedom.
  • This allows computation of approximate thresholds and p-values.

🔬 Worked example: diesel car weights

🔬 The research question

  • The sample median curb weight is 2,414 lb (half the cars above, half below).
  • Question: Is the median weight of diesel cars also 2,414 lb?
  • If 2,414 lb were the population median for diesel cars, the probability that a random diesel car does not exceed this weight would be 0.5.

📋 The data

  • Total diesel cars in sample: 20
  • Diesel cars with weight below 2,414 lb: 6
  • Diesel cars with weight above 2,414 lb: 14
  • Sample proportion: P-hat = 6/20 = 0.3

🧮 Test results

The test statistic (with continuity correction) obtains the value 2.45.

  • The p-value is 0.1175 (the probability that a chi-square distribution on 1 degree of freedom exceeds 2.45).
  • Since 0.1175 > 0.05, the null hypothesis is not rejected at the 5% significance level.
  • The 95% confidence interval for p is [0.1284, 0.5433].

Don't confuse: Although the estimated proportion (0.3) deviates substantially from the null value (0.5), the null hypothesis is not rejected because the sample size is small (n = 20), leading to high sampling variability.

📏 The critical role of sample size

📏 Same proportion, different sample size

The excerpt demonstrates the same test with:

  • Sample size n = 200
  • Number of occurrences: 60
  • Sample proportion: P-hat = 60/200 = 0.3 (identical to the previous example)

📏 Dramatically different conclusion

  • Test statistic: 31.205
  • p-value: 2.322 × 10⁻⁸ (far below 0.05)
  • The null hypothesis is rejected "with flying colors"
  • The 95% confidence interval is [0.2384, 0.3694] (narrower than before)

🔑 Key insight: relative discrepancy

The basic characteristic of statistical hypothesis testing: consideration is based not on the discrepancy of the estimator from the null value, but on the relative discrepancy in comparison to the sampling variability of the estimator.

Sample sizeVariabilityChances of rejection for same discrepancy
SmallerLargerLower
LargerSmallerHigher

Example: The same estimated proportion (0.3) versus null value (0.5) leads to non-rejection with n = 20 but strong rejection with n = 200, because larger samples reduce variability and make discrepancies more detectable.

🛠️ Technical notes

🛠️ Continuity correction

The excerpt mentions that the test statistic uses Yates' correction for continuity by default.

  • With continuity correction: [|p-hat - p₀| - 0.5/n]² / [p₀(1 - p₀)/n]
  • Without correction: [p-hat - p₀]² / [p₀(1 - p₀)/n]
  • This is similar to the continuity correction used for Normal approximation of the Binomial distribution.

🛠️ Confidence interval and point estimate

The test output provides:

  • A confidence interval for the probability of the event
  • The point estimate P-hat (the sample proportion)
  • These complement the hypothesis test by showing the range of plausible values and the best single estimate.
60

Intervals for Normal Measurements

11.3 Intervals for Normal Measurements

🧭 Overview

🧠 One-sentence thesis

When sample sizes are small, constructing valid confidence intervals for Normal measurements requires using the t-distribution for the mean and the chi-square distribution for the variance instead of relying on large-sample approximations.

📌 Key points (3–5)

  • Why small samples need special treatment: Large-sample methods rely on the Central Limit Theorem and assume the sample variance is close to the true variance; both assumptions may fail when n is small.
  • When these methods apply: The measurements themselves must be Normally distributed; applying these methods to non-Normal data (e.g., car prices) with small samples produces dubious results.
  • Key difference from large-sample intervals: Replace the standard Normal percentile (1.96) with the t-distribution percentile, which has wider tails and depends on degrees of freedom (n − 1).
  • Common confusion: The t-distribution percentile is noticeably larger than 1.96 for small samples but converges to 1.96 as sample size increases, so large-sample and small-sample methods produce essentially the same intervals when n is large.
  • Extension to variance: The chi-square distribution enables confidence intervals for the variance itself, not just the mean.

📏 Why large-sample methods fail for small samples

📏 Two key assumptions break down

Large-sample confidence intervals (from the previous section) rest on two assumptions:

  1. Central Limit Theorem applies: The sampling distribution of the sample average is approximately Normal.
  2. Sample variance ≈ true variance: The sample standard deviation S is an accurate estimator of the measurement's standard deviation.
  • When the sample size is small, the Normal approximation may be poor and S may differ substantially from the actual variance.
  • Consequence: The reasoning that produced intervals like X̄ ± 1.96 · S / √n may no longer be valid.

⚠️ Normality assumption required

For small samples, making valid inference requires more detailed modeling of the distribution of the measurements; specifically, we assume the measurements themselves are Normally distributed.

  • This assumption does not fit all scenarios.
  • Example: Car prices are better modeled by the Exponential distribution, not the Normal distribution.
  • Don't confuse: "Normal sampling distribution of X̄" (large-sample assumption via CLT) vs. "Normal distribution of individual measurements" (small-sample requirement).
  • Blind application of these methods to non-Normal variables with small samples is not recommended.

🎯 Confidence intervals for the Normal mean

🎯 The t-distribution replaces the standard Normal

When measurements are Normally distributed, the exact distribution of the standardized sample average (with S substituting the standard deviation) is:

(X̄ − E(X)) / (S / √n) follows a t-distribution on (n − 1) degrees of freedom, denoted t(n−1).

  • Degrees of freedom: The number of observations used to estimate the variance, minus 1.
  • The t-distribution is bell-shaped and symmetric, like the standard Normal, but has wider tails.
  • Key advantage: This is an exact relation, not an approximation (when measurements are Normal).

📐 Structure of the confidence interval

The 95% confidence interval for the expectation of a Normal measurement is:

X̄ ± qt(0.975, n−1) · S / √n

  • Structure is essentially identical to the large-sample interval.
  • Only difference: Replace 1.96 (the 0.975-percentile of the standard Normal) with qt(0.975, n−1) (the 0.975-percentile of the t-distribution on n−1 degrees of freedom).
  • The middle 95% of the t-distribution is bracketed by [−qt(0.975, n−1), qt(0.975, n−1)].

🚗 Example: Fuel consumption differences

The excerpt analyzes the difference in miles per gallon between highway and urban driving for diesel vs. gas cars.

  • Variables: "dif.mpg" = highway.mpg − city.mpg; "fuel" = fuel type (diesel or gas).
  • Sample sizes: 20 diesel cars, 185 gas cars.
  • Concern: With only 20 diesel observations, large-sample methods may be questionable.

Results for diesel cars:

  • Sample mean: 4.45
  • Sample standard deviation: 2.781045
  • t-percentile: qt(0.975, 19) = 2.093024
  • 95% confidence interval: [3.148431, 5.751569]

Results for gas cars:

  • Sample mean: 5.648649
  • Sample standard deviation: 1.433607
  • t-percentile: qt(0.975, 184) = 1.972941
  • 95% confidence interval: [5.440699, 5.856598]

🔍 Small vs. large sample comparison

Sample sizet-percentileComparison to 1.96
20 (diesel)2.093024Noticeably larger
185 (gas)1.972941Essentially equal to 1.96
  • For large samples, the t-distribution percentile converges to the standard Normal percentile.
  • Implication: For large n, the method from Subsection 11.2.2 and the method in this section produce essentially the same confidence intervals.

📊 Confidence intervals for the Normal variance

📊 The chi-square distribution for variance

To construct a confidence interval for the variance σ², we use the sample variance S² and a new distribution:

When measurements are Normally distributed, the random variable (n − 1)S² / σ² follows a chi-square distribution on (n − 1) degrees of freedom, denoted χ²(n−1).

  • The chi-square distribution is associated with the sum of squares of Normal variables.
  • Not symmetric: Unlike the t-distribution, we must compute both the 0.025- and 0.975-percentiles to bracket the central 95%.

🧮 Structure of the confidence interval for variance

The 95% confidence interval for the variance σ² is:

[(n − 1) / qchisq(0.975, n−1)] × S², [(n − 1) / qchisq(0.025, n−1)] × S²

  • Left endpoint: Multiply S² by (n − 1) / qchisq(0.975, n−1); this ratio is less than 1 (making the bound smaller).
  • Right endpoint: Multiply S² by (n − 1) / qchisq(0.025, n−1); this ratio is greater than 1 (making the bound larger).
  • The percentile associated with the larger probability (0.975) goes in the denominator of the left bound; the percentile for the smaller probability (0.025) goes in the denominator of the right bound.

🚗 Example: Variance of fuel consumption differences

Diesel cars (n = 20):

  • Sample variance: 7.734211
  • 95% confidence interval: [4.473047, 16.499155]

Gas cars (n = 185):

  • Sample variance: 2.055229
  • 95% confidence interval: [1.692336, 2.549466]

✅ Simulation verification

The excerpt simulates 100,000 samples from a Normal distribution with μ = 4, σ² = 9, n = 20:

  • Confidence interval for mean: Computed confidence level = 0.94934 ≈ 95% (nominal).
  • Confidence interval for variance: Computed confidence level = 0.94896 ≈ 95% (nominal).
  • Both simulations confirm that the nominal 95% confidence level matches the actual coverage probability when measurements are Normal.

🔧 General construction strategy

🔧 The three-step recipe

  1. Identify the distribution of a random variable associated with the parameter of interest.
    • For the mean: (X̄ − E(X)) / (S / √n) ~ t(n−1).
    • For the variance: (n − 1)S² / σ² ~ χ²(n−1).
  2. Find the central region that contains the desired probability (e.g., 95%).
    • For t: [−qt(0.975, n−1), qt(0.975, n−1)].
    • For χ²: [qchisq(0.025, n−1), qchisq(0.975, n−1)].
  3. Reformulate the event to put the parameter between a lower and upper limit computed from the data.
    • These limits become the boundaries of the confidence interval.

🔧 Why the chi-square interval looks different

  • The chi-square distribution is not symmetric, so the central 95% region is not symmetric around the mode.
  • We must compute both tails explicitly: 2.5% below qchisq(0.025, n−1) and 2.5% above qchisq(0.975, n−1).
  • The resulting interval multiplies S² by two different ratios (one < 1, one > 1) to form the lower and upper bounds.
61

Choosing the Sample Size

11.4 Choosing the Sample Size

🧭 Overview

🧠 One-sentence thesis

Statistics provides guidelines for selecting the minimal sample size needed to achieve valid conclusions with high probability while optimizing resources, balancing statistical accuracy against the cost of collecting more observations.

📌 Key points (3–5)

  • Why sample size matters: larger samples improve statistical accuracy but increase expenses, so the goal is to find the minimum sufficient size.
  • How to determine sample size: use confidence interval range requirements to calculate the needed number of observations.
  • Key insight for proportions: the maximum value of the function p(1 − p) is 1/4, which allows calculating a worst-case sample size that works regardless of the actual proportion.
  • Common confusion: doubling accuracy does not double sample size—increasing accuracy by 50% requires a sample size 4 times larger.
  • Application context: opinion polls estimating population proportions, where you specify how close the sample estimate must be to the true value.

🎯 The sample size problem

🎯 Balancing accuracy and resources

  • From a statistical perspective, larger sample sizes are usually preferable.
  • However, increasing sample size typically increases expenses.
  • The practical goal: collect the minimal number of observations that is still sufficient to reach a valid conclusion.
  • This is an important consideration at the planning stage of experiments or surveys.

📊 Opinion poll example

The excerpt uses a concrete scenario to illustrate the problem:

An opinion poll aimed at estimating the proportion in the population of those that support a specific candidate considering running for office.

Research question: How large must the sample be to assure, with high probability, that the percentage in the sample of supporters is within 0.5% (or 0.25%) of the percentage in the population?

🔧 The calculation method

🔧 Using confidence intervals

  • The natural approach is via a confidence interval for the proportion.
  • If the range of the confidence interval is no more than the desired margin (0.05 or 0.025), then with probability equal to the confidence level, the population relative frequency is within the given distance from the sample proportion.

📐 The 95% confidence interval structure

For a proportion with 95% confidence level:

Confidence interval structure: P-hat ± 1.96 · {P-hat(1 − P-hat)/n} to the power 1/2

  • The range of the confidence interval is: 1.96 · {P-hat(1 − P-hat)/n} to the power 1/2
  • This is twice the margin of error (the ± part).
  • The question becomes: how large should n be to guarantee this range is no more than the desired value?

🔑 The key mathematical insight

The answer depends on P-hat(1 − P-hat), which is a random quantity. However:

  • The quadratic function f(p) = p(1 − p) has a maximum value of 1/4.
  • This maximum occurs at p = 1/2.
  • Therefore: 1.96 · {P-hat(1 − P-hat)/n} to the power 1/2 ≤ 1.96 · {0.25/n} to the power 1/2 = 0.98/√n

This upper bound works regardless of the actual value of P-hat, providing a worst-case guarantee.

📏 Worked examples

📏 Example 1: Range of 0.05 (within 0.5%)

To ensure the range is no more than 0.05:

  • Start with: 0.98/√n ≤ 0.05
  • Rearrange: √n ≥ 0.98/0.05 = 19.6
  • Square both sides: n ≥ (19.6) squared = 384.16

Conclusion: n should be larger than 384; for example, n = 385 should be sufficient.

📏 Example 2: Range of 0.025 (within 0.25%)

To ensure the range is no more than 0.025:

  • Start with: 0.98/√n ≤ 0.025
  • Rearrange: √n ≥ 0.98/0.025 = 39.2
  • Square both sides: n ≥ (39.2) squared = 1536.64

Conclusion: n = 1537 will do.

⚠️ The accuracy-sample size relationship

Important observation: Increasing the accuracy by 50% (from 0.05 to 0.025) requires a sample size that is 4 times larger (from 385 to 1537).

  • This is not a linear relationship.
  • Halving the margin of error requires quadrupling the sample size.
  • Don't confuse: a 50% improvement in accuracy ≠ a 50% increase in sample size.

🧮 Why the maximum matters

🧮 The worst-case approach

The excerpt uses the maximum value of p(1 − p) = 1/4 to provide a conservative estimate:

  • Before collecting data, we don't know what P-hat will be.
  • By using the maximum possible value (1/4), we ensure the calculated sample size will be sufficient no matter what proportion we observe.
  • This is a safe upper bound that guarantees the desired accuracy.

🔍 The mathematical detail

The excerpt provides the calculus reasoning:

  • The derivative of f(p) = p(1 − p) is f'(p) = 1 − 2p.
  • Setting f'(p) = 0 gives p = 1/2 as the maximizer.
  • Plugging p = 1/2 into the function gives 1/4 as the maximal value.

This mathematical fact allows the practical sample size calculation to proceed without knowing the true population proportion in advance.

62

11.5 Exercises

11.5 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply confidence interval methods to real and simulated data, comparing approaches under different distributional assumptions and sample sizes.

📌 Key points (3–5)

  • Three exercise types: analyzing real teacher-rating data, simulating confidence intervals under misspecified distributions, and designing sample sizes for proportion estimation.
  • Normal vs. t-based intervals: Exercise 11.2 compares confidence intervals constructed using Normal approximation versus t-distribution when the true distribution is Exponential, revealing differences in actual vs. nominal confidence levels.
  • Sample size planning: Exercise 11.3 addresses how to determine minimal sample size to achieve a target margin of error at a given confidence level.
  • Common confusion: nominal confidence level (what the formula claims) vs. actual confidence level (what simulation reveals when assumptions are violated).
  • Why it matters: exercises demonstrate that distributional assumptions affect interval validity, especially with small samples.

📊 Exercise 11.1: Teacher rating data

📁 Dataset and variables

  • Data source: "teacher.csv" contains ratings of a lecturer after students received one of two prior summaries.
  • Experimental design:
    • Condition "C": summary described lecturer as charismatic.
    • Condition "P": summary described lecturer as punitive.
    • All subjects watched the same 20-minute lecture by the same instructor.
  • Task 1: Identify each variable's name and type (factor or numeric).

📏 Overall and conditional estimates

Task 2: Overall statistics

  • Estimate the expectation (mean) and standard deviation of teacher ratings across all students.

Task 3: Conditional statistics

  • Estimate the expectation and standard deviation only for students in condition "C" (charismatic summary).
  • This isolates the effect of the prior description on ratings.

🎯 Confidence intervals for charismatic condition

Task 4: Interval for the mean (99% confidence)

  • Construct a 99% confidence interval for the expectation of ratings among students given the charismatic summary.
  • Assumption: ratings follow a Normal distribution.

Task 5: Interval for the variance (90% confidence)

  • Construct a 90% confidence interval for the variance of ratings in the charismatic condition.
  • Assumption: ratings follow a Normal distribution.
  • Don't confuse: this interval is for variance (spread), not mean (center).

🔬 Exercise 11.2: Simulation under misspecification

🎲 Setup and true distribution

  • Sample size: 20 observations.
  • True distribution: measurements are actually Exponential(1/4), not Normal.
  • Goal: compare two confidence interval methods when the Normality assumption is violated.

🔍 Two interval construction methods

MethodBasisNominal confidence level
First caseNormal approximation of the sample average95%
Second caseAssumes observations are Normally distributed95%
  • Both claim 95% confidence, but their actual performance may differ when the true distribution is Exponential.

📉 Actual vs. nominal confidence level

Task 1: Simulate the first case

  • Use simulation to compute the actual confidence level (the proportion of intervals that contain the true parameter) for the Normal approximation method.

Task 2: Simulate the second case

  • Use simulation to compute the actual confidence level for the method that assumes Normal observations.

Task 3: Compare and choose

  • Decide which approach is preferable based on simulation results.
  • The actual confidence level may be lower than the nominal 95% when assumptions are violated.
  • Don't confuse: nominal confidence level is what the formula promises; actual confidence level is what happens in practice under the true distribution.

📐 Exercise 11.3: Sample size for proportions

🎯 Planning sample size (Task 1)

  • Context: insurance companies want to estimate the population proportion of drivers who always buckle up.
  • Requirements:
    • 99% confidence level.
    • Margin of error: up to 0.03 (3 percentage points).
  • Question: what is the minimal sample size needed to meet these requirements?
  • This is a design question answered before collecting data.

📊 Constructing an interval from collected data (Task 2)

Survey results

  • Sample size: 400 drivers surveyed.
  • Observed count: 320 drivers claim to always buckle up.
  • Sample proportion: 320/400 = 0.80.

Task

  • Construct an 80% confidence interval for the population proportion of drivers who claim to always buckle up.
  • Note the lower confidence level (80%) compared to the planning stage (99%)—different questions may require different confidence levels.

📚 Summary concepts from the chapter

📖 Key definitions

Confidence Interval: An interval that is most likely to contain the population parameter.

Confidence Level: The sampling probability that random confidence intervals contain the parameter value; it indicates the interval was constructed using a formula that produces such intervals when applied to random samples.

t-Distribution: A bell-shaped distribution that resembles the standard Normal distribution but has wider tails, characterized by a positive parameter called degrees of freedom.

Chi-Square Distribution: A distribution associated with the sum of squares of Normal random variables; obtains only positive values, is not symmetric, and is characterized by a positive parameter called degrees of freedom.

🔑 Large vs. small samples

  • Large samples: fewer a-priori assumptions needed; general limit theorems (e.g., Central Limit Theorem) establish validity under general conditions.
  • Small samples: strong distributional assumptions required to justify the validity of inference procedures.
  • Don't confuse: the same method may be robust with large samples but sensitive to violations with small samples.
63

Summary

11.6 Summary

🧭 Overview

🧠 One-sentence thesis

This summary consolidates the core tools of confidence intervals—definitions, distributions, formulas, and the trade-off between large-sample generality and small-sample assumptions—and introduces hypothesis testing as the next step in statistical inference.

📌 Key points (3–5)

  • What confidence intervals are: intervals most likely to contain the true population parameter, constructed using formulas that work across random samples.
  • Key distributions: the t-distribution (bell-shaped, wider tails than Normal, used for small samples) and the chi-square distribution (positive-only, asymmetric, used for variance).
  • Large vs small samples: large samples rely on general limit theorems (e.g., Central Limit Theorem); small samples require strong distributional assumptions (e.g., Normality) that cannot always be verified.
  • Common confusion: nominal vs actual confidence level—if the sample is small and the data are not Normal, the stated confidence level may not match the true coverage probability.
  • Standard formulas: 95% confidence intervals for expectation, probability, and variance are provided with specific R quantile functions.

📚 Glossary and core concepts

📖 Confidence Interval

An interval that is most likely to contain the population parameter.

  • Not a guarantee, but a probabilistic statement about the procedure.
  • The interval itself is fixed once computed; the randomness is in the sampling process.

📖 Confidence Level

The sampling probability that random confidence intervals contain the parameter value.

  • It describes the long-run success rate of the formula, not the probability that a single observed interval contains the parameter.
  • Example: a 95% confidence level means that if you repeat the sampling and interval construction many times, about 95% of those intervals will capture the true parameter.

📖 t-Distribution

A bell-shaped distribution that resembles the standard Normal distribution but has wider tails, characterized by a positive parameter called degrees of freedom.

  • Used when the sample size is small and the population standard deviation is unknown.
  • As degrees of freedom increase, the t-distribution approaches the Normal distribution.

📖 Chi-Square Distribution

A distribution associated with the sum of squares of Normal random variables; it obtains only positive values and is not symmetric, characterized by a positive parameter called degrees of freedom.

  • Used for constructing confidence intervals for variance.
  • The excerpt provides a formula using chi-square quantiles for the variance interval.

🔍 Large samples vs small samples

🔍 Large samples: fewer assumptions

  • With large samples, general limit theorems (e.g., the Central Limit Theorem) justify the validity of inference under broad conditions.
  • You do not need to assume a specific distribution for the observations.
  • The nominal confidence level is approximately correct even if the data are not perfectly Normal.

🔍 Small samples: strong assumptions required

  • For small sample sizes, you must assume a specific distribution (typically Normal) to justify the procedure.
  • The problem: these assumptions cannot be verified from the data alone.
  • Risk: if the true distribution differs from the assumed one, the nominal confidence level may not match the actual confidence level.
  • Example: you construct a 95% confidence interval assuming Normality, but if the data are skewed, the true coverage might be only 90% or 85%.

🤔 The debate: are small-sample inferences worthless?

  • The excerpt poses a forum question: can you trust conclusions that depend on unverifiable assumptions?
  • One view: small-sample inference is risky because the nominal and actual significance levels may diverge.
  • Implication: always check sample size and distributional assumptions before relying on confidence intervals or tests.

📐 Standard formulas (95% confidence level)

The excerpt provides four formulas for 95% confidence intervals:

ParameterFormulaNotes
Expectation (large sample or unknown distribution)x̄ ± qnorm(0.975) · s/√nUses Normal quantile; relies on CLT for large n
Probability (proportion)p̂ ± qnorm(0.975) · √[p̂(1−p̂)/n]For sample proportion p̂; large-sample approximation
Normal Expectation (small sample, Normal data)x̄ ± qt(0.975, n−1) · s/√nUses t-distribution with n−1 degrees of freedom
Normal Variance (small sample, Normal data)[(n−1)s²/qchisq(0.975, n−1), (n−1)s²/qchisq(0.025, n−1)]Uses chi-square quantiles; interval is asymmetric

📐 How to read the formulas

  • : sample mean; s: sample standard deviation; n: sample size; : sample proportion.
  • qnorm(0.975): the 97.5th percentile of the standard Normal distribution (≈1.96).
  • qt(0.975, n−1): the 97.5th percentile of the t-distribution with n−1 degrees of freedom.
  • qchisq(0.975, n−1) and qchisq(0.025, n−1): the 97.5th and 2.5th percentiles of the chi-square distribution.
  • The variance interval is constructed by inverting the chi-square distribution; note the order of quantiles in the numerator.

📐 When to use which formula

  • Expectation (large sample): use the Normal quantile formula when n is large (typically n ≥ 30) or when the Central Limit Theorem applies.
  • Expectation (small sample, Normal data): use the t-distribution formula when n is small and you can assume the data are Normal.
  • Variance (Normal data): use the chi-square formula only when the data are Normal; this assumption is critical.
  • Probability: use the proportion formula when the sample size is large enough that np̂ and n(1−p̂) are both at least 5.

🧪 Practical exercises (from the excerpt)

The excerpt includes two exercises that illustrate the concepts:

🧪 Exercise on simulation and confidence levels

  • Task 1: compute the actual confidence level via simulation for a confidence interval with a nominal 95% level.
  • Task 2: compare two approaches (details not fully specified in the excerpt).
  • Purpose: demonstrate that nominal and actual confidence levels can differ, especially under violated assumptions.

🧪 Exercise on insurance and sample size

  • Scenario: insurance companies want to estimate the proportion of drivers who always buckle up.
  • Task 1: find the minimal sample size for a 99% confidence interval with a margin of error of 0.03.
  • Task 2: given a sample of 400 drivers (320 claim to buckle up), construct an 80% confidence interval for the population proportion.
  • Purpose: practice sample-size calculation and interval construction for proportions.

🔗 Transition to hypothesis testing

🔗 What comes next

  • The excerpt ends by introducing hypothesis testing as the next chapter.
  • Hypothesis testing is described as a crucial component in decision making where one of two competing options must be selected.
  • It provides formal guidelines for choosing between options and is typically the first step in inference: "Is there a phenomenon at all?"
  • The next chapter will cover formulation of hypotheses, decision rules, and testing in the context of expectations and probabilities.

🔗 Why hypothesis testing follows confidence intervals

  • Confidence intervals estimate a parameter; hypothesis testing decides whether a specific claim about the parameter is supported by the data.
  • Both rely on the same underlying distributions (Normal, t, chi-square) and the same large-sample vs small-sample trade-offs.
64

Statistical Hypothesis Testing

12.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Statistical hypothesis testing provides formal guidelines for deciding whether observed data reveal a meaningful phenomenon or can be reasonably explained by random noise alone.

📌 Key points (3–5)

  • Purpose of hypothesis testing: to answer "Is there a phenomenon at all?" by determining whether observed data can be explained by randomness without invoking the phenomenon.
  • Typical role: hypothesis testing is usually the first step in statistical inference, detecting whether something meaningful exists before characterizing it.
  • Core question: whether a difference (e.g., between two values) is significant or merely due to random variation.
  • Common confusion: a difference in numbers does not automatically mean significance—context (e.g., inflation, sample variability) must be considered before concluding that a phenomenon exists.
  • Limitations: the chapter emphasizes identifying the dangers of misinterpreting test conclusions and recognizing the limitations of hypothesis testing.

🎯 What hypothesis testing does

🎯 Detecting phenomena in noisy data

Statistical inference is used to detect and characterize meaningful phenomena that may be hidden in an environment contaminated by random noise.

  • The excerpt frames hypothesis testing as a tool for separating signal from noise.
  • The central question: can the observed data be reasonably explained by a model of randomness that does not involve the phenomenon?
  • If randomness alone cannot explain the data, we infer that a phenomenon exists.
  • Example: if a measurement differs from an expected value, hypothesis testing helps decide whether that difference reflects a real change or just random fluctuation.

🔍 The first step in inference

  • Hypothesis testing is described as "an important step, typically the first, in the process of making inferences."
  • Before characterizing or modeling a phenomenon, you must first establish that it exists.
  • Don't confuse: hypothesis testing does not describe how or why a phenomenon works; it only answers whether it is present.

🧪 The structure of a hypothesis test

🧪 Formulating hypotheses

  • The excerpt states that students should be able to "formulate statistical hypothesis for testing."
  • A hypothesis test involves two competing options: one that assumes no phenomenon (randomness only) and one that assumes the phenomenon exists.
  • The test uses sample data to decide which option is more consistent with the observations.

📐 Testing expectations and probabilities

  • The chapter focuses on two contexts:
    • Expectation of a measurement: testing whether the mean of a variable has changed or differs from a reference value.
    • Probability of an event: testing whether the likelihood of an outcome has shifted.
  • Example: the excerpt illustrates testing whether the average car price in 2009 (adjusted for inflation) differs significantly from the 1985 average.

⚠️ Limitations and misinterpretation

  • The excerpt emphasizes that students must "identify the limitations of statistical hypothesis testing and the danger of misinterpretation of the test's conclusions."
  • A key danger: concluding that a numerical difference is meaningful without accounting for variability or context.
  • Don't confuse: a test result does not prove that a phenomenon exists with certainty; it only indicates whether the data are consistent with randomness or not.

📊 Example: car prices over time

📊 The question

  • The excerpt asks: "Do Americans pay today for cars a different price than what they used to pay in the 80's?"
  • Average price in 1985: $13,207.
  • Average price in 2009: $27,958 (nominal).
  • After adjusting for inflation, the 2009 price corresponds to $13,662 in 1985 dollars.

📊 Why the difference matters

  • The raw difference ($27,958 vs. $13,207) is large, but inflation explains most of it.
  • The inflation-adjusted difference ($13,662 vs. $13,207) is much smaller.
  • The question becomes: "Is the difference between $13,207 and $13,662 significant or is it not so?"
  • This is the kind of question hypothesis testing is designed to answer.

📊 Conducting the test

  • The excerpt mentions that "a statistical test" is carried out using the function "t.test."
  • The test helps decide whether the $455 difference (in 1985 dollars) reflects a real change in car prices or is merely random variation.
  • Example: if the test concludes the difference is not significant, we would say that car prices have not meaningfully changed (after accounting for inflation and variability).

🧠 Key takeaways for decision making

🧠 Formal guidelines for choosing between options

  • Hypothesis testing provides "formal guidelines for making such a selection" when one of two competing options must be chosen.
  • It structures the decision process: define hypotheses, collect data, apply a test, and interpret the result.

🧠 Context is essential

  • The car price example shows that raw numbers can be misleading.
  • Inflation adjustment is necessary for a fair comparison; without it, the conclusion would be incorrect.
  • Similarly, hypothesis testing accounts for variability and sample size to avoid mistaking noise for a real effect.

🧠 Recognizing what tests cannot do

  • The excerpt warns of "the danger of misinterpretation of the test's conclusions."
  • A test does not prove causation, nor does it quantify the size or importance of an effect—it only assesses whether the data are consistent with randomness.
  • Don't confuse: "statistically significant" does not always mean "practically important," and "not significant" does not prove that no effect exists.
65

The Theory of Hypothesis Testing

12.2 The Theory of Hypothesis Testing

🧭 Overview

🧠 One-sentence thesis

Statistical hypothesis testing provides a structured framework to decide whether observed data can reasonably be explained by randomness alone or whether it provides significant evidence for a phenomenon, with the decision rule designed to control the probability of falsely rejecting the status quo.

📌 Key points (3–5)

  • Core purpose: determine whether observed data can be reasonably explained by a model of randomness that does not involve the phenomena of interest.
  • Three-step structure: (i) formulate null and alternative hypotheses, (ii) specify the test statistic and rejection region, (iii) apply the test to data and reach a conclusion.
  • Asymmetric error treatment: Type I error (rejecting a true null hypothesis) is considered more severe than Type II error (failing to reject a false null hypothesis); significance level controls Type I error probability.
  • Common confusion—null vs alternative: the null hypothesis represents the absence of the phenomenon (status quo, conservative explanation), while the alternative represents its presence (the claim requiring strong evidence).
  • p-value interpretation: the p-value equals the significance level at which the observed test statistic would just barely lead to rejection; reject the null hypothesis when p-value < 0.05 (or chosen significance level).

🔬 The structure of hypothesis testing

🎯 Step 1: Formulating the hypotheses

Null hypothesis (H₀): the sub-collection of parameter values where the phenomenon is absent.

Alternative hypothesis (H₁): the sub-collection of parameter values reflecting the presence of the phenomenon.

  • The formulation splits the range of possible parameter values into two complementary subsets.
  • The phenomenon of interest determines which hypothesis is which.
  • Example: To investigate whether car prices changed, if the 1985 expected price was E(X) and the 2009 inflation-adjusted price was $13,662:
    • Null hypothesis: E(X) = 13,662 (no change, phenomenon absent)
    • Alternative hypothesis: E(X) ≠ 13,662 (change occurred, phenomenon present)

🔀 Two-sided vs one-sided alternatives

TypeFormWhen to use
Two-sidedH₁: E(X) ≠ 13,662Testing for any change (increase or decrease)
One-sided (lower)H₁: E(X) < 13,662Testing specifically for a rise in price (1985 price was lower)
One-sided (upper)H₁: E(X) > 13,662Testing specifically for a decrease in price (1985 price was higher)
  • The choice depends on what phenomenon you want to investigate.
  • Don't confuse: a value outside the alternative can still belong to the null hypothesis in one-sided tests (e.g., E(X) > 13,662 belongs to H₀ when testing for a rise).

⚙️ Step 2: Specifying the test

Test statistic: a statistic computed from the data that measures evidence against the null hypothesis.

Rejection region: the subset of test statistic values that lead to rejecting the null hypothesis.

  • The decision rule: reject H₀ if the test statistic falls in the rejection region; otherwise accept H₀.
  • Example: For testing E(X) = 13,662 vs E(X) ≠ 13,662, the T statistic is:
    • T = (X̄ - 13,662) / (S / √n)
    • Where X̄ is sample average, S is sample standard deviation, n is sample size.
  • This statistic measures the discrepancy between the estimated expectation and the null expectation, in units of the estimated standard deviation of the sample average.
  • Under the null hypothesis, T should be near 0; extreme values (very positive or very negative) suggest the null is false.

🎚️ Rejection regions for different alternatives

Alternative hypothesisRejection regionThreshold
H₁: E(X) ≠ 13,662 (two-sided)|T| > cc = 0.975-percentile of t-distribution (≈1.972 for df=200)
H₁: E(X) < 13,662 (one-sided)T < -cc = 0.05-percentile of t-distribution (≈-1.653 for df=200)
H₁: E(X) > 13,662 (one-sided)T > cc = 0.95-percentile of t-distribution (≈1.653 for df=200)
  • The threshold c is chosen based on the desired significance level and the degrees of freedom (n - 1).
  • The rule of thumb for two-sided tests at 5% significance: use the 0.975-percentile of the t-distribution.

✅ Step 3: Reaching a conclusion

  • Compute the observed value of the test statistic from the actual data.
  • Check whether this observed value falls in the rejection region.
  • If yes → reject H₀ (accept H₁); if no → accept H₀.
  • Example: In the car price data, the observed T statistic was t = -0.8115.
    • The threshold for |T| was 1.972.
    • Since |−0.8115| = 0.8115 < 1.972, the value does not fall in the rejection region.
    • Conclusion: accept the null hypothesis—the expected 1985 car price was not significantly different from the current inflation-adjusted price.

⚠️ Error types and probabilities

🚨 Two types of errors

Type I error: rejecting the null hypothesis when it is actually true.

Type II error: accepting the null hypothesis when the alternative is actually true.

DecisionH₀ is trueH₁ is true
Accept H₀✓ CorrectType II error
Reject H₀Type I error✓ Correct
  • Example: If E(X) actually equals 13,662 but we reject H₀ because |T| > 1.972, that is a Type I error.
  • Example: If E(X) actually differs from 13,662 but we accept H₀ because |T| ≤ 1.972, that is a Type II error.

⚖️ Asymmetric treatment of errors

  • Type I error is considered more severe than Type II error.
  • The test is designed to control the probability of Type I error at an acceptable level (the significance level).
  • Reducing Type II error probability is desirable but secondary.
  • This explains why the threshold is set relatively high (e.g., 1.972): the sample average must be significantly different from the null expectation, not just different, to reject H₀.

📊 Error probabilities

Significance level: the probability of making a Type I error; typically set at 5% or 1%.

Statistical power: the probability of correctly rejecting the null hypothesis when the alternative is true; equals 1 - Probability of Type II error.

OutcomeUnder H₀Under H₁
P(reject H₀)Significance levelStatistical power
P(accept H₀)1 - Significance levelProbability of Type II error
  • Setting the threshold at the 0.975-percentile of the t-distribution produces a test with 5% significance level.
  • When comparing two tests with the same significance level, prefer the one with higher statistical power.

🧪 Why null = status quo

  • The null hypothesis is designated for the situation where the cost of making an error is greater.
  • In drug trials: H₀ = "new drug is no better than current treatment"; only strong evidence of benefit leads to approval.
  • In scientific research: H₀ = currently accepted theory (conservative explanation); a novel claim (H₁) requires strong empirical evidence to replace it.
  • The rejection region corresponds to values unlikely under the current theory; observing such values suggests the theory is inadequate.

📉 The p-value

🔢 What the p-value measures

p-value: the probability of observing a test statistic as extreme as (or more extreme than) the observed value, computed under the null hypothesis; equivalently, the significance level at which the observed statistic would just reach the rejection threshold.

  • For the T statistic with rejection region {|T| > c}, the p-value equals P(|T| > |t|), where t is the observed value.
  • Example: Observed t = -0.8115, so p-value = P(|T| > 0.8115).
  • Under H₀, T follows a t-distribution with n - 1 degrees of freedom.
  • By symmetry of the t-distribution: P(|T| > 0.8115) = 2 · P(T > 0.8115) = 2 · [1 - P(T ≤ 0.8115)] ≈ 0.4181.

🎲 The p-value as a test statistic

  • The p-value is a function of the data, so it is itself a statistic.
  • Different data sets produce different observed T values, hence different p-values.
  • Decision rule using p-value:
    • If p-value < 0.05 (for 5% significance level) → reject H₀.
    • If p-value ≥ 0.05 → accept H₀.
  • Example: p-value = 0.4181 > 0.05, so accept H₀.

🔄 Equivalence of T-test and p-value test

  • The test based directly on the T statistic and the test based on the p-value are equivalent.
  • One rejects H₀ if and only if the other does.
  • Advantage of the p-value: no need to look up the threshold percentile; simply compare the p-value directly to the chosen significance level (e.g., 0.05 or 0.01).
  • Advantage of the T statistic approach: provides insight into the structure of the rejection region.

🧮 Computing the p-value

  • For two-sided alternative H₁: E(X) ≠ 13,662:
    • p-value = P(|T| > |observed t|) = 2 · [1 - cumulative probability at |observed t|].
  • For one-sided alternative H₁: E(X) < 13,662:
    • p-value = P(T < observed t) = cumulative probability at observed t.
  • For one-sided alternative H₁: E(X) > 13,662:
    • p-value = P(T > observed t) = 1 - cumulative probability at observed t.
  • Don't confuse: the p-value is not the probability that H₀ is true; it is the probability of the data (or more extreme) given that H₀ is true.

🚗 The car price example walkthrough

📋 The research question

  • Question: Do Americans pay a different price for cars today than in the 1980s (adjusting for inflation)?
  • Data: 1985 car prices from "cars.csv"; average was $13,207.
  • Comparison: 2009 average car price was $27,958 in nominal terms, or $13,662 in 1985 dollars after adjusting for inflation.
  • The observed difference: $13,662 - $13,207 = $455. Is this difference significant or just random variation?

🧪 Applying the three-step framework

Step 1: Formulate hypotheses

  • Null hypothesis H₀: E(X) = 13,662 (no change in relative price).
  • Alternative hypothesis H₁: E(X) ≠ 13,662 (relative price changed).
  • This is a two-sided alternative because we are testing for any change, not a specific direction.

Step 2: Specify the test

  • Test statistic: T = (X̄ - 13,662) / (S / √n).
  • Sample size: n = 201 (after excluding 4 missing values).
  • Rejection region: {|T| > 1.972}, where 1.972 is the 0.975-percentile of the t-distribution with 200 degrees of freedom.
  • This gives a 5% significance level.

Step 3: Compute and conclude

  • Observed sample average: X̄ = 13,207.13.
  • Observed sample standard deviation: S = 7,947.066.
  • Observed T statistic: t = (13,207.13 - 13,662) / (7,947.066 / √201) ≈ -0.8115.
  • Check: |−0.8115| = 0.8115 < 1.972, so t does not fall in the rejection region.
  • Conclusion: accept H₀—the expected price of a car in 1985 was not significantly different from the current inflation-adjusted price.

📊 Using the p-value

  • The p-value reported by the t.test function was 0.4181.
  • Since 0.4181 > 0.05, we accept H₀.
  • Interpretation: if the null hypothesis were true, we would observe a T statistic as extreme as -0.8115 (or more extreme) about 42% of the time—this is not unusual, so we have no reason to reject H₀.
  • The 95% confidence interval for E(X) was [12,101.80, 14,312.46], which includes 13,662, consistent with accepting H₀.

🔍 Why this matters

  • The example illustrates that a nominal difference ($455) may not be statistically significant when accounting for sampling variability.
  • The test provides a principled way to distinguish real changes from random fluctuations.
  • The high p-value (0.4181) indicates weak evidence against the null hypothesis; the data are quite consistent with no change in relative car prices.

budget:token_budget Tokens used: 21900 Token budget remaining: 978100 </budget:token_budget>

66

Testing Hypothesis on Expectation

12.3 Testing Hypothesis on Expectation

🧭 Overview

🧠 One-sentence thesis

The t-test evaluates whether a sample mean differs significantly from a hypothesized value by comparing the deviation to the sampling variability, and statistical significance depends not just on the size of the deviation but on how large it is relative to the standard error.

📌 Key points (3–5)

  • What the t-test examines: whether the expected value (mean) of a variable in a population differs from a specific hypothesized value.
  • How to interpret results: the p-value is compared directly to the significance level (e.g., 0.05); if the p-value is smaller, the null hypothesis is rejected.
  • Common confusion: a larger deviation from the null value does not always mean stronger evidence—statistical significance depends on the deviation relative to the standard deviation of the sample mean.
  • Subsetting with logical sequences: you can select subsets of data by using logical TRUE/FALSE sequences as indices, which is useful for testing hypotheses on different groups.
  • One-sided vs two-sided tests: one-sided tests have smaller p-values (half of the two-sided p-value) because they only consider deviations in one direction.

🚗 The fuel consumption example

🚗 The variable under study

  • The variable dif.mpg measures the difference in miles-per-gallon between highway and city driving conditions for each car type.
  • Summary statistics: values range from 0 to 11, median is 6, mean is 5.532, and 50% of values fall between 5 and 7.
  • The variable takes only integer values.

🤔 The research question

Two competing conjectures about heavier vs lighter cars:

  • Conjecture 1 (heavier cars → larger difference): Urban traffic involves frequent speed changes; heavier cars require more energy for acceleration, so the highway-city difference might be larger for heavier cars.
  • Conjecture 2 (heavier cars → smaller difference): Heavier cars do fewer miles per gallon overall; the difference between two smaller numbers (heavy cars) may be smaller than the difference between two larger numbers (light cars).

The excerpt tests whether the expected difference for heavier and lighter cars differs from the overall average of 5.53.

🔪 Dividing cars into weight groups

  • Cars are split at the median weight of 2,414 lb.
  • Cars above 2,414 lb are labeled "heavy" (102 cars); cars at or below are labeled "light" (103 cars).
  • The variable heavy is a logical sequence: TRUE for heavy cars, FALSE for light cars.

🧮 Conducting the t-test

🧮 Testing heavier cars

The null hypothesis is H₀: E(X) = 5.53 (expected difference for heavier cars equals the overall average).
The alternative is H₁: E(X) ≠ 5.53 (two-sided).

Results for heavier cars:

  • Test statistic: t = -1.5385
  • Degrees of freedom: 101
  • p-value: 0.127
  • Sample mean: 5.254902
  • 95% confidence interval: [4.900198, 5.609606]

Interpretation:

  • The p-value (0.127) is larger than 0.05, so the null hypothesis is not rejected at the 5% significance level.
  • Conclusion: we cannot conclude that the expected difference for heavier cars is significantly different from 5.53.

🧮 Testing lighter cars

The same null and alternative hypotheses are tested for lighter cars.

Results for lighter cars:

  • Test statistic: t = 1.9692
  • Degrees of freedom: 102
  • p-value: 0.05164
  • Sample mean: 5.805825
  • 95% confidence interval: [5.528002, 6.083649]

Interpretation:

  • The p-value (0.05164) is just slightly larger than 0.05, so the null hypothesis is not rejected, but it is very close to the threshold.
  • Conclusion: we almost conclude that the expected difference for lighter cars is significantly different from 5.53.

🔍 Why the difference in results?

The excerpt explains why the test for lighter cars came closer to rejecting the null:

FactorHeavier carsLighter cars
Deviation from null5.254902 − 5.53 = −0.2750985.805825 − 5.53 = 0.275825
Absolute deviation~0.275~0.276
Sample size102103
Sample standard deviation1.8058561.421531
Test statistic-1.53851.9692
  • The deviations from the null are practically equal in absolute value.
  • The sample sizes are almost equal.
  • The key difference: the sample standard deviation for lighter cars (1.421531) is much smaller than for heavier cars (1.805856).
  • The T statistic is the ratio of deviation to standard error (standard deviation divided by square root of n), so a smaller standard deviation leads to a larger test statistic and a smaller p-value.

Important lesson: Simple-minded significance (just looking at the size of the deviation) is not the same as statistical significance. A smaller deviation can be more statistically significant if the sampling variability is much smaller.

🔀 Subsetting data with logical sequences

🔀 How logical indexing works

The excerpt demonstrates an alternative to position-based indexing: using logical TRUE/FALSE sequences.

Example with two sequences:

  • w <- c(5,3,4,6,2,9) and d <- c(13,22,0,12,6,20)
  • w > 5 produces [FALSE, FALSE, FALSE, TRUE, FALSE, TRUE]
  • d[w > 5] selects elements of d where w > 5 is TRUE, yielding [12, 20] (the 4th and 6th elements).

Reversing with the ! operator:

  • !(w > 5) produces [TRUE, TRUE, TRUE, FALSE, TRUE, FALSE]
  • d[!(w > 5)] selects elements where w ≤ 5, yielding [13, 22, 0, 6].

🔀 Applying logical indexing to the car data

  • dif.mpg[heavy] selects the differences for cars where heavy is TRUE (heavier cars).
  • dif.mpg[!heavy] selects the differences for cars where heavy is FALSE (lighter cars).
  • This approach is used to apply the t-test separately to each weight group.

🎯 One-sided tests

🎯 Testing the alternative H₁: E(X) > 5.53

For lighter cars, the excerpt tests whether the expected difference is greater than 5.53.

Code: t.test(dif.mpg[!heavy], mu=5.53, alternative="greater")

Results:

  • Test statistic: t = 1.9692 (same as two-sided)
  • Degrees of freedom: 102 (same as two-sided)
  • p-value: 0.02582 (half of the two-sided p-value)
  • One-sided confidence interval: [5.573323, ∞)

Interpretation:

  • The p-value for the one-sided test is the probability that the test statistic is larger than the observed value.
  • The two-sided p-value is twice this figure because it includes both tails.
  • The null hypothesis is rejected at the 5% level (0.02582 < 0.05).
  • The one-sided confidence interval gives the smallest value the expectation may reasonably obtain.

🎯 Testing the alternative H₁: E(X) < 5.53

For lighter cars, the excerpt also tests whether the expected difference is less than 5.53.

Code: t.test(dif.mpg[!heavy], mu=5.53, alternative="less")

Results:

  • Test statistic: t = 1.9692
  • p-value: 0.9742
  • One-sided confidence interval: (−∞, 6.038328]

Interpretation:

  • The p-value (0.9742) is the probability that the test statistic is less than the observed value.
  • The null hypothesis is clearly not rejected.

🎯 Comparison of one-sided and two-sided tests

  • The test statistic and degrees of freedom are the same for all three tests (two-sided, greater, less).
  • The p-value for the one-sided "greater" test is half the two-sided p-value.
  • The p-value for the one-sided "less" test is very large (close to 1) because the observed sample mean is above the null value.
  • Don't confuse: the choice of one-sided vs two-sided depends on the research question, not on the data.

📊 Testing hypotheses on proportions

📊 The setup for proportion tests

The excerpt introduces testing hypotheses on the probability p of an event.

A probability p can be estimated by the observed relative frequency of the event in the sample, denoted P-hat.

  • The estimation is associated with a Bernoulli random variable X: X = 1 when the event occurs, X = 0 when it does not.
  • The probability p is the expectation of X.
  • The estimator P-hat is the sample average of X.

📊 The test statistic for proportions

Null hypothesis: H₀: p = 0.5
Alternative hypothesis: H₁: p ≠ 0.5

  • The variance of P-hat is V(P-hat) = p(1−p)/n.

  • Under the null hypothesis, the variance is V(P-hat) = 0.5(1−0.5)/n.

  • The test statistic is the standardized sample proportion:

    Z = (P-hat − 0.5) / sqrt(0.5(1−0.5)/n)

  • This measures the ratio between the deviation of the estimator from its null expected value and the standard deviation of the estimator.

  • If the null hypothesis is true, the center of the sampling distribution of Z is 0.

  • Values of Z much larger or much smaller than 0 indicate the null hypothesis is unlikely.

  • The rejection region has the form |Z| > c for some threshold c.

📊 Relation to the t-test

  • Testing hypotheses on proportions is similar to testing hypotheses on expectations, since p is the expectation of the Bernoulli variable X.
  • A similar (though not identical) test to the t-test is used for proportions.
  • The excerpt mentions that the function prop.test is used for conducting such tests (details not provided in this excerpt).
67

Testing Hypothesis on Proportion

12.4 Testing Hypothesis on Proportion

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing on proportions uses the standardized sample proportion as a test statistic to determine whether an observed probability differs significantly from a hypothesized value, with the test's power depending critically on sample size relative to sampling variability.

📌 Key points (3–5)

  • What the test examines: whether the probability of an event equals a specific hypothesized value (e.g., p = 0.5).
  • How the test statistic works: the standardized sample proportion measures the ratio of deviation from the null value to the standard deviation of the estimator.
  • Distribution approximation: the Central Limit Theorem allows using Normal (or chi-square) distributions to compute thresholds and p-values.
  • Common confusion: a large difference between estimated and null proportions does not guarantee rejection—what matters is the relative discrepancy compared to sampling variability.
  • Sample size effect: larger samples reduce variability, making the same discrepancy more likely to lead to rejection.

🧮 The statistical framework

🎲 Connection to Bernoulli variables

The probability p of an event is estimated by the observed relative frequency in the sample, denoted P-hat.

  • The underlying model uses a Bernoulli random variable X:
    • X = 1 when the event occurs
    • X = 0 when it does not
  • The parameter p is the expectation of X
  • The estimator P-hat is the sample average of this measurement
  • This formulation connects proportion testing to testing hypotheses about expectations

📐 Variance under the null hypothesis

  • The variance of the estimator P-hat is: V(P-hat) = p(1−p)/n
  • Under the null hypothesis H₀: p = 0.5, the variance becomes: V(P-hat) = 0.5(1−0.5)/n = 0.25/n
  • This variance formula is crucial for standardizing the test statistic

🔬 Constructing the test

📊 The test statistic

The standardized sample proportion takes the form:

Z = (P-hat − 0.5) / sqrt(0.5(1−0.5)/n)

  • Numerator: deviation of the estimator from its null expected value
  • Denominator: standard deviation of the estimator under the null
  • The statistic measures how many standard deviations the observed proportion is from the hypothesized value

🎯 Rejection region

  • If H₀: p = 0.5 is true, then 0 is the center of the sampling distribution of Z
  • Values much larger or smaller than 0 indicate the null hypothesis is unlikely
  • Rejection region: {|Z| > c} for some threshold c
  • Equivalently: {Z² > c²}
  • The threshold c is set high enough to ensure the required significance level

📈 Distribution approximation

  • The Central Limit Theorem implies the test statistic is approximately Normal
  • Normal computations produce approximate thresholds and p-values
  • If Z has a standard Normal distribution, then Z² has a chi-square distribution with one degree of freedom
  • This chi-square distribution is used for computing p-values

🚗 Worked example: diesel car weights

🔍 The research question

  • Sample median curb weight: 2,414 lb (half above, half below)
  • If this is the population median, probability of a random car not exceeding this weight = 0.5
  • Hypothesis: Is the median weight of diesel cars also 2,414 lb?
  • Sample: 20 diesel cars out of 205 total car types

📋 The data structure

A 2×2 table summarizing weight group and fuel type:

Fuel typeLight (≤2414 lb)Heavy (>2414 lb)
Diesel614
Gas9788
  • 6 diesel cars below threshold, 14 above
  • Event of interest: diesel car with weight below threshold
  • Frequency: 6 out of 20 diesel cars

🧪 Test results (n=20)

  • Test statistic (with continuity correction): X-squared = 2.45
  • Degrees of freedom: 1
  • p-value: 0.1175
  • Conclusion: null hypothesis is NOT rejected at 5% significance level
  • Sample estimate: p-hat = 6/20 = 0.3
  • 95% confidence interval: [0.1284, 0.5433]

🔬 Comparison with larger sample (n=200)

When testing with 60 occurrences out of 200 (same proportion 0.3):

  • Test statistic: X-squared = 31.205
  • p-value: 2.322 × 10⁻⁸ (far below 0.05)
  • Conclusion: null hypothesis IS rejected decisively
  • Same estimated proportion (0.3) but different conclusion due to sample size

💡 Key insight: relative discrepancy

⚖️ Why sample size matters

The example demonstrates a fundamental principle:

  • Not about absolute discrepancy: The difference between p-hat = 0.3 and p₀ = 0.5 is the same in both examples
  • About relative discrepancy: What matters is the discrepancy compared to sampling variability
  • Larger sample → smaller variability → same discrepancy becomes more significant
  • The test considers whether the observed difference is large relative to what random sampling variation would produce

🎲 Sampling variability decreases with n

  • When n = 20: variability is large, so p-hat = 0.3 is not significantly different from 0.5
  • When n = 200: variability is smaller, so the same proportion (0.3) is highly significant
  • Don't confuse: a large absolute difference does not guarantee rejection if the sample is small

🛠️ Technical notes

🔧 Continuity correction

  • The test statistic uses Yates' correction for continuity by default
  • With correction: [|p-hat − p₀| − 0.5/n]² / [p₀(1−p₀)/n]
  • Without correction: [p-hat − p₀]² / [p₀(1−p₀)/n]
  • The correction adjusts for approximating a discrete distribution with a continuous one

🎚️ Default null probability

  • Default null hypothesis: p = 0.5
  • This value can be modified to test other hypothesized probabilities
  • The test framework remains the same; only the null value changes
68

12.5 Exercises

12.5 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply hypothesis testing concepts to real scenarios—testing for placebo effects, examining test robustness when assumptions are violated, and practicing decision-making with different significance levels.

📌 Key points (3–5)

  • Exercise 12.1: Tests whether a placebo effect exists by examining pain score changes in control patients who received inactive treatment.
  • Exercise 12.2: Explores how robust the t-test remains when data come from non-Normal distributions (Exponential and Uniform).
  • Exercise 12.3: Practices hypothesis testing decisions by varying the null hypothesis value and significance level with the same dataset.
  • Common confusion: The same data can lead to different conclusions depending on the significance level chosen (5% vs 1%) or the null hypothesis being tested.
  • Key skill: Translating research questions into formal null and alternative hypotheses in terms of expected values.

🧪 Exercise 12.1: Testing for placebo effect

🎯 The research question

  • Context: Clinical trials use placebo controls (inactive treatments that look identical to real treatment) because patients often respond to the act of being treated regardless of active ingredients.
  • Dataset: The "magnets.csv" file contains pain measurements; the last 21 observations are control patients who received inactive devices.
  • Goal: Determine whether a placebo effect exists in this pain relief study.

📝 What you need to do

📐 Part 1: Formulate hypotheses

  • Let X = change in pain score (before treatment minus after treatment) for placebo patients.
  • Express null and alternative hypotheses in terms of E(X):
    • Null hypothesis: Should reflect the situation where placebo effect is absent (no change in pain).
    • Alternative hypothesis: Should reflect the presence of a placebo effect.

🔍 Part 2: Identify the data

  • Find which observations in the dataset correspond to the control group (the 21 patients who received inactive placebo).

⚖️ Part 3: Conduct the test

  • Carry out the hypothesis test using a 5% significance level.
  • Report your conclusion about whether a placebo effect is present.

🔬 Exercise 12.2: Robustness of the t-test

🧩 The assumption being tested

The t-test assumes measurements are Normally distributed.

  • Question: What happens when this assumption is violated?
  • Task: Compute the actual significance level when the nominal level is 5%.

📊 Two scenarios to examine

ScenarioDistributionParametersSample size
Part 1Exponentialrate = 1/4 (so E(X) = 4)n = 20
Part 2Uniformrange (0, 8) (so E(X) = 4)n = 20
  • Test setup: Two-sided t-test of H₀: E(X) = 4 versus H₁: E(X) ≠ 4.
  • Goal: See if the actual Type I error rate matches the nominal 5% level when data aren't Normal.
  • Why it matters: Understanding robustness tells you when you can safely use the t-test with non-Normal data.

🎲 Exercise 12.3: Varying test parameters

📋 Given information

  • Sample size: n = 55
  • Sample mean: x̄ = 22.7
  • Sample standard deviation: s = 5.4

🔄 Three different testing scenarios

🔹 Part 1: Standard test

  • Hypotheses: H₀: E(X) = 20 versus H₁: E(X) ≠ 20
  • Significance level: 5%
  • Question: Do you reject the null hypothesis?

🔹 Part 2: Stricter significance level

  • Same hypotheses: H₀: E(X) = 20 versus H₁: E(X) ≠ 20
  • Significance level: 1% (more conservative)
  • Question: Do you reject the null hypothesis now?
  • Key insight: The same data may lead to different decisions with different significance thresholds.

🔹 Part 3: Different null value

  • New hypotheses: H₀: E(X) = 24 versus H₁: E(X) ≠ 24
  • Significance level: Back to 5%
  • Question: Do you reject this null hypothesis?
  • Key insight: Moving the null hypothesis value closer to the sample mean makes rejection less likely.

⚠️ Don't confuse

  • Significance level vs. decision: A lower significance level (1% vs 5%) makes it harder to reject the null, even with identical data.
  • Null value vs. sample mean: The farther the null hypothesis value is from the sample mean, the more likely you are to reject it (all else equal).
69

Testing Hypotheses

12.6 Summary

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing is a method for choosing between two competing claims—the currently accepted null hypothesis and the challenging alternative—by calculating the probability of wrongly rejecting the null hypothesis (the significance level) and using a test statistic to decide.

📌 Key points (3–5)

  • What hypothesis testing does: determines which of two hypotheses to accept based on sample data, controlling the risk of falsely rejecting the established theory.
  • Two types of error: Type I (rejecting a true null hypothesis) and Type II (failing to reject a false null hypothesis); the test is designed to control Type I error at the significance level.
  • How the decision works: compute a test statistic from the data; if it falls in the rejection region, reject the null hypothesis, otherwise do not reject it.
  • Common confusion: the null hypothesis is the "currently accepted" or "conservative" position (the one worse to wrongly reject), not necessarily the claim you want to prove; the alternative is the new challenge.
  • p-value interpretation: the p-value equals the significance level at which the observed statistic would just barely lead to rejection; smaller p-values indicate stronger evidence against the null.

🧩 Core framework

🧩 Null and alternative hypotheses

Null Hypothesis (H₀): A sub-collection that emerges in response to the situation when the phenomena is absent. The established scientific theory that is being challenged. The hypothesis which is worse to erroneously reject.

Alternative Hypothesis (H₁): A sub-collection that emerges in response to the presence of the investigated phenomena. The new scientific theory that challenges the currently established theory.

  • The null hypothesis represents the "status quo" or "no effect" scenario.
  • The alternative represents the new claim or the presence of the effect you are investigating.
  • The test is constructed to protect the null hypothesis: you only reject it if the data provide strong enough evidence.
  • Example: If you want to show a placebo effect exists, the null hypothesis is "placebo effect is absent" (no difference in pain scores), and the alternative is "placebo effect is present" (a difference exists).

📊 Test statistic and rejection region

Test Statistic: A statistic that summarizes the data in the sample in order to decide between the two alternatives.

Rejection Region: A set of values that the test statistic may obtain. If the observed value of the test statistic belongs to the rejection region then the null hypothesis is rejected. Otherwise, the null hypothesis is not rejected.

  • The test statistic condenses the sample information into a single number.
  • The rejection region is determined by the significance level and the structure of the test (two-sided, greater than, or less than).
  • If the observed test statistic falls in the rejection region, you reject H₀; otherwise, you do not reject H₀.
  • Don't confuse "not rejecting H₀" with "accepting H₀"—the test is designed only to control the error of wrongly rejecting H₀, not to prove H₀ is true.

⚠️ Errors and control

⚠️ Type I and Type II errors

Type I Error: The null hypothesis is correct but it is rejected by the test.

Type II Error: The alternative hypothesis holds but the null hypothesis is not rejected by the test.

Error typeWhat happensControlled by
Type IReject a true H₀Significance level (chosen in advance)
Type IIFail to reject a false H₀Statistical power (1 − probability of Type II error)
  • The test is constructed to keep the probability of Type I error at the chosen significance level (commonly 5%).
  • Type II error probability is not directly controlled by the test design; instead, statistical power (the probability of correctly rejecting a false H₀) is reported.
  • Example: A 5% significance level means that if H₀ is true, there is a 5% chance the test will wrongly reject it.

🎯 Significance level and statistical power

Significance Level: The probability of a Type I error. The probability, computed under the null hypothesis, of rejecting the null hypothesis. The test is constructed to have a given significance level. A commonly used significance level is 5%.

Statistical Power: The probability, computed under the alternative hypothesis, of rejecting the null hypothesis. The statistical power is equal to 1 minus the probability of a Type II error.

  • The significance level is set before you see the data (e.g., 5% or 1%).
  • A lower significance level (e.g., 1% instead of 5%) makes the test more conservative: you are less likely to reject H₀, reducing Type I error but potentially increasing Type II error.
  • Higher statistical power means the test is better at detecting a true effect when it exists.
  • Don't confuse: the significance level is computed under H₀; statistical power is computed under H₁.

📐 The t-test for expectations

📐 Test statistic formula

The excerpt provides the test statistic for testing a hypothesis about the expectation (mean):

Test Statistic for Expectation: t = (x̄ − μ₀) / (s / √n)

  • x̄ is the sample average.
  • μ₀ is the hypothesized value of the expectation under H₀.
  • s is the sample standard deviation.
  • n is the sample size.
  • This statistic measures how many standard errors the sample mean is away from the hypothesized mean.

🔀 Rejection regions for different tests

The excerpt lists three structures:

Test typeRejection ruleWhen to use
Two-sidedReject H₀ if |t| > qt(0.975, n−1)H₁: E(X) ≠ μ₀ (mean is different)
Greater thanReject H₀ if t > qt(0.95, n−1)H₁: E(X) > μ₀ (mean is larger)
Less thanReject H₀ if t < qt(0.05, n−1)H₁: E(X) < μ₀ (mean is smaller)
  • "qt(0.975, n−1)" refers to the quantile (critical value) from the t-distribution with n−1 degrees of freedom.
  • For a two-sided test at 5% significance, you split the 5% into 2.5% in each tail, so you use the 97.5th percentile.
  • For a one-sided test (greater than) at 5%, you use the 95th percentile.
  • Example: If you compute t = 2.8 and the critical value is 2.0, then |t| > 2.0, so you reject H₀ in a two-sided test.

🧪 Robustness and assumptions

The excerpt mentions that the t-test assumes measurements are Normally distributed, but Exercise 12.2 examines "robustness" to divergence from this assumption.

  • Robustness means the test still performs reasonably well even if the data are not perfectly Normal.
  • The excerpt does not provide conclusions about robustness, only that it is examined for Exponential and Uniform distributions.
  • Don't confuse: the t-test is designed for Normal data, but in practice it may still be used (with caution) for moderately non-Normal data, especially with larger sample sizes.

📉 The p-value

📉 What the p-value measures

p-value: A form of a test statistic. It is associated with a specific test statistic and a structure of the rejection region. The p-value is equal to the significance level of the test in which the observed value of the statistic serves as the threshold.

  • The p-value is the smallest significance level at which you would reject H₀ given the observed data.
  • A smaller p-value indicates stronger evidence against H₀.
  • If the p-value is less than your chosen significance level (e.g., 5%), you reject H₀; if it is greater, you do not reject H₀.
  • Example: If p-value = 3%, and your significance level is 5%, then 3% < 5%, so you reject H₀. If your significance level is 1%, then 3% > 1%, so you do not reject H₀.

📉 Interpretation and reporting

The excerpt's discussion forum raises the question of whether results with p-value between 5% and 10% should be published.

  • Many journals require p-value < 5% for "statistical significance."
  • The excerpt notes a tension: conservatism (protecting against Type I error) versus boldness (not missing new discoveries).
  • The excerpt does not prescribe an answer, but highlights that the choice of significance level affects what gets reported and accepted.
  • Don't confuse: a p-value slightly above 5% (e.g., 6% or 10%) does not mean "no effect"; it means the evidence is not strong enough to reject H₀ at the 5% level.

🧮 Worked example from Exercise 12.3

The excerpt provides a scenario to illustrate the test:

  • Hypotheses: H₀: E(X) = 20 versus H₁: E(X) ≠ 20 (two-sided test).
  • Data: sample size n = 55, sample mean x̄ = 22.7, sample standard deviation s = 5.4.
  • Test statistic: t = (22.7 − 20) / (5.4 / √55) = 2.7 / (5.4 / 7.416) ≈ 2.7 / 0.728 ≈ 3.71 (approximate calculation for illustration).
  • Decision at 5% significance: Compare |t| to the critical value from the t-distribution with 54 degrees of freedom. If |t| exceeds the critical value (approximately 2.0 for large samples), reject H₀.
  • Decision at 1% significance: The critical value is larger (approximately 2.7), so if |t| is smaller than this, you do not reject H₀.
  • Different null hypothesis: If H₀: E(X) = 24, then t = (22.7 − 24) / (5.4 / √55) ≈ −1.3 / 0.728 ≈ −1.79. If |t| < critical value, do not reject H₀.

The excerpt does not provide the final numerical answers, but the structure shows how changing the significance level or the null hypothesis value changes the decision.

70

Comparing Two Samples

13.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Statistical inference for two samples examines how an explanatory variable with two levels affects the distribution of a response variable by comparing the distributions in the two resulting sub-samples.

📌 Key points (3–5)

  • Core relationship: In statistical relations, one variable (explanatory) affects the distribution of another variable (response), not its exact value.
  • Two-sample structure: An explanatory variable with two levels splits the sample into two sub-samples; inference compares the response distributions between them.
  • What to compare: Point estimation, confidence intervals, and hypothesis testing for expectations (using t.test) and variances (using var.test).
  • Common confusion: Statistical relations differ from mathematical functions—the explanatory variable changes the distribution of the response, not a deterministic value.
  • Real-world example: Clinical trials split patients into treatment and control groups; approval depends on whether the treatment group's response distribution is better.

🔍 Statistical relations vs mathematical functions

🔍 How statistical relations differ

  • In a mathematical function, one variable directly determines the value of another.
  • In a statistical relation, the explanatory variable affects the distribution of the response variable, not a single outcome.
    • For one value of the explanatory variable, the response has one distribution.
    • For a different value, the response may have a different distribution.
  • Don't confuse: The response is not a function of the explanatory variable; instead, its possible values and probabilities change.

📊 Response and explanatory variables defined

Response: The variable whose distribution is being investigated.

Explanatory variable: The variable which may have an effect on the distribution of the response.

  • The explanatory variable is the "cause" side; the response is the "effect" side.
  • Example: In a clinical trial, the treatment assignment (treatment vs control) is the explanatory variable; the patient's health outcome is the response.

🧪 Two-sample comparison framework

🧪 Structure of the problem

  • The explanatory variable is a factor with two levels (e.g., treatment vs control, group A vs group B).
  • This factor splits the overall sample into two sub-samples.
  • Statistical inference compares the distributions of the response variable in these two sub-samples.

🎯 What inference covers

The excerpt states that inference involves three components:

ComponentWhat it does
Point estimationEstimates a parameter (e.g., difference in means) from the two sub-samples
Confidence intervalsProvides a range of plausible values for the parameter
Hypothesis testingTests whether the distributions differ in a specific way (e.g., different means or variances)

🛠️ R functions for inference

  • t.test: Investigates the difference between the expectations (means) of the response in the two sub-samples.
  • var.test: Investigates the ratio between the variances of the response in the two sub-samples.
  • These functions carry out the statistical inference automatically.

🏥 Clinical trial example

🏥 How the design works

  • A group of patients is randomly divided into two sub-groups: treatment and control.
  • The treatment sub-group receives the new treatment anonymously.
  • The control sub-group receives the currently standard treatment.
  • The response to the medical intervention is measured for each patient.

✅ Approval criterion

  • The new treatment is approved for marketing only if the response distribution for the treatment sub-group is better than for the control sub-group.
  • This is a treatment-control experimental design used in many scientific and industrial settings.
  • Example: If the treatment group shows a higher mean recovery rate or lower variance in symptoms, the treatment may be approved.

🔬 Variables in the trial

  • Explanatory variable: Treatment assignment (treatment vs control)—this is the factor with two levels.
  • Response: The measured outcome of the medical intervention for each patient.
  • The inference compares the response distributions between the two groups to determine if the treatment has an effect.
71

Comparing Two Distributions

13.2 Comparing Two Distributions

🧭 Overview

🧠 One-sentence thesis

Statistical inference for comparing two distributions examines how an explanatory variable affects the distribution of a response variable, typically by testing whether the distributions differ between two sub-groups and estimating the magnitude of that difference.

📌 Key points (3–5)

  • Response vs. explanatory variable: the response is the variable whose distribution is investigated; the explanatory variable (a factor with two levels) determines which sub-group each observation belongs to.
  • Statistical relationships are distributional: unlike mathematical functions, statistical relationships mean the distribution of the response changes with the explanatory variable, not that one variable directly determines the other.
  • Clinical trial paradigm: the treatment-control design splits patients into two sub-groups and compares the response distribution between them to determine if the treatment is better.
  • Common confusion: statistical relationships vs. functional relationships—in statistics, the response is not a direct function of the explanatory variable; instead, the explanatory variable affects the distribution of possible response values.
  • Inference toolkit: comparing two distributions involves hypothesis testing (e.g., are the expectations equal?), point estimation, and confidence intervals for parameters like the difference in means or ratio of variances.

🔍 Core concepts

🔍 Response and explanatory variables

Response: the variable whose distribution is being investigated.

Explanatory variable: the variable which may have an effect on the distribution of the response.

  • The explanatory variable in this chapter is a factor with two levels (e.g., treatment vs. control).
  • Each level corresponds to a sub-sample of the data.
  • The goal is to compare the response distribution across the two sub-samples.

🔗 Statistical vs. mathematical relationships

Mathematical relationship:

  • One variable is a direct function of the other.
  • The value of the first variable is determined by the value of the second.

Statistical relationship:

  • The distribution of one variable is affected by the value of the other.
  • For a given value of the explanatory variable, the response may have one distribution; for a different value, the response may have a different distribution.
  • The response is not a direct function of the explanatory variable.

Example: In a clinical trial, whether a patient receives treatment or control (explanatory variable) affects the distribution of health outcomes (response), but does not uniquely determine any single patient's outcome.

Don't confuse: A statistical relationship does not mean "if X changes, Y changes by a fixed amount." It means "if X changes, the range and likelihood of Y's values change."

🧪 The treatment-control design

🧪 How the design works

  • A group of patients is randomly divided into two sub-groups: treatment and control.
  • The treatment sub-group receives the new treatment (anonymously administered).
  • The control sub-group receives the currently standard treatment.
  • The response to the medical intervention is measured for each patient.

✅ Approval criterion

  • The new treatment is approved for marketing only if the response is better for the treatment sub-group than for the control sub-group.
  • This requires comparing the distribution of the response variable between the two sub-samples.

🔬 Broader application

  • The treatment-control design is used in many scientific and industrial settings, not just clinical trials.
  • It is a special case of investigating the effect of an explanatory variable (with two levels) on a response variable.

📊 Types of comparisons

📊 Numeric response

When the response is a numeric measurement, the analysis may compare:

Comparison typeWhat is compared
ExpectationsThe mean (expected value) of the response in one sub-group vs. the other
VariancesThe variance of the response in one sub-group vs. the other
  • The excerpt mentions using the R function t.test to investigate differences between expectations.
  • The excerpt mentions using the R function var.test to investigate the ratio between variances.

🎯 Event indicator response

  • If the response is the indicator of the occurrence of an event (e.g., did the patient recover? yes/no), the analysis compares two probabilities.
  • Specifically: the probability of the event in the treatment group vs. the probability in the control group.

Don't confuse: The type of comparison depends on the nature of the response variable—numeric responses lead to comparisons of means or variances, while event indicators lead to comparisons of probabilities.

🧰 Inference toolkit

🧰 Hypothesis testing

  • Null hypothesis: the distribution of the response is the same in both sub-groups.
  • Alternative hypothesis: the distribution is not the same.
  • Testing helps answer: "Are the two expectations (or variances, or probabilities) equal?"

📏 Point estimation and confidence intervals

  • Point estimation: a single best guess for a parameter (e.g., the difference between the two means).
  • Confidence interval: a range of plausible values for the parameter.
  • These tools help assess the magnitude of the difference when the null hypothesis is rejected.

🔧 R functions mentioned

  • t.test: investigates the difference between the expectations of the response variable in the two sub-samples.
  • var.test: investigates the ratio between the variances of the response variable in the two sub-samples.

🗺️ Context and scope

🗺️ Where this chapter fits

  • Previous chapters: dealt with the distribution of a single measurement.
  • This chapter: compares the distribution of a numerical response between two sub-groups determined by a factor (explanatory variable with two levels).
  • Next chapter: will consider the case where the explanatory variable is numeric (not a two-level factor).
  • Subsequent chapter: will describe inference when the response is the indicator of the occurrence of an event.

🎯 Why this matters

  • Most applications are interested in relationships between several measurements, not just the characteristics of a single measurement.
  • Understanding how one measurement affects another is central to scientific and industrial decision-making.
  • Example: In a clinical trial, the decision to approve a new treatment depends on comparing response distributions between treatment and control groups.
72

Comparing the Sample Means

13.3 Comparing the Sample Means

🧭 Overview

🧠 One-sentence thesis

Statistical inference for comparing two sub-sample means uses the Welch Two Sample t-test to determine whether the expectations differ and constructs confidence intervals to estimate the size of that difference.

📌 Key points (3–5)

  • What the inference addresses: testing equality of expectations in two sub-samples and estimating the difference between them.
  • The Welch t-test structure: uses a test statistic based on the difference between sub-sample averages, standardized by the estimated standard deviation of that difference.
  • Confidence interval for the difference: constructed from the difference in averages ± a percentile times the estimated standard deviation, similar to single-sample intervals but accounting for variability in both sub-samples.
  • Common confusion: the model involves two random variables (X_a and X_b) with potentially different distributions, not a single measurement split into groups.
  • Practical interpretation: a p-value less than 0.05 leads to rejecting the null hypothesis that expectations are equal; the confidence interval quantifies the plausible range for the difference.

🔬 The two-population statistical model

🔬 Two random variables, not one

The model deals with two populations rather than one population.

  • In single-population inference, one random variable X represents the measurement.
  • Here, two sub-populations exist (e.g., lighter vs. heavier cars), each with its own random variable:
    • X_a: measurement for the first sub-population (e.g., cars ≤ 2,414 lb).
    • X_b: measurement for the second sub-population (e.g., cars > 2,414 lb).
  • Each random variable may have a different distribution, so their expectations E(X_a) and E(X_b) and variances V(X_a) and V(X_b) may differ.

📊 Parameters and estimators

ParameterEstimatorDescription
E(X_a)X̄_aExpectation of first sub-population; estimated by first sub-sample average
E(X_b)X̄_bExpectation of second sub-population; estimated by second sub-sample average
V(X_a)S²_aVariance of first sub-population; estimated by first sub-sample variance
V(X_b)S²_bVariance of second sub-population; estimated by second sub-sample variance
  • The goal is to make inference about the difference in expectations: E(X_a) − E(X_b).
  • The natural estimator for this difference is the difference in averages: X̄_a − X̄_b.

🔍 Example setup: miles-per-gallon difference

  • Response variable: difference in miles-per-gallon between highway and city driving (dif.mpg).
  • Explanatory factor: whether curb weight is above 2,414 lb (heavy, with levels FALSE and TRUE).
  • Sub-sample a (FALSE): lighter cars, n_a = 103, average = 5.805825, variance = 2.020750.
  • Sub-sample b (TRUE): heavier cars, n_b = 102, average = 5.254902, variance = 3.261114.
  • Observed difference in averages: 5.805825 − 5.254902 = 0.550923.

🧮 Confidence interval for the difference

🧮 Construction logic

  • The confidence interval for E(X_a) − E(X_b) follows the same logic as the single-sample interval but accounts for variability in both sub-samples.
  • Start with the estimator X̄_a − X̄_b and its deviation from the true difference: {X̄_a − X̄_b} − {E(X_a) − E(X_b)}.
  • Standardize by dividing by the standard deviation of the estimator.

📐 Variance of the difference in averages

The variance of the difference is V(X̄_a − X̄_b) = V(X̄_a) + V(X̄_b) = V(X_a)/n_a + V(X_b)/n_b.

  • Both averages contribute to variability; the total is the sum of the two contributions.
  • This formula assumes the two sub-samples are independent.
  • Don't confuse: variance of a difference (or sum) of independent random variables is the sum of the variances, not the difference.

🎯 The standardized deviation and Normal approximation

  • The standardized deviation is:
    • Z = {X̄_a − X̄_b − (E(X_a) − E(X_b))} / √(V(X_a)/n_a + V(X_b)/n_b)
  • When both n_a and n_b are large, Z is approximately standard Normal by the Central Limit Theorem.
  • Therefore, P(−1.96 ≤ Z ≤ 1.96) ≈ 0.95.

🔧 Substituting sample variances

  • The unknown variances V(X_a) and V(X_b) are estimated by S²_a and S²_b.
  • When both sub-sample sizes are large, these estimators are good approximations.
  • The approximate event becomes:
    • {−1.96 · √(S²_a/n_a + S²_b/n_b) ≤ X̄_a − X̄_b − (E(X_a) − E(X_b)) ≤ 1.96 · √(S²_a/n_a + S²_b/n_b)}
  • Rearranging to put the parameter in the center gives the confidence interval:
    • X̄_a − X̄_b ± 1.96 · √(S²_a/n_a + S²_b/n_b)

📊 Example calculation

  • Difference in averages: 0.550923.
  • Estimated standard deviation of the difference:
    • √(2.020750/103 + 3.261114/102) = 0.227135.
  • Confidence interval: 0.550923 ± 1.96 · 0.227135 = [0.1057384, 0.9961076].
  • Note: the function t.test uses the t-distribution percentile (1.972425) instead of 1.96, giving [0.1029150, 0.9989315], which is very similar.

🧪 The Welch Two Sample t-test

🧪 Hypotheses

  • Null hypothesis H₀: E(X_a) = E(X_b), or equivalently E(X_a) − E(X_b) = 0.
  • Alternative hypothesis H₁: E(X_a) ≠ E(X_b), or equivalently E(X_a) − E(X_b) ≠ 0.
  • The test determines whether the data provide sufficient evidence to reject the null hypothesis that the expectations are equal.

📏 Test statistic

The T statistic is the ratio between the deviation of the estimator from the null value of the parameter, divided by the estimated standard deviation of the estimator.

  • In this setting:
    • T = (X̄_a − X̄_b − 0) / √(S²_a/n_a + S²_b/n_b) = (X̄_a − X̄_b) / √(S²_a/n_a + S²_b/n_b)
  • The numerator is the difference in averages (since the null value is 0).
  • The denominator is the estimated standard deviation of the difference.

🎲 Distribution under the null hypothesis

  • When both n_a and n_b are large, the limit distribution of T is standard Normal.
  • For a refined approximation (especially when measurements are Normally distributed), the t-distribution is used.
  • The Welch test uses the t-distribution with degrees of freedom df computed by a specific formula (in the example, df = 191.561).

🔍 Example calculation

  • Observed T statistic: 0.550923 / 0.227135 = 2.425531.
  • This matches the value "t = 2.4255" in the t.test output.

📊 p-value and decision

The p-value is the probability of obtaining values of the test statistic more extreme than the value obtained in the data, computed under the null hypothesis.

  • For a symmetric distribution (Normal or t), the probability of obtaining a value in either tail is twice the probability in the upper tail:
    • P(|T| > 2.4255) = 2 × P(T > 2.4255) = 2 × [1 − P(T ≤ 2.4255)]
  • Using the t-distribution with df = 191.561: p-value = 0.01621.
  • Since 0.01621 < 0.05, reject the null hypothesis at the 5% significance level.
  • Conclusion: the expectations are not equal; weight has an effect on the expected difference in miles-per-gallon.

⚠️ Don't confuse: rejection region vs. p-value

  • The rejection region approach: reject H₀ if |T| > qt(0.975, df).
  • The p-value approach: reject H₀ if p-value < 0.05.
  • Both are equivalent; the p-value is the probability that the test statistic falls in the rejection region under the null hypothesis.

🛠️ Practical implementation in R

🛠️ Defining the factor

  • The variable heavy is redefined as a factor using the function factor applied to a logical sequence (cars$curb.weight > 2414).
  • This produces a factor with two levels: "FALSE" (lighter cars) and "TRUE" (heavier cars).
  • Note: the redefined heavy is no longer a logical sequence and cannot be used as an index to select components.

📈 Visualizing the comparison

  • The function plot applied to the formula dif.mpg ~ heavy produces two box plots, one for each level of the factor.
  • The box plots show that the distribution of the response for heavier cars (TRUE) tends to have smaller values than for lighter cars (FALSE).

🧪 Running the test

  • The function t.test applied to the formula dif.mpg ~ heavy performs the Welch Two Sample t-test.
  • The output includes:
    • Test statistic: t = 2.4255.
    • Degrees of freedom: df = 191.56.
    • p-value: 0.01621.
    • 95% confidence interval: [0.1029150, 0.9989315].
    • Sample estimates: mean in group FALSE = 5.805825, mean in group TRUE = 5.254902.

🔢 Computing sub-sample statistics

  • The function table(heavy) counts observations in each level: 103 (FALSE) and 102 (TRUE).
  • The function tapply(dif.mpg, heavy, mean) computes the average for each level.
  • The function tapply(dif.mpg, heavy, var) computes the variance for each level.
  • These values match those reported by t.test.
73

Comparing Sample Variances

13.4 Comparing Sample Variances

🧭 Overview

🧠 One-sentence thesis

The F-distribution enables inference on the ratio of variances between two sub-populations, allowing both confidence intervals and hypothesis tests to determine whether variances differ significantly.

📌 Key points (3–5)

  • What is compared: the ratio of variances between two sub-populations, using the ratio of sample variances as the estimator.
  • The F-distribution: the sampling distribution of the ratio of two sample variances (when measurements are Normal), characterized by two degrees of freedom parameters (numerator and denominator).
  • Confidence interval construction: reformulate the probability statement so the ratio of population variances is isolated in the center, using F-distribution percentiles.
  • Hypothesis testing: test whether variances are equal by checking if the observed F statistic is extreme (much larger or smaller than 1); reject the null hypothesis if the p-value is below the significance level.
  • Common confusion: the F-distribution requires two degrees of freedom (one for each sample variance), not one; the numerator and denominator degrees of freedom are both sample size minus 1.

📐 The ratio of variances and the F-distribution

📐 What we compare

  • The goal is to compare the variance in sub-population a, denoted V(X_a), to the variance in sub-population b, denoted V(X_b).
  • The basis for comparison is the ratio of the two sample variances: S²_a and S²_b, computed from the two sub-samples.
  • The ratio of variances V(X_a) / V(X_b) is the parameter of interest.

🎲 The F-distribution

The F-distribution: the sampling distribution of the ratio of two sample variances (each divided by its corresponding population variance), characterized by two degrees of freedom parameters.

  • The random variable (S²_a / V(X_a)) / (S²_b / V(X_b)) follows an F-distribution with (n_a - 1, n_b - 1) degrees of freedom.
  • The first parameter (numerator degrees of freedom) is n_a - 1, the number of observations in the first sub-sample minus 1.
  • The second parameter (denominator degrees of freedom) is n_b - 1, the number of observations in the second sub-sample minus 1.
  • Important: this statement holds when the measurement has a Normal distribution; if not Normal, the ratio does not follow the F-distribution.

🔧 Computing F-distribution percentiles

  • In R, the function "qf" computes percentiles of the F-distribution.
  • Example: "qf(0.025, dfa, dfb)" gives the 0.025-percentile, where dfa = n_a - 1 and dfb = n_b - 1.
  • Similarly, "qf(0.975, dfa, dfb)" gives the 0.975-percentile.
  • Between these two percentiles lies 95% of the F-distribution.

🔍 Confidence interval for the ratio of variances

🔍 Starting probability statement

  • The probability that the ratio (S²_a / V(X_a)) / (S²_b / V(X_b)) falls between the 0.025 and 0.975 percentiles is 0.95.
  • In symbols: P(qf(0.025, dfa, dfb) ≤ (S²_a / V(X_a)) / (S²_b / V(X_b)) ≤ qf(0.975, dfa, dfb)) = 0.95.

🔄 Reformulation to isolate the parameter

  • Rewrite the event so the ratio of population variances V(X_a) / V(X_b) is in the center.
  • The resulting 95% confidence interval is:
    • Lower bound: (S²_a / S²_b) / qf(0.975, dfa, dfb)
    • Upper bound: (S²_a / S²_b) / qf(0.025, dfa, dfb)
  • Notice the inversion: the larger percentile (0.975) appears in the denominator of the lower bound, and vice versa.

📊 Interpretation

  • This interval estimates the ratio V(X_a) / V(X_b) with 95% confidence.
  • If the interval does not contain 1, it suggests the variances are different.

🧪 Hypothesis testing for equality of variances

🧪 Null and alternative hypotheses

  • Null hypothesis H₀: V(X_a) / V(X_b) = 1 (the variances are equal).
  • Alternative hypothesis H₁: V(X_a) / V(X_b) ≠ 1 (the variances are not equal).
  • The test statistic is F = S²_a / S²_b.

🎯 Decision rule

  • Under the null hypothesis, the test statistic F follows an F(n_a - 1, n_b - 1) distribution.
  • Reject the null hypothesis if F is either much smaller or much larger than 1:
    • Reject if F < qf(0.025, dfa, dfb), or
    • Reject if F > qf(0.975, dfa, dfb).
  • The significance level of this test is 5%.

📈 Computing the p-value

  • The p-value depends on whether the observed value f is less than or greater than 1.
  • If f < 1: p-value = 2 · P(F < f) (twice the lower tail probability).
  • If f > 1: p-value = 2 · P(F > f) = 2 · [1 - P(F ≤ f)] (twice the upper tail probability).
  • Reject the null hypothesis at the 5% significance level if the p-value is less than 0.05.

⚠️ Why twice the tail probability

  • The alternative hypothesis is two-sided (variances can be unequal in either direction).
  • Doubling the tail probability accounts for both extremes (much smaller or much larger than 1).

🖥️ Example: comparing variances in R

🖥️ Using the var.test function

  • The function "var.test" performs the F-test for equality of variances.
  • Input: a formula like "dif.mpg ~ heavy", with a numeric variable on the left and a two-level factor on the right.
  • The function produces the test statistic, degrees of freedom, p-value, confidence interval, and estimated ratio of variances.

📋 Interpreting the output

  • Test statistic: F = 0.6197 (the ratio of sample variances S²_a / S²_b).
  • Degrees of freedom: numerator df = 102, denominator df = 101.
  • P-value: 0.01663, which is less than 0.05.
  • Conclusion: reject the null hypothesis; the two variances are significantly different.
  • Estimated ratio: 0.6196502 (the sample variance ratio).
  • 95% confidence interval: [0.4189200, 0.9162126] (does not contain 1, consistent with rejection).

🔢 Manual verification

  • Sample variances: s²_a = 2.020750, s²_b = 3.261114.
  • Sample sizes: n_a = 103, n_b = 102.
  • Test statistic: s²_a / s²_b = 2.020750 / 3.261114 = 0.6196502 (matches the report).
  • Degrees of freedom: dfa = 102, dfb = 101.
  • Since f = 0.6196502 < 1, the p-value is 2 · pf(0.6196502, 102, 101) = 0.01662612 (matches the report after rounding).
  • Percentiles: qf(0.025, 102, 101) = 0.676317, qf(0.975, 102, 101) = 1.479161.
  • Confidence interval: [0.6196502 / 1.479161, 0.6196502 / 0.676317] = [0.4189200, 0.9162127] (matches the report).

🧩 Don't confuse

  • The test statistic F is the ratio of sample variances, not the ratio of population variances.
  • The confidence interval is for the ratio of population variances V(X_a) / V(X_b), not for the individual variances.
  • The F-distribution has two parameters (numerator and denominator degrees of freedom), unlike the chi-square distribution used for a single variance (which has one parameter).
74

13.5 Exercises

13.5 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply F-test methods for comparing variances and means across two samples, including hypothesis testing at the 5% significance level and examining robustness when distributional assumptions are violated.

📌 Key points (3–5)

  • Exercise 13.1 focus: testing differences in pain scores between treatment and control groups (both means and variances) using magnet trial data.
  • Exercise 13.2 focus: examining how robust the F-test for variance equality is when data diverge from the Normal distribution assumption.
  • Exercise 13.3 focus: estimating the ratio of variances given summary statistics from two sub-samples of equal size.
  • Common confusion: the F-test assumes Normality; robustness exercises check what happens when this assumption fails (e.g., Exponential distribution).
  • Key technique: all hypothesis tests use a 5% significance level and involve comparing variances or means between two groups.

🧪 Exercise 13.1: Magnet pain trial analysis

🧪 Study design and variables

  • Data source: file "magnets.csv" from a randomized trial.
  • Groups: patients randomly assigned to treatment (active magnet, level "1") or control (inactive placebo, level "2").
  • Response variables:
    • change: difference in pain score before vs. after treatment
    • score1: pain score before device application
  • Explanatory variable: factor active with two levels.

📋 Four hypothesis tests required

QuestionWhat to compareType of test
1Expectation of pain score before applicationTest for difference in means
2Variance of pain score before applicationF-test for variance equality
3Expectation of change in scoreTest for difference in means
4Variance of change in scoreF-test for variance equality
  • All tests conducted at 5% significance level.
  • Example: Question 1 asks whether the average pre-treatment pain differs between groups; Question 2 asks whether the variability in pre-treatment pain differs.

🔍 Don't confuse

  • Before vs. after: Questions 1–2 use score1 (baseline); Questions 3–4 use change (treatment effect).
  • The trial is randomized, so baseline differences (Questions 1–2) test whether randomization worked; treatment effect differences (Questions 3–4) test whether the magnet had an impact.

🔬 Exercise 13.2: Robustness of the F-test

🔬 Purpose and setup

  • Goal: compute the actual significance level of the F-test when the Normality assumption is violated.
  • Null hypothesis: H₀: V(Xₐ) = V(Xᵦ) versus H₁: V(Xₐ) ≠ V(Xᵦ) (two-sided test).
  • Sample sizes: nₐ = 29 and nᵦ = 21.
  • Nominal significance level: 5% (the level the test is designed for under Normality).

📊 Two scenarios to compare

CaseDistributionParameters
1NormalX ~ Normal(4, 4²)
2ExponentialX ~ Exponential(1/4)
  • What "robustness" means here: the F-test is constructed assuming Normal data; this exercise checks whether the test maintains its 5% significance level when data follow a different distribution (Exponential).
  • Example: if the actual significance level under Exponential data is much higher than 5%, the test is not robust—it rejects H₀ too often when it's true.

⚠️ Don't confuse

  • Nominal vs. actual significance level: nominal is what the test claims (5%); actual is what happens in practice when assumptions are violated.
  • The exercise does not ask whether the test has power; it asks whether the test controls Type I error correctly under non-Normal data.

📐 Exercise 13.3: Estimating variance ratio from summary statistics

📐 Given information

  • First sub-sample: sample mean x̄ₐ = 124.3, sample standard deviation sₐ = 13.4, size nₐ = 15.
  • Second sub-sample: sample mean x̄ᵦ = 80.5, sample standard deviation sᵦ = 16.7, size nᵦ = 15.
  • Target: estimate the ratio of variances V(Xₐ) / V(Xᵦ).

🧮 How to compute the estimate

  • The natural estimate of the variance ratio is the ratio of sample variances: sₐ² / sᵦ².
  • Sample variances are obtained by squaring the given standard deviations.
  • Example: sₐ² = (13.4)² and sᵦ² = (16.7)², so the estimate is (13.4)² / (16.7)².

🔍 Don't confuse

  • The exercise provides standard deviations, not variances; you must square them first.
  • The sample means (x̄ₐ and x̄ᵦ) are given but are not directly used in estimating the variance ratio—they would be relevant for comparing means, not variances.
75

Comparing Two Samples: Summary

13.6 Summary

🧭 Overview

🧠 One-sentence thesis

This chapter provides formulas and methods for comparing two groups on both their means (expectations) and variances, using t-tests for equality of expectations and F-tests for equality of variances.

📌 Key points (3–5)

  • Response vs explanatory variable: the response is what you measure; the explanatory variable is what may affect it.
  • Two main comparisons: testing whether two groups differ in their average (expectation) or in their spread (variance).
  • Design vs analysis: good study design—how you collect data—is as important as statistical analysis; quality of data often matters more than quantity.
  • Common confusion: sample size affects confidence intervals—larger samples give tighter intervals even when point estimates stay the same.

📊 Core concepts

📊 Response variable

Response: The variable whose distribution one seeks to investigate.

  • This is the outcome you measure.
  • Example: pain score after treatment.

📊 Explanatory variable

Explanatory Variable: A variable that may affect the distribution of the response.

  • This is the factor you think might influence the outcome.
  • Example: whether an active magnet or inactive placebo was applied (a factor with two levels).

🧪 Testing equality of expectations

🧪 The t-test statistic

  • Formula: t equals (x-bar_a minus x-bar_b) divided by the square root of (s-squared_a divided by n_a plus s-squared_b divided by n_b).
  • This measures how far apart the two sample means are, relative to the variability within each group.
  • Used to test whether the true means of the two groups differ.

🧪 Confidence interval for the difference in means

  • Formula: (x-bar_a minus x-bar_b) plus-or-minus qnorm(0.975) times the square root of (s-squared_a divided by n_a plus s-squared_b divided by n_b).
  • Provides a range of plausible values for the true difference in expectations.
  • Example: if the interval does not include zero, there is evidence of a difference.

🔬 Testing equality of variances

🔬 The F-test statistic

  • Formula: f equals s-squared_a divided by s-squared_b.
  • This is the ratio of the two sample variances.
  • Used to test whether the two groups have the same spread (variance).

🔬 Confidence interval for the ratio of variances

  • Formula: [(s-squared_a divided by s-squared_b) divided by qf(0.975, dfa, dfb), (s-squared_a divided by s-squared_b) divided by qf(0.025, dfa, dfb)].
  • Provides a range for the true ratio of variances.
  • Don't confuse: the F-test assumes Normal distributions; robustness to this assumption should be checked (as in Exercise 13.2).

🎯 Design and data quality

🎯 Why design matters

  • The excerpt emphasizes that statistics plays an important role not just in analysis but in the design stage—deciding how to collect data.
  • Good design improves the chances that conclusions will be meaningful and trustworthy.

🎯 Quantity vs quality

  • Some claim quantity (amount of data) is most important; others say quality is more important.
  • Example from the excerpt: telephone surveys can reach many people quickly (high quantity), but face-to-face interviews may yield better-quality responses.
  • Poor quality or insufficient quantity can both undermine the validity of conclusions.

📐 Sample size effects

📐 How sample size changes inference

  • Exercise 13.3 illustrates: when sample size increases from n = 15 to n = 150 (but sample means and standard deviations stay the same), the point estimate of the variance ratio does not change, but the confidence interval becomes narrower.
  • Larger samples reduce uncertainty, leading to tighter intervals and more precise inference.
  • Don't confuse: the estimate itself (e.g., the ratio of sample variances) is unchanged, but our confidence in it improves.
76

Linear Regression: Student Learning Objectives

14.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

Linear regression describes the linear relationship between two numeric variables by fitting a line to the data, enabling both parameter estimation and statistical inference about how the explanatory variable affects the response distribution.

📌 Key points (3–5)

  • What linear regression does: fits a line to data when both response and explanatory variables are numeric, summarizing the effect of the explanatory variable on the response distribution.
  • Core skills: produce scatter plots, understand how linear equations relate to lines, fit models using tools, and conduct statistical inference on fitted models.
  • Key metric R²: measures the percentage of response variability explained by the regression model, related to residual variability and response variance.
  • Common confusion: distinguish between plotting raw data (scatter plots) versus fitting a model (the regression line that summarizes the relationship).
  • Context: this extends previous work on numeric responses with two-level factors to the case where the explanatory variable is also numeric.

📊 What linear regression addresses

📊 The scenario

  • Previous context: earlier chapters examined numeric responses with factor explanatory variables (two levels).
  • This chapter's focus: both response and explanatory variables are numeric.
  • The method used to describe relations between two numeric variables is called regression; specifically, linear regression for linear relationships.

🎯 Purpose of the line

The line summarizes the effect of the explanatory variable on the distribution of the response.

  • The fitted line is not just a visual aid—it represents the model of how one variable affects the other.
  • Statistical inference can be conducted on this model: point estimation of parameters, confidence intervals, and hypothesis testing.

🔧 Core capabilities

🔧 Scatter plots

  • What they show: graphical representation of both variables together, revealing potential relationships.
  • Each observation appears as a point in a two-dimensional plane.
  • The x-axis represents the explanatory variable; the y-axis represents the response.

📐 Lines and equations

  • Understanding how a line relates to the parameters of a linear equation.
  • Ability to add lines to scatter plots to visualize the fitted relationship.

🛠️ Model fitting

  • Use statistical functions (the excerpt mentions "lm") to fit linear regression to data.
  • Conduct statistical inference on the fitted model: estimate parameters, build confidence intervals, test hypotheses.

📈 Assessing model quality

📈 R² and variability

The excerpt emphasizes understanding the relationship among three concepts:

ConceptRole
Percentage of response variability explained by the regression model
Regression residualsVariability not captured by the model
Response varianceTotal variability in the response variable
  • Why it matters: R² tells you how much of the variation in the response is accounted for by the linear relationship with the explanatory variable.
  • Don't confuse: R² is about explained variability; residuals represent unexplained variability; together they relate to the total variance of the response.

🎓 Learning outcome

By the end of the chapter, students should be able to explain how these three quantities relate to one another—understanding that the model's explanatory power (R²) is tied to how much variability remains in the residuals versus the original response variance.

77

Points and Lines

14.2 Points and Lines

🧭 Overview

🧠 One-sentence thesis

Scatter plots visualize the relationship between two numeric variables, and linear equations with intercepts and slopes describe straight-line trends that can be added to these plots to represent patterns in the data.

📌 Key points (3–5)

  • Scatter plots: each observation becomes a point with x (explanatory variable) and y (response variable) coordinates.
  • Linear equations: take the form y = a + b·x, where a is the intercept and b is the slope.
  • Intercept vs slope: intercept is where the line crosses the y-axis (value when x=0); slope is the change in y for each unit change in x.
  • Common confusion: positive slope means increasing line, negative slope means decreasing line, zero slope means constant (horizontal) line.
  • Adding lines to plots: high-level functions create entire plots; low-level functions (like abline) add features to existing plots.

📊 Scatter plots and data visualization

📍 What a scatter plot shows

  • A scatter plot represents pairs of observations as points on a two-dimensional graph.
  • The x-axis shows the explanatory variable; the y-axis shows the response variable.
  • Each observation in the dataset becomes one point at coordinates (x, y).

Example: In a dataset with 10 observations, there will be 10 points on the scatter plot. If the first observation has x=4.5 and y=9.5, that point appears at those coordinates.

🚗 The cars dataset example

The excerpt describes a scatter plot of horsepower versus engine size:

  • Engine size: volume in cubic inches swept by pistons inside cylinders (explanatory variable).
  • Horsepower: power of the engine in horsepower units (response variable).
  • The plot shows that horsepower tends to increase as engine size increases, following an overall linear trend, though points don't lie exactly on a single line.

🔍 Observing trends

  • When you examine a scatter plot, you can see whether the response tends to increase, decrease, or stay constant as the explanatory variable changes.
  • The excerpt notes that data points may follow a linear trend (straight line pattern) even if they're not located exactly on one line.
  • Linear regression (discussed in subsequent sections) describes and assesses this linear trend.

📐 Linear equations and lines

🧮 The linear equation form

Linear equation: y = a + b·x, where y and x are variables, and a and b are coefficients.

  • a = intercept
  • b = slope
  • A linear equation produces a straight line when plotted on a graph.
  • For each x value, the equation gives you the corresponding y value; all such (x, y) pairs form a straight line.

📏 What the intercept means

Intercept: the value of y when the line crosses the y-axis.

  • Equivalently, it's the result when you set x=0 in the linear equation.
  • Example from the excerpt:
    • Green line (y = 7 + x): crosses y-axis at y=7
    • Blue line (y = 14 - 2x): crosses y-axis at y=14
    • Red line (y = 8.97): stays constant at y=8.97

📈 What the slope means

Slope: the change in y for each unit change in x.

Slope valueLine behaviorExample from excerpt
Positive (b=1)Increasing lineGreen line: when x changes from 0 to 1, y increases from 7 to 8 (change of +1)
Negative (b=-2)Decreasing lineBlue line: when x changes from 0 to 1, y decreases from 14 to 12 (change of -2)
Zero (b=0)Constant (horizontal) lineRed line: y stays at 8.97 regardless of x value

Don't confuse: The slope tells you the rate of change, not the absolute level. A line can have a high intercept but zero slope (constant), or a low intercept but steep positive slope (rapidly increasing).

🎨 Adding lines to plots in R

🔧 High-level vs low-level plotting functions

The excerpt distinguishes two categories:

Function typeWhat it doesExamples
High-levelProduces an entire plot (axes, labels, etc.)plot, hist, boxplot
Low-levelAdds features to an existing plotabline

➕ The abline function

  • abline adds a straight line to an existing plot.
  • First argument: intercept
  • Second argument: slope
  • Additional arguments can specify characteristics like color (col="color.name").

Example from the excerpt:

plot(y ~ x)                    # Create scatter plot
abline(7, 1, col="green")      # Add green line: intercept=7, slope=1
abline(14, -2, col="blue")     # Add blue line: intercept=14, slope=-2
abline(mean(y), 0, col="red")  # Add red line: intercept=average of y, slope=0

📊 The regression line concept

  • The regression line is the line that best describes the linear trend between explanatory and response variables.
  • It's not just any line—it's the best description of the linear trend in the data.
  • The red constant line (at the average value of y) partly reflects the data but doesn't capture the trend.
  • The regression line reflects more information by including a description of how y changes with x.

Don't confuse: Any line can be drawn on a scatter plot, but the regression line is specifically computed to be the best fit for the data's linear trend (discussed in section 14.3).

78

Linear Regression

14.3 Linear Regression

🧭 Overview

🧠 One-sentence thesis

Linear regression fits a line to data that describes the linear trend of a response variable as a function of an explanatory variable, and when treated as a statistical model it allows hypothesis testing and confidence intervals for the slope and intercept parameters.

📌 Key points (3–5)

  • What the regression line is: a line characterized by an intercept and a slope, computed from data, that describes the linear trend between response (y-axis) and explanatory variable (x-axis).
  • How the coefficients are computed: the slope is the ratio of covariance (between response and explanatory variable) to variance (of the explanatory variable); the intercept is calculated so the line passes through the point (mean of x, mean of y).
  • Regression as a statistical model: the intercept and slope become estimates of population parameters; the expectation of the response is a linear function of the explanatory variable.
  • Common confusion—descriptive vs. inferential: fitting a line to describe observed data is descriptive; testing whether the slope is zero (i.e., whether the explanatory variable affects the response) is inferential and requires a sampling distribution.
  • Why it matters: testing the null hypothesis that slope equals zero tells us whether the explanatory variable affects the response; confidence intervals quantify uncertainty about the true slope and intercept.

📐 Fitting the regression line

📐 The lm function and coefficients

The R function lm (Linear Model) fits the regression line to data; input is a formula with response to the left of ~ and explanatory variable to the right; output is the fitted linear regression model.

  • The output displays the intercept and the slope (the coefficient multiplying x).
  • Example: for fish weight (y) vs. length (x), the fitted model gave intercept = 4.616 and slope = 1.427.
  • The regression line can be added to a scatter plot using abline(fit).

📊 Interpreting the line

  • The line passes through the points, balancing those above and below, capturing the linear trend.
  • Slope interpretation: when x increases by 1 unit, y increases by the slope amount (1.427 in the fish example).
  • Intercept interpretation: the y-value when x = 0; in the fish example, when x = 0, y ≈ 4.616.
  • Example: at x = 1, y ≈ 6; at x = 2, y ≈ 7.5 (an increase of ~1.5, consistent with slope 1.427); at x = 0, y ≈ 4.6 (the intercept).

🧮 Manual computation of coefficients

  • Slope formula: covariance(y, x) / variance(x).
    • Covariance measures joint variability: sum of products of deviations divided by (n − 1).
    • Covariance formula: sum of (y_i − mean(y)) × (x_i − mean(x)) / (n − 1).
  • Intercept formula: mean(y) − slope × mean(x).
    • This ensures the line passes through (mean(x), mean(y)).
  • Example: applying these formulas manually to the fish data yields the same coefficients as lm.

🚗 Another example: engine size and horsepower

  • Fitted model: lm(horsepower ~ engine.size, data=cars).
  • Coefficients: intercept = 6.6414, slope = 0.7695.
  • Interpretation: as engine size increases, horsepower increases; the regression line describes this general linear trend.

🔬 Inference and hypothesis testing

🔬 The statistical model for regression

In the linear regression model, the expectation of the response Y_i is a linear function of the explanatory variable: E(Y_i) = a + b × x_i, where a (intercept) and b (slope) are population parameters common to all observations.

  • The regression line represents the average trend of the response in the population.
  • The observed data is one realization of the sampling distribution; inference aims to make statements about the population parameters based on the sample.

🧪 Testing the null hypothesis that slope = 0

  • Null hypothesis H₀: b = 0 (the explanatory variable does not affect the distribution of the response; the expected value of the response is constant).
  • Alternative H₁: b ≠ 0 (the explanatory variable does affect the response).
  • The function summary(fit) produces a table with:
    • Estimate: the fitted slope (e.g., 0.76949 for engine size vs. horsepower).
    • Standard error: estimated standard deviation of the slope estimator (0.03919).
    • Test statistic: (estimate − 0) / standard error = 0.76949 / 0.03919 ≈ 19.63.
    • p-value: probability of observing such a statistic if H₀ is true; here, extremely small (< 2e-16), so reject H₀.
  • Sampling distribution: under H₀, the test statistic is asymptotically standard Normal; if the response is Normal, it follows a t-distribution with n − 2 degrees of freedom.

🔍 Inference for the intercept

  • The summary table also tests H₀: a = 0.
  • Example: for engine size vs. horsepower, intercept estimate = 6.64138, standard error = 5.23318, test statistic = 1.269, p-value = 0.206 (do not reject H₀).
  • Caution: the intercept is the expected response when x = 0; if 0 is not in the range of observed x, inference on the intercept requires extrapolation and is sensitive to model misspecification.
    • Example: engine size = 0 has no physical meaning, so treat the intercept inference cautiously.

📏 Confidence intervals for parameters

  • Crude interval: estimate ± 1.96 × standard error (using Normal approximation).
    • Slope: 0.76949 ± 1.96 × 0.03919 = [0.693, 0.846].
    • Intercept: 6.64138 ± 1.96 × 5.23318 = [−3.616, 16.898].
  • Using confint(fit): computes intervals using the t-distribution; results are similar to crude intervals but slightly more accurate.
    • Slope: [0.692, 0.847]; Intercept: [−3.678, 16.960].
  • Don't confuse: the confidence interval quantifies uncertainty about the true population parameter, not the variability of individual observations.

📦 Residuals and R-squared (preview)

📦 What residuals are

Residuals: the differences between the observed values of the response and their estimated expected values according to the regression model.

  • Residuals are the regression equivalent of deviations from the sample average.
  • They measure the variability not accounted for by the regression model.

📦 R-squared as a measure of explained variability

  • Total variability: variability of deviations from the mean.
  • Residual variability: variability of residuals (what the model does not explain).
  • R-squared: 1 − (residual variability / total variability).
    • Interpreted as the fraction of the variability of the response explained by the regression model.
  • Example: in the engine size vs. horsepower model, the summary output shows "Multiple R-squared: 0.6574," meaning ~65.7% of horsepower variability is explained by engine size.

🧩 Degrees of freedom and missing data

  • Degrees of freedom for the t-distribution: n − 2 (number of observations minus 2).
  • Example: for the cars data, 2 observations were deleted due to missing horsepower measurements, leaving n = 203; degrees of freedom = 203 − 2 = 201.
79

R-Squared and the Variance of Residuals

14.4 R-Squared and the Variance of Residuals

🧭 Overview

🧠 One-sentence thesis

R-squared measures the fraction of the response's variability that is explained by the regression model, serving as a key indicator of how well the model fits the data.

📌 Key points (3–5)

  • What residuals are: the difference between each observed response value and its estimated expectation according to the regression model.
  • How R-squared is computed: 1 minus the ratio of (variance of residuals from regression) to (variance of response from its average).
  • What R-squared tells us: values range from 0 to 1; closer to 1 means the regression line explains more variability; closer to 0 means little linear trend.
  • Common confusion: residuals from regression vs deviations from the average—the regression line minimizes residual variability, so its residuals are always smaller than deviations from the average.
  • Why it matters: R-squared allows comparison of different explanatory variables to see which better explains the response.

📏 Understanding residuals

📏 What a residual is

Residual from regression: the difference between the value of the response for an observation and the estimated expectation of the response under the regression model.

  • For an observation (x_i, y_i), the estimated expectation is a + b · x_i, where a and b are the estimated intercept and slope.
  • The residual is y_i minus (a + b · x_i).
  • Example: For a fish observation (4.5, 9.5) with estimated intercept 4.6165 and slope 1.4274, the estimated expectation is 4.6165 + 1.4274 · 4.5 = 11.0398, so the residual is 9.5 − 11.0398 = −1.5398.

📊 Visualizing residuals

  • Residuals are represented by vertical arrows from each data point to the regression line.
  • The point where the arrow hits the regression line corresponds to the estimated expectation for that observation.
  • There are as many residuals as there are observations.
  • The average of residuals from the regression line is always equal to 0.

🔧 Computing residuals in R

  • The function residuals takes a fitted regression model as input and outputs the sequence of residuals.
  • Example: Applying residuals(fit) to the fish data produces 10 residuals, one for each observation.

📐 Variance of residuals and standard deviation

📐 Estimated standard deviation of the response

  • The variability of the response about the regression line is estimated by the sum of squares of the residuals divided by (number of observations minus 2).
  • Dividing by n − 2 produces an unbiased estimator of the variance.
  • Taking the square root gives the estimated standard deviation.
  • Example: For the fish data, sqrt(sum(residuals(fit)^2) / 8) = 2.790787, which matches the "Residual standard error" in the summary report.

🎯 Why n − 2 degrees of freedom

  • The division by n − 2 (instead of n − 1) accounts for estimating two parameters (intercept and slope) from the data.
  • This adjustment produces an unbiased estimate of the response's variability about the regression model.

📊 R-squared: measuring explained variability

📊 Two forms of variation

Type of variationWhat it measuresHow it's computed
Variation from averageDeviations of response from its averageSum of squared red arrows (deviations from mean) ÷ (n − 1) = sample variance
Variation from regressionResiduals from the fitted regression lineSum of squared black arrows (residuals) ÷ (n − 1)
  • The ratio between these two quantities gives the relative variability that remains after fitting the regression line.

🔢 How R-squared is calculated

  • R-squared = 1 − (variance of residuals from regression) / (variance of response).
  • The regression line minimizes residual variability among all possible straight lines, so the variance of regression residuals is always less than the variance of the response.
  • Therefore, the ratio is less than 1, and R-squared is between 0 and 1.
  • Example: For the fish data, 1 - var(residuals(fit)) / var(y) = 0.3297, matching the "Multiple R-squared" in the report.

🎯 Interpreting R-squared values

  • R-squared = 1: All data points lie exactly on the regression line; the model explains 100% of variability.
  • R-squared = 0: No linear trend in the data; the regression line explains none of the variability.
  • Between 0 and 1: The fraction of response variability explained by the regression model.
  • The closer points are to the regression line, the larger R-squared becomes.

🔧 Adjusted R-squared

  • Adjusted R-squared uses an unbiased estimate of variability: sum of squared residuals divided by (n − 2) instead of (n − 1).
  • Example: For the fish data, 1 - (sum(residuals(fit)^2) / 8) / var(y) = 0.246, matching "Adjusted R-squared" in the report.
  • The difference between adjusted and unadjusted R-squared becomes negligible for larger sample sizes.
  • Which to use is a matter of personal preference.

🔍 Using R-squared to compare models

🔍 Comparing explanatory variables

  • R-squared provides a convenient measure of goodness of fit, allowing comparison of different explanatory variables.
  • Example: For the cars data, two models were fitted to predict horsepower:
    • Engine size as explanatory variable: R-squared = 0.6574 (about 2/3 of variability explained).
    • Car length as explanatory variable: R-squared = 0.308 (less than 1/3 of variability explained).
  • Conclusion: Engine size explains twice as much variability as car length and is a better explanatory variable.

⚠️ Don't confuse statistical significance with explanatory power

  • Both engine size and car length may be statistically significant (reject null hypothesis of zero slope with very small p-values).
  • However, statistical significance does not tell you which variable better explains the response.
  • R-squared quantifies explanatory power: a higher R-squared means the variable accounts for more of the response's variability.

📋 Summary report components

📋 What the summary function reports

The summary function applied to a fitted model produces several components:

  1. Formula: Identifies the response and explanatory variable.
  2. Residuals summary: Min, 1Q, Median, 3Q, Max of residuals (average not reported because it's always 0).
  3. Coefficients table: Estimates, standard errors, t-values, and p-values for intercept and slope.
  4. Residual standard error: Estimated standard deviation of response from regression, with degrees of freedom (n − 2).
  5. Multiple R-squared and Adjusted R-squared: Fraction of variability explained by the model.
  6. F-statistic: Overall goodness of fit test (for simple linear regression, this is the square of the t-value for the slope test, with identical p-value).
80

Linear Regression Exercises

14.5 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply linear regression concepts by asking students to match equations to lines, compute regression coefficients, analyze real datasets, and compare model fits using R-squared values.

📌 Key points (3–5)

  • Matching visual representations: identify which linear equation corresponds to a plotted line and which observation corresponds to a marked point.
  • Computing regression components: calculate intercepts, slopes, expected values, and residuals from given regression models.
  • Real-world data analysis: fit regression models to datasets (AIDS cases, union membership, car fuel consumption) and assess statistical significance.
  • Model comparison: use R-squared to determine which explanatory variable better explains response variability.
  • Common confusion: don't confuse the regression line equation with individual data points—the line represents expected values, not actual observations.

📝 Exercise types and skills

📊 Exercise 14.1: Visual interpretation

  • Task: Match a red line to one of three equations (y = 4, y = 5 - 2x, or y = x) and identify which observation a red triangle represents.
  • Data source: Table with 10 observations showing x and y coordinates.
  • Skills tested: reading scatter plots, recognizing linear equation forms, connecting visual and algebraic representations.

🧮 Exercise 14.2: Basic regression calculations

  • Given model: E(Y_i) = 2.13 · x_i - 3.60
  • Tasks:
    1. Identify the intercept (-3.60) and slope (2.13) values.
    2. Calculate the expected response value for the 3rd observation when x₃ = 4.2.
  • Skills tested: understanding regression equation structure, computing predicted values.

📈 Exercise 14.3: AIDS data analysis

  • Dataset: "aids.csv" containing year, diagnosed cases, and deaths from 1981-2002 in the United States.
  • Tasks:
    1. Regress deaths on diagnosed cases: find slope, confidence interval, test significance.
    2. Create scatter plot with regression line and assess fit quality.
    3. Regress diagnosed cases on year: find slope, confidence interval, test significance.
    4. Create scatter plot for year vs. diagnosed and assess fit quality.
  • Skills tested: fitting models in R, interpreting statistical significance, visual assessment of model appropriateness.

🔢 Computational exercises

🧪 Exercise 14.4: Union membership analysis

  • Dataset: U.S. labor force union membership percentages from 1945-1993.
  • Tasks:
    1. Create scatter plot with regression line and assess model reasonableness.
    2. Compute sample averages, standard deviations, and covariance for both variables.
    3. Use these summary statistics to recompute regression coefficients manually.
  • Skills tested: manual calculation of regression parameters, understanding the mathematical foundation behind regression formulas.

🔍 Exercise 14.5: Residual calculations

  • Given: Fitted model with a = 2.5 (intercept) and b = -1.13 (slope); four observations provided.
  • Tasks:
    1. Calculate the estimated expected response for the 4th observation (x₄ = 6.7).
    2. Compute the residual (difference between actual y₄ = 0.17 and predicted value).
  • Skills tested: applying regression equations, understanding residuals as deviations from the fitted line.

🚗 Exercise 14.6: Car fuel consumption comparison

  • Context: Revisits Chapter 13 example about highway vs. city fuel consumption differences.
  • Tasks:
    1. Fit regression using "curb.weight" as explanatory variable; test slope significance and find fraction of standard deviation explained.
    2. Fit regression using "engine.size" as explanatory variable; test slope significance and find fraction of standard deviation explained.
    3. Compare which model fits better.
  • Skills tested: comparing multiple regression models, using R-squared to assess explanatory power, understanding that different variables explain different amounts of variability.

🎯 Key learning objectives

🎯 Understanding model fit quality

  • R-squared measures the fraction of response variability explained by the explanatory variable.
  • Higher R-squared indicates better fit.
  • Example from context: engine size R-squared = 0.6574 (about 2/3) vs. car length R-squared = 0.308 (less than 1/3), so engine size is the better explanatory variable.

🎯 Statistical significance vs. practical fit

  • A slope can be statistically significant (p-value < 0.05, rejecting null hypothesis that slope = 0) but still explain only a small fraction of variability.
  • Visual assessment through scatter plots helps determine if a linear model is appropriate for the data pattern.
  • Don't confuse: significance tells you if there's a relationship; R-squared tells you how strong it is.
81

Linear Regression Summary

14.6 Summary

🧭 Overview

🧠 One-sentence thesis

Linear regression models the relationship between a numeric explanatory variable and a numeric response by fitting a straight line that minimizes residuals, with R-squared measuring how much of the response's variability the model explains.

📌 Key points (3–5)

  • What regression does: describes how one variable (explanatory) affects the distribution of another (response) using a linear trend.
  • Key components: intercept (where the line crosses the y-axis), slope (change in response per unit change in explanatory variable), and residuals (differences between actual and predicted values).
  • How to measure fit: R-squared represents the fraction of response variability explained by the regression line; values range from 0 to 1.
  • Common confusion: covariance measures joint variability of two variables, but the slope is covariance divided by variance of the explanatory variable—they are related but not the same.
  • Model choice debate: complex models fit data more closely, but simpler models are easier to interpret and may provide more insight.

📊 Core regression concepts

📊 What regression relates

Regression: Relates different variables that are measured on the same sample. Regression models are used to describe the effect of one of the variables on the distribution of the other one. The former is called the explanatory variable and the later is called the response.

Linear Regression: The effect of a numeric explanatory variable on the distribution of a numeric response is described in terms of a linear trend.

  • Both variables must be numeric for linear regression.
  • The explanatory variable is the one you think might influence the response.
  • The relationship is captured by a straight line.

📈 Scatter plot

Scatter Plot: A plot that presents the data in a pair of numeric variables. The axes represents the variables and each point represents an observation.

  • Visual tool to see the relationship between two numeric variables.
  • Each observation becomes one point on the graph.
  • Helps identify whether a linear trend exists.

🔢 Line components and formulas

🔢 Intercept and slope

Intercept: A coefficient of a linear equation. Equals the value of y when the line crosses the y-axis.

Slope: A coefficient of a linear equation. The change in the value of y for each unit change in the value of x. A positive slope corresponds to an increasing line and a negative slope corresponds to a decreasing line.

  • Linear equation form: y = a + b · x, where a is intercept and b is slope.
  • Intercept (a): the predicted response when the explanatory variable equals zero.
  • Slope (b): how much the response changes for each one-unit increase in the explanatory variable.
  • Example: if slope is 2, then increasing x by 1 unit increases predicted y by 2 units.

📐 How slope is calculated

  • Covariance: measures joint variability of two numeric variables; equals the sum of the product of deviations from the mean, divided by (number of observations minus 1).
    • Formula in words: sum of [(each y minus mean y) times (each x minus mean x)] divided by (n minus 1).
  • Regression slope: b = Covariance(x, y) / Variance(x).
  • Regression intercept: a = mean of y minus (b times mean of x).
  • Don't confuse: covariance is not the slope itself; slope adjusts covariance by the variance of x.

🎯 Measuring model fit

🎯 Residuals

Residuals from Regression: The residual differences between the values of the response for the observation and the estimated expectations of the response under the regression model (the predicted response).

  • Formula: residual = actual y minus predicted y, or y_i − (a + b·x_i).
  • Each observation has its own residual.
  • Positive residual: actual value is above the line; negative residual: actual value is below the line.
  • The regression line is chosen to minimize the sum of squared residuals.

🎯 Residual variance

  • Estimate of residual variance: sum of squared residuals divided by (n minus 2).
  • Formula in words: sum of [y_i minus (a + b·x_i)] squared, divided by (n − 2).
  • Measures how much variability remains after fitting the line.

🎯 R-squared

R-Square: is the difference between 1 and the ratio between the variance of the residuals from the regression and the variance of the response. Its value is between 0 and 1 and it represents the fraction of the variability of the response that is explained by the regression line.

  • Formula in words: 1 minus [sum of squared residuals divided by sum of squared deviations of y from its mean].
  • Interpretation:
    • R² = 0 means the line explains none of the variability (no better than just using the mean).
    • R² = 1 means the line explains all the variability (perfect fit).
    • R² = 0.7 means 70% of the response's variability is explained by the regression.
  • Example: if R² = 0.6, the regression model accounts for 60% of why the response values differ from each other.

🧮 The regression model

🧮 Population vs sample

  • The regression model: E(Y_i) = a + b · x_i, where a and b are population parameters.
  • In practice, a and b are estimated from the data.
  • All formulas for residuals, residual variance, and R-squared use the estimated values of a and b.

🧮 Formula summary table

ComponentFormula in words
Linear equationy = a + b · x
CovarianceSum of [(y_i − mean y)(x_i − mean x)] / (n − 1)
SlopeCovariance(x, y) / Variance(x)
InterceptMean of y − (slope × mean of x)
Residualy_i − (a + b·x_i)
Residual varianceSum of squared residuals / (n − 2)
R-squared1 − [sum of squared residuals / sum of squared deviations of y]

💬 Model complexity debate

💬 Simple vs complex models

  • The excerpt presents two viewpoints:
    • Complex models: try to fit the data as closely as possible.
    • Simple models: may be more remote from the data but are easier to interpret and provide more insight.
  • The excerpt does not resolve the debate but poses it as a discussion question.
  • Context matters: when reporting findings to others, clarity and interpretability may be more valuable than perfect fit.
  • Don't confuse: "better fit" (closer to data) is not always the same as "better model" (more useful for understanding and communication).
82

15.1 Student Learning Objectives

15.1 Student Learning Objectives

🧭 Overview

🧠 One-sentence thesis

This chapter extends statistical inference to cases where the response is a Bernoulli (two-level) variable, using different methods depending on whether the explanatory variable is a factor or numeric.

📌 Key points (3–5)

  • What changes: Chapters 13 and 14 dealt with numeric responses; this chapter shifts to Bernoulli (TRUE/FALSE or two-level factor) responses.
  • Two scenarios: when the explanatory variable is a factor with two levels, use prop.test; when it is numeric, use glm (Generalized Linear Model).
  • Common confusion: prop.test was used in Chapter 12 for a single sample; here it compares two sub-samples, just as t.test was used for both single-sample and two-sample numeric responses.
  • New model: logistic regression relates the probability of an event to a numeric explanatory variable.
  • Skills to gain: produce mosaic plots, compare event probabilities between sub-populations, define and fit logistic regression models, and perform inference on fitted models.

🔄 Transition from previous chapters

📊 What Chapters 13 and 14 covered

  • Both chapters introduced statistical inference involving a response and an explanatory variable.
  • The response was numeric in both cases.
  • Chapter 13: explanatory variable was a factor with two levels, splitting the sample into two sub-samples.
  • Chapter 14: explanatory variable was numeric, producing a linear trend together with the response.

🎯 What this chapter adds

  • The response is now a Bernoulli variable (two levels: TRUE or FALSE).
  • A Bernoulli variable can be:
    • An indicator of whether an event occurs.
    • A factor with two levels.
  • The explanatory variable can be either:
    • A factor with two levels, or
    • A numeric variable.

🛠️ Methods and tools

🧪 When the explanatory variable is a factor with two levels

Use the function prop.test to compare the probability of an event between two sub-populations.

  • Parallel to earlier use: prop.test was introduced in Chapter 12 for analyzing the probability of an event in a single sample.
  • New application: here it compares two sub-samples.
  • Analogy: similar to how t.test was used for numeric responses in both single-sample and two-sample comparisons.
  • Example: An organization wants to compare the success rate (event occurrence) between two settings; prop.test tests whether the probabilities differ.

📈 When the explanatory variable is numeric

Use the function glm (Generalized Linear Model) to fit a logistic regression model.

  • What logistic regression does: relates the probability of an event in the response to a numeric explanatory variable.
  • Why a new model: linear regression (Chapter 14) was for numeric responses; logistic regression handles Bernoulli responses.
  • Don't confuse: logistic regression models probability (bounded between 0 and 1), not a continuous numeric outcome.

🎓 Learning objectives

📊 Visualization

  • Mosaic plots: produce plots showing the relationship between the Bernoulli response and the explanatory variable.
  • These plots help visualize the distribution of the two-level response across different values or levels of the explanatory variable.

🔍 Inference for two sub-populations

  • Apply prop.test: compare the probability of an event between two sub-populations.
  • This involves:
    • Estimating sample proportions in each sub-sample.
    • Testing whether the probabilities differ.
    • Constructing confidence intervals for the difference.

🧮 Logistic regression

  • Define the model: understand how logistic regression links a numeric explanatory variable to the probability of an event.
  • Fit the model: use glm to estimate model parameters from data.
  • Inference: produce statistical tests and confidence intervals on the fitted logistic regression model.

📝 Context: Section 15.2 preview

🔄 Parallel to Chapter 13

  • Chapter 13 setup: compared the expectation of a numeric response between two sub-populations (denoted a and b).
    • Used sample means (X̄ₐ and X̄ᵦ) and sample variances (S²ₐ and S²ᵦ).
    • Applied t.test for inference.
  • This chapter's setup: compares the probability of an event between two sub-populations a and b.
    • Denote probabilities as pₐ and pᵦ.
    • Use sample proportions (P̂ₐ and P̂ᵦ) as estimators.
    • Apply prop.test for inference.

🎯 Inference goal

  • Test whether the two probabilities (pₐ and pᵦ) are equal.
  • Construct confidence intervals for the difference or ratio of probabilities.
  • Example: An organization runs two different programs and wants to know if the success probability differs between them; sample proportions from each program are used to estimate and compare the true probabilities.
83

Comparing Sample Proportions

15.2 Comparing Sample Proportions

🧭 Overview

🧠 One-sentence thesis

When comparing two sub-populations with a Bernoulli (TRUE/FALSE) response, we use sample proportions to test whether the probabilities of the event differ between the two groups and to construct confidence intervals for the difference.

📌 Key points (3–5)

  • What we compare: the probability of an event (or one level of a two-level response) between two sub-populations or settings, denoted p_a and p_b.
  • How we estimate: use sub-sample proportions (P-hat_a and P-hat_b) as natural estimators of the true probabilities.
  • Key inference tools: confidence interval for the difference (p_a − p_b) and a test of the null hypothesis that the two probabilities are equal.
  • Common confusion: this parallels comparing means of numerical responses (Section 13.3), but uses proportions instead of averages, estimates standard deviation differently, and applies a continuity correction.
  • Visualization: a mosaic plot shows the relative frequency of response levels within each explanatory-variable level, making it easy to see differences in proportions.

🔍 The inference problem

🔍 Bernoulli response structure

A Bernoulli response has two levels, "TRUE" or "FALSE" (often coded as 1 or 0, "success" or "failure", or any other pair of levels).

  • The response may be an indicator of an event or correspond to one level of a two-level factor.
  • Example: the variable "num.of.doors" (two levels: "two" and "four") is treated as a Bernoulli response.

🎯 Goal of the analysis

  • We examine the same event (or response level) in two different settings or sub-populations, labeled a and b.
  • Denote the probabilities of the event in each sub-population by p_a and p_b.
  • Our aim is to compare these two probabilities statistically.

📊 Natural estimators

  • The sub-sample proportions P-hat_a and P-hat_b (the observed proportions of the event in each sub-sample) serve as estimators of p_a and p_b.
  • These estimators are used to carry out inference: constructing confidence intervals and testing hypotheses.

🔧 Methods and differences from comparing means

🔧 Parallel to comparing expectations

  • This section parallels Section 13.3, which compared the expectation of a numerical response between two sub-populations.
  • In that section, we used sub-sample averages (X-bar_a and X-bar_b) and variances (S²_a and S²_b) to compare expectations E(X_a) and E(X_b).
  • Here, we replace averages with proportions and adjust the estimation of standard deviation.

🔧 Key differences

AspectComparing means (13.3)Comparing proportions (15.2)
StatisticSub-sample averagesSub-sample proportions
Standard deviationEstimated from sample variancesEstimated differently (details not given)
Additional adjustmentNone mentionedContinuity correction applied
  • Don't confuse: the principles are similar, but the derivations and tools are not identical.
  • The excerpt does not discuss theoretical details; it focuses on demonstrating the application.

📋 Working with data: frequency tables and mosaic plots

📋 Creating a frequency table

  • The function "table" applied to two factors produces a 2×2 frequency table of joint frequencies.
  • Each entry shows the count of observations with the corresponding combination of levels.
  • Example: in the cars dataset, the table of "fuel.type" (diesel/gas) versus "num.of.doors" (two/four) shows:
    • 16 diesel cars with four doors
    • 3 diesel cars with two doors
    • 98 gas cars with four doors
    • 86 gas cars with two doors
  • Total entries: 16 + 3 + 98 + 86 = 203 (the dataset size minus two missing values).

🎨 Mosaic plot visualization

  • A mosaic plot graphically represents the relation between two factors.
  • The plot is produced when the input to "plot" is a formula with both response and explanatory variables as factors.
  • Structure:
    • The x-axis shows levels of the explanatory variable (e.g., "diesel" and "gas").
    • Each level gets a vertical rectangle; the area of each rectangle represents the relative frequency of that level.
    • Each vertical rectangle is divided into horizontal sub-rectangles for the response levels (e.g., "four" and "two" doors).
    • The relative area of horizontal rectangles within each vertical rectangle shows the relative frequency of the response levels within that sub-population.
  • Example interpretation: the plot shows diesel cars are less frequent overall, but the relative frequency of four-door cars is larger among diesel cars than among gas cars.

🧪 Hypothesis testing with prop.test

🧪 The null hypothesis

  • The null hypothesis is H₀: p_a = p_b (the probabilities of the event are equal in both sub-populations).
  • This is tested against the two-sided alternative H₁: p_a ≠ p_b (the probabilities differ).

🧪 Using the prop.test function

  • The output of "table" serves as input to "prop.test".
  • Important: the Bernoulli response variable should be the second variable in the table; the explanatory factor should be the first.
  • The function tests equality of the probability of the first column between the rows of the table.

🧪 Interpreting the output

  • Sample proportions: reported at the bottom as "prop 1" and "prop 2".
    • Example: P-hat_a = 16/19 ≈ 0.8421 (diesel cars with four doors), P-hat_b = 98/184 ≈ 0.5326 (gas cars with four doors).
  • Confidence interval: reported under "95 percent confidence interval" for the difference p_a − p_b.
    • Example: [0.1013542, 0.5176389].
  • Test statistic: labeled "X-squared" (e.g., 5.5021).
    • It measures the deviation between the estimated difference (sub-sample proportions) and the null hypothesis value (p_a − p_b = 0), divided by the estimated standard deviation, then squared.
    • A continuity correction is applied to make the null distribution closer to the limiting chi-square distribution with one degree of freedom.
  • p-value: computed based on the chi-square distribution.
    • Example: p-value = 0.01899, which is less than 0.05.
    • Conclusion: reject the null hypothesis at the 5% significance level; the sub-population probabilities are different.

🧪 Don't confuse

  • The test statistic here is based on proportions and uses a continuity correction, unlike the t-test for means.
  • The p-value is computed from a chi-square distribution, not a t-distribution.
84

Logistic Regression

15.3 Logistic Regression

🧭 Overview

🧠 One-sentence thesis

Logistic regression investigates the relationship between the probability of an event (a Bernoulli response) and a numeric explanatory variable by fitting a model that treats the probability as a function of that variable, enabling inference about whether the two are related.

📌 Key points (3–5)

  • What logistic regression does: models the probability of an event as a function of a numeric explanatory variable, rather than fitting a straight line to a numeric response.
  • When to use it: when the response is a Bernoulli indicator (an event occurs or not) and the explanatory variable is numeric, not when both are numeric as in linear regression.
  • The model structure: the probability p_i for observation i is related to the explanatory variable x_i through a specific formula involving coefficients a and b, which are estimated from data.
  • Common confusion: logistic regression is not linear regression—it models probability of an event, not the expectation of a numeric response; the response is a factor level indicator, not a continuous measurement.
  • Testing the relationship: the null hypothesis H₀: b = 0 tests whether the explanatory variable and the event probability are unrelated; rejecting it means a relationship exists.

🔄 From linear regression to logistic regression

🔄 Parallel structure with linear regression

  • Linear regression (Chapter 14) fits a straight line to a scatter plot of numeric data points; the line represents the expectation of the response as a function of the explanatory variable.
  • Logistic regression replaces the numeric response with the probability of an event associated with a Bernoulli (indicator) response.
  • Both methods estimate coefficients from data and use them for inference about the relationship between variables.

🔀 Key difference in response type

  • Linear regression: response is a regular numeric variable.
  • Logistic regression: response is the indicator of a level of a factor—a TRUE/FALSE or 1/0 outcome marking whether an event occurred.
  • Example from the excerpt: the response is whether a car has four doors (the event) versus two doors, not a continuous measurement.
  • Don't confuse: the response in logistic regression is not numeric; it is a categorical indicator transformed into a probability.

📊 Understanding the data relationship

📊 Mosaic plot visualization

  • The excerpt uses a mosaic plot to show the relationship between the response (number of doors) and the explanatory variable (car length).
  • The x-axis represents the explanatory variable (length); the total area is divided into vertical rectangles corresponding to intervals of length values.
  • Width of each vertical rectangle: represents the relative frequency of that length interval (analogous to a histogram turned on its side).
  • Horizontal sub-rectangles within each vertical rectangle: show the relative frequency of the response levels (four doors vs. two doors) within that length interval.
  • Example: darker rectangles represent cars with four doors; brighter rectangles represent cars with two doors; the relative area shows the proportion of each type within each length interval.

🔍 Identifying patterns

  • From the mosaic plot, one can observe that the relative frequency of cars with four doors increases overall with increasing car length.
  • This visual pattern suggests a relationship between the explanatory variable (length) and the probability of the event (having four doors).
  • Logistic regression formalizes this investigation by estimating parameters and testing whether the relationship is statistically significant.

🧮 The logistic regression model

🧮 Model formula

The statistical model in logistic regression relates the probability p_i (the probability of the event for observation i) to x_i (the value of the explanatory variable for that observation) by the formula:
p_i = e^(a + b·x_i) / (1 + e^(a + b·x_i))
where a and b are coefficients common to all observations.

  • Alternative form: log(p_i / [1 - p_i]) = a + b·x_i
    • This states that a function of the probability (the log-odds) has a linear relationship with the explanatory variable.
  • Coefficients a and b: estimated from the data; a is the intercept, b is the slope.
  • The model assumes that the probability of the event changes systematically with the explanatory variable according to this specific functional form.

🔧 Fitting the model in practice

  • The excerpt uses the R function glm with the argument family=binomial to fit logistic regression.
  • Response specification: may be a sequence of logical TRUE/FALSE values (TRUE when the event occurs, FALSE otherwise) or 1/0 values (1 for event, 0 for no event).
    • Example: the response num.of.doors == "four" produces TRUE when a car has four doors and FALSE when it has two doors.
  • Explanatory variable: a numeric variable (e.g., car length).
  • The fitted model object stores the estimated coefficients and other information.

🧪 Inference and hypothesis testing

🧪 Testing for a relationship

  • The null hypothesis is H₀: b = 0, which claims that the slope coefficient is zero.
  • Interpretation: if b = 0, the probability of the event is unrelated to the explanatory variable; the explanatory variable has no effect.
  • The alternative hypothesis is that b ≠ 0, meaning a relationship exists.
  • The excerpt reports a p-value of 2.37 × 10⁻⁷ for the slope coefficient, which is much smaller than 0.05.
  • Conclusion: the null hypothesis is clearly rejected; there is strong evidence of a relationship between car length and the probability of having four doors.

📏 Coefficient estimates and confidence intervals

  • Estimated intercept (a): -13.14767
  • Estimated slope (b): 0.07726
  • These estimates determine the fitted logistic curve relating probability to the explanatory variable.
  • Confidence intervals for the coefficients can be obtained using the confint function applied to the fitted model.
    • Example from the excerpt: the 95% confidence interval for the slope is approximately [0.04938, 0.10824].
  • Both the intercept and slope are tested for equality to zero; the report includes standard errors, z-values, and p-values for each coefficient.

📋 Similarities and differences with linear regression reports

AspectLinear regressionLogistic regression
Coefficients reportedIntercept and slopeIntercept and slope
Hypothesis testedCoefficients equal to zeroCoefficients equal to zero
Response typeNumericBernoulli indicator (event/no event)
Model formStraight line for expectationProbability as a function via log-odds
Interpretation of b = 0No linear trendNo relationship between probability and explanatory variable
  • Don't confuse: although both methods estimate coefficients and test hypotheses, the underlying models and interpretations differ because the response types are fundamentally different.
85

15.4 Exercises

15.4 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply logistic regression to real-world health and medical datasets, testing whether explanatory variables (diet type, steroid levels) are statistically related to binary or categorical outcomes (health status, syndrome type).

📌 Key points (3–5)

  • Exercise 15.1 (diet study): tests whether Mediterranean diet vs. AHA diet affects the probability of staying healthy after a heart attack, using a two-sided test and confidence intervals for the difference in probabilities.
  • Exercise 15.2 (Cushing's syndrome): examines whether urinary steroid excretion rates relate to syndrome type, using mosaic plots and logistic regression with and without unknown-type observations.
  • Key task pattern: produce frequency tables or plots, test the null hypothesis of no relation (b = 0), and compute confidence intervals for coefficients.
  • Common confusion: statistical association vs. causality—the forum question asks whether finding a significant relation between explanatory variable and response implies causation.
  • Tool reminder: logistic regression relates a numeric explanatory variable to a binary/indicator response; roles can be reversed using different tools.

🍽️ Exercise 15.1: Mediterranean diet and heart attack survival

🍽️ Study design and data

  • Context: 605 heart attack survivors randomly assigned to either (1) AHA "prudent diet step 1" or (2) Mediterranean diet (more bread, cereals, fruit, vegetables, grains, fish; less meat and delicatessen).
  • Monitoring: four-year follow-up tracking deaths, cancer, non-fatal illness, or healthy status.
  • Data file: diet.csv contains two factors:
    • health: condition (healthy, non-fatal illness, cancer, or dead)
    • type: diet type (Mediterranean or AHA)

📊 Question 1: Frequency table

  • Produce a frequency table of health by type.
  • Read off the number of healthy subjects in each diet group (Mediterranean vs. AHA).

🧪 Question 2: Hypothesis test

  • Null hypothesis: the probability of staying healthy is the same for both diets.
  • Alternative: two-sided test at 5% significance level.
  • Goal: determine whether diet type is statistically related to the probability of remaining healthy.

📏 Question 3: Confidence interval

  • Compute a 95% confidence interval for the difference between the two probabilities of keeping healthy (Mediterranean minus AHA).
  • This interval quantifies the magnitude and uncertainty of the diet effect.

🏥 Exercise 15.2: Cushing's syndrome and steroid excretion

🏥 Background and data

  • Cushing's syndrome: disorder from a pituitary adenoma causing high cortisol levels; first described by Harvey Cushing in 1932.
  • Data file: coshings.csv with 27 patients and three variables:
    • type: underlying syndrome type (a = adenoma, b = bilateral hyperplasia, c = carcinoma, u = unknown)
    • tetra: urinary excretion rate (mg/24hr) of Tetrahydrocortisone (a steroid)
    • pregn: urinary excretion rate (mg/24hr) of Pregnanetriol (another steroid)

📊 Question 1: Histogram and mosaic plot

  • Plot the histogram of tetra to see its distribution.
  • Plot a mosaic plot with type as response and tetra as explanatory variable.
  • Mosaic plot interpretation: vertical rectangles represent the distribution of the explanatory variable; horizontal rectangles within them represent the response distribution.
  • Specific task: interpret the second vertical triangle from the right (third from the left) in the mosaic plot—what information does it convey about the relationship?

🧪 Question 2: Test and confidence interval (all data)

  • Null hypothesis: no relation between tetra (explanatory) and the indicator "type equals b" (response).
  • Fit a logistic regression model and test whether the slope coefficient b is zero.
  • Compute a confidence interval for the parameter describing the relation.

🧪 Question 3: Repeat analysis excluding unknowns

  • Fit the model using only observations where type is known (exclude type = u).
  • Hint: use the argument subset=(type!=u) in the fitting function.
  • Reflection: which analysis (with or without unknowns) is more appropriate? Consider whether including unknown types adds noise or useful information.

🧩 Key concepts and tools

🧩 Mosaic plot

Mosaic Plot: A plot that describes the relation between a response factor and an explanatory variable. Vertical rectangles represent the distribution of the explanatory variable. Horizontal rectangles within the vertical ones represent the distribution of the response.

  • Useful for visualizing how a categorical response varies across levels of a numeric or categorical explanatory variable.

🧩 Logistic regression

Logistic Regression: A type of regression that relates an explanatory variable to a response of the form of an indicator of an event.

  • Probability formula: p_i = e^(a + b·x_i) / (1 + e^(a + b·x_i))
    • p_i is the probability of the event for subject i
    • a is the intercept, b is the slope, x_i is the explanatory variable value
  • Predictor (log-odds) formula: log(p_i / (1 - p_i)) = a + b·x_i
    • The log of the odds (p / (1 - p)) is a linear function of x
  • Testing: when b = 0, the probability and explanatory variable are unrelated; reject the null hypothesis H₀: b = 0 if the p-value is small.
  • Confidence intervals: use confint() on the fitted model to get intervals for a and b.

🧩 Fitting in R

  • Use glm() function with the model formula (response ~ explanatory) and data= argument.
  • Response can be logical (TRUE/FALSE) or numeric (1/0), where 1 or TRUE indicates the event occurred.
  • Example from the excerpt: num.of.doors==four produces TRUE when a car has 4 doors, FALSE when it has 2 doors.
  • Apply summary() to the fitted model to see coefficient estimates, standard errors, and p-values.

💬 Forum discussion: Association vs. causality

💬 The causality question

  • Terminology issue: terms like "explanatory variable" and "response" suggest causality, but some argue statistics only examines joint distributions, not causation.
  • Question posed: Can statistical reasoning determine causality, or does it only reveal association?
  • Suggested approach: consider a specific situation where causality must be determined; can tools from the book (hypothesis tests, confidence intervals, regression) aid in that determination?

💬 Reversing roles

  • The excerpt notes that an analysis with one variable as response and the other as explanatory can be reversed (swapping roles), possibly using a different statistical tool.
  • Key observation: a significant statistical finding usually remains significant when roles are reversed.
  • Don't confuse: statistical significance of a relation does not automatically imply that one variable causes the other; confounding factors and study design matter.

📐 Formulas summary

FormulaExpressionMeaning
Probabilityp_i = e^(a + b·x_i) / (1 + e^(a + b·x_i))Probability of event for subject i as a function of x_i
Predictor (log-odds)log(p_i / (1 - p_i)) = a + b·x_iLog of the odds is a linear function of the explanatory variable
  • a: intercept coefficient
  • b: slope coefficient (if b = 0, no relation between x and probability)
  • x_i: value of the explanatory variable for subject i
86

Student Learning Objectives and Review of Statistical Inference

16.1 Student Learning Objective

🧭 Overview

🧠 One-sentence thesis

This chapter aims to consolidate understanding of statistical inference methods—estimation and hypothesis testing—and apply them to real case studies, preparing students to use these tools for analyzing actual data.

📌 Key points (3–5)

  • Two main forms of inference: estimation (determining parameter values) and hypothesis testing (choosing between competing claims about populations).
  • Estimation tools: point estimates give single values; confidence intervals give ranges, evaluated by confidence level and mean square error (MSE).
  • Hypothesis testing mechanics: uses a test statistic and rejection region to decide whether to reject the null hypothesis, controlled by significance level (Type I error bound).
  • Common confusion: statistical association vs causality—two variables can be statistically related without one causing the other; reversing explanatory and response roles usually preserves significance but may require different tools.
  • Chapter goals: review inference concepts, apply methods to real data, and motivate further learning in statistics.

📚 What statistical inference does

📚 Making statements from samples to populations

  • Statistical inference is the science of making general statements about an entire population based on data from a sample.
  • It relies on theoretical models that produce the sampling distribution—the distribution of a statistic across many possible samples.
  • Procedures are evaluated by their properties in the context of this sampling distribution, and summaries of these properties accompany the results.

🔄 Two forms of inference

The excerpt identifies two main approaches:

FormGoalTools
EstimationDetermine the value of a population parameterPoint estimates, confidence intervals
Hypothesis testingDecide between two competing hypotheses about parametersTest statistic, rejection region, significance level

🎯 Estimation methods

🎯 Point estimates

  • A point estimate provides a single value as the best guess for a population parameter.
  • Quality is assessed using mean square error (MSE), which measures how far the estimate tends to be from the true parameter value across samples.

📏 Confidence intervals

  • A confidence interval gives a range of plausible values for the parameter.
  • Evaluated by the confidence level: the probability (in repeated sampling) that the interval contains the true parameter.
  • Example: a 95% confidence level means that if we repeated the sampling process many times, about 95% of the intervals would capture the true parameter.

🧪 Hypothesis testing framework

🧪 Structure of a test

Hypothesis testing: deciding between two competing hypotheses formulated in terms of population parameters.

  • The null hypothesis is the default claim (the one we assume unless evidence contradicts it).
  • The test statistic is a number computed from the sample data.
  • The rejection region is the set of test statistic values that lead us to reject the null hypothesis.
  • Decision rule: reject the null if the test statistic falls in the rejection region.

⚠️ Type I error and significance level

  • Type I error: erroneously rejecting the null hypothesis when it is actually true.
  • The significance level is the bound on the probability of a Type I error—the maximum risk we accept of making this mistake.
  • Example: a 5% significance level means we design the test so that, if the null is true, we reject it incorrectly at most 5% of the time.

🔗 Statistical models and causality

🔗 Explanatory vs response variables

  • Statistical models often relate an explanatory variable (predictor) to a response variable (outcome).
  • The language suggests causality, but the excerpt warns: statistical association does not prove causation.
  • Example: two variables can be statistically related without one causing the other (e.g., both might be influenced by a third factor).

🔄 Reversing roles

  • An analysis can be reversed: swap the explanatory and response variables, possibly using a different statistical tool.
  • The excerpt notes that a significant statistical finding usually remains significant when roles are exchanged, though the method may differ.
  • Don't confuse: reversing roles does not establish causality in the opposite direction; it only explores the association from another angle.

💬 Forum discussion prompt

The excerpt poses a question: Can statistical reasoning determine causality, or does statistics only examine joint distributions?

  • The text invites consideration of specific situations where causality must be determined and whether the tools discussed (models for factors, numeric variables, logistic regression) can aid in that process.
  • This is a conceptual challenge, not a settled answer in the excerpt.

📊 Logistic regression (brief mention)

📊 What it is

Logistic regression: a type of regression that relates an explanatory variable to a response in the form of an indicator of an event (a binary outcome).

  • Used when the response is an event indicator (yes/no, success/failure).
  • The excerpt provides two formulas (rewritten in words):
    • Probability form: the probability p_i equals e raised to (a + b times x_i), divided by 1 plus that same exponential.
    • Predictor form: the log of the odds (p_i divided by 1 minus p_i) equals a plus b times x_i.
  • These formulas model how the probability of the event changes with the explanatory variable.

🎓 Chapter learning objectives

🎓 What students should achieve

By the end of Chapter 16, students should be able to:

  • Review the concepts and methods for statistical inference presented in the second part of the book.
  • Apply these methods to the analysis of real data (the chapter includes two case studies).
  • Develop resolve to learn more statistics—motivating continued study beyond this book.

🗂️ Chapter structure

  • Starts with a short review of inference topics.
  • Main part: statistical analysis of two case studies using the tools discussed.
  • Closes with concluding remarks.
87

A Review

16.2 A Review

🧭 Overview

🧠 One-sentence thesis

Statistical inference enables us to make general statements about entire populations based on sample data by using theoretical models and procedures evaluated through their sampling distribution properties.

📌 Key points (3–5)

  • Two main forms of inference: estimation (determining parameter values) and hypothesis testing (deciding between competing hypotheses).
  • How procedures are evaluated: estimation uses mean square error and confidence level; testing uses significance level (Type I error bound) and statistical power.
  • What can be analyzed: single measurements (numeric variables or factors) and relationships between pairs of measurements (comparison or regression models).
  • Common confusion: the role of assumptions—statistical models attempt to reflect reality but require healthy skepticism; always check validity against actual data.
  • Foundation principle: inference relies on theoretical models that produce sampling distributions, and procedures with desirable properties are applied to real data.

📊 Two forms of statistical inference

📊 Estimation

Estimation: the goal is to determine the value of a parameter in the population.

  • Two tools are available:
    • Point estimates: a single value estimate of the parameter.
    • Confidence intervals: a range of plausible values.
  • How quality is assessed:
    • Point estimators are evaluated using mean square error (MSE).
    • Confidence intervals are evaluated using the confidence level.
  • Example: estimating the average income in a population—you might report a point estimate or a 95% confidence interval.

🧪 Hypothesis testing

Hypothesis testing: the target is to decide between two competing hypotheses formulated in terms of population parameters.

  • How it works:
    • Construct a statistical test using a test statistic and a rejection region.
    • The default hypothesis (null hypothesis) is rejected if the test statistic falls in the rejection region.
  • How quality is assessed:
    • Significance level: a bound on the probability of Type I error (erroneously rejecting the null hypothesis when it is true).
    • Statistical power: the probability of rightfully rejecting the null hypothesis when it is false.
  • Don't confuse: Type I error is rejecting a true null; power measures correctly rejecting a false null.

🔢 Types of measurements analyzed

🔢 Single measurements

The excerpt distinguishes between two types of single measurements:

TypeWhat is inferredExamples of parameters
Numeric variablesExpectation and/or varianceMean, standard deviation
FactorsProbability of obtaining a level or an eventProportion, event probability
  • For numeric variables: inference focuses on central tendency (expectation) and spread (variance).
  • For factors: inference focuses on the probability of outcomes.

🔗 Relations between pairs of measurements

Statistical models describe the relations between variables; one variable is designated as the response, and the other as the explanatory variable (which may affect the distribution of the response).

  • When the explanatory variable is a factor with two levels:
    • The analysis reduces to comparison of two sub-populations, each associated with one level.
    • Example: comparing outcomes between a treatment group and a control group.
  • When the explanatory variable is numeric:
    • A regression model may be applied.
    • The type of regression depends on the response:
      • Linear regression: for numeric responses.
      • Logistic regression: for categorical responses.

🧠 The role of assumptions and models

🧠 Statistical models as foundations

  • What models do: they attempt to reflect reality and produce the sampling distribution that underlies inference.
  • How procedures are chosen: procedures with desirable properties (in the context of the sampling distribution) are applied to the data.
  • The output may include summaries that describe these theoretical properties.

⚠️ Healthy skepticism toward assumptions

The excerpt emphasizes a three-step approach to using models:

  1. Be aware: know what the assumptions are.
  2. Ask: how reasonable are these assumptions in the context of the specific analysis?
  3. Check: validate the assumptions as much as possible in light of the information at hand.
  • Practical advice: plot the data and compare the plot to the assumptions of the model.
  • Don't confuse: models are tools that attempt to reflect reality, not perfect representations; always question and verify.

🎯 Summary of the inference framework

🎯 The overall process

  • Starting point: data from a sample.
  • Theoretical basis: statistical models that produce sampling distributions.
  • Evaluation: procedures are evaluated based on their properties in the sampling distribution context.
  • Application: procedures with desirable properties are applied to the data.
  • Output: general statements about the entire population, with summaries describing theoretical properties.

🎯 Key properties to assess

Inference formMain propertyWhat it measures
Estimation (point)Mean square error (MSE)Accuracy of the estimate
Estimation (interval)Confidence levelProbability the interval contains the true parameter
Hypothesis testingSignificance levelUpper bound on Type I error probability
Hypothesis testingStatistical powerProbability of correctly rejecting a false null
88

Case Studies in Statistical Inference

16.3 Case Studies

🧭 Overview

🧠 One-sentence thesis

These case studies demonstrate how to apply statistical inference methods—including t-tests, variance tests, and linear regression—to real data while critically examining model assumptions and potential violations that may affect the validity of conclusions.

📌 Key points (3–5)

  • Two real-world examples: physicians' reactions to patient weight (comparing two groups) and physical strength's relationship to job performance (regression analysis).
  • Core workflow: plot data first, apply appropriate tests (t-test for group comparison, linear regression for numeric relationships), then validate assumptions.
  • Assumption checking is critical: nominal p-values may be misleading if data violate model assumptions (e.g., skewness, non-Normality of residuals).
  • Common confusion: a statistically significant p-value does not automatically mean the analysis is valid—always check distributional assumptions and data features (e.g., concentration at default values, skewness).
  • Robustness checks: when assumptions are questionable, use alternative tests (e.g., converting numeric response to a factor and using proportion tests) to validate findings.

📊 Case Study 1: Physicians' Reactions to Patient Weight

🎯 Research question and design

Study goal: examine whether physicians' predicted time spent with patients differs based on patient weight (average BMI=23 vs. overweight BMI=30).

  • Sample: 122 primary care physicians in Houston; 72 responses analyzed (33 average-weight, 38 overweight patients).
  • Explanatory variable: patient weight (factor with two levels).
  • Response variable: predicted time in minutes (numeric, range 5–60 minutes, mean ≈27.8, median=30).
  • Method: two-sample t-test to compare means.

🔍 Initial data exploration reveals a problem

High concentration at the default value:

  • 30 of 72 physicians (42%) marked exactly "30 minutes."
  • This is the middle value and may reflect physicians completing the form without careful thought rather than true predictions.
  • Implication: the response may not accurately measure physicians' genuine expectations.

Distribution by group:

  • BMI=23 group: 14 responses at 30 minutes, fairly symmetric distribution around it.
  • BMI=30 group: 16 responses at 30 minutes, but only 2 responses above 30 (skewed distribution).
  • The skewness in the overweight group raises concerns about the validity of the t-test, which assumes approximate Normality.

📈 Primary t-test results

Findings:

  • p-value = 0.0058 (< 0.05) → reject the null hypothesis of equal means.
  • Estimated difference: 31.36 − 24.74 ≈ 6.6 minutes (physicians expect to spend more time with average-weight patients).
  • 95% confidence interval: [1.99, 11.27] minutes.
  • Variance test (F-test): p-value = 0.89 → no evidence of unequal variances; ratio ≈1.04, CI [0.53, 2.08].

Interpretation caution: the skewness in the BMI=30 group may inflate the significance of the t-test.

🛡️ Robustness check with a proportion test

Why needed: the t-test assumes symmetric distributions; the BMI=30 group is skewed.

Alternative approach:

  • Convert response to a factor: "≥30 minutes" (TRUE) vs. "<30 minutes" (FALSE).
  • Use a two-sample proportion test to compare the probability of TRUE between groups.

Results:

  • p-value = 0.054 (just above 0.05).
  • Proportions: BMI=23 group 31% ≥30 min; BMI=30 group 57% ≥30 min.
  • Interpretation: the proportion test provides weaker evidence (borderline significance) compared to the t-test, suggesting the t-test may overstate the strength of evidence due to distributional issues.

Conclusion: there is some evidence that physicians expect to spend less time with overweight patients, but the effect is less clear-cut when accounting for data skewness.

🏗️ Case Study 2: Physical Strength and Job Performance

🎯 Research question and design

Study goal: determine whether simple strength measurements (grip and arm strength) can predict job performance in physically demanding occupations, providing a safe and cost-effective selection tool.

  • Sample: 147 workers in physically demanding jobs.
  • Explanatory variables:
    • grip: grip strength score (range 29–189, mean 110.2).
    • arm: arm strength score (range 19–132, mean 78.75).
  • Response variables:
    • ratings: supervisor ratings of physical job performance (range 21.6–57.2, mean 41.01).
    • sims: simulation task scores (range −4.17 to 5.17, mean 0.20).
  • Method: linear regression (response ~ explanatory variable).

📉 Regression analysis: grip strength predicting simulations

Scatter plot and regression line:

  • Positive linear trend: higher grip strength → higher simulation scores.
  • Regression line follows the overall trend reasonably well.

Model summary:

  • Slope estimate: 0.045 (p < 2e-16, highly significant).
  • R² = 0.41 → grip explains 41% of variance in simulation scores.
  • Proportion of SD explained: √0.41 ≈ 0.64 (64% of the standard deviation).

💪 Regression analysis: arm strength predicting simulations

Model summary:

  • Slope estimate: 0.055 (p < 2e-16, highly significant).
  • R² = 0.47 → arm explains 47% of variance, slightly better than grip.
  • Proportion of SD explained: √0.47 ≈ 0.69 (69%).

🔗 Combined score: using both grip and arm

Why combine: pooling information from both measurements may improve prediction.

Combined score formula:

score = −5.434 + 0.024 × grip + 0.037 × arm

Regression of simulations on the combined score:

  • R² = 0.54 → the combined score explains 54% of variance.
  • Proportion of SD explained: √0.54 ≈ 0.74 (74%).
  • Interpretation: almost three-quarters of the variability in simulation performance can be predicted from the simple strength tests.

⚠️ Checking the relationship between grip and arm

Scatter plot of grip vs. arm:

  • The two strength measures show a strong positive linear trend.
  • Implication: grip and arm largely measure the same underlying property (general strength), so combining them yields only modest improvement over using either alone.
  • Recommendation: consider adding other strength measures that capture different dimensions (e.g., leg strength, endurance) to improve prediction further.

🧪 Validating regression assumptions: residual analysis

Why check residuals: linear regression assumes residuals (observed − fitted) are Normally distributed.

Histogram of residuals:

  • Appears symmetric and bell-shaped, consistent with Normality.

Quantile-Quantile (Q-Q) plot:

Q-Q plot: plots empirical percentiles of residuals against theoretical Normal percentiles; deviations from a straight line indicate non-Normality.

  • Points fall close to a straight line → no evidence of violation of Normality assumption.
  • Conclusion: the regression model assumptions appear reasonable; reported p-values and confidence intervals are trustworthy.

🔄 Next steps (left as exercise)

  • Repeat the analysis using ratings (supervisor ratings) as the response instead of sims.
  • The workflow is the same, but conclusions may differ.

🧰 General lessons from the case studies

🧰 Always plot first

  • Histograms, box plots, and scatter plots reveal data features (e.g., skewness, outliers, concentration at default values) that summary statistics alone cannot show.
  • Plotting helps identify potential violations of model assumptions before running tests.

🧰 Understand what the test assumes

TestKey assumptionsWhat to check
t-testApproximate Normality of each groupBox plots for symmetry; watch for skewness
Variance F-testNormality in both groupsHistograms; sensitive to non-Normality
Linear regressionResiduals are Normally distributedQ-Q plot of residuals; histogram
Proportion testLarge enough sample in each cellCheck cell counts (≥5 recommended)

🧰 Use robustness checks when assumptions are questionable

  • If data are skewed or have unusual features, apply an alternative test that is less sensitive to those issues.
  • Example: when the t-test assumption is doubtful, convert the response to a factor and use a proportion test.
  • If results agree across methods → stronger evidence; if they diverge → interpret cautiously.

🧰 R² and explained variance

  • : the proportion of variance in the response explained by the model.
  • √R²: the proportion of the standard deviation explained (often more intuitive).
  • Example: R²=0.54 means 54% of variance explained, or √0.54≈74% of SD explained.

🧰 Don't confuse statistical significance with practical validity

  • A small p-value does not guarantee the analysis is correct if assumptions are violated.
  • Always examine the data and residuals; a "significant" result from a flawed model may be misleading.

🎓 What's missing from this book (summary remarks)

🎓 More complex models

  • Real applications often require models beyond simple two-group comparisons or single-predictor regression (e.g., multiple regression, interactions, nonlinear models).
  • Each model has specific estimation methods, assumptions, and diagnostic tools.

🎓 Prediction vs. inference

  • The book focuses on estimation (e.g., estimating means, slopes) and hypothesis testing (e.g., is there a difference?).
  • Another major task is prediction: forecasting future observations and quantifying uncertainty (e.g., prediction intervals).
  • Regression is a natural tool for prediction, but the focus shifts from parameter estimates to forecasting accuracy.

🎓 Design and other statistical tasks

  • The excerpt ends mid-sentence mentioning "design," likely referring to experimental design (how to plan studies to answer questions efficiently).
  • Other tasks not covered: causal inference, model selection, handling missing data, etc.
89

Summary of an Introductory Statistics Course

16.4 Summary

🧭 Overview

🧠 One-sentence thesis

This introductory statistics book covers only the minimum essential elements, and true statistical competence requires learning more complex models, prediction methods, experimental design, and the underlying mathematical theory.

📌 Key points (3–5)

  • What this book covered: simple statistical models sufficient for basic academic data analysis, but typical applications require more complexity.
  • What's missing: prediction tasks, statistical design (especially data collection planning), and the mathematical theory underlying statistics.
  • Common confusion: statistics is not just about final analysis—good design at the data collection stage can prevent problems that even expert analysis cannot fix later.
  • Key warning: statistics can be misused; learning statistics helps distinguish valid from invalid applications.
  • The real goal: finishing this book should mark the beginning, not the end, of learning statistics.

📚 Limitations of simple models

📚 More complex models are needed in practice

  • The book presented the simplest possible statistical models.
  • Real applications typically require more complex models.
  • Each complex model needs:
    • Specific estimation and testing methods
    • Computational tools for analysis
    • Understanding of probabilistic assumptions to interpret output and diagnose problems

🔍 Understanding assumptions matters

  • Inference characteristics (significance levels, confidence levels) depend on model assumptions.
  • Users must be able to:
    • Diagnose when assumptions are violated
    • Assess how severe the violation is
    • Judge whether findings remain valid despite violations

🔮 Beyond estimation and testing

🔮 Prediction as a distinct task

  • Statistics can be used for tasks other than estimation and hypothesis testing.
  • Prediction: assessing what future observations may be and their likely range of values.
  • Regression is a natural tool for prediction.
  • Don't confuse: prediction of future response values is different from testing or estimating parameter values.

📐 Statistical design and data collection

Statistical design: planning how to collect data with the final analysis requirements in mind.

  • Design includes more than just sample size determination.
  • Sample size selection typically emerges in hypothesis testing contexts, with criteria like minimal test power (minimal probability to detect a true finding).
  • Crucial role: an experienced statistician can ensure collected data is appropriate for the intended analysis.

⚠️ The cost of poor planning

  • Common problem: researchers collect data first, then seek statistical help when it's too late.
  • The data may not provide a satisfactory answer to the research question.
  • Key insight from the excerpt: "good statisticians are required for the final analysis only in the case where the initial planning was poor" (stated with some exaggeration).

🧮 The mathematical foundation

🧮 Why mathematical theory matters

  • This course introduced minimal mathematics intentionally.
  • Serious learning of statistics requires familiarity with relevant mathematical theory.
  • What's needed:
    • Deep knowledge of probability theory
    • Familiarity with the rich and growing body of research on mathematical aspects of data analysis
  • Bottom line: "One cannot be a good statistician unless one becomes familiar with the important aspects of this theory."

💡 Final perspective

💡 Statistics can be used or misused

  • The book ends with the famous quotation: "Lies, damned lies, and statistics."
  • Learning statistics provides tools to distinguish valid use from misuse.
  • The author's goal: this book should mark the beginning of learning statistics, not the end.
    Introduction to Statistical Thinking | Thetawave AI – Best AI Note Taker for College Students