🧭 Overview
🧠 One-sentence thesis
Statistical hypothesis testing provides a formal framework for deciding between a null hypothesis (the currently accepted explanation) and an alternative hypothesis (a new claim), using sample data to determine whether observed differences are statistically significant rather than due to random chance.
📌 Key points (3–5)
- Core structure: hypothesis testing splits into three steps—formulating null and alternative hypotheses, specifying the test statistic and rejection region, and reaching a conclusion based on observed data.
- Two error types: Type I error (rejecting a true null hypothesis) is controlled by the significance level (typically 5%), while Type II error (failing to reject a false null hypothesis) relates to statistical power.
- Common confusion: simple-minded significance vs. statistical significance—a smaller deviation can be more statistically significant if sampling variability is much smaller; the test compares deviation to the standard error, not just absolute difference.
- The p-value: measures the probability of observing data as extreme as (or more extreme than) what was observed, assuming the null hypothesis is true; reject the null when p-value < 0.05 (at 5% significance level).
- Sample size matters: the same estimated difference can lead to different conclusions depending on sample size, because larger samples reduce sampling variability and increase the ability to detect real effects.
🏗️ The structure of hypothesis testing
🏗️ Three-step process
The testing process follows a logical sequence:
- Formulate hypotheses (before seeing data): split parameter values into two collections—null hypothesis H₀ (phenomenon absent) and alternative hypothesis H₁ (phenomenon present)
- Specify the test (before seeing data): choose a test statistic and define the rejection region (values that lead to rejecting H₀)
- Reach conclusion (using observed data): compute the test statistic from the sample and decide whether it falls in the rejection region
Null hypothesis (H₀): the sub-collection of parameter values where the phenomenon is absent; the currently accepted theory; the hypothesis where erroneously rejecting it is more severe.
Alternative hypothesis (H₁): the sub-collection reflecting the presence of the phenomenon; the new theory challenging the established one.
🎯 Formulating hypotheses
The formulation depends on what phenomenon you want to investigate:
- Two-sided alternative: H₁: E(X) ≠ value (testing for any change)
- One-sided alternatives:
- H₁: E(X) < value (testing for decrease)
- H₁: E(X) > value (testing for increase)
Example: To test whether car prices changed, use H₀: E(X) = 13,662 vs. H₁: E(X) ≠ 13,662. To test specifically for a price rise, use H₀: E(X) ≥ 13,662 vs. H₁: E(X) < 13,662.
Don't confuse: The null hypothesis is not always "equals"—it can be "greater than or equal" or "less than or equal" depending on what phenomenon you're investigating.
📊 Test statistic and rejection region
Test statistic: a statistic that summarizes the sample data to decide between the two hypotheses.
Rejection region: a set of values for the test statistic; if the observed value falls in this region, reject H₀.
For testing expectations, the t-statistic is commonly used:
- Formula: t = (sample mean − null value) / (sample standard deviation / √n)
- This measures the discrepancy in units of the estimated standard error
Rejection regions by alternative type:
- Two-sided: reject if |t| > threshold (e.g., 1.972 for n=201)
- Greater than: reject if t > threshold (e.g., 1.653)
- Less than: reject if t < negative threshold (e.g., −1.653)
The threshold is chosen to achieve the desired significance level (typically 5%).
⚠️ Error types and probabilities
⚠️ Two types of error
| Error type | What happens | When it occurs |
|---|
| Type I | Reject H₀ when H₀ is true | False positive; claiming a phenomenon exists when it doesn't |
| Type II | Fail to reject H₀ when H₁ is true | False negative; missing a real phenomenon |
The two errors are not treated symmetrically: Type I error is considered more severe, so the test is designed to control its probability at a pre-specified level.
⚠️ Significance level and power
Significance level: the probability of Type I error; the probability (computed under H₀) of rejecting H₀. Commonly set at 5% or 1%.
Statistical power: the probability (computed under H₁) of rejecting H₀; equals 1 − probability of Type II error.
When comparing two tests with the same significance level, prefer the one with higher statistical power.
Why the asymmetry? In scientific research, the currently accepted theory is designated as H₀. A novel claim requires strong evidence to replace it. Similarly, a new drug must demonstrate clear benefit before approval—the null is "no better than current treatment."
⚠️ Interpreting significance
The excerpt emphasizes that statistical significance differs from simple-minded significance:
- Simple assessment: looks only at the size of the deviation from the null value
- Statistical assessment: compares the deviation to the sampling variability (standard error)
Example: Two groups showed the same deviation from the null expectation (about 0.275), but one had p-value = 0.127 (not significant) while the other had p-value = 0.052 (nearly significant). The difference was due to different sample standard deviations (1.806 vs. 1.422)—smaller variability makes the same deviation more statistically significant.
📈 The p-value approach
📈 What the p-value measures
p-value: the probability, computed under the null hypothesis, of obtaining a test statistic as extreme as (or more extreme than) the observed value.
The p-value is itself a test statistic. It equals the significance level of a test where the observed value serves as the threshold.
Decision rule:
- If p-value < 0.05, reject H₀ at the 5% significance level
- If p-value < 0.01, reject H₀ at the 1% significance level
📈 Computing the p-value
For a two-sided test with observed t = −0.811:
- p-value = P(|T| > 0.811) under H₀
- This equals twice the upper tail probability (by symmetry of the t-distribution)
- Formula: 2 × [1 − P(T ≤ 0.811)]
For one-sided tests:
- Greater than alternative: p-value = P(T > observed value)
- Less than alternative: p-value = P(T < observed value)
Advantage of p-values: No need to look up critical thresholds; simply compare the p-value directly to your chosen significance level.
📈 P-value and sample size
The excerpt illustrates how sample size affects conclusions:
- With n=20 and estimated proportion = 0.3 (vs. null of 0.5): p-value = 0.118, do not reject
- With n=200 and estimated proportion = 0.3 (vs. null of 0.5): p-value = 0.00000002, strongly reject
The estimated value (0.3) is identical, but the larger sample reduces sampling variability, making the same discrepancy highly significant.
Key lesson: Statistical testing is based on the relative discrepancy compared to sampling variability, not the absolute discrepancy alone.
🧪 Testing expectations (t-test)
🧪 When to use the t-test
The t-test is used to test hypotheses about the expected value (mean) of a measurement:
- Statistical model: observations are a random sample
- Parameter of interest: E(X), the expectation
- Test statistic: t = (sample mean − null value) / (s / √n)
- Distribution under H₀: t-distribution with n−1 degrees of freedom
🧪 Applying the t-test in practice
The excerpt demonstrates using the function t.test with car price data:
Basic syntax: t.test(data, mu=null_value)
data: the sample observations
mu: the expected value under H₀ (default is 0)
For one-sided alternatives, add:
alternative="greater" for H₁: E(X) > null value
alternative="less" for H₁: E(X) < null value
- Default is
alternative="two.sided"
🧪 Interpreting t-test output
The output includes:
- Test statistic value (t = ...)
- Degrees of freedom (df = n−1)
- p-value: compare to significance level
- Confidence interval: for two-sided tests, a 95% interval; for one-sided, a one-sided interval like [lower bound, ∞)
- Sample estimate: the sample mean
Example interpretation: If p-value = 0.418 > 0.05, do not reject H₀; conclude the expected value is not significantly different from the null value.
🧪 Subsetting data for testing
The excerpt shows how to test hypotheses for subgroups using logical indexing:
- Create a logical variable (e.g.,
heavy <- weight > 2414)
- Use it to subset:
data[heavy] selects observations where heavy is TRUE
- Use negation:
data[!heavy] selects observations where heavy is FALSE
This allows testing the same hypothesis separately for different subgroups (e.g., heavier vs. lighter cars).
🎲 Testing proportions
🎲 When to use proportion tests
Proportion tests examine hypotheses about the probability p of an event:
- Connection to expectations: p is the expected value of a Bernoulli variable (1 if event occurs, 0 otherwise)
- Estimator: sample proportion p̂ = (number of occurrences) / n
- Variance under H₀: V(p̂) = p₀(1−p₀)/n, where p₀ is the null value
🎲 The test statistic for proportions
The test statistic measures the standardized deviation:
Z = (p̂ − p₀) / √[p₀(1−p₀)/n]
Under H₀, Z is approximately standard Normal (by Central Limit Theorem), so Z² follows a chi-square distribution with 1 degree of freedom.
Rejection region: {Z² > c} for some threshold c, or equivalently {|Z| > √c}.
🎲 Applying prop.test
The excerpt demonstrates using the function prop.test:
Basic syntax: prop.test(x, n)
x: number of occurrences of the event
n: total sample size
- Default null probability is 0.5; change with
p=value
Example: Testing whether the median weight for diesel cars is 2,414 lb:
- 6 out of 20 diesel cars are below the threshold
prop.test(6, 20) tests H₀: p = 0.5
- Output: X-squared = 2.45, p-value = 0.118, do not reject
Continuity correction: By default, the function applies Yates' continuity correction (similar to the Normal approximation for Binomial). This can be disabled with correct=FALSE.
🎲 Sample size and proportion tests
The excerpt demonstrates the effect of sample size:
- With n=20, x=6: p̂=0.3, p-value=0.118 (not significant)
- With n=200, x=60: p̂=0.3, p-value=0.00000002 (highly significant)
Even though the estimated proportion is identical (0.3 vs. null of 0.5), the larger sample makes the difference statistically significant because sampling variability decreases with sample size.
🔍 Important considerations
🔍 Robustness and assumptions
The t-test assumes measurements are Normally distributed. The excerpt mentions examining robustness to violations of this assumption (e.g., testing with Exponential or Uniform distributions instead of Normal).
When sample size is small and the distribution is not Normal, the nominal significance level may not match the actual significance level.
🔍 Confidence intervals in testing
The excerpt notes that confidence intervals and hypothesis tests are related:
- A 95% confidence interval is reported alongside the test
- For one-sided tests, a one-sided confidence interval is given (e.g., [lower bound, ∞))
- If the null value falls outside the 95% confidence interval, the null hypothesis would be rejected at the 5% level
🔍 Practical interpretation
The excerpt emphasizes several practical points:
- Conservatism in science: statisticians advocate caution to add objectivity; investigators may prefer bold discoveries
- Publication bias: many journals require p-value < 0.05 for publication; should results with p-value around 10% also be published?
- Context matters: the severity of Type I vs. Type II errors depends on the application (e.g., drug approval, scientific discovery)
Don't confuse: Failing to reject H₀ does not prove H₀ is true; it only means the data do not provide sufficient evidence against it.