Introduction to Statistics

1

What Are Statistics

What Are Statistics

🧭 Overview

🧠 One-sentence thesis

Statistics is not merely a collection of numerical facts but a comprehensive discipline for analyzing, interpreting, displaying, and making decisions based on data—requiring careful attention to how numbers are chosen and interpreted to avoid misleading conclusions.

📌 Key points (3–5)

  • Statistics as facts vs. discipline: Statistics includes numerical facts (e.g., earthquake measurements, demographic data) but more broadly refers to techniques for analyzing and interpreting data.
  • Numbers can be right, interpretations wrong: The same numerical data can lead to flawed conclusions if context, causation, or confounding factors are misunderstood.
  • Common confusion: correlation vs. causation: A statistical relationship between two variables does not prove one causes the other; third variables or temporal effects may be responsible.
  • Incomplete information misleads: Percentages and comparisons without baseline rates or full context can create false impressions about trends or societal changes.
  • Why it matters: Understanding statistics properly is essential because misinterpretation leads to incorrect decisions and false beliefs about cause-and-effect relationships.

📊 Two faces of statistics

📊 Statistics as numerical facts

The excerpt opens with examples of statistics as concrete numbers:

  • Earthquake magnitudes (9.2 on Richter scale)
  • Crime ratios (men commit murder at 10× the rate of women)
  • Health statistics (1 in 8 South Africans HIV positive)
  • Demographic projections (15 elderly per newborn by 2020)

These are descriptive facts and figures—quantitative statements about the world.

🔬 Statistics as a discipline

In the broadest sense, "statistics" refers to a range of techniques and procedures for analyzing, interpreting, displaying, and making decisions based on data.

  • The study involves both mathematical calculations and critical thinking about how numbers are chosen and interpreted.
  • It's not just computation; it requires understanding context, sources of bias, and alternative explanations.
  • Don't confuse: Statistics (the discipline) with statistics (individual numbers)—the former is the methodology for working with the latter.

⚠️ Three common interpretation errors

⚠️ History effects (temporal confounding)

Scenario from excerpt: Ice cream sales increased 30% in the three months after a new advertisement launched in late May, leading to the conclusion that the ad was effective.

The flaw:

  • Ice cream consumption naturally rises in June, July, and August regardless of advertising.
  • This is called a history effect: outcomes are attributed to one variable (the ad) when another variable related to the passage of time is actually responsible.
  • Example: Any ice cream brand would likely see increased sales in summer months; the ad may have had no effect at all.

🔗 Third-variable problem (spurious correlation)

Scenario from excerpt: Cities with more churches have more crime, leading to the conclusion that churches lead to crime.

The flaw:

  • Both variables are caused by a third factor: population size.
  • Bigger cities have both more churches and more crime simply because they have more people.
  • Don't confuse: A correlation (two things occurring together) with causation (one causing the other).
  • The excerpt notes this will be discussed in detail in Chapter 6, emphasizing that people erroneously believe in a causal relationship between two variables when a third variable causes both.

📉 Missing baseline information

Scenario from excerpt: Interracial marriages increased 75% compared to 25 years ago, leading to the conclusion that society now accepts interracial marriages.

The flaw:

  • The percentage increase is meaningless without knowing the baseline rate.
  • If only 1% of marriages were interracial 25 years ago, a 75% increase means 1.75% now—hardly evidence of widespread acceptance.
  • Additional missing information: Has the rate fluctuated over the years? Is this year actually the highest?
  • Key lesson: Relative changes (percentages) can be dramatic even when absolute numbers remain very small.

🧮 What statistics requires

🧮 Math and calculation

  • The study involves mathematical computations with numbers.
  • Quantitative skills are necessary but not sufficient.

🤔 Critical interpretation

The excerpt emphasizes that statistics "relies heavily on how the numbers are chosen and how the statistics are interpreted."

What this means:

  • Where did the data come from? (sampling, measurement)
  • What context is needed to understand the numbers?
  • What alternative explanations exist?
  • What information is missing?

The central warning: "You will find that the numbers may be right, but the interpretation may be wrong."

AspectWhat's neededWhy it matters
CalculationMath skillsGet the numbers right
SelectionUnderstanding data sourcesAvoid biased samples
InterpretationCritical thinkingAvoid false conclusions
ContextDomain knowledgeUnderstand what numbers mean
2

Importance of Statistics

Importance of Statistics

🧭 Overview

🧠 One-sentence thesis

Statistics provides essential tools for evaluating data and claims intelligently, enabling people to distinguish good reasoning from faulty manipulation in everyday decisions.

📌 Key points (3–5)

  • What statistics really means: not just facts and figures, but a range of techniques for analyzing, interpreting, displaying, and making decisions based on data.
  • Why statistics matters: without the ability to distinguish good from faulty reasoning, people are vulnerable to manipulation and poor decisions.
  • Common confusion: correlation vs causation—two variables appearing together does not mean one causes the other; a third variable may cause both.
  • Interpreting percentages: raw percentages can be misleading without context (baseline rates, absolute numbers, trends over time).
  • Where statistics appears: claims from psychology, health, law, sports, business, and virtually every facet of contemporary life.

🚨 Common statistical fallacies

🚨 History effect (time-related confounding)

History effect: interpreting outcomes as the result of one variable when another variable having to do with the passage of time is actually responsible.

  • The ice cream example: A 30% sales increase after advertising in late May was attributed to the ad's effectiveness.
  • The flaw: Ice cream consumption naturally increases in June, July, and August regardless of advertisements.
  • Why it matters: People mistakenly credit one factor (the ad) when time-related patterns are the real cause.
  • Example: An organization launches a campaign in spring and sees summer growth, but the growth may be seasonal rather than campaign-driven.

🔗 Third-variable problem

Third-variable problem: a third variable can cause both situations, but people erroneously believe there is a causal relationship between the two primary variables.

  • The churches example: "The more churches in a city, the more crime there is. Thus, churches lead to crime."
  • The flaw: Larger populations cause both more churches and more crime; population size is the third variable.
  • Don't confuse: Correlation (two things occurring together) with causation (one thing causing the other).
  • Example: Two outcomes may both be driven by a common underlying factor rather than one causing the other.

📊 Insufficient context for percentages

  • The interracial marriage example: "75% more interracial marriages are occurring this year than 25 years ago. Thus, our society accepts interracial marriages."
  • The flaws:
    • Missing baseline: If only 1% of marriages were interracial 25 years ago, 75% more means 1.75%—hardly evidence of widespread acceptance.
    • Missing trend information: The statistic doesn't show whether there have been dramatic fluctuations or whether this year is even the highest.
  • Key lesson: Percentages without absolute numbers, rates, or historical context can be misleading.

🛠️ What statistics encompasses

🛠️ Beyond facts and figures

  • The excerpt emphasizes that statistics is not only facts and figures.
  • In the broadest sense, statistics refers to:
    • Techniques for analyzing data
    • Procedures for interpreting data
    • Methods for displaying data
    • Tools for making decisions based on data
  • It is a comprehensive toolkit, not just a collection of numbers.

🎯 Taking control through statistical literacy

  • "Taking control of your life" partly means being able to properly evaluate data and claims.
  • Without the ability to distinguish good from faulty reasoning, people are:
    • Vulnerable to manipulation
    • Prone to decisions not in their best interest
  • Statistics provides the tools needed to react intelligently to information heard or read.

🌍 Statistics in everyday life

🌍 Diverse domains

The excerpt lists claims from multiple fields to show how pervasive statistics are:

DomainExample claim
HealthAlmost 85% of lung cancers in men and 45% in women are tobacco-related
Consumer products4 out of 5 dentists recommend Dentine
SafetyCondoms are effective 94% of the time
Social issuesWomen make 75 cents to every dollar a man makes when they work the same job
PsychologyPeople tend to be more persuasive when they look others directly in the eye and speak loudly and quickly
SportsPeople predict it is very unlikely there will ever be another baseball player with a batting average over 400
ProbabilityThere is an 80% chance that in a room full of 30 people at least two people will share the same birthday

🗣️ Statistical character of claims

  • All the listed claims are statistical in character.
  • They come from psychology, health, law, sports, business, and more.
  • Data and data interpretation show up in discourse from virtually every facet of contemporary life.
  • The excerpt notes that "79.48% of all statistics are made up on the spot"—a humorous reminder to be skeptical and evaluate claims carefully.
3

Descriptive Statistics

Descriptive Statistics

🧭 Overview

🧠 One-sentence thesis

Descriptive statistics provide tools to summarize and understand data without generalizing beyond it, enabling intelligent evaluation of the numerical claims that bombard us daily.

📌 Key points (3–5)

  • What descriptive statistics are: numbers used to summarize and describe collected data (e.g., percentages, averages).
  • What they do NOT do: descriptive statistics do not generalize beyond the data at hand—that is the job of inferential statistics.
  • Why they matter: they help you evaluate claims, detect misleading numbers, and make informed decisions in everyday life.
  • Common confusion: descriptive vs. inferential—descriptive statistics only describe the data you have; inferential statistics draw conclusions about cases beyond your data.
  • Real-world presence: statistics appear in advertising, health claims, sports, business, and nearly every aspect of contemporary life.

📊 What descriptive statistics are

📊 Definition and scope

Descriptive statistics: numbers that are used to summarize and describe data.

  • "Data" refers to information collected from experiments, surveys, historical records, etc. (Note: "data" is plural; one piece is a "datum.")
  • Any number you compute from your data counts as a descriptive statistic.
  • Multiple descriptive statistics are often used together to give a full picture.
  • Example: If analyzing birth certificates, a descriptive statistic might be the percentage issued in a particular state or the average age of mothers.

🔍 What they do and do NOT do

  • They describe: they summarize the information you have.
  • They do NOT generalize: descriptive statistics do not involve extending conclusions beyond the data at hand.
  • Don't confuse: generalizing to other cases is the business of inferential statistics, a separate topic.
  • Descriptive statistics are "just descriptive"—they offer insight into the data itself, not predictions or broader claims.

🌍 Why statistics matter in everyday life

🌍 Claims everywhere

The excerpt lists diverse statistical claims from psychology, health, law, sports, and business:

  • "4 out of 5 dentists recommend Dentine."
  • "Almost 85% of lung cancers in men and 45% in women are tobacco-related."
  • "Women make 75 cents to every dollar a man makes when they work the same job."
  • "79.48% of all statistics are made up on the spot."

These examples show that data and data interpretation appear in virtually every facet of contemporary life.

🛡️ Credibility and manipulation

  • Statistics are often presented to add credibility to an argument or advice (e.g., television advertisements).
  • Many numbers thrown about do not represent careful statistical analysis—they can be misleading and push you into decisions you might regret.
  • The British Prime Minister Benjamin Disraeli is quoted as saying, "There are three kinds of lies—lies, damned lies, and statistics."
  • Learning statistics is a long step toward taking control of your life and becoming an intelligent consumer of statistical claims.

❓ Your first reflex: question the statistics

  • To be an intelligent consumer, your first reflex must be to question the statistics you encounter.
  • Think about the numbers, their sources, and most importantly, the procedures used to generate them.
  • Do not blindly accept numbers or findings; reform your statistical habits.

📋 Examples of descriptive statistics in action

💼 Occupation salaries

Table 1 in the excerpt shows average salaries for various U.S. occupations in 1999:

  • Pediatricians: $112,760
  • Dentists: $106,130
  • Elementary school teachers: $39,560
  • Police officers: $38,710
  • Floral designers: $18,980

Insight: We pay people who educate our children and protect our citizens much less than people who care for our feet or teeth.

👫 Unmarried men and women in metro areas

Table 2 shows the number of unmarried men per 100 unmarried women in U.S. metro areas in 1990:

  • Jacksonville, NC: 224 men per 100 women (most men)
  • Sarasota, FL: 66 men per 100 women (most women)

Insight: Descriptive statistics can be useful if you are looking for an opposite-sex partner.

  • The excerpt speculates that more women in Sarasota might be due to elderly individuals moving there and women outliving men, but emphasizes: "in the absence of proper data, this is only speculation."

🏃 Olympic marathon times

Table 3 shows winning times for men (since 1896) and women (since 1984):

  • Women's times range from 2:23:14 (2000) to 2:32:41 (1992).
  • Men's times range from 2:32:35 (1920) to 3:28:53 (1904).

Insight: Descriptive statistics are central to the world of sports, producing numerous numbers like shooting percentages and race times.

🔄 Distinguishing descriptive from inferential statistics

🔄 The key difference

AspectDescriptive statisticsInferential statistics
What it doesSummarizes and describes the data you haveGeneralizes from your data to other cases
ScopeOnly the data at handBeyond the data at hand
ExampleAverage salary of teachers in your datasetPredicting average salary of all teachers nationwide
  • The excerpt emphasizes: "Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand."
  • Don't confuse: if you compute an average from your data, that is descriptive; if you use that average to make claims about a larger population, that is inferential.

🎯 The dual purpose of learning statistics

🎯 Defensive and appreciative

  • Defensive: Detect the deceptive use of statistics—protect yourself from fraudulent claims wrapped up as numbers.
  • Appreciative: Recognize statistical evidence that properly supports a stated conclusion.
  • Statistics are all around you, sometimes used well, sometimes not—you must learn to distinguish the two cases.
  • The excerpt closes on a positive note: just as important as detecting deception is appreciating the proper use of statistics.
4

Inferential Statistics

Inferential Statistics

🧭 Overview

🧠 One-sentence thesis

Inferential statistics uses mathematical procedures to convert information from a small sample into intelligent guesses about a larger population, but the validity of these inferences depends critically on how the sample is chosen.

📌 Key points (3–5)

  • What inferential statistics does: converts sample information into inferences about the entire population using mathematical procedures.
  • Why sampling method matters: a sample must be representative of the population; biased samples (e.g., only volunteers, only one geographic region) lead to invalid inferences.
  • Random vs non-random sampling: simple random sampling gives every member equal chance of selection, but non-random samples over-represent some groups and under-represent others.
  • Common confusion: random sampling (how you select subjects from a population) vs random assignment (how you divide an already-selected sample into experimental groups); failure to randomize assignment is more serious than a non-random sample in experiments.
  • Sample size matters: even random samples can be unrepresentative if too small; larger samples are more likely to reflect the population accurately.

🎯 Core concepts

🎯 Population vs sample

Population: the larger set of data from which a sample is drawn.

Sample: a small subset of a larger set of data used to draw inferences about the larger set.

  • The population is the entire group you want to understand (e.g., all Americans, all college seniors, all twins in a registry).
  • The sample is the smaller group you actually study.
  • You query the sample and then generalize to the population.
  • Example: To understand American attitudes about voting fairness, you cannot ask every single American; instead, you ask a few thousand and infer the attitudes of the entire country from their responses.

🧮 What inferential statistics does

Inferential statistics: the mathematical procedures whereby we convert information about the sample into intelligent guesses about the population.

  • It is not just describing the sample; it is using the sample to make claims about the population.
  • Example: If a sample of college seniors took an average of 3.2 math classes, inferential statistics helps you speculate that 3.2 approximates the number for all seniors in the population.
  • The procedures are mathematical and take into account factors like sample size.

🚨 Sampling bias and representativeness

🚨 What makes a sample biased

  • A sample is biased when it over-represents one kind of member at the expense of others.
  • Biased samples cannot be used to infer the attitudes or characteristics of the entire population.
  • Example: A sample made up entirely of Florida residents cannot represent all Americans; a sample of only Republicans cannot represent the political views of the entire country.
  • Example: A teacher asks the 10 students in the front row for their test scores and concludes the class did extremely well. The sample (front-row students) is biased because front-row students tend to be more interested and perform higher than the population (all students in the class).
  • Example: A coach asks for volunteers to do cartwheels and concludes freshmen can do an average of 16 cartwheels. The sample (volunteers) is biased because people who can't do cartwheels probably did not volunteer.

🔍 Why representativeness matters

  • Inferential statistics are based on the assumption that sampling is random.
  • A random sample is trusted to represent different segments of society in close to the appropriate proportions (if the sample is large enough).
  • If the sample is not representative, inferences to the population will be wrong.
  • Example: If you sample too many math majors or too many technical institutions, your estimate of the average number of math classes taken by all college seniors will be too high.

🎲 Simple random sampling

🎲 Definition and requirements

Simple random sampling: a sampling method that requires every member of the population to have an equal chance of being selected into the sample.

  • In addition, the selection of one member must be independent of the selection of every other member.
  • Picking one member must not increase or decrease the probability of picking any other member.
  • Simple random sampling chooses a sample by pure chance.

🎲 Example of non-random sampling

  • A researcher studying twins selects all those in the National Twin Registry whose last name begins with Z, then every other name beginning with B.
  • Problems:
    • Choosing only Z names does not give every individual an equal chance.
    • It risks over-representing ethnic groups with many Z surnames.
    • The "every-other-one" procedure for B names means adjacent names cannot both be selected, violating independence.
  • Conclusion: This is not simple random sampling; it is biased.

📏 Sample size matters

  • Random samples, especially small ones, are not necessarily representative of the population.
  • Example: A random sample of 20 subjects from a population with equal males and females has a probability of 0.06 that 70% or more would be female—such a sample would not be representative, even though it was drawn randomly.
  • Only a large sample size makes it likely that the sample is close to representative.
  • Inferential statistics take sample size into account when generalizing from samples to populations.

🔀 Random assignment vs random sampling

🔀 Random assignment in experiments

Random assignment: the random division of a sample into two or more groups (e.g., treatment and control).

  • In experimental research, populations are often hypothetical (e.g., there is no actual population of people taking a new drug).
  • A sample is taken from a defined population, then randomly divided into groups.
  • Example: In a study of an anti-depressant, a sample of people with depression is randomly divided into a drug group and a placebo group.
  • Random assignment is critical for the validity of an experiment.

🔀 Why random assignment matters more than random sampling in experiments

  • Failure to assign subjects randomly to groups invalidates the experimental findings.
  • Example: If the first 20 subjects to show up are assigned to the experimental group and the second 20 to the control group, late arrivals (who may be more depressed) would bias the control group.
  • A non-random sample simply restricts the generalizability of the results (you can only generalize to the population you sampled from).
  • Don't confuse: Random sampling = how you select subjects from a population; random assignment = how you divide an already-selected sample into experimental groups.

🧩 Stratified sampling

🧩 What stratified sampling is

Stratified random sampling: a method used when the population has distinct "strata" or groups; you first identify members of each group, then randomly sample from each subgroup so that the sizes of the subgroups in the sample are proportional to their sizes in the population.

  • This method makes the sample more representative than simple random sampling when the population has clear subgroups.

🧩 Example of stratified sampling

  • You want to study views on capital punishment at a university with 70% day students (average age 19) and 30% night students (average age 39).
  • Night students may have different views than day students.
  • To ensure representativeness, your sample of 200 students should consist of 140 day students (70%) and 60 night students (30%).
  • The proportion of day students in the sample matches the proportion in the population, making inferences more secure.

🔑 Key takeaways

🔑 When inferences are valid

  • Inferences to a population are valid only if:
    • The sample is representative (not biased).
    • The sampling method is appropriate (ideally random or stratified random).
    • The sample size is large enough.
  • You can only generalize to the population from which the sample was drawn (e.g., if you sample from the National Twin Registry, you can only generalize to twins in that registry, not all twins in the world).

🔑 Practical challenges

  • Simple random sampling is sometimes not feasible (e.g., how do you contact people without phones? how do you account for people who just moved?).
  • For this reason, other sampling techniques (like stratified sampling) have been devised.
  • The excerpt emphasizes that the sampling procedure, not the results, defines whether a sample is random.
5

Levels of Measurement

Variables

🧭 Overview

🧠 One-sentence thesis

The level of measurement (nominal, ordinal, interval, or ratio) determines what statistical operations are meaningful for a variable, and misunderstanding these distinctions can lead to serious analytical errors.

📌 Key points (3–5)

  • Four scale types exist: nominal (categories only), ordinal (ordered categories), interval (equal intervals, no true zero), and ratio (equal intervals with true zero).
  • Each scale has different properties: higher-level scales preserve all properties of lower-level scales plus additional ones.
  • Scale type constrains valid operations: you cannot meaningfully compute averages of nominal data, and ratios are only meaningful for ratio scales.
  • Common confusion: changing response format to numbers does not change the underlying scale—a 1-to-4 satisfaction rating is still ordinal, not interval.
  • Psychological measures are often ordinal: rating scales (e.g., 5-point or 7-point) typically do not guarantee equal intervals across the range.

📏 The four scale types

🏷️ Nominal scales

Nominal scale: measurement that names or categorizes responses without implying any ordering among them.

  • The lowest level of measurement.
  • Variables like gender, favorite color, religion, or handedness.
  • No sense in which one category comes "before" or "after" another.
  • Example: classifying people by favorite color—green is not "ahead of" blue.
  • Don't confuse with: just because you assign numbers as codes (e.g., Blue=1, Red=2) doesn't make it a higher scale.

📊 Ordinal scales

Ordinal scale: measurement in which responses are ordered from least to most, but differences between levels are not necessarily equal.

  • Items are ranked or ordered (e.g., "very dissatisfied" to "very satisfied").
  • You can say one person is more satisfied than another.
  • Key limitation: the difference between "very dissatisfied" and "somewhat dissatisfied" may not equal the difference between "somewhat dissatisfied" and "somewhat satisfied."
  • The intervals between adjacent scale values do not necessarily represent equal psychological distances.
  • Example: consumer satisfaction ratings—the mental step from 1 to 2 may not equal the step from 3 to 4.
  • Don't confuse with: changing to numbers (1, 2, 3, 4) does not make intervals equal; the scale remains ordinal.

📐 Interval scales

Interval scale: a numerical scale in which intervals have the same interpretation throughout, but there is no true zero point.

  • Equal differences at any point on the scale represent equal differences in the underlying quantity.
  • Example: Fahrenheit temperature—the difference between 30° and 40° equals the difference between 80° and 90°.
  • Key limitation: no true zero point. Zero degrees Fahrenheit does not mean "absence of temperature"; it's an arbitrary label.
  • Because there's no true zero, ratios are not meaningful—you cannot say 80° is "twice as hot" as 40° (the ratio would change if you shifted the zero point).

📏 Ratio scales

Ratio scale: the most informative scale; an interval scale with a true zero point indicating the absence of the quantity being measured.

  • Combines all properties of lower scales: naming (nominal), ordering (ordinal), equal intervals (interval), plus meaningful ratios.
  • Zero means complete absence of the quantity.
  • Examples:
    • Kelvin temperature scale (absolute zero exists).
    • Money (zero money = no money).
    • Number of items recalled in a memory test.
  • Ratios are meaningful: someone with 50 cents has twice as much money as someone with 25 cents.
  • Example: if one temperature is twice another on the Kelvin scale, it has twice the kinetic energy.

🧪 Psychological measurement challenges

🧪 Rating scales are typically ordinal

  • Psychological research frequently uses 5-point or 7-point rating scales (e.g., pain level, product liking, confidence).
  • These are ordinal scales: no assurance that a given difference represents the same thing across the range.
  • Example: reducing pain from level 3 to level 2 may not represent the same relief as reducing from level 7 to level 6.

🧪 Number of items recalled: ratio or not?

  • At first glance, "number of items correctly recalled" seems like a ratio scale:
    • True zero exists (some subjects recall zero items).
    • A difference of one item is consistent across the scale.
    • Valid to say someone who recalled 12 items recalled twice as many as someone who recalled 6.
  • But it's more complicated: if some items are easy and others are difficult, the difference between recalling 2 vs. 3 easy items may not signify the same memory difference as recalling 7 vs. 8 items (where the 8th is a difficult item).
  • Conclusion: it is often inappropriate to treat psychological measurement scales as interval or ratio.

⚠️ Consequences of misunderstanding scale types

⚠️ Invalid operations lead to nonsense

  • The relationship between scale type and valid statistics is crucial.
  • Example: if favorite color is coded as Blue=1, Red=2, Yellow=3, Green=4, Purple=5, computing the average code is meaningless.
    • If the average is 3, it would be nonsense to conclude "the average favorite color is yellow."
    • This is like "counting the number of letters in the name of a snake to see how long the beast is."

⚠️ Can you compute the mean of ordinal data?

  • Statisticians have debated this for decades.
  • Prevailing opinion: for almost all practical situations, the mean of an ordinally-measured variable is meaningful.
  • Caution: there are extreme situations where computing the mean can be very misleading.

⚠️ Choosing appropriate statistics requires judgment

  • Understanding scale types helps you choose meaningful statistics.
  • "Statistics is not just recipes!"—good judgment is required.

📊 Summary table

Scale TypeOrdering?Equal Intervals?True Zero?ExampleValid Operations
NominalNoNoNoFavorite color, genderFrequency counts, mode
OrdinalYesNoNoSatisfaction rating (1-5)Median, percentiles
IntervalYesYesNoFahrenheit temperatureMean, standard deviation (but not ratios)
RatioYesYesYesMoney, Kelvin temperatureAll operations including ratios
6

Percentiles

Percentiles

🧭 Overview

🧠 One-sentence thesis

Percentiles divide a distribution into parts, enabling researchers to understand data position and spread through cumulative frequency representations and comparisons across distributions.

📌 Key points (3–5)

  • What percentiles show: Position within a distribution; the 25th, 50th, and 75th percentiles are especially important for describing spread.
  • Cumulative frequency polygons: Plot cumulative counts against values, showing how data accumulate across the range.
  • Comparing distributions: Overlaying frequency polygons or cumulative frequency polygons reveals differences in central tendency and spread between groups.
  • Common confusion: Frequency polygons vs. cumulative frequency polygons—the former shows counts at each value; the latter shows running totals up to each value.
  • Why it matters: Percentiles and their graphical representations (like box plots) summarize distributions compactly and enable visual comparison.

📊 Frequency representations

📊 Frequency polygons

Frequency polygon: a graph where points representing frequencies are connected by lines.

  • Each point corresponds to a test score (or bin) on the X-axis and its frequency on the Y-axis.
  • Lines connect the points, forming a polygon shape.
  • Why useful: Makes it easy to see the shape of the distribution at a glance.
  • Example: In a psychology test with scores from 35 to 175, the polygon peaks around the most common scores and tapers at the extremes.

📈 Cumulative frequency polygons

Cumulative frequency polygon: a graph where each point shows the total number of observations up to and including that value.

  • The Y-axis shows cumulative frequency (running total).
  • The curve always rises (or stays flat) as you move right along the X-axis.
  • Why useful: Shows how data accumulate; makes it easy to read percentiles directly from the graph.
  • Example: For the psychology test, the cumulative frequency at score 105 shows how many students scored 105 or below.

🔍 Overlaying polygons for comparison

  • Purpose: Compare two or more distributions on the same graph.
  • Frequency polygons or cumulative frequency polygons can be overlaid.
  • Example: A cursor-movement task compared small vs. large targets. Overlaying the frequency polygons showed that times were generally longer for the small target, though there was some overlap.
  • Don't confuse: Overlaying frequency polygons shows where distributions differ in shape and center; overlaying cumulative frequency polygons shows differences in how quickly data accumulate.

🎯 Percentiles and box plots

🎯 Key percentiles (hinges)

  • The 25th, 50th, and 75th percentiles are called hinges in box plot terminology.
  • Lower hinge: 25th percentile.
  • Median: 50th percentile (the middle line in the box).
  • Upper hinge: 75th percentile.
  • These three values define the "box" in a box plot.

📦 Box plots

Box plot: a graphical display showing the 25th, 50th, and 75th percentiles, along with whiskers extending to adjacent values and markers for outliers.

  • The box: Extends from the 25th to the 75th percentile; the line inside is the median.
  • Whiskers: Vertical lines extending from the box to the "adjacent values" (the largest/smallest values that are not outliers).
  • Outliers: Marked with small circles (outside values) or asterisks (far-out values) if they fall beyond the inner or outer fences.
  • Mean: Sometimes shown with a plus sign (+) inside or near the box.
  • Example: For women's times in a study, the 25th percentile was 17, the median was 19, and the 75th percentile was 20. The box plot showed one outside value at 29.

📏 H-spread and fences

  • H-spread: Upper hinge minus lower hinge (the height of the box).
  • Step: 1.5 times the H-spread.
  • Inner fences: Upper hinge + 1 step; lower hinge − 1 step.
  • Outer fences: Upper hinge + 2 steps; lower hinge − 2 steps.
  • Adjacent values: The largest data value below the upper inner fence and the smallest data value above the lower inner fence.
  • Outside values: Values beyond an inner fence but not beyond an outer fence.
  • Far-out values: Values beyond an outer fence.
  • Example: For women's times, H-spread = 3, step = 4.5, upper inner fence = 24.5, lower inner fence = 12.5. The upper adjacent value was 24; the outside value was 29.

🔄 Using percentiles for comparison

🔄 Comparing groups with box plots

  • Parallel box plots: Multiple box plots side by side, one for each group.
  • Makes it easy to compare medians, spreads, and outliers across groups.
  • Example: Comparing men's and women's times showed that half of women's times were between 17 and 20 seconds, while half of men's times were between 19 and 25.5 seconds.

🔄 Interpreting differences

  • If the boxes don't overlap much, the groups differ in central tendency.
  • If one box is much taller than another, that group has more variability.
  • Outliers in one group but not another suggest asymmetry or unusual observations.
  • Don't confuse: A box plot shows the middle 50% of the data (between the hinges), not the full range. Whiskers and outlier markers show the rest.
7

Levels of Measurement

Levels of Measurement

🧭 Overview

🧠 One-sentence thesis

The excerpt does not contain substantive content about levels of measurement; it consists primarily of statistical tables, formulas, and exercise questions from a statistics textbook.

📌 Key points (3–5)

  • The excerpt includes a critical-value table for Spearman's ρ (rho) correlation coefficient with various sample sizes and significance levels.
  • A brief statistical literacy example discusses the Wilcoxon rank sum test comparing troponin levels in patients with and without right ventricular strain.
  • The excerpt mentions "Chapter 1: Levels of Measurement" as a prerequisite but does not explain what levels of measurement are.
  • The material appears to be from a statistics textbook chapter on non-parametric tests and effect sizes, not specifically about measurement levels.
  • No actual teaching content about levels of measurement (nominal, ordinal, interval, ratio) is present in the excerpt.

📊 What the excerpt contains

📊 Statistical tables and reference material

The excerpt provides:

  • A critical-value table for Spearman's rank correlation coefficient (ρ) showing thresholds for sample sizes from N=5 to N=50
  • Values are given for both one-tailed and two-tailed tests at .05 and .01 significance levels
  • Example: For N=5, the critical value for a one-tailed test at .05 level is 0.90

🔬 Statistical literacy example

A brief case describes comparing troponin concentration in patients with and without signs of right ventricular strain using the Wilcoxon rank sum test.

The example notes:

  • Patients with right ventricular strain had higher median troponin (0.03 ng/ml) than those without (< 0.01 ng/ml)
  • The difference was significant at p<0.001
  • The authors suggest the Wilcoxon test may have been chosen because distributions were very non-normal

⚠️ Content limitation

⚠️ Missing core content

The excerpt does not explain:

  • What levels of measurement are
  • The different types of measurement scales
  • How to identify or distinguish between measurement levels
  • Why measurement levels matter for statistical analysis

The title "Levels of Measurement" appears only as a prerequisite reference, not as the actual topic covered in this excerpt.

8

Distributions

Distributions

🧭 Overview

🧠 One-sentence thesis

This excerpt defines key statistical terms related to variance assumptions, independence, measurement scales, and distribution shapes, which are foundational for understanding regression, experimental design, and inferential statistics.

📌 Key points (3–5)

  • Variance assumptions: homogeneity of variance and homoscedasticity both require equal spread across groups or predictor values.
  • Independence concepts: variables are independent if knowing one tells you nothing about the other; events are independent if one's occurrence doesn't change the probability of the other.
  • Common confusion: independence of variables vs. independence of events—variables relate to correlation (Pearson's r = 0), events relate to conditional probability (P(A|B) = P(A)).
  • Measurement scales: interval scales have equal intervals throughout but no true zero, unlike ratio scales.
  • Distribution shape: kurtosis measures tail thickness—leptokurtic means long/fat tails, platykurtic means short/thin tails, normal distributions have zero kurtosis.

📏 Variance and spread assumptions

📏 Homogeneity of variance

The assumption that the variances of all the populations are equal.

  • This is a requirement for many statistical tests.
  • It means the spread of data is the same across different groups being compared.

📏 Homoscedasticity

In linear regression, the assumption that the variance around the regression line is the same for all values of the predictor variable.

  • This is the regression-specific version of equal variance.
  • The scatter of points around the fitted line should be consistent whether the predictor is low or high.
  • Don't confuse: homogeneity of variance applies to group comparisons; homoscedasticity applies to regression contexts.

📏 H-Spread and interquartile range

  • H-Spread: the difference between the upper hinge and the lower hinge in a box plot.
  • Interquartile Range (IQR): the 75th percentile minus the 25th percentile; a robust measure of variability.
  • Both describe the middle 50% of the data.

🔗 Independence concepts

🔗 Independence of variables

Two variables are said to be independent if the value of one variable provides no information about the value of the other variable.

  • Independent variables are uncorrelated: Pearson's r would be 0.
  • Knowing one variable's value doesn't help predict the other.
  • Example: if Variable A and Variable B are independent, observing A tells you nothing about B.

🔗 Independence of events

Events A and B are independent events if the probability of Event B occurring is the same whether or not Event A occurs.

  • Formal definition: P(A|B) = P(A) and P(B|A) = P(B).
  • Example from the excerpt: throwing two dice—the probability the second die shows 1 is independent of whether the first die showed 1.
  • Don't confuse: this is about probability of outcomes, not correlation of measurements.

🔗 Independent variable (factor)

Variables that are manipulated by the experimenter, as opposed to dependent variables.

  • Most experiments observe the effect of independent variables on dependent variables.
  • This is a different use of "independent"—it means "controlled by the researcher," not "statistically unrelated."

📐 Measurement scales

📐 Interval scale

One of four commonly used levels of measurement, an interval scale is a numerical scale in which intervals have the same meaning throughout.

  • Equal intervals: the difference between 30 and 40 has the same meaning as the difference between 80 and 90.
  • Example from the excerpt: Fahrenheit temperature—each 10-degree interval represents the same physical temperature difference.
  • Key limitation: interval scales do not have a true zero point, unlike ratio scales.
  • Example: 0°F does not mean "no temperature"; it is an arbitrary point on the scale.

📊 Distribution shape and tails

📊 Kurtosis

Kurtosis measures how fat or thin the tails of a distribution are relative to a normal distribution.

  • Normal distributions have zero kurtosis (the reference point).
  • The excerpt notes kurtosis is "commonly defined" by a formula (not reproduced here in words).

📊 Leptokurtic distributions

A distribution with long tails relative to a normal distribution is leptokurtic.

  • "Long tails" means more extreme values appear more often than in a normal distribution.
  • These distributions have positive kurtosis.

📊 Platykurtic distributions

  • Distributions with short tails are called platykurtic.
  • "Short tails" means fewer extreme values than a normal distribution.
  • These distributions have negative kurtosis.
Distribution typeTail lengthKurtosis value
LeptokurticLong/fat tailsPositive
NormalReferenceZero
PlatykurticShort/thin tailsNegative

🧪 Experimental and regression concepts

🧪 Level

When a factor consists of various treatment conditions, each treatment condition is considered a level of that factor.

  • Example from the excerpt: if the factor is drug dosage and three doses are tested, each dosage is one level, and the factor has three levels.
  • This terminology is used in experimental design and ANOVA.

🧪 Interaction

Two independent variables interact if the effect of one of the variables differs depending on the level of the other variable.

  • The effect of one variable is not constant—it changes based on the other variable.
  • Example from the excerpt's interaction plot: the effect of dosage is different for males than for females.

🧪 Influence (in regression)

Influence refers to the degree to which a single observation in regression influences the estimation of the regression parameters.

  • Measured by how much predicted scores for other observations would differ if that observation were removed.
  • High-influence points can strongly affect the regression line.

🔢 Inferential statistics and estimation

🔢 Inferential statistics

The branch of statistics concerned with drawing conclusions about a population from a sample.

  • Generally done through random sampling.
  • Inferences are made about central tendency or other aspects of a distribution.

🔢 Interval estimate

An interval estimate is a range of scores likely to contain the estimated parameter.

  • Can be used synonymously with "confidence interval."
  • Provides a range rather than a single point estimate.

📈 Graphical techniques

📈 Jitter

When points in a graph are jittered, they are moved horizontally so that all the points can be seen and none are hidden due to overlapping values.

  • This is a visualization technique to reveal overlapping data points.
  • The excerpt mentions an example is shown (in the original figure).

📈 Interaction plot

  • Displays levels of one variable on the X axis.
  • Has a separate line for the means of each level of the other variable.
  • The Y axis is the dependent variable.
  • The excerpt's example shows that the effect of dosage differs for men vs. women (non-parallel lines indicate interaction).
9

Summation Notation

Summation Notation

🧭 Overview

🧠 One-sentence thesis

Summation notation provides a compact mathematical shorthand for expressing the addition of numbers, which is essential for statistical formulas and avoids confusion between operations like summing-then-squaring versus squaring-then-summing.

📌 Key points (3–5)

  • What summation notation does: expresses the sum of numbers using the Greek letter Σ (sigma) with index variables to specify which values to add.
  • How the notation works: the index variable (e.g., i = 1 at the bottom and 4 at the top) tells you where to start and stop summing.
  • Common confusion: the order of operations matters—summing all values then squaring the total gives a completely different result from squaring each value first then summing those squares.
  • Abbreviated notation: when no index values are shown, it means sum all values of the variable.
  • Cross products: summation notation can also express the sum of products of two variables (XY).

📐 Basic summation structure

📐 The sigma symbol and index

The Greek letter Σ indicates summation.

  • The notation includes three parts: the sigma symbol, the index specification, and the variable to be summed.
  • The index variable (commonly i) appears at the bottom with a starting value (e.g., "i = 1") and at the top with an ending value (e.g., 4).
  • The variable name with subscript (e.g., X sub i) tells you which values to add as i changes.
  • Example: If you have four grape weights labeled X₁, X₂, X₃, X₄, the notation with i = 1 at bottom and 4 at top means add X₁ + X₂ + X₃ + X₄.

🔢 Partial sums

  • You can sum only a subset of values by changing the upper limit of the index.
  • Example: If the index goes from 1 to 3 (instead of 1 to 4), you sum only the first three values.
  • The excerpt shows: summing from i = 1 to 3 gives 4.6 + 5.1 + 4.9, which equals the total minus the fourth value.

✂️ Abbreviated notation

  • When all values of a variable should be summed, the index limits are often omitted.
  • The notation then shows just Σ X without any i = 1 or upper limit.
  • This shorthand means "sum all the values of X."

⚠️ Order of operations

⚠️ Sum-then-square vs square-then-sum

The excerpt emphasizes a critical distinction:

OperationNotationWhat it meansExample result
Sum then square(Σ X)²Add all values, then square the total19² = 361
Square then sumΣ X²Square each value first, then add the squares21.16 + 26.01 + 24.01 + 19.36 = 90.54
  • These two operations produce very different results even though they use the same numbers.
  • The parentheses and position of the exponent determine the order.
  • Don't confuse: (Σ X)² ≠ Σ X² because the first squares the sum (19² = 361) while the second sums the squares (90.54).

🧮 Squaring individual values

  • When the notation shows Σ X², you square each individual value before summing.
  • Example: For values 4.6, 5.1, 4.9, 4.4, you compute 4.6² + 5.1² + 4.9² + 4.4².
  • This gives 21.16 + 26.01 + 24.01 + 19.36 = 90.54.

🔗 Cross products

🔗 Summing products of two variables

  • Some formulas require multiplying corresponding values from two variables (X and Y) and then summing those products.
  • The notation Σ XY means: for each index value, multiply X by Y, then add all those products.
  • Example: If X values are 1, 2, 3 and Y values are 3, 2, 7, the cross products are 1×3=3, 2×2=4, 3×7=21, and the sum is 3 + 4 + 21 = 28.

📊 Cross product table

The excerpt shows a table format for organizing cross products:

XYXY
133
224
3721
  • The third column shows each product, and the sum of that column is Σ XY.
  • This notation is written as Σ XY = 28 in the example.
10

Linear Transformations

Linear Transformations

🧭 Overview

🧠 One-sentence thesis

Linear transformations preserve the structure of statistical relationships by scaling measures of spread while shifting measures of central tendency in predictable ways.

📌 Key points (3–5)

  • What a linear transformation is: multiplying a variable by a constant and/or adding a constant (e.g., converting Fahrenheit to Celsius).
  • Effect on central tendency: the mean and median transform by the same rule—multiply by the constant, then add the constant.
  • Effect on spread: the standard deviation is multiplied by the constant; the variance is multiplied by the constant squared.
  • What stays unchanged: the correlation between variables is unaffected by linear transformations.
  • Common confusion: adding a constant shifts location but does not change spread; only multiplication affects variability.

📐 What is a linear transformation

📐 Definition and formula

A linear transformation creates a new variable Y from an existing variable X using the formula: Y = bX + A, where b and A are constants.

  • The transformation has two parts: multiplication (scaling) and addition (shifting).
  • Example from the excerpt: converting Fahrenheit to Celsius uses C = 0.55556F - 17.7778.
  • The constant b scales the values; the constant A shifts them.

🌡️ Temperature conversion example

The excerpt uses temperatures in five cities to illustrate:

  • Original data: temperatures in Fahrenheit.
  • Transformation: multiply each temperature by 0.556, then subtract 17.7778 to get Celsius.
  • This is a linear transformation because it follows the Y = bX + A pattern.

📊 How transformations affect central tendency

📊 Mean and median behavior

  • Both the mean and median follow the same transformation rule.
  • If the original mean is μ, the new mean is b·μ + A.
  • If the original median is M, the new median is b·M + A.

Example from the excerpt:

  • Mean temperature in Fahrenheit: 54 degrees.
  • After transformation: (0.556)(54) - 17.7778 = 12.22 degrees Celsius.
  • The same rule applies to the median.

🔄 Why this works

  • Linear transformations preserve order and relative spacing.
  • Every value is transformed by the same rule, so the "center" of the distribution shifts and scales accordingly.
  • Don't confuse: the transformation changes the numerical value but not the relative position of the mean or median within the distribution.

📏 How transformations affect variability

📏 Standard deviation rule

  • The standard deviation is multiplied by the absolute value of the constant b.
  • If the original standard deviation is σ, the new standard deviation is b·σ.

Example from the excerpt:

  • Standard deviation in Fahrenheit: 18.166.
  • Standard deviation in Celsius: 18.166 × 0.556 = 10.092.

📦 Variance rule

  • The variance is multiplied by b squared.
  • If the original variance is σ², the new variance is b²·σ².

Example from the excerpt:

  • Variance in Fahrenheit: 330.
  • Variance in Celsius: 330 × (0.556)² = 101.852.

➕ Why adding a constant doesn't change spread

  • Adding the same number to every value shifts the entire distribution but doesn't change how spread out the values are.
  • Only the multiplication part (b) affects variability.
  • Example: adding 5 points to every test score doesn't change the standard deviation of the scores.

🔗 Correlation is unaffected

🔗 A critical property

  • Multiplying a variable by a constant and/or adding a constant does not change its correlation with other variables.
  • This means the strength and direction of relationships are preserved.

🔍 Why this matters

  • The correlation between Weight and Height is the same whether Height is measured in inches, feet, or miles.
  • Adding five points to every student's test score would not change the correlation of the test score with GPA.
  • This property makes correlation a "scale-free" measure of association.

⚠️ Don't confuse

  • While the numerical values of variables change under transformation, the pattern of their relationship does not.
  • This is why correlation is useful for comparing relationships across different units of measurement.

📋 Summary formula

📋 Complete transformation rules

If variable X has mean μ, standard deviation σ, and variance σ², and you create Y = bX + A:

MeasureOriginal (X)Transformed (Y)
Meanμb·μ + A
Standard deviationσb·σ
Varianceσ²b²·σ²
Correlation with other variablesrr (unchanged)

🔬 Linear algebra note

The excerpt notes that "linear transformation" is defined differently in the field of linear algebra, and provides a link for details (not included in the excerpt).

11

Logarithms

Logarithms

🧭 Overview

🧠 One-sentence thesis

The Poisson distribution uses the base of natural logarithms (e) to calculate probabilities of various numbers of events when the mean number of events is known and the events are independent.

📌 Key points (3–5)

  • What the Poisson distribution calculates: probabilities of various numbers of "successes" (events) based on the mean number of successes.
  • Key requirement: the events must be independent for the Poisson distribution to apply.
  • The role of logarithms: e (the base of natural logarithms, approximately 2.7183) is a core component of the Poisson formula.
  • Mean and variance: both the mean and variance of the Poisson distribution equal μ (the mean number of successes).
  • Common confusion: "success" does not mean a positive outcome; it simply means the outcome in question occurs.

📐 The Poisson distribution formula

📐 Core components

The excerpt introduces a formula for calculating Poisson probabilities with three key elements:

  • e: the base of natural logarithms (2.7183)
  • μ: the mean number of "successes"
  • x: the number of "successes" in question

The formula uses these components to compute the probability of observing exactly x events when the mean is μ.

🔢 How the formula works

  • The formula involves e raised to the negative mean (e to the power of negative μ).
  • It also involves μ raised to the power of x.
  • The denominator includes x factorial (x!).
  • Example: If the mean number of calls to a fire station on a weekday is 8, you can calculate the probability of exactly 11 calls on a given weekday by setting μ = 8 and x = 11.

🎯 When to use the Poisson distribution

🎯 Independence requirement

The Poisson distribution can be used to calculate the probabilities of various numbers of "successes" based on the mean number of successes, provided the various events must be independent.

  • The events being counted must not influence each other.
  • If events are not independent, the Poisson distribution does not apply.

🔄 What "success" means

  • The term "success" does not carry its traditional positive meaning.
  • It simply refers to the outcome in question occurring.
  • Example: In the fire station scenario, a "success" is a call to the station—not necessarily a positive event, just the event being counted.

📊 Properties of the Poisson distribution

📊 Mean and variance

PropertyValue
Meanμ
Varianceμ
  • Both the mean and the variance of the Poisson distribution are equal to μ.
  • This is a distinctive feature: unlike many distributions where mean and variance differ, the Poisson distribution has them equal.
  • Example: If the mean number of calls is 8, the variance is also 8.
12

Graphing Qualitative Variables

Graphing Qualitative Variables

🧭 Overview

🧠 One-sentence thesis

Box plots reveal key distribution features—medians, hinges, and outliers—and are especially effective for comparing groups, though they sacrifice detail that histograms preserve.

📌 Key points (3–5)

  • What box plots show: median, hinges (25th and 75th percentiles), whiskers, and outliers; half the data lies between the hinges.
  • Strengths: excellent for portraying extreme values and comparing distributions side-by-side.
  • Limitations: many distribution details are hidden; histograms or stem-and-leaf displays are needed for finer structure.
  • Common confusion: box plots vs histograms—box plots summarize key statistics but do not show the full shape; histograms reveal detailed frequency patterns.
  • Variations exist: options include marking outliers, showing means, jittering individual points, and scaling box width by sample size.

📦 Anatomy of a box plot

📏 Core components

A box plot displays five key statistics plus outliers:

Hinges: the 25th and 75th percentiles of the distribution.

Median: the middle value (50th percentile).

  • The box spans from the lower hinge to the upper hinge; half of all scores fall inside this box.
  • Whiskers extend from the hinges toward the minimum and maximum values (within defined fences).
  • Outliers are marked separately (often with symbols like circles or plus signs) when they fall beyond the fences.

Example: In the women's time data, half the scores lie between 17 and 20 seconds (the hinges), and the median is 19 seconds.

🧮 Additional markers

  • Mean: some box plots mark the mean with a line or plus sign (distinct from the median).
  • Overall mean: a gray line may indicate the grand mean across all groups.
  • Individual scores: dots can represent each data point; when scores are rounded, one dot may represent multiple subjects.

Don't confuse: the median (middle value) and the mean (arithmetic average) are different; box plots can show both.

🔍 Interpreting distributions with box plots

📊 Comparing groups

Box plots are especially powerful for side-by-side comparisons.

  • The excerpt compares women's and men's times:
    • Women: hinges at 17 and 20 seconds.
    • Men: hinges at 19 and 25.5 seconds.
  • Interpretation: women generally named colors faster (lower times), though one woman was slower than almost all men (an outlier).

Example: If two groups have the same median but different hinge spreads, the group with wider hinges has more variability in the middle 50% of scores.

📐 Detecting skew

  • Positive skew: longer whisker in the positive direction; mean larger than median.
  • Negative skew: longer whisker in the negative direction; mean smaller than median.
  • The excerpt notes that skew can be inferred from whisker length and mean-median comparison.

Don't confuse: a longer whisker does not mean more data in that direction—it reflects the spread of extreme values, not frequency.

🎨 Variations and customization

🎛️ Common options

Statistical software offers several box plot styles; the excerpt lists five variations:

FeatureWhat it does
Outlier markingSome plots omit outlier symbols entirely
Mean indicatorsGreen lines or plus signs can mark group means
Grand mean lineA gray line shows the overall mean across all groups
Individual pointsDots represent each score (may overlap if rounded)
Box widthWidth proportional to sample size (e.g., wider box for 31 women vs 16 men)

🎲 Jittering

  • Purpose: spread out dots at the same horizontal position so multiple occurrences of the same score are visible.
  • How it works: each dot is randomly shifted horizontally (without overlapping exactly).
  • Trade-off: helps reveal repeated scores, but some points may still be obscured depending on dot size and resolution.

Example: If five subjects all scored 20 seconds, jittering spreads five dots vertically at the "20" position instead of stacking them invisibly.

⚖️ When to use box plots vs other graphs

✅ Box plot strengths

  • Portray extreme values clearly.
  • Compare distributions efficiently (multiple groups side-by-side).
  • Summarize key statistics (median, hinges, range) at a glance.

❌ Box plot limitations

  • Many details of the distribution are not revealed.
  • Fine structure (e.g., bimodality, gaps) is hidden.
  • The excerpt recommends creating a histogram or stem-and-leaf display to examine these details.

🧭 Choosing the right graph

  • The excerpt advises: "When exploring your data, you should try several ways of visualizing them."
  • Which graphs to include in a report depends on which aspects of the data you consider most important.
  • No firm rules exist; different styles suit different situations.

Don't confuse: box plots and histograms serve different purposes—box plots summarize key statistics for comparison; histograms show the full frequency distribution.

📊 Bar charts for qualitative variables

📈 What bar charts show

The excerpt introduces bar charts as tools for illustrating frequencies of different categories.

Example: A bar chart shows how many iMac purchasers were previous Macintosh users, previous Windows users, or new computer purchasers.

🔄 Beyond frequency counts

  • Bar charts can present other kinds of quantitative information, not just counts.
  • The excerpt mentions (but does not detail) a bar chart showing percent increases in financial indices.

Don't confuse: bar charts (for categorical data) vs histograms (for continuous data)—bar charts have gaps between bars; histograms have adjacent bars.

13

Graphing Quantitative Variables

Graphing Quantitative Variables

🧭 Overview

🧠 One-sentence thesis

Different graph types serve different purposes when displaying quantitative variables, with the choice depending on data size, the need to compare distributions, and the specific insights you want to reveal.

📌 Key points (3–5)

  • What quantitative variables are: variables measured on a numeric scale (height, weight, response time, exam scores), distinct from categorical variables (favorite color, city of birth).
  • Multiple graph types available: stem and leaf displays, histograms, frequency polygons, box plots, bar charts, line graphs, dot plots, and scatter plots—each suited to different situations.
  • Data size matters: some graphs (stem and leaf) work best for small to moderate data; others (histograms) handle large datasets better.
  • Common confusion: don't use line graphs for qualitative/categorical X-axis variables—line graphs imply a natural numeric ordering that categorical data lack.
  • Purpose-driven choice: some graphs (box plots) excel at comparing distributions; others (scatter plots) show relationships between two variables.

📊 Quantitative vs categorical variables

📊 What quantitative variables are

Quantitative variables: variables measured on a numeric scale.

  • Examples from the excerpt: height, weight, response time, subjective rating of pain, temperature, score on an exam.
  • Key characteristic: ordering and measuring are meaningful.
  • Example: temperature can be ranked (higher/lower) and differences can be measured (10 degrees warmer).

🔀 How they differ from categorical variables

Categorical (qualitative) variables: variables such as favorite color, religion, city of birth, favorite sport in which there is no ordering or measuring involved.

  • Categorical variables have no inherent numeric order.
  • Don't confuse: even if you code categories as numbers (e.g., 1 = red, 2 = blue), the numbers are labels, not measurements.
  • Example: "favorite sport" has no natural ranking—basketball isn't "more" than soccer in a numeric sense.

🎨 Graph types and their purposes

🎨 Seven main types for quantitative data

The excerpt lists seven graph types for quantitative variables:

Graph typeBest use case (from excerpt)
Stem and leaf displaysSmall to moderate amounts of data
HistogramsLarge amounts of data
Frequency polygons(Not specified in excerpt)
Box plotsDepicting differences between distributions
Bar charts(Not specified in excerpt)
Line graphs(Not specified in excerpt)
Dot plots(Not specified in excerpt)
  • Scatter plots are mentioned separately—they show the relationship between two variables and are discussed in a different chapter.

🔍 Matching graph to data size

  • Small to moderate data: stem and leaf displays work well because you can see individual data points.
  • Large data: histograms are better suited—they summarize large datasets without overwhelming detail.
  • Don't confuse: a stem and leaf display for 10,000 data points would be unreadable; a histogram for 10 data points loses too much detail.

📐 Comparing distributions

  • Box plots are specifically highlighted as "good at depicting differences between distributions."
  • This means: when you want to compare two or more groups side-by-side, box plots make patterns and differences clearer.
  • Example: comparing exam scores across three classes—box plots let you quickly see which class has higher median scores or more variability.

🌿 Stem and leaf displays in detail

🌿 What a stem and leaf display shows

A stem and leaf display is a graphical method of displaying data. It is particularly useful when your data are not too numerous.

  • Purpose: clarify the shape of the distribution while preserving individual data values.
  • Structure: stems (left side) represent higher-order digits (e.g., 10's place); leaves (right side) represent lower-order digits (e.g., 1's place).
  • Example from the excerpt: stem 3 with leaves 2, 3, 3, 7 represents the values 32, 33, 33, and 37.

📖 How to read a stem and leaf display

The excerpt uses touchdown passes data:

  • Stem 3 with leaves 2, 3, 3, 7 → values 32, 33, 33, 37.
  • Stem 2 with 12 leaves → represents 12 data points in the 20s range (e.g., two 20s, three 21s, three 22s, one 23, two 28s, one 29).
  • Stem 0 with leaves 9, 6 → values 09 and 06 (i.e., 9 and 6).
  • Every leaf stands for "the result of adding the leaf to 10 times its stem."

🔎 What you can see in a stem and leaf display

  • Shape of the distribution: you can quickly see where most data cluster and where outliers lie.
  • Precise values: unlike a histogram, you can recover the exact data points from the display.
  • Example from the excerpt: "most of the teams had between 10 and 29 passing TD's, with a few having more and a few having less."
  • Don't confuse: a stem and leaf display is not just a list—it organizes data to reveal patterns while keeping detail.

🔄 Back-to-back stem and leaf displays

  • The learning objectives mention "back-to-back stem and leaf displays" for comparing two datasets.
  • Structure: one stem column in the middle, leaves for one dataset on the left, leaves for the other dataset on the right.
  • Purpose: directly compare two distributions side-by-side.
  • Example: comparing touchdown passes for two different seasons—you can see which season had more high-scoring teams.

⚖️ When to use stem and leaf displays

  • Best for: small to moderate datasets where you want to see both shape and individual values.
  • Not appropriate for: very large datasets (too many leaves make the display cluttered and hard to read).
  • Judgment criterion from learning objectives: "Judge whether a stem and leaf display is appropriate for a given data set."

⚠️ Common graphing mistakes

⚠️ Misleading distortions in bar charts

The excerpt discusses three distortions (from earlier sections on qualitative data, but principles apply):

  1. Lie factor: the ratio of the effect shown in a graph to the size of the effect in the data.

    • Acceptable range: 0.95 to 1.05.
    • Lie factors greater than 1.05 or less than 0.95 produce unacceptable distortion.
    • Example: Figure 5 in the excerpt has a lie factor greater than 8, grossly exaggerating differences.
  2. Baseline distortion: setting the Y-axis baseline to a value other than zero.

    • Normal baseline: zero (the least number of cases that could occur).
    • Example: Figure 6 uses a baseline of 50, making a 12% value "seem minuscule" compared to its true size.
    • Why it matters: differences in bar areas suggest a different story than the true differences in percentages.
  3. Using line graphs for qualitative X-axes: a serious mistake.

    • Line graphs imply a natural numeric ordering of X-axis values.
    • Example: Figure 7 shows card games ordered alphabetically, giving "the false impression that the games are naturally ordered in a numerical way."
    • Don't confuse: a line graph is "essentially a bar graph with the tops of the bars represented by points joined by lines"—but the connection only makes sense if the X-axis has a meaningful order.

🛑 When not to use a line graph

It is a serious mistake to use a line graph when the X-axis contains merely qualitative variables.

  • Qualitative variables (e.g., game names, cities, colors) have no inherent numeric order.
  • Connecting points with lines falsely suggests continuity or progression.
  • Example from the excerpt: card games ordered alphabetically—the line connecting them implies a relationship that doesn't exist.
  • Correct alternative: use a bar chart for qualitative X-axis variables.
14

Stem and Leaf Displays

Stem and Leaf Displays

🧭 Overview

🧠 One-sentence thesis

Stem and leaf displays are graphical tools that organize numerical data by splitting each value into a "stem" (leading digit(s)) and a "leaf" (trailing digit), allowing quick visualization of distribution shape and individual data points.

📌 Key points (3–5)

  • What they show: Individual data values arranged in a way that reveals the distribution's shape, center, and spread.
  • How they're structured: Each number is split into a stem (left side) and leaf (right side), with stems listed vertically and leaves horizontally.
  • Back-to-back displays: Two groups can be compared by placing leaves on opposite sides of a shared stem column.
  • When to use them: Useful for small to moderate datasets where you want to see both the overall pattern and the actual values.
  • Common confusion: Unlike histograms that show only frequency, stem and leaf displays preserve the original data values.

📊 Structure and reading

📊 Basic anatomy

A stem and leaf display splits each data value into two parts: the stem (leading digit or digits) and the leaf (trailing digit).

  • The stem forms the vertical axis (left column).
  • Leaves extend horizontally to the right of each stem.
  • Each leaf represents one data point.
  • Example: For the value 47, the stem might be 4 and the leaf would be 7.

🔍 Reading the display

  • Stems are listed in ascending order from top to bottom.
  • Leaves for each stem are typically arranged in ascending order from left to right.
  • The display shows both the shape of the distribution and the actual data values.
  • Example from the excerpt: In a chess memory experiment display, stem "3" with leaves "0" represents a score in the 30s range.

🔄 Back-to-back displays

🔄 Comparing two groups

Back-to-back stem and leaf displays place two distributions on opposite sides of a shared stem column, enabling direct comparison.

  • The stem column sits in the middle.
  • One group's leaves extend to the left.
  • The other group's leaves extend to the right.
  • This format makes it easy to compare centers, spreads, and shapes of two distributions.

🎯 Chess memory example

The excerpt shows a back-to-back display comparing non-chess-players (left) and tournament players (right):

  • Non-players' scores cluster in the lower ranges (20s-40s).
  • Tournament players' scores cluster in higher ranges (40s-80s).
  • The center of the tournament players' distribution is "much lower than the center of the distribution for the tournament players" [note: this appears to be an error in the source; context suggests tournament players scored higher].
  • This visual comparison immediately reveals the performance difference between groups.

🛠️ Practical applications

🛠️ Exercise contexts

The excerpt mentions several scenarios where stem and leaf displays are requested:

  • Comparing experimental conditions (e.g., false smile vs. neutral conditions).
  • Examining treatment effects (e.g., ADHD treatment with placebo responses).
  • Analyzing academic data (e.g., high school vs. university GPAs).
  • Describing distribution shape (e.g., identifying skew or other patterns).

💻 Creation methods

  • May require manual construction if computer programs don't support the format.
  • The excerpt notes: "It may be hard to find a computer program to do this for you, so be prepared to do it by hand."
  • Despite being older technology, stem and leaf displays remain valuable for small datasets and teaching purposes.

📐 Relationship to other displays

  • Compared to histograms: Stem and leaf displays preserve actual data values; histograms show only frequencies.
  • Compared to box plots: Stem and leaf displays show the full distribution; box plots summarize with quartiles.
  • Used alongside: Often created together with other displays (histograms, line graphs, box plots) for comprehensive data exploration.

🎓 Prerequisites and context

🎓 Learning sequence

The excerpt places stem and leaf displays early in the statistical learning sequence:

  • Appears in Chapter 2 (after Chapter 1 on Distributions).
  • Listed as a prerequisite for understanding central tendency concepts.
  • Serves as a foundation for more advanced summarizing techniques.

🎯 Connection to distribution concepts

  • Helps visualize where the center of a distribution is located.
  • Makes spread and variability visible.
  • Reveals distribution shape (symmetric, skewed, bimodal).
  • Example: The chess memory display clearly shows the location difference between the two groups' centers, illustrating the concept of central tendency through visual comparison.
15

Histograms

Histograms

🧭 Overview

🧠 One-sentence thesis

Histograms can be used to visualize distributions and estimate central tendency measures, with the balance point on a fulcrum representing the mean of the distribution.

📌 Key points (3–5)

  • Balance point interpretation: when a histogram is balanced on a fulcrum, the fulcrum position indicates the mean of the distribution.
  • Estimating from histograms: mean, median, and mode can be approximated by examining the shape and balance of a histogram.
  • Histogram and skew: histograms reveal whether a distribution has positive skew, negative skew, or no skew.
  • Common confusion: mean vs median—the mean is the balance point (center of mass), while the median is the middle value when data are ordered; in skewed distributions these differ.

⚖️ Balance and the mean

⚖️ Fulcrum as the mean

The histogram is in balance on the fulcrum.

  • The excerpt presents a histogram balanced on a fulcrum and asks for the mean, median, and mode.
  • The balance point (fulcrum) corresponds to the mean of the distribution.
  • Why: the mean is the center of mass—the point where the distribution would balance if it were a physical object.
  • Example: if a histogram balances at a certain point on the x-axis, that point is the mean of the data.

📏 Approximating mean, median, and mode

  • The question asks to "approximate where necessary," meaning exact calculation is not always possible from a histogram alone.
  • Mean: the balance point (fulcrum position).
  • Median: the value that divides the area of the histogram in half (50th percentile).
  • Mode: the peak(s) of the histogram—the value(s) with the highest frequency.
  • Don't confuse: these three measures coincide only in perfectly symmetric distributions; in skewed distributions they differ.

📊 Using histograms to assess distribution shape

📊 Skewness

  • The excerpt references checking whether a variable (Anger-Out) has positive skew, negative skew, or no skew.
  • Histograms reveal skew visually:
    • Positive skew: long tail to the right; mean > median.
    • Negative skew: long tail to the left; mean < median.
    • No skew: symmetric; mean ≈ median.

🔍 Mean vs median from histogram shape

  • The excerpt asks whether the mean or median of a variable is larger based on the histogram.
  • How to tell:
    • If the histogram has a long right tail (positive skew), the mean is pulled higher than the median.
    • If symmetric, mean and median are approximately equal.
  • Example: a histogram of "perday" (from the Flatulence case study) can suggest which measure is larger before calculation.

📐 Measures derived from histograms

📐 Range and interquartile range

  • The excerpt asks for the range (maximum minus minimum) and interquartile range (IQR: 75th percentile minus 25th percentile) of Anger-In scores.
  • Histograms show the spread of data:
    • Range: the span from the leftmost to rightmost bars.
    • IQR: the middle 50% of the data, less sensitive to outliers than the range.

📊 Variance and standard deviation

  • The excerpt asks for the variance of Control-In scores for athletes and non-athletes.
  • Variance measures how spread out the data are around the mean.
  • Histograms with wider spreads indicate higher variance; narrower spreads indicate lower variance.
  • Don't confuse: variance is the average squared deviation from the mean; standard deviation is its square root (same units as the data).
16

Frequency Polygons

Frequency Polygons

🧭 Overview

🧠 One-sentence thesis

Frequency polygons serve the same purpose as histograms for understanding distribution shapes but are especially helpful for comparing multiple data sets and displaying cumulative frequencies.

📌 Key points (3–5)

  • What frequency polygons show: the shape of distributions, just like histograms, but in a line-graph format.
  • When they excel: comparing sets of data by overlaying multiple polygons on the same graph.
  • Two main types: regular frequency polygons (showing counts per interval) and cumulative frequency polygons (showing running totals).
  • Common confusion: the X-axis labels represent the middle of each class interval, not the boundaries; the polygon extends one interval below and above the actual data to touch the X-axis.
  • How to read overlaid polygons: overlapping lines reveal differences in distribution—e.g., one distribution shifted higher or lower, or one more spread out than another.

📐 How to construct a frequency polygon

📏 Setting up the axes

  • Start by choosing a class interval, just as you would for a histogram.
  • Draw an X-axis representing the values of the scores.
  • Mark the middle of each class interval with a tick mark and label it with that middle value.
  • Draw a Y-axis to indicate the frequency of each class.

📍 Plotting the points

  • Place a point in the middle of each class interval at the height corresponding to its frequency.
  • Connect the points with straight lines.
  • Include one class interval below the lowest value in your data and one above the highest value so the graph touches the X-axis on both sides.

🧮 Example: Psychology test scores

The excerpt describes a frequency polygon for 642 psychology test scores:

  • The first label on the X-axis is 35, representing an interval from 29.5 to 39.5 (frequency = 0, since the lowest score is 46).
  • The point labeled 45 represents the interval from 39.5 to 49.5 (frequency = 3).
  • The point at 85 has 147 scores in its interval.
  • Most scores fall between 65 and 115.
  • The distribution is not symmetric: good scores (to the right) trail off more gradually than poor scores (to the left), indicating the distribution is skewed.

📈 Cumulative frequency polygons

📊 What cumulative frequency means

Cumulative frequency polygon: a frequency polygon where the Y value for each point is the number of observations in the corresponding class interval plus all numbers in lower intervals.

  • Instead of showing the count for each interval alone, it shows a running total.
  • The final point on the cumulative polygon equals the total number of observations.

🔢 Example: Cumulative psychology test scores

  • The interval labeled "35" has 0 scores.
  • The interval "45" has 3 scores.
  • The interval "55" has 10 scores.
  • Therefore, the Y value at "55" is 0 + 3 + 10 = 13 (cumulative).
  • Since 642 students took the test, the cumulative frequency for the last interval is 642.

🔀 Comparing distributions with overlaid polygons

🔀 Why overlay frequency polygons

  • Frequency polygons are especially useful for comparing sets of data.
  • By overlaying (drawing multiple polygons on the same graph), you can directly see differences in shape, center, and spread.

🎯 Example: Cursor task with two target sizes

The excerpt describes a computer cursor task:

  • Goal: move the cursor to a target as fast as possible.
  • Two conditions: 20 trials with a small rectangle target, 20 trials with a large rectangle target.
  • Time to reach the target was recorded for each trial.
  • The two distributions (one for each target size) are plotted together.
  • What the overlaid polygons reveal: although there is some overlap in times, it generally took longer to move the cursor to the small target than to the large one.

📉 Overlaid cumulative frequency polygons

  • It is also possible to plot two cumulative frequency distributions in the same graph.
  • The excerpt illustrates this with the same cursor task data.
  • The difference in distributions for the two targets is again evident in the cumulative version.
  • Don't confuse: a cumulative polygon always rises (or stays flat) as you move right, because it is a running total; a regular frequency polygon can go up and down.

🧩 Reading the shape of a distribution

🧩 What the polygon reveals

  • You can easily discern the shape of the distribution from a frequency polygon.
  • The excerpt's psychology test example shows most scores are between 65 and 115.
  • The distribution is not symmetric: good scores trail off more gradually than poor scores.
  • This asymmetry is called skewness (the excerpt notes this will be studied more systematically in a later chapter).

🔍 Interpreting overlaps and shifts

  • When comparing overlaid polygons, look for:
    • Shift: one distribution centered higher or lower than another.
    • Spread: one distribution more spread out (wider) or more concentrated (narrower).
    • Overlap: how much the two distributions share the same range of values.
  • Example: In the cursor task, the small-target polygon is shifted to the right (longer times) compared to the large-target polygon, with some overlap in the middle range.
17

Box Plots

Box Plots

🧭 Overview

🧠 One-sentence thesis

Box plots are graphical tools that display distribution features and enable side-by-side comparisons, with specific components marking typical values, spread, and outliers.

📌 Key points (3–5)

  • What box plots show: distribution features including center, spread, and atypical observations.
  • Key structural components: hinges (quartiles), fences (boundaries for identifying outliers), and outside values.
  • Parallel box plots: multiple box plots on the same Y-axis allow direct comparison of distributions across groups.
  • Common confusion: outer fence vs inner fence—outer fences are two steps from the hinge, inner fences are one step; outside values fall between these boundaries.
  • Why they matter: box plots make it easy to spot outliers and compare distribution shapes visually.

📦 Box plot structure and boundaries

📦 Hinges and fences

  • Hinges: the lower and upper hinges correspond to the quartiles of the distribution (the excerpt does not define them further, but they mark the box boundaries).
  • Steps: the distance used to define fences (the excerpt does not specify the step size formula, only that fences are measured in "steps" from the hinges).

🚧 Inner and outer fences

The lower outer fence is two steps below the lower hinge; the upper outer fence is two steps above the upper hinge.

  • Outer fences: mark the boundary two steps away from the nearest hinge.
  • Inner fences: (implied by the "outside values" definition) mark the boundary one step away from the nearest hinge.
  • These fences are not drawn on the plot itself but serve as thresholds for classifying data points.

🔴 Outside values

Outside values are more than one step beyond the nearest hinge but not more than two steps. They are beyond an inner fence but not beyond an outer fence.

  • These are moderately extreme observations.
  • They lie in the zone between the inner and outer fences.
  • Don't confuse: outside values are not the same as outliers (see below).

⚠️ Outliers

Outliers are atypical, infrequent observations; values that have an extreme deviation from the center of the distribution.

  • The excerpt notes there is no universally-agreed criterion for defining an outlier.
  • Outliers should only be discarded with extreme caution.
  • One should always assess the effects of outliers on statistical conclusions.
  • Example: a value more than two steps beyond a hinge (beyond the outer fence) might be considered an outlier, but the excerpt does not mandate this rule.

📊 Parallel box plots for comparison

📊 What parallel box plots are

Two or more box plots drawn on the same Y-axis.

  • Also called "pairwise comparisons" in the excerpt (though the excerpt uses both terms).
  • They allow direct visual comparison of distribution features across groups.

🔍 Why use them

  • Useful for comparing features of distributions across different samples or categories.
  • Example: the excerpt mentions "the times it took samples of women and men to do a task"—parallel box plots on the same Y-axis let you compare medians, spreads, and outliers between the two groups at a glance.
  • The shared Y-axis ensures that differences in scale and position are immediately visible.

🧩 Related concepts in the excerpt

🧩 Parameter vs statistic

A parameter is a value calculated in a population. A statistic is a value computed in a sample to estimate a parameter.

  • Box plots typically display sample statistics (e.g., sample quartiles).
  • Don't confuse: the box plot itself is a visualization tool, not a parameter or statistic.

🧩 Ordinal scales

An ordinal scale is a set of ordered values with no set distance between scale values.

  • Example: "Very Poor, Poor, Average, Good, Very Good."
  • Box plots can be used with ordinal data, but the "step" distances in fences assume interval or ratio scales (the excerpt does not address this limitation directly).

🧩 Non-representative samples

A non-representative sample does not accurately reflect the population.

  • If a box plot is drawn from a non-representative sample, the distribution features shown will not generalize to the population.
  • Always consider sample representativeness when interpreting box plots.
18

Bar Charts

Bar Charts

🧭 Overview

🧠 One-sentence thesis

Bar charts can display not only frequency counts but also other quantitative information such as percentage changes and means, though box plots are often superior for showing means because they reveal more about distributions.

📌 Key points (3–5)

  • Beyond frequency: Bar charts can show quantitative information like percentage increases, not just category counts.
  • Effective for time trends: Bar charts are particularly good at showing change over time (e.g., inflation fluctuations).
  • Comparing experimental means: Bar charts can display means of different conditions, but box plots are recommended instead because they provide more distributional information without using more space.
  • Common confusion: Don't assume bar charts are always best for means—box plots reveal more about the data (spread, outliers) in the same space.
  • When to use: Bar charts work well for ordered variables on both axes, but should not be used when the X-axis contains only qualitative (unordered) variables.

📊 What bar charts can display

📊 Frequency counts (traditional use)

  • The excerpt references an earlier section showing bar charts for qualitative variables.
  • Example: a bar chart showing how many iMac buyers were previous Macintosh users, previous Windows users, or new computer purchasers.
  • The Y-axis represents the number (count) of subjects in each category.

📈 Other quantitative information

Bar charts can present other kinds of quantitative information, not just frequency counts.

  • The Y-axis does not have to be frequency; it can be any signed quantity.
  • Example: percentage increase in stock indexes (Dow Jones, S & P, Nasdaq) from one date to another.
    • Some indexes had "negative increases" (i.e., they decreased in value).
    • The Y-axis is "percentage increase," not a count.

⏱️ Showing change over time

⏱️ Why bar charts work well for time trends

  • Bar charts are particularly effective for showing change over time.
  • Example: percent increase in the Consumer Price Index (CPI) over four three-month periods.
    • Each bar represents the percent increase for the three months ending at the indicated date.
    • The fluctuation in inflation becomes apparent in the graph.

📉 Visualizing trends

  • When the data points are ordered chronologically, bar charts make it easy to see patterns (rising, falling, fluctuating).
  • The excerpt emphasizes that the graph makes fluctuations "apparent."

🧪 Comparing experimental conditions

🧪 Displaying means

  • Bar charts are often used to compare the means of different experimental conditions.
  • Example: mean time to move a cursor to a small target vs. a large target.
    • On average, more time was required for small targets than for large ones.
    • Each bar represents the mean time for one condition.

⚠️ Why box plots are better

Although bar charts can display means, we do not recommend them for this purpose. Box plots should be used instead since they provide more information than bar charts without taking up more space.

  • A box plot of the cursor-movement data reveals more about the distribution of movement times than the bar chart does.
  • Don't confuse: A bar chart shows only the mean (a single summary number); a box plot shows the median, quartiles, spread, and potential outliers—all in the same space.
  • Example: The excerpt contrasts a bar chart (showing only mean time) with a box plot (showing the full distribution of times) for the same cursor data.

🔍 Appropriateness and limitations

🔍 When bar charts are appropriate

  • Bar charts work when you want to emphasize discrete categories or time periods.
  • They are effective for ordered variables on both axes (e.g., time on X-axis, percentage change on Y-axis).

🚫 When not to use bar charts

  • The excerpt stresses that bar charts should not be used when the X-axis contains merely qualitative (unordered) variables.
  • Example: The excerpt mentions (but does not fully show) an inappropriate line graph of card game data from Yahoo with qualitative X-axis variables—bar charts would have the same problem.
  • Common mistake: Using bar charts (or line graphs) when categories have no inherent order can be misleading.

🆚 Bar charts vs. other graphs

Graph typeBest forLimitation
Bar chartFrequency counts, percentage changes, time trendsShows only summary statistics (e.g., mean); hides distribution details
Box plotComparing distributions, showing means + spreadTakes same space but reveals quartiles, outliers, and variability
Line graphEmphasizing continuous change over timeOnly appropriate when both axes are ordered variables
  • The excerpt advises: "Which graphs you include in your report should depend on how well different graphs reveal the aspects of the data you consider most important."
  • When exploring data, try several visualization methods.
19

Line Graphs

Line Graphs

🧭 Overview

🧠 One-sentence thesis

Line graphs emphasize change over time by connecting data points with lines, but they should only be used when both axes display ordered (not merely qualitative) variables.

📌 Key points (3–5)

  • What a line graph is: a bar graph with the tops of bars represented by points joined by lines (the rest of the bar is suppressed).
  • When to use line graphs: appropriate only when both X- and Y-axes display ordered variables; generally better than bar charts for comparing changes over time.
  • Common confusion: do not use line graphs when the X-axis contains merely qualitative variables—this gives a false impression of natural numerical ordering.
  • Why line graphs matter: they emphasize period-to-period change and make it easier to see patterns like steady progression across multiple components.

📊 What line graphs are and how they relate to bar charts

📊 Definition and structure

A line graph is a bar graph with the tops of the bars represented by points joined by lines (the rest of the bar is suppressed).

  • Start with a bar chart concept, then replace bars with connected points.
  • The line connecting the points emphasizes the change from period to period, not just the individual values.
  • Example: the excerpt shows Consumer Price Index (CPI) data first as a bar chart, then as a line graph—the line graph makes the trend over time more visible.

🔄 Comparison with bar charts

  • Bar charts and line graphs can both display the same data when both axes are ordered.
  • Key difference: line graphs are generally better at comparing changes over time.
  • The excerpt notes that "although the figures are similar, the line graph emphasizes the change from period to period."

✅ When line graphs are appropriate

✅ Both axes must be ordered

  • Line graphs are appropriate only when both the X- and Y-axes display ordered (rather than qualitative) variables.
  • "Ordered" means the variables have a natural sequence (e.g., time, temperature, quantity).
  • Don't confuse: ordered variables vs qualitative variables—qualitative variables (like categories with no inherent order) should not be on the X-axis of a line graph.

📈 Comparing changes over time

  • Line graphs excel at showing trends and patterns across time or other ordered sequences.
  • Example from the excerpt: a line graph showing percent changes in five components of the CPI makes it easy to see that medical costs had a steadier progression than other components.
  • The excerpt states: "Although you could create an analogous bar chart, its interpretation would not be as easy."

❌ When line graphs are misleading

❌ Qualitative X-axis problem

  • It is misleading to use a line graph when the X-axis contains merely qualitative variables.
  • The defect: it gives the false impression that the categories are naturally ordered in a numerical way.
  • Example: the excerpt shows an inappropriate line graph of card game data (Blackjack, Bridge, Canasta, etc.) with number of players on different days.
    • The games have no natural numerical order.
    • Connecting them with lines suggests a progression that does not exist.
  • The excerpt explicitly calls this figure "inappropriately used."

🚫 How to avoid this mistake

  • Before choosing a line graph, check whether the X-axis variable has a meaningful order.
  • If the X-axis is categorical without inherent sequence (e.g., game names, product types), use a bar chart or dot plot instead.
  • Line graphs should be reserved for situations where the connection between points reflects a real progression (typically time or another continuous/ordered variable).

🎯 Practical advantages of line graphs

🎯 Emphasizing trends and patterns

  • Line graphs make it easier to see:
    • Overall direction (increasing, decreasing, stable).
    • Relative steadiness or volatility across different series.
  • Example: the five-component CPI graph allows quick visual comparison of which components are more stable or more volatile over the same time period.

🎯 Multiple series comparison

  • Line graphs can display multiple data series (e.g., different CPI components) on the same plot.
  • Each series gets its own line, making it easy to compare trajectories.
  • The excerpt shows this with housing, medical care, food and beverage, recreation, and transportation all on one graph.
20

Dot Plots

Dot Plots

🧭 Overview

🧠 One-sentence thesis

Dot plots are a flexible graphing tool that can display frequencies either by the number of dots or by the position of dots along an axis, and choosing the right layout makes it easier to compare specific categories or groups.

📌 Key points (3–5)

  • Two ways to show frequency: dot plots can represent frequency by counting individual dots (each dot = one item) or by placing a single dot at a position on a scale.
  • What dot plots display: they work for various types of information, including counts of categorical data (e.g., colors, card games).
  • Layout matters for comparisons: different arrangements (side-by-side vs. grouped) make it easier to compare either across categories or across groups (e.g., days of the week).
  • Common confusion: dot plots are not the same as line graphs—dot plots are appropriate for categorical data that have no natural numerical order, whereas line graphs imply a continuous or ordered relationship.
  • When to use: dot plots are suitable when you want to show frequencies or counts for discrete categories in a visually clear way.

📊 Two methods of representing frequency

📊 Counting individual dots

In this method, each dot represents a single item, and the total number of dots shows the frequency.

  • Example: Figure 1 shows M&M colors. There are 3 blue M&Ms (3 dots), 19 brown M&Ms (19 dots), etc.
  • This approach is intuitive: you literally count the dots to see how many items fall into each category.
  • Best for small counts where each individual item can be shown.

📍 Position on a scale

In this method, the location of a dot (rather than the number of dots) represents the frequency.

  • Example: Figure 2 shows the number of people playing card games on a Wednesday. A single dot is placed at the position corresponding to the count.
  • The horizontal axis shows the frequency scale (e.g., 1000, 2000, 3000…), and the dot's position tells you the value.
  • This approach is more compact and works well for larger counts.

🔀 Layout choices for comparisons

🔀 Comparing categories within one group

  • When you have one set of data (e.g., card games on Wednesday only), a simple dot plot shows how categories differ from each other.
  • Example: Figure 2 makes it easy to see which card game is most popular on Wednesday.

🔀 Comparing the same categories across two groups

  • When you have two groups (e.g., Wednesday and Sunday), layout affects what comparisons are easiest.
LayoutWhat it emphasizesExample from excerpt
Separate rows for each groupEasy to compare popularity of games within each day, harder to compare the same game across daysFigure 3 (Sunday and Wednesday in separate rows)
Side-by-side dots for each categoryEasy to compare days for a specific game while still showing differences among gamesFigure 4 (Wednesday and Sunday dots next to each other for each game)
  • The excerpt notes that Figure 4 "makes it easy to compare the days of the week for specific games while still portraying differences among games."
  • Don't confuse: the same data can be arranged in different ways; choose the layout that highlights the comparison you care about most.

⚠️ When dot plots are appropriate

⚠️ Categorical vs. ordered data

  • Dot plots work well for categorical data (data that fall into distinct groups with no inherent numerical order).
  • Example: card games (Poker, Bridge, Gin, etc.) are categories, not numbers on a scale.
  • The excerpt warns against using line graphs for categorical data: earlier text mentions that a line graph "gives the false impression that the games are naturally ordered in a numerical way."
  • Don't confuse: a dot plot for categories vs. a line graph for continuous or time-ordered data. If there is no natural ordering or progression, a dot plot (or bar chart) is more appropriate than a line graph.

⚠️ Judging appropriateness

  • The learning objectives state you should "judge whether a dot plot would be appropriate for a given data set."
  • Ask: Are the categories discrete and unordered? If yes, a dot plot is a good choice.
  • Ask: Do you want to show individual counts or frequencies clearly? Dot plots make this visible.

🛠️ Practical considerations

🛠️ Choosing between dot plot variants

  • If counts are small and you want to emphasize individual items, use the "each dot = one item" method.
  • If counts are large or you want a cleaner look, use the "position on a scale" method.
  • If you need to compare two groups, decide whether you want to emphasize within-group or between-group comparisons, and arrange the layout accordingly.

🛠️ What dot plots reveal

  • Dot plots make it easy to see which categories have the highest or lowest frequencies at a glance.
  • They also show the distribution of counts across categories without requiring numerical labels on every bar (as in a bar chart).
  • Example: In Figure 1, you can immediately see that brown M&Ms are the most common and blue M&Ms are the least common in the bag.
21

What is Central Tendency?

What is Central Tendency?

🧭 Overview

🧠 One-sentence thesis

Central tendency describes where the center of a distribution is located, allowing us to compare individual scores to the group and understand what is typical in a dataset.

📌 Key points (3–5)

  • Why we care about central tendency: comparing an individual score to the group helps us interpret whether that score is high, low, or typical.
  • What central tendency measures: the location of the center of a distribution of scores or values.
  • How distributions differ by center: two groups can have very different centers (e.g., non-chess-players vs. tournament players), making comparison meaningful.
  • Common confusion: there are multiple ways to define "center"—the excerpt mentions three different definitions, each useful in different contexts.
  • Real-world application: knowing the center helps answer questions like "How did the class do?" or "Is my score above or below average?"

🎯 Why central tendency matters

🎯 Comparing individual scores to the group

  • When you receive a score (e.g., 3 out of 5 on a quiz), the raw number alone doesn't tell you much.
  • You naturally want to know: "How did others do?"
  • Your reaction depends on where your score falls relative to the distribution of all scores.
  • Example: A score of 3 feels very different if everyone else scored 3 (you're at the center), versus if everyone else scored 4 or 5 (you're below the center), versus if everyone else scored 1 or 2 (you're above the center).

📊 Understanding what is typical

  • Central tendency tells you what is typical or expected in a group.
  • It provides a reference point for interpreting any single observation.
  • The excerpt emphasizes that "comparing individual scores to a distribution of scores is fundamental to statistics."

📍 How the center reveals group differences

📍 Comparing two distributions

The excerpt presents a chess memory experiment:

  • Non-players: people who don't play chess
  • Tournament players: people who play chess extensively
  • Both groups tried to reconstruct chess positions; scores represent pieces correctly placed (maximum 89).
  • The center of the distribution for non-players is much lower than the center for tournament players.
  • This difference in centers shows that tournament players perform better on average.

🔍 What "center" means intuitively

The center of a distribution: the location around which the scores cluster or balance.

  • It's not just one number; it's a concept that can be defined in multiple ways.
  • The excerpt promises three formal definitions, each capturing a different aspect of "center."
  • Don't confuse: "center" is not the same as "highest score" or "most common score" necessarily—it depends on which definition you use.

🧮 Three ways to think about center

🧮 Multiple definitions exist

The excerpt states there are "at least three different ways of thinking about the center of a distribution."

  • Each definition is useful in different contexts.
  • The text introduces the idea that center can be defined formally in multiple ways, but the specific definitions are covered in later sections (not included in this excerpt).
  • This multiplicity is intentional, not redundant—different measures capture different aspects of what "typical" or "central" means.

⚖️ Balance and symmetry

The excerpt mentions that "the balance is different for symmetric distributions than it is for asymmetric distributions."

  • Symmetric distributions: the two sides mirror each other around the center.
  • Asymmetric distributions: one side may have a longer tail, shifting where the "balance point" lies.
  • This hints that the concept of center must account for the shape of the distribution.

📋 Practical example: the quiz scenario

📋 Three possible outcomes

DatasetYour scoreOthers' scoresWhere you stand
A3All scored 3At the exact center—same as everyone
B3Four students scored 4, 4, 4, 5Below the center—everyone else did better
C3Four students scored 2, 2, 2, 1Above the center—you did better than everyone
  • Your raw score (3) is identical in all three cases.
  • Your interpretation and emotional reaction depend entirely on the center of the distribution.
  • Dataset A: neutral (you're average).
  • Dataset B: disappointing (you're below center).
  • Dataset C: satisfying (you're above center).

🎓 The natural question students ask

When you get a score back, you immediately ask:

  • "What did you get?" (to neighbors)
  • "How did the class do?" (to the instructor)
  • This shows that humans intuitively understand the importance of central tendency for interpreting individual performance.
22

Measures of Central Tendency

Measures of Central Tendency

🧭 Overview

🧠 One-sentence thesis

The center of a distribution can be defined in three distinct ways—as a balance point, as the value minimizing absolute deviations, or as the value minimizing squared deviations—each corresponding to different formal measures (mean, median, and mode).

📌 Key points (3–5)

  • Three definitions of center: balance point, smallest absolute deviation, and smallest squared deviation—each captures a different intuitive sense of "middle."
  • Three formal measures: the mean (arithmetic average), the median (midpoint value), and the mode (most frequent value) operationalize these definitions.
  • Common confusion: the mean and median are not the same thing; the mean balances a distribution like a fulcrum, while the median minimizes the sum of absolute deviations.
  • How to compute each: mean = sum divided by count; median = middle value (or average of two middle values); mode = most frequent value.
  • Why it matters: different measures suit different contexts; understanding which definition of "center" you need helps choose the right measure.

⚖️ Three ways to define the center

⚖️ Balance point

The center as the point at which the distribution balances, like a fulcrum on a scale.

  • Imagine placing each data value as a weight on a number line; the balance point is where a fulcrum would keep the scale level.
  • Example: For the numbers 2, 3, 4, 9, 16, the balance point is at 6.8—not the geometric middle, but the point where the "weights" balance.
  • This definition works even for asymmetric distributions; the fulcrum shifts toward heavier clusters.
  • Don't confuse: the balance point is not necessarily halfway between the lowest and highest values.

📏 Smallest absolute deviation

The center as the value that minimizes the sum of absolute deviations (distances) from all other values.

  • For each candidate center, calculate the absolute difference from every data point, then sum them.
  • The value with the smallest total is "closest overall" to all the data.
  • Example: For 2, 3, 4, 9, 16, the sum of absolute deviations from 10 is 28; from 5 is 21; the excerpt hints that 20 is achievable with a different target.
  • This definition emphasizes minimizing total distance without caring about direction (positive or negative).

📐 Smallest squared deviation

The center as the value that minimizes the sum of squared deviations from all other values.

  • For each candidate center, calculate the squared difference from every data point, then sum them.
  • Squaring penalizes larger deviations more heavily than smaller ones.
  • Example: For 2, 3, 4, 9, 16, the sum of squared deviations from 10 is 186; from 5 is 151; the excerpt states 134.8 is achievable.
  • This definition is harder to compute by trial and error but is foundational for many statistical methods.

📊 The three formal measures

📊 Arithmetic mean

The sum of all numbers divided by the count of numbers.

  • Symbol: μ (population mean) or M (sample mean).
  • Formula in words: add up all values, then divide by how many values there are.
  • Example: For 1, 2, 3, 6, 8, the mean is (1+2+3+6+8)/5 = 20/5 = 4.
  • The excerpt shows a real dataset: 31 NFL teams' touchdown passes sum to 634, so the mean is 634/31 = 20.4516.
  • The mean corresponds to the balance point definition.

📊 Median

The midpoint value: half the data is above it, half below.

  • Also called the 50th percentile.
  • Odd count: the median is the middle number when values are sorted.
    • Example: For 2, 4, 7, the median is 4.
  • Even count: the median is the average of the two middle numbers.
    • Example: For 2, 4, 7, 12, the median is (4+7)/2 = 5.5.
  • For the NFL dataset (31 teams), the 16th value when sorted is 20, with 15 values below and 15 above.
  • The median corresponds to the smallest absolute deviation definition.

📊 Mode

The most frequently occurring value.

  • For discrete data: simply the value that appears most often.
  • Example: In the NFL dataset, 18 touchdown passes occurred 4 times (more than any other value), so the mode is 18.
  • For continuous data: values rarely repeat exactly, so the mode is computed from a grouped frequency distribution (the midpoint of the interval with the highest frequency).
  • Example: If the interval 600-700 has the highest frequency, the mode is 650 (the middle of that interval).

🔗 How definitions connect to measures

🔗 Mean as balance point

  • The excerpt explicitly states: "The mean is the point on which a distribution would balance."
  • This matches the fulcrum/balance scale definition from the first part.

🔗 Median minimizes absolute deviations

  • The excerpt explicitly states: "The median is the value that minimizes the sum of absolute deviations."
  • This matches the "smallest absolute deviation" definition.

🔗 Mean minimizes squared deviations

  • The excerpt explicitly states: "The mean is the value that minimizes the sum of the squared deviations."
  • This matches the "smallest squared deviation" definition.
  • Note: the mean appears in two definitions (balance point and squared deviations)—these are mathematically equivalent properties.

🧮 Computation summary

MeasureHow to computeExample (2, 3, 4, 9, 16)
MeanSum ÷ count(2+3+4+9+16)/5 = 6.8
MedianMiddle value (or average of two middle)4 (the 3rd of 5 values)
ModeMost frequent valueNot applicable (all appear once)

🧮 Special cases

  • Median with ties: When multiple values are the same, use the formula for the 50th percentile (mentioned but not detailed in the excerpt).
  • Mode with continuous data: Group into intervals and find the midpoint of the highest-frequency interval.
  • Symbols: Use μ or M for mean depending on whether you have a population or sample; the formula structure is identical.
23

Median and Mean

Median and Mean

🧭 Overview

🧠 One-sentence thesis

The excerpt provides exercises and case-study questions for practicing statistical concepts related to summarizing distributions, but contains no substantive instructional content about median and mean themselves.

📌 Key points (3–5)

  • The excerpt consists entirely of practice exercises and case-study questions
  • Questions cover topics including measures of central tendency, variability, transformations, and data entry errors
  • No definitions, explanations, or instructional material about median or mean are present
  • The exercises reference external case studies (Angry Moods, Flatulence, Stroop, etc.) whose data are not included

📋 Content assessment

📋 What the excerpt contains

The excerpt is a collection of numbered exercises (problems 1–31) that ask students to:

  • Compute means, medians, and other statistics from provided datasets
  • Analyze the effects of data transformations on measures of central tendency
  • Identify which measures change when data entry errors are corrected
  • Apply concepts to various case studies

📋 What is missing

  • No instructional content: The excerpt contains no definitions of median or mean
  • No explanations: There are no worked examples or conceptual explanations
  • No context: The case studies referenced (AM, F, S, PR, SL, AT, AR) are mentioned but their data and descriptions are not provided
  • Exercise-only format: This appears to be an end-of-chapter problem set, not the chapter content itself

🔍 Note for learners

🔍 Purpose of this excerpt

This excerpt serves as practice material to test understanding of concepts that would have been taught in earlier sections (not included here). Without the preceding instructional content, these exercises cannot be completed or understood in isolation.

🔍 What you would need

To work through these problems, you would need:

  • The actual chapter content defining median, mean, variance, standard deviation, and other measures
  • Access to the case study datasets referenced throughout
  • Understanding of the prerequisite material listed (Chapter 3: Median and Mean)

The excerpt lacks substantive content for creating meaningful review notes about median and mean.

24

Additional Measures of Central Tendency

Additional Measures of Central Tendency

🧭 Overview

🧠 One-sentence thesis

The excerpt demonstrates that failing to reject a null hypothesis does not prove it is true, and researchers must avoid concluding that two effects differ simply because one is statistically significant and the other is not.

📌 Key points (3–5)

  • Core error: Accepting the null hypothesis when it is not rejected leads to faulty conclusions about which effects exist or differ.
  • Significance vs. difference: One simple effect being significant (p = 0.02) and another not (p = 0.09) does not prove the two effects are different from each other.
  • Interaction testing: A non-significant interaction (p = 0.08) means the data support the hypothesis but not strongly enough for confident conclusions.
  • Common confusion: Researchers mistakenly interpret "not significant" as "zero effect" rather than "insufficient evidence."
  • Components of interaction: Interactions can be broken into testable portions using specific comparisons with coefficients.

⚠️ Logical errors in hypothesis testing

⚠️ Mistakenly accepting the null hypothesis

  • When a test does not reject the null hypothesis, it means the evidence is insufficient—not that the null hypothesis is true.
  • The excerpt warns against concluding that a simple effect is zero just because it is not significant.
  • Example: If an interaction test yields p = 0.08 (not significant at 0.05), the proper conclusion is that the experiment supports the hypothesis but not strongly enough for confidence.

🚫 Comparing significance across simple effects

The error: Concluding two simple effects are different because one is significant and the other is not.

  • The imaginary experiment tested whether addicted people show a larger increase in brain activity than non-addicted people following treatment.
  • Results:
    • Treatment effect for Addicted group: p = 0.02 (significant)
    • Treatment effect for Non-Addicted group: p = 0.09 (not significant)
  • Faulty logic: The researcher concluded that because one effect is significant and the other is not, the hypothesis of a greater effect for the Addicted group is demonstrated.
  • Why it's wrong: This reasoning accepts the null hypothesis (zero effect) for the Non-Addicted group based solely on lack of significance, which is not justified.

🧩 Components of interaction

🧩 What components of interaction are

  • Interactions can be divided into testable portions or components.
  • The excerpt uses a diet and weight loss study with three diets (Control, Diet A, Diet B) and two age groups (Teens, Adults).
  • Over one portion of the graph, lines are parallel (no interaction); over another portion, they are not (interaction present).

🔍 Testing specific portions

  • The difference between Diet A and Control was essentially the same for teens and adults (parallel lines).
  • The difference between Diet B and Diet A was much larger for teens than for adults (non-parallel lines).
  • These portions can be tested using the method of specific comparisons.

📊 Coefficients for component testing

The excerpt provides coefficients to test the difference between Teens and Adults on the difference between Diets A and B:

Age GroupDietCoefficient
TeenControl0
TeenA1
TeenB-1
AdultControl0
AdultA-1
AdultB1
  • These coefficients isolate the specific comparison of interest within the larger interaction.
  • The same considerations for multiple comparisons and orthogonal comparisons apply to components of interactions.

🔬 Proper interpretation of non-significant results

🔬 What p = 0.08 means

  • The interaction test resulted in p = 0.08, not quite low enough to be significant at the conventional 0.05 level.
  • Proper conclusion: The experiment supports the researcher's hypothesis, but not strongly enough to allow a confident conclusion.
  • Improper conclusion: The hypothesis is false or the interaction does not exist.

🧠 Why "not significant" ≠ "no effect"

  • A non-significant result means the data do not provide strong enough evidence to reject the null hypothesis.
  • It does not mean the null hypothesis is true or that the effect size is zero.
  • Don't confuse: "We cannot conclude there is an effect" with "We conclude there is no effect."
25

Comparing Measures of Central Tendency

Comparing Measures of Central Tendency

🧭 Overview

🧠 One-sentence thesis

The mean, median, and other measures of central tendency differ systematically in skewed distributions, so reporting multiple measures tells a more complete story than relying on any single statistic.

📌 Key points (3–5)

  • Symmetric vs skewed: in symmetric distributions all measures are equal (except mode in bimodal cases), but skewed distributions produce different values across measures.
  • Positive skew pattern: the mean is typically higher than the median when distributions have a positive skew.
  • Common confusion: the mode can be much lower than other measures even when skew is slight; don't assume all measures cluster together.
  • Why it matters: a single measure (e.g., only the mean or only the median) hides important information in skewed data; reporting mean, median, and trimean/trimmed mean gives a fuller picture.
  • Real-world practice: media typically report the median for skewed distributions like salaries and house prices, but more statistics would be more informative.

📊 Behavior in symmetric distributions

📊 When measures agree

For symmetric distributions, the mean, median, trimean, and trimmed mean are equal, as is the mode except in bimodal distributions.

  • Symmetric = no skew → all the central tendency measures converge to the same value.
  • The only exception: bimodal distributions (two peaks) can have a mode that differs even when the distribution is symmetric.
  • This agreement makes choosing a measure straightforward when data are symmetric.

📈 Behavior in skewed distributions

📈 Slight positive skew example

The excerpt presents test scores (642 introductory psychology students) with a slight positive skew:

MeasureValue
Mode84.00
Median90.00
Geometric Mean89.70
Trimean90.25
Mean trimmed 50%89.81
Mean91.58
  • Pattern: mean (91.58) is higher than the median (90.00), which is typical for positive skew.
  • Mode is much lower: 84.00, considerably below all other measures—don't assume the mode will be close to the median or mean.
  • Trimean and trimmed mean: usually fall between the median and mean, though in this case the trimmed mean is slightly lower than the median.
  • Geometric mean: lower than all measures except the mode.

📈 Pronounced positive skew example

Baseball salaries (1994) show a much larger positive skew:

MeasureValue (thousands of dollars)
Mode250
Median500
Geometric Mean555
Trimean792
Mean trimmed 50%619
Mean1,183
  • Large differences: the mean is more than double the median, and nearly five times the mode.
  • Why each measure alone is misleading:
    • Reporting only the mean ($1,183,000) suggests most players earn that much, but only about one third do.
    • Reporting only the mode ($250,000) or median ($500,000) hides the fact that some players make many millions.
  • Don't confuse: a higher mean does not mean "most people earn that amount"; it reflects the pull of very high values in the tail.

🔍 General rule for positive skew

When distributions have a positive skew, the mean is typically higher than the median, although it may not be in bimodal distributions.

  • The long tail on the right pulls the mean upward.
  • The median is less affected by extreme values, so it stays closer to the center of the bulk of the data.
  • Bimodal distributions can break this pattern.

📢 Reporting recommendations

📢 No single measure is sufficient

No single measure of central tendency is sufficient for data such as these.

  • When measures differ substantially, one number cannot capture the distribution's shape.
  • Example: answering "What do baseball players make?" with only the mean or only the median omits critical information about the spread and tail.

📢 What to report

The excerpt recommends:

  • Mean, median, and either the trimean or the mean trimmed 50% should be reported when measures differ.
  • Mode is sometimes worth reporting as well (especially when it is far from other measures).
  • This combination gives readers a sense of both the center and the skew.

📢 Media practice

  • The median is usually reported to summarize the center of skewed distributions (e.g., median salaries, median house prices).
  • The excerpt notes this is better than reporting only the mean, but "it would be informative to hear more statistics."
  • Don't confuse: the median alone is an improvement over the mean alone for skewed data, but still incomplete.
26

Measures of Variability

Measures of Variability

🧭 Overview

🧠 One-sentence thesis

Measures of variability quantify how spread out a distribution is, with four main measures—range, interquartile range, variance, and standard deviation—each capturing different aspects of dispersion around the center.

📌 Key points (3–5)

  • What variability measures: how "spread out" or dispersed scores are in a distribution (synonyms: spread, dispersion).
  • Four main measures: range (simplest), interquartile range (middle 50%), variance (average squared deviation from mean), and standard deviation (square root of variance).
  • Common confusion: variance vs. standard deviation—variance uses squared deviations (units squared), while standard deviation returns to original units and is easier to interpret.
  • Population vs. sample formulas: when estimating population variance from a sample, divide by N-1 instead of N to avoid underestimation.
  • Why it matters: standard deviation is especially useful for normal distributions, where known percentages fall within 1 or 2 standard deviations of the mean (68% and 95%).

📏 What variability means

📏 The concept of spread

Variability refers to how "spread out" a group of scores is.

  • The excerpt illustrates this with two quizzes, both with mean = 7.0, but very different distributions.
  • Quiz 1: scores densely packed (less variability).
  • Quiz 2: scores more spread out (greater variability).
  • The terms variability, spread, and dispersion are synonyms.
  • Example: Two datasets can have identical means but completely different spreads—one tightly clustered, one widely scattered.

🔍 Why multiple measures exist

  • No single measure captures all aspects of spread.
  • Different measures emphasize different features: extreme values, middle bulk, or average distance from center.
  • The excerpt presents four frequently used measures, each with trade-offs.

📐 Simple measures: Range and IQR

📐 Range

The range is simply the highest score minus the lowest score.

  • How to calculate: highest score − lowest score.
  • Example from excerpt: numbers 10, 2, 5, 6, 7, 3, 4 → range = 10 − 2 = 8.
  • For Quiz 1: range = 9 − 5 = 4.
  • For Quiz 2: range = 10 − 4 = 6 (larger spread).
  • Limitation: sensitive to extreme outliers; only uses two data points.

📊 Interquartile range (IQR)

The interquartile range (IQR) is the range of the middle 50% of the scores in a distribution.

  • Formula: IQR = 75th percentile − 25th percentile.
  • Also called the H-spread (using "upper hinge" and "lower hinge" terminology from box plots).
  • Example from excerpt:
    • Quiz 1: 75th percentile = 8, 25th percentile = 6 → IQR = 2.
    • Quiz 2: 75th percentile = 9, 25th percentile = 5 → IQR = 4 (greater spread).
  • Semi-interquartile range: IQR divided by 2; for symmetric distributions, median ± semi-IQR contains half the scores.
  • Don't confuse: IQR ignores the tails (top and bottom 25%), unlike range which uses extremes.

🧮 Variance: Average squared deviation

🧮 What variance measures

Variance is defined as the average squared difference of the scores from the mean.

  • Focuses on how close scores are to the middle (using the mean as the center).
  • Key insight: mean deviation from the mean is always 0, so we square deviations to avoid cancellation.
  • Example from excerpt (Quiz 1):
    • Mean = 7.0.
    • Each score's deviation from mean is calculated (e.g., 9 − 7 = 2).
    • Square each deviation (e.g., 2² = 4).
    • Average the squared deviations → variance = 1.5.
  • Quiz 2 has variance = 6.7 (much larger, reflecting greater spread).

🔢 Population variance formula

  • Formula in words: variance (σ²) = sum of (each score minus mean)² divided by N.
  • σ² is the symbol for population variance.
  • μ is the population mean.
  • N is the number of scores.

🔢 Sample variance formula (estimating population)

  • Why different: using N underestimates the population variance when working from a sample.
  • Corrected formula: divide by N−1 instead of N.
  • s² is the symbol for sample variance estimate.
  • M is the sample mean.
  • Example from excerpt: scores 1, 2, 4, 5 sampled from a larger population.
    • M = (1+2+4+5)/4 = 3.
    • s² = [(1−3)² + (2−3)² + (4−3)² + (5−3)²] / (4−1) = 10/3 = 3.333.
  • Don't confuse: population formula (÷N) vs. sample estimate formula (÷N−1).

🧮 Alternate computational formulas

  • The excerpt provides shortcut formulas easier for hand calculation:
    • Population: variance = [sum of (each score squared) − (sum of scores)²/N] / N.
    • Sample: variance = [sum of (each score squared) − (sum of scores)²/N] / (N−1).
  • Example verification: using scores 1, 2, 4, 5:
    • Sum of scores squared = 1² + 2² + 4² + 5² = 46.
    • (Sum of scores)²/N = 12²/4 = 36.
    • s² = (46 − 36) / 3 = 3.333 (same result).

📉 Standard deviation: The square root step

📉 What standard deviation is

The standard deviation is simply the square root of the variance.

  • Symbol: σ for population standard deviation, s for sample estimate.
  • Example from excerpt: Quiz 1 variance = 1.5 → standard deviation = 1.225; Quiz 2 variance = 6.7 → standard deviation = 2.588.
  • Why take the square root: variance is in squared units (hard to interpret); standard deviation returns to original units.

📊 Why standard deviation is especially useful

  • For normal (or approximately normal) distributions, known percentages fall within standard deviation intervals:
    • 68% of the distribution is within 1 standard deviation of the mean.
    • 95% of the distribution is within 2 standard deviations of the mean.
  • Example from excerpt: normal distribution with mean = 50, standard deviation = 10:
    • 68% of scores fall between 50 − 10 = 40 and 50 + 10 = 60.
    • 95% of scores fall between 50 − 2×10 = 30 and 50 + 2×10 = 70.
  • The excerpt shows two normal distributions (Figure 2):
    • Red: mean = 40, SD = 5 → 68% between 35 and 45.
    • Blue: mean = 60, SD = 10 → 68% between 50 and 70.

🔍 Interpreting standard deviation

  • Larger standard deviation = more spread out.
  • Smaller standard deviation = more tightly clustered around the mean.
  • Example: Quiz 2 (SD = 2.588) has greater variability than Quiz 1 (SD = 1.225), consistent with the visual spread in the bar charts.

📋 Summary comparison

MeasureWhat it capturesFormula (in words)ProsCons
RangeDistance between extremesHighest − lowestSimplest to calculateSensitive to outliers; ignores middle scores
Interquartile range (IQR)Spread of middle 50%75th percentile − 25th percentileResistant to outliersIgnores tails
VarianceAverage squared deviation from meanSum of (score − mean)² / N (or N−1)Uses all data; foundation for other statisticsUnits are squared (hard to interpret)
Standard deviationSquare root of varianceSquare root of varianceSame units as data; interpretable with normal distributionsStill affected by outliers

🧠 Choosing the right measure

  • The excerpt does not prescribe one "best" measure; each serves different purposes.
  • For symmetric, normal-like distributions: standard deviation is especially informative.
  • For skewed or outlier-prone data: IQR may be more robust.
  • Range is quick but crude; variance is foundational but less intuitive than standard deviation.
27

Shapes of Distributions

Shapes of Distributions

🧭 Overview

🧠 One-sentence thesis

Skewness and kurtosis provide numerical measures that describe how a distribution's shape deviates from symmetry and normal peakedness, helping statisticians understand data patterns beyond central tendency and spread.

📌 Key points (3–5)

  • What skew measures: the direction and degree to which a distribution's tail extends to one side, with positive skew having tails extending to the right.
  • Skew and central tendency relationship: distributions with positive skew normally have larger means than medians (e.g., mean more than twice the median in baseball salaries).
  • Two skew formulas: Pearson's measure uses 3 times (mean minus median) divided by standard deviation; the more common measure is the "third moment about the mean."
  • Common confusion: don't confuse skew direction with which measure (mean vs median) is larger—positive skew means the mean is typically higher than the median, not lower.
  • Kurtosis definition: measures the "peakedness" of a distribution, with 3 subtracted so that a normal distribution has zero kurtosis.

📊 Understanding skewness

📊 What positive skew looks like

  • A distribution with positive skew has tails that extend to the right.
  • Example from the excerpt: baseball player salaries show very large positive skew, with most players earning less but a few earning extremely high amounts.
  • The histogram shows counts concentrated at lower values with a long tail stretching toward higher values.

📈 How skew affects mean vs median

Distributions with positive skew normally have larger means than medians.

  • The baseball salary example demonstrates this clearly:
    • Mean: $1,183,417
    • Median: $500,000
    • The mean is more than twice as high as the median
  • Why this happens: extreme values in the tail pull the mean upward, but the median stays anchored to the middle position.
  • Don't confuse: this is the normal pattern, not a universal rule—the excerpt says "normally have," indicating typical behavior.

🧮 Measuring skewness numerically

🧮 Pearson's measure of skew

The excerpt presents Pearson's simple and convenient formula:

Pearson's skew = 3 times (mean minus median) divided by standard deviation

  • For the baseball salaries:
    • Standard deviation: 1,390,922
    • Calculation: 3 times (1,183,417 minus 500,000) divided by 1,390,922 equals 1.47
  • This measure is "simple and convenient" according to the excerpt.

🔢 Third moment about the mean

  • The excerpt states this is "more commonly used" than Pearson's measure.
  • It is "sometimes referred to as the third moment about the mean."
  • The excerpt mentions it exists but does not provide the detailed formula in the text portion.
  • Just as there are several measures of central tendency, there is more than one measure of skew.

📐 Understanding kurtosis

📐 What kurtosis measures

Kurtosis: a measure similar to skew but focused on the distribution's shape in a different way.

  • The measure is "similar to the definition of skew" in its construction.
  • The value 3 is subtracted from the calculation.
  • Purpose of subtracting 3: to define "no kurtosis" as the kurtosis of a normal distribution.
  • Without this adjustment, a normal distribution would have a kurtosis of 3, which would be confusing when trying to identify departures from normality.

🎯 Interpreting kurtosis values

  • Zero kurtosis = normal distribution (after the subtraction of 3).
  • Positive kurtosis = more peaked or heavier tails than normal.
  • Negative kurtosis = flatter or lighter tails than normal.
  • The excerpt emphasizes that the subtraction is specifically to make the normal distribution the reference point (zero).

🔗 Relationship to other concepts

🔗 Prerequisites mentioned

The excerpt lists prerequisites showing how skew and kurtosis build on earlier concepts:

PrerequisiteWhy it matters
Distributions (Chapter 1)Foundation for understanding shape differences
Measures of Central Tendency (Chapter 3)Mean and median are used in skew calculations
Variability (Chapter 3)Standard deviation appears in Pearson's skew formula

🎓 Learning objectives stated

  1. Compute skew using two different formulas (Pearson's and third moment)
  2. Compute kurtosis
  • These objectives indicate that the section teaches how to calculate these measures, not just understand them conceptually.
28

Effects of Linear Transformations

Effects of Linear Transformations

🧭 Overview

🧠 One-sentence thesis

Linear transformations (multiplying by a constant and adding a constant) change the mean and standard deviation of a distribution in predictable ways, which is essential for standardizing data and working with the standard normal distribution.

📌 Key points (3–5)

  • What linear transformations do: multiplying all values by a constant and/or adding a constant shifts and scales a distribution systematically.
  • How they affect the mean and standard deviation: multiplying by a scales the standard deviation by |a|; adding b shifts the mean by b.
  • The Z-score transformation: converting raw data to Z-scores is a linear transformation that produces a distribution with mean 0 and standard deviation 1.
  • Common confusion: the standard normal distribution is not a different kind of distribution—it's just a normal distribution that has been standardized through linear transformation.
  • Why it matters: standardizing allows us to use a single reference table (the Z table) for all normal distributions, and to compare values from different scales.

🔄 What linear transformations are

🔄 The basic operation

Linear transformation: applying the formula Y = a × X + b to every value in a dataset, where a and b are constants.

  • This is called "linear" because the relationship between X and Y is a straight line.
  • a is the scale factor (multiplier); b is the shift (additive constant).
  • Example: if you have temperatures in Celsius and want Fahrenheit, you use F = 1.8 × C + 32.

📊 How they change distributions

  • Multiplying by a: stretches (if |a| > 1) or compresses (if |a| < 1) the distribution; if a is negative, it also flips the distribution.
  • Adding b: slides the entire distribution left (if b < 0) or right (if b > 0) without changing its shape or spread.
  • The shape of the distribution (e.g., normal, skewed) does not change—only location and scale change.

📐 Effects on mean and standard deviation

📐 Mean transformation

  • If the original mean is μ, then after the transformation Y = a × X + b, the new mean is a × μ + b.
  • Example: if the mean of X is 50, and you compute Y = 8 × X + 75, the new mean is 8 × 50 + 75 = 475.

📏 Standard deviation transformation

  • If the original standard deviation is σ, then after the transformation Y = a × X + b, the new standard deviation is |a| × σ.
  • Key point: adding a constant (b) does not change the standard deviation—it only shifts the center.
  • Example: if the standard deviation of X is 10, and you compute Y = 8 × X + 75, the new standard deviation is 8 × 10 = 80.

🔍 Don't confuse

  • Don't confuse "adding a constant" with "multiplying by a constant":
    • Adding changes the mean but not the spread.
    • Multiplying changes both the mean and the spread.

🎯 The Z-score transformation (standardization)

🎯 What standardization is

Standardizing a distribution: transforming all values to Z-scores using Z = (Xμ) / σ, which produces a distribution with mean 0 and standard deviation 1.

  • This is a linear transformation: Z = (1/σ) × X + (−μ/σ).
  • Every value is expressed as "how many standard deviations away from the mean."
  • Example: if X = 26, μ = 50, and σ = 10, then Z = (26 − 50)/10 = −2.4 (i.e., 2.4 standard deviations below the mean).

📊 Why standardize

  • Comparison across scales: Z-scores let you compare values from different distributions (e.g., SAT scores vs. GPA).
  • Using a single reference table: once standardized, all normal distributions become the standard normal distribution (mean 0, standard deviation 1), so you can use one Z table for all problems.
  • Simplifies calculations: many statistical procedures assume or work more easily with standardized data.

🔍 Don't confuse

  • Don't think the standard normal distribution is a fundamentally different distribution—it's just a normal distribution that has been shifted and scaled to have mean 0 and standard deviation 1.
  • The shape remains normal; only the numbers on the axis change.

🧮 Working with the standard normal distribution

🧮 The standard normal distribution

  • Mean = 0, standard deviation = 1.
  • Denoted by the letter Z.
  • The cumulative distribution function (CDF) is often written as Φ(z), which gives the area (probability) below z.

📋 Using Z tables

  • A Z table lists values of z and the corresponding area below z (the cumulative probability).
  • Example from the excerpt: z = −2.5 corresponds to an area of 0.0062, meaning about 0.62% of the distribution is below −2.5.
  • To find probabilities for any normal distribution, first convert the raw value X to a Z-score, then look up the Z-score in the table.

🔄 Converting back to the original scale

  • If you know a Z-score and want the corresponding raw value, use X = μ + σ × Z.
  • Example: if μ = 50, σ = 10, and Z = 1.5, then X = 50 + 10 × 1.5 = 65.

🖥️ Using calculators/applets

  • The excerpt mentions using online calculators (applets) that let you enter the mean and standard deviation directly, so you don't have to manually convert to Z-scores.
  • You input the raw value, mean, and standard deviation, and the calculator returns the area/probability.
  • Example: to find the area below 26 in a distribution with mean 50 and standard deviation 10, you can enter those values directly and get the same result as converting to Z = −2.4 and using a Z table.

🔗 Connection to other topics

🔗 Normal approximation to the binomial

  • The excerpt briefly mentions that the normal distribution can approximate the binomial distribution (a discrete distribution).
  • The binomial has a mean μ = N × π and variance σ² = N × π × (1 − π), where N is the number of trials and π is the probability of success.
  • To use the normal approximation, you standardize the binomial values using these parameters.
  • Continuity correction: because the binomial is discrete and the normal is continuous, you adjust by 0.5 (e.g., to approximate the probability of exactly 8 successes, you find the area from 7.5 to 8.5 in the normal distribution).

🔗 Q-Q plots

  • The excerpt mentions Q-Q plots as a prerequisite for the standard normal distribution section.
  • Q-Q plots compare sample quantiles to theoretical quantiles; if data are normal, the plot should be close to a straight line.
  • Standardizing data (transforming to Z-scores) is often done before making a Q-Q plot to check normality.
29

Variance Sum Law I

Variance Sum Law I

🧭 Overview

🧠 One-sentence thesis

The variance of a sampling distribution of the mean equals the population variance divided by the sample size, which can be derived directly from the variance sum law.

📌 Key points (3–5)

  • Core relationship: the variance of the sampling distribution of the mean is the population variance divided by N (sample size).
  • Effect of sample size: larger samples produce smaller variance in the sampling distribution, meaning sample means cluster more tightly around the population mean.
  • Derivation method: the variance sum law provides a straightforward way to prove why variance shrinks by a factor of N.
  • Common confusion: the variance of the sum grows with N, but the variance of the mean (which divides the sum by N) shrinks with N.
  • Practical implication: this formula underpins the standard error of the mean, a key measure in inferential statistics.

📐 The fundamental variance formula

📐 Variance of the sampling distribution of the mean

The variance of the sampling distribution of the mean is the population variance divided by N, the sample size.

  • Formula in words: variance of sampling distribution of mean = (population variance) / N
  • The excerpt uses the notation: variance of M = (sigma squared) / N
  • This is not the variance of a single observation; it is the variance of means computed from samples of size N.

📉 How sample size affects variance

  • Larger N → smaller variance: as the sample size increases, the variance of the sampling distribution decreases.
  • Why it matters: sample means from larger samples are less spread out and closer to the true population mean.
  • Example: if you compute means from samples of 100 instead of 10, the means will vary much less from sample to sample.

🧮 Deriving the formula using the variance sum law

🧮 Step-by-step derivation

The excerpt provides an optional but clear derivation:

  1. Start with a sum of N numbers sampled from a population with variance sigma squared.
  2. Variance of the sum: by the variance sum law, the variance of the sum of N independent values is N times sigma squared.
    • For three numbers: sigma squared + sigma squared + sigma squared = 3 sigma squared.
    • For N numbers: N sigma squared.
  3. Variance of the mean: the mean is (1/N) times the sum.
    • When you multiply a random variable by a constant, the variance is multiplied by the square of that constant.
    • So variance of the mean = (1/N) squared times (variance of the sum) = (1 divided by N squared) times (N sigma squared) = sigma squared divided by N.

🔍 Why the sum grows but the mean shrinks

  • Don't confuse: the variance of the sum increases with N (more terms add more variability).
  • But the mean divides the sum by N, so the variance is scaled down by N squared, leaving sigma squared divided by N.
  • This is why averaging reduces variability: you are spreading the total variability over N observations.

🔗 Connection to the standard error

🔗 Standard error of the mean

  • The excerpt mentions "the standard error" at the end but does not define it fully in this section.
  • Context from the prerequisite material: the standard error of the mean is the standard deviation of the sampling distribution of the mean.
  • Since variance of the sampling distribution of the mean = sigma squared / N, the standard error = sigma / square root of N.
  • This measure tells you how much sample means typically differ from the population mean.

📊 Practical use

  • Knowing the variance (or standard error) of the sampling distribution helps you gauge how close your sample mean is likely to be to the true population mean.
  • Example: if the standard error is small, your sample mean is probably very close to the population mean; if it is large, there is more uncertainty.
30

Introduction to Bivariate Data

Introduction to Bivariate Data

🧭 Overview

🧠 One-sentence thesis

Bivariate data analysis reveals relationships between two variables that cannot be understood by examining each variable separately, and scatter plots provide the primary graphical tool for visualizing these paired relationships.

📌 Key points (3–5)

  • What bivariate data is: data with two variables collected on each individual, where the pairing between variables matters.
  • Why pairing matters: separating variables into individual summaries (means, histograms) loses critical information about the relationship between them.
  • How to visualize relationships: scatter plots maintain the pairing and reveal both the direction (positive/negative association) and shape (linear/nonlinear) of relationships.
  • Common confusion: linear vs nonlinear relationships—points clustering along a straight line indicate linear; curved patterns (like a parabola) indicate nonlinear.
  • Pearson's correlation: the most common numerical measure of linear relationship strength, ranging from -1 (perfect negative) to +1 (perfect positive).

📊 What bivariate data captures

📊 Definition and purpose

Bivariate data: data consisting of two quantitative variables collected for each individual.

  • Examples from the excerpt:
    • Health studies: age, sex, height, weight, blood pressure, cholesterol
    • Economic studies: personal income and years of education
    • University admissions: high school GPA and standardized test scores
    • Spousal ages: husband's age and wife's age

🔗 Why pairing matters

  • The excerpt emphasizes that pairing within each observation is lost when variables are separated.
  • Even with complete summary statistics (mean, standard deviation, histograms) for each variable individually, you cannot answer questions about the relationship:
    • What percentage of couples has younger husbands than wives?
    • What is the average age of husbands with 45-year-old wives?
    • How does one variable change as the other changes?
  • Example: The spousal age data shows husbands tend to be slightly older, but this pattern is only visible when you maintain the pairing—separate histograms and means cannot reveal it.

Don't confuse: Univariate summaries (histograms, means) describe each variable's distribution, but they do not describe the relationship between variables.

📈 Scatter plots as the primary tool

📈 What a scatter plot shows

  • A scatter plot displays bivariate data graphically while maintaining the pairing.
  • One variable is plotted on the x-axis, the other on the y-axis.
  • Each point represents one paired observation.
  • Example: In the spousal age scatter plot, the x-axis is husband's age, the y-axis is wife's age, and each point is one couple.

🔍 Two key characteristics revealed

The excerpt identifies two important features visible in scatter plots:

  1. Direction of association:

    • Positive association: when one variable (Y) increases, the other variable (X) also increases.
    • Negative association: when Y decreases as X increases.
    • Example: Spousal ages show positive association—older husbands tend to have older wives.
  2. Shape of relationship:

    • Linear relationship: points cluster along a straight line.
    • Nonlinear relationship: points follow a curved pattern.
    • Example: Galileo's projectile motion data (release height vs. distance traveled) shows a parabolic (nonlinear) relationship—a straight line would not fit the points well.

🎯 Examples from the excerpt

DatasetVariablesAssociationShapePearson's r
Spousal agesHusband's age, Wife's agePositive (older husbands → older wives)Linear, tightly clustered0.97
Physical workersGrip strength, Arm strengthPositive (stronger grip → stronger arm)Linear, less tightly clustered0.63
Galileo's experimentRelease height, Distance traveledPositiveNonlinear (parabola)Not given
Figure 3 exampleX, YNoneNo pattern0

Don't confuse: A strong relationship (points close to a line) vs. a weak relationship (points scattered widely). Both can be linear, but the strength differs—this is captured by how tightly points cluster around the line.

🔢 Pearson's correlation coefficient

🔢 What it measures

Pearson product-moment correlation coefficient: a measure of the strength of the linear relationship between two variables.

  • Symbol: ρ (rho) in the population, r in a sample.
  • The excerpt notes that if the relationship is not linear, Pearson's correlation does not adequately represent the strength of the relationship.

📏 Range and interpretation

  • Possible range: -1 to +1.
  • Perfect positive linear relationship: r = +1 (all points lie exactly on an upward-sloping line).
  • Perfect negative linear relationship: r = -1 (all points lie exactly on a downward-sloping line).
  • No linear relationship: r = 0 (no pattern, or a nonlinear pattern).

Example interpretations from the excerpt:

  • r = 0.97 (spousal ages): very strong positive linear relationship, points cluster very tightly along a line.
  • r = 0.63 (grip and arm strength): moderate positive linear relationship, points cluster less tightly.
  • r = 0 (Figure 3): no linear relationship at all.

🔄 Key properties

The excerpt on "Properties of Pearson's r" highlights:

  1. Symmetry: The correlation of X with Y is the same as the correlation of Y with X.

    • Example: The correlation of Weight with Height equals the correlation of Height with Weight.
  2. Range reminder: -1 to +1, where -1 and +1 represent perfect linear relationships and 0 represents no linear relationship.

Don't confuse: r = 0 does not mean "no relationship"—it means "no linear relationship." A strong nonlinear relationship (like Galileo's parabola) can have r near 0.

🧩 What comes next

🧩 Chapter structure

The excerpt outlines the chapter's organization:

  • The introductory section (covered here) gives examples and introduces scatter plots.
  • The next five sections discuss Pearson's correlation in detail.
  • The final section ("Variance Sum Law II") applies Pearson's correlation to generalize a law to bivariate data.

🎓 Prerequisites and learning objectives

Prerequisites listed:

  • Variables, distributions, histograms (Chapter 1–2)
  • Measures of central tendency, variability, shapes of distributions (Chapter 3)

Learning objectives for the introduction:

  1. Define bivariate data.
  2. Define scatter plot.
  3. Distinguish linear from nonlinear relationships.
  4. Identify positive and negative associations from scatter plots.
31

Values of the Pearson Correlation

Values of the Pearson Correlation

🧭 Overview

🧠 One-sentence thesis

The sampling distribution of Pearson's r becomes increasingly negatively skewed as the population correlation ρ approaches 1.0, because r cannot exceed 1.0 and therefore has limited room to vary in the positive direction.

📌 Key points (3–5)

  • What the sampling distribution of r shows: the distribution of sample correlation values (r) obtained from repeated random samples when the population correlation (ρ) is known.
  • Shape is not symmetric: the sampling distribution of r is negatively skewed, not normal, especially when ρ is large.
  • Why the skew happens: r cannot exceed 1.0, so the distribution cannot extend as far in the positive direction as it can in the negative direction.
  • Common confusion: the greater the population correlation ρ, the more pronounced the skew—don't assume all sampling distributions of r look the same.
  • Practical implication: when ρ is high (e.g., 0.90), the distribution has a very short positive tail and a long negative tail.

📊 The sampling distribution of r

📊 What it represents

The sampling distribution of r: the distribution of sample correlation values (r) obtained after repeated random samples from a population with a known correlation ρ.

  • If you draw many samples of the same size from a population, each sample will yield a slightly different r.
  • The collection of all these r values forms the sampling distribution.
  • Example: if the population correlation between quantitative and verbal SAT scores is ρ = 0.60, and you sample 12 students repeatedly, each sample's r will vary around 0.60.

🔢 Key notation

  • ρ (rho): the population correlation (the true correlation in the entire population).
  • r: the sample correlation (the correlation calculated from a single sample).
  • The excerpt uses N = 12 (sample size of 12 students) in its examples.

🔀 Shape and skewness

🔀 Not symmetric

  • The sampling distribution of r is negatively skewed, not normal.
  • This means the distribution has a longer tail on the left (negative side) and a shorter tail on the right (positive side).
  • Don't confuse: many sampling distributions (e.g., of means) are approximately normal, but the sampling distribution of r is not.

🚧 Why r is skewed

  • The constraint: r cannot take values greater than 1.0.
  • Because of this upper limit, the distribution cannot extend as far in the positive direction as it can in the negative direction.
  • The result is a distribution that is "cut off" on the right and stretched out on the left.

📈 Effect of ρ on skewness

  • The greater the value of ρ, the more pronounced the skew.
  • When ρ is small or moderate, the skew is less noticeable.
  • When ρ is large (e.g., 0.90), the skew is very pronounced.

📐 Examples of different ρ values

📐 Moderate correlation: ρ = 0.60

  • The excerpt describes the sampling distribution for N = 12 and ρ = 0.60.
  • The distribution is negatively skewed but not extremely so.
  • There is still some room for r to vary in the positive direction (from 0.60 up to 1.0).

📐 High correlation: ρ = 0.90

  • The excerpt describes the sampling distribution for N = 12 and ρ = 0.90.
  • This distribution has a very short positive tail and a long negative tail.
  • Because ρ is already close to 1.0, there is very little room for sample r values to be higher than 0.90, but plenty of room for them to be lower.
  • Example: if the population correlation is 0.90, a sample might yield r = 0.85 or r = 0.70 (negative direction), but it is unlikely to yield r = 0.95 or r = 1.0 (positive direction is constrained).

🧮 Comparison of distributions

Population correlation (ρ)Sample size (N)Shape of sampling distributionReason
0.6012Negatively skewed (moderate)r can vary from 0.60 to 1.0 (positive) or down to lower values (negative), but positive range is limited
0.9012Negatively skewed (very pronounced)r has very little room to vary upward (0.90 to 1.0) but much room to vary downward, creating a very short positive tail and long negative tail
32

Properties of Pearson's r

Properties of Pearson's r

🧭 Overview

🧠 One-sentence thesis

Pearson's r has specific mathematical properties—ranging from -1 to 1, symmetric between variables, and unaffected by linear transformations—that make it a robust measure of linear relationships.

📌 Key points (3–5)

  • Range and meaning: r ranges from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 meaning no linear relationship.
  • Symmetry property: the correlation of X with Y equals the correlation of Y with X.
  • Invariance under linear transformations: multiplying by a constant or adding a constant does not change r.
  • Common confusion: r measures only linear relationships; the excerpt emphasizes "linear" repeatedly, not all types of relationships.
  • Real-world values: with actual data, you rarely get exactly -1, 0, or 1; examples show r = 0.97 for spousal ages and r = 0.63 for grip/arm strength.

📏 Range and interpretation

📏 The possible values of r

A basic property of Pearson's r is that its possible range is from -1 to 1.

  • -1: a perfect negative linear relationship (as one variable increases, the other decreases perfectly linearly).
  • 0: no linear relationship between the variables.
  • 1: a perfect positive linear relationship (as one variable increases, the other increases perfectly linearly).

🌍 Real data examples

The excerpt provides context for interpreting r values in practice:

Datasetr valueInterpretation
Spousal ages0.97Very strong positive relationship
Grip strength and arm strength0.63Moderate positive relationship
  • With real data, you would not expect to get values of exactly -1, 0, or 1.
  • These examples show that r quantifies the strength of the linear relationship on a continuous scale.

🔄 Symmetry property

🔄 Correlation is bidirectional

Pearson's correlation is symmetric in the sense that the correlation of X with Y is the same as the correlation of Y with X.

  • The order of variables does not matter.
  • Example from the excerpt: the correlation of Weight with Height is the same as the correlation of Height with Weight.
  • This property reflects that correlation measures association, not causation or direction.

🔧 Invariance under linear transformations

🔧 What linear transformations are

A critical property of Pearson's r is that it is unaffected by linear transformations. This means that multiplying a variable by a constant and/or adding a constant does not change the correlation of that variable with other variables.

Linear transformations include:

  • Multiplying by a constant (e.g., changing units)
  • Adding a constant (e.g., shifting all values)

📐 Why this matters for measurement

The excerpt gives two practical examples:

Example 1 (units): The correlation of Weight and Height does not depend on whether Height is measured in inches, feet, or even miles.

  • This means r is unit-free; you can compare correlations across different measurement systems.

Example 2 (shifting scores): Adding five points to every student's test score would not change the correlation of the test score with other variables such as GPA.

  • Shifting all values by the same amount preserves the relationship structure.

🎯 Design purpose

  • The excerpt notes that "Pearson's r is designed so that the correlation between height and weight is the same whether height is measured in inches or in feet."
  • This invariance property makes r a standardized measure that can be compared across contexts.
33

Computing Pearson's r

Computing Pearson's r

🧭 Overview

🧠 One-sentence thesis

Pearson's r measures the strength and direction of a linear relationship between two variables by comparing how their deviation scores co-vary, and it remains unchanged by linear transformations like scaling or shifting the data.

📌 Key points (3–5)

  • What Pearson's r measures: the linear relationship between two variables using deviation scores (deviations from the mean).
  • Why the sum of xy reveals relationship: when X and Y are related, positive deviations pair with positive deviations (and negative with negative), making the product xy consistently positive or negative; when unrelated, the sum of xy is small.
  • Invariance to linear transformations: multiplying by a constant or adding a constant does not change the correlation (e.g., measuring height in inches vs. feet gives the same r).
  • Common confusion: the raw sum of xy depends on scale, so Pearson's r divides by the square root of the product of the sums of squared deviations to standardize the result.
  • Alternative formulas: a computational formula exists that skips the deviation-score step for easier calculation.

🔢 Deviation scores and the logic of correlation

📏 What deviation scores are

Deviation scores (x and y): each score is a deviation from the mean, computed by subtracting the mean of X from each X value (creating x) and the mean of Y from each Y value (creating y).

  • The means of x and y are both 0 by construction.
  • Deviation scores show how far each observation is from the center of the distribution.
  • Example: if X has values 1, 3, 5, 5, 6 with mean 4, then x = -3, -1, 1, 1, 2.

🔗 Why the sum of xy reveals relationship

The excerpt explains that the product xy captures whether high values of one variable pair with high values of the other:

  • No relationship: positive x values are equally likely to pair with positive or negative y values, so xy products cancel out and the sum is small.
  • Positive relationship: positive x pairs with positive y (both above their means) and negative x pairs with negative y (both below their means), so all xy products are positive and the sum is large.
  • Negative relationship: positive x pairs with negative y and vice versa, so all xy products are negative and the sum is negative.

Example from Table 1: X = 1, 3, 5, 5, 6 and Y = 4, 6, 10, 12, 13 have x = -3, -1, 1, 1, 2 and y = -5, -3, 1, 3, 4; the xy column is 15, 3, 1, 3, 8, all positive, summing to 30, indicating a positive relationship.

🧮 The formula for Pearson's r

🧮 Conceptual formula using deviation scores

The excerpt presents the formula:

r = (sum of xy) / square root of [(sum of x squared) times (sum of y squared)]

In words: divide the sum of the products of deviation scores by the square root of the product of the sum of squared x deviations and the sum of squared y deviations.

  • The denominator standardizes the numerator so that r is unaffected by the scale of X or Y.
  • From Table 1: sum of xy = 30, sum of x squared = 16, sum of y squared = 60, so r = 30 / square root of (16 times 60) = 30 / square root of 960 = 30 / 30.984 = 0.968.

⚙️ Alternative computational formula

The excerpt mentions an alternative formula that avoids computing deviation scores:

An alternative computational formula that avoids the step of computing deviation scores is provided.

  • This formula works directly with the raw X and Y values.
  • It is easier for hand calculation but less intuitive conceptually.
  • The excerpt does not provide the full formula in the given text, only references it.

🔄 Invariance to linear transformations

🔄 What linear transformations are

Linear transformations: multiplying a variable by a constant and/or adding a constant.

  • Example: converting height from inches to feet (multiplying by 1/12) or adding 5 points to every test score.
  • The excerpt emphasizes that these operations do not change the correlation.

🔄 Why this property matters

  • The correlation between Weight and Height is the same whether Height is measured in inches, feet, or miles.
  • Adding five points to every student's test score does not change the correlation of the test score with other variables such as GPA.
  • This property ensures that r measures the relationship structure, not the units or origin of measurement.
  • Don't confuse: changing the scale or shifting the data changes the raw sum of xy, but the standardized r remains the same because both numerator and denominator adjust proportionally.

📊 Example calculation walkthrough

📊 Step-by-step from Table 1

The excerpt provides a complete worked example:

StepWhat to doResult from Table 1
1. Compute meansMean of X and mean of YMean X = 4, Mean Y = 9
2. Compute deviation scoresx = X - mean X, y = Y - mean Yx: -3, -1, 1, 1, 2; y: -5, -3, 1, 3, 4
3. Multiply deviationsxy for each pairxy: 15, 3, 1, 3, 8
4. Square deviationsx squared and y squaredx squared: 9, 1, 1, 1, 4; y squared: 25, 9, 1, 9, 16
5. Sum columnsSum xy, sum x squared, sum y squaredSum xy = 30, sum x squared = 16, sum y squared = 60
6. Apply formular = sum xy / square root of (sum x squared times sum y squared)r = 30 / 30.984 = 0.968
  • The final r = 0.968 indicates a very strong positive linear relationship.
  • Each step builds on deviation scores to capture co-variation and then standardize it.
34

Variance Sum Law II

Variance Sum Law II

🧭 Overview

🧠 One-sentence thesis

When two variables are correlated, the variance of their sum or difference must include an additional term that accounts for the correlation between them, unlike the simpler formula used when variables are independent.

📌 Key points

  • Extension of independence: Variance Sum Law I assumes independence; Variance Sum Law II handles correlated variables.
  • The correlation term: When variables X and Y are correlated, you add (for sums) or subtract (for differences) a term involving twice the correlation times the product of the standard deviations.
  • Practical use: If you know the variance of each variable and their correlation, you can compute the variance of their sum or difference.
  • Common confusion: Don't forget the correlation term—ignoring it when variables are correlated will give the wrong variance.
  • Notation difference: Use rho (the population correlation) in the population formula; use r (the sample correlation) in the sample formula.

📐 The basic formulas

📐 When variables are independent

Variance Sum Law I: When X and Y are independent, the variance of X plus or minus Y equals the variance of X plus the variance of Y.

  • Written in words: variance of (X ± Y) = variance of X + variance of Y.
  • The sign (plus or minus) between X and Y does not matter when they are independent.
  • This is the simpler case covered in a prerequisite chapter.

📐 When variables are correlated

Variance Sum Law II: When X and Y are correlated, the variance of (X ± Y) = variance of X + variance of Y ± 2 times rho times the square root of (variance of X) times the square root of (variance of Y).

  • The ± sign in the formula matches the ± sign in (X ± Y).
  • For a sum (X + Y): use plus 2 times rho times square root of variance of X times square root of variance of Y.
  • For a difference (X − Y): use minus 2 times rho times square root of variance of X times square root of variance of Y.
  • Rho is the population correlation between X and Y.

🧮 Worked example with SAT scores

🧮 Setting up the problem

The excerpt gives a concrete example:

  • Variance of verbal SAT = 10,000
  • Variance of quantitative SAT = 11,000
  • Correlation between verbal and quantitative = 0.50

➕ Variance of the sum (total SAT)

To find the variance of total SAT (verbal + quantitative):

  • Start with variance of verbal + variance of quantitative = 10,000 + 11,000 = 21,000.
  • Add the correlation term: 2 times 0.5 times square root of 10,000 times square root of 11,000.
  • Square root of 10,000 = 100; square root of 11,000 ≈ 104.88.
  • Correlation term ≈ 2 × 0.5 × 100 × 104.88 ≈ 10,488.
  • Total variance = 21,000 + 10,488 = 31,488.

➖ Variance of the difference

To find the variance of (verbal − quantitative):

  • Start with the same base: 10,000 + 11,000 = 21,000.
  • Subtract the correlation term: 21,000 − 10,488 = 10,512.
  • Notice: the variance of the difference is smaller than the variance of the sum when the correlation is positive.

🔍 Why the sign matters

  • Positive correlation means the two variables tend to move together.
  • When you add them, this co-movement increases variability → larger variance.
  • When you subtract them, the co-movement partially cancels out → smaller variance.
  • Example: If both verbal and quantitative scores are high together, their sum varies more; their difference varies less.

🔤 Notation for samples vs populations

🔤 Population notation

  • Use rho (ρ) for the population correlation.
  • The formula is written with rho when referring to the true population relationship.

🔤 Sample notation

  • When variances and correlation are computed from a sample, use r (the sample correlation) instead of rho.
  • The structure of the formula remains the same; only the symbol changes.
  • Don't confuse: rho is a parameter (population); r is a statistic (sample).

⚠️ Key reminders

⚠️ When to use which law

SituationFormula to useKey feature
X and Y are independentVariance Sum Law INo correlation term
X and Y are correlatedVariance Sum Law IIInclude ± 2 times correlation term

⚠️ Common mistake

  • Forgetting the correlation term when variables are not independent will underestimate (for sums) or overestimate (for differences) the true variance.
  • Always check: are the variables correlated? If yes, you must use Variance Sum Law II.
35

Remarks on the Concept of "Probability"

Remarks on the Concept of“Probability”

🧭 Overview

🧠 One-sentence thesis

Probability can be understood through three distinct approaches—symmetry, relative frequency, and subjective belief—but the frequentist interpretation is most widely adopted in statistical practice because it provides objective criteria for evaluating probability claims over the long run.

📌 Key points (3–5)

  • Three approaches to probability: symmetry (equal outcomes), relative frequency (long-run proportions), and subjective (personal belief/opinion).
  • Frequentist interpretation dominates: most statistical work uses the frequency approach, where probability means the proportion of times an event occurs in many repetitions.
  • Common confusion—right vs. wrong predictions: a single outcome doesn't prove a probability wrong; a probability is accurate if events occur at that rate over many trials.
  • Nondogmatic probabilities: almost all probabilities of interest are neither 0 (impossible) nor 1 (certain).
  • Why subjective probability is problematic: it lacks objective criteria to judge who is "right" or "wrong" when two people assign different probabilities to the same event.

🎲 Three ways to think about probability

🎲 Symmetry approach

Symmetry approach: when outcomes are indistinguishable in any way that affects which will occur, each outcome has equal probability.

  • If there are N symmetrical outcomes, the probability of any one is 1/N.
  • Example: A fair coin has two symmetrical outcomes (heads/tails), so probability of heads = 1/2.
  • Example: A six-sided die has six symmetrical sides, so probability of any one side = 1/6.
  • The key assumption: no feature distinguishes one outcome from another in terms of likelihood.

📊 Relative frequency approach

Relative frequency approach: probability is the proportion of times an event occurs as the number of trials increases toward infinity.

  • If you toss a coin millions of times, the proportion of heads approaches 1/2.
  • Example: If it rained on 62% of the last 100,000 days in Seattle, the probability of rain tomorrow might be taken as 0.62.
  • Why this is often unreasonable: raw historical frequency ignores relevant conditions.
    • If tomorrow is August 1 (a day when rain is rare), you should only count August 1 occurrences.
    • But even that isn't enough—humidity, wind direction, and other factors matter.
    • The sample of truly comparable prior cases shrinks to nearly zero.
    • Climate change makes past meteorological history misleading.

🧠 Subjective approach

Subjective approach: probability reflects personal opinion or willingness to bet at certain odds.

  • Used for questions that don't fit symmetry or frequency frameworks.
  • Example: "What is the probability Ms. Garcia defeats Mr. Smith in an election?" reflects the speaker's personal judgment.
  • Major drawback: loses objective content; probability becomes mere opinion.
    • Two people can assign different probabilities to the same event with no criterion for calling one "right" and the other "wrong."
    • You cannot judge correctness simply by which outcome actually happens (see next section).

✅ What makes a probability "right" or "wrong"

✅ Single outcomes don't prove anything

  • Don't confuse: a correct probability with a correct prediction of a single event.
  • Example: Weather forecaster says "10% chance of rain," and it rains.
    • You might be furious, but the forecaster was not wrong.
    • She did not say it would not rain, only that rain was unlikely.
    • She would be flatly wrong only if she said probability = 0 and it rained.
  • Example: You assign probability 1/6 to rolling a six with a fair die; your friend assigns 2/3.
    • You are right and your friend is wrong—even if the die shows a six.
    • The correctness is about the probability assignment, not the single outcome.

📈 Long-run frequency is the test

  • A probability is accurate if events occur at that rate over many trials.
  • Example: If you track the weather forecaster over a long period and find it rains on 50% of days she said "10% chance," then her probability assessments are wrong.
  • According to the frequency interpretation: "probability of rain is 0.10" means it will rain 10% of the days on which rain is forecast with this probability.

🔢 Frequentist approach in practice

🔢 Why frequentist is preferred

  • The present text (and most work in the field) adopts the frequentist approach in most cases.
  • It provides objective criteria: probabilities can be checked against long-run frequencies.
  • Unlike subjective probability, there is a standard for "right" and "wrong."

🔢 Nondogmatic probabilities

  • Almost all probabilities encountered in statistics are neither 0 nor 1.
  • Probability 0 = no chance of occurring (event is impossible).
  • Probability 1 = certain to occur.
  • Hard to find examples of interest to statistics with probability exactly 0 or 1.
  • Example: Even the probability the Sun will come up tomorrow is less than 1.

🎯 Computing probability with equally-likely outcomes

🎯 Basic formula

  • When all outcomes are equally likely, probability = (number of favorable outcomes) / (total number of possible outcomes).
  • "Favorable" means "favorable to the event in question happening," not necessarily favorable to your well-being.

🎯 Single event example

  • Roll a six-sided die: six possible outcomes, all equally likely.
  • Probability of rolling a one = 1/6.
  • Probability of rolling either a one or a six:
    • Two favorable outcomes (one or six).
    • Six possible outcomes.
    • Probability = 2/6 = 1/3.
36

Basic Concepts of Probability

Basic Concepts

🧭 Overview

🧠 One-sentence thesis

Probability quantifies the likelihood of events by comparing favorable outcomes to all possible outcomes, and understanding independence, conditional probability, and common fallacies enables accurate calculation of single and combined event probabilities.

📌 Key points (3–5)

  • Equally-likely outcomes: When all outcomes have the same chance, probability equals favorable outcomes divided by total possible outcomes.
  • Independent vs dependent events: Independent events don't affect each other's probabilities; dependent events require conditional probability calculations.
  • "And" vs "Or" logic: P(A and B) multiplies probabilities for independent events; P(A or B) adds probabilities minus the overlap.
  • Common confusion: The gambler's fallacy—past independent events do not influence future ones (e.g., a coin has no "memory" of previous flips).
  • Conditional probability: When one event affects another, use P(A and B) = P(A) × P(B|A), not simple multiplication.

🎲 Single event probability

🎲 The basic formula

Probability = (number of favorable outcomes) / (number of possible outcomes)

  • Applies when all outcomes are equally likely.
  • Example: Rolling a six-sided die, the probability of getting a one is 1/6 (one favorable outcome out of six possible).
  • Example: Drawing an ace from a 52-card deck is 4/52 = 1/13 (four aces among 52 cards).

🍒 The equal-probability assumption

  • The formula depends critically on every outcome having the same chance.
  • Example: In a bag with 14 sweet and 6 sour cherries, P(sweet) = 14/20 = 7/10 only if each cherry is equally likely to be picked.
  • Don't confuse: If sweet cherries are smaller, they're harder to grab, violating the equal-probability assumption.

🎯 Complement rule

  • If P(A) is the probability an event occurs, then 1 - P(A) is the probability it does not occur.
  • Example: If the probability of rolling a total of 6 with two dice is 5/36, then P(not 6) = 1 - 5/36 = 31/36.

🔗 Independent events

🔗 What independence means

Events A and B are independent if the probability of B occurring is the same whether or not A occurs.

  • Example: Tossing a fair coin twice—the second toss has P(heads) = 1/2 regardless of the first toss result.
  • Counterexample: "Rain in Houston tomorrow" and "Rain in nearby Galveston tomorrow" are not independent; rain in one city makes rain in the other more likely.

✖️ Probability of A and B (both occurring)

  • For independent events: P(A and B) = P(A) × P(B)
  • Example: Flipping heads twice in a row = (1/2) × (1/2) = 1/4.
  • Example: Flipping a coin (heads) and rolling a die (getting 1) = (1/2) × (1/6) = 1/12.
  • Example: Drawing a heart, replacing it, then drawing a black card = (1/4) × (1/2) = 1/8 (replacement makes events independent).

➕ Probability of A or B (at least one occurring)

  • Formula: P(A or B) = P(A) + P(B) - P(A and B)
  • The subtraction avoids double-counting the case where both happen.
  • "Or" here is inclusive—it includes the possibility that both A and B occur.
ScenarioCalculationResult
Head on first flip or head on second flip(1/2) + (1/2) - (1/4)3/4
Rolling a 6 or flipping heads(1/6) + (1/2) - (1/12)7/12

🔄 Alternate method for "or" problems

  • Compute the probability of neither event happening, then subtract from 1.
  • Example: P(6 or head) = 1 - P(not 6 and not head) = 1 - (5/6 × 1/2) = 1 - 5/12 = 7/12.
  • Advantage: This method extends easily to three or more events.

🔀 Dependent events and conditional probability

🔀 When events are not independent

  • If the first event changes the situation for the second, the events are dependent.
  • Example: Drawing two aces from a deck without replacement—after the first ace, only 3 aces remain out of 51 cards.

📊 Conditional probability notation

P(B|A) means "the probability of B occurring given that A has occurred."

  • Example: P(ace on second draw | ace on first draw) = 3/51 = 1/17.
  • The vertical bar "|" is read as "given."

🧮 Formula for dependent events

  • P(A and B) = P(A) × P(B|A)
  • Example: Drawing two aces = (4/52) × (3/51) = 1/221.
  • Don't confuse: This is not (4/52) × (4/52), which would incorrectly assume independence.

🃏 Multiple pathways example

  • Problem: Drawing the Ace of Diamonds and a black card (in any order).
  • Case 1: Ace of Diamonds first, then black = (1/52) × (26/51) = 1/102.
  • Case 2: Black card first, then Ace of Diamonds = (1/2) × (1/51) = 1/102.
  • Since the cases are mutually exclusive, add them: 1/102 + 1/102 = 1/51.

🎂 Birthday problem

🎂 The surprising result

  • With 25 people in a room, the probability that at least two share a birthday is 0.569 (not 25/365 = 0.068).
  • The key is to calculate the probability of no matches, then subtract from 1.

🧩 How to calculate it

  • P(second person doesn't match first) = 364/365.
  • P(third doesn't match first two | no previous matches) = 363/365.
  • Continue: P₄ = 362/365, P₅ = 361/365, ..., P₂₅ = 341/365.
  • Multiply all these together: P(no matches) = (364/365) × (363/365) × ... × (341/365) = 0.431.
  • Therefore, P(at least one match) = 1 - 0.431 = 0.569.

🎰 Gambler's fallacy

🎰 The mistaken belief

  • Fallacy: After five heads in a row, tails is "due" on the sixth flip.
  • Reality: Each flip is independent; the coin has no memory. P(heads on sixth flip) = 1/2.

📈 What actually happens in the long run

  • The proportion of heads approaches 0.5, but the difference between number of heads and tails does not shrink to zero.
  • Example from simulation: After 1,500,000 flips, there were 968 more tails than heads, yet the proportion of heads was 0.500 (rounded to three decimals).
  • Don't confuse: "Balancing out" refers to proportions, not absolute counts.
37

Permutations and Combinations

Permutations and Combinations

🧭 Overview

🧠 One-sentence thesis

Permutations and combinations provide formulas to count the number of ways to select items from a group, with permutations caring about order and combinations ignoring it.

📌 Key points (3–5)

  • Counting orders: The number of ways to arrange n items is n factorial (n!).
  • Multiplication rule: When making sequential choices, multiply the number of options at each step to get total possibilities.
  • Permutations vs combinations: Permutations count ordered selections (red then yellow ≠ yellow then red), while combinations count unordered selections (red and yellow = yellow and red).
  • Common confusion: Order matters for permutations but not for combinations—choosing the same items in different sequences counts as one combination but multiple permutations.
  • Formulas: Permutations use n!/(n-r)!, combinations use n!/[r!(n-r)!], where n is the total items and r is how many you select.

🔢 Counting all possible orders

🔢 What factorial means

The number of orders = n!, where n is the number of pieces to be picked up. The symbol "!" stands for factorial.

  • Factorial means multiplying all whole numbers from n down to 1.
  • Examples from the excerpt:
    • 3! = 3 × 2 × 1 = 6
    • 4! = 4 × 3 × 2 × 1 = 24
    • 5! = 5 × 4 × 3 × 2 × 1 = 120
  • This counts every possible sequence when you pick up all n items one at a time.

🍬 Candy example

The excerpt uses three pieces of candy (green, yellow, red) to illustrate ordering:

  • There are 3! = 6 possible orders to pick them up.
  • Two orders start with red, two start with yellow, two start with green.
  • If there were 5 pieces of candy, there would be 5! = 120 different orders.

✖️ Multiplication rule for sequential choices

✖️ How the rule works

  • When you make a series of independent choices, multiply the number of options at each step.
  • The excerpt explains: "first there is a choice among 3 soups. Then, for each of these choices there is a choice among 6 entrées resulting in 3 × 6 = 18 possibilities. Then, for each of these 18 possibilities there are 4 possible desserts yielding 18 × 4 = 72 total possibilities."

🍽️ Restaurant menu example

A small restaurant has:

  • 3 soups
  • 6 entrées
  • 4 desserts

Total possible meals = 3 × 6 × 4 = 72.

🎯 Permutations: order matters

🎯 What permutations count

Permutations: the number of ways r things can be selected from a group of n things, where order counts.

  • Notation: nPr (or sometimes written as P(n,r))
  • Formula: n! / (n - r)!
  • "It is important to note that order counts in permutations. That is, choosing red and then yellow is counted separately from choosing yellow and then red."

🍬 Four-candy example

Suppose there are four pieces of candy (red, yellow, green, brown) and you pick exactly two:

  • First choice: any of 4 colors
  • Second choice: any of the remaining 3 colors
  • Total: 4 × 3 = 12 possibilities

Using the formula:

  • 4P2 = 4! / (4 - 2)! = (4 × 3 × 2 × 1) / (2 × 1) = 12

📋 All twelve permutations

The excerpt lists all 12 ordered pairs:

  1. red, yellow
  2. red, green
  3. red, brown
  4. yellow, red
  5. yellow, green
  6. yellow, brown
  7. green, red
  8. green, yellow
  9. green, brown
  10. brown, red
  11. brown, yellow
  12. brown, green

Notice that "red, yellow" and "yellow, red" are both counted.

🎲 Combinations: order does not matter

🎲 What combinations count

Combinations: the number of ways to select items when order of choice is not considered.

  • Notation: nCr (or sometimes written as C(n,r))
  • Formula: n! / [r! × (n - r)!]
  • "In counting combinations, choosing red and then yellow is the same as choosing yellow and then red because in both cases you end up with one red piece and one yellow piece."

🍬 Same four-candy example, now unordered

With the same four candies (red, yellow, green, brown), if you only care about which two you end up with (not the order):

  • The excerpt modifies the permutation table: repeated combinations get an "x" instead of a number.
  • For example, "yellow then red" gets an "x" because "red and yellow" was already counted.
  • Result: only 6 distinct combinations.

Using the formula:

  • 4C2 = 4! / [2! × (4 - 2)!] = (4 × 3 × 2 × 1) / [(2 × 1) × (2 × 1)] = 6

📋 All six combinations

The excerpt shows these six unordered pairs:

  1. red, yellow
  2. red, green
  3. red, brown
  4. yellow, green
  5. yellow, brown
  6. green, brown

🍕 Pizza toppings example

Suppose there are six kinds of toppings and you want to order exactly 3:

  • n = 6 (total toppings available)
  • r = 3 (number you are selecting)
  • 6C3 = 6! / [3! × (6 - 3)!] = (6 × 5 × 4 × 3 × 2 × 1) / [(3 × 2 × 1) × (3 × 2 × 1)] = 20

There are 20 different combinations of exactly 3 toppings.

🔍 Key distinction: permutations vs combinations

AspectPermutationsCombinations
Order matters?Yes—red then yellow ≠ yellow then redNo—red and yellow = yellow and red
Formulan! / (n - r)!n! / [r! × (n - r)!]
Example (4 candies, pick 2)12 outcomes6 outcomes
When to useWhen sequence/arrangement is importantWhen only the final selection matters

🔍 Don't confuse

  • Permutations always give a larger or equal count than combinations for the same n and r, because permutations count every ordering separately.
  • The combination formula divides the permutation formula by r! to remove the duplicate orderings.
38

Binomial Distribution

Binomial Distribution

🧭 Overview

🧠 One-sentence thesis

The binomial distribution calculates the probability of achieving a specific number of successes across independent trials when each trial has the same probability of success.

📌 Key points (3–5)

  • What it models: probabilities of different numbers of successes (e.g., heads) across N independent trials, each with probability π of success.
  • How to calculate: use the binomial formula to find the probability of exactly x successes out of N trials.
  • Cumulative probabilities: to find the probability of a range of outcomes (e.g., 0 to 3 heads), sum the individual probabilities for each outcome in that range.
  • Common confusion: the probability π can vary—the coin doesn't have to be fair (0.5); biased coins (e.g., π = 0.4) also follow binomial distributions.
  • Central tendency: the mean number of successes is N times π; the variance accounts for both the number of trials and the probability of failure.

🎲 What the binomial distribution represents

🎲 Discrete probability distribution

A discrete probability distribution shows the probability for each possible value on the X-axis.

  • The binomial distribution is a specific type of discrete distribution.
  • It applies when you have a fixed number of independent trials (N) and each trial has the same probability of success (π).
  • Example: flipping a coin twice (N = 2) with probability 0.5 of heads (π = 0.5) produces a binomial distribution showing probabilities for 0, 1, or 2 heads.

🎯 Defining "success"

  • A "success" is whatever outcome you're counting (e.g., getting heads in a coin flip).
  • The excerpt defines heads as a success in the coin-flip example.
  • The probability of success on each trial is denoted by π (the Greek letter pi).

📐 The binomial formula

📐 General formula structure

The formula calculates the probability of exactly x successes out of N trials:

P(x) = [N! / (x! × (N - x)!)] × π^x × (1 - π)^(N - x)

Where:

  • P(x) is the probability of x successes
  • N is the number of trials
  • π is the probability of success on a given trial
  • The factorial term [N! / (x! × (N - x)!)] counts the number of ways to arrange x successes among N trials

🪙 Fair coin example (π = 0.5, N = 2)

The excerpt works through flipping a fair coin twice:

Number of heads (x)CalculationProbability
0(2! / 0!2!) × 0.5^0 × 0.5^2 = 1 × 1 × 0.250.25
1(2! / 1!1!) × 0.5^1 × 0.5^1 = 2 × 0.5 × 0.50.50
2(2! / 2!0!) × 0.5^2 × 0.5^0 = 1 × 0.25 × 10.25
  • The probability of exactly one head is 0.50 because two different outcomes (Heads-Tails and Tails-Heads) both produce one head.
  • The probability of one or more heads is 0.50 + 0.25 = 0.75 (add the probabilities of mutually exclusive outcomes).

🎲 Biased coin example (π = 0.4, N = 2)

  • The excerpt asks: if the probability of heads is only 0.4, what is the probability of getting heads at least once in two tosses?
  • "At least once" means one or more heads, so you calculate P(1) + P(2).
  • Substituting π = 0.4 into the formula yields 0.64.
  • Don't confuse: the binomial distribution works for any probability π between 0 and 1, not just fair (0.5) scenarios.

📊 Cumulative probabilities

📊 Summing individual probabilities

  • To find the probability of a range of outcomes (e.g., 0 to 3 heads in 12 tosses), calculate the probability of each exact outcome and add them together.
  • Example from the excerpt: P(0 to 3 heads in 12 tosses) = P(0) + P(1) + P(2) + P(3) = 0.0002 + 0.0029 + 0.0161 + 0.0537 = 0.073.

🧮 Why cumulative calculations are tedious

  • Computing each individual probability with the binomial formula and then summing can be time-consuming, especially for large N or wide ranges.
  • The excerpt mentions that binomial calculators exist to simplify these computations.

📏 Mean and variance

📏 Mean (expected value)

The mean of a binomial distribution is μ = N × π.

  • This represents the average number of successes you would expect over many repetitions of the experiment.
  • Example: tossing a fair coin 12 times (N = 12, π = 0.5) gives a mean of 12 × 0.5 = 6 heads.
  • Intuition: on average, half the tosses come up heads when the coin is fair.

📐 Variance

The variance of a binomial distribution is σ² = N × π × (1 - π).

  • Variance measures the spread of the distribution—how much the number of successes varies around the mean.
  • The term (1 - π) represents the probability of failure on each trial.
  • The variance depends on both the number of trials (N) and the balance between success and failure probabilities.
  • Don't confuse: variance is not just N × π; it also includes the (1 - π) factor, which accounts for the probability of not succeeding.
39

Poisson Distribution

Poisson Distribution

🧭 Overview

🧠 One-sentence thesis

The Poisson distribution calculates probabilities of various counts of independent events when you know the mean number of occurrences.

📌 Key points (3–5)

  • What it calculates: probabilities of getting a specific number of "successes" based on the mean number of successes.
  • Key requirement: the various events must be independent of each other.
  • Core parameters: uses the mean (μ) to compute probabilities; variance also equals μ.
  • Common confusion: "success" does not mean positive outcome—it just means the outcome in question occurs.
  • When to use: situations where you know the average rate of events and want to find the probability of a specific count.

📐 Core formula and parameters

🧮 The Poisson formula

The excerpt provides a formula with three components:

  • e: the base of natural logarithms (2.7183)
  • μ (mu): the mean number of "successes"
  • x: the number of "successes" in question

The formula calculates the probability of observing exactly x occurrences when the mean is μ.

📊 Mean and variance

The mean of the Poisson distribution is μ. The variance is also equal to μ.

  • Both the mean and variance have the same value in a Poisson distribution.
  • This is a distinctive property that sets it apart from other distributions.
  • Example from the excerpt: if the mean is 8, then the variance is also 8.

🔥 Worked example: fire station calls

🎯 The scenario

The excerpt uses a fire station example:

  • Known information: mean number of calls to a fire station on a weekday is 8
  • Question: what is the probability that on a given weekday there would be 11 calls?

🧩 Applying the formula

  • μ = 8 (the mean)
  • x = 11 (the number of calls in question)
  • The result is 0.072 (7.2% probability)

This shows how to use the Poisson distribution when you know the average rate and want to find the probability of a specific count.

⚠️ Key requirement: independence

🔗 What independence means

The excerpt emphasizes: "In order to apply the Poisson distribution, the various events must be independent."

  • Each event must not affect the probability of other events occurring.
  • Without independence, the Poisson distribution cannot be validly applied.
  • Don't confuse: this is a requirement for using the distribution, not just a nice-to-have property.

📝 Understanding "success"

Keep in mind that the term "success" does not really mean success in the traditional positive sense. It just means that the outcome in question occurs.

  • "Success" is a technical term, not a value judgment.
  • In the fire station example, a call (which might represent an emergency) is counted as a "success" for calculation purposes.
  • The term simply identifies the event you are counting.
40

Multinomial Distribution

Multinomial Distribution

🧭 Overview

🧠 One-sentence thesis

The multinomial distribution extends the binomial distribution to compute probabilities when each event has more than two possible outcomes.

📌 Key points (3–5)

  • What it extends: the binomial distribution handles binary outcomes (two possibilities), while the multinomial handles three or more outcomes per event.
  • What it computes: the probability of obtaining a specific combination of outcomes across multiple events.
  • Key requirement: each event must have the same set of possible outcomes with fixed probabilities.
  • Common confusion: don't confuse the number of events (n) with the number of times each outcome occurs (n₁, n₂, n₃, etc.).
  • Formula structure: uses factorials of outcome counts and multiplies by each outcome's probability raised to its count.

🎲 From binomial to multinomial

🎲 What the binomial distribution handles

  • The binomial distribution computes probabilities for binary outcomes—events with only two possible results.
  • Example from the excerpt: getting 6 heads out of 10 coin flips (heads or tails only).

🔀 What the multinomial distribution adds

The multinomial distribution can be used to compute the probabilities in situations in which there are more than two possible outcomes.

  • It generalizes the binomial case to three or more outcomes per event.
  • Each event still has the same set of possible outcomes with the same probabilities.
  • Example: a chess game can end in three ways—Player A wins, Player B wins, or draw—rather than just two.

Don't confuse: the multinomial is not for different events with different outcome sets; it's for repeated events where each event has the same multiple outcomes.

🧮 The multinomial formula

🧮 Formula for three outcomes

The excerpt provides the formula for three possible outcomes:

p = (n!) / ((n₁!)(n₂!)(n₃!)) × p₁^n₁ × p₂^n₂ × p₃^n₃

Where:

  • n = total number of events
  • n₁, n₂, n₃ = number of times each outcome occurs
  • p₁, p₂, p₃ = probability of each outcome
  • p = probability of obtaining that specific combination

📐 Formula for k outcomes

The general form extends to any number of outcomes:

p = (n!) / ((n₁!)(n₂!)...(nₖ!)) × p₁^n₁ × p₂^n₂ × ... × pₖ^nₖ

  • The numerator is the factorial of the total number of events.
  • The denominator is the product of factorials of each outcome count.
  • Then multiply by each outcome probability raised to its count.

🏁 Worked example: chess games

🏁 The scenario

The excerpt uses a chess example:

  • Two players play 12 games (n = 12).
  • Player A wins with probability 0.40 (p₁ = 0.40).
  • Player B wins with probability 0.35 (p₂ = 0.35).
  • Games draw with probability 0.25 (p₃ = 0.25).

Question: What is the probability that in 12 games, Player A wins 7, Player B wins 2, and 3 are drawn?

🧮 Applying the formula

The excerpt assigns:

  • n₁ = 7 (Player A wins)
  • n₂ = 2 (Player B wins)
  • n₃ = 3 (draws)

Calculation:

  • p = (12!) / ((7!)(2!)(3!)) × (0.40)⁷ × (0.35)² × (0.25)³
  • p = 0.0248

Interpretation: there is about a 2.48% chance of this exact outcome combination occurring.

⚠️ What the formula requires

  • The sum of n₁, n₂, and n₃ must equal n (7 + 2 + 3 = 12).
  • The probabilities p₁, p₂, and p₃ must sum to 1.0 (0.40 + 0.35 + 0.25 = 1.0).
  • Each game is independent with the same outcome probabilities.
41

Hypergeometric Distribution

Hypergeometric Distribution

🧭 Overview

🧠 One-sentence thesis

The hypergeometric distribution calculates probabilities for sampling without replacement, where each draw changes the composition of the remaining population.

📌 Key points (3–5)

  • When to use it: sampling without replacement—once you draw an item, it does not go back into the pool before the next draw.
  • What it calculates: the probability of getting exactly a certain number of "successes" in your sample, given a fixed number of successes in the whole population.
  • Key distinction from binomial: the binomial distribution assumes replacement (or independence); hypergeometric does not replace items, so probabilities shift after each draw.
  • Common confusion: the hypergeometric uses the population size and sample size explicitly, whereas binomial only needs a fixed probability per trial.
  • Practical use: useful for problems like drawing cards from a deck, sampling defective items from a batch, or any scenario where the pool shrinks with each draw.

🎯 What the hypergeometric distribution does

🎯 Purpose and context

The hypergeometric distribution is used to calculate probabilities when sampling without replacement.

  • "Without replacement" means once you pick an item, you do not put it back before the next pick.
  • This changes the odds for each subsequent draw because the population composition changes.
  • Example: drawing cards from a deck—after you draw one card, there are only 51 cards left, and the number of aces may have changed.

🔍 The core question it answers

  • Given a population with a known number of "successes" (e.g., aces in a deck), what is the probability that a sample of a certain size will contain exactly a specified number of those successes?
  • The excerpt's example: a deck has 52 cards, 4 of which are aces; if you draw 3 cards without replacement, what is the probability that exactly 2 of them are aces?

🧮 How the formula works

🧮 The hypergeometric formula components

The excerpt gives the formula in words:

  • p = the probability of obtaining exactly x successes
  • k = the number of "successes" in the population (e.g., 4 aces in the deck)
  • x = the number of "successes" you want in your sample (e.g., 2 aces)
  • N = the size of the population (e.g., 52 cards)
  • n = the number sampled (e.g., 3 cards drawn)
  • C notation = combinations (the number of ways to choose items)

The formula structure is:

  • p = (k C x) times (N minus k C n minus x) divided by (N C n)

In plain language:

  • Numerator: ways to choose x successes from k successes, times ways to choose the remaining (n minus x) items from the (N minus k) non-successes.
  • Denominator: total ways to choose n items from N items.

📐 Worked example from the excerpt

ParameterValueMeaning
k44 aces in the deck
x2want exactly 2 aces
N5252 cards total
n3drawing 3 cards
  • The excerpt calculates: p = (4 C 2) times (48 C 1) divided by (52 C 3)
  • The result is approximately 0.013, or about 1.3% chance of getting exactly 2 aces in 3 draws.

🧩 Why combinations matter

  • Combinations count the number of ways to select items when order does not matter.
  • The hypergeometric formula uses combinations because the order in which you draw the cards (or items) does not affect whether you have 2 aces; only the count matters.
  • Don't confuse: permutations (order matters) vs combinations (order does not matter)—here, we use combinations.

📊 Mean and standard deviation

📊 Summary statistics for the hypergeometric distribution

The excerpt provides formulas for the mean and standard deviation:

Mean:

  • mean = (n times k) divided by N
  • Interpretation: the expected number of successes in your sample.
  • Example: if you draw 3 cards from a deck with 4 aces out of 52, the mean number of aces is (3 times 4) divided by 52, which is about 0.23 aces on average.

Standard deviation:

  • sd = square root of [(n times k times (N minus k) times (N minus n)) divided by (N squared times (N minus 1))]
  • This measures the spread or variability in the number of successes you might get across many samples.
  • The formula accounts for the finite population size and the fact that sampling without replacement reduces variability compared to sampling with replacement.

🔄 Why these formulas differ from binomial

  • The binomial distribution's mean is n times p (where p is a fixed probability).
  • The hypergeometric mean looks similar—(n times k) divided by N is essentially n times (k divided by N), where (k divided by N) is the proportion of successes in the population.
  • However, the standard deviation formula includes a correction factor (N minus n) divided by (N minus 1), which reflects the finite population and lack of replacement.
  • Don't confuse: binomial assumes each trial is independent (replacement or infinite population); hypergeometric adjusts for dependence (no replacement, finite population).

🆚 Hypergeometric vs binomial

🆚 Key distinction: replacement

FeatureHypergeometricBinomial
ReplacementNo replacement—items are not returnedWith replacement or effectively infinite population
Probability per trialChanges after each drawStays constant
ParametersN (population size), k (successes in population), n (sample size), x (successes in sample)n (number of trials), p (fixed probability of success)
Use caseDrawing from a finite, unchanging pool without putting items backIndependent trials with the same probability each time

🧩 When to use hypergeometric

  • Use the hypergeometric distribution when:
    • You have a finite population.
    • You sample without replacement.
    • You want the probability of a specific count of successes in your sample.
  • Example: sampling defective items from a batch, drawing cards from a deck, or selecting committee members from a group with known characteristics.

⚠️ Common confusion

  • Don't assume "95% accurate" or similar statements automatically mean you can use a simple probability—context matters.
  • The excerpt briefly mentions a disease-testing scenario at the end (though it is cut off), hinting that accuracy involves both miss rates and false positive rates, and that base rates (how common the condition is) also matter.
  • This is a reminder that probability problems often require careful attention to the sampling method and population structure, not just a single "accuracy" number.
42

Base Rates

Base Rates

🧭 Overview

🧠 One-sentence thesis

The probability that you actually have a disease given a positive test result depends critically on the base rate (how common the disease is in the population), not just the test's accuracy, and ignoring base rates leads to dramatic overestimation of risk.

📌 Key points (3–5)

  • Base rate definition: the proportion of people in the population who actually have the condition being tested for.
  • Two types of test errors: misses (failing to detect disease when present) and false positives (indicating disease when absent)—these rates are not necessarily equal.
  • Common confusion: a 95% accurate test does NOT mean 95% probability you have the disease if you test positive; the base rate must be factored in.
  • How base rates change probabilities: when a disease is rare (low base rate), even a highly accurate test will produce many more false positives than true positives.
  • Two calculation methods: tree diagrams (counting outcomes in a population) and Bayes' theorem (using conditional probabilities) both yield the same result.

🧪 The medical test scenario

🧪 Initial misconception

  • You test positive for Disease X at a physical exam.
  • The test is described as "95% accurate."
  • Intuitive but wrong conclusion: probability of having the disease = 0.95.
  • The excerpt emphasizes this is "not that simple" because accuracy alone is insufficient information.

🔍 Understanding test accuracy properly

The excerpt explains that "95% accurate" is ambiguous because tests make two distinct types of errors:

Miss: the test fails to detect the disease when you actually have it.

False positive: the test indicates disease when you do not have it.

Example from the excerpt:

  • Test accurately detects disease in 99% of people who have it (miss rate = 0.01).
  • Test accurately indicates no disease in 91% of people who do not have it (false positive rate = 0.09).
  • Even with this information, concluding probability = 0.91 is still incorrect.

Don't confuse: test accuracy with the probability you have the disease—the missing piece is the base rate.

📊 Working through the numbers

📊 The base rate

Base rate: the proportion of people having the disease in the relevant population.

  • In the example, Disease X is rare: only 2% of people in your situation have it.
  • This means out of 1,000,000 people tested, 20,000 have the disease and 980,000 do not.

🧮 Counting true positives

  • Of the 20,000 with disease, the test detects it in 99%.
  • True positives = 19,800 people correctly identified.

🧮 Counting false positives

  • Of the 980,000 without disease, 9% test positive (false positive rate = 0.09).
  • False positives = 88,200 people incorrectly diagnosed.

🎯 The final calculation

Total people testing positive = 19,800 (true) + 88,200 (false) = 108,000

Probability you have the disease given a positive test:

  • 19,800 divided by 108,000 = 0.1833 (about 18%)
  • Not 95%, not 91%, but only 18.33%.

Key insight: Even though the test is highly accurate, the low base rate means most positive results are false positives.

GroupNumberTest PositiveTest Negative
No disease980,00088,200891,800
Disease20,00019,800200

The excerpt notes: "if you look only at the people testing positive (shown in red), only 19,800 (0.1833) of the 108,000 testing positive actually have the disease."

🧠 Bayes' Theorem approach

🧠 What Bayes' Theorem does

Bayes' Theorem: considers both the prior probability of an event and the diagnostic value of a test to determine the posterior probability of the event.

  • Prior probability: what you know before the test (the base rate).
  • Diagnostic value: how the test changes your belief (sensitivity and false positive rate).
  • Posterior probability: what you conclude after the test result.

🔢 The formula components

The excerpt defines:

  • Event D = you have Disease X
  • Event T = you test positive
  • P(D) = 0.02 (base rate)
  • P(D') = 0.98 (probability you don't have it)
  • P(T|D) = 0.99 (probability of positive test given disease)
  • P(T|D') = 0.09 (probability of positive test given no disease)

🔢 The calculation

The formula given is:

P(D|T) = [P(T|D) × P(D)] / [P(T|D) × P(D) + P(T|D') × P(D')]

Plugging in numbers:

  • Numerator: (0.99) × (0.02) = 0.0198
  • Denominator: (0.99) × (0.02) + (0.09) × (0.98) = 0.0198 + 0.0882 = 0.108
  • Result: 0.0198 / 0.108 = 0.1833

The excerpt confirms: "which is the same value computed previously."

⚠️ Real-world implications

⚠️ Warning signs example

The excerpt includes a brief discussion of FBI warning signs for school shooters:

  • Even if warning signs are accurate indicators, the base rate of actual school shooters is extremely low.
  • Most students showing warning signs would never become shooters.
  • Implication: "it is necessary to take this base rate information into account" before concluding a student will be a shooter.
  • Actions based solely on warning signs will likely target many students who were never at risk.

Don't confuse: a good predictor (high accuracy) with a reliable individual diagnosis when the base rate is very low—false positives will dominate.

43

Scientific Method

Scientific Method

🧭 Overview

🧠 One-sentence thesis

The scientific method depends on empirical data collected systematically, uses theories that must be testable and potentially disconfirmable, and relies on parsimony and the ability to survive repeated hypothesis testing rather than absolute proof.

📌 Key points (3–5)

  • Empirical foundation: Scientific investigation requires systematically collected empirical data, but does not always require experimentation—observational studies are also valid.
  • Theories cannot be proved, only tested: Theories can never be 100% certain; they must be potentially disconfirmable and lead to testable hypotheses.
  • Disconfirmation vs confirmation: A disconfirmed hypothesis means the theory is incorrect; a confirmed hypothesis makes the theory more useful but does not "prove" it.
  • Common confusion: Scientific vs faith-based explanations—scientific explanations must be testable; faith-based explanations do not need to be testable (neither is "cosmically" wrong, but only the former is scientific).
  • Parsimony matters: Good theories explain many findings with few constructs; theories that require constant modification become less parsimonious and may be replaced.

🔬 What makes an investigation scientific

🔬 Empirical data requirement

Scientific investigation depends on empirical data collected systematically.

  • "Empirical" means data from observation or measurement, not speculation.
  • "Systematically" means following a consistent, planned method—not haphazard collection.
  • Important: Experimentation (manipulating variables) is not always required.
    • Observational studies in astronomy, developmental psychology, and ethology are common and provide valuable scientific information.
    • Example: Observing star movements or child development over time without intervention is still scientific.

🧪 Testability and disconfirmability

Scientific theories must be potentially disconfirmable.

  • A theory that can accommodate all possible results is not scientific.
  • The theory must lead to testable hypotheses—predictions that can be checked against data.
  • Why this matters: If no possible observation could ever contradict the theory, it cannot be tested and is not scientific.
  • Example from the excerpt: The secondary reinforcement theory of attachment predicted that infants bond with parents through food pairing. Experiments with infant monkeys (fed by wire surrogates but clinging to cloth surrogates) disconfirmed this theory.

🆚 Scientific vs faith-based explanations

TypeRequirementStatus if untestable
ScientificMust be testableNot scientific
Faith-basedBased on faith; does not need to be testableNot incorrect "in some cosmic sense," just not scientific
  • Don't confuse: "Not scientific" does not mean "wrong"—it means the explanation does not follow the scientific method.

🔄 How theories are tested and evolve

🔄 Deductive reasoning: testing hypotheses

  • Process: A hypothesis is developed from a theory, then confirmed or disconfirmed.
  • If disconfirmed: The theory from which the hypothesis was deduced is incorrect.
  • If confirmed: The theory has survived a test and becomes more useful and better regarded—but is not "proved."
  • Example: If a theory predicts X and X is observed, the theory is supported but not confirmed as absolute truth.

🧠 Inductive reasoning: building theories

  • Where theories come from: A scientist observes many empirical findings and, through a "generally poorly understood process called induction," develops a framework to explain them.
  • This is the creative, synthesis step—not purely logical deduction.

🪶 Parsimony as a virtue

An important attribute of a good scientific theory is that it is parsimonious.

  • Parsimonious = simple in the sense of using relatively few constructs to explain many empirical findings.
  • A theory with as many assumptions as predictions is not valuable.
  • Why it matters: Parsimony makes theories easier to test, understand, and apply.

🔁 Theory modification and replacement

🔁 When disconfirmation happens

  • Strictly speaking, a disconfirmed hypothesis disconfirms the theory.
  • In practice: Disconfirmation rarely leads to immediate abandonment of the theory.
  • Instead, the theory is modified to accommodate the inconsistent finding.

🔁 When theories are replaced

  • If a theory must be modified over and over to accommodate new findings, it becomes less and less parsimonious.
  • This leads to discontent and the search for a new theory.
  • Replacement condition: If a new theory can explain the same facts in a more parsimonious way, it will eventually supersede the old theory.
  • Example: A theory that originally had 3 core assumptions but now requires 15 ad-hoc adjustments may be replaced by a simpler theory with 4 assumptions that explains all the same findings.

🚫 Don't confuse: proof vs support

  • Common confusion: A confirmed hypothesis does not "confirm" or "prove" the theory.
  • A theory is never 100% certain because a new empirical finding inconsistent with it could always be discovered.
  • Confirmation only means the theory has survived one more test and is more useful—not that it is absolutely true.
44

Measurement and Experimental Design

Measurement

🧭 Overview

🧠 One-sentence thesis

Reliable and valid measurement, combined with careful experimental design and sampling methods, enables researchers to draw meaningful causal inferences from data while avoiding systematic biases.

📌 Key points (3–5)

  • Reliability measures consistency—how closely a test correlates with parallel forms of itself, limited by the ratio of true score variance to total variance.
  • Validity has three types: face validity (appears to measure what it should), predictive validity (predicts relevant behavior), and construct validity (correlates appropriately with related and unrelated measures).
  • Common confusion: reliability sets an upper limit on validity—a test cannot correlate with another measure more than it correlates with itself.
  • Sampling bias refers to the method of sampling, not the sample itself; types include self-selection bias, undercoverage bias, and survivorship bias.
  • Experimental design choices (between-subjects vs. within-subjects, factorial designs) determine what causal inferences can be drawn and how powerfully effects can be detected.

📏 Reliability concepts

📏 What reliability measures

Reliability: the consistency of a test, defined by how closely it correlates with parallel forms of itself.

  • Reliability is not about whether the test measures the right thing—that's validity.
  • It reflects the proportion of variance due to true scores versus error.
  • Formula concept: reliability equals true score variance divided by total test score variance (true score variance plus error variance).
  • Example: if true score variance is 80 and total test variance is 100, reliability is 0.80.

📐 True score concept

True score: the score a person would obtain if there were no measurement error.

  • Every observed test score consists of true score plus error.
  • Reliability increases when true score variance is large relative to error variance.
  • Don't confuse: a single test administration gives an observed score, not a true score.

📊 Test length and reliability

  • Adding more items increases reliability—if the new items have the same characteristics as the old ones.
  • The excerpt provides a formula showing how reliability changes when test length changes by a factor.
  • Example given: increasing test length by 1.5 times raises reliability from 0.70 to 0.78.
  • Warning: adding poor-quality items can decrease reliability instead of increasing it.

🔗 Reliability limits validity

  • A test's correlation with another measure will generally be lower than the test's reliability.
  • Theoretical maximum: a test can correlate up to the square root of its reliability with another measure.
  • Example: if reliability is 0.81, maximum correlation with another measure is 0.90.
  • Red flag: correlations above this limit suggest data analysis errors (as found in some fMRI studies).

✅ Validity concepts

👁️ Face validity

Face validity: whether a test appears "on its face" to measure what it is supposed to measure.

  • Most straightforward type of validity.
  • Example: an Asian history test with Asian history questions has high face validity; one with American history questions does not.
  • This is about appearance, not actual measurement quality.

🎯 Predictive validity

Predictive validity (empirical validity): a test's ability to predict a relevant behavior.

  • Validated by correlation with real-world outcomes.
  • Example: SAT tests are validated by their ability to predict college grades.
  • The stronger the prediction, the higher the predictive validity.

🧩 Construct validity

Construct validity: whether a test's pattern of correlations with other measures aligns with the construct it claims to measure.

  • Established through two components:
    • Convergent validity: correlates highly with other measures of the same construct
    • Divergent validity: does not correlate highly with measures of different constructs
  • Example: a spatial ability test should correlate highly with other spatial tests but less with verbal ability or social intelligence tests.
  • Don't confuse: some constructs may overlap, making validation complex.

📊 Data collection principles

📊 Recording information

  • Core rule: ask for information in the way it will be most accurately reported.
  • Example: ask for height in feet and inches (how people know it), then convert to inches for analysis—don't ask people to convert on the fly.
  • Verbal data must be coded numerically for statistical analysis.

🔢 Conversion strategies

  • Verbal descriptions can be converted to numbers for analysis.
  • Example conversion scale:
VerbalNumeric
Very Little1
Little2
Moderate3
Lots4
Very Lots5

📐 Precision decisions

  • Decide on measurement precision before data collection begins.
  • Example problem: recording race times to only one decimal place created a tie when more precision was needed.
  • You can always collapse detailed data later, but you cannot add precision after collection.
  • Principle: measure more detail if uncertain—you can discard it later but cannot recover it.

📝 Specificity in questionnaires

  • Qualitative terms ("a lot," "little") are imprecise.
  • Better: ask for specific quantities (hours studied, exact exam score).
  • Even better: have subjects keep logs rather than relying on memory.

🎲 Sampling bias types

🎲 What sampling bias means

  • Key distinction: sampling bias refers to the method of sampling, not the sample itself.
  • Random sampling does not guarantee a representative sample—it only ensures differences are due to chance.
  • Biased sampling methods can sometimes produce representative samples by chance.

🙋 Self-selection bias

  • Occurs when people who volunteer differ systematically from the population.
  • Example: students volunteering for a study about intimate details of sex lives would not represent all students.
  • Example: online surveys about computer use attract people more interested in technology.
  • Can also occur after enrollment: subjects who leave an experiment may differ from those who stay.

📉 Undercoverage bias

  • Sampling too few observations from a segment of the population.
  • Famous example: 1936 Literary Digest poll predicted Landon would win against Roosevelt, but Roosevelt won by a large margin.
  • Cause: poorer people (more likely to support Roosevelt) were undercovered because they were less likely to have telephones.
  • Additional factor: nonresponse bias—Landon supporters were more likely to return surveys.

💀 Survivorship bias

  • Occurs when observations at the end are a non-random subset of those at the beginning.
  • Stock fund example: poorly-performing funds are eliminated or merged, so calculating mean appreciation of existing funds overestimates typical performance.
  • WWII example: Abraham Wald analyzed hits on returning aircraft—recommended armor for locations without hits, because hits there likely brought planes down (they didn't survive to return).
  • Don't confuse: looking only at survivors gives a biased picture of the whole population.

🧪 Experimental design structures

🧪 Between-subjects designs

Between-subjects design: different groups of subjects receive different experimental treatments.

  • Each subject experiences only one condition.
  • Example: subjects randomly assigned to either "charismatic teacher" or "punitive teacher" condition.
  • The independent variable has different levels, each tested with a different group.
  • Random assignment ensures differences between groups are chance differences, but does not eliminate them.
  • Inferential statistics determine whether observed differences are likely due to treatment or chance.

🔄 Within-subjects designs

Within-subjects design: the same subjects perform at all levels of the independent variable.

  • Also called repeated-measures designs.
  • Example: ADHD study where each subject tested under all four dose levels.
  • Each subject serves as their own control.

Key advantage: controls for individual differences in overall performance levels.

  • Some subjects naturally perform better than others regardless of condition.
  • Comparing the same subject across conditions removes this variability.
  • Result: within-subjects designs typically have more statistical power than between-subjects designs.

⚖️ Counterbalancing

  • Problem: in within-subjects designs, order effects can confound results.
  • Example problem: if all subjects receive doses in the same order (low to high), practice effects might be mistaken for dose effects.
  • Solution: counterbalance the order of presentations.
  • Each treatment appears in each sequential position an equal number of times across subjects.
  • Limitation: does not work well if there are complex dependencies between treatment order and outcomes—use between-subjects design instead.

🏗️ Factorial designs

Factorial design: includes all combinations of levels of multiple independent variables.

  • Notation example: Associate's Weight (2) × Associate's Relationship (2) means two independent variables, each with two levels.
  • All four combinations are tested: (obese girlfriend, obese acquaintance, average girlfriend, average acquaintance).
  • Key advantage: can assess interactions—whether the effect of one variable depends on the level of another variable.
  • Example: does associate's weight have a larger effect when the associate is a girlfriend versus an acquaintance?
  • Can have three or more independent variables.

🔀 Complex designs

  • Can combine between-subjects and within-subjects variables in the same experiment.
  • Example: gender (between-subjects) combined with type of priming word and type of response word (both within-subjects).

🔗 Causation and inference

🔗 Causation in experiments

  • Random assignment to conditions does not eliminate differences in unmeasured variables—it makes them chance differences.
  • Example problem: in a sleep drug experiment, the control group might by chance have more stressed subjects, affecting sleep.
  • Solution: use within-condition variance to assess combined effects of all unmeasured variables.
  • Variance among subjects in the same condition reflects unmeasured variable effects.
  • Statistical methods calculate the probability that unmeasured variables could produce the observed difference.
  • If that probability is low, infer the treatment had an effect.
  • Don't confuse: total certainty is impossible—there is always some probability the difference occurred by chance.

🔗 Third-variable problem

Third-variable problem: a third variable is responsible for the correlation between two other variables.

  • Classic example: in 1970s Taiwan, positive correlation between contraception use and number of electric appliances.
  • Neither causes the other—education level affects both.
  • Makes it difficult to infer causation from correlation alone.

🔍 Converging evidence approach

  • Better than assuming no third-variable problem exists.
  • Use multiple types of evidence pointing to the same conclusion.
  • Example: smoking causes cancer conclusion based on:
    • Retrospective studies
    • Prospective studies
    • Lab studies with animals
    • Theoretical understanding of cancer causes

↔️ Direction of causality

  • Correlation does not indicate which variable causes which.
  • Example: correlation between public debt and GDP growth—does debt slow growth, or does slow growth increase debt?
  • Most evidence supports the latter.

💊 HDL and niacin case

  • Low HDL is a risk factor for heart disease.
  • Niacin increases HDL levels.
  • Assumption: niacin causes HDL increase, which causes lower heart disease risk.
  • Test: randomly assign low-HDL patients to niacin or no-niacin condition.
  • Finding that niacin increases HDL without decreasing heart disease would cast doubt on the causal relationship.
  • This is exactly what an NIH study found.
45

Basics of Data Collection

Basics of Data Collection

🧭 Overview

🧠 One-sentence thesis

Data must be converted into numerical form for statistical analysis, and researchers should collect information at the most accurate and detailed level possible because lost precision cannot be recovered later.

📌 Key points (3–5)

  • Why numerical conversion matters: statistical programs require numbers, not verbal descriptions like "5'4"" or "lots."
  • The core rule of data collection: ask for information in the way it will be most accurately reported by participants.
  • Precision cannot be added later: you can always collapse detailed data into broader categories, but you cannot expand imprecise data to include more detail after collection.
  • Common confusion: recording too little vs. too much detail—err on the side of more precision; you can discard extra digits later but cannot retrieve them once lost.
  • Questionnaire design trade-off: qualitative terms ("little," "moderate," "lots") are easier to answer but harder to analyze; precise numbers (hours, percentages) yield clearer statistical results but require accurate participant knowledge.

🔢 Converting verbal data to numbers

🔢 Why conversion is necessary

  • Statistical programs and calculators cannot process verbal descriptions.
  • Example: you cannot compute the average of "5'4"", "6'1"", and "5'10"" without first converting them to a common numerical unit.

📏 Height conversion example

  • The excerpt shows converting feet-and-inches to inches only:
    • 5'4" becomes (5 × 12) + 4 = 64 inches
    • 6'1" becomes (6 × 12) + 1 = 73 inches
  • Once all heights are in inches, a statistical program can easily calculate the mean.

🎚️ Ordinal scale conversion

  • Verbal descriptions like "Very Little," "Little," "Moderate," "Lots," "Very Lots" can be mapped to numbers 1, 2, 3, 4, 5.
  • This allows calculation of means and other statistics on originally qualitative data.
Verbal descriptionNumerical code
Very Little1
Little2
Moderate3
Lots4
Very Lots5

🎯 The golden rule: accuracy first

🎯 Ask in the most accurate form

The number one rule of data collection is to ask for information in such a way as it will be most accurately reported.

  • Most people know their height in feet and inches, not in inches alone.
  • Asking participants to convert on the fly introduces errors.
  • Better: collect in the familiar format (feet and inches), then the researcher converts it.

⚠️ Don't confuse convenience with accuracy

  • It might seem simpler to ask for height in inches directly, but participants cannot "quickly and accurately convert it into inches 'on the fly.'"
  • Researcher convenience should not compromise data quality.

📊 How much detail to record

📊 The track meet example

  • A digital clock shows runner times to eight decimal places (e.g., 22.93219780 seconds).
  • Recording only one decimal place (22.9) led to a tie between two runners who actually had different times.
  • Key lesson: once data collection is over, lost precision cannot be recovered—"you cannot go back in time and record running times to more decimal places."

🔽 You can collapse, but not expand

  • You can always decide later to use fewer digits or broader categories.
  • You cannot add detail that was never measured.
  • Example: if you record 22.9, you have permanently lost the information that the true time was 22.93219780.

🧭 Planning ahead

  • "Think very carefully about the scales and specificity of information needed in your research before you begin collecting data."
  • If unsure whether you'll need more detail, measure it—you can discard it later if unnecessary.
  • The excerpt notes you "probably would not need to record eight digits to the right of the decimal point," but one digit is "clearly too few."

📝 Questionnaire design trade-offs

📝 Qualitative vs. quantitative questions

The excerpt contrasts two versions of a student study-time questionnaire:

ApproachStudy time questionGrade questionAnalysis difficulty
Qualitative"a lot / moderate / little""A / B / C / D / F"Hard to see patterns; imprecise
Quantitative"How many hours studied?""% Correct on exam"Clear numerical relationship

🧠 Why precision helps analysis

  • With qualitative terms ("Little," "Lots," "B"), "it's difficult to tell" if more study leads to better grades.
  • With exact numbers (5 hours → 71%, 13 hours → 97%), the relationship is much clearer.

🤔 When to use each approach

  • The excerpt does not say qualitative scales are always wrong; it shows that precise numbers yield clearer statistical results.
  • Important caveat: "this assumes the students would know how many hours they studied."
  • If participants cannot accurately recall hours, asking for a number may violate the golden rule (accuracy first).
  • Solution suggested: "ask them to keep a log of their study time as they study" to ensure accurate numerical data.

⚖️ Don't confuse participant burden with data quality

  • Asking for exact hours or keeping a log is more work for participants.
  • But if the research question requires precision, the extra effort is justified.
  • The key is whether participants can provide accurate answers in the requested format.
46

Sampling Bias

Sampling Bias

🧭 Overview

🧠 One-sentence thesis

Sampling bias arises from the method of selecting observations, not the sample itself, and can systematically exclude or overrepresent certain groups, leading to non-representative samples that undermine valid generalization.

📌 Key points (3–5)

  • What sampling bias is: a problem with the method of sampling, not necessarily the resulting sample—even random sampling can produce unrepresentative samples by chance, and biased methods can sometimes yield representative samples.
  • Self-selection bias: occurs when people who volunteer or stay in a study differ systematically from the population, because participation is not random.
  • Undercoverage bias: happens when too few observations are sampled from a segment of the population, leaving that group underrepresented.
  • Survivorship bias: arises when only observations that "survive" to the end are recorded, excluding those that dropped out or failed, creating a skewed picture.
  • Common confusion: sampling bias vs. a bad sample—bias refers to the process, not the outcome; a biased method tends to produce non-representative samples, but is not guaranteed to do so every time.

🧩 Core concept: what sampling bias is

🧩 Definition and key distinction

Sampling bias refers to the method of sampling, not the sample itself.

  • It is about how you select observations, not what you end up with.
  • A random sampling method does not guarantee a representative sample every time—chance variation can still occur.
  • Conversely, a biased sampling method does not guarantee that every sample will be greatly non-representative—sometimes a biased method might accidentally produce a reasonable sample.
  • Why this matters: you cannot judge bias by inspecting one sample; you must evaluate the selection process.

🔍 Don't confuse: method vs. outcome

  • The excerpt emphasizes that "there is no guarantee that random sampling will result in a sample representative of the population just as not every sample obtained using a biased sampling method will be greatly non-representative."
  • Example: flipping a fair coin 10 times might yield 8 heads by chance (unrepresentative outcome), but the method is still unbiased.

🙋 Self-selection bias

🙋 What it is

Self-selection bias: people who volunteer or choose to participate differ systematically from the population.

  • Occurs when subjects "self-select" themselves into the study.
  • The sample is likely to differ in important ways from the population the experimenter wants to draw conclusions about.

🎯 When it happens

The excerpt gives two scenarios:

  1. Before enrollment: volunteers sign up knowing what the study is about.

    • Example: a university newspaper ad asks students to volunteer for a study discussing intimate details of their sex lives—those who volunteer are not representative of all students.
    • Example: an online survey about computer use attracts people more interested in technology than typical.
  2. After enrollment: subjects enlist without knowing the study topic, then leave when they find out.

    • Example: subjects show up for an experiment, learn it will ask intimate questions, and many leave—resulting in a biased sample of those who stayed.

📺 Common examples

  • "Non-scientific" polls on television or websites suffer greatly from self-selection bias because only certain types of people choose to respond.

🕳️ Undercoverage bias

🕳️ What it is

Undercoverage bias: sampling too few observations from a segment of the population, leaving that group underrepresented.

  • A segment of the population is systematically under-sampled.
  • The resulting sample does not reflect the true proportions in the population.

📰 Historical example: the 1936 Literary Digest poll

  • The poll predicted Landon would win the election against Roosevelt by a large margin; in fact, Roosevelt won by a large margin.
  • Common explanation: poorer people were undercovered because they were less likely to have telephones, and this group was more likely to support Roosevelt.
  • Additional factor: a detailed analysis by Squire (1988) showed there was also a nonresponse bias (a form of self-selection bias)—those favoring Landon were more likely to return their survey than those favoring Roosevelt.

🔍 Don't confuse: undercoverage vs. self-selection

  • Undercoverage: the sampling frame or method fails to reach certain groups.
  • Self-selection: certain groups choose not to participate or to drop out.
  • The 1936 poll had both problems.

🛡️ Survivorship bias

🛡️ What it is

Survivorship bias: the observations recorded at the end are a non-random subset of those present at the beginning, because some observations "did not survive."

  • Only the survivors are included in the analysis, excluding those that dropped out, failed, or were eliminated.
  • This creates a skewed picture that overrepresents success or performance.

📈 Example: stock fund performance

  • Scenario: calculate the mean 10-year appreciation of stock funds that exist today.
  • Problem: poorly-performing funds are often eliminated or merged into other funds over time.
  • Result: the sample includes only funds that survived 10 years, which tend to be better-performing; the poorly-performing funds that did not survive are excluded.
  • Implication: the calculated mean is biased upward, and cannot be validly generalized to all stock funds of the same type.
  • The excerpt notes there is good evidence that this survivorship bias is substantial (Malkiel, 1995).

✈️ Example: World War II aircraft armor

  • Scenario: statistician Abraham Wald analyzed the distribution of hits from anti-aircraft fire on aircraft returning from missions, to decide where to place extra armor.
  • Naive approach: put armor where returning planes were frequently hit, to reduce damage there.
  • Survivorship bias problem: only a subset of aircraft return—those that were hit in certain locations and did not return are missing from the data.
  • Wald's insight: if there were few hits in a certain location on returning planes, then hits in that location were likely to bring a plane down (so those planes did not return).
  • Recommendation: locations without hits on the returning planes should be given extra armor, because hits there are fatal.
  • A detailed and mathematical description of Wald's work can be found in Mangel and Samaniego (1984).

🔍 Don't confuse: survivors vs. the full population

  • Survivorship bias means you see only the "winners" or "survivors."
  • The missing observations (failures, casualties, eliminated funds) are systematically different.
  • Analyzing only survivors leads to overly optimistic or misleading conclusions.
47

Experimental Designs

Experimental Designs

🧭 Overview

🧠 One-sentence thesis

Experimental designs vary in how subjects are assigned to treatments—between-subjects designs use different groups for each condition while within-subjects designs test the same subjects under all conditions—and these choices affect the experiment's ability to detect real treatment effects while controlling for individual differences and order effects.

📌 Key points (3–5)

  • Between-subjects designs: different groups of subjects receive different treatments; differences between conditions are comparisons between different people.
  • Within-subjects designs: the same subjects are tested under all treatment levels; each subject serves as their own control, increasing statistical power.
  • Common confusion: order effects vs. treatment effects—in within-subjects designs, counterbalancing is needed so that the sequence of tasks doesn't confound the results.
  • Factorial designs: experiments can have multiple independent variables with multiple levels, allowing researchers to test interactions between variables.
  • Key advantage of within-subjects: controls for individual differences in baseline performance, making it easier to detect true treatment effects.

🔀 Between-Subjects Designs

🔀 What defines a between-subjects design

Between-subjects design: the various experimental treatments are given to different groups of subjects.

  • Each subject experiences only one level of the independent variable.
  • Comparisons are made between different groups of people.
  • Example: In the Teacher Ratings study, one group was told the instructor was charismatic, another group was told the instructor was punitive—different subjects in each condition.

🎲 The role of random assignment

  • Subjects are randomly assigned to treatment conditions.
  • Random assignment ensures all differences between groups are chance differences, not systematic biases.
  • Don't confuse: random assignment eliminates systematic bias but does not eliminate chance differences—groups can still differ by luck.
  • Inferential statistics help distinguish real treatment effects from chance differences.

📊 Multiple levels of one variable

  • Independent variables can have more than two levels.
  • Example: The "Smiles and Leniency" study had four levels—false smile, felt smile, miserable smile, and neutral control.
  • Important: four levels still means one independent variable, not four variables.

🔢 Multi-Factor Designs

🔢 What factorial designs allow

Factorial design: all combinations of levels of multiple independent variables are included.

  • Notation example: Associate's Weight (2) x Associate's Relationship (2) means two independent variables, each with two levels.
  • All four combinations are tested: obese girlfriend, obese acquaintance, average-weight girlfriend, average-weight acquaintance.

🔗 Testing interactions

  • Factorial designs let researchers ask whether one variable's effect depends on the level of another variable.
  • Example: Does the associate's weight matter more when the associate is a girlfriend versus just an acquaintance?
  • When one variable's effect changes depending on another variable, this is called an interaction.
  • Why it matters: running two separate single-variable experiments would miss these interaction effects entirely.

🔁 Within-Subjects Designs

🔁 What defines a within-subjects design

Within-subjects design: the same subjects perform at all levels of the independent variable.

  • Also called repeated-measures designs.
  • Each person is tested multiple times, once under each condition.
  • Example: In the ADHD Treatment study, every subject received all four dose levels at different times.

⚖️ Counterbalancing to avoid confounds

Counterbalancing: giving treatments in different orders so that each treatment appears in each sequential position an equal number of times.

  • The problem: if every subject received doses in the same order (low to high), practice effects could be mistaken for treatment effects.
  • The solution: vary the order across subjects systematically.
  • Example table from the excerpt:
    • Subject 1: 0 mg first, .15 mg second, .30 mg third, .60 mg fourth
    • Subject 2: .15 mg first, .30 mg second, .60 mg third, 0 mg fourth
    • Subject 3: .30 mg first, .60 mg second, 0 mg third, .15 mg fourth
    • Subject 4: .60 mg first, 0 mg second, .15 mg third, .30 mg fourth
  • Limitation: counterbalancing doesn't work well if there are complex dependencies between which treatment precedes which; in those cases, between-subjects designs are better.

💪 The key advantage of within-subjects designs

  • Individual differences are controlled: people naturally vary in baseline performance (some solve problems faster, some have higher blood pressure).
  • Within-subjects designs compare each person to themselves across conditions, removing the noise from individual differences.
  • This gives within-subjects designs more power—they are better able to detect real treatment effects.
  • Each subject "serves as his or her own control."

🧩 Complex and Mixed Designs

🧩 Combining design types

  • Experiments can mix between-subjects and within-subjects variables in the same study.
  • Example: The "Weapons and Aggression" study had one between-subjects variable (gender) and two within-subjects variables (type of priming word and type of word to respond to).
  • This flexibility allows researchers to control individual differences on some factors while comparing different groups on others.
48

Causation

Causation

🧭 Overview

🧠 One-sentence thesis

Statistical methods allow researchers to infer causation in experiments by using within-condition variance to assess whether unmeasured variables could explain observed differences, while non-experimental designs require converging evidence to establish causal relationships.

📌 Key points (3–5)

  • Random assignment creates chance differences: Random assignment to experimental/control groups does not eliminate differences in unmeasured variables—it only ensures differences are due to chance.
  • Within-condition variance measures unmeasured variables: Since everyone in a condition is treated identically, differences among subjects within that condition reflect the combined effects of all unmeasured variables.
  • The third-variable problem: In non-experimental designs, correlation does not imply causation because a third variable may be responsible for the relationship between two other variables.
  • Common confusion: Random assignment vs. no differences—random assignment does not guarantee groups are identical, only that differences are chance-based rather than systematic.
  • Direction of causality: Even when correlation exists, it may be unclear which variable causes which.

🔬 How experiments establish causation

🎲 The role of random assignment

  • In a simple experiment, subjects are sampled randomly from a population and assigned randomly to experimental or control groups.
  • Example: An experimental group receives a drug for insomnia, a control group receives a placebo, and researchers measure minutes slept.
  • Random assignment ensures differences on unmeasured variables (stress, caffeine consumption, genetics, prior sleep) are chance differences, not systematic ones.
  • Don't confuse: Random assignment does not eliminate differences—it only makes them random rather than systematic.

🔍 The unmeasured variables problem

Unmeasured variables: factors that affect the dependent variable but are not directly measured or controlled in the experiment.

  • Many variables affect outcomes (e.g., stress, physiological factors, caffeine intake).
  • Perhaps by chance, many control group subjects were under high stress, making it harder to sleep.
  • The fact that greater stress was due to chance does not mean it couldn't be responsible for the observed difference.
  • The challenge: It is impossible to measure every variable that affects the dependent variable.

📊 How variance solves the problem

  • Although individual unmeasured variables cannot be assessed, their combined effects can be measured.
  • Since everyone in a given condition is treated the same, differences in their scores must be due to unmeasured variables.
  • Within-condition variance measures the sum total of effects from all unmeasured variables.
  • Statistical methods use this variance to calculate the probability that unmeasured variables could produce a difference as large as the one observed.
  • If that probability is low, researchers infer the treatment had an effect (this is why it's called inferential statistics).
  • Important limitation: There is always some nonzero probability the difference occurred by chance, so total certainty is impossible.

🔗 Causation in non-experimental designs

⚠️ The third-variable problem

Third-variable problem: a third variable is responsible for the correlation between two other variables, creating a spurious relationship.

  • The saying "correlation does not mean causation" reflects this main fallacy.
  • Example from the excerpt: In Taiwan in the 1970s, there was a positive correlation between contraception use and number of electric appliances in one's house.
  • Using contraception does not cause you to buy appliances (or vice versa).
  • Instead, the third variable of education level affects both.

🔄 The direction-of-causality problem

  • A correlation between two variables does not indicate which variable is causing which.
  • Example: A strong correlation exists between public debt and GDP growth.
  • Some argue public debt slows growth; most evidence supports the alternative that slow growth increases public debt.
  • The correlation alone cannot tell us the causal direction.

🛠️ Approaches to inferring causation without experiments

❌ The weak approach: Assume no third variable

  • One approach is to simply assume you do not have a third-variable problem.
  • The excerpt notes this approach is common but not very satisfactory.
  • Warning: This assumption may be hidden behind complex causal models with sophisticated mathematics.

✅ The better approach: Converging evidence

  • A more difficult but better approach is to find converging evidence from multiple sources.
  • Example: Concluding that smoking causes cancer required converging evidence from:
    • Retrospective studies
    • Prospective studies
    • Lab studies with animals
    • Theoretical understandings of cancer causes
  • Multiple lines of evidence pointing to the same conclusion strengthen causal inference.

💊 Real-world application: HDL and heart disease

The excerpt includes a statistical literacy example:

  • Low HDL levels are a known risk factor for heart disease.
  • Niacin increases HDL levels and has been recommended for patients with low HDL.
  • The assumption: Niacin causes HDL to increase, thus causing lower risk for heart disease.
  • The test: Randomly assign patients with low HDL to receive niacin or not.
  • The finding: An NIH study found niacin increased HDL without decreasing heart disease, casting doubt on the causal relationship between HDL and heart disease.
  • This demonstrates that correlation (HDL and heart disease) does not guarantee causation.
49

Introduction to Normal Distributions

Introduction to Normal Distributions

🧭 Overview

🧠 One-sentence thesis

The normal distribution is fundamental to statistics because many natural phenomena follow it approximately, and even when they don't, the distribution of sample means becomes normal as sample size increases.

📌 Key points (3–5)

  • Historical origin: the normal distribution was discovered through studying measurement errors in astronomy, where small errors occurred more frequently than large errors and followed a symmetric pattern.
  • Central limit theorem: even if the original data is not normally distributed, the means of repeated samples will be very nearly normal, and larger samples produce distributions closer to normal.
  • Why it matters for testing: most statistical tests assume normal distributions, but they work well even with roughly normal data because the distribution of means is very close to normal.
  • Common confusion: the normal distribution applies both to naturally occurring phenomena (like height and weight) and to distributions of sample means from non-normal populations.
  • Standard areas: 68% of any normal distribution falls within one standard deviation of the mean; 95% falls within 1.96 standard deviations.

📜 Historical development

🔭 Discovery through measurement errors

  • Galileo (17th century) observed that astronomical measurement errors had two key properties:
    • Errors were symmetric (equally likely above or below the true value).
    • Small errors occurred more frequently than large errors.
  • These observations led to several hypothesized error distributions, but the correct formula wasn't discovered until the early 19th century.

🧮 Mathematical formulation

  • Two mathematicians independently developed the normal distribution formula:
    • Adrain in 1808
    • Gauss in 1809
  • Both showed that measurement errors fit this distribution well.
  • Example: imperfect astronomical instruments and observers produced errors that clustered near zero and tapered off symmetrically.

🎯 Laplace and the central limit theorem

  • Laplace discovered the same distribution in 1778 while deriving the central limit theorem.
  • His key insight:

    Even if a distribution is not normally distributed, the means of repeated samples from the distribution would be very nearly normally distributed, and the larger the sample size, the closer the distribution of means would be to a normal distribution.

  • This discovery is "extremely important" because it extends the usefulness of the normal distribution far beyond naturally normal phenomena.

👤 Application to human characteristics

  • Quételet was the first to apply the normal distribution to human characteristics.
  • He noted that traits such as height, weight, and strength were normally distributed.
  • This showed the distribution applied beyond measurement errors to natural biological variation.

🌍 Why the normal distribution matters

🌿 Natural phenomena

The distributions of many natural phenomena are at least approximately normally distributed.

  • The importance stems primarily from this empirical fact.
  • Many real-world measurements cluster around a central value with symmetric tails.

🧪 Statistical testing implications

  • Most statistical procedures for testing differences between means assume normal distributions.
  • Why tests still work: because the distribution of means is very close to normal, these tests perform well even if the original distribution is only roughly normal.
  • Don't confuse: the requirement is not that every data point be perfectly normal, but that the distribution of sample means be approximately normal (which the central limit theorem guarantees).

📏 Standard properties of normal distributions

📐 The 68% rule

  • General rule: 68% of the area of any normal distribution is within one standard deviation of the mean.
  • This holds regardless of the specific mean or standard deviation values.
  • Example from the excerpt:
    • Distribution with mean 50 and standard deviation 10: 68% falls between 40 and 60.
    • Distribution with mean 100 and standard deviation 20: 68% falls between 80 and 120.

📊 The 95% rule

  • 95% of the area is within 1.96 standard deviations of the mean.
  • For quick approximations, it is sometimes useful to round to 2 standard deviations rather than 1.96.
  • Example: a normal distribution with mean 75 and standard deviation 10 has 95% of its area between 55.4 and 94.6.

🎲 Standard normal distribution

A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution.

  • Any value from any normal distribution can be transformed to the standard normal using the formula: Z equals X minus the mean, divided by the standard deviation.
  • Where:
    • Z is the value on the standard normal distribution
    • X is the value on the original distribution
    • The mean is the mean of the original distribution
    • The standard deviation is the standard deviation of the original distribution
  • Example: to find what portion of a distribution with mean 50 and standard deviation 10 is below 26, transform to Z: (26 minus 50) divided by 10 equals negative 2.4. Tables show 0.0062 (or 0.62%) of the distribution is below this value.

📋 Using tables and calculators

  • Areas under the normal distribution can be calculated using tables or online calculators.
  • Standard normal tables show:
    • First column: Z values (number of standard deviations from the mean)
    • Second column: area below that Z value
  • Example: Z of negative 2.5 represents a value 2.5 standard deviations below the mean, with area 0.0062 below it.
  • If all values in a distribution are transformed to Z scores, the resulting distribution will have a mean of 0 and a standard deviation of 1.
50

Standard Normal Distribution and Z-Scores

History of the Normal Distribution

🧭 Overview

🧠 One-sentence thesis

The standard normal distribution provides a universal reference framework for comparing values from any normal distribution by converting them into standardized Z-scores that measure distance from the mean in standard-deviation units.

📌 Key points (3–5)

  • What Z-scores represent: the number of standard deviations a value is above or below the mean
  • How to standardize: any normal distribution can be transformed to standard normal (mean = 0, SD = 1) using the formula Z = (X - μ)/σ
  • Why standardization matters: allows lookup of probabilities in standard tables and comparison across different scales
  • Common confusion: the Z-score itself is not a probability; it must be converted to an area under the curve using tables or calculators
  • Key application: the normal distribution approximates the binomial distribution when sample sizes are large enough

📊 The standard normal distribution

📊 What makes it "standard"

The standard normal distribution: a normal distribution with mean (μ) = 0 and standard deviation (σ) = 1.

  • Any normal distribution can be converted to this standard form
  • The Z column in tables represents standard deviations from the mean
  • Example: Z = -2.5 means a value 2.5 standard deviations below the mean
  • The corresponding area (probability) for Z = -2.5 is 0.0062

🔢 Reading Z-tables

  • Tables show cumulative probabilities (area below a given Z value)
  • The first column lists Z values (e.g., -2.4, -2.39, -2.38)
  • The second column shows the proportion of the distribution below that Z
  • Example: for Z = -2.4, the area below is 0.0082 (about 0.82% of values fall below)

🔄 Converting between distributions

🔄 The transformation formula

The excerpt provides the standardization formula:

  • Z = (X - μ)/σ
  • Z is the standardized value
  • X is the original value
  • μ is the population mean
  • σ is the population standard deviation

📐 Worked example

The excerpt demonstrates: "What portion of a normal distribution with a mean of 50 and a standard deviation of 10 is below 26?"

Steps:

  1. Apply the formula: Z = (26 - 50)/10 = -2.4
  2. Look up Z = -2.4 in the table
  3. Find that 0.0082 of the distribution is below this value
  4. Interpretation: about 0.82% of values fall below 26

Don't confuse: You can work directly with the original distribution using software, but the Z-transformation is essential when using standard tables.

🎯 The standardizing process

Standardizing the distribution: transforming all values to Z-scores so the distribution has mean 0 and standard deviation 1.

  • If you convert every value in a distribution to Z-scores, the result is the standard normal distribution
  • This process preserves the shape and relative positions of values
  • It changes the scale but not the relationships between values

🔗 Normal approximation to binomial

🔗 When to use the approximation

The excerpt explains that the normal distribution can approximate the binomial distribution under certain conditions:

  • Rule of thumb: the approximation is good if both Nπ and N(1-π) are greater than 10
  • N is the number of trials
  • π is the probability of success

🎲 The continuity correction

Because the binomial is discrete and the normal is continuous:

  • A specific value (e.g., 8 heads) must be represented as a range
  • Use 7.5 to 8.5 to represent "8 heads"
  • This accounts for the fact that continuous distributions assign zero probability to single points

📊 Example calculation

The excerpt shows approximating "8 heads out of 10 flips":

Given:

  • Mean μ = Nπ = (10)(0.5) = 5
  • Variance σ² = Nπ(1-π) = (10)(0.5)(0.5) = 2.5
  • Standard deviation σ = 1.5811

Process:

  1. Find area below 8.5: convert to Z = (8.5 - 5)/1.5811 = 2.21, area = 0.987
  2. Find area below 7.5: convert to Z = (7.5 - 5)/1.5811 = 1.58, area = 0.943
  3. Subtract: 0.987 - 0.943 = 0.044

Don't confuse: For a range of outcomes (e.g., 8 to 10), use 7.5 to 10.5 as the continuous interval.

⚠️ Limitations and cautions

⚠️ Tail risk and extreme events

The excerpt mentions a "Statistical Literacy" discussion:

  • Critics argue that extreme events (more than 3 standard deviations from the mean) occur more frequently in reality than normal distributions predict
  • The assumption of normality has been called a "Great Intellectual Fraud" in some risk analyses
  • Highly-skewed distributions have more extreme values than normal distributions

📉 When normal assumptions fail

If the normal distribution is used to assess tail events when the true distribution is skewed:

  • The "tail risk" will be underestimated
  • Events more than 3 SD from the mean are very rare for normal distributions
  • But they are not as rare for other distribution shapes

Example: Financial market crashes may be more common than normal distribution models predict, leading to underestimation of risk.


Note: This excerpt focuses on the mechanics of using the standard normal distribution and Z-scores for probability calculations, with particular emphasis on the normal approximation to the binomial and awareness of when normal assumptions may not hold.

51

Areas Under Normal Distributions

Areas Under Normal Distributions

🧭 Overview

🧠 One-sentence thesis

The area under a normal distribution curve between any two points represents the probability of obtaining a value within that range, which is fundamental for statistical inference and hypothesis testing.

📌 Key points (3–5)

  • What areas represent: The total area under a normal curve equals 1 (or 100%), and any portion represents the probability of values falling in that range
  • How to find areas: Use z-scores to standardize values, then look up areas in normal distribution tables or calculators
  • Standard deviations and areas: Approximately 68% of values fall within ±1 SD, 95% within ±1.96 SD, and 99% within ±2.58 SD from the mean
  • Common confusion: The area represents probability, not the actual number of observations—you must multiply by sample size to get expected counts
  • Why it matters: Computing areas under normal curves is essential for confidence intervals, hypothesis tests, and determining statistical significance

📊 Understanding probability as area

📊 The fundamental concept

Area under the normal distribution curve: The proportion of the total area under the curve between two points, representing the probability of a value falling in that interval.

  • The entire area under any probability distribution curve equals 1.0 (representing 100% probability)
  • A specific region's area tells you the probability of randomly selecting a value from that region
  • Example: If the area between z = 0 and z = 1 is 0.34, there's a 34% chance of getting a value in that range

🎯 Why areas equal probabilities

  • For continuous distributions, we cannot calculate the probability of getting an exact value (it's infinitesimally small)
  • Instead, we calculate probabilities for ranges or intervals
  • The normal curve is constructed so that area = probability by definition

🔢 Computing areas using z-scores

🔢 The standardization process

To find areas under any normal distribution:

  1. Convert raw scores to z-scores using: z = (X - μ) / σ
  2. Use the standard normal distribution (mean = 0, SD = 1)
  3. Look up the area in a table or use a calculator

📱 Using normal distribution calculators

  • Modern calculators/software make this easier than tables
  • Input: mean, standard deviation, and the value(s) of interest
  • Output: the area (probability) for your specified region
  • Can find areas above, below, or between values

📋 Reading z-tables (traditional method)

  • Tables typically show area from the mean to a positive z-score
  • For negative z-scores, use symmetry of the normal curve
  • May need to add or subtract areas to get the desired region

🎲 Common probability calculations

🎲 Finding area above a value

  • Calculate z-score for your value
  • Find area from mean to that z-score
  • If z is positive: subtract from 0.5 to get upper tail
  • If z is negative: add to 0.5 to get upper tail
  • Example: To find P(X > 110) when μ = 100, σ = 15, first find z = (110-100)/15 = 0.67, then find the area above z = 0.67

🎲 Finding area below a value

  • Calculate z-score
  • If z is positive: add 0.5 to the area from mean to z
  • If z is negative: subtract the area from 0.5
  • This gives you the cumulative probability

🎲 Finding area between two values

  • Calculate z-scores for both values
  • Find area from mean to each z-score
  • Add the two areas if z-scores are on opposite sides of mean
  • Subtract the smaller from larger if both on same side
  • Example: Area between z = -1 and z = +1 is approximately 0.68

🔑 Key benchmark areas

🔑 The empirical rule (68-95-99.7)

RangeApproximate AreaExact z-values
μ ± 1σ68%±1.00
μ ± 2σ95%±1.96
μ ± 3σ99.7%±2.58
  • These are approximations that are easy to remember
  • The exact values for 95% and 99% use z = 1.96 and z = 2.58
  • Useful for quick mental estimates

🔑 Common significance levels

  • For α = 0.05 (two-tailed): critical z = ±1.96
  • For α = 0.01 (two-tailed): critical z = ±2.58
  • For α = 0.05 (one-tailed): critical z = 1.645
  • These values define "unusual" or "significant" results in hypothesis testing

⚠️ Common pitfalls

⚠️ Don't confuse area with count

  • Area represents probability (a proportion)
  • To get expected number of observations: multiply area by total N
  • Example: If area = 0.30 and N = 100, expect about 30 observations in that range

⚠️ Watch for one-tailed vs. two-tailed

  • One-tailed: area in one direction only (above OR below)
  • Two-tailed: area in both tails combined (above AND below)
  • The same area threshold means different z-values depending on which you need

⚠️ Remember the symmetry property

  • The normal distribution is perfectly symmetric around the mean
  • Area from mean to +z equals area from mean to -z
  • Use this to simplify calculations
52

Chi Square Distribution

Standard Normal Distribution

🧭 Overview

🧠 One-sentence thesis

The Chi Square distribution is a particularly useful statistical distribution that enables testing whether observed frequencies match theoretical expectations and whether two nominal variables are associated.

📌 Key points (3–5)

  • What Chi Square is: a distribution proven particularly useful in statistics for specific types of tests.
  • Two main applications: testing differences between expected and observed frequencies (one-way tables), and testing associations between two nominal variables (contingency tables).
  • Common confusion: the "Chi Square Test" often refers specifically to the contingency table application, not all uses of the distribution.
  • Foundation needed: understanding of distributions, the standard normal distribution, and degrees of freedom are prerequisites.

📊 Two main statistical applications

📊 One-way tables

One-way tables: using the Chi Square distribution to test the difference between theoretically expected and observed frequencies.

  • This application compares what you observe in data against what theory or expectation predicts.
  • The test evaluates whether differences between expected and observed counts are larger than would occur by chance.
  • Example: if theory predicts equal frequencies across categories, the test checks whether actual data deviates significantly from equal distribution.

📊 Contingency tables

Contingency tables: using Chi Square to test the association between two nominal variables.

  • This tests whether two categorical variables are related or independent.
  • This use is so common it is often called "the Chi Square Test" (singular, as if it were the only application).
  • Don't confuse: "Chi Square Test" can refer to this specific contingency table application, but Chi Square distribution has other uses too.

🔗 Relationship to other concepts

🔗 Prerequisites

The excerpt identifies three foundational concepts needed:

PrerequisiteWhy it matters
DistributionsChi Square is itself a distribution
Standard Normal DistributionFoundation for understanding Chi Square
Degrees of FreedomRequired parameter for Chi Square tests

🔗 Context in statistics

  • The excerpt positions Chi Square as having "proven to be particularly useful" rather than merely theoretical.
  • The distribution serves as a tool for practical hypothesis testing in multiple scenarios.
53

Normal Approximation to the Binomial

Normal Approximation to the Binomial

🧭 Overview

🧠 One-sentence thesis

The normal distribution can be used to approximate binomial probabilities by treating discrete outcomes as continuous intervals, and this approximation works well when both N times π and N times (1 minus π) are greater than 10.

📌 Key points (3–5)

  • Core technique: Use the normal distribution to approximate binomial probabilities by converting discrete values into continuous ranges (e.g., 8 heads becomes the interval 7.5 to 8.5).
  • Why the adjustment is needed: The binomial distribution is discrete (specific points), while the normal distribution is continuous (areas under curves), so the probability of any exact point in a continuous distribution is zero.
  • When the approximation works: The rule of thumb is that both N times π and N times (1 minus π) must be greater than 10 for good accuracy.
  • Common confusion: Don't try to find the probability of an exact value (e.g., exactly 8 heads) directly on the normal curve; instead, find the area for a range (7.5 to 8.5).
  • How to calculate: Convert binomial parameters to normal parameters (mean and standard deviation), then use Z-scores or a normal area calculator to find probabilities.

🔄 The fundamental problem and solution

❌ Why direct approximation fails

  • The binomial distribution assigns probabilities to specific discrete outcomes (0 heads, 1 head, 2 heads, etc.).
  • The normal distribution is continuous—it describes areas under a curve, not individual points.
  • The key issue: In a continuous distribution, the probability of any single exact value is always zero.
  • Example from the excerpt: The probability of getting exactly 1.897 standard deviations above the mean is 0 in a continuous distribution.

✅ The rounding solution

The solution is to round off and consider any value from 7.5 to 8.5 to represent an outcome of 8 heads.

  • Instead of asking "What is the probability of exactly 8 heads?", ask "What is the probability of 7.5 to 8.5 heads?"
  • This converts the discrete binomial question into a continuous normal area question.
  • The area under the normal curve from 7.5 to 8.5 approximates the binomial probability of 8 heads.
  • Don't confuse: This is not the same as finding the probability of "7 to 9 heads"—the half-unit buffer (0.5) on each side is essential for accuracy.

🧮 Step-by-step calculation method

📊 Setting up the normal parameters

The excerpt uses a coin-flip example: 10 flips of a fair coin, finding the probability of 8 heads.

Binomial parameters:

  • N = 10 (number of flips)
  • π = 0.5 (probability of heads on each flip)

Convert to normal parameters:

  • Mean: μ = N times π = 10 times 0.5 = 5
  • Variance: σ squared = N times π times (1 minus π) = 10 times 0.5 times 0.5 = 2.5
  • Standard deviation: σ = 1.5811

🎯 Computing the area

To find the probability of 8 heads:

  1. Define the interval: Use 7.5 to 8.5 (not just the point 8).
  2. Find the area below 8.5:
    • Z = (8.5 minus 5) divided by 1.5811 = 2.21
    • Area below Z = 2.21 is 0.987
  3. Find the area below 7.5:
    • Z = (7.5 minus 5) divided by 1.5811 = 1.58
    • Area below Z = 1.58 is 0.943
  4. Subtract: 0.987 minus 0.943 = 0.044

The approximation gives a probability of 0.044, which the excerpt notes is "very accurate" for these parameters.

📏 For a range of outcomes

The same logic extends to ranges:

  • To find the probability of 8 to 10 heads, calculate the area from 7.5 to 10.5.
  • Always add/subtract 0.5 to create the continuous interval around the discrete values.

✔️ When the approximation is adequate

📐 The rule of thumb

A rule of thumb is that the approximation is good if both N π and N(1 minus π) are both greater than 10.

What this means:

  • N π represents the expected number of "successes."
  • N times (1 minus π) represents the expected number of "failures."
  • Both must exceed 10 for the normal approximation to be reliable.

Why it matters:

  • If N or π are too small, the binomial distribution may be too skewed or lumpy for the smooth normal curve to fit well.
  • Example check for the coin flip: N π = 10 times 0.5 = 5 and N times (1 minus π) = 10 times 0.5 = 5. Both are less than 10, so technically this example is borderline, yet the excerpt states the approximation is still "very accurate" here.

🔍 Accuracy depends on parameters

  • The accuracy varies with different values of N (number of trials) and π (probability of success).
  • Larger N generally improves the approximation.
  • Values of π closer to 0.5 (symmetric cases) tend to work better than extreme values near 0 or 1.
54

Quantile-Quantile (Q-Q) Plots

Quantile-Quantile (q-Q) Plots

🧭 Overview

🧠 One-sentence thesis

Q-Q plots provide a visual diagnostic tool that reveals whether sample data follow an assumed distribution by plotting sample quantiles against theoretical quantiles, with points falling on a straight line when the distributional assumption is correct.

📌 Key points (3–5)

  • Purpose: Q-Q plots check whether data follow a specific distribution (e.g., uniform, normal) by comparing observed quantiles to theoretical quantiles.
  • Interpretation: When data match the assumed distribution, points lie approximately on a straight line; departures from the line indicate violations of the distributional assumption.
  • Advantage over histograms: Q-Q plots avoid arbitrary design choices like bin width that affect histogram appearance.
  • Sample size matters: Larger samples produce Q-Q plots closer to the reference line even when the distribution is correct; small samples show more scatter.
  • Common confusion: Different departure patterns signal different problems—skewness shows one characteristic curve, heavy tails (kurtosis) show another.

📊 Why Q-Q plots exist

📊 The distributional assumption problem

  • Many statistical methods assume data follow a specific distribution (e.g., normal, uniform).
  • Before applying these methods, researchers need to verify the assumption holds.
  • Visual tools help assess whether the assumption is reasonable.

📉 Limitations of alternative methods

Histograms have design ambiguity:

  • The excerpt shows three histograms of the same uniform data with 10, 5, and 3 bins.
  • Visual perception varies dramatically depending on bin count.
  • No clear guidance on which histogram to trust.

CDF plots work but are less intuitive:

  • The cumulative distribution function (CDF) approach compares empirical CDF to theoretical CDF.
  • For uniform data, this means comparing a staircase function to the line y = x.
  • Q-Q plots provide essentially the same information with axes reversed, often easier to interpret.

🎯 How Q-Q plots work for uniform data

🎯 The basic construction

Q-Q plot for uniform data: a plot of theoretical quantiles (x-axis) against sample quantiles (y-axis), where sample quantiles are the sorted data values.

The uniform case is simplest:

  • For n data points, divide the interval (0,1) into n equal parts.
  • The theoretical quantile for the i-th ordered value is (i - 0.5)/n (the middle of the i-th interval).
  • Plot each pair: (theoretical quantile, observed value).

🔢 Small example with 5 points

The excerpt provides data: 0.03, 0.24, 0.41, 0.59, 0.67

Sorted dataRank (i)Middle of intervalPlot point
0.0310.1(0.1, 0.03)
0.2420.3(0.3, 0.24)
0.4130.5(0.5, 0.41)
0.5940.7(0.7, 0.59)
0.6750.9(0.9, 0.67)

📏 The reference line

  • If data are truly uniform, points should fall near y = x.
  • The excerpt shows that with n = 1000, the Q-Q plot is almost identical to y = x.
  • With n = 10, more scatter appears even when data are uniform.

🔔 How Q-Q plots work for normal data

🔔 The key difference from uniform

For normal data, the theoretical quantile is not simply the middle of an interval but the inverse of the normal CDF applied to that middle value.

Why this matters:

  • The normal distribution is not flat like the uniform.
  • We need to find the z-value such that a specific fraction of the normal distribution falls below it.
  • Example: For the first of 5 points, we want the z-value where 10% of the distribution is below—this is z = -1.28.

🧮 Computing theoretical quantiles

The theoretical quantile corresponding to the i-th ordered sample value is:

  • The inverse of the normal CDF applied to (i - 0.5)/n
  • Written in the excerpt as: inverse-Phi((i - 0.5)/n)
  • For a sample of size 100, the first few expected quantiles are -2.576, -2.170, and -1.960.

📐 The reference line for general mean and scale

For standardized data (mean 0, standard deviation 1):

  • The reference line is y = x.

For raw data with mean M and standard deviation s:

  • The reference line becomes y = M + s·x.
  • Alternatively, standardize the data first, then use y = x.

🚨 Detecting departures from normality

🚨 Skewed data

The excerpt shows a chi-squared distribution example:

  • Points match the theoretical line at the median and extremes.
  • Data are symmetric around the median.
  • Points are closer to the median than expected (for this particular skewed distribution).
  • The pattern curves away from the reference line in a characteristic way.

🚨 Heavy-tailed data (high kurtosis)

The excerpt shows a Student's t-distribution example:

  • Points follow the normal curve fairly closely in the middle.
  • The last dozen or so points on each extreme depart from the line.
  • This indicates more extreme values than a normal distribution would produce.

🚨 Don't confuse with sample size effects

  • Small samples (n = 10) show scatter even when the distribution is correct.
  • Large samples (n = 1000) hug the line closely when the distribution is correct.
  • Judge departures relative to what you'd expect for that sample size.

📚 Real data example: SAT case study

📚 The context

The excerpt analyzes 105 college students with:

  • Verbal SAT scores
  • University GPA

📚 What the Q-Q plots revealed

Verbal SAT:

  • Follows normal distribution reasonably well.
  • Some departure in extreme tails.

University GPA:

  • Highly non-normal Q-Q plot.
  • Pattern similar to the heavy-tailed simulation shown earlier.
  • Histograms revealed the data are bimodal with about 20% of students in a separate low-GPA cluster.

📚 Why this matters

  • The Q-Q plot flagged a problem that required further investigation.
  • The bimodal structure has implications for correlation analysis (correlation was 0.65 overall, 0.59 excluding the cluster).
  • Example: Before computing statistics that assume normality, the Q-Q plot warns that assumptions are violated.

🔬 Technical advantages

🔬 No arbitrary parameters

  • Unlike histograms, Q-Q plots require no choice of bin width or bin count.
  • The construction is fully determined by the data and the assumed distribution.

🔬 Formal hypothesis testing

The excerpt mentions (for advanced use):

  • The correlation coefficient of the n points in the Q-Q plot can test normality formally.
  • If this correlation is below a threshold (close to 0.95 for modest sample sizes), reject the null hypothesis that data are normal.

🔬 Connection to probability theory

  • The probability integral transform maps any random variable X to uniform (0,1) via Y = F(X), where F is the CDF of X.
  • This explains why Q-Q plots on standardized data approach y = x when the model is correct.
  • Q-Q plots act as "probability graph paper" that linearizes correctly distributed data.
55

Contour Plots

Contour Plots

🧭 Overview

🧠 One-sentence thesis

Contour plots visualize how a third variable behaves across two dimensions by drawing lines or shaded regions that connect points with the same value on that third variable.

📌 Key points (3–5)

  • What a contour plot shows: an X-Y plot where each contour line represents a constant value on a third variable.
  • Two visualization styles: lines connecting equal values, or shaded areas representing ranges of values.
  • Real example: the excerpt uses breakfast cereal data—fat and carbohydrates on the axes, calories shown as contours.
  • Common confusion: the third variable (e.g., calories) may not be exactly determined by the two plotted variables alone; other factors (like sugar and protein in cereals) can also contribute.
  • Why it matters: contour plots let you see patterns in three-dimensional data on a flat, two-dimensional surface.

📐 What contour plots display

📐 The basic structure

A contour plot contains a number of contour lines. Each contour line is shown in an X-Y plot and has a constant value on a third variable.

  • The X and Y axes show two variables.
  • The third variable is represented by lines (or shaded regions) rather than a third spatial axis.
  • Each line connects all the points where the third variable has the same value.

🥣 The cereal example

The excerpt uses breakfast cereal data to illustrate:

  • X axis: fat content.
  • Y axis: non-sugar carbohydrates.
  • Third variable (contours): calories.
  • Each contour line shows combinations of fat and carbohydrates that yield the same calorie count.

Don't confuse: The number of calories is not determined exactly by fat and carbohydrates alone—cereals also differ in sugar and protein. The contour lines show the relationship, but other factors contribute to the total.

🎨 Two ways to draw contour plots

🎨 Line-based contours (Figure 1)

  • Each line is labeled with a specific value (e.g., 75, 100, 125, 150 calories).
  • Lines connect points with the same calorie count.
  • Example: A cereal with 2 grams of fat and 20 grams of carbohydrates might lie on the "100 calorie" line.

🎨 Shaded-area contours (Figure 2)

  • Instead of lines, regions are shaded.
  • Each shaded area represents values less than or equal to the label shown to the right of the area.
  • This style emphasizes ranges rather than exact boundaries.
StyleWhat it showsUse case
LinesExact boundaries for each valuePrecise comparisons across specific levels
Shaded areasRanges (≤ a threshold)Emphasizing regions and gradients

🔗 Relationship to 3D plots

🔗 Contour plots as a 2D alternative

  • The excerpt transitions to discussing 3D plots, which show the same three variables in three spatial dimensions.
  • A contour plot is essentially a "flattened" view of 3D data.
  • Example: The same cereal data (fat, carbohydrates, calories) can be shown as a 3D scatter plot where calories become the height, or as a contour plot where calories are represented by lines on a flat surface.

🔗 When contour plots are useful

  • They avoid the need to rotate or interpret depth in 3D space.
  • They make it easier to read exact values and compare levels at a glance.
  • Don't confuse: A contour plot does not show individual data points as clearly as a scatter plot; it summarizes the relationship into bands or lines.
56

3D Plots

3D Plots

🧭 Overview

🧠 One-sentence thesis

3D plots extend two-dimensional scatter plots to display three variables simultaneously, and interactive rotation can reveal systematic patterns in data that are not visible from a single fixed perspective.

📌 Key points (3–5)

  • What 3D plots show: data in three dimensions, where each axis represents one variable (e.g., fat, carbohydrates, and calories).
  • Adding a fourth dimension: a nominal variable (e.g., manufacturer) can be represented using different colors.
  • Interactive rotation value: rotating the axes to view data from different angles can reveal hidden patterns or systematic relationships not apparent from one viewpoint.
  • Common confusion: a single static view may look random, but rotating to a different perspective can expose non-random structure—always explore multiple angles.
  • Why it matters: 3D plots help detect data patterns, validate random number generators, and understand multivariate relationships.

📊 What 3D plots display

📊 Three-dimensional scatter plots

3D plots: visualizations that show data in three dimensions, extending the concept of two-dimensional scatter plots.

  • Each axis corresponds to one variable.
  • Example from the excerpt: fat, non-sugar carbohydrates, and calories from cereal types are plotted on three axes.
  • The plot places each data point in 3D space according to its values on all three variables.

🎨 Representing a fourth dimension

  • A fourth variable can be added as long as it is nominal (categorical).
  • The excerpt uses color coding to represent different manufacturers.
  • Example: each manufacturer's cereals appear in a distinct color, allowing comparison across groups within the same 3D space.
DimensionHow it is shown
First three variablesX, Y, Z axes (continuous)
Fourth variable (nominal)Color or symbol coding

🔄 Interactive rotation and its value

🔄 Viewing from different angles

  • Many statistical packages allow interactive rotation of the 3D axes.
  • The same dataset can look very different from different vantage points.
  • The excerpt shows two figures (Figure 1 and Figure 2) of the same cereal data from different perspectives.

🔍 Revealing hidden patterns

  • Rotating 3D plots can uncover aspects of the data not otherwise apparent.
  • The excerpt provides a striking example: data from a pseudo random number generator.
    • Figure 4 (one perspective): the data appear random and unsystematic.
    • Figure 5 (different perspective): the same data clearly show a non-random, systematic pattern.
  • Conclusion from the excerpt: "Clearly they were not generated by a random process" (visible only after rotation).

Don't confuse: A single static 3D view is like looking at a sculpture from one side—you may miss important structure. Always rotate to explore multiple perspectives.

🛠️ Practical use case

  • Example scenario: testing a pseudo random number generator.
    • If the generator is truly random, the 3D plot should show no systematic structure from any angle.
    • If rotation reveals patterns (as in Figures 4 and 5), the generator is flawed.
  • This demonstrates that 3D plots are not just for display—they are diagnostic tools.

🔗 Relationship to contour plots

🔗 Contour plots as an alternative

  • The excerpt begins by describing contour plots, which also represent three variables but in a 2D format.
  • Contour plots use lines (or shaded areas) to show constant values of a third variable on an X-Y plane.
  • Example: lines show carbohydrate and fat levels for cereals with the same number of calories.

🔗 When to use each

  • Contour plots: useful when you need a 2D representation; easier to print and annotate.
  • 3D plots: better for exploring relationships interactively and detecting patterns through rotation.
  • The excerpt does not state one is superior; they are complementary tools for visualizing three variables.

Note: The excerpt mentions that calorie values are not determined exactly by fat and non-sugar carbohydrates alone, since cereals also differ in sugar and protein—this reminds us that visualizing three variables does not capture all sources of variation.

57

Introduction to Sampling Distributions

Introduction to Sampling Distributions

🧭 Overview

🧠 One-sentence thesis

Sampling distributions—the theoretical distributions of statistics computed from all possible samples—form the foundation of inferential statistics by showing how sample statistics vary and how close they are likely to be to population parameters.

📌 Key points (3–5)

  • What a sampling distribution is: a theoretical distribution showing all possible values of a statistic (like the mean) and their probabilities when sampling from a population.
  • Two ways to conceptualize it: either enumerate all possible outcomes, or imagine taking thousands of samples repeatedly and plotting the relative frequency distribution.
  • Central limit theorem: as sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of the population's shape.
  • Common confusion: sampling distributions are theoretical, not empirical—you don't actually observe them directly; you estimate their properties from your sample data.
  • Why it matters for inference: knowing the sampling distribution (especially its standard error) tells you how much sample statistics vary and how close your sample statistic is likely to be to the true population parameter.

🎱 Understanding sampling distributions through discrete examples

🎱 The pool ball example

The excerpt introduces sampling distributions using three pool balls numbered 1, 2, and 3.

  • When you sample two balls with replacement and compute the mean, there are nine possible outcomes.
  • The possible means are: 1.0, 1.5, 2.0, 2.5, or 3.0.
  • Each mean has a specific probability (relative frequency):
    • Mean of 1.0: probability 0.111 (1 out of 9 outcomes)
    • Mean of 1.5: probability 0.222 (2 out of 9 outcomes)
    • Mean of 2.0: probability 0.333 (3 out of 9 outcomes)
    • Mean of 2.5: probability 0.222 (2 out of 9 outcomes)
    • Mean of 3.0: probability 0.111 (1 out of 9 outcomes)

Sampling distribution of the mean: the distribution showing all possible values a sample mean can take and the probability of each value.

🔄 Two ways to think about sampling distributions

Method 1: Enumerate all possible outcomes

  • List every possible sample that could be drawn.
  • Calculate the statistic (e.g., mean) for each sample.
  • Count the frequency of each statistic value.
  • This works well for simple, discrete cases.

Method 2: Repeated sampling conceptualization

  • Imagine drawing a sample, computing the statistic, and recording it.
  • Repeat this process thousands of times.
  • Plot the relative frequency distribution of the recorded statistics.
  • As the number of samples approaches infinity, this relative frequency distribution approaches the true sampling distribution.
  • This method is more practical for complex or continuous distributions.

📊 Every statistic has a sampling distribution

The excerpt emphasizes that sampling distributions exist for any statistic, not just the mean.

Example: The range (largest minus smallest value) also has a sampling distribution.

  • For the three pool balls with N=2, possible ranges are 0, 1, or 2.
  • Range of 0: probability 0.333
  • Range of 1: probability 0.444
  • Range of 2: probability 0.222

Don't confuse: Different sample sizes produce different sampling distributions—the sampling distribution for N=3 will differ from N=2.

🌊 Continuous distributions

🌊 Moving from discrete to continuous

When the population is continuous (or nearly so, like 1,000 pool balls numbered 0.001 to 1.000), enumeration becomes impractical or impossible.

  • With 1,000 balls, there are 1,000,000 possible pairs (1,000 × 1,000).
  • For truly continuous distributions, you cannot list all outcomes because there are infinitely many.
  • In continuous distributions, individual values have zero probability; instead we work with probability densities.

The repeated-sampling conceptualization becomes essential here: imagine repeatedly sampling and computing means to build up the relative frequency distribution.

📏 Properties of the sampling distribution of the mean

📏 Mean of the sampling distribution

The mean of the sampling distribution of the mean equals the mean of the population.

  • If the population mean is μ, then the mean of the sampling distribution (written as μ_M) is also μ.
  • This means sample means are "centered" around the true population mean.

📐 Variance and standard error

The variance of the sampling distribution of the mean is calculated as:

  • Variance = (population variance) divided by N (sample size)
  • Written as: σ²_M = σ² / N

Key insight: Larger sample sizes produce smaller variance in the sampling distribution—sample means cluster more tightly around the population mean.

Standard error of the mean: the standard deviation of the sampling distribution of the mean.

  • It equals the square root of the variance: σ_M = σ / square root of N
  • The standard error measures how much sample means differ from each other and from the population mean.
  • Smaller standard error → sample means are very close to the population mean
  • Larger standard error → sample means vary considerably

Example: If your sample mean is 125 and the standard error is 5, with a normal distribution your sample mean is likely within 10 units of the population mean (within two standard errors).

🔔 Central limit theorem

🔔 The remarkable theorem

Central limit theorem: Given a population with finite mean μ and finite non-zero variance σ², the sampling distribution of the mean approaches a normal distribution with mean μ and variance σ²/N as N increases.

What makes this remarkable: Regardless of the parent population's shape, the sampling distribution of the mean becomes approximately normal as sample size increases.

🔔 Demonstration with uniform distribution

The excerpt describes a simulation starting with a uniform (flat) population distribution:

  • N = 2: The sampling distribution is far from normal but shows scores are denser in the middle than the tails.
  • N = 10: The sampling distribution is quite close to normal.
  • Both distributions have the same mean, but N=10 has smaller spread.

Even when the parent population is "very non-normal," the sampling distribution of the mean approximates a normal distribution as N increases.

Don't confuse: The population distribution can be any shape; it's the sampling distribution of the mean that becomes normal with larger N.

🔗 Connection to inferential statistics

🔗 How sampling distributions enable inference

In practice, the process works in reverse from the examples:

  1. You collect sample data (you have one sample, not all possible samples).
  2. From your sample data, you estimate parameters of the sampling distribution.
  3. Knowledge of the sampling distribution tells you how much sample statistics typically vary.
  4. This gives you a sense of how close your particular sample statistic is likely to be to the population parameter.

🔗 Using the standard error

The standard error is directly available from the sampling distribution and provides crucial information:

  • It measures the typical distance between sample means and the population mean.
  • If the standard error is small, your sample mean is likely close to the population mean.
  • If the standard error is large, there's more uncertainty about how close your sample mean is to the population mean.

Remember: All statistics (variance, difference between means, correlation, proportions) have sampling distributions, not just the mean. Later sections cover these other sampling distributions.

58

Sampling Distribution of the Mean

Sampling Distribution of the Mean

🧭 Overview

🧠 One-sentence thesis

The sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the parent population's shape, and its variance decreases as sample size grows.

📌 Key points (3–5)

  • Mean of the sampling distribution: equals the population mean μ.
  • Variance shrinks with sample size: the variance of the sampling distribution of the mean is the population variance divided by N (sample size).
  • Central limit theorem: as N increases, the sampling distribution of the mean approaches a normal distribution, no matter what the parent population looks like.
  • Common confusion: larger sample size → smaller variance of the sampling distribution (not the population variance itself).
  • Standard error of the mean: the standard deviation of the sampling distribution of the mean; it decreases as N increases.

📏 Mean of the sampling distribution

📏 What the mean equals

The mean of the sampling distribution of the mean is the mean of the population from which the scores were sampled.

  • Notation: μ_M (mean of the sampling distribution of the mean) = μ (population mean).
  • This is straightforward: if you repeatedly sample from a population and compute the mean each time, the average of all those sample means equals the population mean.
  • Example: if a population has a mean of 50, the mean of all possible sample means is also 50.

📉 Variance and standard error

📉 Variance of the sampling distribution

The variance of the sampling distribution of the mean is the population variance divided by N, the sample size.

  • Formula in words: variance of sampling distribution of the mean = (population variance) / N.
  • Why it matters: the larger the sample size, the smaller the variance of the sampling distribution of the mean.
  • This means sample means cluster more tightly around the population mean when samples are larger.

🔢 How the variance formula is derived (optional)

  • The excerpt provides an optional derivation using the variance sum law.
  • For N numbers sampled from a population with variance σ², the variance of the sum is N σ².
  • Since the mean is 1/N times the sum, the variance of the mean is (1/N²) times the variance of the sum, which equals σ²/N.
  • Don't confuse: this is the variance of the sampling distribution, not the variance of individual scores.

📐 Standard error of the mean

The standard error of the mean is the standard deviation of the sampling distribution of the mean.

  • It is the square root of the variance of the sampling distribution of the mean.
  • Formula in words: standard error = square root of (population variance / N).
  • Notation: σ_M (the subscript M indicates it is the standard error of the mean).
  • Example: if population variance is 64 and N = 16, the standard error is square root of (64/16) = 2.

🔔 Central limit theorem

🔔 What the theorem states

Given a population with a finite mean μ and a finite non-zero variance σ², the sampling distribution of the mean approaches a normal distribution with a mean of μ and a variance of σ²/N as N, the sample size, increases.

  • The mean and variance formulas are not new, but what is remarkable is the shape: the sampling distribution becomes normal as N increases, regardless of the parent population's shape.
  • This holds even if the parent population is uniform, skewed, or very non-normal.

🎲 Simulation examples

The excerpt describes simulations to illustrate the central limit theorem:

Parent populationSample size NResult
UniformN = 2Distribution is far from normal, but scores are denser in the middle than in the tails
UniformN = 10Distribution is quite close to a normal distribution
Very non-normalLarger NSampling distribution approximates a normal distribution; slight positive skew may remain
  • Notice: the means of the distributions are the same, but the spread (variance) is smaller for larger N.
  • Don't confuse: the parent population does not need to be normal; the sampling distribution of the mean becomes normal as N increases.

📊 Key takeaway

  • The larger the sample size, the closer the sampling distribution of the mean is to a normal distribution.
  • This property is fundamental for many inferential statistics techniques.
59

Sampling Distribution of Difference Between Means

Sampling Distribution of Difference Between Means

🧭 Overview

🧠 One-sentence thesis

The sampling distribution of the difference between two sample means is normally distributed and allows us to calculate the probability that one sample mean will exceed another by a given amount.

📌 Key points

  • What it describes: the distribution of differences between means from repeated samples drawn from two populations.
  • Shape and parameters: normally distributed with a mean equal to the difference in population means and a calculable standard error.
  • How to use it: convert the difference of interest into standard deviations above or below the mean, then find the probability using a Z table or calculator.
  • Simplified formula: when sample sizes and population variances are equal, the standard error formula becomes simpler and subscripts are unnecessary.
  • Common confusion: the question is not "what is the difference?" but "what is the probability of observing a difference of X or more?"

📐 Distribution characteristics

📊 Shape and parameters

The sampling distribution of the difference between means is normally distributed.

  • Mean of the distribution: equals the difference between the two population means (population mean 1 minus population mean 2).
  • Standard error: calculated from the population variances and sample sizes of both groups.
  • The excerpt shows that for Species 1 (mean 32) and Species 2 (mean 22), the sampling distribution has mean = 10 and standard error = 3.317.

🔢 Standard error formula

  • The general formula incorporates both population variances and both sample sizes.
  • The excerpt states: "the formula for the standard error of the difference between means is much simpler if the sample sizes and the population variances are equal."
  • When variances and sample sizes are the same, there is no need to use subscripts 1 and 2 to differentiate the terms.
  • Example: when both groups have variance 64 and sample size 8, the standard error simplifies to 4.

🧮 Probability calculations

🎯 Converting to standard deviations

  • To find the probability of a given difference, express that difference in terms of standard deviations from the mean.
  • The excerpt shows: "A difference between means of 0 or higher is a difference of 10/4 = 2.5 standard deviations above the mean of -10."
  • Once converted to standard deviations, use a Z table or normal calculator to find the area/probability.

📈 Species example

The excerpt presents a scenario with two species:

  • Species 1: population mean 32, sample size 10
  • Species 2: population mean 22, sample size 14
  • Question: probability that the sample mean from Species 1 exceeds the sample mean from Species 2 by 5 or more?
  • The sampling distribution has mean 10 and standard error 3.317.
  • The area above 5 in this distribution is 0.934, so the probability is 0.934.

👦👧 Height example

The excerpt also presents a height comparison:

  • Boys: mean height 175 cm, variance 64
  • Girls: mean height 165 cm, variance 64
  • Both samples have size 8
  • Question: probability that the mean height of the sample of girls would be higher than the mean height of the sample of boys?
  • The distribution of (girls - boys) has mean = 165 - 175 = -10 and standard error = 4.
  • A difference of 0 or higher is 2.5 standard deviations above the mean of -10.
  • The probability of a score 2.5 or more standard deviations above the mean is 0.0062.
  • Don't confuse: even though boys are taller in the population, it is not impossible (just unlikely) for a sample of girls to have a higher mean than a sample of boys.

🔍 Interpretation guidance

🧠 Understanding the question

  • The excerpt emphasizes that "without doing any calculations, you probably know that the probability is pretty high since the difference in population means is 10."
  • The question is always framed as: what is the probability of observing a certain difference (or more extreme) given the population parameters?
  • Example: "In other words, what is the probability that the mean height of girls minus the mean height of boys is greater than 0?"

⚠️ Unlikely vs impossible

  • The height example shows that even when the population difference is large (boys 10 cm taller on average), it is "not inconceivable" that a sample could show the opposite pattern.
  • The probability is very small (0.0062) but not zero.
  • This illustrates sampling variability: individual samples can deviate from population parameters.
60

Sampling Distribution of Pearson's r

Sampling Distribution of Pearson's r

🧭 Overview

🧠 One-sentence thesis

The sampling distribution of Pearson's r is negatively skewed (not normal), so Fisher's z' transformation is needed to compute probabilities about sample correlations using normal distribution methods.

📌 Key points (3–5)

  • Shape problem: The sampling distribution of r is negatively skewed, not symmetric, because r cannot exceed 1.0, and the skew becomes more pronounced as the population correlation (ρ) increases.
  • Why normal methods fail: Even if you know the mean and standard error of r's sampling distribution, you cannot compute probabilities directly because the distribution is not normal.
  • Fisher's solution: Transform r to z' using Fisher's formula; z' is normally distributed with a known standard error, allowing probability calculations.
  • Common confusion: Don't confuse the population correlation ρ (the true value) with the sample correlation r (which varies from sample to sample).
  • Practical use: To find the probability of obtaining a sample r above a certain value, convert both ρ and the target r to z', then use normal distribution methods.

📐 The shape of the sampling distribution

📐 Why r's distribution is skewed

The sampling distribution of r: the distribution of sample correlation values obtained from repeated random samples of the same size from a population with correlation ρ.

  • If you repeatedly sample (e.g., 12 students) from a population where ρ = 0.60, each sample's r will differ slightly from 0.60.
  • The distribution of these r values is negatively skewed, not symmetric.
  • Reason for skew: r cannot be greater than 1.0, so the distribution cannot extend as far in the positive direction as it can in the negative direction.
  • The positive tail is constrained; the negative tail can stretch further.

📉 How skew increases with ρ

  • The greater the population correlation ρ, the more pronounced the skew.
  • Example from the excerpt:
    • When ρ = 0.60 (N = 12), the distribution shows moderate negative skew.
    • When ρ = 0.90 (N = 12), the distribution has a very short positive tail and a long negative tail.
  • Don't confuse: The skew is a property of the sampling distribution, not the population data itself.

🔄 Fisher's z' transformation

🔄 Why transformation is necessary

  • You might think you only need the mean and standard error of r's sampling distribution to compute probabilities.
  • Problem: Since the sampling distribution is not normal, knowing the mean and standard error is not enough.
  • Solution: Fisher developed a transformation that converts r to a variable z' that is normally distributed.

🧮 The z' formula and its properties

  • The transformation formula is: z' = 0.5 times the natural logarithm of [(1 + r) / (1 - r)].
  • The excerpt notes that the formula details are not important because you typically use a table or calculator.
  • Key properties of z':
    • z' is normally distributed.
    • z' has a known standard error: 1 divided by the square root of (N - 3), where N is the number of pairs of scores.

🔢 Standard error of z'

  • The standard error of z' depends only on sample size N.
  • Formula (in words): standard error = 1 / square root of (N - 3).
  • Example: For N = 12, the standard error is 1 / square root of 9 = 1 / 3 = 0.333.

🎯 Computing probabilities with z'

🎯 Step-by-step procedure

The excerpt provides a worked example: finding the probability of getting a sample correlation of 0.75 or above in a sample of 12 from a population with ρ = 0.60.

Steps:

  1. Convert both the population correlation (ρ = 0.60) and the target sample value (r = 0.75) to their z' values.
    • z' for 0.60 = 0.693
    • z' for 0.75 = 0.973
  2. Calculate the standard error of z' for the given N.
    • For N = 12: standard error = 0.333
  3. Reframe the question: given a normal distribution with mean 0.693 and standard deviation 0.333, what is the probability of obtaining a value of 0.973 or higher?
  4. Use normal distribution methods:
    • Either use a normal calculator directly (answer: 0.20).
    • Or compute z-score: z = (X - μ) / σ = (0.973 - 0.693) / 0.333 = 0.841, then use a table to find the area above 0.841 is 0.20.

📊 Interpretation

  • The probability is 0.20, meaning there is a 20% chance of obtaining a sample r of 0.75 or higher when the true population correlation is 0.60.
  • Don't confuse: The z in the final step (z = 0.841) is a standard normal z-score, not the same as z' (the transformed correlation value).

🔑 Key distinctions

🔑 Population ρ vs. sample r

SymbolMeaningNature
ρ (rho)Population correlationFixed parameter; the true correlation in the entire population
rSample correlationRandom variable; varies from sample to sample
  • Example: If ρ = 0.60, a sample of 12 students will not yield exactly r = 0.60; different samples produce different r values.

🔑 r vs. z'

VariableDistributionUse
rNegatively skewedCannot directly compute probabilities using normal methods
z'NormalAllows probability calculations using standard normal distribution techniques
  • The transformation from r to z' is the bridge that enables normal distribution methods to be applied to correlation problems.
61

Sampling Distribution of p

Sampling Distribution of p

🧭 Overview

🧠 One-sentence thesis

The sampling distribution of p (a sample proportion) differs from the binomial distribution in that it deals with means rather than totals, and it becomes approximately normal under certain conditions, allowing us to make inferences about population proportions.

📌 Key points (3–5)

  • What p represents: p is the sample proportion (mean of 0s and 1s), while the binomial distribution deals with the total count of successes
  • Mean and standard error: The mean of the sampling distribution equals the population proportion π, and the standard error decreases with larger sample sizes
  • Normal approximation conditions: The sampling distribution of p is approximately normal when both Nπ and N(1-π) are greater than 10
  • Common confusion: Don't confuse the population proportion π with the sample proportion p; π is the parameter we're estimating, p is our sample statistic
  • Practical use: Understanding this distribution allows us to construct confidence intervals and conduct hypothesis tests about proportions

📊 Core concepts

📊 What is p?

p: the sample proportion, calculated as the mean of binary (0/1) scores

  • p represents the proportion of "successes" in a sample
  • It's computed as the number of successes divided by the sample size
  • Example: If 7 out of 10 voters prefer a candidate, p = 0.70

📊 Relationship to binomial distribution

The key difference between p and the binomial:

  • Binomial: counts total successes (e.g., 7 successes out of 10 trials)
  • Sampling distribution of p: deals with the mean proportion (e.g., 0.70)
  • The binomial has mean μ = Nπ; dividing by N gives the mean of p as simply π

🎯 Mean and standard error

🎯 Mean of the sampling distribution

The mean of the sampling distribution of p equals the population proportion:

  • μ_p = π
  • This means p is an unbiased estimator of π

🎯 Standard error formula

The standard error of p is calculated as:

  • Standard error = square root of [π(1-π)/N]
  • Where π is the population proportion and N is the sample size
  • Larger samples produce smaller standard errors (more precise estimates)

Example: For π = 0.60 and N = 10:

  • Standard error = square root of [0.60(1-0.60)/10] = 0.155

🔔 Normal approximation

🔔 When is the approximation valid?

Rule of thumb: The sampling distribution of p is approximately normal when both Nπ and N(1-π) are greater than 10

  • This ensures enough observations in both categories (successes and failures)
  • With smaller samples, the distribution may be noticeably discrete and skewed
  • The approximation improves as sample size increases

🔔 Why it matters

When the normal approximation holds:

  • We can use z-scores and normal distribution tables
  • Confidence intervals can be constructed using standard normal methods
  • Hypothesis tests become more straightforward

Don't confuse: Even when N(1-π) is less than 10, the approximation may still be "quite good" (as noted in the example where N(1-π) = 4), but caution is warranted.

⚠️ Discrete vs. continuous

⚠️ The discrete nature of p

  • With N = 10, p can only take values like 0.50 or 0.60, not 0.55
  • The sampling distribution is technically discrete, not continuous
  • The normal distribution is continuous, so it's an approximation
  • This is why we need sufficient sample size for the approximation to work well

🔍 Practical application

🔍 Using the sampling distribution

The sampling distribution of p allows us to:

  1. Estimate population proportions: Use p to estimate π
  2. Construct confidence intervals: Determine a range of plausible values for π
  3. Test hypotheses: Assess whether observed proportions differ from hypothesized values
  4. Calculate probabilities: Determine how likely various sample outcomes are

Example scenario: A pollster samples voters to estimate support for a candidate. Understanding the sampling distribution of p helps quantify the uncertainty in the estimate and determine whether observed differences between candidates are meaningful or just sampling variation.

62

Introduction to Estimation

Introduction to Estimation

🧭 Overview

🧠 One-sentence thesis

Degrees of freedom for an estimate equal the number of observations minus the number of parameters estimated along the way, and this concept explains why estimating the population mean from sample data reduces the independence of variance estimates.

📌 Key points (3–5)

  • What degrees of freedom measure: the number of independent pieces of information used in an estimate.
  • Independence requirement: estimates are independent only when based on randomly selected, unrelated observations—not when one observation influences another's calculation.
  • Key formula: degrees of freedom = number of values minus number of parameters estimated en route to the final estimate.
  • Common confusion: when you estimate the population mean from your sample first, the variance estimates lose independence because every observation contributed to calculating that mean.
  • Practical rule: for variance estimates, df = N - 1, where N is the number of observations, because you must estimate one parameter (the mean) before estimating variance.

🔢 Understanding degrees of freedom

🔢 What degrees of freedom represent

Degrees of freedom: the number of independent pieces of information on which an estimate is based.

  • It is not simply "how many data points you have"; it is "how many truly independent pieces of information remain."
  • Each independent observation contributes one degree of freedom.
  • Example: If you sample one Martian with height 8 and the population mean is known to be 6, you have one independent squared deviation (8-6)² = 4, so df = 1.

🎲 Independence of estimates

  • Two estimates are independent when they are based on separately and randomly selected observations.
  • The excerpt emphasizes: estimates would not be independent if, after sampling one Martian, you chose its brother as the second Martian.
  • Independence means one observation does not influence the calculation of another.

🧮 Estimating variance when the mean is known

🧮 Simple case: known population mean

  • If the population mean (μ) is known, each observation gives an independent estimate of variance.
  • Example scenario:
    • Population mean of Martian heights = 6
    • Sample one Martian: height = 8 → squared deviation = (8-6)² = 4 → estimate variance = 4 (df = 1)
    • Sample a second Martian: height = 5 → squared deviation = (5-6)² = 1 → estimate variance = 1 (df = 1)
    • Average the two estimates: (4 + 1)/2 = 2.5 (df = 2, because two independent pieces of information)

🔗 Why these estimates are independent

  • Each Martian was independently and randomly selected.
  • The height of the first Martian does not affect the height or calculation of the second.
  • Therefore, you have two degrees of freedom.

🔄 Estimating variance when the mean is unknown

🔄 The problem: estimating the mean first

  • In practice, the population mean is usually unknown, so you must estimate it using the sample mean (M).
  • This estimation step affects degrees of freedom.
  • Example scenario:
    • Sample two Martians: heights 8 and 5
    • Estimate the mean: M = (8 + 5)/2 = 6.5
    • Compute variance estimates:
      • Estimate 1 = (8 - 6.5)² = 2.25
      • Estimate 2 = (5 - 6.5)² = 2.25

🚫 Why these estimates are NOT independent

  • Each height contributed to calculating M.
  • The first Martian's height of 8 influenced M, which then influenced Estimate 2.
  • Example of dependence: If the first height had been 10 instead of 8, then M would be 7.5, and Estimate 2 would be (5 - 7.5)² = 6.25 instead of 2.25.
  • Another way to see it: if you know the mean and one score, you can compute the other score. (If one score is 5 and the mean is 6.5, the total must be 13, so the other score is 13 - 5 = 8.)
  • Don't confuse: having two observations does not automatically give you two degrees of freedom when you estimate parameters from those same observations.

📐 General rule for degrees of freedom

📐 The formula

Degrees of freedom = number of values minus number of parameters estimated en route to the estimate in question.

  • In the Martians example with unknown mean:
    • Number of values = 2 (heights 8 and 5)
    • Number of parameters estimated = 1 (the mean μ)
    • Degrees of freedom = 2 - 1 = 1

📊 Applying the rule to variance estimation

ScenarioNumber of observations (N)Parameters estimatedDegrees of freedom
Known population meanN0N
Unknown mean (estimated from sample)N1 (the mean)N - 1
  • If you sample 12 Martians and estimate the mean from the sample, your variance estimate has 12 - 1 = 11 degrees of freedom.
  • The denominator in the sample variance formula is the degrees of freedom (N - 1).

🔍 Why N - 1 for variance

  • You use N observations.
  • You estimate one parameter (the population mean) before estimating variance.
  • Therefore, df = N - 1.
  • This explains why the formula for estimating variance in a sample has N - 1 in the denominator.
63

Degrees of Freedom

Degrees of Freedom

🧭 Overview

🧠 One-sentence thesis

Degrees of freedom for an estimate equal the number of independent pieces of information used, which is the number of values minus the number of parameters estimated along the way.

📌 Key points (3–5)

  • What degrees of freedom measure: the number of independent pieces of information on which an estimate is based.
  • Independence matters: two estimates are independent only if they are based on separate, randomly selected observations; if one value influences the calculation of a parameter used in another estimate, they are not independent.
  • The general rule: degrees of freedom = number of values − number of parameters estimated en route to the estimate of interest.
  • Common confusion: when estimating variance from a sample, you must first estimate the mean from the same data, which reduces degrees of freedom by 1 (so df = N − 1, not N).
  • Why it matters: degrees of freedom appear in the denominator of the sample variance formula and affect the reliability of estimates.

🧮 What degrees of freedom measure

🧮 Independent pieces of information

Degrees of freedom: the number of independent pieces of information on which an estimate is based.

  • It is not simply the number of observations; it is the number of independent contributions to the estimate.
  • Each independent observation adds one degree of freedom.
  • Example: If you sample one Martian with height 8 and the population mean is known to be 6, the squared deviation (8 − 6)² = 4 is one independent piece of information, so df = 1.

🔗 When estimates are independent

  • Two estimates are independent if they come from separate, randomly selected observations.
  • The excerpt gives a positive example: sampling two Martians independently and computing two separate squared deviations from the known population mean yields two independent estimates, so df = 2.
  • Don't confuse: Independence is lost if the second observation is chosen based on the first (e.g., choosing the first Martian's brother as the second sample).

🔄 How estimating parameters reduces degrees of freedom

🔄 Estimating the mean first

  • In practice, the population mean is usually unknown, so you must estimate it from the sample before estimating variance.
  • This estimation step creates dependence among the squared deviations.
  • Example: Two Martians have heights 8 and 5. The sample mean M = (8 + 5)/2 = 6.5. Both squared deviations (8 − 6.5)² and (5 − 6.5)² depend on M, which was computed from both heights.

🔗 Why the estimates are no longer independent

  • Each height contributed to calculating M, so each squared deviation is influenced by all the data.
  • The excerpt explains: if the first height had been 10 instead of 8, M would have been 7.5, and the second squared deviation (5 − 7.5)² would have been 6.25 instead of 2.25.
  • Another way to see the dependence: if you know the mean and one score, you can deduce the other score. Example: if M = 6.5 and one score is 5, the total must be 13, so the other score is 13 − 5 = 8.

📉 The general rule

Degrees of freedom for an estimate = number of values − number of parameters estimated en route to the estimate in question.

  • In the Martians example: 2 values (8 and 5) − 1 parameter estimated (the mean μ) = 1 degree of freedom.
  • If you sampled 12 Martians, df = 12 − 1 = 11.
  • For estimating variance from a sample: df = N − 1, where N is the number of observations.

📐 Degrees of freedom in the variance formula

📐 The denominator

  • The excerpt recalls the formula for estimating variance in a sample (the formula itself is not fully shown in the excerpt, but the text states):
    • The denominator of the sample variance formula is the degrees of freedom.
  • This means the sample variance divides the sum of squared deviations by (N − 1), not N.
  • Why: because one degree of freedom is "used up" by estimating the mean from the same data.

🧩 Summary table

ScenarioNumber of valuesParameters estimated en routeDegrees of freedom
Known population mean, 1 observation101
Known population mean, 2 observations202
Unknown mean, 2 observations21 (the mean)1
Unknown mean, N observationsN1 (the mean)N − 1
64

Characteristics of Estimators

Characteristics of Estimators

🧭 Overview

🧠 One-sentence thesis

Estimators are judged by whether they systematically over- or underestimate the true parameter (bias) and by how much their values vary from sample to sample (sampling variability), with unbiased and low-variability estimators being more desirable.

📌 Key points (3–5)

  • Bias: whether an estimator systematically over- or underestimates the parameter; the sample mean is unbiased, but sample variance with N in the denominator is biased downward.
  • Sampling variability: how much an estimator varies from sample to sample, measured by its standard error; smaller standard error means less variability.
  • Common confusion: bias vs. variability are independent—an estimator can be biased but have low variability (like Scale 1), or unbiased but highly variable (like Scale 2).
  • Efficiency: statistics with smaller standard errors are more efficient; for example, the mean is more efficient than the median for normal distributions.
  • Why N-1 matters: using N-1 instead of N in the variance formula corrects the bias and gives an unbiased estimate of population variance.

🎯 Understanding bias

📏 What bias means

A statistic is biased if the long-term average value of the statistic is not the parameter it is estimating.

  • More formally: a statistic is biased if the mean of its sampling distribution does not equal the parameter.
  • The mean of the sampling distribution is called the expected value of the statistic.
  • Bias is about systematic tendency, not individual errors—any single estimate may be too high or too low, but bias means there's a consistent direction.

⚖️ The bathroom scale analogy

The excerpt uses two scales to illustrate bias vs. variability:

ScaleBiasVariabilityBehavior
Scale 1Biased (+1 lb)Low (±0.02 lb)Always overstates weight by ~1 pound, but very consistent
Scale 2UnbiasedHighWildly variable readings, but average of many measurements equals true weight
  • Don't confuse: Scale 1 is "fairly accurate" in the sense that it's never far off (max 1.02 lb error), even though it's biased. Scale 2 is unbiased but often very far from the true value.
  • Example: If your true weight is 150 lb, Scale 1 always reads 150.98–151.02 lb; Scale 2 might read 140 one time and 165 the next, but averages to 150 over many weighings.

✅ Unbiased estimators

Sample mean:

  • The mean of the sampling distribution of the sample mean equals the population mean μ.
  • Therefore, the sample mean is an unbiased estimate of μ.
  • There is no systematic tendency to over- or underestimate.

Sample variance:

  • Population variance formula uses N in the denominator.
  • Sample variance formula uses N-1 in the denominator.
  • If you use N instead of N-1 when estimating from a sample, the estimates tend to be too low (biased downward).
  • Using N-1 (the degrees of freedom) gives an unbiased estimate of the population variance.

📊 Understanding sampling variability

📉 What sampling variability measures

The sampling variability of a statistic refers to how much the statistic varies from sample to sample.

  • It is usually measured by the standard error of the statistic.
  • Smaller standard error = less sampling variability = more precise estimates.
  • Example: the standard error of the mean measures how much sample means vary across different samples.

🔢 How sample size affects variability

The excerpt gives the formula for the variance of the sampling distribution of the mean (in words):

  • Variance of the sampling distribution of the mean = (population variance) divided by (sample size N).
  • Larger sample size N → smaller variance → smaller standard error → lower sampling variability.
  • This means bigger samples give more stable, less variable estimates.

⚡ Efficiency: comparing estimators

The smaller the standard error of a statistic, the more efficient the statistic.

  • Relative efficiency is typically defined as the ratio of the standard errors of two statistics.
  • Sometimes it's defined as the ratio of their squared standard errors.
  • Example from the excerpt: for normal distributions, the standard error of the median is larger than the standard error of the mean, so the mean is more efficient.
  • Don't confuse: efficiency is about variability, not bias—both estimators might be unbiased, but one varies less from sample to sample.

🔗 Bias and variability together

🧩 Independent dimensions

The excerpt emphasizes that bias and variability are separate characteristics:

  • An estimator can be biased but have low variability (Scale 1: consistently wrong by the same amount).
  • An estimator can be unbiased but have high variability (Scale 2: correct on average but wildly inconsistent).
  • Ideally, you want an estimator that is both unbiased and has low sampling variability.

🎯 Practical implications

  • Scale 1's measurements are "never more than 1.02 pounds from your actual weight," even though biased—this shows low variability can still be useful.
  • Scale 2 gives the right answer "on average" (unbiased), but individual measurements can be "vastly" off—this shows high variability is a problem even without bias.
  • The choice between estimators depends on both characteristics: sometimes a slightly biased but much less variable estimator may be preferred in practice, though the excerpt focuses on the theoretical ideal of unbiased estimators.
65

Confidence Intervals

Confidence Intervals

🧭 Overview

🧠 One-sentence thesis

The t distribution is used instead of the normal distribution when estimating confidence intervals from small samples, because it accounts for the extra uncertainty introduced by estimating the population standard deviation from the sample.

📌 Key points (3–5)

  • Why t differs from normal: the t distribution has more area in the tails (is leptokurtic), reflecting greater uncertainty when the population standard deviation is unknown.
  • How degrees of freedom matter: with fewer degrees of freedom, the t distribution requires wider intervals (larger multipliers) to achieve the same confidence level; as df increases, t approaches the normal distribution.
  • Common confusion: 1.96 standard deviations captures 95% of a normal distribution, but less than 95% of a t distribution with low df—you must use a larger multiplier from the t table.
  • When to use which: use the normal distribution when the population standard deviation is known; use the t distribution when it is estimated from the sample.

📊 Comparing t and normal distributions

📊 Shape differences

The t distribution is leptokurtic: it has relatively more scores in the tails and fewer in the center compared to the normal distribution.

  • The normal distribution has more probability concentrated near the mean.
  • The t distribution spreads more probability into the tails, reflecting the added uncertainty from estimating the standard deviation.
  • Example: Figure 1 in the excerpt shows that the 2 df t distribution has the lowest peak (flattest center), then 4 df, then 10 df, and the standard normal has the highest peak.

🔄 Convergence with degrees of freedom

  • As degrees of freedom increase, the t distribution approaches the shape of the normal distribution.
  • With very large samples, the difference between t and normal becomes negligible.
  • Don't confuse: the t distribution is not "a different kind of data"—it is the appropriate model when you estimate the standard error from the sample rather than knowing the population value.

🧮 Critical values and confidence intervals

🧮 Why 1.96 is not enough for t

  • For a normal distribution, 1.96 standard deviations from the mean captures 95% of the area.
  • For a t distribution with low df, the percentage within 1.96 standard deviations is less than 95% because of the heavier tails.
  • You must use a larger multiplier from the t table to achieve the same confidence level.

📋 Reading the t table

The excerpt provides an abbreviated table showing the number of standard deviations required to contain 95% and 99% of the t distribution for various degrees of freedom:

df95% interval99% interval
24.3039.925
33.1825.841
42.7764.604
52.5714.032
82.3063.355
102.2283.169
202.0862.845
502.0092.678
1001.9842.626
Normal1.962.58
  • Notice that with 8 df, you need 2.306 (not 1.96) to capture 95%.
  • As df increases toward 100, the t value approaches the normal value of 1.96.

🧪 Worked example from the excerpt

🧪 The problem setup

  • Sample size: 9 values from a normal population.
  • Degrees of freedom: N - 1 = 8.
  • Question: What is the probability that the sample mean M would be within 1.96 times the estimated standard error (s_M) of the population mean μ?

🧪 The solution

  • From the t table with 8 df, 2.306 standard errors are required for 95% probability.
  • Since 1.96 is smaller than 2.306, the probability that M falls within 1.96 s_M of μ is less than 0.95.
  • Using a t distribution calculator: 0.086 of the area is more than 1.96 standard deviations from the mean.
  • Therefore, the probability is 1 - 0.086 = 0.914 (91.4%).
  • This is lower than the 0.95 that would apply if the population standard deviation were known (normal distribution case).

🧪 Why the probability is lower

  • When you estimate the standard error from the sample (using s_M instead of the known σ_M), you introduce additional uncertainty.
  • The t distribution accounts for this by spreading more probability into the tails.
  • Example: if you use the normal-distribution multiplier (1.96) with a t distribution, you underestimate the width needed for 95% confidence.

🔀 Choosing between t and normal

🔀 The decision rule

ConditionDistribution to useReason
Population standard deviation (σ) is knownNormalNo estimation uncertainty
Population standard deviation is estimated from sample (s)tMust account for estimation uncertainty

🔀 Practical implication

  • In most real-world problems, you do not know the population standard deviation.
  • You estimate it from your sample, so you use the t distribution.
  • The excerpt's worked example demonstrates that using the normal distribution when you should use t leads to overconfidence (claiming 95% when the true probability is only 91.4%).
66

Introduction to Confidence Intervals

Introduction to Confidence Intervals

🧭 Overview

🧠 One-sentence thesis

Confidence intervals provide a range of plausible values for a population parameter based on sample data, with the width and calculation method depending on whether the population standard deviation is known and on the sample size.

📌 Key points (3–5)

  • What a confidence interval estimates: A range of values likely to contain the true population parameter (e.g., mean, proportion, correlation).
  • Why we use the t distribution instead of normal: When the population standard deviation is unknown and must be estimated from the sample, the t distribution accounts for additional uncertainty, especially with small samples.
  • How sample size affects precision: Larger samples produce narrower confidence intervals; smaller samples require wider intervals (larger t values) to maintain the same confidence level.
  • Common confusion—known vs. estimated standard deviation: If σ is known, use the normal distribution and Z values; if σ must be estimated (the usual case), use the t distribution and t values.
  • Interpreting the interval: A 95% confidence interval means that if we repeated the sampling process many times, about 95% of the intervals constructed would contain the true population parameter.

📊 The t distribution and why it matters

📊 What the t distribution is

The t distribution is a probability distribution used when estimating population parameters from small samples with unknown population standard deviation.

  • The t distribution is similar to the normal distribution but has heavier tails (is leptokurtic).
  • This means more probability is in the extreme tails, so you must go farther from the mean to capture the same percentage of the distribution.
  • Example: For a normal distribution, 95% of values fall within 1.96 standard deviations of the mean. For a t distribution with 8 degrees of freedom, you need 2.306 standard deviations to capture 95%.

🔄 How t approaches normal as sample size increases

  • As degrees of freedom increase (larger sample sizes), the t distribution becomes more like the standard normal distribution.
  • With very large samples (e.g., 100+), the t and normal distributions are nearly identical.
  • The excerpt shows that for df = 2, the t value for 95% confidence is 4.303, but for df = 100, it drops to 1.984 (very close to the normal value of 1.96).

📐 Degrees of freedom

  • For a single sample mean: df = N - 1, where N is the sample size.
  • Degrees of freedom represent the number of independent pieces of information available to estimate variability.
  • Don't confuse: More degrees of freedom = closer to normal distribution = smaller t values needed.

🧮 Constructing confidence intervals for the mean

🧮 When σ is known (pedagogical case)

The formula is:

  • Lower limit = M - Z₀.₉₅ × σₘ
  • Upper limit = M + Z₀.₉₅ × σₘ

Where:

  • M is the sample mean
  • Z₀.₉₅ is the Z value for the desired confidence level (1.96 for 95%)
  • σₘ is the standard error of the mean = σ / √N

Example from the excerpt: Population mean = 90, σ = 36, sample size = 9.

  • Standard error = 36/3 = 12
  • 95% CI: 90 ± (1.96)(12) = 66.48 to 113.52

Why this is unrealistic: In practice, you rarely know the population standard deviation σ before you know the population mean.

🧮 When σ is estimated (realistic case)

The formula is:

  • Lower limit = M - (t_CL)(sₘ)
  • Upper limit = M + (t_CL)(sₘ)

Where:

  • M is the sample mean
  • t_CL is the t value for the desired confidence level and df = N - 1
  • sₘ is the estimated standard error = s / √N, where s is the sample standard deviation

Example from the excerpt: Sample of 5 values (2, 3, 5, 6, 9), mean = 5, s² = 7.5.

  • sₘ = √(7.5/5) = 1.225
  • For df = 4, t₀.₉₅ = 2.776
  • 95% CI: 5 ± (2.776)(1.225) = 1.60 to 8.40

Key difference: Using t instead of Z, and estimating the standard error from the sample rather than knowing it from the population.

🔬 Real example: Stroop interference effect

🔬 The study setup

  • 47 subjects named ink colors in two conditions: interference (word conflicts with color) vs. neutral (colored rectangles).
  • The question: What is the population mean difference in response time?

🔬 Calculating the interval

Given data:

  • Mean difference = 16.362 seconds
  • Standard deviation of differences = 7.470 seconds
  • Standard error = 1.090
  • df = 47 - 1 = 46
  • t₀.₉₅ for 46 df = 2.013

Calculation:

  • Lower limit = 16.362 - (2.013)(1.090) = 14.17
  • Upper limit = 16.362 + (2.013)(1.090) = 18.56

Interpretation: We can be 95% confident that the true population interference effect is between 14.17 and 18.56 seconds.

📏 Confidence intervals for other parameters

📏 Difference between two means

When comparing two independent groups (e.g., males vs. females):

Assumptions:

  1. Homogeneity of variance (both populations have the same variance)
  2. Normal distributions
  3. Independent sampling

Key steps:

  • Pool the variances to estimate the common population variance (MSE)
  • Calculate the standard error of the difference
  • Use df = (n₁ - 1) + (n₂ - 1)
  • Apply the same confidence interval formula structure

Example from excerpt: Female mean = 5.353, male mean = 3.882, difference = 1.471.

  • 95% CI: 0.29 to 2.65
  • Interpretation: The population mean for females is likely 0.29 to 2.65 points higher than for males.

📏 Correlation coefficient

Special challenge: The sampling distribution of r (Pearson correlation) is not normal.

Solution—Fisher's z' transformation:

  1. Convert r to z' (a transformed value with approximately normal distribution)
  2. Construct confidence interval for z'
  3. Convert the interval endpoints back to r

Example from excerpt: r = -0.654 (N = 34)

  • z' = -0.78
  • Standard error of z' = 1/√(N-3) = 1/√31 = 0.180
  • 95% CI for z': -0.78 ± (1.96)(0.18) = -1.13 to -0.43
  • Converting back: r from -0.81 to -0.40

Why transformation is needed: Without it, the confidence interval would be inaccurate, especially for correlations far from zero.

📏 Proportions

For a sample proportion p (e.g., 260 out of 500 voters favor a candidate):

Formula:

  • Standard error: sₚ = √[p(1-p)/N]
  • Apply continuity correction: subtract 0.5/N from lower limit, add 0.5/N to upper limit
  • Use Z values (normal distribution) for large samples

Example from excerpt: p = 0.52, N = 500

  • sₚ = 0.0223
  • 95% CI: 0.475 to 0.565 (47.5% to 56.5%)

Important note: The margin of error for the difference between two proportions is twice the margin of error for a single proportion.

⚠️ Common pitfalls and clarifications

⚠️ What confidence level means

  • A 95% confidence interval does NOT mean "there is a 95% probability the true parameter is in this interval."
  • Correct interpretation: If we repeated the sampling process many times, about 95% of the intervals we construct would contain the true parameter.
  • The parameter is fixed (not random); the interval is random (varies with each sample).

⚠️ Wider vs. narrower intervals

FactorEffect on widthReason
Higher confidence level (99% vs 95%)WiderNeed larger multiplier to be more confident
Larger sample sizeNarrowerStandard error decreases with √N
More variability in dataWiderLarger standard deviation increases standard error
Using t vs Z (small samples)Widert distribution has heavier tails

⚠️ Don't confuse statistical and practical significance

  • A confidence interval that excludes zero indicates statistical significance.
  • But a very narrow interval (e.g., 0.01 to 0.03) might be statistically significant yet practically unimportant.
  • Always consider the magnitude of the effect, not just whether zero is excluded.
67

t Distribution

t Distribution

🧭 Overview

🧠 One-sentence thesis

The t distribution enables hypothesis testing of means when the population standard deviation is unknown, extending the logic of normal distribution testing to real-world scenarios where we must estimate variability from sample data.

📌 Key points (3–5)

  • Core purpose: test hypotheses about means when σ (population standard deviation) is not known and must be estimated from the sample.
  • Key distinction: differs from normal distribution testing, which requires knowing σ; the t distribution accounts for the additional uncertainty from estimating σ.
  • Common confusion: one-tailed vs two-tailed tests—one-tailed tests look for effects in a specific direction, while two-tailed tests check for any difference.
  • Type I and Type II errors: choosing alpha level (.01 vs .05) affects both the risk of false positives (Type I) and false negatives (Type II).
  • Practical application: used for comparing single means to values, comparing two group means, and multiple pairwise comparisons.

🧪 Testing scenarios

🧪 Single mean testing

The excerpt introduces testing a single sample mean against a specified value.

  • When σ is known: compute probability using normal distributions.
  • When σ is estimated: use the t distribution instead.
  • The t distribution is necessary because estimating σ from the sample introduces additional uncertainty.

🔀 Comparing two means

The excerpt mentions testing the difference between two means for independent groups.

  • Example null hypothesis format: the difference between two population means equals zero.
  • The excerpt notes that "Ho: M1 = M2" is not proper notation (question 14 asks why).

📊 Multiple comparisons

The chapter covers:

  • All pairwise comparisons among means
  • Specific planned comparisons
  • Comparisons with correlated observations

The excerpt emphasizes these are not necessarily "post-hoc" tests after ANOVA, contrary to some textbook treatments.

🎯 One-tailed vs two-tailed tests

🎯 What they test

  • One-tailed: looks for an effect in one specific direction only.
  • Two-tailed: checks whether there is any difference, regardless of direction.

🎯 When to use one-tailed

The excerpt raises the question: "Is expecting an effect in a certain direction sufficient basis for using a one-tailed test?" (question 15).

  • One-tailed tests have an advantage (question 10 asks what it is).
  • The directional expectation alone may not justify one-tailed testing.

🎯 Probability conversion

From question 17:

  • If a two-tailed probability is .03, the one-tailed probability in the specified direction would be different.
  • If the effect were in the opposite direction, the one-tailed probability would also differ.

🎯 Error rate differences

Question 16 notes that Type I and Type II error rates differ between one-tailed and two-tailed tests.

⚠️ Error types and significance

⚠️ Type I error

Type I error: rejecting the null hypothesis when it is actually true (false positive).

  • The alpha level directly sets the Type I error rate.
  • From question 18: if you choose alpha = .01 and the null hypothesis is true, the probability of Type I error is .01.
  • If the null hypothesis is false, you cannot make a Type I error (question 18b).
  • Question 24: "A researcher risks making a Type I error any time the null hypothesis is rejected" (true/false).

⚠️ Type II error

Type II error: failing to reject the null hypothesis when it is actually false (false negative).

  • Denoted by beta (β).
  • Trade-off with Type I error: question 13 asks whether beta is higher for alpha = .05 or .01.

⚠️ Alpha level choice

Alpha levelInterpretationTrade-off
.01More conservativeLower Type I error risk, but affects beta
.05Less conservativeHigher Type I error risk
  • Question 13 asks which is more conservative.
  • Question 20 (true/false): "It is easier to reject the null hypothesis if the researcher uses a smaller alpha level."

⚠️ Sample size effects

Question 21 (true/false): "You are more likely to make a Type I error when using a small sample than when using a large sample."

📐 Probability value vs significance level

📐 Key distinction

Question 11 asks to distinguish between these two concepts:

  • Significance level (alpha): chosen before the analysis; the threshold for decision-making.
  • Probability value (p-value): calculated from the data; compared against alpha.

📐 Interpretation example

From question 12:

  • An experimental group (test-taking class) had mean SAT score 503.
  • Control group had mean 499.
  • Difference was significant at p = .037.
  • The question asks what to conclude about class effectiveness.

This illustrates that statistical significance (p < .05) does not automatically mean practical importance—a 4-point difference may be statistically significant with 100 subjects per group but may not represent meaningful improvement.

🚫 Common hypothesis testing confusions

🚫 What you don't test

Question 19: "Why doesn't it make sense to test the hypothesis that the sample mean is 42?"

  • Hypothesis tests are about population parameters, not sample statistics.
  • The sample mean is a known, observed value—there is no uncertainty to test.

🚫 Accepting vs failing to reject

From questions 22–23:

  • Question 22 (true/false): "You accept the alternative hypothesis when you reject the null hypothesis."
  • Question 23 (true/false): "You do not accept the null hypothesis when you fail to reject it."

These questions highlight the asymmetry in hypothesis testing:

  • Rejecting the null provides evidence for the alternative.
  • Failing to reject does not prove the null is true—it only means insufficient evidence against it.
68

Confidence Interval for the Mean

Confidence Interval for the Mean

🧭 Overview

🧠 One-sentence thesis

Confidence intervals estimate a population mean from sample data by constructing a range that will contain the true mean a specified percentage of the time, using either the normal distribution (when population standard deviation is known) or the t distribution (when it must be estimated from the sample).

📌 Key points (3–5)

  • Core purpose: A confidence interval estimates the population mean from a sample mean, creating a range likely to contain the true population value.
  • Two scenarios: Use the normal distribution when population standard deviation (σ) is known; use the t distribution when standard deviation must be estimated from sample data.
  • Common confusion: The t distribution vs normal distribution—t has more area in the tails (is leptokurtic) for small samples, requiring you to extend farther from the mean to capture the same confidence level.
  • Sample size matters: With large samples (100+), the t distribution closely resembles the normal distribution; with small samples, the t distribution requires larger multipliers (e.g., 2.78 instead of 1.96 for 95% confidence with N=5).
  • Degrees of freedom: For confidence intervals on the mean, degrees of freedom equals N - 1, where N is the sample size.

📐 Building blocks: sampling distribution and standard error

📊 The sampling distribution of the mean

  • The excerpt works backwards from a known population to illustrate the concept.
  • Example scenario: weights of 10-year-old children are normally distributed with mean 90 and standard deviation 36.
  • For a sample size of 9, the sampling distribution of the mean has:
    • Mean = population mean (90)
    • Standard deviation (standard error) = population standard deviation divided by square root of N = 36/3 = 12

📏 What the 95% range means

  • 95% of sample means fall within 1.96 standard deviations of the population mean.
  • In the example: 90 ± (1.96)(12) gives the range 66.48 to 113.52.
  • Key insight: If 95% of sample means are within 23.52 units of the true mean, then an interval of M ± 23.52 will contain the population mean 95% of the time.

🔢 Computing confidence intervals when σ is known

🧮 The formula

Lower limit = M - Z.95 × σ_M
Upper limit = M + Z.95 × σ_M

Where:

  • M is the sample mean
  • Z.95 is the number of standard deviations needed to contain 95% of the area (1.96 for 95% confidence)
  • σ_M is the standard error of the mean

📝 Worked example

Given five numbers sampled from a normal distribution with known standard deviation 2.5: {2, 3, 5, 6, 9}

Steps:

  1. Compute sample mean: M = (2+3+5+6+9)/5 = 5
  2. Compute standard error: 2.5 / square root of 5 = 1.118
  3. Find Z value: 1.96 for 95% confidence (or 2.58 for 99%)
  4. Calculate limits:
    • Lower: 5 - (1.96)(1.118) = 2.81
    • Upper: 5 + (1.96)(1.118) = 7.19

⚠️ The unrealistic assumption

  • The excerpt acknowledges this scenario is pedagogical: in practice, you rarely know the population standard deviation if you don't know the population mean.
  • This simpler case helps explain the logic before moving to the realistic scenario.

📊 Computing confidence intervals when σ is estimated (t distribution)

🔄 When to use t instead of normal

Use the t distribution rather than the normal distribution when the variance is not known and has to be estimated from sample data.

Key differences:

  • The t distribution is leptokurtic (has relatively more scores in its tails) compared to the normal distribution.
  • You must extend farther from the mean to contain the same proportion of area.
  • Example: 95% of a normal distribution is within 1.96 standard deviations, but with t distribution and sample size 5, you need 2.78 standard deviations.

📉 Sample size effect

Sample sizeDistribution behavior
Large (100+)t distribution very similar to normal
Small (e.g., 5)t distribution has much heavier tails

🧮 The modified formula

Lower limit = M - (t_CL)(s_M)
Upper limit = M + (t_CL)(s_M)

Where:

  • M is the sample mean
  • t_CL is the t value for the desired confidence level
  • s_M is the estimated standard error of the mean (computed from sample data)

📝 Worked example with estimation

Given five numbers: {2, 3, 5, 6, 9} with unknown population standard deviation

Steps:

  1. Compute sample mean: M = 5
  2. Compute sample variance: s² = 7.5
  3. Estimate standard error: s_M = square root of (7.5/5) = 1.225
  4. Find t value: For df = N - 1 = 4 and 95% confidence, t = 2.776 (from table)
  5. Calculate limits:
    • Lower: 5 - (2.776)(1.225) = 1.60
    • Upper: 5 + (2.776)(1.225) = 8.40

🔍 Degrees of freedom

  • For confidence intervals on the mean: df = N - 1
  • Used to look up the appropriate t value in a t table or calculator.
  • Don't confuse: degrees of freedom is not the same as sample size; it's one less.

🧪 Real data example: Stroop interference

🎨 The study context

  • 47 subjects named ink colors of words where the word and ink color conflicted.
  • Example: the word "blue" written in red ink—correct response is "red."
  • Compared to a control condition of naming colored rectangles.
  • Goal: estimate the population interference effect (time difference).

📊 The calculation

Given data:

  • Sample size: 47 subjects
  • Mean time difference: 16.362 seconds
  • Standard deviation: 7.470 seconds
  • Standard error: 1.090
  • Degrees of freedom: 47 - 1 = 46
  • t value for 95% confidence: 2.013

Result:

  • Lower limit: 16.362 - (2.013)(1.090) = 14.17
  • Upper limit: 16.362 + (2.013)(1.090) = 18.56

💡 Interpretation

The interference effect for the whole population is likely to be between 14.17 and 18.56 seconds.

  • This range estimates the true population parameter from sample data.
  • The confidence interval quantifies the uncertainty in the estimate.
69

Difference between Means

Difference between Means

🧭 Overview

🧠 One-sentence thesis

A confidence interval on the difference between two sample means estimates the likely range of the true population difference, assuming equal variances, normal distributions, and independent sampling.

📌 Key points (3–5)

  • What it estimates: the difference between population means using the difference between sample means as the starting point.
  • Three key assumptions: equal variance in both populations (homogeneity of variance), normal distributions, and independent sampling.
  • How to compute: use the sample mean difference, the estimated standard error of the difference, and the appropriate t-value for the desired confidence level.
  • Common confusion: the sample difference (e.g., 1.47) is not the main interest—what matters is the confidence interval for the population difference.
  • Degrees of freedom: calculated as (n₁ - 1) + (n₂ - 1), which determines which t-value to use.

📐 Core concepts

📐 What the confidence interval estimates

A confidence interval on the difference between means: a range of values likely to contain the true difference between two population means.

  • Researchers care more about the population difference than specific sample values.
  • The sample difference is used as an estimate, but the confidence interval reveals the accuracy of that estimate.
  • Example: In the Animal Research study, females rated animal research as more wrong (mean 5.35) than males (mean 3.88), giving a sample difference of 1.47. The confidence interval tells us the likely range for the true population difference.

🔍 The three assumptions

The excerpt lists three assumptions required to construct the confidence interval:

  1. Homogeneity of variance: both populations have the same variance.
  2. Normal distributions: both populations are normally distributed.
  3. Independent sampling: each value is sampled independently from every other value.

Important note: Small-to-moderate violations of assumptions 1 and 2 do not make much difference, according to the excerpt.

🧮 Calculation steps

🧮 Step 1: Estimate the pooled variance (MSE)

  • Since we assume equal population variances, we average the two sample variances.
  • Formula in words: MSE equals the sum of the two sample variances divided by 2.
  • Example: For the Animal Research study, MSE = (2.743 + 2.985) / 2 = 2.864.
  • MSE stands for "mean square error" and represents the mean squared deviation of each score from its group's mean.

🧮 Step 2: Calculate the standard error of the difference

  • The standard error of the difference between means is estimated using MSE and the sample sizes.
  • Formula in words: the standard error equals the square root of (2 times MSE divided by n), where n is the sample size per group (when both groups have equal size).
  • Example: With MSE = 2.864 and n = 17, the standard error = square root of (2 × 2.864 / 17) = 0.5805.

🧮 Step 3: Find the appropriate t-value

  • Degrees of freedom = (n₁ - 1) + (n₂ - 1).
  • When sample sizes are equal, this simplifies to 2(n - 1).
  • Example: With n₁ = n₂ = 17, degrees of freedom = 16 + 16 = 32.
  • For a 95% confidence interval with 32 degrees of freedom, t = 2.037 (from a t table or calculator).

🧮 Step 4: Compute the confidence interval

The formula in words:

  • Lower limit = (difference between sample means) - (t-value × standard error)
  • Upper limit = (difference between sample means) + (t-value × standard error)

Example calculation:

  • Difference between means = 5.353 - 3.882 = 1.471
  • Lower limit = 1.471 - (2.037 × 0.5805) = 0.29
  • Upper limit = 1.471 + (2.037 × 0.5805) = 2.65
  • Interpretation: The population difference is likely between 0.29 and 2.65, meaning females likely rate animal research as more wrong than males by this amount.

🔧 Special cases and practical considerations

🔧 Unequal sample sizes

When sample sizes differ, the calculations become more complex:

  • MSE must weight the larger sample more heavily using a sum of squares error (SSE) approach.
  • The standard error formula uses the harmonic mean of the sample sizes instead of the simple sample size.
  • Harmonic mean formula in words: 2 divided by (1/n₁ + 1/n₂).

💻 Data formatting for computer analysis

Most computer programs require data in a specific format:

  • Use two variables: one specifying the group and one for the score.
  • Example: Instead of two separate columns for Group 1 and Group 2, create a "Group" column (with values 1 or 2) and a "Score" column.
GroupScore
13
14
15
25
26
27

📊 Real example: Stroop interference effect

The excerpt includes a second example with 47 subjects:

  • Mean time difference = 16.362 seconds
  • Standard deviation = 7.470 seconds
  • Standard error = 1.090
  • Degrees of freedom = 46
  • t-value for 95% confidence = 2.013
  • Confidence interval: 14.17 to 18.56 seconds
  • Interpretation: The interference effect in the population is likely between 14.17 and 18.56 seconds.
70

Correlation and Hypothesis Testing

Correlation

🧭 Overview

🧠 One-sentence thesis

Statistical significance testing determines whether observed effects (such as correlations or differences between means) are unlikely to have occurred by chance, but a significant result does not prove the effect is large or that the null hypothesis is certainly false.

📌 Key points (3–5)

  • What significance testing does: it calculates the probability of obtaining the observed data (or more extreme) assuming the null hypothesis is true, not the probability that the null hypothesis is false.
  • Confidence intervals and significance: if a 95% confidence interval does not contain zero, the effect is significant at the 0.05 level; the interval shows which parameter values are plausible.
  • Common confusion: a non-significant result does not mean the null hypothesis is true or even supported—it only means the data are inconclusive.
  • Type I vs Type II errors: rejecting a true null hypothesis (Type I) vs failing to reject a false null hypothesis (Type II); the α level controls Type I error rate.
  • One-tailed vs two-tailed tests: one-tailed tests are appropriate only when effects in one direction are not worth distinguishing from zero; two-tailed tests are standard in research.

📊 Confidence intervals for correlation

📊 Converting r to z' (Fisher transformation)

  • The excerpt shows a worked example: a correlation of r = -0.654 (from 34 observations) is converted to z' = -0.78 using a calculator.
  • Why this matters: the sampling distribution of z' is approximately normal, making confidence interval calculations straightforward.
  • The standard error formula is given as 1 divided by the square root of (N - 3), where N is the sample size.
  • Example: with N = 34, the standard error is 0.180.

🔢 Computing the interval

  • For a 95% confidence interval, use Z = 1.96 (the critical value for 95% coverage).
  • The interval in z' units: -0.78 ± (1.96)(0.18) = [-1.13, -0.43].
  • Final step: convert the z' endpoints back to r using a table or calculator.
  • Result: the population correlation ρ is likely between -0.81 and -0.40.
  • For 99% confidence, use Z = 2.58, yielding a wider interval: -0.84 ≤ ρ ≤ -0.31.

⚠️ Interpretation reminder

  • The interval gives plausible values for the population parameter ρ, not the sample statistic r.
  • Don't confuse: the confidence level (95%) refers to the long-run success rate of the method, not the probability that this specific interval contains ρ.

🗳️ Confidence intervals for proportions

🗳️ Estimating a population proportion

  • The excerpt describes a poll: 260 out of 500 voters favor a candidate, so the sample proportion p = 0.52.
  • The estimated standard error is the square root of [p(1 - p) / n] = sqrt[0.52(0.48) / 500] = 0.0223.

🔧 Continuity correction

  • Because a discrete distribution (binomial) is approximated by a continuous distribution (normal), a correction is applied.
  • Subtract 0.5/N from the lower limit and add 0.5/N to the upper limit.
  • With N = 500, the correction is 0.001 in each direction.

📐 The 95% confidence interval

  • Lower: 0.52 - (1.96)(0.0223) - 0.001 = 0.475
  • Upper: 0.52 + (1.96)(0.0223) + 0.001 = 0.565
  • Conclusion: between 47.5% and 56.5% of voters favor the candidate; the margin of error is 4.5%.
  • Common media mistake: the margin of error for the difference between two candidates is twice as large (9%), not 4.5%.

📱 Small-sample example (iPhone retention)

  • The excerpt mentions a survey with 58 iPhone users planning to keep their phone (proportion = 0.94).
  • The 95% confidence interval extends from 0.87 to 1.0 (or 0.85 to 0.97 by some methods).
  • Key insight: even with a small sample, the lower bound (0.85 or 0.87) indicates the vast majority plan to stay with iPhone, so a strong conclusion is justified.

🧪 Logic of hypothesis testing

🧪 What the probability value means

The probability value (p-value): the probability of obtaining a result as extreme or more extreme than the observed result, given that the null hypothesis is true.

  • It is not the probability that the null hypothesis is false.
  • Example: if Mr. Bond correctly identifies 13 out of 16 martinis, p = 0.0106 is the probability of 13+ correct if he were guessing, not the probability he can tell the difference.
  • Don't confuse: probability of data given hypothesis vs probability of hypothesis given data (the latter requires Bayesian methods).

🎯 The null hypothesis

  • The null hypothesis typically states that an effect is zero (e.g., μ₁ = μ₂ or ρ = 0).
  • It is usually the opposite of the researcher's hypothesis and is put forward hoping it can be rejected.
  • Example: in the Physicians' Reactions study, the null hypothesis is that physicians spend equal time with obese and average-weight patients.

🔀 Alternative hypothesis

  • If the null hypothesis is rejected, the alternative hypothesis is accepted.
  • For a two-tailed test, rejecting μ₁ = μ₂ means accepting either μ₁ < μ₂ or μ₁ > μ₂, depending on the sample data.
  • The excerpt notes that some textbooks incorrectly claim you cannot conclude the direction of the effect; Kaiser (1960) showed this conclusion is justified.

⚖️ Significance levels and errors

⚖️ What "statistically significant" means

  • An effect is statistically significant if the p-value is below the α level (commonly 0.05 or 0.01).
  • Critical distinction: statistical significance means the effect is not exactly zero; it does not mean the effect is large or important.
  • The word "significant" originally meant "signifies something" (i.e., the effect is real), not "large."

🚨 Type I error

Type I error: rejecting a true null hypothesis.

  • The α level is the probability of a Type I error given that the null hypothesis is true.
  • If the null hypothesis is false, a Type I error is impossible.
  • Lower α (e.g., 0.01 vs 0.05) reduces the Type I error rate but increases Type II error rate.

🚫 Type II error

Type II error: failing to reject a false null hypothesis.

  • The probability of a Type II error (given the null hypothesis is false) is called β.
  • Power = 1 - β, the probability of correctly rejecting a false null hypothesis.
  • A Type II error is not really an "error" in the sense of a wrong conclusion—it just means the data are inconclusive.

❌ Why not to accept the null hypothesis

  • A non-significant result does not mean the null hypothesis is true.
  • It is impossible to distinguish "no effect" from "a very small effect" with limited data.
  • Example: if Mr. Bond has π = 0.51 (barely better than chance) but the test yields p = 0.62, concluding he cannot tell the difference would be wrong.
  • Proper interpretation: "no credible evidence of an effect" rather than "the effect is zero."

🔄 One-tailed vs two-tailed tests

🔄 What they test

Test typeNull hypothesisWhen to use
Two-tailedπ = 0.5 (or μ₁ = μ₂)When deviations in either direction matter
One-tailedπ ≤ 0.5 (or μ₁ ≤ μ₂)When only one direction is of interest
  • Two-tailed: reject if the result is extreme in either direction; p-value includes both tails.
  • One-tailed: reject only if the result is extreme in the predicted direction; p-value is half that of the two-tailed test (for symmetric distributions).

🧭 When one-tailed tests are appropriate

  • Use a one-tailed test only when an effect in the non-predicted direction is not worth distinguishing from no effect.
  • Example: testing a cold treatment—only interested in whether it's better than placebo, not whether it's worse.
  • Not justified simply because the researcher predicts a direction; if a strong effect appears in the opposite direction, the researcher should be able to conclude something.

⚠️ Decide before looking at data

  • Always choose one-tailed vs two-tailed before analyzing the data.
  • Two-tailed tests are much more common in scientific research because unexpected findings are usually worth noting.

🔍 Interpreting results correctly

🔍 Significant results

  • A significant result (p < α) means the null hypothesis is rejected, but not with absolute certainty.
  • Strength of evidence: p = 0.003 provides stronger evidence than p = 0.049; rejecting the null is not all-or-none.
  • The direction of the effect is established by the sample data (e.g., if sample mean for obese patients is lower, conclude the population mean is lower).

🤔 Non-significant results

  • A non-significant result (p ≥ α) does not mean the null hypothesis is probably true.
  • It means the data do not provide strong evidence against the null hypothesis.
  • Proper interpretation: the test is inconclusive; more research is needed.

📈 When non-significant results are encouraging

  • Example: a researcher tests a new anxiety treatment twice; both times it performs better than the traditional treatment, but neither result is significant (p = 0.11 and p = 0.07).
  • Combining the two studies (using methods for combining probabilities) yields p = 0.045, which is significant.
  • Lesson: weak evidence from multiple studies can add up to strong evidence.

📏 Using confidence intervals

  • A confidence interval can show that an effect is small even if not exactly zero.
  • Example: if the 95% confidence interval for a treatment benefit is [-4, 8] minutes, the benefit is at most 8 minutes.
  • This does not prove the null hypothesis is true, but it bounds the size of the effect.

🔗 Relationship between confidence intervals and significance tests

🔗 The connection

  • If a 95% confidence interval does not contain zero, the effect is significant at the 0.05 level.
  • If the interval contains zero, the effect is not significant at that level.
  • Example: in the Physicians' Reactions study, the 95% confidence interval for the difference is [2.00, 11.26]; since zero is not in the interval, the effect is significant (p = 0.0057).

🎯 Direction of the effect

  • When an effect is significant, all values in the confidence interval are on the same side of zero (all positive or all negative).
  • This allows the researcher to specify the direction of the effect.
  • Even if the null hypothesis of "exactly no difference" is known to be false before the experiment (e.g., aspirin vs acetaminophen), a significant test establishes which treatment is better.

🚫 Why non-significant ≠ null is true

  • If the confidence interval contains zero, every value in the interval (including zero) is plausible.
  • But there are infinitely many other plausible values in the interval, so zero is not singled out as "probably true."

🚩 Common misconceptions

🚩 Misconception 1: p-value = probability null is false

  • Wrong: "The p-value is the probability that the null hypothesis is false."
  • Correct: The p-value is the probability of the data (or more extreme) given the null hypothesis is true.

🚩 Misconception 2: low p-value = large effect

  • Wrong: "A low p-value means the effect is large."
  • Correct: A low p-value means the result is unlikely under the null hypothesis; it can occur even with small effects if the sample size is large.

🚩 Misconception 3: non-significant = null is probably true

  • Wrong: "A non-significant result means the null hypothesis is probably true."
  • Correct: A non-significant result means the data do not conclusively demonstrate the null hypothesis is false; it does not support the null hypothesis.

📋 Steps in hypothesis testing

📋 The four-step process

  1. Specify the null hypothesis: typically that a parameter equals zero (two-tailed) or is ≤ or ≥ zero (one-tailed).
  2. Specify the α level: commonly 0.05 or 0.01 (the significance level).
  3. Compute the p-value: the probability of obtaining the observed result (or more extreme) if the null hypothesis is true.
  4. Compare p-value to α: if p < α, reject the null hypothesis; if p ≥ α, the findings are inconclusive (do not accept the null hypothesis).

🎚️ Two approaches to significance testing

ApproachDescriptionBest for
FisherTreat p-value as a continuous measure of evidence strengthScientific research
Neyman-PearsonUse α as a fixed decision rule (reject or don't reject)Yes/no decisions (e.g., quality control)
  • The excerpt recommends the Fisher approach for scientific research: p = 0.001 is stronger evidence than p = 0.049, and p = 0.06 is weaker than p = 0.34.
  • The Neyman-Pearson approach treats all p < α the same and all p ≥ α the same, which is more suitable for applications requiring immediate action.
71

Proportion

Proportion

🧭 Overview

🧠 One-sentence thesis

Confidence intervals for population proportions allow researchers to estimate the true proportion in a population from sample data, with adjustments for discrete distributions and clear interpretation of margin of error.

📌 Key points (3–5)

  • What a proportion confidence interval estimates: the range within which the true population proportion (π) likely falls, based on a sample proportion (p).
  • Key formula components: uses the sample proportion, standard error calculated from p(1-p)/n, and a Z-score (e.g., 1.96 for 95% confidence).
  • Continuity correction: subtract 0.5/N from the lower limit and add 0.5/N to the upper limit because we approximate a discrete distribution with a continuous normal distribution.
  • Common confusion: margin of error for a single proportion vs. margin of error for the difference between two proportions—the latter is twice as large.
  • Sample size matters: even small samples can yield strong conclusions if the confidence interval bounds are far from critical thresholds.

📐 Building the confidence interval

📐 Core formula structure

The confidence interval for a proportion is built from:

  • Center: the sample proportion p
  • Spread: Z-score times the standard error
  • Adjustment: continuity correction of ±0.5/N

Standard error of p: the square root of p(1-p) divided by n.

The formula structure is:

  • Lower limit = p - (Z × standard error) - 0.5/N
  • Upper limit = p + (Z × standard error) + 0.5/N

🔢 Z-score selection

  • For 95% confidence, Z = 1.96
  • For 99% confidence, Z = 2.58
  • The Z-score is "the number of standard deviations extending from the mean of a normal distribution required to contain [the desired area]."
  • Found using a normal distribution calculator.

🧮 Worked example: election poll

A pollster samples 500 voters; 260 favor the candidate (p = 0.52).

Step 1: Calculate standard error

  • s_p = √[0.52(1-0.52)/500] = 0.0223

Step 2: Apply Z for 95% confidence (1.96)

  • Interval before correction: 0.52 ± (1.96)(0.0223) = 0.52 ± 0.044

Step 3: Apply continuity correction (±0.001 for N=500)

  • Lower: 0.52 - 0.044 - 0.001 = 0.475
  • Upper: 0.52 + 0.044 + 0.001 = 0.565

Result: 0.475 ≤ π ≤ 0.565 (or 47.5% to 56.5%)

🔧 The continuity correction

🔧 Why it's needed

  • We are "approximating a discrete distribution with a continuous distribution (the normal distribution)."
  • Proportions are inherently discrete (counts of successes), but the normal curve is continuous.
  • The correction adjusts the interval boundaries to account for this mismatch.

🔧 How to apply it

  • Subtract 0.5/N from the lower limit
  • Add 0.5/N to the upper limit
  • Example: with N=500, the correction is ±0.001 (or ±0.1 percentage points)

Don't confuse: this is not the same as the margin of error; it's a separate technical adjustment.

⚠️ Interpreting margin of error

⚠️ Single proportion vs. difference

The excerpt emphasizes a critical distinction the media often gets wrong:

ContextMargin of error
Single proportion (e.g., % favoring candidate)±4.5% in the example
Difference between two proportions (e.g., candidate lead)±9% (twice the single margin)
  • The interval 47.5% to 56.5% tells us the candidate's support level.
  • It does NOT directly tell us the margin in a two-person race.
  • "The margin of error for the difference is 9%, twice the margin of error for the individual percent."

⚠️ Media misreporting

The excerpt warns: "Keep this in mind when you hear reports in the media; the media often get this wrong."

Example: If the poll shows 52% for Candidate A, the margin of error for the race outcome (A's lead over B) is double the margin for A's individual percentage.

📊 Sample size and precision

📊 Small samples can still be informative

The excerpt includes a Statistical Literacy case about iPhone vs. Android retention:

  • iPhone: 58 kept, 4 changed (proportion = 0.94)
  • Sample size: only 62 iPhone users

Key insight: "The article contains the strong caution: 'It's only a tiny sample, so large conclusions must not be drawn.' This caution appears to be a welcome change... But has this report understated the importance of the study?"

The 95% confidence interval extends from 0.87 to 1.0 (or 0.85 to 0.97 by some methods).

  • Even the lower bound (85-87%) indicates the vast majority plan to buy another iPhone.
  • "A strong conclusion can be made even with this sample size."

📊 When small samples suffice

Don't confuse: a small sample is not automatically uninformative.

  • If the confidence interval bounds are far from critical thresholds (e.g., 50% in a two-way choice), strong conclusions are justified.
  • Precision depends on both sample size and the proportion itself (closer to 0.5 = more variability).

🎯 Practical application steps

🎯 Summary workflow

  1. Calculate sample proportion p
  2. Compute standard error: √[p(1-p)/n]
  3. Choose confidence level and find Z-score
  4. Compute interval: p ± Z × standard error
  5. Apply continuity correction: ±0.5/N
  6. Interpret in context, watching for single vs. difference margin of error

🎯 Parameter notation

  • π (pi): the true population proportion (unknown parameter)
  • p: the sample proportion (statistic, used to estimate π)
  • The confidence interval provides a range estimate for π based on p.
72

Introduction to Hypothesis Testing

Introduction

🧭 Overview

🧠 One-sentence thesis

Statistical significance testing uses probability values to determine whether observed differences are real effects rather than chance, but a statistically significant result does not necessarily mean the effect is large or practically important.

📌 Key points (3–5)

  • What the null hypothesis is: typically states that a parameter equals zero (no difference, no correlation, or sometimes a specific value like 0.5 for chance).
  • How significance testing works: if the probability of observing data as extreme as the sample (assuming the null hypothesis is true) is very low, researchers reject the null hypothesis.
  • Common confusion: "statistically significant" does NOT mean "important" or "large"—it only means the effect is unlikely to be exactly zero; a tiny effect can be statistically significant with a large sample.
  • Two approaches to testing: Fisher's approach treats the p-value as a continuous measure of evidence strength; Neyman-Pearson's approach uses a fixed cutoff (alpha level) for yes/no decisions.
  • Conventional thresholds: p < 0.05 is the standard cutoff; p < 0.01 is more conservative; values between 0.05 and 0.10 are considered weak evidence.

🎯 The null hypothesis

🎯 What it states

Null hypothesis: a statement that a population parameter equals a specific value, typically zero (no difference, no effect, no correlation).

  • The null hypothesis is usually the opposite of what the researcher expects or hopes to find.
  • Researchers put it forward hoping to discredit and reject it.
  • Example: in the Physicians' Reactions study, the null hypothesis was that physicians spend equal time with obese and average-weight patients (μ_obese = μ_average or μ_obese - μ_average = 0).

📐 Common forms

Study typeNull hypothesisMeaning
Comparing two groupsμ₁ = μ₂ or μ₁ - μ₂ = 0No difference between population means
Correlation studyρ = 0No relationship in the population (ρ is population correlation, not r the sample correlation)
Chance performanceπ = 0.5Performance equals chance level (e.g., coin flipping)
  • Don't confuse: ρ (population correlation) vs. r (sample correlation).
  • The null hypothesis can be a value other than zero when testing against a specific benchmark (e.g., chance = 0.5).

✅ Alternative hypothesis

  • If the null hypothesis is rejected, the alternative hypothesis is accepted.
  • The alternative is simply the reverse of the null.
  • Example: if μ_obese = μ_average is rejected, the alternatives are μ_obese < μ_average or μ_obese > μ_average.
  • The direction of the sample means determines which alternative is adopted.
  • The excerpt notes that it is justified to conclude which population mean is larger based on the sample direction (Kaiser, 1960).

🔬 How significance testing works

🔬 The logic of probability values

  • A low probability value casts doubt on the null hypothesis.
  • The probability value answers: "If the null hypothesis were true, how unlikely would it be to observe a difference as large or larger than what we saw in the sample?"
  • Example: in the Physicians' Reactions study, the sample difference was 6.7 minutes; the probability value was 0.0057, meaning such a large difference would be very unlikely if the null hypothesis were true.
  • Therefore, researchers rejected the null hypothesis and concluded that physicians intend to spend less time with obese patients in the population.

📏 The alpha (α) level

Alpha level (α) or significance level: the probability value below which the null hypothesis is rejected.

  • Conventional cutoff: α = 0.05 (5%).
  • More conservative: α = 0.01 (1%).
  • If the probability value is less than α, the null hypothesis is rejected and the effect is called statistically significant.

⚖️ Interpreting probability values (Fisher's approach)

Probability valueInterpretation
Below 0.01Strong evidence against the null hypothesis
Between 0.01 and 0.05Null hypothesis typically rejected, but with less confidence
Between 0.05 and 0.10Weak evidence; by convention, not low enough to reject
Above 0.10Less evidence that the null hypothesis is false
  • This approach treats the p-value as a continuous measure of evidence strength.
  • Example: p = 0.0057 provides strong evidence; p = 0.049 provides weaker evidence, even though both are below 0.05.

⚠️ What "statistically significant" really means

⚠️ Not the same as "important"

Statistically significant: the null hypothesis of exactly no effect is rejected; the effect is not exactly zero.

  • Key warning: statistical significance does NOT mean the effect is large, important, or practically meaningful.
  • "Significant" in everyday language means "important," but in statistics it originally meant "signifies something" (i.e., the effect is real, not due to chance).
  • Over time, the everyday meaning changed, leading to confusion.

🔍 Statistical vs. practical significance

  • Statistical significance: confidence that the effect is not exactly zero.
  • Practical significance: whether the effect is large enough to matter in the real world.
  • Don't confuse: a small effect can be highly significant if the sample size is large enough.
  • Example: a tiny difference in time spent with patients could be statistically significant with thousands of patients, but the 6.7-minute difference might not be practically important for scheduling.

🛤️ Two approaches to significance testing

🛤️ Fisher's approach (evidence-based)

  • Conduct the test and use the probability value to reflect the strength of evidence against the null hypothesis.
  • The p-value is treated as a continuous measure: smaller values = stronger evidence.
  • More suitable for scientific research, where the goal is to assess the weight of evidence.
  • Example: p = 0.001 is treated as much stronger evidence than p = 0.049.

🛤️ Neyman-Pearson approach (decision-based)

  • Specify an α level before analyzing the data (e.g., α = 0.05).
  • If p < α, reject the null hypothesis; if p ≥ α, do not reject.
  • All results below α are treated identically; all results above α are treated identically.
  • Example: p = 0.049 and p = 0.001 are treated the same (both reject); p = 0.06 and p = 0.34 are treated the same (both do not reject).
  • More suitable for yes/no decisions in applied settings.
  • Example: deciding whether to shut down a malfunctioning machine in a manufacturing plant—the manager needs a clear decision, not a nuanced assessment of evidence strength.

🧭 Which approach to use

  • The excerpt adopts Fisher's approach for scientific research.
  • The Neyman-Pearson approach is better when a binary decision must be made.
73

Significance Testing

Significance Testing

🧭 Overview

🧠 One-sentence thesis

Significance testing determines whether an observed effect is real or due to chance, and the interpretation depends on whether you treat probability values as a spectrum of evidence (Fisher's approach) or as a binary decision rule (Neyman-Pearson approach).

📌 Key points (3–5)

  • Two competing approaches: Fisher treats probability values as a continuous measure of evidence strength; Neyman-Pearson uses a fixed α cutoff for yes/no decisions.
  • What "significant" means: the effect is real and not due to chance, but the term's meaning has shifted over time and can be misinterpreted.
  • Common confusion: a non-significant result does not prove the null hypothesis is true—it only means the data do not provide strong evidence against it.
  • Type I vs Type II errors: Type I is rejecting a true null hypothesis (a real error); Type II is failing to reject a false null hypothesis (not really an error, just inconclusive).
  • One-tailed vs two-tailed: one-tailed tests look for effects in only one direction; two-tailed tests look for deviations in either direction; the choice must be made before seeing the data.

🔬 Two approaches to significance testing

🔬 Fisher's continuous-evidence approach

  • Probability values (p-values) reflect the strength of evidence against the null hypothesis.
  • Interpretation is on a spectrum:
    • p < 0.01: strong evidence the null hypothesis is false
    • 0.01 ≤ p < 0.05: null hypothesis typically rejected, but with less confidence
    • 0.05 ≤ p < 0.10: weak evidence; by convention, not low enough to reject
    • Higher p-values: even less evidence against the null
  • No fixed cutoff: the researcher weighs the evidence and may conclude "more research is needed" rather than making a definitive decision.
  • Best for scientific research: allows nuanced interpretation and acknowledges uncertainty.

⚖️ Neyman-Pearson's decision-rule approach

  • Specify an α level (significance level) before analyzing the data.
  • Decision rule:
    • If p < α → reject the null hypothesis
    • If p ≥ α → do not reject the null hypothesis
  • All-or-nothing: p = 0.049 and p = 0.001 are treated identically (both significant); p = 0.06 and p = 0.34 are treated identically (both not significant).
  • The degree of significance does not matter—only whether the threshold is crossed.
  • Best for yes/no decisions: suitable when an immediate action is required (e.g., shutting down a malfunctioning machine in a plant).

🧩 Which approach to use

  • The excerpt adopts Fisher's approach for scientific research.
  • Neyman-Pearson is more suitable for applications requiring a clear action (e.g., quality control, operational decisions).
  • Example: A plant manager needs to decide whether to shut down a machine for repair—binary decision. A researcher studying a phenomenon can conclude "some evidence exists, but more study is needed"—continuous evidence.

⚠️ Type I and Type II errors

⚠️ Type I error: rejecting a true null hypothesis

Type I error: occurs when a significance test results in the rejection of a true null hypothesis.

  • What it means: you conclude there is an effect when in reality there is none.
  • Example: In the Physicians' Reactions case study, p = 0.0057, so the null hypothesis (no difference in time spent with obese vs average-weight patients) was rejected. If the null hypothesis is actually true and the large sample difference occurred by chance, this conclusion is a Type I error.
  • Relationship to α: The α level affects the Type I error rate—lower α means lower Type I error rate.
  • Common confusion: α is not the probability of a Type I error in general; it is the probability of a Type I error given that the null hypothesis is true. If the null hypothesis is false, a Type I error is impossible.

🔄 Type II error: failing to reject a false null hypothesis

Type II error: failing to reject a false null hypothesis.

  • What it means: the test is not significant, so you do not reject the null hypothesis, but the null hypothesis is actually false.
  • Not really an error: A non-significant result means the data do not provide strong evidence that the null hypothesis is false—it does not mean the null hypothesis is true.
  • Correct interpretation: the test is inconclusive, not confirmatory of the null hypothesis.
  • Don't confuse: Failing to reject the null ≠ accepting the null. A researcher should never conclude the null hypothesis is true based on a non-significant result.
  • Probability of Type II error: denoted β (beta); can only occur if the null hypothesis is false.
  • Power: the probability of correctly rejecting a false null hypothesis = 1 - β.

📊 Comparison of error types

Error typeWhat happensWhen it can occurIs it a real mistake?
Type IReject a true null hypothesisOnly if null is trueYes—erroneous conclusion
Type IIFail to reject a false null hypothesisOnly if null is falseNo—just inconclusive

🎯 One-tailed vs two-tailed tests

🎯 What they are

One-tailed probability: a probability calculated in only one tail of the distribution.

Two-tailed probability: a probability calculated in both tails of a distribution.

  • One-tailed test: looks for an effect in only one direction (e.g., "Is the result better than chance?").
  • Two-tailed test: looks for an effect in either direction (e.g., "Is the result different from chance, whether better or worse?").

🧪 Example: James Bond martini case

  • Mr. Bond judged 16 martinis (shaken or stirred) and was correct 13 times.
  • Probability of 13 or more correct by guessing alone = 0.0106 (one-tailed, upper tail only).
  • If we ask "What is the probability of a result as extreme or more extreme?", we consider both tails: 13/16 is as extreme as 3/16. Since the binomial is symmetric at π = 0.5, the two-tailed probability = 2 × 0.0106 = 0.0212.

🔍 Which test to use

  • Two-tailed: use when you want to detect any deviation from the null, in either direction.
    • Question: "Can Mr. Bond tell the difference between shaken and stirred?" (He could be much better or much worse than chance.)
    • Null hypothesis: π = 0.5
    • Alternative hypothesis: π ≠ 0.5
  • One-tailed: use when you only care about an effect in one direction.
    • Question: "Is Mr. Bond better than chance?" (Only interested in performance above chance.)
    • Null hypothesis: π ≤ 0.5
    • Alternative hypothesis: π > 0.5
    • If Mr. Bond were correct on only 3/16 trials, the one-tailed probability (right tail) would be very high, and the null would not be rejected.

⏰ When to decide

  • Always decide before looking at the data whether you will use a one-tailed or two-tailed test.
  • Two-tailed tests are much more common in scientific research because any outcome different from chance is usually worth noting.
  • One-tailed tests are appropriate when it is not important to distinguish between no effect and an effect in the unexpected direction.
    • Example: Testing a cold treatment. The researcher only cares if the treatment is better than placebo. If it is worse or the same, the drug is worthless either way—no need to distinguish.

🚫 Common misuse

  • Some argue a one-tailed test is justified whenever the researcher predicts the direction of an effect.
  • Problem: If the effect comes out strongly in the non-predicted direction, the researcher cannot justifiably conclude the effect is zero.
  • The excerpt notes this is unrealistic (text cuts off, but the implication is that ignoring a strong opposite effect is not scientifically sound).

📋 Summary table

Test typeNull hypothesisAlternative hypothesisWhen to use
Two-tailedπ = 0.5π ≠ 0.5Want to detect any deviation from null (either direction)
One-tailedπ ≤ 0.5π > 0.5Only care about effect in one direction; opposite direction is unimportant
74

Type I and II Errors

Type I and II Errors

🧭 Overview

🧠 One-sentence thesis

Understanding Type I and Type II errors is essential for interpreting significance tests correctly, because rejecting or failing to reject the null hypothesis both carry specific risks that affect how confidently we can draw conclusions from data.

📌 Key points (3–5)

  • Type I error: rejecting a true null hypothesis (false positive); controlled by the alpha (α) level.
  • Type II error: failing to reject a false null hypothesis (false negative); related to beta (β), where power = 1 - β.
  • Common confusion: a non-significant result does NOT prove the null hypothesis is true—it only means the evidence is inconclusive.
  • Trade-off: lowering the risk of Type I error (using a smaller α) increases the risk of Type II error (lower power).
  • Why it matters: understanding these errors helps researchers design better experiments and interpret results appropriately.

⚖️ The two types of errors

🔴 Type I error (false positive)

Type I error: rejecting the null hypothesis when it is actually true.

  • This is a "false alarm"—concluding there is an effect when there really isn't one.
  • The probability of making a Type I error is controlled by the significance level (α), typically set at 0.05 or 0.01.
  • Example: concluding a drug works when it actually has no effect.
  • When it happens: when your sample data happen to be extreme just by chance, even though the null hypothesis is true.

🔵 Type II error (false negative)

Type II error: failing to reject the null hypothesis when it is actually false.

  • This is a "miss"—failing to detect a real effect.
  • The probability of making a Type II error is denoted β (beta).
  • Power is defined as 1 - β, the probability of correctly rejecting a false null hypothesis.
  • Example: concluding a drug doesn't work when it actually does have an effect.
  • When it happens: when your sample size is too small, the effect size is small, or there is too much variability in the data.

📊 Comparison table

Error TypeWhat happensProbabilityConsequence
Type IReject true H₀α (e.g., 0.05)False positive; conclude effect exists when it doesn't
Type IIFail to reject false H₀βFalse negative; miss a real effect

🎯 Controlling error rates

🎚️ The alpha (α) level

  • The significance level is the threshold for rejecting the null hypothesis.
  • Common values: 0.05 (5% chance of Type I error) or 0.01 (1% chance).
  • More conservative α (e.g., 0.01 instead of 0.05):
    • Lower risk of Type I error
    • Higher risk of Type II error (lower power)
    • Harder to detect real effects

💪 Power (1 - β)

  • Power is the probability of correctly rejecting a false null hypothesis.
  • Higher power means better ability to detect real effects.
  • Power is increased by:
    • Larger sample sizes
    • Larger effect sizes
    • Lower variability in data
    • Using a less conservative α level

⚖️ The trade-off

  • You cannot simultaneously minimize both Type I and Type II errors without changing other aspects of the study (like sample size).
  • Don't confuse: choosing a smaller α makes you less likely to make a Type I error, but it makes you MORE likely to make a Type II error (unless you increase sample size).

🚫 Common misconceptions

❌ Misconception 1: p-value is the probability H₀ is false

Wrong interpretation: "If p = 0.03, there's a 3% chance the null hypothesis is true."

Correct interpretation: The p-value is the probability of obtaining data as extreme as (or more extreme than) what was observed, assuming the null hypothesis is true. It is NOT the probability that the null hypothesis is false.

❌ Misconception 2: Non-significant = null hypothesis is true

Wrong interpretation: "The difference wasn't significant, so the two groups are the same."

Correct interpretation: A non-significant result means the data do not provide strong evidence against the null hypothesis. It does NOT prove the null hypothesis is true. The effect might be real but too small to detect with the current sample size.

  • Example: If you fail to reject H₀, you should say "there is no convincing evidence of a difference," NOT "there is no difference."

❌ Misconception 3: Low p-value = large effect

Wrong interpretation: "p < 0.001, so the effect must be huge."

Correct interpretation: A low p-value indicates that the result is unlikely if the null hypothesis were true, but it doesn't tell you the size of the effect. With a very large sample, even a tiny effect can be statistically significant.

🧪 Practical implications

🔬 When designing experiments

  • Consider power before collecting data: ensure your study has a reasonable chance (often 0.80 or 80%) of detecting an effect if one exists.
  • Avoid underpowered studies—they waste resources and are unlikely to yield conclusive results.
  • Use power analysis to determine appropriate sample sizes.

📖 When interpreting results

✅ Significant result (reject H₀)

  • You have evidence against the null hypothesis.
  • Strength of evidence varies: p = 0.049 is weaker evidence than p = 0.003.
  • Rejecting H₀ is not all-or-none; consider the magnitude of the p-value.

🤷 Non-significant result (fail to reject H₀)

  • Do NOT accept the null hypothesis.
  • The result is inconclusive—you cannot distinguish between "no effect" and "effect too small to detect."
  • A non-significant result can still provide weak support for an effect if the direction is consistent with predictions.
  • Multiple non-significant studies in the same direction can combine to provide significant evidence.

🎲 Example scenario

Suppose a researcher tests whether a new teaching method improves test scores:

  • Type I error: concluding the method works when it actually doesn't (wasting resources implementing it).
  • Type II error: concluding the method doesn't work when it actually does (missing an opportunity to improve education).
  • The researcher must balance these risks when choosing α and designing the study.

🔑 Key takeaways

📝 Remember

  • Type I error = false positive (α controls this risk)
  • Type II error = false negative (β; power = 1 - β)
  • Non-significant ≠ no effect; it means inconclusive evidence
  • Lower α → lower Type I error risk BUT higher Type II error risk
  • Always consider power when planning research

🎓 Best practices

  1. Decide on α level and whether to use one-tailed or two-tailed tests before looking at data.
  2. Don't accept H₀ when you fail to reject it.
  3. Report exact p-values, not just "significant" or "not significant."
  4. Consider effect sizes and confidence intervals, not just p-values.
  5. Use power analysis to ensure adequate sample sizes.
75

One- and Two-Tailed Tests

One- and Two-Tailed Tests

🧭 Overview

🧠 One-sentence thesis

The choice between one-tailed and two-tailed tests affects the probability of correctly rejecting a false null hypothesis (statistical power), with one-tailed tests offering higher power when the direction of the effect is predicted in advance.

📌 Key points (3–5)

  • What power measures: the probability that a researcher will correctly reject a false null hypothesis.
  • How sample size affects power: larger samples increase power, giving researchers more ability to detect true effects.
  • One-tailed vs two-tailed distinction: one-tailed tests predict a specific direction (e.g., "higher than"), while the choice affects the critical value and power.
  • Common confusion: power is not the same as significance level—power is about detecting true effects, while significance level (e.g., 0.05) is about controlling false positives.
  • What researchers control: sample size and choice of test type (one-tailed vs two-tailed) are under the experimenter's control and directly affect power.

🎯 Understanding statistical power

🎯 What power means

Power: the probability that a researcher will correctly reject a false null hypothesis.

  • Power is not about whether the null hypothesis is true or false in reality; it measures the test's ability to detect a false null hypothesis when one exists.
  • The excerpt repeatedly frames power as "the probability of correctly rejecting" a hypothesis that is actually false.
  • Example: if the true population mean is 80 but the null hypothesis claims it is 75 or lower, power is the chance the test will correctly reject that null hypothesis.

🔢 How power is calculated

The excerpt walks through a concrete calculation:

  • A math achievement test has a known mean of 75 and standard deviation of 10.
  • A researcher tests whether a new teaching method produces a higher mean.
  • The researcher plans to sample 25 subjects and use a one-tailed test.
  • The true population mean (unknown to the researcher) is actually 80.
  • The standard error of the mean is calculated as 10 divided by 5 (the square root of 25), which equals 2.
  • If the null hypothesis (mean equals 75) is true, the probability of getting a sample mean of 78.29 or higher is 0.05—this is the critical value.
  • The researcher will reject the null hypothesis if the sample mean is 78.29 or larger.
  • Given that the true population mean is 80, the probability of getting a sample mean of 78.29 or higher is 0.80.
  • Therefore, power equals 0.80.

Don't confuse: the 0.05 threshold is the significance level (Type I error rate), while 0.80 is the power (probability of correctly detecting the true effect).

🧪 One-tailed tests in practice

🧪 What a one-tailed test predicts

  • The excerpt describes a one-tailed test as testing "whether the sample mean is significantly higher than 75."
  • This means the researcher has predicted the direction of the effect in advance (higher, not just different).
  • The test only looks for evidence in one direction.

📐 How the critical value is set

  • The excerpt shows that with a one-tailed test at the 0.05 level, the critical value is 78.29.
  • This is the point where "the probability of a sample mean being greater than or equal to 78.29 is 0.05" if the null hypothesis is true.
  • The researcher rejects the null hypothesis only if the sample mean reaches or exceeds this value.

Example: if the sample mean turns out to be 79, it exceeds 78.29, so the researcher rejects the null hypothesis and concludes the new method works.

🔧 Factors under researcher control

🔧 Sample size

  • The excerpt states: "the larger the sample size, the higher the power."
  • Sample size is "typically under an experimenter's control."
  • Increasing sample size is described as "one way to increase power."
  • Mechanism: larger samples reduce the standard error of the mean, making it easier to detect true differences.

🔧 Assumptions about population parameters

  • The excerpt notes that the researcher "assumes that the population standard deviation with the new method is the same as with the old method (10)."
  • When the population variance is known (or assumed), the researcher can use the normal distribution instead of the t distribution.
  • The excerpt acknowledges this is "rarely true in practice" but useful for teaching purposes.

Don't confuse: knowing the population variance vs. estimating it from the sample—the former allows use of the normal distribution, the latter requires the t distribution, and "power calculators are available for situations in which the experimenter does not know the population variance."

📊 Comparing test scenarios

📊 Bond example

The excerpt briefly mentions another power calculation:

  • Bond's true ability is to be correct on 0.75 of trials.
  • The probability he will be correct on 12 or more trials is 0.63.
  • Therefore, power is 0.63.

This illustrates that power varies depending on the true effect size and the test design.

📊 Complexity for other tests

Test typePower calculation complexity
Normal distribution (known variance)Straightforward, as shown in the example
t testsMore complex
Analysis of VarianceMore complex
GeneralMany programs compute power for these cases

The excerpt notes that "calculation of power is more complex for t tests and for Analysis of Variance" and recommends using programs that compute power.

76

Interpreting Significant Results

Interpreting Significant Results

🧭 Overview

🧠 One-sentence thesis

Rejecting the null hypothesis is not an all-or-none decision, and statistical significance establishes both that an effect exists and its direction, even when the null hypothesis was unlikely to be exactly true before the experiment.

📌 Key points (3–5)

  • Not all rejections are equal: p = 0.049 gives less confidence than p = 0.003, so rejecting the null hypothesis is a matter of degree, not absolute.
  • Direction matters: rejecting the null hypothesis in a two-tailed test tells you which population mean is larger, not just that they differ.
  • Null hypotheses are often false before testing: in many real situations (e.g., comparing two drugs), the populations almost certainly differ; the test establishes which direction the difference goes.
  • Common confusion: some textbooks incorrectly claim that rejecting a two-tailed null hypothesis only shows populations differ, not which is larger—but a two-tailed test at 0.05 is equivalent to two one-tailed tests at 0.025 each, so direction is valid.
  • Why it matters: significance tests are useful even when you already know populations differ, because they reveal the direction of the effect.

🎚️ Degrees of confidence in rejection

🎚️ Rejecting is not all-or-none

  • When p is below the alpha level, the null hypothesis is rejected and the effect is statistically significant.
  • However, not all statistically significant effects should be treated the same way.
  • You should have less confidence that the null hypothesis is false if p = 0.049 than if p = 0.003.
  • The smaller the p-value, the stronger the evidence against the null hypothesis.

🔄 What happens after rejection

  • If the null hypothesis is rejected, the alternative hypothesis is accepted.
  • The alternative hypothesis is the logical opposite of the null hypothesis.
  • Example: if the null hypothesis is "probability ≤ 0.5" and it is rejected, then the alternative "probability > 0.5" is accepted.

🧭 Direction in one-tailed vs two-tailed tests

🎯 One-tailed test direction

  • Example from the James Bond case study:
    • Mr. Bond judged whether a martini had been shaken or stirred in 16 trials.
    • Null hypothesis: probability ≤ 0.5 (no better than chance).
    • Alternative hypothesis: probability > 0.5 (better than chance).
  • If the null is rejected, the direction is already built into the alternative hypothesis.

⚖️ Two-tailed test direction

  • Example from the Physicians' Reactions case study:
    • Null hypothesis: mean for obese patients = mean for average-weight patients.
    • If rejected, there are two possible alternatives:
      • Mean for obese < mean for average-weight
      • Mean for obese > mean for average-weight
  • The direction of the sample means determines which alternative is adopted.
  • If the sample mean for obese patients is significantly lower, conclude that the population mean for obese patients is lower.

🔍 Common confusion: can you conclude direction?

  • Some textbooks incorrectly state that rejecting the null hypothesis (two means are equal) does not justify concluding which population mean is larger.
  • They claim you can only conclude that the means differ, not which is bigger.
  • This is wrong: a two-tailed test at the 0.05 level is equivalent to two separate one-tailed tests, each at the 0.025 level.
  • The two one-tailed null hypotheses are:
    • Mean for obese ≥ mean for average-weight
    • Mean for obese ≤ mean for average-weight
  • Rejecting one of these clearly establishes the direction of the difference.

🧪 When the null hypothesis is already unlikely

🧪 Null hypotheses that are practically false

  • In many real situations, it is very unlikely that two conditions will have exactly the same population means.
  • Example: it is practically impossible that aspirin and acetaminophen provide exactly the same degree of pain relief.
  • Therefore, even before the experiment, the researcher knows the null hypothesis of exactly no difference is false.

🎯 What the test still tells you

  • The researcher does not know which drug offers more relief before the experiment.
  • If the test is significant, then the direction of the difference is established.
  • This makes significance tests useful even when you already suspect the null hypothesis is false.
  • The test answers: "Which one is better?" not just "Are they different?"

🔗 Connection to confidence intervals

  • The excerpt notes that this point is also made in the section on the relationship between confidence intervals and significance tests.
  • (The excerpt does not elaborate further on this connection.)
77

Interpreting Non-Significant Results

Interpreting Non-Significant Results

🧭 Overview

🧠 One-sentence thesis

A non-significant result does not prove the null hypothesis is true; it simply means the data provide little evidence against it, and sometimes multiple non-significant results can together support an effect.

📌 Key points (3–5)

  • What non-significant means: the data provide little or no evidence that the null hypothesis is false, not that the null hypothesis is true.
  • The core problem: it is impossible to distinguish a true null effect (no difference) from a very small effect using a single non-significant test.
  • Common confusion: "not rejecting" the null hypothesis is not the same as "accepting" it—accepting the null is a serious error.
  • How non-significant results can help: a non-significant result in the predicted direction can increase confidence that an effect exists, and combining multiple non-significant results can yield significance.
  • What you can conclude: confidence intervals can show that an effect is likely small, but never that it is exactly zero.

🚫 What non-significant results do NOT mean

🚫 Non-significant ≠ null hypothesis is true

  • A high probability value (e.g., p = 0.62) means the data are consistent with the null hypothesis, but this is not evidence for the null.
  • The excerpt emphasizes: "the high probability value is not evidence that the null hypothesis is true."
  • Why: the test cannot tell the difference between "no effect" and "a very tiny effect."

🎯 The Bond example: a tiny true effect

  • Suppose Mr. Bond has a 0.51 probability of being correct (slightly better than chance, so the null hypothesis π = 0.50 is false).
  • He scores 49 out of 100 correct; the probability of this result (or better) under the null is 0.62.
  • The test does not reject the null (p = 0.62 >> 0.05), yet we know the null is actually false.
  • Don't confuse: "failing to reject" with "the null is true"—the test simply lacks power to detect such a small effect.

⚠️ Accepting the null hypothesis is an error

Accepting the null hypothesis: concluding that the null hypothesis is true based on a non-significant result.

  • The excerpt calls this "a serious error."
  • Rule: Do not accept the null hypothesis when you do not reject it.
  • Instead, report: "There is no credible evidence for an effect, but no proof that the effect does not exist."

🔍 Why you cannot prove a negative

  • The excerpt uses an analogy: claiming to have been Socrates in an earlier life.
  • No one can definitively prove you were not Socrates; absence of evidence is not evidence of absence.
  • Similarly, a non-significant result does not prove the effect is zero.

🔄 How non-significant results can increase confidence

🔄 The anxiety treatment example

  • A researcher tests a new treatment vs. traditional treatment (20 subjects, 10 per group).
  • The new treatment shows lower mean anxiety, but the difference is not significant (p = 0.11).
  • Naive interpretation: "The new treatment is no better."
  • Sophisticated interpretation: "The data weakly support that the new treatment is better, even though not significant."
  • The researcher should have more confidence in the new treatment than before the experiment, though the evidence is weak and inconclusive.

🔁 Repeating the experiment

  • The researcher repeats the study: again, the new treatment is better, but not significant (p = 0.07).
  • Naive view: "Two failures to find significance → the new treatment is unlikely to be better."
  • Sophisticated view: "Two out of two times the new treatment was better; weak support from each can combine into strong support."
  • Using a method for combining probabilities, p = 0.11 and p = 0.07 together yield p = 0.045, which is significant.
  • Key insight: multiple non-significant results in the same direction can accumulate into a significant finding.

📏 Using confidence intervals to assess effect size

📏 What confidence intervals can show

  • Although you can never prove an effect is exactly zero, a confidence interval can demonstrate that an effect is likely small.
  • Method: compute a 95% confidence interval around the observed difference.
  • If all values in the interval are small, conclude the effect is small.

💤 The insomnia treatment example

  • Mean time to fall asleep is 2 minutes shorter for the treatment group (not significant).
  • The 95% confidence interval ranges from -4 to +8 minutes.
  • What you can conclude: the benefit is at most 8 minutes.
  • What you cannot conclude: that the null hypothesis is true or even "supported."
  • The interval includes zero, so the data are consistent with no effect, but also consistent with a small effect in either direction.

📊 Summary table: interpreting non-significant results

SituationNaive interpretationSophisticated interpretation
Single non-significant result"No effect exists""No strong evidence for an effect; cannot rule out small effect"
Non-significant result in predicted direction"Treatment doesn't work""Weak support for the effect; consider replication"
Multiple non-significant results in same direction"Multiple failures → no effect""Accumulating weak evidence; combine probabilities"
Confidence interval includes zero"Null hypothesis is true""Effect is likely small; cannot conclude it is zero"

🧠 Practical guidance

🧠 What to report

  • State: "There is no credible evidence for an effect."
  • Add: "But there is no proof that the effect does not exist."
  • If the result trends in the predicted direction, mention that the data are consistent with (but do not strongly support) the hypothesis.

🧠 What to do next

  • If a non-significant result trends in the predicted direction, consider replication.
  • Combine evidence from multiple studies (even if each is non-significant) to assess cumulative support.
  • Use confidence intervals to quantify the plausible range of effect sizes.

🧠 Don't confuse

  • "Not rejecting the null" vs. "accepting the null": the former is correct; the latter is an error.
  • "No evidence of effect" vs. "evidence of no effect": a non-significant result provides the former, not the latter.
78

Steps in Hypothesis Testing

Steps in Hypothesis Testing

🧭 Overview

🧠 One-sentence thesis

Hypothesis testing follows a four-step process that compares a computed probability value against a preset significance level to decide whether to reject the null hypothesis, though rejection is not all-or-none and failure to reject does not prove the null hypothesis true.

📌 Key points (3–5)

  • The four-step process: specify the null hypothesis, set the significance level (α), compute the probability value (p value), and compare p to α to decide whether to reject the null hypothesis.
  • Null hypothesis formulation: for two-tailed tests, typically states a parameter equals zero; for one-tailed tests, states a parameter is greater than or equal to zero or less than or equal to zero.
  • Rejection is not binary: lower p values give more confidence that the null hypothesis is false, but p > 0.05 is considered inconclusive, not proof that the null hypothesis is true.
  • Common confusion: failing to reject the null hypothesis does not support or prove it true—it only means the data are not strong enough to reject it.
  • Confidence intervals and significance: if a 95% confidence interval excludes zero, the test is significant at the 0.05 level; values outside the interval are rejected as plausible.

📋 The four-step process

📋 Step 1: Specify the null hypothesis

The null hypothesis is the statement being tested, typically that a parameter equals zero or has no effect.

  • Two-tailed test: the null hypothesis is usually that a parameter equals zero.
    • Example: μ₁ - μ₂ = 0, which is the same as saying μ₁ = μ₂.
  • One-tailed test: the null hypothesis states a parameter is greater than or equal to zero, or less than or equal to zero.
    • If you predict μ₁ is larger than μ₂, the null hypothesis (the reverse) is μ₂ - μ₁ ≥ 0, equivalent to μ₁ ≤ μ₂.
  • The null hypothesis is always the opposite of the research prediction in one-tailed tests.

📋 Step 2: Specify the significance level (α)

The significance level (α level) is the threshold probability used to decide whether to reject the null hypothesis.

  • Typical values are 0.05 and 0.01.
  • This is set before computing the probability value.
  • It represents the maximum probability of rejecting the null hypothesis when it is actually true (Type I error rate).

📋 Step 3: Compute the probability value (p value)

The probability value (p value) is the probability of obtaining a sample statistic as different or more different from the parameter specified in the null hypothesis, given that the null hypothesis is true.

  • It measures how extreme the observed data are under the assumption that the null hypothesis is correct.
  • The p value is calculated from the sample data and the null hypothesis.
  • Example: if the null hypothesis is that two groups have equal means, the p value tells you how likely it is to see a difference as large as (or larger than) the one observed if the groups truly have equal means.

📋 Step 4: Compare p value with α level

  • If p < α: reject the null hypothesis.
  • If p ≥ α: do not reject the null hypothesis (findings are inconclusive).
  • The decision is not all-or-none: the lower the p value, the more confidence you can have that the null hypothesis is false.
  • If p is higher than the conventional α level of 0.05, most scientists consider the findings inconclusive.

⚠️ What rejection and non-rejection mean

⚠️ Rejection is not binary

  • Rejecting the null hypothesis is not an all-or-none decision.
  • Lower p values provide more confidence that the null hypothesis is false.
  • Example: a p value of 0.001 gives much stronger evidence against the null hypothesis than a p value of 0.04, even though both lead to rejection at α = 0.05.

⚠️ Failure to reject ≠ accepting the null hypothesis

  • Don't confuse: failing to reject the null hypothesis does not mean the null hypothesis is true or even supported.
  • It only means you do not have sufficiently strong data to reject it.
  • The excerpt emphasizes: "There is never a statistical basis for concluding that an effect is exactly zero."
  • Example: if a treatment shows a 2-minute improvement in sleep time but p > 0.05, you cannot conclude the treatment has zero effect—only that the evidence is not strong enough to reject zero effect.

🔗 Confidence intervals and significance tests

🔗 The relationship between confidence intervals and significance

If a statistic is significantly different from 0 at the 0.05 level, then the 95% confidence interval will not contain 0.

  • All values inside the confidence interval are plausible values for the parameter.
  • Values outside the interval are rejected as plausible values.
  • Example: in the Physicians' Reactions case study, the 95% confidence interval for the difference between means extends from 2.00 to 11.26. Since zero is below 2.00, zero is rejected as a plausible value, so the test is significant (p = 0.0057).

🔗 Why confidence intervals clarify non-acceptance of the null hypothesis

  • Confidence intervals make it clear that you should not accept the null hypothesis even when you fail to reject it.
  • Example: suppose a treatment for insomnia reduces sleep time by 2 minutes on average, and this difference is not significant. If the 95% confidence interval ranges from -4 to 8 minutes, the researcher can conclude the benefit is eight minutes or less, but cannot conclude the null hypothesis is true or supported.
  • The interval shows a range of plausible values, not proof that the true effect is zero.
ScenarioConfidence intervalInterpretation
Interval excludes zeroe.g., 2.00 to 11.26Test is significant; reject null hypothesis
Interval includes zeroe.g., -4 to 8Test is not significant; do not reject null hypothesis, but cannot accept it either
Interval is narrow and excludes zeroe.g., 2.00 to 3.00Strong evidence against null hypothesis
Interval is wide and includes zeroe.g., -10 to 15Weak evidence; effect could be small or zero
79

Significance Testing and Confidence Intervals

Significance Testing and Confidence Intervals

🧭 Overview

🧠 One-sentence thesis

Confidence intervals and significance tests are closely linked: when a test is significant at a given level, the corresponding confidence interval will exclude the null hypothesis value, making clear that failing to reject the null does not mean accepting it.

📌 Key points (3–5)

  • The link between CIs and tests: if a statistic is significantly different from zero at the 0.05 level, the 95% confidence interval will not contain zero.
  • Direction of effect: when significant, all confidence interval values fall on the same side of zero (all positive or all negative), establishing the direction of the difference.
  • Why not to accept the null: when a CI contains zero, every value in the interval is plausible—including zero, but also infinitely many other values—so you cannot conclude the null is true.
  • Common confusion: a non-significant result does not mean the null hypothesis is probably true; it only means the data do not conclusively show it is false.
  • Practical implication: researchers often know the null (exactly zero difference) is false before testing, but significance tests reveal the direction and plausibility of the effect size.

🔗 The relationship between confidence intervals and significance tests

🔗 How CIs tell you about significance

If a statistic is significantly different from 0 at the 0.05 level, then the 95% confidence interval will not contain 0.

  • The confidence interval contains all plausible values for the parameter.
  • Values outside the interval are rejected as plausible.
  • Example: In the Physicians' Reactions case study, the 95% CI extends from 2.00 to 11.26. Since zero is below 2.00, it is rejected, and the test is significant (p = 0.0057).
  • The same relationship holds for 99% CIs and the 0.01 significance level.

🎯 Establishing direction of effect

  • When an effect is significant, all values in the confidence interval are on the same side of zero (either all positive or all negative).
  • This allows the researcher to specify the direction of the effect.
  • Example: If comparing aspirin and acetaminophen for pain relief, the researcher may know beforehand that the null hypothesis of exactly no difference is false, but does not know which drug is better. A significant test with a CI entirely above (or below) zero establishes the direction.

❌ Why you cannot accept the null hypothesis

❌ Non-significant results and plausible values

  • If the 95% confidence interval contains zero, the effect is not significant at the 0.05 level.
  • Zero cannot be rejected as a plausible value, but neither can any other value in the interval.
  • There are infinitely many values in the interval (assuming continuous measurement), and none can be rejected.
  • Don't confuse: "failing to reject the null" with "accepting the null"—the data simply do not provide strong enough evidence either way.

❌ What non-significance really means

A non-significant outcome means that the data do not conclusively demonstrate that the null hypothesis is false.

  • It does not mean the null hypothesis is probably true.
  • Failure to reject the null does not constitute support for it; it just means you lack sufficiently strong data to reject it.

🚫 Common misconceptions about significance testing

🚫 Misconception 1: p-value = probability the null is false

MisconceptionProper interpretation
The p-value is the probability that the null hypothesis is falseThe p-value is the probability of obtaining data as extreme or more extreme given that the null hypothesis is true; it is the probability of the data given the null, not the probability the null is false

🚫 Misconception 2: Low p-value = large effect

  • A low p-value indicates the sample outcome would be very unlikely if the null were true.
  • It does not necessarily indicate a large effect size.
  • A low p-value can occur with small effect sizes if the sample size is large.

🚫 Misconception 3: Non-significant = null is probably true

  • Already covered above: non-significant results mean the data do not conclusively show the null is false.
  • They do not provide evidence that the null is true.

📐 The four-step significance testing process

📐 Step 1: State the null hypothesis

  • The null hypothesis specifies a parameter value to test against.
  • Example: testing whether a claimed psychic can predict coin flips, or whether a drug differs from placebo.

📐 Step 2: Set the significance level (α)

  • The significance level (α) is chosen before computing the probability value.
  • Typical values are 0.05 and 0.01.

📐 Step 3: Compute the p-value

  • The p-value is the probability of obtaining a sample statistic as different or more different from the null hypothesis parameter, given that the null hypothesis is true.

📐 Step 4: Compare and decide

  • If the p-value is lower than α, reject the null hypothesis.
  • Rejecting the null is not all-or-none: the lower the p-value, the more confidence you have that the null is false.
  • If the p-value is higher than the conventional α level of 0.05, most scientists consider findings inconclusive.
80

Misconceptions About Significance Testing

Misconceptions

🧭 Overview

🧠 One-sentence thesis

Three common misconceptions about significance testing lead researchers to misinterpret p-values, effect sizes, and non-significant results.

📌 Key points (3–5)

  • What p-values actually mean: the probability of obtaining data this extreme if the null hypothesis is true, not the probability that the null hypothesis is false.
  • Low p-value ≠ large effect: statistical significance can occur even with small effects when sample sizes are large.
  • Non-significance ≠ null is true: failing to reject the null hypothesis does not prove it is correct; it only means the data are inconclusive.
  • Common confusion: distinguishing between "probability of data given the null" versus "probability of the null given the data"—these are not the same.
  • Why it matters: these misinterpretations can lead to incorrect conclusions about research findings.

❌ The p-value misconception

❌ What people wrongly believe

Many researchers mistakenly think:

The probability value is the probability that the null hypothesis is false.

This is incorrect and represents a fundamental misunderstanding of what significance tests measure.

✅ The correct interpretation

The probability value is the probability of a result as extreme or more extreme given that the null hypothesis is true.

  • The p-value tells you: "If the null hypothesis were true, how likely would I be to see data like this (or more extreme)?"
  • It is the probability of the data given the null hypothesis.
  • It is not the probability that the null hypothesis is false.
  • Don't confuse: P(data | null is true) versus P(null is true | data)—these are fundamentally different conditional probabilities.

🔍 Why this matters

Example: A p-value of 0.05 does not mean there is a 95% chance the null hypothesis is false. It means that if the null hypothesis were true, you would see results this extreme 5% of the time.

📏 The effect size misconception

📏 What people wrongly believe

A common error is thinking:

A low probability value indicates a large effect.

This confuses statistical significance with practical importance.

✅ The correct interpretation

A low probability value indicates that the sample outcome (or one more extreme) would be very unlikely if the null hypothesis were true.

  • A low p-value can occur with small effect sizes, particularly if the sample size is large.
  • Statistical significance depends on both effect size and sample size.
  • With a very large sample, even tiny, practically meaningless differences can be statistically significant.

🔍 Why this matters

Example: An organization tests a new procedure with 10,000 participants and finds a statistically significant improvement (p = 0.001). However, the actual improvement might be so small that it has no practical value—the large sample size made even a tiny effect detectable.

🚫 The non-significance misconception

🚫 What people wrongly believe

Researchers often incorrectly conclude:

A non-significant outcome means that the null hypothesis is probably true.

This treats "failure to reject" as "acceptance," which is logically invalid.

✅ The correct interpretation

A non-significant outcome means that the data do not conclusively demonstrate that the null hypothesis is false.

  • Non-significant results are inconclusive, not confirmatory.
  • The data simply lack sufficient strength to reject the null hypothesis.
  • Many plausible values remain possible, including values far from the null hypothesis.

🔍 Why this matters

Example: A researcher tests whether a new treatment works better than a placebo and finds p = 0.20 (non-significant). This does not prove the treatment is ineffective—it only means the study did not provide strong enough evidence either way. The treatment might still have an effect that a larger study could detect.

🔗 Connection to confidence intervals

The excerpt's earlier discussion explains why non-significance doesn't support the null:

  • Every value in a confidence interval is plausible.
  • If zero is in the interval, it cannot be rejected.
  • However, an infinite number of other values are also in the interval, and none of them can be rejected either.
  • Therefore, failing to reject zero does not make zero more likely than any other plausible value.
81

Testing a Single Mean

Testing a Single Mean

🧭 Overview

🧠 One-sentence thesis

Testing a single mean determines whether a sample mean differs significantly from a hypothesized population mean by calculating the probability that such a difference could occur by chance.

📌 Key points (3–5)

  • What the test does: computes the probability that a sample mean differs from a hypothesized population mean (μ) by the observed amount or more.
  • Two scenarios: when population standard deviation (σ) is known, use the normal distribution and Z; when σ is unknown (typical), estimate it with sample standard deviation (s) and use the t distribution.
  • One-tailed vs two-tailed: one-tailed tests probability in one direction only; two-tailed tests probability of difference in either direction and doubles the probability.
  • Common confusion: the test does not prove the null hypothesis is true when you fail to reject it; it only shows the sample result is not unlikely under the null hypothesis.
  • Key assumption: the population (or population of difference scores) must be normally distributed.

📐 The basic logic and computation

📐 What you are testing

The null hypothesis: the population mean (μ) equals some specific hypothesized value.

  • The test asks: "If the null hypothesis is true, how likely is it to observe a sample mean at least as different from the hypothesized mean as the one we got?"
  • Example: In the subliminal message experiment, nine subjects chose pictures; the null hypothesis is that the population mean number of suggested pictures chosen is 50 (no effect).
  • The sample mean was 51, differing by 1 from the hypothesized mean of 50.

🧮 The sampling distribution of the mean

  • To compute the probability, you need the sampling distribution of the mean.
  • The mean of the sampling distribution equals the hypothesized population mean (μ_M = μ).
  • The standard deviation of the sampling distribution (called the standard error) depends on whether you know the population standard deviation.

🔢 When population standard deviation (σ) is known

🔢 Computing the standard error

  • If σ is known, the standard error is σ divided by the square root of N.
  • Example: In the subliminal message experiment, the variance was 25 (from the binomial distribution), so σ = 5. With N = 9, the standard error = 5/3 = 1.667.

📊 Using the normal distribution

  • Assume the sampling distribution of the mean is normally distributed.
  • Compute the probability of obtaining a sample mean as extreme or more extreme than observed.
  • Example: With a hypothesized mean of 50, standard error of 1.667, and sample mean of 51, the probability of a mean ≥ 51 is 0.274 (one-tailed).
  • Since 0.274 is not very low, the result is not significant; the null hypothesis is not rejected.

🔄 One-tailed vs two-tailed tests

  • One-tailed: computes probability in one direction only (e.g., mean ≥ 51).
  • Two-tailed: computes probability of a difference in either direction (e.g., mean ≤ 49 or mean ≥ 51).
  • The two-tailed probability is twice the one-tailed probability.
  • Example: The two-tailed probability for the subliminal message test is 0.548 (twice 0.274).

🧪 The Z formula (historical method)

  • Before calculators, probabilities were computed using the standard normal distribution.
  • Formula: Z = (M - μ) / σ_M, where M is the sample mean, μ is the hypothesized mean, and σ_M is the standard error.
  • Example: Z = (51 - 50) / 1.667 = 0.60, which gives the same probability as the normal distribution calculator.

📉 When population standard deviation (σ) is unknown (typical case)

📉 Estimating the standard error

  • In real-world analyses, σ is almost never known.
  • Estimate σ with the sample standard deviation (s).
  • Estimate the standard error (σ_M) with s_M = s / square root of N.

🧬 Using the t distribution

  • When σ is estimated, use the t distribution instead of the normal distribution.
  • Formula: t = (M - μ) / s_M, where M is the sample mean, μ is the hypothesized mean, and s_M is the estimated standard error.
  • Notice the similarity to the Z formula; the difference is using the estimated standard error and the t distribution.

💊 Example: ADHD treatment study

  • 24 children with ADHD were tested under placebo (0 mg) and high dosage (0.6 mg) of methylphenidate.
  • Difference scores (D60 - D0) were computed for each child; positive scores mean better performance under the drug.
  • Null hypothesis: the population mean difference score is 0 (no drug effect).
  • Sample mean difference (M) = 4.958, sample standard deviation (s) = 7.538, N = 24.
  • Estimated standard error = 7.538 / square root of 24 = 1.54.
  • t = 4.958 / 1.54 = 3.22.

🎯 Degrees of freedom and probability

  • Degrees of freedom = N - 1.
  • Example: With N = 24, degrees of freedom = 23.
  • The probability of t ≤ -3.22 or t ≥ 3.22 is 0.0038 (very low).
  • Conclusion: Reject the null hypothesis; the drug condition has a higher population mean than the placebo condition.

⚠️ Don't confuse: difference scores vs individual scores

  • In the ADHD example, the assumption is that the population of difference scores is normally distributed, not necessarily the individual scores in each condition.
  • Each difference score represents one subject's change, so the test is on a single mean (the mean of the differences).

✅ Required assumptions

✅ Independence

  • Each value must be sampled independently from each other value.
  • Violating this assumption can invalidate the test.

✅ Normality

  • The values must be sampled from a normal distribution.
  • In the case of difference scores (like the ADHD example), the population of difference scores must be normally distributed.
82

Differences between Two Means (Independent Groups)

Differences between Two Means (Independent Groups)

🧭 Overview

🧠 One-sentence thesis

When comparing means from two independent groups, the t-test evaluates whether the observed difference is large enough (relative to sampling variability) to reject the null hypothesis that the population means are equal.

📌 Key points (3–5)

  • What the test measures: whether the difference between two sample means is statistically significant, i.e., unlikely to occur if the population means are truly equal.
  • Three key assumptions: homogeneity of variance (both populations have the same variance), normal distributions, and independence (each subject provides only one score).
  • How the test works: compute the difference between sample means, estimate the standard error of that difference, then calculate t and its probability.
  • Common confusion: small-to-moderate violations of the variance and normality assumptions are tolerable, but violating independence (e.g., one subject providing two scores) is serious.
  • Unequal sample sizes: when groups have different sizes, the variance estimate and standard error formulas must be adjusted using the harmonic mean.

📐 Core logic and formula

📐 The general significance-test structure

The excerpt applies the general significance-test formula:

  • Statistic: the difference between the two sample means (M₁ - M₂).
  • Hypothesized value: 0 (the null hypothesis states that the population means are equal).
  • Standard error: the estimated standard error of the difference between means.

🧮 The t formula for independent groups

t = (M₁ - M₂) / s(M₁ - M₂)

  • M₁ and M₂ are the two sample means.
  • s(M₁ - M₂) is the estimated standard error of the difference.
  • The null hypothesis is that the population mean difference is 0, so we do not subtract a hypothesized value.

🔧 Three assumptions

🔧 Homogeneity of variance

The two populations have the same variance.

  • This assumption allows us to pool the two sample variances into a single estimate of the population variance.
  • Small-to-moderate violations do not seriously affect the test.

📊 Normality

The populations are normally distributed.

  • The test assumes that the scores in each population follow a normal distribution.
  • Again, small-to-moderate violations are tolerable.

🚫 Independence

Each value is sampled independently from each other value; each subject provides only one score.

  • If a subject provides two scores, the scores are not independent.
  • Don't confuse: this is the most critical assumption; violating it is serious and requires a different test (the correlated t-test, covered later in the chapter).

🧪 Worked example: Animal Research study

🧪 The data

The excerpt uses data from the "Animal Research" case study:

  • Females: n = 17, mean = 5.353, variance = 2.743
  • Males: n = 17, mean = 3.882, variance = 2.985
  • Sample difference: 5.353 - 3.882 = 1.47

The question is whether this difference reflects a real population difference or just sampling variability.

🧮 Step 1: Compute the statistic

M₁ - M₂ = 5.3529 - 3.8824 = 1.4705

🧮 Step 2: Estimate the variance (MSE)

Because we assume homogeneity of variance, we average the two sample variances:

  • MSE = (2.743 + 2.985) / 2 = 2.864
  • MSE is the estimate of the common population variance (σ²).

🧮 Step 3: Compute the standard error

The formula for the standard error of the difference between means (when sample sizes are equal) is:

  • s(M₁ - M₂) = square root of (2 × MSE / n)
  • Here: square root of (2 × 2.864 / 17) = 0.5805

🧮 Step 4: Compute t

t = 1.4705 / 0.5805 = 2.533

🧮 Step 5: Find the probability

  • Degrees of freedom: (n₁ - 1) + (n₂ - 1) = 16 + 16 = 32
  • Using a t distribution calculator with df = 32, the two-tailed probability for t = 2.533 is 0.0164.
  • This is the probability of observing a t as extreme as ±2.533 if the null hypothesis (no population difference) were true.
  • A one-tailed test would give half this probability: 0.0082.

🎯 Conclusion

Because the probability (0.0164) is low, we reject the null hypothesis and conclude that the population mean for females is different from the population mean for males.

⚖️ Unequal sample sizes

⚖️ Why the calculation changes

When the two groups have different sample sizes, the variance estimate (MSE) must weight the larger group more heavily, and the standard error formula must use the harmonic mean of the sample sizes.

🧮 Computing SSE and MSE

The sum of squares error (SSE) is:

  • SSE = sum of (each score in group 1 - M₁)² + sum of (each score in group 2 - M₂)²
  • Then MSE = SSE / df, where df = (n₁ - 1) + (n₂ - 1)

Example from the excerpt:

  • Group 1: 3, 4, 5 (M₁ = 4)
  • Group 2: 2, 4 (M₂ = 3)
  • SSE = (3-4)² + (4-4)² + (5-4)² + (2-3)² + (4-3)² = 4
  • df = (3-1) + (2-1) = 3
  • MSE = 4 / 3 = 1.333

🧮 The harmonic mean

The harmonic mean (nₕ) is computed as:

  • nₕ = 2 / (1/n₁ + 1/n₂)
  • In the example: nₕ = 2 / (1/3 + 1/2) = 2.4

Then the standard error is:

  • s(M₁ - M₂) = square root of (2 × MSE / nₕ)
  • = square root of (2 × 1.333 / 2.4) = 1.054

Finally:

  • t = (4 - 3) / 1.054 = 0.949
  • Two-tailed p = 0.413 (not significant)

💾 Formatting data for computer analysis

💾 The required format

Most statistical software requires data in "long" format with two variables:

  • One variable specifies the group (e.g., 1 or 2).
  • The other variable contains the score.

💾 Example transformation

Original data (wide format):

Group 1Group 2
32
46
58

Reformatted data (long format):

GY
13
14
15
22
26
28
  • G = group identifier
  • Y = score

This format allows the software to correctly identify which scores belong to which group.

83

All Pairwise Comparisons Among Means

All Pairwise Comparisons Among Means

🧭 Overview

🧠 One-sentence thesis

The Tukey HSD test controls the inflated Type I error rate that arises when comparing multiple group means, unlike performing multiple independent t tests which increases the probability of false positives.

📌 Key points (3–5)

  • The core problem: doing separate t tests for all pairs of means inflates the Type I error rate—with 12 means, the probability of at least one false positive reaches about 0.70.
  • What Tukey HSD does: controls the Type I error rate by using the studentized range distribution, which accounts for the number of means being compared.
  • How it differs from t tests: computations are similar to independent-groups t tests, but the test statistic (Q) is evaluated against a different distribution.
  • Common confusion: Tukey HSD is often presented as a "follow-up to ANOVA," but it can be used independently without performing ANOVA first.
  • Interpreting non-significant results: failing to reject the null does not mean two means are the same; it only means there is not convincing evidence they are different.

🚨 The multiple comparisons problem

🚨 Why multiple t tests inflate error rates

  • When you compare more than two groups, you naturally want to test all pairs of means.
  • Each individual test has a Type I error rate (e.g., 0.05), but performing multiple tests increases the overall probability of making at least one Type I error.
  • Example: with 12 means, there are 66 possible pairwise comparisons; if all population means were actually the same, the probability of at least one false positive is about 0.70.

📈 How error rates grow with more means

  • The number of pairwise comparisons grows rapidly: 2 means = 1 comparison, 12 means = 66 comparisons.
  • The probability of a Type I error increases as a function of the number of means being compared.
  • The excerpt shows that with just 6 means, the error rate is already substantially inflated above the nominal 0.05 level.

🔧 The Tukey HSD test

🔧 What Tukey HSD is

Tukey Honestly Significant Difference test: a method that controls the Type I error rate when comparing multiple means by using the studentized range distribution.

  • The studentized range distribution takes into account the number of means being compared, unlike the standard t distribution.
  • It adjusts the critical values to maintain the overall Type I error rate at the desired level (e.g., 0.05) across all comparisons.

🧮 How to compute Tukey HSD

The steps are similar to an independent-groups t test:

  1. Compute means and variances for each group.
  2. Compute MSE (Mean Squared Error), which is simply the mean of all group variances.
  3. Compute the test statistic for each pair of means using the formula: the difference between two means divided by the square root of (MSE divided by n), where n is the number of observations per group.
  4. Compute p-values using the Studentized Range Calculator, with degrees of freedom equal to the total number of observations minus the number of means.

Example from the excerpt: In the leniency study with 4 groups and 34 observations per group, df = 136 - 4 = 132. The only significant comparison was between the false smile and the neutral smile.

⚖️ Equal vs unequal sample sizes

  • When sample sizes are equal across groups, use the common sample size n in calculations.
  • When sample sizes are unequal, use the harmonic mean of the two sample sizes for each comparison.
  • SSE (Sum of Squares Error) is computed by summing the squared deviations of each observation from its group mean across all groups.
  • MSE is then computed as SSE divided by degrees of freedom error (total observations minus number of groups).

🧩 Interpreting results correctly

🧩 The paradox of non-significant differences

The excerpt presents an apparent contradiction from the leniency study:

  • False smile = Miserable smile (not significant)
  • Miserable smile = Neutral control (not significant)
  • False smile ≠ Neutral control (significant)

Why this is not actually a contradiction:

  • Non-significant does not mean "the same"—it means "not enough evidence to conclude they are different."
  • The proper conclusion: the false smile is higher than the control, and the miserable smile is either (a) equal to the false smile, (b) equal to the control, or (c) somewhere in-between.
  • Don't confuse: "fail to reject the null" with "accept the null"—these are not the same thing.

📊 Comparison table structure

The excerpt shows results in a table format:

ComparisonDifferenceQ statisticp-value
False - Felt0.461.650.649
False - Miserable0.461.650.649
False - Neutral1.254.480.01

Only the comparison with p < 0.05 is considered significant.

🔍 Assumptions and practical considerations

🔍 What Tukey HSD assumes

The assumptions are essentially the same as for an independent-groups t test:

  • Normality: the test is quite robust to violations of this assumption.
  • Homogeneity of variance: more problematic than in the two-sample case because MSE is based on data from all groups.
  • Independence of observations: important and should not be violated.

💻 Computer analysis and ANOVA connection

  • Format data the same way as for independent-groups t tests, but code each group with a unique number (1, 2, 3, 4, etc.).
  • Some smaller programs cannot compute Tukey's test directly but can compute ANOVA, which provides the MSE needed for Tukey calculations.
  • The "Mean Square Error" from an ANOVA summary table is the same MSE used in Tukey's test.

🎯 Tukey as a standalone test

  • Common misconception: Tukey HSD must follow an ANOVA.
  • Reality: there is no logical or statistical reason why Tukey cannot be used independently of ANOVA.
  • The excerpt cites Wilkinson and the Task Force on Statistical Inference (1999) as support for this position.
  • You do not need to compute or even understand ANOVA to use the Tukey test appropriately.
84

Specific Comparisons (Independent Groups)

Specific Comparisons (Independent Groups)

🧭 Overview

🧠 One-sentence thesis

Planned comparisons using linear combinations and coefficients allow researchers to test complex hypotheses about differences among multiple group means beyond simple pairwise tests, while controlling for Type I error through methods like the Bonferroni correction.

📌 Key points (3–5)

  • What planned comparisons do: test complex hypotheses (e.g., comparing averages of multiple groups or differences between differences) decided before looking at data.
  • How linear combinations work: express any comparison as a weighted sum of means using coefficients that sum to zero.
  • Testing significance: use the formula involving L (the linear combination), MSE (mean squared error), coefficients, and sample sizes to compute a t-statistic.
  • Common confusion: per-comparison vs familywise error rate—testing multiple comparisons increases the chance of at least one Type I error across the family, not just for individual tests.
  • Why it matters: allows testing interactions (when one variable's effect differs by level of another) and controlling overall error rates when making multiple comparisons.

🔧 What are planned comparisons

🔧 Definition and purpose

Planned comparisons: comparisons among means that are decided on before looking at the data.

  • These go beyond simple two-group comparisons (like an independent-groups t test).
  • They allow testing more complex hypotheses, such as:
    • Comparing the average of several groups to the average of other groups
    • Testing whether differences between groups vary across conditions (interactions)
  • The excerpt contrasts planned comparisons with unplanned comparisons, which require different procedures (not covered in this excerpt).

🧮 Linear combinations and coefficients

Linear combination: a weighted sum of means, expressed as L = c₁M₁ + c₂M₂ + ... + cₖMₖ, where cᵢ is the coefficient and Mᵢ is the mean of the i-th group.

  • Coefficients are the multipliers applied to each group mean.
  • The sum of all coefficients must equal zero for valid comparisons.
  • Example from the excerpt: To compare success vs failure conditions across two esteem groups, use coefficients 0.5, 0.5, -0.5, -0.5 for the four groups. The products sum to L = 0.083.

Don't confuse: The linear combination L is not the mean itself—it's a weighted difference that represents the specific comparison you want to test.

📊 Testing a specific comparison

📊 The significance test formula

The excerpt provides this formula for testing whether L differs significantly from zero:

t = L divided by the square root of (MSE times the sum of (coefficient-squared divided by n))

Where:

  • L = the linear combination (the weighted sum of means)
  • MSE = mean squared error (the average of the group variances)
  • cᵢ = the coefficient for group i
  • n = number of subjects in each group

📊 Degrees of freedom

df = N - k

Where:

  • N = total number of subjects across all groups
  • k = number of groups

Example from the excerpt: With 24 subjects and 4 groups, df = 20.

📊 Step-by-step process

  1. Define the comparison using coefficients that sum to zero.
  2. Compute L by multiplying each mean by its coefficient and summing.
  3. Calculate MSE as the mean of all group variances.
  4. Compute the sum of (each coefficient squared divided by n).
  5. Calculate t using the formula above.
  6. Find the p-value using the t-distribution with df = N - k.

Example from the excerpt: Testing success vs failure (ignoring esteem) yielded t = 0.16, p = 0.874 (not significant).

🔀 Testing interactions (differences between differences)

🔀 What is an interaction

Interaction: when the effect of one variable differs as a function of the level of another variable.

  • The excerpt's example: Does the effect of outcome (success/failure) differ depending on self-esteem level?
  • For high-self-esteem subjects: success - failure = 7.333 - 4.833 = 2.500
  • For low-self-esteem subjects: success - failure = 5.500 - 7.833 = -2.333
  • The difference between these differences = 2.500 - (-2.333) = 4.833

🔀 Coefficients for testing interactions

To test a difference between differences, the coefficients form a pattern:

Self-EsteemOutcomeMeanCoefficient
HighSuccess7.3331
HighFailure4.833-1
LowSuccess5.5-1
LowFailure7.8331
  • Notice the coefficients: 1, -1, -1, 1 (sum = 0).
  • This pattern captures: (High Success - High Failure) - (Low Success - Low Failure).

🔀 Interpreting the result

In the excerpt's example:

  • L = 4.83
  • Sum of (coefficient² / n) = 4
  • t = 4.83 / sqrt((4)(1.625)/6) = 4.64
  • p = 0.0002 (highly significant)

Interpretation: Success led to more self-attribution for high-self-esteem subjects but less self-attribution for low-self-esteem subjects—the effect of outcome depends on esteem level.

⚠️ Controlling error rates with multiple comparisons

⚠️ Two types of error rates

The excerpt distinguishes:

Error RateDefinition
Per-comparison error rateProbability of Type I error for a single comparison (e.g., α = 0.05)
Familywise error rate (FW)Probability of making one or more Type I errors across a family/set of comparisons

Common confusion: If you run multiple comparisons each at α = 0.05, your chance of at least one false positive across all tests is higher than 0.05.

⚠️ The Bonferroni inequality

Bonferroni inequality: FW ≤ c × α, where c is the number of comparisons and α is the per-comparison error rate.

  • This provides an upper bound (conservative approximation) for the familywise error rate.
  • In practice, FW is approximated by c × α.
  • Example: If you make 2 comparisons at α = 0.05 each, FW ≤ 2 × 0.05 = 0.10.

⚠️ The Bonferroni correction

Bonferroni correction: To control familywise error rate at α, use α/c as the per-comparison error rate.

  • If you want FW = 0.05 and plan c = 5 comparisons, test each at α = 0.05/5 = 0.01.
  • This correction is conservative—it ensures FW does not exceed your target α.
  • The excerpt notes the correction "will generally result in" (text cuts off, but implies a more stringent threshold).

Don't confuse: The Bonferroni correction makes each individual test harder to pass (lower α per test) to keep the overall family error rate under control.

🖥️ Computational notes

🖥️ Data formatting

  • Format data the same way as for independent-groups t test.
  • Code groups as 1, 2, 3, 4, etc. (instead of just 1 or 2 for two groups).

🖥️ Software and ANOVA

  • Full-featured programs (SAS, SPSS, R) can compute Tukey's test directly.
  • Smaller programs may not, but they can compute Analysis of Variance (ANOVA), which provides the MSE needed for planned comparisons.
  • The excerpt shows an ANOVA summary table where the "Mean Square" (MS) in the "Error" row is the MSE (2.6489 in the example, rounded to 2.65).

🖥️ Unequal sample sizes (optional)

When groups have different numbers of subjects:

  1. Compute Sum of Squares Error (SSE): sum across all groups of the squared deviations from each group mean.
  2. Compute degrees of freedom: df_e = N - k.
  3. Compute MSE: MSE = SSE / df_e.
  4. For each comparison: use the harmonic mean of the two sample sizes (n_h) instead of a single n.

All other calculations remain the same.

85

Difference Between Two Means (Correlated Pairs)

Difference Between Two Means (Correlated Pairs)

🧭 Overview

🧠 One-sentence thesis

When the same subjects are measured under two conditions, a correlated-pairs t test (which treats each subject as their own control) has greater power than an independent-groups test because it removes between-subject variability from the analysis.

📌 Key points (3–5)

  • When to use correlated pairs: when the two sets of scores come from the same subjects tested in both conditions, not from independent groups.
  • How the test works: compute the difference score for each subject (Condition A minus Condition B), then test whether the mean of these differences is significantly different from zero.
  • Why it has more power: each subject serves as their own control, so differences between subjects do not inflate the standard error; the result is a larger t value and greater ability to detect real effects.
  • Common confusion: correlated pairs vs independent groups—if you mistakenly use an independent-groups test on correlated data, you will lose power and may fail to detect a real difference.
  • Alternative names: this test is also called a "correlated t test" or "related-pairs t test."

🔍 Recognizing correlated pairs vs independent groups

🔍 What makes data correlated

  • The excerpt emphasizes that correlated pairs occur when the same subjects are tested under both conditions.
  • Example: In the ADHD Treatment study, 24 children were each tested under a placebo (D0) and a high-dosage (D60) condition; the D0 scores and D60 scores come from the same children, so the two sets of scores are not independent.
  • The scatter plot in the excerpt shows that children who score higher in D0 tend to score higher in D60 (r = 0.80), confirming the two variables are not independent.

⚠️ Don't confuse with independent groups

  • Independent groups: two separate groups of subjects, one group tested in Condition A and a different group tested in Condition B.
  • Correlated pairs: one group of subjects, each subject tested in both Condition A and Condition B.
  • The excerpt warns that using an independent-groups t test on correlated data will give the wrong result (lower power, possibly non-significant when the true difference is real).

🧮 How to compute the correlated-pairs t test

🧮 Step-by-step procedure

  1. Compute difference scores: for each subject, subtract the score in one condition from the score in the other condition (e.g., D60 minus D0).
  2. Test the mean difference: treat the difference scores as a single sample and test whether their mean is significantly different from zero (using the single-mean t test).
  3. The excerpt states: "In general, the correlated t test is computed by first computing the differences between the two scores for each subject. Then, a test of a single mean is computed on the mean of these difference scores."

📊 Example from the excerpt

  • The ADHD study compared placebo (D0) and 60-mg dosage (D60) conditions.
  • Difference scores (D60 minus D0) were computed for each of the 24 children (shown in Table 1).
  • The mean difference score was 4.96.
  • The t test result: t = 3.22, degrees of freedom = 23, p = 0.0038 (significant).
  • Interpretation: the high-dosage condition produced significantly more correct responses than the placebo condition.

💪 Why correlated-pairs tests have greater power

💪 Each subject as their own control

  • The excerpt explains that in a correlated t test, "each difference score is a comparison of performance in one condition with the performance of that same subject in another condition."
  • This makes each subject "their own control," meaning that stable individual differences (e.g., one child is generally better at the task than another) are removed from the analysis.
  • Between-subject variability does not enter into the difference scores, so the standard error of the difference between means is smaller.

💪 Smaller standard error, larger t

  • The formula for t has the standard error in the denominator.
  • A smaller standard error results in a larger t value, making it easier to detect a real effect (higher power).
  • The excerpt states: "This is a typical result: correlated t tests almost always have greater power than independent-groups t tests."

💪 Contrast with independent-groups test

  • The excerpt shows what happens if you mistakenly use an independent-groups test on the same data:
    • Independent-groups result: t = 1.42, df = 46, p = 0.15 (not significant).
    • Correlated-pairs result: t = 3.22, df = 23, p = 0.0038 (significant).
  • The independent-groups test failed to detect the difference because it included between-subject variability in the standard error, inflating it and reducing power.

🔢 Testing differences between differences (interaction)

🔢 What is a difference between differences

  • The excerpt introduces a more complex question: does the effect of one variable (e.g., success vs failure outcome) differ depending on the level of another variable (e.g., high vs low self-esteem)?
  • To test this, you compute "a difference between differences."
  • Example from the excerpt:
    • For high-self-esteem subjects: success minus failure = 7.333 minus 4.833 = 2.500.
    • For low-self-esteem subjects: success minus failure = 5.500 minus 7.833 = -2.333.
    • Difference between differences: 2.500 minus (-2.333) = 4.833.

🔢 How to compute it with coefficients

  • The excerpt shows that the difference between differences can be expressed as a weighted sum using coefficients.
  • The calculation (7.33 - 4.83) - (5.5 - 7.83) can be rewritten as (1)×7.33 + (-1)×4.83 + (-1)×5.5 + (1)×7.83.
  • The coefficients (1, -1, -1, 1) are applied to the four group means, and the sum of products gives the difference between differences (4.83).
  • The excerpt then computes a t test on this difference; the result is p = 0.0002 (highly significant).

🔢 Interaction interpretation

  • The excerpt explains that this test is checking for an interaction: "there is an interaction when the effect of one variable differs as a function of the level of another variable."
  • In the example: for high-self-esteem subjects, success led to more self-attribution than failure; for low-self-esteem subjects, success led to less self-attribution than failure.
  • The effect of outcome (success vs failure) depends on self-esteem level.

🧪 Multiple comparisons and error rates

🧪 Per-comparison vs familywise error rate

Per-comparison error rate: the probability of a Type I error for a particular comparison.

Familywise error rate: the probability of making one or more Type I errors in a family or set of comparisons.

  • If you do multiple comparisons, your chance of making at least one Type I error increases.
  • Example: if you use alpha = 0.05 for each of two comparisons, the per-comparison rate is 0.05, but the familywise rate is higher (up to 2 × 0.05 = 0.10).

🧪 Bonferroni inequality and correction

Bonferroni inequality: FW ≤ c × alpha, where FW is the familywise error rate, c is the number of comparisons, and alpha is the per-comparison error rate.

  • The excerpt states that "FW can be approximated by c × alpha" and that this is a conservative approximation (FW is generally less than c × alpha).
  • Bonferroni correction: to control the familywise error rate at alpha, use alpha/c as the per-comparison error rate for each test.
  • Alternatively, multiply the p-value by c and compare it to the original alpha level.

🧪 Should you control familywise error rate?

  • Disadvantage of controlling: it makes it harder to reach significance for any given comparison (lower power).
  • Advantage of controlling: lower chance of making a Type I error.
  • The excerpt argues that the decision depends on whether the comparisons are testing related hypotheses or completely different hypotheses.
  • Example: if one researcher tests male vs female babies' crawling age and a colleague uses the same data to test winter-born vs summer-born babies, the excerpt says "there is no reason you should be penalized (by lower power) just because your colleague used the same data to address a different research question." Therefore, controlling the familywise rate is not necessary.
  • Contrast: if you do four comparisons (male vs female, over 40 vs under 40, vegetarians vs non-vegetarians, firstborns vs others) all asking whether anything affects coin-flip prediction, the excerpt notes that "the whole series of comparisons could be seen as addressing the general question of whether anything affects the ability to predict the outcome of a coin flip," so controlling the familywise rate may be appropriate.

🔗 Orthogonal (independent) comparisons

🔗 What orthogonal means

Orthogonal comparisons: independent comparisons; if the sum of the products of the coefficients is 0, then the comparisons are orthogonal.

  • The excerpt provides a simple test: multiply the coefficients of two comparisons element-by-element, then sum the products; if the sum is 0, the comparisons are orthogonal.

🔗 Example of orthogonal comparisons

  • Table 6 in the excerpt shows two comparisons from the attribution study.
  • Comparison 1 (C1) coefficients: 0.5, 0.5, -0.5, -0.5.
  • Comparison 2 (C2) coefficients: 1, -1, -1, 1.
  • Products: 0.5, -0.5, 0.5, -0.5.
  • Sum of products: 0.5 + (-0.5) + 0.5 + (-0.5) = 0.
  • Therefore, the two comparisons are orthogonal.

🔗 Example of non-orthogonal comparisons

  • Table 7 shows two comparisons that are not orthogonal.
  • Comparison 1: high-self-esteem vs low-self-esteem for the whole sample (coefficients 0.5, -0.5, 0.5, -0.5).
  • Comparison 2: high-self-esteem vs low-self-esteem for the success group only (coefficients 0.5, -0.5, 0, 0).
  • Sum of products: 0.25 + 0.25 + 0 + 0 = 0.5 (not 0).
  • The excerpt notes that "the comparison of high-self-esteem subjects to low-self-esteem subjects for the whole sample is not independent of the comparison for the success group only," so they are not orthogonal.
86

Specific Comparisons (Correlated Observations)

Specific Comparisons (Correlated Observations)

🧭 Overview

🧠 One-sentence thesis

When comparing means from the same subjects measured under different conditions, a correlated t-test (treating each subject as their own control) has greater statistical power than an independent-groups test because it removes between-subject variability.

📌 Key points (3–5)

  • When to use correlated comparisons: when the same subjects are measured in multiple conditions, creating paired/dependent observations rather than independent groups.
  • How it works: compute the difference score for each subject across conditions, then test whether the mean of these differences is significantly different from zero.
  • Why it's more powerful: correlated t-tests make each subject "their own control," removing individual differences from the analysis and reducing the standard error.
  • Common confusion: correlated vs independent groups—if the same subjects appear in both conditions, the data are correlated (not independent), and using an independent-groups test will underestimate power.
  • Linear comparisons: you can test complex hypotheses (e.g., weighted combinations of conditions) by computing a comparison score for each subject and testing if its mean differs from zero.

🔄 Correlated vs Independent Groups

🔄 Recognizing correlated pairs

Correlated pairs (also called "related pairs" or "repeated measures"): scores from the same subjects measured under different conditions.

  • The key indicator: one group of subjects, each tested in multiple conditions.
  • Example from the excerpt: 24 children with ADHD tested under both placebo (D0) and high-dosage (D60) conditions—the same children appear in both datasets.
  • The excerpt shows a scatter plot with correlation r = 0.80 between D0 and D60 scores, confirming the two variables are not independent.

⚠️ Why independent-groups methods fail here

  • If you mistakenly use an independent-groups t-test on correlated data, you ignore the pairing.
  • The excerpt gives a concrete example: using the wrong method yielded t = 1.42, p = 0.15 (not significant), while the correct correlated t-test yielded t = 3.22, p = 0.0038 (significant).
  • Don't confuse: independent groups = different subjects in each condition; correlated pairs = same subjects in each condition.

🧮 Computing the Correlated t-Test

🧮 Step-by-step procedure

  1. Compute difference scores: for each subject, subtract one condition's score from the other (e.g., D60 - D0).
  2. Test the mean difference: use a single-sample t-test to determine if the mean of these difference scores is significantly different from zero.

The excerpt provides a table showing D0, D60, and D60-D0 for each child; the mean difference is 4.96, which is tested against zero.

📐 The t formula

The excerpt references the single-mean t-test formula (shown in words):

  • t = (sample mean of differences minus hypothesized population mean) divided by (standard error of the mean of differences).
  • For testing whether a difference exists, the hypothesized population mean is 0.

Example from the excerpt: mean difference = 5.875, standard error = 4.2646, so t = 5.875 / 4.2646 = 1.378.

🔋 Why Correlated Tests Have Greater Power

🔋 Subjects as their own controls

  • Core mechanism: each difference score compares one subject's performance in condition A with that same subject's performance in condition B.
  • This removes between-subject variability—differences in baseline ability, motivation, etc., do not inflate the error term.
  • The excerpt states: "This makes each subject 'their own control' and keeps differences between subjects from entering into the analysis."

📉 Smaller standard error

The excerpt explains (in an optional technical section) that the variance of difference scores depends on:

  • Variance in condition X
  • Variance in condition Y
  • Minus twice the product of (correlation × SD of X × SD of Y)

The formula (in words): Variance of (X - Y) = Variance of X + Variance of Y - 2 × r × SD_X × SD_Y.

  • Higher correlation → lower variance of differences → smaller standard error → larger t.
  • Example from the excerpt: with r = 0.80, the variance of difference scores (56.82) is much smaller than the sum of the two condition variances (128.02 + 151.78 = 279.80).

🧪 Linear Comparisons with Correlated Data

🧪 Testing complex hypotheses

The excerpt describes a "Weapons and Aggression" study with four conditions per subject:

  • aw: aggressive word after weapon prime
  • an: aggressive word after non-weapon prime
  • cw: control word after weapon prime
  • cn: control word after non-weapon prime

You can test whether weapon primes speed reading overall by computing a linear comparison for each subject:

  • L₁ = (an + cn) - (aw + cw)

🧪 Procedure for linear comparisons

  1. Compute the comparison score for each subject using the specified weights.
  2. Test whether the mean comparison score differs from zero using a single-sample t-test.

Example from the excerpt:

  • Subject 1: L₁ = (440 + 452) - (447 + 432) = 13
  • Across 32 subjects, mean L₁ = 5.875, SE = 4.2646, t = 1.378, p = 0.178 (not significant).

🎯 Testing interaction effects

A second comparison tests whether weapon primes affect aggressive words differently than control words:

  • L₂ = (an - aw) - (cn - cw)

This is the priming effect for aggressive words minus the priming effect for control words.

  • Example: Subject 1's L₂ = (440 - 447) - (452 - 432) = -27
  • Across 32 subjects, mean L₂ = 8.4375, SE = 3.9128, t = 2.156, p = 0.039 (significant).

🔀 Multiple and Orthogonal Comparisons

🔀 Multiple comparisons

  • The excerpt notes that issues with multiple comparisons (e.g., inflated Type I error) are the same for correlated observations as for independent groups.
  • The "Pairwise Comparisons" section mentions the Bonferroni correction (though details are cut off).

🔀 Orthogonal comparisons

Orthogonal comparisons: comparisons that are statistically independent (uncorrelated).

  • The excerpt states that orthogonal comparisons with correlated observations are very rare.
  • To assess dependence between two comparisons, you can correlate them directly.
  • Example: L₁ and L₂ from the weapons study are correlated r = 0.24 in the sample, indicating some dependence.
87

Pairwise Comparisons (Correlated Observations)

Pairwise Comparisons (Correlated Observations)

🧭 Overview

🧠 One-sentence thesis

When comparing multiple pairs of means from the same subjects, the Bonferroni correction applied to correlated-pairs t-tests is the standard practice because the Tukey HSD test's assumption of equal variance across all pairwise differences is unlikely to hold.

📌 Key points (3–5)

  • Why not Tukey HSD for correlated data: The Tukey test assumes the variance of difference scores is the same for all pairwise differences, which is unlikely when the same subjects provide multiple scores.
  • Standard practice: Compare each pair of means using the correlated-pairs t-test method plus the Bonferroni correction to control familywise error rate.
  • Bonferroni correction formula: Divide the desired familywise error rate (e.g., 0.05) by the number of pairwise comparisons to get the per-comparison alpha level.
  • Common confusion: The standard deviations of pairwise differences can vary widely across different pairs of conditions, violating the Tukey test's equal-variance assumption.
  • Practical implication: All comparisons must meet the adjusted (stricter) significance threshold to be declared significant at the familywise level.

🚫 Why Tukey HSD doesn't work here

🚫 The violated assumption

The Tukey HSD test makes an assumption that is unlikely to hold: The variance of difference scores is the same for all pairwise differences between means.

  • The Tukey HSD test was recommended for independent groups (covered in an earlier section).
  • When you have one group with several scores from the same subjects (correlated observations), this equal-variance assumption typically fails.
  • The excerpt emphasizes this is "unlikely to hold" in practice.

📊 Evidence of unequal variances

The Stroop Interference case study illustrates this problem clearly:

ComparisonStandard Deviation (Sd)
W-C2.99
W-I7.84
C-I7.47
  • Notice how different the standard deviations are across the three pairwise comparisons.
  • The excerpt states: "For the Tukey test to be valid, all population values of the standard deviation would have to be the same."
  • Don't confuse: This is about the standard deviation of the difference scores, not the original scores themselves.

🔧 The standard practice method

🔧 Two-step procedure

The standard approach combines two techniques:

  1. Base method: Use the correlated-pairs t-test (from the "Difference Between Two Means (Correlated Pairs)" section) for each pair.
  2. Error control: Apply the Bonferroni correction (from the "Specific Comparisons" section) to maintain the familywise error rate.

🧮 Bonferroni correction calculation

The Bonferroni correction: Divide the desired familywise error rate by the number of comparisons to get the per-comparison alpha level.

How to compute:

  • Determine the number of possible pairwise comparisons.
  • Divide your target familywise error rate by that number.

Example from the excerpt:

  • Four means → six possible pairwise comparisons.
  • Familywise error rate = 0.05.
  • Per-comparison alpha = 0.05 / 6 = 0.0083.
  • Each individual test must have p < 0.0083 to be declared significant at the 0.05 familywise level.

📐 Number of comparisons

For k means, the number of pairwise comparisons is k(k-1)/2:

  • 3 means → 3 comparisons
  • 4 means → 6 comparisons
  • 5 means → 10 comparisons

📚 Worked example: Stroop Interference

📚 Study design

  • Subjects: 47 subjects (same people in all conditions).
  • Tasks: Three tasks per subject—"words," "color," and "interference."
  • Goal: Compute all pairwise comparisons of task times.

🔢 Step 1: Calculate difference scores

For each subject, compute the difference between each pair of conditions:

ComparisonWhat it represents
W-CWords time minus Color time
W-IWords time minus Interference time
C-IColor time minus Interference time

Example for one subject (from Table 1):

  • W-C = -3
  • W-I = -24
  • C-I = -21

🔢 Step 2: Compute t-tests for each pair

For all 47 subjects, the excerpt provides summary statistics:

ComparisonMeanSdSemtp
W-C-4.152.990.44-9.53<0.001
W-I-20.517.841.14-17.93<0.001
C-I-16.367.471.09-15.02<0.001
  • The t values are computed by dividing the mean by the standard error of the mean.
  • Degrees of freedom = 47 - 1 = 46.

✅ Step 3: Apply Bonferroni correction

  • Three comparisons → adjusted alpha = 0.05 / 3 = 0.0167.
  • Decision rule: A comparison is significant at the 0.05 familywise level only if its p-value is below 0.0167.
  • Result: All three p-values are far below 0.0167 (all < 0.001), so all pairwise differences are significant.

🔍 Why the correction matters

Without the Bonferroni correction:

  • Each test would use alpha = 0.05.
  • The chance of at least one false positive (Type I error) across multiple tests increases.

With the correction:

  • The familywise error rate is controlled at 0.05.
  • Each individual test uses a stricter threshold (0.0167 in this example).

🧩 How to compute the correlated-pairs t-test

🧩 The difference-score approach

Although the excerpt references the "Difference Between Two Means (Correlated Pairs)" section for details, it illustrates the core idea:

  1. For each subject, compute the difference between the two conditions being compared.
  2. Treat these difference scores as a single sample.
  3. Test whether the mean difference is significantly different from zero using a one-sample t-test.

🧮 Formula components

The excerpt shows the t formula in words:

  • t = (M - μ) / s_M
  • M = sample mean of the difference scores
  • μ = hypothesized population mean (typically 0, meaning "no difference")
  • s_M = estimated standard error of the mean

Example from the weapons-and-aggression data (L1 comparison):

  • Mean difference (M) = 5.875
  • Standard error (s_M) = 4.2646
  • Hypothesized mean (μ) = 0
  • t = (5.875 - 0) / 4.2646 = 1.378
  • Degrees of freedom = 32 - 1 = 31
  • Two-tailed p = 0.178 (not significant)

📋 Data structure

The excerpt provides a clear example of how data are organized:

Table format (five subjects shown):

SubjectawancwcnL1
144744043245213
2427437469451-8
3417418445434-10
434837135334414
5471443462463-27
  • Each row = one subject.
  • Each column (aw, an, cw, cn) = one condition.
  • L1 column = the computed difference score for that subject.

Don't confuse: The original scores (aw, an, etc.) are in milliseconds, but what matters for the t-test is the difference (L1), not the raw scores.

🔬 Additional context from the excerpt

🔬 Orthogonal comparisons

The excerpt briefly mentions orthogonal comparisons:

  • You can assess dependence between two comparisons by correlating them directly.
  • Example: L1 and L2 in the weapons-and-aggression data are correlated 0.24.
  • The excerpt notes: "Orthogonal comparisons with correlated observations are very rare."

🔬 Multiple comparisons

Issues associated with doing multiple comparisons are the same for related observations as they are for multiple comparisons among independent groups.

  • The same principles (controlling familywise error, using corrections) apply whether observations are independent or correlated.
  • The difference is in how you compute each individual test (independent-groups t-test vs. correlated-pairs t-test).
88

Introduction to Power

Introduction to Power

🧭 Overview

🧠 One-sentence thesis

Statistical power is the probability that a researcher will correctly reject a false null hypothesis, and it depends on factors such as sample size, effect size, significance level, population variability, and whether the test is one-tailed or two-tailed.

📌 Key points (3–5)

  • What power measures: the probability of correctly rejecting a false null hypothesis (avoiding a Type II error).
  • How power is calculated: by determining the probability that a sample statistic will fall in the rejection region when the alternative hypothesis is true.
  • Key factors affecting power: sample size, effect size (difference between true and hypothesized population parameters), significance level (alpha), population standard deviation, and test directionality (one-tailed vs two-tailed).
  • Common confusion: power depends on the true population parameter (which the researcher doesn't know in practice), not the hypothesized null value.
  • Why it matters: higher power means a better chance of detecting a real effect; researchers can increase power by adjusting factors under their control, especially sample size.

🎯 What power means

🎯 Definition and purpose

Power: the probability that a researcher will correctly reject a false null hypothesis.

  • Power answers the question: "If the null hypothesis is actually false, what is the chance my test will detect it?"
  • It is the complement of a Type II error (failing to reject a false null hypothesis).
  • Example: if power = 0.80, there is an 80% chance the test will correctly identify that the null hypothesis is false.

🔢 How power is computed

The excerpt provides two worked examples to illustrate the calculation process:

Example 1 (Bond's ability):

  • Bond claims he can correctly identify a drink 0.75 of the time.
  • If his true ability is 0.75, the probability he will be correct on 12 or more out of some number of trials is 0.63.
  • Therefore, power = 0.63.

Example 2 (Math achievement test):

  • A test has a known mean of 75 and standard deviation of 10.
  • The true population mean for a new teaching method is 80 (unknown to the researcher).
  • The researcher samples 25 subjects and uses a one-tailed test with significance level 0.05.
  • Standard error of the mean = 10 / 5 = 2.
  • The rejection region starts at a sample mean of 78.29 or higher (the cutoff for p = 0.05 under the null hypothesis).
  • When the true mean is 80, the probability of getting a sample mean ≥ 78.29 is 0.80.
  • Therefore, power = 0.80.

📐 Key calculation steps

  1. Determine the rejection region under the null hypothesis (e.g., sample mean ≥ 78.29 for alpha = 0.05).
  2. Find the probability that the sample statistic will fall in that rejection region given the true population parameter (e.g., true mean = 80).
  3. That probability is the power.

Don't confuse: Power is calculated assuming the true parameter value (which the researcher doesn't actually know), not the null hypothesis value. The examples assume the researcher "does not know" the true mean but the calculation requires it for pedagogical purposes.

🧮 Calculation considerations

🧮 When population variance is known

  • The excerpt notes that assuming the researcher knows the population variance is "rarely true in practice" but useful for teaching.
  • When the population standard deviation is known, the normal distribution (z-distribution) can be used instead of the t distribution.
  • Power calculators are available for situations where the population variance is unknown.

🧮 Complexity for different tests

  • Power calculation is "more complex for t tests and for Analysis of Variance."
  • Many programs exist to compute power for these more complex scenarios.

🔧 Factors that affect power

🔧 Overview of factors

The excerpt introduces a section on factors affecting power, noting that:

  • Several factors affect the power of a statistical test.
  • Some factors are under the experimenter's control; others are not.
  • Five factors will be discussed (though the excerpt cuts off before listing all).

📏 Sample size

The larger the sample size, the higher the power.

  • Sample size (N) is typically under the experimenter's control.
  • Increasing sample size is one way to increase power.
  • Larger samples reduce the standard error of the mean, making it easier to detect a true difference.
  • Example: In the math achievement example, the standard error = σ / √N. With N = 25, standard error = 10 / 5 = 2. A larger N would reduce this further, narrowing the sampling distribution and increasing the chance of detecting the true mean of 80.

🎚️ Other factors (introduced but not detailed)

The excerpt mentions that five factors affect power but cuts off after discussing sample size. The setup example defines:

  • Population standard deviation (σ): affects the spread of the sampling distribution.
  • True population mean (μ): the actual parameter value (larger difference from the null hypothesis value would affect power).
  • Significance level: the 0.05 cutoff used in the example determines the rejection region.
  • Test directionality: the example uses a one-tailed test.

Note: The excerpt does not provide detailed explanations of how these other factors affect power, only that they will be covered in the section.

📊 Interpreting power values

📊 What different power values mean

Power valueInterpretation
0.6363% chance of correctly rejecting the false null hypothesis
0.8080% chance of correctly rejecting the false null hypothesis
Higher powerBetter chance of detecting a real effect
  • Power of 0.80 is often considered a reasonable target in research design.
  • Low power means a high risk of missing a real effect (Type II error).

📊 Practical implications

  • Researchers can use power calculations before conducting a study to determine appropriate sample sizes.
  • The calculation requires assumptions about the true population parameter (which must be estimated based on prior research or theory).
  • Power analysis helps balance the cost of larger samples against the benefit of higher detection probability.
89

Example Calculations

Example Calculations

🧭 Overview

🧠 One-sentence thesis

Power—the probability of correctly rejecting a false null hypothesis—can be calculated by finding the probability that a sample statistic will exceed the critical value when the true population parameter differs from the null hypothesis value.

📌 Key points (3–5)

  • What power measures: the probability that a researcher will correctly reject a false null hypothesis.
  • How to calculate power: determine the critical value under the null hypothesis, then find the probability of exceeding that critical value under the true population parameter.
  • Key inputs needed: population standard deviation, sample size, significance level (alpha), and the true population parameter.
  • Common confusion: the null hypothesis distribution vs. the true population distribution—power is calculated using the true distribution, not the null.
  • Two worked examples: one with a binomial scenario (Bond's ability) and one with a normal distribution (math achievement test).

🎲 The Bond example: binomial power

🎲 Setup and question

  • Bond has a true ability to be correct on 0.75 of trials.
  • The question: what is the probability he will be correct on 12 or more trials?
  • This probability represents power in this context.

🎲 Result

  • The probability Bond will be correct on 12 or more trials is 0.63.
  • Therefore, power = 0.63 in this scenario.
  • This is a binomial calculation where the true success rate (0.75) determines the probability of achieving a certain number of successes.

📐 The teaching method example: normal distribution power

📐 Problem setup

  • A math achievement test has a known mean of 75 and standard deviation of 10.
  • A researcher tests whether a new teaching method produces a higher mean.
  • True (unknown) population mean for the new method is 80.
  • Sample size: 25 subjects.
  • Test type: one-tailed test.
  • Question: what is the probability the researcher will correctly reject the false null hypothesis that the population mean is 75 or lower?

📐 Key assumptions

  • The population standard deviation remains 10 with the new method.
  • The distribution is normal.
  • Because the population standard deviation is assumed known, the normal distribution (not the t distribution) is used to compute the p value.

📐 Step 1: Calculate the standard error

Standard error of the mean (σ M): the standard deviation divided by the square root of the sample size.

  • Formula: standard error equals population standard deviation divided by square root of sample size.
  • Calculation: 10 divided by square root of 25 equals 10 divided by 5 equals 2.
  • This measures the variability of sample means.

📐 Step 2: Find the critical value under the null hypothesis

  • Under the null hypothesis (population mean = 75), find the sample mean value that corresponds to a 0.05 probability in the upper tail.
  • Figure 3 in the excerpt shows this distribution.
  • Result: if the sample mean M is 78.29 or larger, the researcher will reject the null hypothesis.
  • This is the critical value: the threshold for declaring significance at alpha = 0.05.

📐 Step 3: Calculate power under the true population mean

  • Under the true population mean (80, not 75), what is the probability of getting a sample mean greater than 78.29?
  • Figure 4 in the excerpt shows this distribution centered at 80.
  • Result: the probability is 0.80.
  • Don't confuse: this is not the probability under the null hypothesis; it is the probability under the true parameter.

📐 Conclusion

  • The probability that the researcher will correctly reject the false null hypothesis is 0.80.
  • Power = 0.80 for this test.

🔧 Practical notes

🔧 When population variance is unknown

  • The excerpt notes that assuming a known population variance is rarely true in practice.
  • This assumption is used here for pedagogical purposes (to simplify the example).
  • Power calculators are available for situations where the experimenter does not know the population variance.
  • For t tests and Analysis of Variance, power calculation is more complex, and many programs exist to compute power.

🔧 The two-distribution logic

DistributionPurposeCenterWhat it shows
Null hypothesis distributionFind the critical valueHypothesized mean (75)Threshold for rejecting null (78.29 at alpha = 0.05)
True population distributionCalculate powerTrue mean (80)Probability of exceeding the critical value (0.80)
  • Power requires thinking about two distributions simultaneously: one under the null hypothesis and one under the true parameter.
  • Example: the critical value 78.29 is "far out" in the null distribution (5% tail) but is "closer to center" in the true distribution (80), making it likely to be exceeded.
90

Factors Affecting Power

Factors Affecting Power

🧭 Overview

🧠 One-sentence thesis

Statistical power—the probability of correctly rejecting a false null hypothesis—is influenced by sample size, standard deviation, effect size, significance level, and whether the test is one- or two-tailed, with some factors under the experimenter's control and others not.

📌 Key points (3–5)

  • What power measures: the probability that a test will correctly reject a false null hypothesis.
  • Five key factors: sample size, standard deviation, difference between hypothesized and true mean (effect size), significance level, and one- vs two-tailed tests.
  • Control vs no control: experimenters can control sample size, sometimes reduce standard deviation, and choose significance level and test direction; they cannot control the true population mean.
  • Common confusion: significance level vs power—lowering the significance level (making it more stringent) actually reduces power, creating a trade-off.
  • Why it matters: understanding these factors helps researchers design studies that are more likely to detect real effects when they exist.

📏 Sample characteristics

📏 Sample size

  • Larger sample size → higher power.
  • The excerpt shows this relationship in Figure 1: as N increases, power increases for both standard deviations tested (10 and 15).
  • Sample size is typically under the experimenter's control, making it one way to increase power.
  • Limitation: using a large sample size can be difficult and/or expensive.
  • Example: If a researcher wants to detect whether a new teaching method raises test scores above 75, testing 100 students will give higher power than testing 20 students.

📐 Standard deviation

Standard deviation (σ): a measure of variability in the population.

  • Smaller standard deviation → higher power.
  • The excerpt states: "For all values of N, power is higher for the standard deviation of 10 than for the standard deviation of 15 (except, of course, when N = 0)."
  • Why: less variability makes it easier to detect a real difference between the hypothesized mean and the true mean.

🎯 How experimenters can reduce standard deviation

The excerpt identifies three methods:

  • Sampling from a homogeneous population: choosing subjects who are more similar reduces variability.
  • Reducing random measurement error: improving measurement precision lowers noise.
  • Applying experimental procedures consistently: standardizing procedures prevents unnecessary variation.

Don't confuse: standard deviation is a property of the population or measurement process, not the sample size—increasing N does not change σ, but it does change how precisely we estimate the mean.

🎯 Effect size

🎯 Difference between hypothesized and true mean

  • Larger effect size → higher power.
  • The excerpt states: "Naturally, the larger the effect size, the more likely it is that an experiment would find a significant effect."
  • Figure 2 shows that as the true population mean (μ) moves farther from the hypothesized mean (75), power increases for both standard deviations.
  • Why: a bigger real difference is easier to detect statistically.
  • Example: If the new teaching method raises the true mean to 85 instead of 80, the test is more likely to detect the improvement.

⚠️ Lack of control

  • The experimenter does not control the true population mean.
  • The excerpt emphasizes: "Assume that although the experimenter does not know it, the population mean μ for the new method is larger than 75."
  • This factor is a property of reality, not a design choice.

⚖️ Test design choices

⚖️ Significance level

Significance level (α): the threshold probability for rejecting the null hypothesis (e.g., 0.05 or 0.01).

  • More stringent (lower) significance level → lower power.
  • The excerpt states: "There is a trade-off between the significance level and power: the more stringent (lower) the significance level, the lower the power."
  • Figure 3 shows that power is lower for α = 0.01 than for α = 0.05.
  • Why: requiring stronger evidence to reject the null hypothesis makes it harder to reject, even when the null is false.
  • Example: If a researcher sets α = 0.01 instead of 0.05, they need more extreme sample results to reject the null, reducing the chance of detecting a real effect.

🔀 One-tailed vs two-tailed tests

  • One-tailed test → higher power (as long as the hypothesized direction is correct).
  • The excerpt states: "Power is higher with a one-tailed test than with a two-tailed test as long as the hypothesized direction is correct."
  • Equivalence: "A one-tailed test at the 0.05 level has the same power as a two-tailed test at the 0.10 level."
  • Why: the excerpt explains, "A one-tailed test, in effect, raises the significance level" (by concentrating the rejection region in one tail).
  • Don't confuse: this advantage only holds if the direction is correct—if the real effect is in the opposite direction, a one-tailed test will have zero power to detect it.

📊 Summary of factors

FactorEffect on powerUnder experimenter control?How to increase power
Sample size (N)Larger N → higher powerYesIncrease sample size (if feasible)
Standard deviation (σ)Smaller σ → higher powerSometimesSample homogeneously, reduce measurement error, standardize procedures
Effect size (true μ − hypothesized μ)Larger difference → higher powerNoCannot control; depends on reality
Significance level (α)Lower α → lower powerYesUse less stringent α (but increases Type I error risk)
One- vs two-tailedOne-tailed → higher powerYesUse one-tailed test if direction is known (but loses power in opposite direction)
91

Introduction to Linear Regression

Introduction to Linear Regression

🧭 Overview

🧠 One-sentence thesis

Linear regression allows you to predict one variable from another by finding the line that minimizes squared errors, and the calculations are assumption-free though the line's formula depends on means, standard deviations, and correlation.

📌 Key points (3–5)

  • What the regression line does: it is the line for which the sum of squared errors is lower than for any other line.
  • The regression formula: Y' = bX + A, where Y' is the predicted score, b is the slope, and A is the Y intercept.
  • How to compute slope and intercept: slope b is calculated from the correlation and standard deviations; intercept A is calculated from the means and slope.
  • Standardized variables simplify the equation: when variables are standardized (mean = 0, standard deviation = 1), the slope equals the correlation and the intercept equals 0.
  • Common confusion: the calculations themselves are assumption-free; assumptions only matter when you do inferential statistics with regression.

📐 The regression line formula

📐 Basic formula structure

The formula for a regression line is Y' = bX + A, where Y' is the predicted score, b is the slope of the line, and A is the Y intercept.

  • Y' is not the actual observed score; it is the predicted score for a given X.
  • The line is chosen so that the sum of squared errors is lower than for any other regression line.
  • Example: if the equation is Y' = 0.425X + 0.785, then for X = 1, Y' = (0.425)(1) + 0.785 = 1.21; for X = 2, Y' = (0.425)(2) + 0.785 = 1.64.

🧮 What you need to compute the line

The calculations require five statistics:

StatisticMeaning
M_XMean of X
M_YMean of Y
s_XStandard deviation of X
s_YStandard deviation of Y
rCorrelation between X and Y
  • In the age of computers, the regression line is typically computed with statistical software.
  • However, the calculations are relatively easy and are given for anyone who is interested.

🔢 Computing slope and intercept

🔢 Slope (b)

  • The slope can be calculated as: b = r × (s_Y / s_X).
  • Example from the excerpt: b = (0.627) × (1.072 / 1.581) = 0.425.
  • The slope depends on both the correlation and the ratio of the two standard deviations.

🔢 Intercept (A)

  • The intercept can be calculated as: A = M_Y − b × M_X.
  • Example from the excerpt: A = 2.06 − (0.425)(3) = 0.785.
  • The intercept adjusts the line so it passes through the point (M_X, M_Y).

📊 Sample vs population

  • The excerpt notes that all calculations have been shown in terms of sample statistics rather than population parameters.
  • The formulas are the same; simply use the parameter values for means, standard deviations, and the correlation.

🎯 Standardized variables

🎯 Simplified formula

When variables are standardized so that their means are equal to 0 and standard deviations are equal to 1, then b = r and A = 0.

  • This makes the regression line: Z_Y' = (r)(Z_X).
  • Z_Y' is the predicted standard score for Y, r is the correlation, and Z_X is the standardized score for X.
  • The slope of the regression equation for standardized variables is r.

🎯 Why standardization simplifies

  • When means are 0, the intercept A = M_Y − b × M_X becomes 0 − 0 = 0.
  • When standard deviations are both 1, the slope b = r × (s_Y / s_X) becomes r × (1 / 1) = r.
  • Don't confuse: standardizing changes the scale but not the relationship; the correlation remains the same.

🎓 Real example: predicting university GPA

🎓 The prediction task

  • The excerpt describes a case study with 105 computer science majors at a local state school.
  • The goal: predict a student's university GPA if you know his or her high school GPA.
  • The scatter plot shows a strong positive relationship; the correlation is 0.78.

🎓 The regression equation

  • The regression equation is: Univ GPA' = (0.675)(High School GPA) + 1.097.
  • Example: a student with a high school GPA of 3 would be predicted to have a university GPA of (0.675)(3) + 1.097 = 3.12.

⚠️ Assumptions and limitations

⚠️ Calculations are assumption-free

  • It may surprise you, but the calculations shown in this section are assumption-free.
  • The formulas for slope and intercept do not require any distributional assumptions.

⚠️ When assumptions matter

  • If the relationship between X and Y is not linear, a different shaped function could fit the data better.
  • Inferential statistics in regression are based on several assumptions, and these assumptions are presented in a later section of this chapter.
  • Don't confuse: computing the line itself is assumption-free; making inferences (e.g., confidence intervals, hypothesis tests) requires assumptions.
92

Partitioning the Sums of Squares

Partitioning the Sums of Squares

🧭 Overview

🧠 One-sentence thesis

Regression partitions the total variation in Y into explained variation (from predictions) and unexplained variation (from errors), and the proportion of variation explained equals r squared.

📌 Key points (3–5)

  • What partitioning means: the total variation in Y (SSY) splits into variation of predicted scores (SSY') and variation of prediction errors (SSE).
  • The partition formula: SSY = SSY' + SSE, where total variation equals explained variation plus unexplained variation.
  • How r² connects to variation: r squared is the proportion of variation explained; for example, if r = 0.4, then 16% of variation is explained.
  • Common confusion: variation vs variance—the relationships hold for both; variance is just variation divided by N or N-1.
  • Why it matters: partitioning shows how much of the outcome's variability the predictor accounts for versus how much remains unpredictable.

📐 Understanding variation in Y

📏 What sum of squares Y measures

Sum of squares Y (SSY): the sum of the squared deviations of Y from the mean of Y.

  • Formula (population): SSY = sum of (Y - mean of Y) squared
  • Formula (sample): use the sample mean M instead of the population mean
  • It quantifies the total spread or variability in the outcome variable Y.
  • Example: if Y values are 1, 2, 1.3, 3.75, 2.25 with mean 2.06, then SSY = 4.597 (sum of squared deviations).

🔤 Deviation scores notation

  • Deviation scores are differences from the mean; by convention, lowercase letters denote them.
  • y = Y - mean of Y (the deviation of each Y value from its mean)
  • The sum of y always equals zero because deviations above and below the mean cancel out.
  • Using deviation scores simplifies formulas and makes the partitioning clearer.

🧩 The two parts of variation

🎯 Sum of squares predicted (SSY')

Sum of squares predicted (SSY'): the sum of the squared deviations of the predicted scores from the mean predicted score.

  • It measures how much the predicted values vary around their mean.
  • Formula: SSY' = sum of (Y' - mean of Y') squared, or sum of y'² in deviation notation.
  • This is the explained variation—the part of Y's variability that the regression line accounts for.
  • Example: in the excerpt's data, SSY' = 1.806.

❌ Sum of squares error (SSE)

Sum of squares error (SSE): the sum of the squared errors of prediction.

  • It measures how much the actual Y values differ from their predicted values.
  • Formula: SSE = sum of (Y - Y')²
  • This is the unexplained variation—the part of Y's variability that the predictor does not capture.
  • Example: in the excerpt's data, SSE = 2.791.
  • Don't confuse: the mean of (Y - Y') is always zero, meaning errors average out, but the squared errors (SSE) capture total prediction inaccuracy.

➕ The partition equation

The total variation splits exactly into two parts:

ComponentMeaningValue in example
SSYTotal variation in Y4.597
SSY'Explained variation (from predictions)1.806
SSEUnexplained variation (from errors)2.791
  • SSY = SSY' + SSE
  • 4.597 = 1.806 + 2.791
  • This holds for any regression; the total is always the sum of explained and unexplained parts.

📊 Proportion of variation explained

📈 Calculating proportions

  • Proportion explained = SSY' divided by SSY
  • Proportion unexplained = SSE divided by SSY
  • These two proportions always add up to 1.
  • Example: 1.806 / 4.597 ≈ 0.393 explained; 2.791 / 4.597 ≈ 0.607 unexplained.

🔗 The relationship to r squared

r squared is the proportion of variation explained.

  • If r = 1, then r² = 1, meaning 100% of variation is explained (perfect prediction).
  • If r = 0, then r² = 0, meaning 0% of variation is explained (no predictive power).
  • Example: if r = 0.4, then r² = 0.16, so 16% of the variation in Y is explained by X.
  • This connects correlation strength directly to how much variability the predictor accounts for.

📉 Variation vs variance

  • The excerpt notes that variance is variation divided by N (population) or N-1 (sample).
  • All the relationships for variation also hold for variance:
    • Variance of Y = variance of Y' + variance of errors
    • r² is the proportion of variance explained, just as it is the proportion of variation explained.
  • Don't confuse: variation (sum of squares) and variance (mean of squares) follow the same partitioning logic; they differ only by a scaling factor.

📋 Summary table structure

📊 Organizing the partition

The excerpt describes summarizing the partition in a table with columns for source of variation, degrees of freedom (df), and sums of squares.

SourceDegrees of freedomSum of squares
Explained (Y')Number of predictors (1 in simple regression)SSY'
ErrorTotal observations minus 2SSE
TotalTotal observations minus 1SSY

🔢 Degrees of freedom rules

  • Explained df: equals the number of predictor variables (always 1 in simple regression).
  • Error df: equals total number of observations minus 2.
  • Total df: equals total number of observations minus 1.
  • Example: with 5 observations, error df = 5 - 2 = 3, total df = 5 - 1 = 4.
  • These degrees of freedom are used later for inferential statistics (not covered in this excerpt).
93

Standard Error of the Estimate

Standard Error of the Estimate

🧭 Overview

🧠 One-sentence thesis

The standard error of the estimate measures how accurate regression predictions are by quantifying the typical distance between actual values and predicted values.

📌 Key points (3–5)

  • What it measures: the accuracy of predictions made by a regression line; smaller standard error means points cluster closer to the line.
  • How it's calculated: it is the standard deviation of the errors of prediction (the differences between actual Y and predicted Y').
  • Relationship to correlation: the standard error is directly related to Pearson's r—when r² is high (more variation explained), the standard error is lower.
  • Common confusion: population vs sample formulas—sample formulas use N-2 in the denominator (not N-1) because two parameters (slope and intercept) were estimated.
  • Why it matters: it quantifies prediction accuracy and is used in inferential statistics for regression.

📏 What the standard error of the estimate means

📏 Definition and purpose

Standard error of the estimate: a measure of the accuracy of predictions made by a regression line.

  • It tells you how tightly the actual data points cluster around the regression line.
  • The excerpt shows two graphs: in Graph A, points are closer to the line than in Graph B, so predictions in Graph A are more accurate.
  • Example: if the standard error is small, the actual Y values are close to the predicted Y' values; if large, predictions are less reliable.

🔍 Visual interpretation

  • You can judge the size of the standard error from a scatter plot by looking at how spread out the points are around the regression line.
  • Closer points → smaller standard error → more accurate predictions.
  • More scattered points → larger standard error → less accurate predictions.

🧮 How to calculate the standard error

🧮 Population formula (basic)

The standard error of the estimate is calculated as:

  • Numerator: sum of squared differences between actual scores (Y) and predicted scores (Y').
  • Denominator: N (the number of pairs of scores).
  • Take the square root of the result.

In words: the standard error of the estimate equals the square root of the sum of (Y minus Y') squared, divided by N.

📊 Example calculation

The excerpt provides a worked example with five X, Y pairs:

XYY'Y - Y'(Y - Y')²
111.21-0.210.044
221.6350.3650.133
31.32.06-0.760.578
43.752.4851.2651.6
52.252.91-0.660.436
Sum1510.302.791
  • The sum of squared errors is 2.791.
  • Standard error = square root of (2.791 divided by 5) = 0.747.
  • Notice that the mean of (Y - Y') is zero: some Y values are higher than predicted, some lower, but the average difference is zero.

🔗 Connection to standard deviation

  • The formula for the standard error of the estimate is similar to the formula for standard deviation.
  • In fact, the standard error of the estimate is the standard deviation of the errors of prediction.
  • Each (Y - Y') is an error of prediction, and the standard error measures the typical size of these errors.

🔗 Relationship to correlation and variance

🔗 Formula using Pearson's correlation

The standard error can also be computed using Pearson's correlation (r):

  • Standard error = square root of [(1 minus r²) times SSY, divided by N].
  • SSY is the total sum of squares for Y (the total variation in Y).

📊 Why r² matters

  • r² is the proportion of variation explained by the regression.
  • (1 - r²) is the proportion of variation not explained (the unexplained variation).
  • The standard error is based on the unexplained variation.
  • Example from the excerpt: r = 0.6268, so r² = 0.393. About 39% of variation is explained, and 61% is unexplained. The standard error reflects that 61%.

🧩 Worked example with correlation

Using the same data:

  • Mean of Y = 10.30, SSY = 4.597, r = 0.6268.
  • Standard error = square root of [(1 - 0.6268²) times 4.597, divided by 5] = square root of (2.791 / 5) = 0.747.
  • This matches the result from the direct calculation.

🔄 Partitioning variation

The excerpt explains that:

  • Total variation (SSY) = explained variation (SSY') + unexplained variation (SSE).
  • The standard error is based on SSE (the sum of squared errors).
  • The same relationships hold for variance: total variance = variance of predictions + variance of errors.
  • r² is the proportion of variance explained, as well as the proportion of variation explained.

🧪 Sample vs population formulas

🧪 Key difference

When computing the standard error from a sample (not a population), the denominator changes:

  • Population formula: divide by N.
  • Sample formula: divide by (N - 2).

🔍 Why N - 2?

  • The reason is that two parameters (the slope and the intercept) were estimated to compute the sum of squares.
  • This is analogous to using (N - 1) for sample variance, but here two degrees of freedom are lost instead of one.
  • Don't confuse: it's not (N - 1) as in standard deviation; it's (N - 2) because of the two estimated parameters.

📐 Sample formulas

For a sample:

  • Standard error = square root of [sum of (Y - Y')² divided by (N - 2)].
  • Or using correlation: standard error = square root of [(1 - r²) times SSY, divided by (N - 2)].
  • Example from the excerpt: with the same data treated as a sample, standard error = square root of (2.791 / 3) = 0.964.

📋 Summary table for partitioning

📋 Structure of the summary table

The excerpt shows how to organize the partitioning of variation in a table:

SourceSum of SquaresdfMean Square
Explained1.80611.806
Error2.79130.93
Total4.5974

🔢 Degrees of freedom (df)

  • Explained df: equals the number of predictor variables (always 1 in simple regression).
  • Error df: equals total number of observations minus 2 (in this example, 5 - 2 = 3).
  • Total df: equals total number of observations minus 1 (in this example, 5 - 1 = 4).

📊 Mean square

  • Mean square is calculated by dividing the sum of squares by the degrees of freedom.
  • Example: error mean square = 2.791 / 3 = 0.93.
  • The mean square for error is the variance of the errors.
94

Inferential Statistics for b and r

Inferential Statistics for b and r

🧭 Overview

🧠 One-sentence thesis

When the regression slope (b) is significantly different from zero, the correlation coefficient (r) is also significantly different from zero, and both can be tested using t-tests with the same result.

📌 Key points (3–5)

  • Core assumptions: Inferential statistics for regression require linearity, homoscedasticity (equal variance around the regression line), and normally distributed prediction errors.
  • Testing the slope: The slope b is tested against zero using a t-test with degrees of freedom N-2, where N is the number of pairs of scores.
  • Testing correlation: Pearson's r is tested using a similar t-test formula, and the t value obtained is identical to the t value for testing the slope.
  • Common confusion: Homoscedasticity vs heteroscedasticity—heteroscedasticity means variance around the regression line changes with X values (e.g., predictions are accurate for some X ranges but not others).
  • Confidence intervals: Both the slope and correlation can be estimated with confidence intervals using the t distribution and standard errors.

📋 Assumptions for inferential statistics

📏 Linearity

The relationship between the two variables is linear.

  • This assumption applies to the population, not the sample.
  • No assumptions are needed to find the best-fitting line itself, but they are required for significance tests and confidence intervals.

📊 Homoscedasticity

The variance around the regression line is the same for all values of X.

  • What it means: Prediction errors should have similar spread across all X values.
  • Violation example: The excerpt describes university GPA predicted from high-school GPA—predictions are very good (points close to the line) for students with high high-school GPAs, but not very good (points far from the line) for students with low high-school GPAs.
  • Don't confuse: This is about variance of errors, not about whether X or Y themselves are normally distributed.

🔔 Normal distribution of errors

The errors of prediction are distributed normally.

  • This means the distributions of deviations from the regression line are normally distributed.
  • It does not mean that X or Y must be normally distributed.

🧪 Testing the slope (b)

🧮 The t-test formula for the slope

The general t-test formula is applied as:

  • Statistic: the sample slope (b)
  • Hypothesized value: 0
  • Degrees of freedom: df = N - 2, where N is the number of pairs of scores

The formula structure is: t = (statistic minus hypothesized value) divided by the standard error.

📐 Standard error of the slope

The estimated standard error of b is:

  • s_b = s_est divided by the square root of SSX
  • s_est: the standard error of the estimate
  • SSX: the sum of squared deviations of X from the mean of X

SSX is calculated as the sum of (X minus mean of X) squared.

The standard error of the estimate can be calculated as the square root of (sum of squared prediction errors divided by N).

🔢 Worked example

The excerpt provides data with 5 pairs of scores:

XYPredicted YErrorSquared error
111.21-0.210.044
221.6350.3650.133
31.32.06-0.760.578
43.752.4851.2651.6
52.252.91-0.660.436
  • Sum of squared errors = 2.791
  • s_est = square root of (2.791 / 5) = 0.747 (for population formula)
  • For sample formula, use N-2 in denominator: square root of (2.791 / 3) = 0.964
  • SSX = 10.00
  • s_b = 0.964 / square root of 10 = 0.305
  • Slope b = 0.425
  • t = 0.425 / 0.305 = 1.39
  • df = 5 - 2 = 3
  • Two-tailed p value = 0.26, so the slope is not significantly different from 0

❓ Why N-2 degrees of freedom

The denominator is N-2 rather than N-1 because two parameters (the slope and the intercept) were estimated in order to estimate the sum of squares.

🎯 Confidence interval for the slope

📊 Formula structure

For the 95% confidence interval:

  • Lower limit: b minus (t_0.95 times s_b)
  • Upper limit: b plus (t_0.95 times s_b)

where t_0.95 is the value of t to use for the 95% confidence interval, looked up in a t distribution table using df = N - 2.

🔢 Example calculation

Using the example data with df = 3:

  • t_0.95 = 3.182 (from the t table)
  • Lower limit: 0.425 - (3.182 × 0.305) = -0.55
  • Upper limit: 0.425 + (3.182 × 0.305) = 1.40

🔗 Testing Pearson's correlation (r)

🧮 The t-test formula for correlation

The formula for testing Pearson's r is:

  • t = r times the square root of [(N - 2) divided by (1 minus r squared)]

where N is the number of pairs of scores.

🔄 Equivalence with slope test

For the example data:

  • r = 0.627
  • N = 5
  • t = 0.627 times square root of [3 / (1 - 0.627 squared)] = 1.39
  • Key finding: This is the same t value obtained in the t test of b.
  • Degrees of freedom is also the same: N - 2 = 3.

Don't confuse: Although the formulas look different, testing the slope and testing the correlation are equivalent—if one is significant, the other must be significant too.

📏 Standard error formulas

🔢 Population vs sample formulas

The excerpt shows two versions of the standard error of the estimate:

ContextDenominatorReason
PopulationNAll parameters known
SampleN - 2Two parameters (slope and intercept) were estimated

🔗 Alternative formula using correlation

There is a version of the standard error formula in terms of Pearson's correlation:

  • Uses ρ (the population value of Pearson's correlation) and SSY (sum of squared deviations of Y from the mean of Y)
  • For the example: mean of Y = 10.30, SSY = 4.597, r = 0.6268
  • The formula produces the same value (0.747) as the direct calculation from prediction errors.
95

Influential Observations

Influential Observations

🧭 Overview

🧠 One-sentence thesis

An observation's influence on regression results depends on both its leverage (how far its predictor value is from the mean) and its distance (how large its prediction error is), with high values of both creating extremely influential points that can dramatically alter the regression line.

📌 Key points (3–5)

  • What influence means: how much predictions for other observations would change if a given observation were excluded from the analysis.
  • Two factors that create influence: leverage (how extreme the predictor value is) and distance (how large the prediction error is).
  • Common confusion: high leverage alone or high distance alone does not guarantee high influence—it's the combination that matters.
  • Rule of thumb: Cook's D over 1.0 suggests too much influence, though this should be applied judiciously.
  • Why it matters: a single influential observation can greatly affect the regression slope and predictions, so identifying them is critical for interpreting results correctly.

🎯 Understanding influence

🎯 What influence measures

Influence: how much the predicted scores for other observations would differ if the observation in question were not included.

  • Influence is measured by Cook's D, which is proportional to the sum of squared differences between two sets of predictions:
    • Predictions made with all observations included
    • Predictions made leaving out the observation in question
  • If predictions stay the same with or without the observation → no influence
  • If predictions differ greatly when the observation is excluded → the observation is influential
  • Example: If removing one data point causes the regression line to shift dramatically and changes predictions for all other points, that point has high influence.

📏 Cook's D threshold

  • A common rule of thumb: Cook's D > 1.0 indicates too much influence
  • The excerpt emphasizes this should be "applied judiciously and not thoughtlessly"
  • Don't confuse: this is a guideline, not an absolute cutoff

🔧 The two components of influence

⚖️ Leverage: distance from the mean predictor

Leverage: based on how much the observation's value on the predictor variable differs from the mean of the predictor variable.

  • The greater the leverage, the more potential to be influential
  • Key insight: an observation at the mean of the predictor has no influence on the slope, regardless of its criterion value
  • An observation extreme on the predictor has the potential to affect the slope greatly (depending on its distance)
  • Example: An observation with a predictor value far from the mean has high leverage; one near the mean has low leverage.

How leverage is calculated:

  • Standardize the predictor variable (mean = 0, standard deviation = 1)
  • Square the observation's standardized predictor value
  • Add 1 and divide by the number of observations

📐 Distance: prediction error size

Distance: based on the error of prediction for the observation—the greater the error of prediction, the greater the distance.

  • Most commonly measured by the studentized residual
  • Related to the error of prediction divided by the standard deviation of prediction errors
  • Important detail: the predicted score is derived from a regression equation that excludes the observation in question
  • Example: An observation far from the regression line (large prediction error) has high distance; one close to the line has low distance.

🔗 Why both matter together

  • High distance with low leverage → not much influence
  • High leverage with low distance → not much influence
  • High leverage and high distance → extremely influential
  • Don't confuse: you cannot assess influence by looking at only one component

📊 Example walkthrough

📊 Five observations compared

The excerpt provides a dataset with five observations (A through E) to illustrate how leverage, distance, and influence interact:

ObservationXYLeverage (h)Studentized Residual (R)Cook's D
A120.39-1.020.4
B230.27-0.560.06
C350.210.890.11
D460.201.220.19
E870.73-1.688.86

🔍 Pattern analysis

Observation A:

  • Fairly high leverage + relatively high residual → moderately high influence (0.4)

Observation B:

  • Small leverage + relatively small residual → very little influence (0.06)

Observation C:

  • Small leverage + relatively high residual → relatively low influence (0.11)
  • Demonstrates: high distance alone doesn't create high influence if leverage is low

Observation D:

  • Lowest leverage + second highest residual → much less influence (0.19) than Observation A
  • Key lesson: even though its residual is much higher than A's, its influence is lower because of low leverage

Observation E:

  • By far the largest leverage (0.73) + largest residual (-1.68) → extremely influential (8.86)
  • This is the only observation exceeding the Cook's D > 1.0 threshold
  • The combination of high leverage and high distance makes it dominate the regression results

📈 Visual interpretation

The excerpt describes Figure 1, which shows:

  • Blue line: regression with all observations included
  • Red line: regression with the circled observation excluded
  • For Observation E: it lies very close to the blue line (when included) but very far from the red line (when excluded)
  • This visual demonstrates that E's presence dramatically changes where the regression line is positioned

🧮 Calculation details (optional)

🧮 Computing Cook's D

Three steps:

  1. Predict all scores twice: once using all observations, once excluding the observation in question
  2. Compute the sum of squared differences between these two sets of predictions
  3. Divide by 2 times the MSE (mean squared error)

🧮 Computing leverage (h)

Three steps:

  1. Standardize the predictor variable (mean = 0, SD = 1)
  2. Square the observation's standardized predictor value
  3. Add 1 and divide by the number of observations

Don't confuse: the excerpt notes that studentized residual calculations are "a bit complex and beyond the scope of this work," so full distance calculation details are not provided.

96

Regression Toward the Mean

Regression Toward the Mean

🧭 Overview

🧠 One-sentence thesis

Regression toward the mean explains why extreme performances that depend partly on luck tend to be closer to average on retests, because only the skill component—not the luck—carries over to future performance.

📌 Key points (3–5)

  • Core phenomenon: People with extreme scores on a test that includes both skill and chance will, on average, score closer to the mean on a retest because their luck does not persist.
  • When it occurs: Regression toward the mean happens whenever the correlation between test and retest is less than perfect (r < 1), which is true for any measure with a chance component.
  • Mathematical basis: The regression equation for standardized variables shows that predicted scores move closer to the mean by a factor equal to the correlation coefficient.
  • Common confusion: Mistaking regression effects for real treatment effects—extreme performers naturally move toward the mean on retests, even without any intervention.
  • Why it matters: Failing to account for regression toward the mean leads to incorrect conclusions about causation, especially in experiments without control groups.

🎲 Pure chance vs. mixed tasks

🎲 Pure chance: coin flipping

The excerpt begins with a purely chance-based task to illustrate the concept clearly.

  • In a simulation, 25 people each predicted 12 coin flips.
  • One subject got 10 correct—clearly very lucky.
  • Key insight: No matter how well someone did, the best prediction for their retest score is 6 (the mean), because each flip has probability 0.5 and there are 12 trials.
  • The binomial distribution with N = 12 and p = 0.50 has a mean of 6.
  • Example: Even the luckiest subject who scored 10/12 is predicted to score 6/12 on a retest, because luck does not carry over.

🎯 Mixed tasks: skill plus chance

The excerpt then introduces "Test A," which combines skill and chance.

  • Test A: 6 coin flip predictions + 6 history questions (mean score on history = 4).
  • A subject scoring very high (e.g., 10/12) likely did well on both components.
  • Why regression occurs: To score 10/12 with only 4 correct history answers would require getting all 6 coin predictions correct—exceptionally good luck.
  • On Test B (similar format), the subject's history knowledge remains helpful, but their coin-flip luck will not persist.
  • Prediction: Their Test B score will fall somewhere between their Test A score and the mean of Test B.

Regression toward the mean: the tendency of subjects with high values on a measure that includes chance and skill to score closer to the mean on a retest.

🧩 Why regression toward the mean happens

🧩 The essence: skill persists, luck does not

The excerpt emphasizes the fundamental mechanism:

  • People with high scores tend to be above average in both skill and luck.
  • Only the skill portion is relevant to future performance; luck is random and does not carry over.
  • Similarly, people with low scores tend to be below average in both skill and luck, and their bad luck will not persist.
  • Important clarification: Not every high scorer had above-average luck, but on average they did.

📚 Real-world example: exam scores

Almost every behavioral measure has both skill and chance components.

  • A student's final exam grade depends mainly on knowledge (skill).
  • Chance factors include:
    • The exam covers only a subset of material—maybe the student was lucky that poorly understood topics were underrepresented.
    • Random choices between problem-solving approaches that happened to be correct.
    • External factors like fatigue from a random early-morning phone call.
    • Guessing on multiple-choice questions.
  • Because of these chance elements, extreme scores will tend to regress toward the mean on a retest.

📐 Mathematical foundation

📐 The regression equation

The excerpt provides the formula for predicting standardized scores:

Z_Y' = (r)(Z_X)

Where:

  • Z_Y' is the predicted standardized score on the retest (Y).
  • Z_X is the standardized score on the initial test (X).
  • r is the correlation between test and retest.

Key implications:

  • If the absolute value of r is less than 1, the predicted Z_Y' will be closer to 0 (the mean for standardized scores) than Z_X is.
  • If r = 0 (pure chance task), the predicted score is always the mean (0), regardless of the initial score.
  • Regression toward the mean occurs whenever r < 1, which is true for any test-retest situation with a chance component.

📊 SAT example

The excerpt illustrates with Math SAT predicting Verbal SAT (r = 0.835).

Initial scorePredicted scoreInterpretation
Math SAT: 1.6 SD above meanVerbal SAT: (0.835)(1.6) = 1.34 SD above meanCloser to the mean
Math SAT: far below meanVerbal SAT: higher than Math SATAlso closer to the mean
  • A student 1.6 standard deviations above the mean on Math SAT is predicted to score only 1.34 standard deviations above the mean on Verbal SAT.
  • The prediction "regresses" toward the mean by a factor equal to the correlation.

🎯 General principle

The excerpt states the conditions clearly:

  • Regression toward the mean occurs whenever observations are selected based on performance on a task with a random component.
  • You are choosing people partly on skill and partly on luck.
  • Since luck cannot be expected to persist, the best prediction for a second trial is between their first-trial performance and the mean.
  • Degree of regression: The greater the role of chance, the more regression toward the mean occurs.

⚠️ Common errors and misinterpretations

⚠️ The flight instructor example

The excerpt provides a compelling real-world case from Daniel Kahneman's autobiography.

The situation:

  • Kahneman was teaching flight instructors that praise is more effective than punishment.
  • An instructor objected: "In my experience, praising a cadet for a clean maneuver is typically followed by worse performance, whereas screaming at a cadet for bad execution is typically followed by improvement."

The error:

  • The instructor attributed the changes to his praise or criticism.
  • Reality: This is exactly what regression toward the mean predicts.
  • A pilot's performance varies randomly from maneuver to maneuver, even with considerable skill.
  • An extremely clean maneuver likely involved some good luck; the next performance will probably be lower (luck disappears), regardless of praise.
  • A poor performance likely involved bad luck; the next performance will probably be better, regardless of criticism.

Kahneman's demonstration:

  • He had instructors toss a coin at a target twice.
  • Those who did best the first time deteriorated on the second attempt.
  • Those who did worst the first time improved.
  • This demonstrated regression toward the mean in a pure-chance task, making the point undeniable.

⚾ Baseball batting averages

The excerpt analyzes the 10 players with the highest batting averages in 1998.

Prediction: Based on regression toward the mean, these players should have lower batting averages in 1999.

Results:

  • 7 out of 10 players had lower batting averages in 1999.
  • Those who increased were only slightly higher; those who decreased were much lower.
  • Average decrease: 33 points.
  • Important: Most still had excellent 1999 averages, showing that skill was an important component.
PatternInterpretation
Most players declinedConsistent with regression toward the mean
Some players increasedRegression is a statistical tendency, not a guarantee for every individual
Still above average in 1999High skill persists; only the luck component regressed

Related phenomena:

  • "Sophomore Slump": A "rookie of the year" typically does less well in the second season.
  • "Sports Illustrated Cover Jinx": Similar regression effect.

🧪 Experimental design errors

🧪 Reading-improvement program (hypothetical)

The excerpt describes a flawed experiment:

Design:

  • All first graders took a reading test.
  • The 50 lowest-scoring readers were enrolled in a reading-improvement program.
  • Students were retested after the program; mean improvement was large.

The problem:

  • The initial poor performance was likely due partly to bad luck.
  • Their luck would be expected to improve on the retest, increasing scores with or without the treatment program.
  • Confounding: Regression effects are mixed with real treatment effects, making it impossible to draw firm conclusions.

💊 Propranolol and SAT scores (real example)

The excerpt describes an actual flawed study:

Design:

  • 25 high-school students were chosen because IQ tests and other academic performance indicated they had not done as well as expected on the SAT.
  • They were given propranolol (a drug thought to reduce test anxiety).
  • On a retest, students improved an average of 120 points (vs. 38 points expected from retesting alone).

The problem:

  • The selection method likely chose students who had bad luck on their first SAT.
  • These students would likely have increased their scores on a retest with or without propranolol.
  • Propranolol effects and regression effects are confounded.
  • Conclusion: No firm conclusions can be drawn about whether propranolol actually helped.

How to fix it:

  • Randomly assign students to a propranolol group and a control group.
  • Regression effects would then be the same for both groups on average.
  • A significant difference between groups would provide good evidence for a propranolol effect.

🔍 Key distinction: individual vs. average

Don't confuse: Regression toward the mean is a statistical tendency, not a deterministic rule.

  • The excerpt emphasizes: "Although the predicted scores for every individual will be lower, some of the predictions will be wrong."
  • Example: In the baseball data, some players increased their batting averages from 1998 to 1999, even though the average declined.
  • Regression toward the mean describes what happens on average for a group, not what must happen for every individual.
97

Introduction to Multiple Regression

Introduction to Multiple Regression

🧭 Overview

🧠 One-sentence thesis

Multiple regression extends simple linear regression by predicting a criterion variable from two or more predictor variables simultaneously, allowing researchers to assess each predictor's unique contribution while controlling for the others.

📌 Key points (3–5)

  • What multiple regression does: predicts a criterion variable using a linear combination of two or more predictor variables, finding weights that minimize squared prediction errors.
  • Regression coefficients as partial slopes: each coefficient represents the relationship between one predictor and the criterion holding all other predictors constant.
  • Confounded variance problem: when predictors are correlated, simply adding their individual explained variances double-counts shared variance; the sum of squares explained together is usually less than the sum of separate simple regressions.
  • Common confusion: unique vs. confounded variance—highly correlated predictors may each explain little variance uniquely, yet together explain substantial variance.
  • Testing significance: comparing a complete model (all predictors) to a reduced model (some predictors omitted) reveals whether the omitted variables contribute significantly.

📐 The multiple regression equation

📐 Basic structure

The multiple regression equation: a formula that predicts the criterion variable by summing weighted predictor variables plus a constant.

  • General form: Predicted criterion = (b₁ × Predictor₁) + (b₂ × Predictor₂) + constant
  • Example from the excerpt: UGPA' = 0.541 × HSGPA + 0.008 × SAT + 0.540
  • Each b value is chosen to minimize the sum of squared prediction errors.

🔢 Regression coefficients (weights)

  • The b values are called regression coefficients or regression weights (synonymous terms).
  • They indicate how much the predicted criterion changes when that predictor increases by one unit.
  • Example: the coefficient 0.541 for HSGPA means that for every 1-point increase in high-school GPA, university GPA is predicted to increase by 0.541 points (holding SAT constant).

📊 Multiple correlation (R)

Multiple correlation (R): the correlation between the predicted scores and the actual scores on the criterion variable.

  • In the example, R = 0.79 (correlation between predicted UGPA and actual UGPA).
  • R is always positive because negative predictor–criterion correlations produce negative regression weights, keeping the predicted–actual correlation positive.
  • R² represents the proportion of variance in the criterion explained by all predictors together.

🔍 Understanding partial slopes

🔍 What "partial" means

A regression coefficient in multiple regression is the slope of the linear relationship between the criterion and the part of a predictor that is independent of all other predictors.

  • "Partial slope" = the relationship between predictor and criterion with other predictors held constant.
  • It answers: "If two people have the same value on all other predictors but differ by 1 on this predictor, how much do we expect them to differ on the criterion?"

🧮 How partial slopes are computed

The excerpt explains a two-step process for HSGPA's coefficient:

  1. Create residuals: predict HSGPA from SAT and save the prediction errors (residuals). These residuals represent the part of HSGPA that is independent of SAT.
  2. Find the slope: regress the criterion (UGPA) on these residuals. The slope is the partial regression coefficient.
  • Example: HSGPA.SAT (residuals after predicting HSGPA from SAT) has zero correlation with SAT by construction.
  • The slope of UGPA regressed on HSGPA.SAT is 0.541—the same as the multiple regression coefficient for HSGPA.

📏 Interpreting partial slopes

  • The coefficient 0.541 for HSGPA means: holding SAT constant, a 1-point increase in HSGPA is associated with a 0.54-point increase in UGPA.
  • If two students have the same SAT but differ in HSGPA by 2 points, we predict they differ in UGPA by 2 × 0.54 = 1.08.
  • Don't confuse: this is not the total relationship between HSGPA and UGPA; it's the relationship after removing overlap with SAT.

⚖️ Beta weights (standardized coefficients)

Beta weight (β): a regression coefficient computed on standardized variables (all with standard deviation = 1).

  • Beta weights allow comparison across predictors measured on different scales.
  • They represent the change in the criterion (in standard deviations) per one standard-deviation change in the predictor, holding others constant.
  • Example: β = 0.625 for HSGPA and β = 0.198 for SAT means a one-SD increase in HSGPA has a larger effect than a one-SD increase in SAT.
  • Practical implication: if you already know HSGPA, SAT adds little predictive value; but if HSGPA is unknown, SAT is more useful (its simple-regression β is 0.68).

🧩 Partitioning variance

🧩 Total variance breakdown

Just as in simple regression, the total sum of squares (SSY) splits into:

  • SS explained (SSY'): variance in the criterion predicted by the model.
  • SS error (SSE): variance not explained (prediction errors).
  • Formula: SSY = SSY' + SSE
  • Example: 20.798 = 12.961 + 7.837

🔀 The confounding problem

ApproachSum of squares explainedWhy it's wrong or right
Simple regression: HSGPA alone12.64Counts all variance HSGPA can explain
Simple regression: SAT alone9.75Counts all variance SAT can explain
Sum of the two22.39Wrong: double-counts shared variance
Multiple regression: both together12.96Correct: counts shared variance once
  • HSGPA and SAT are highly correlated (r = 0.78), so much variance is confounded (explainable by either predictor).
  • Simply adding separate sums of squares inflates the total because the overlap is counted twice.

🧱 Three-part partition

The excerpt shows how to split explained variance into three pieces:

SourceSum of squaresProportionMeaning
HSGPA (unique)3.210.15Explained by HSGPA alone, independent of SAT
SAT (unique)0.320.02Explained by SAT alone, independent of HSGPA
Confounded9.430.45Could be attributed to either predictor
Error7.840.38Unexplained
Total20.81.00
  • Most explained variance (9.43 out of 12.96) is confounded between the two predictors.
  • Don't confuse: small unique contributions (0.15 and 0.02) don't mean the predictors are unimportant—together they explain 0.62 of the variance.

🔬 Computing unique contributions

Unique sum of squares for a predictor = SS for the complete model − SS for the reduced model (omitting that predictor).

  • Complete model: includes all predictors (HSGPA and SAT → SS = 12.96).
  • Reduced model for HSGPA: omits HSGPA, includes only SAT (SS = 9.75).
  • Unique SS for HSGPA = 12.96 − 9.75 = 3.21.
  • Similarly, unique SS for SAT = 12.96 − 12.64 = 0.32.

🎯 Why consider sets of variables

When predictors are highly intercorrelated:

  • Individual unique contributions may be tiny.
  • Yet the set of predictors together may explain substantial variance.
  • Example: multiple cognitive-ability measures may overlap heavily, so none explains much uniquely, but the set explains a lot.
  • Solution: test the variance explained by the entire set (includes unique + confounded variance within the set, excluding confounding with outside variables).

📊 Significance testing

📊 General F-test formula

The excerpt presents a formula comparing a complete model (all predictors) to a reduced model (some predictors omitted):

  • Numerator: (SS_complete − SS_reduced) / (number of predictors in complete − number in reduced)
  • Denominator: (SS_total − SS_complete) / (N − number of predictors in complete − 1)
  • This ratio equals MS_explained / MS_error.
  • Degrees of freedom: numerator = p_c − p_r; denominator = N − p_c − 1.

If F is significant, the omitted variables contribute significantly beyond the retained variables.

🧪 Testing overall R²

To test whether R² differs significantly from zero:

  • Define the reduced model as having no predictors (SS_reduced = 0, p_r = 0).
  • The formula simplifies to: F = (SS_complete / p_c) / [(SS_total − SS_complete) / (N − p_c − 1)]
  • Example: F(2, 102) = (12.96/2) / [(20.80 − 12.96)/102] = 6.48 / 0.08 = 84.35, p < 0.001.
  • Conclusion: the model explains significant variance.

🔎 Testing unique contributions of individual predictors

To test whether a single predictor explains unique variance:

  • Reduced model: includes all predictors except the one being tested.
  • Example for HSGPA: reduced model has only SAT (SS = 9.75).
  • F(1, 102) = (12.96 − 9.75) / [(20.80 − 12.96)/102] = 3.21 / 0.077 = 41.80, p < 0.001.
  • Example for SAT: reduced model has only HSGPA (SS = 12.64).
  • F(1, 102) = (12.96 − 12.64) / 0.077 = 0.32 / 0.077 = 4.19, p = 0.0432.

🔗 Equivalence of tests

  • Testing the unique variance explained by a predictor is identical to testing whether its regression coefficient is zero.
  • Both reflect the predictor–criterion relationship independent of other predictors.
  • If unique variance ≠ 0, then the coefficient ≠ 0 (and vice versa).

⚙️ Assumptions

⚙️ When assumptions matter

No assumptions are necessary for computing regression coefficients or partitioning sums of squares, but several assumptions are required for interpreting inferential statistics.

  • Moderate violations of Assumptions 1–3 don't seriously affect tests of predictor significance.
  • Even small violations cause problems for confidence intervals on predictions for specific observations.

📏 Assumption 1: Normality of residuals

  • Residuals = errors of prediction (actual scores − predicted scores).
  • Assumption: residuals are normally distributed.
  • The excerpt's Q-Q plot shows minor deviations (lower tail doesn't increase as expected; highest value is higher than expected), but overall the distribution is close to normal.

📏 Assumption 2: Homoscedasticity

Homoscedasticity: the variance of prediction errors is the same for all predicted values.

  • Assumption: error variance is constant across the range of predictions.
  • The excerpt's example violates this: errors are much larger for low-to-medium predicted scores than for high predicted scores.
  • Consequence: confidence intervals for low predicted UGPA would underestimate uncertainty.

📏 Assumption 3: Linearity

  • Assumption: the relationship between each predictor and the criterion is linear.
  • If violated, predictions may systematically overestimate for one range of predictor values and underestimate for another.
  • The excerpt does not provide a plot or test for this assumption in the example data.

🔧 Other inferential statistics

The excerpt mentions two important topics beyond its scope:

  1. Confidence intervals on regression slopes: quantify uncertainty about the true population coefficients.
  2. Confidence intervals on predictions for specific observations: quantify uncertainty about a predicted score for a new individual.

Both can be computed by standard statistical software (R, SPSS, STATA, SAS, JMP).

98

One-Factor ANOVA (Between Subjects)

Introduction

🧭 Overview

🧠 One-sentence thesis

One-factor ANOVA tests whether population means across multiple conditions are equal by comparing variation between groups to variation within groups, and rejecting the null hypothesis indicates at least one group mean differs from the others.

📌 Key points (3–5)

  • What ANOVA tests: the null hypothesis that all condition population means are equal (μ₁ = μ₂ = ... = μₖ).
  • Between-subjects design: different subjects are used for each level of the factor, so comparisons are between separate groups.
  • Two key estimates: Mean Square Between (MSB) and Mean Square Error (MSE) capture different sources of variation.
  • Common confusion: ANOVA is conceptually a two-tailed test even though only one tail of the F distribution is used; also, don't confuse between-subjects factors (different groups) with within-subjects factors (same subjects, repeated measures).
  • Relationship to t-test: the F distribution is related to the t distribution, connecting ANOVA to simpler two-group comparisons.

🔬 ANOVA design fundamentals

🔬 What is a one-factor ANOVA

One-factor ANOVA: an analysis of variance with a single independent variable (factor).

  • The factor can have multiple levels (conditions).
  • Example: the "Smiles and Leniency" study has one factor ("Type of Smile") with four levels (false, felt, miserable, neutral).
  • Each level is represented by a separate group of subjects (34 subjects per condition).

🧑‍🤝‍🧑 Between-subjects factor

Between-subjects factor (or between-subjects variable): a factor where different subjects are used for each level.

  • Comparisons are made between different groups of subjects.
  • The term "between subjects" emphasizes that each subject appears in only one condition.
  • Don't confuse with within-subjects factors, where the same subjects are tested under all levels (repeated measures).
Factor typeSubject assignmentComparison type
Between-subjectsDifferent subjects per levelBetween separate groups
Within-subjectsSame subjects across all levelsWithin the same subjects (repeated measures)

🧪 The null hypothesis

  • ANOVA tests whether all population means are identical:
    • H₀: μ₁ = μ₂ = ... = μₖ (where k is the number of conditions).
  • Example: in the "Smiles and Leniency" study, k = 4, so H₀: μ_false = μ_felt = μ_miserable = μ_neutral.
  • If the null hypothesis is rejected, at least one population mean differs from the others (the excerpt does not specify which one).

📐 Key components of ANOVA

📐 Mean Square Error (MSE)

Mean Square Error (MSE): an estimate of variation within groups (error variation).

  • The excerpt states that MSE estimates something specific when H₀ is true and something different when H₀ is false (the exact estimates are listed in the learning objectives but not detailed in the excerpt body).
  • MSE captures the variability of scores within each condition.

📐 Mean Square Between (MSB)

Mean Square Between (MSB): an estimate of variation between group means.

  • Like MSE, MSB estimates different quantities depending on whether the null hypothesis is true or false.
  • MSB reflects how much the condition means differ from each other.

🔢 The F statistic

  • ANOVA computes an F statistic from MSB and MSE.
  • F has two degrees of freedom parameters (the excerpt mentions computing them but does not provide the formulas).
  • The F distribution has a specific shape (described in the learning objectives but not detailed in the excerpt).

🎯 Why ANOVA is "two-tailed" conceptually

  • The excerpt states that ANOVA is best thought of as a two-tailed test even though literally only one tail of the F distribution is used.
  • This reflects that ANOVA detects any difference among means, whether some are higher or lower than others.
  • Don't confuse: the F test uses only the upper tail, but it is sensitive to deviations in any direction.

🔗 Connections and assumptions

🔗 Relationship to the t distribution

  • The excerpt mentions a relationship between the t and F distributions.
  • This connects ANOVA (which can compare multiple groups) to the t-test (which compares two groups).

📋 Assumptions of one-way ANOVA

  • The excerpt lists "assumptions of a one-way ANOVA" as a learning objective but does not detail them in the provided text.

🧮 Partitioning sums of squares

  • ANOVA partitions total variation into:
    • Variation due to conditions (between groups).
    • Error variation (within groups).
  • This breakdown allows the test to compare systematic differences (conditions) to random variation (error).

🖥️ Practical considerations

🖥️ Data format for analysis

  • The excerpt mentions formatting data for use with a computer statistics program.
  • In the "Smiles and Leniency" study, there was one score per subject, with 34 subjects in each of four conditions.

🧩 Multi-factor designs (context)

  • The excerpt briefly describes factorial designs with more than one factor.
  • Example: a Gender (2) × Age (3) design has two factors (gender with 2 levels, age with 3 levels) and 6 total groups when all combinations are included.
  • Complex designs may mix between-subjects and within-subjects factors.
  • Don't confuse: one-factor ANOVA (this section) has only a single factor; multi-factor ANOVA (e.g., two-way ANOVA) has two or more factors.
99

Analysis of Variance Designs

Analysis of Variance Designs

🧭 Overview

🧠 One-sentence thesis

ANOVA designs organize experiments by factors (independent variables) and their levels, distinguishing between-subjects and within-subjects comparisons, with factorial designs combining multiple factors to test whether population means differ across conditions.

📌 Key points (3–5)

  • Factor and levels: a factor is an independent variable in ANOVA terminology; levels are the different conditions or values of that factor.
  • Between-subjects vs within-subjects: between-subjects factors use different groups for each level; within-subjects factors test the same subjects under all levels (repeated measures).
  • Factorial design: when all combinations of levels from multiple factors are included, enabling study of interactions between factors.
  • Common confusion: ANOVA tells you that at least one mean differs, but not which means differ; the Tukey HSD test is more specific and can be used without ANOVA.
  • Why learn ANOVA: it is the most common technique for comparing means and is needed to understand research reports, even though Tukey HSD may be preferable in simple cases.

🔬 Factors and levels

🏷️ What a factor is

Factor: a synonym for independent variable in ANOVA design terminology.

  • An independent variable is a variable manipulated by the experimenter.
  • Example: in the "Smiles and Leniency" study, "Type of Smile" is the factor.
  • The factor is the broad category being tested, not the specific conditions.

📊 What levels are

Levels: the different conditions or values within a factor.

  • Levels are the specific variations of the independent variable.
  • Example: "Type of Smile" has four levels—neutral, false, felt, and miserable.
  • The number of levels determines how many groups or conditions are compared.

🔢 One-way vs two-way ANOVA

  • One-way ANOVA: only one factor is present.
    • Example: comparing four smile types (one factor, four levels).
  • Two-way ANOVA: two factors are present.
    • Example: age (three levels: 8, 10, 12 years) and gender (two levels: male, female) on reading speed.
  • The number in "one-way" or "two-way" refers to the number of factors, not levels.

👥 Between-subjects vs within-subjects factors

👥 Between-subjects factor

Between-subjects factor (or between-subjects variable): a factor where different subjects are used for each level.

  • Comparisons are made between different groups of subjects.
  • Each subject experiences only one level of the factor.
  • Example: in "Smiles and Leniency," four separate groups of subjects saw the four smile types.
  • The term "between subjects" emphasizes that you are comparing across distinct groups.

🔁 Within-subjects factor

Within-subjects factor (or within-subjects variable): a factor where the same subjects are used for all levels.

  • Comparisons are made within the same subjects across different conditions.
  • Each subject experiences every level of the factor.
  • Also called repeated-measures variables because the same subjects are measured multiple times.
  • Example: in the "ADHD Treatment" study, every subject was tested with each of four dosage levels (0, 0.15, 0.30, 0.60 mg/kg).
  • Don't confuse: within-subjects does not mean "within one condition"; it means the same people are tested under all conditions.

🧩 Factorial designs

🧩 What a factorial design is

Factorial design: a design where all combinations of the levels of all factors are included.

  • Used when an experiment has more than one factor.
  • Allows examination of how factors interact, not just their individual effects.
  • Example: a Gender (2) × Age (3) factorial design has 2 levels of gender and 3 levels of age, creating 6 groups total (female-8, female-10, female-12, male-8, male-10, male-12).

📐 Describing factorial designs

  • A concise notation uses the format: Factor1 (number of levels) × Factor2 (number of levels).
  • The numbers in parentheses indicate how many levels each factor has.
  • Example: Gender (2) × Age (3) means gender has 2 levels and age has 3 levels.

🔀 Complex designs

  • Designs can have more than two factors.
  • They may combine between-subjects and within-subjects factors in the same experiment.
  • The excerpt does not detail these, but notes they are common in practice.

🎯 ANOVA vs Tukey HSD

🎯 What ANOVA tells you

  • ANOVA tests the omnibus null hypothesis: all population means are equal (H₀: μ₁ = μ₂ = ... = μₖ, where k is the number of conditions).
  • If the null hypothesis is rejected, the conclusion is that at least one population mean is different from at least one other mean.
  • ANOVA does not reveal which specific means differ from which.
  • It offers less specific information than the Tukey HSD test.

🔍 Why Tukey HSD is preferable

  • The Tukey HSD test identifies which specific means are different.
  • Some textbooks introduce Tukey only as a follow-up to ANOVA, but there is no logical or statistical reason to require ANOVA first.
  • You can use the Tukey test even without computing an ANOVA.

📚 Why learn ANOVA anyway

ReasonExplanation
Complex analysesSome complex types of analyses can be done with ANOVA but not with the Tukey test
Understanding researchANOVA is by far the most commonly-used technique for comparing means; understanding it is important to understand research reports
  • Don't confuse: ANOVA being "most common" does not mean it is always the best choice for a given analysis.
100

Between- and Within-Subjects Factors

Between- and Within-Subjects Factors

🧭 Overview

🧠 One-sentence thesis

The distinction between between-subjects and within-subjects factors determines whether comparisons are made across different groups of people or across repeated measurements of the same people, and this classification shapes the structure of multi-factor experimental designs.

📌 Key points (3–5)

  • Between-subjects factor: each level of the factor uses a different group of subjects; comparisons are between different groups.
  • Within-subjects factor: the same subjects are tested at every level of the factor; comparisons are within the same group across conditions (also called repeated measures).
  • Common confusion: "between" vs "within" refers to where the comparison happens—between different people or within the same people—not to the number of conditions.
  • Factorial designs: when all combinations of factor levels are included, the design is factorial (e.g., Gender (2) × Age (3) means 2 genders × 3 age levels = 6 groups total).
  • Real experiments often mix both: complex designs may combine between-subjects and within-subjects factors in the same study.

🔬 Between-subjects factors

🔬 What a between-subjects factor is

Between-subjects factor (or between-subjects variable): a factor whose levels are represented by separate groups of subjects.

  • The term "between subjects" reflects that comparisons are between different groups of subjects.
  • Each subject appears in only one level of the factor.
  • Example: In the "Smiles and Leniency" study, the factor "Type of Smile" had four levels (false, felt, miserable, neutral), and each level used a different group of 34 subjects. Comparisons were made between these four separate groups.

🧑‍🤝‍🧑 Why it's called "between"

  • The key is that you are comparing Group A vs Group B vs Group C, etc.
  • Different people in each condition → the variance you measure is between those different people.

🔁 Within-subjects factors

🔁 What a within-subjects factor is

Within-subjects factor (or within-subjects variable): a factor whose levels are all tested using the same subjects.

  • Comparisons are not between different groups but between conditions within the same subjects.
  • There is only one group of subjects, measured repeatedly under different conditions.
  • Example: In the "ADHD Treatment" study, every subject was tested with each of four dosage levels (0, 0.15, 0.30, 0.60 mg/kg). The same people were measured four times, so comparisons were within those same individuals across dosages.

🔄 Repeated-measures terminology

  • Within-subjects variables are sometimes called repeated-measures variables because there are repeated measurements of the same subjects.
  • Don't confuse: "repeated measures" does not mean "measuring the same thing twice for reliability"; it means measuring the same person under different experimental conditions.

🆚 How to distinguish between- vs within-subjects

AspectBetween-subjectsWithin-subjects
GroupsDifferent subjects in each levelSame subjects in all levels
ComparisonBetween different peopleWithin the same people
Number of scores per subjectOne score per subject (one condition only)Multiple scores per subject (one per condition)
Also calledBetween-subjects variableRepeated-measures variable
  • The excerpt emphasizes: if you see "different subjects" for each level → between; if you see "every subject tested with each level" → within.

🧩 Multi-factor designs

🧩 What multi-factor designs are

  • It is common for designs to have more than one factor.
  • Example: A hypothetical study of the effects of age and gender on reading speed.
    • Factor 1: Gender (2 levels: male, female)
    • Factor 2: Age (3 levels: 8 years, 10 years, 12 years)
    • Total groups: 2 × 3 = 6 different groups (female-8, female-10, female-12, male-8, male-10, male-12).

🏗️ Factorial designs

Factorial design: a design in which all combinations of the levels of all factors are included.

  • When every possible combination is tested, the design is factorial.
  • A concise notation: Gender (2) × Age (3) factorial design, where the numbers in parentheses indicate the number of levels for each factor.
  • This notation quickly tells you the structure: 2 levels of gender, 3 levels of age, and all 6 combinations are included.

🔀 Complex designs

  • Designs frequently have more than two factors.
  • They may have combinations of between- and within-subjects factors in the same experiment.
  • Example: Age and gender might be between-subjects (different people in each age-gender group), while dosage level might be within-subjects (same people tested at multiple dosages).

📊 ANOVA terminology context

📊 One-way vs two-way ANOVA

  • The excerpt briefly mentions that the number of factors determines the ANOVA name:
    • One factor → one-way ANOVA
    • Two factors → two-way ANOVA
  • Example: The age and gender reading-speed study would use a two-way ANOVA because it has two factors (age and gender).
  • Age would have three levels; gender would have two levels.

🧮 Levels of a factor

  • A level is a specific value or category within a factor.
  • Example: The factor "age" has three levels (8, 10, 12 years); the factor "gender" has two levels (male, female).
  • The total number of groups in a factorial design = product of the number of levels of all factors.
101

One-Factor ANOVA (Between Subjects)

One-Factor ANOVA (Between Subjects)

🧭 Overview

🧠 One-sentence thesis

One-factor ANOVA tests whether population means across multiple groups are equal by comparing two variance estimates—MSE (within-group variance) and MSB (between-group variance)—where a much larger MSB than MSE indicates the population means likely differ.

📌 Key points (3–5)

  • What ANOVA tests: the null hypothesis that all population means are equal across k conditions.
  • Two variance estimates: MSE estimates population variance regardless of whether means are equal; MSB estimates population variance only if means are equal, otherwise it estimates something larger.
  • The F ratio logic: if MSB is much larger than MSE, population means are unlikely to be equal; the F ratio (MSB/MSE) quantifies this comparison.
  • Common confusion: ANOVA is best thought of as a two-tailed test even though only the right tail of the F distribution is used, because F is sensitive to any pattern of differences among means.
  • Partitioning variation: ANOVA splits total variation into variation due to conditions (SSQ condition) and unexplained variation (SSQ error).

📐 The two variance estimates

📊 Mean Square Error (MSE)

MSE: an estimate of population variance (σ²) based on differences among scores within the groups.

  • MSE estimates σ² regardless of whether the null hypothesis is true.
  • It is computed as the mean of the sample variances across all groups.
  • Example: In the "Smiles and Leniency" study with four groups, MSE = 2.6489.
  • Why it's stable: differences in population means do not affect variances within groups, so MSE always estimates σ².

📈 Mean Square Between (MSB)

MSB: an estimate based on differences among the sample means.

  • MSB estimates σ² only if the population means are equal.
  • If population means are not equal, MSB estimates a quantity larger than σ².
  • Computed in three steps:
    1. Compute the means for each group.
    2. Compute the variance of those means.
    3. Multiply the variance of the means by n (the number of observations in each group).
  • Example: For the leniency data, variance of the four sample means = 0.270; MSB = 0.270 × 34 = 9.179.

🔍 The key comparison

  • If population means are equal, both MSE and MSB estimate σ² and should be about the same.
  • If population means are not equal, MSB will be larger than MSE.
  • The larger the differences among sample means, the larger the MSB.
  • Don't confuse: MSE and MSB are based on different aspects of the data—MSE from sample variances, MSB from sample means.

🧮 Computing the F ratio and testing significance

🎲 The F ratio

  • The F ratio is the ratio of MSB to MSE: F = MSB / MSE.
  • Example: F = 9.179 / 2.649 = 3.465.
  • This means MSB is 3.465 times higher than MSE.
  • Interpretation: Would this ratio be likely if all population means were equal? That depends on sample size.

📉 The F distribution

  • The F distribution has a positive skew.
  • Its shape depends on two degrees of freedom parameters:
    • Numerator df (for MSB): k - 1, where k is the number of groups.
    • Denominator df (for MSE): N - k, where N is total number of observations.
  • Example: With k = 4 groups and N = 136 observations, df numerator = 3, df denominator = 132.
  • The area to the right of the observed F value represents the probability of getting that F or larger if the null hypothesis is true.
  • Example: For F = 3.465, p = 0.018, so the null hypothesis can be rejected.

🔄 One-tailed or two-tailed?

  • Literally, the probability is from the right-hand tail of the F distribution (one-tailed).
  • However, the F ratio is sensitive to any pattern of differences among means.
  • Best considered a two-tailed test because it tests a two-tailed hypothesis.

🔗 Relationship to the t test

  • When there are only two groups, ANOVA and an independent-groups t test give the same result.
  • The relationship: F(1, dfd) = t²(df), where dfd equals df.
  • It does not matter which test you use for two groups—the results will always be the same.

🧱 Assumptions and requirements

✅ Three key assumptions

  1. Homogeneity of variance: The populations have the same variance.
  2. Normality: The populations are normally distributed.
  3. Independence: Each value is sampled independently from each other value; each subject provides only one score.
  • These are the same assumptions as for a t test, except they apply to two or more groups.
  • Don't confuse: If a subject provides two scores, the values are not independent and require a different analysis (within-subjects ANOVA).

📏 Sample size notation

  • n: the number of observations in each group.
  • N: the total number of observations across all groups.
  • Example: Four groups with 34 observations each → n = 34, N = 136.
  • The calculations in this section assume equal sample sizes; unequal sizes require adjusted formulas.

🧩 Partitioning variation (sums of squares)

🌐 Sources of variation

  • Scores differ for many reasons: experimental condition, individual differences, unmeasured factors (mood, etc.).
  • All variation except the experimental condition is called error (unexplained variance).
  • ANOVA partitions total variation into:
    • Variation due to conditions (SSQ condition).
    • Variation due to error (SSQ error).

📊 Total sum of squares (SSQ total)

SSQ total: the sum of squared differences between each score and the grand mean.

  • Grand mean (GM): the mean of all subjects across all conditions.
  • Formula: take each score, subtract GM, square the difference, sum all squared values.
  • Example: For the leniency study, SSQ total = 377.19.

🎯 Sum of squares condition (SSQ condition)

  • Based on differences among the sample means.
  • Formula (equal sample sizes): n × [(M₁ - GM)² + (M₂ - GM)² + ... + (Mₖ - GM)²], where Mᵢ is the mean for condition i.
  • Example: SSQ condition = 27.5.
  • For unequal sample sizes, each term is weighted by its sample size nᵢ.

⚠️ Sum of squares error (SSQ error)

  • The sum of squared deviations of each score from its group mean.
  • Can be computed directly or by subtraction: SSQ error = SSQ total - SSQ condition.
  • Example: SSQ error = 377.19 - 27.53 = 349.66.

🔢 From sums of squares to mean squares

  • MSB = SSQ condition / df numerator, where df numerator = k - 1.
  • MSE = SSQ error / df denominator, where df denominator = N - k (also called df error).
  • Example: MSB = 27.535 / 3 = 9.18; MSE = 349.66 / 132 = 2.65.

📋 Summary table and data formatting

📑 ANOVA summary table

A standard way to present ANOVA results:

SourcedfSSQMSFp
Conditionk - 1SSQ conditionMSBF ratioprobability
ErrorN - kSSQ errorMSE
TotalN - 1SSQ total
  • Mean squares are always sums of squares divided by degrees of freedom.
  • F and p are relevant only to the Condition row.

💾 Formatting data for computer analysis

  • Most programs require two variables:
    1. A grouping variable (which group the subject is in).
    2. The score variable (the dependent variable).
  • Example: Instead of separate columns for each group, use one column for group labels (1, 2, 3) and one column for scores.
GroupScore
13
14
15
22
24
26
38
35
35
102

Multi-Factor Between-Subjects Designs

Multi-Factor Between-Subjects Designs

🧭 Overview

🧠 One-sentence thesis

Multi-factor between-subjects designs allow researchers to test not only the separate effects of multiple independent variables (main effects) but also whether those effects depend on one another (interactions), making them more efficient and informative than conducting separate single-factor studies.

📌 Key points (3–5)

  • Main effect vs. simple effect: A main effect averages across all levels of other variables, while a simple effect examines one variable at a single level of another variable.
  • Interaction defined: An interaction occurs when the effect of one independent variable differs depending on the level of another variable—equivalently, when simple effects differ.
  • Common confusion—parallel lines: Interaction is indicated by non-parallel lines in a plot; lines do not have to cross for an interaction to exist, only fail to be parallel.
  • Marginal means: The mean of all means at one level of a variable, used to assess main effects.
  • Three-way interactions: A three-way interaction means that two-way interactions themselves differ across levels of a third variable.

🔬 Core concepts and terminology

🔬 Main effects

A main effect of an independent variable is the effect of the variable averaging over the levels of the other variable(s).

  • Main effects are assessed using marginal means.
  • A marginal mean for one level of a variable is the mean of the means of all levels of the other variable(s).
  • Example: In a study with Weight (Obese vs. Typical) and Relationship (Girlfriend vs. Acquaintance), the marginal mean for "Obese" is the mean of "Girlfriend Obese" (5.65) and "Acquaintance Obese" (6.15), which equals 5.90.
  • The main effect of Weight compares the marginal means for Obese (5.90) vs. Typical (6.39).

🔬 Simple effects

The simple effect of a variable is the effect of the variable at a single level of another variable.

  • Unlike main effects, simple effects do not average across levels.
  • Example: The simple effect of Weight at the "Girlfriend" level is the difference between "Girlfriend Typical" (6.19) and "Girlfriend Obese" (5.65) = 0.54.
  • The simple effect of Weight at the "Acquaintance" level is 6.59 - 6.15 = 0.44.
  • Simple effects are the building blocks for understanding interactions.

🔬 Interactions

There is an interaction when the effect of one variable differs depending on the level of a second variable.

  • Equivalent definition: There is an interaction when the simple effects differ.
  • Example: If the effect of companion weight is larger when the companion is a girlfriend than when she is an acquaintance, Weight and Relationship interact.
  • In the example data, the simple effects of Weight are 0.54 and 0.44—these are not significantly different, so no interaction is present.
  • Don't confuse: Lack of evidence for an interaction does not prove there is no interaction in the population; you do not accept the null hypothesis just because you fail to reject it.

📊 ANOVA for multi-factor designs

📊 Degrees of freedom

  • Main effect df: Number of levels of the variable minus one.
    • Example: Weight has two levels (Obese, Typical), so df = 2 - 1 = 1.
  • Interaction df: Product of the df's of the variables in the interaction.
    • Example: Weight × Relationship interaction has df = 1 × 1 = 1.
  • Error df: Total number of observations minus the total number of groups.
    • Example: 176 observations and 4 groups → dfe = 176 - 4 = 172.
  • Total df: Number of observations minus 1, or the sum of all other df's.
    • Example: 176 - 1 = 175.

📊 F-ratio and significance testing

  • The F-ratio for each effect is computed by dividing the mean square (MS) for the effect by the mean square error (MSE).
  • Example: For Weight, F = 10.4673 / 1.6844 = 6.214.
  • The p-value indicates the probability of obtaining an F as large or larger if there were no effect in the population.
  • Example: Weight has p = 0.0136, so the null hypothesis of no main effect is rejected; being accompanied by an obese companion lowers ratings.
  • Example: The Weight × Relationship interaction has p = 0.8043, providing no evidence for an interaction.

📊 ANOVA summary table structure

SourcedfSSQMSFp
Weight110.467310.46736.2140.0136
Relation18.81448.81445.2330.0234
W × R10.10380.10380.0620.8043
Error172289.71321.6844
Total175310.1818
  • Mean square (MS) = Sum of squares (SSQ) / df.
  • F and p are relevant only to effects, not to Error or Total.

📈 Plotting and interpreting interactions

📈 Interaction plots

An interaction plot displays the dependent variable on the Y-axis, one independent variable on the X-axis, and separate lines for each level of the other independent variable.

  • Label lines directly on the graph rather than using a legend.
  • Non-parallel lines indicate interaction; the lines do not have to cross.
  • Example: In the Weight × Relationship plot, the lines for Girlfriend and Acquaintance are nearly parallel, consistent with the non-significant interaction.

📈 When to use lines vs. bars

  • Use lines when the X-axis variable has three or more levels and is numeric (ordered).
  • Use bars (or better, box plots) when the X-axis variable is qualitative (categorical).
  • Box plots convey distributional information (medians, quantiles, overlap) that bar charts do not.

📈 Example with significant interaction

  • In a hypothetical study of Esteem (High vs. Low) and Outcome (Success vs. Failure) on attribution to self:
    • High-esteem subjects: Success led to more self-attribution than Failure.
    • Low-esteem subjects: Failure led to more self-attribution than Success.
  • The lines cross, clearly showing non-parallel patterns.
  • The significant Outcome × Esteem interaction (p = 0.0002) confirms that the effect of Outcome differs by Esteem level.

🧩 Three-factor designs

🧩 Structure and effects

  • Three-factor designs include three independent variables.
  • Seven effects are tested:
    • Three main effects (one for each variable).
    • Three two-way interactions.
    • One three-way interaction.
  • Example: A study of Hole Shape (Hex vs. Round), Assembly Method (Staked vs. Spun), and Barrel Surface (Knurled vs. Smooth) on breaking torque.

🧩 Degrees of freedom in three-factor designs

  • Main effect df = number of levels - 1 (same as two-factor).
  • Two-way interaction df = product of the df's of the two variables.
  • Three-way interaction df = product of all three main-effect df's.
  • Example: All three factors have two levels, so all main effects have df = 1, and the three-way interaction has df = 1 × 1 × 1 = 1.
  • Error df = number of observations - number of groups.
    • Example: 64 observations, 8 groups → dfe = 56.

🧩 Interpreting three-way interactions

A three-way interaction means that the two-way interactions differ as a function of the level of the third variable.

  • Plot the two-way interactions separately for each level of the third variable.
  • Example: Plot Barrel × Assembly separately for Hex and for Round.
    • For Hex: lines nearly parallel (little interaction).
    • For Round: lines diverge (larger interaction).
  • A significant three-way interaction indicates that this difference in two-way interactions is real.

💾 Formatting data for computer analysis

💾 Data structure

  • Most ANOVA programs require data in "long format":
    • One column for each independent variable (coded numerically).
    • One column for the dependent variable (the score).
  • Each row represents one observation.

💾 Example coding

  • For the Esteem × Outcome study:
    • Esteem: 1 = High, 2 = Low.
    • Outcome: 1 = Success, 2 = Failure.
    • Each subject's data is one row with three columns: outcome, esteem, attrib.
outcomeesteemattrib
117
118
214
229
  • This format allows the program to identify which group each observation belongs to and compute the ANOVA.

⚠️ Unequal sample sizes

⚠️ The problem of confounding

  • Unequal sample sizes (unequal n) can create confounding: the effects of different variables become entangled.
  • Example (absurd design): A study of Diet (Low-Fat vs. High-Fat) and Exercise (Moderate vs. None) with no subjects in "Low-Fat No-Exercise" or "High-Fat Moderate-Exercise."
  • In such cases, it is impossible to separate the effect of Diet from the effect of Exercise.
  • Even less extreme unequal n can complicate interpretation, especially if the cause of unequal n is related to the variables being studied.

⚠️ Weighted vs. unweighted means

  • Weighted means: Each group mean is weighted by its sample size.
  • Unweighted means: Each group mean is given equal weight regardless of sample size.
  • The choice affects the calculation of main effects and can lead to different conclusions.

⚠️ Types of sums of squares

  • Type I sums of squares: Sequential; the order in which effects are entered matters.
  • Type III sums of squares: Each effect is adjusted for all other effects; order does not matter.
  • Type III is more common in multi-factor designs with unequal n because it tests each effect controlling for the others.

⚠️ Interpretation depends on the cause of unequal n

  • By design: Researcher intentionally uses unequal n (e.g., oversampling a rare group).
  • By accident: Random attrition or data collection issues.
  • By necessity: Real-world constraints (e.g., fewer people in one category).
  • The cause affects whether unequal n introduces bias and how to interpret results.
  • Don't confuse: Unequal n due to a variable related to the outcome can bias estimates; unequal n from random processes is less problematic but still complicates analysis.
103

Unequal Sample Sizes

Unequal Sample Sizes

🧭 Overview

🧠 One-sentence thesis

Unequal sample sizes across experimental conditions create confounding between factors, and the choice between weighted versus unweighted means (and between Type I, II, and III sums of squares) depends on whether the unequal n arose from the research design or from differential dropout.

📌 Key points (3–5)

  • The core problem: unequal sample sizes cause confounding—you cannot tell whether a difference is due to one factor or another because the factors are entangled.
  • Weighted vs unweighted means: weighted means ignore other variables and preserve confounding; unweighted means control for other variables and eliminate confounding.
  • Type III vs Type I vs Type II sums of squares: Type III does not apportion confounded variance to any source (most common); Type I assigns it sequentially; Type II assigns main-effect/interaction confounding to main effects.
  • Common confusion: the same data yield different conclusions depending on whether means are weighted by sample size or not—weighted means reflect "ignoring" other factors; unweighted means reflect "controlling for" other factors.
  • Why the cause matters: if unequal n arose from random assignment, use Type III (equal weighting); if it reflects population proportions, use Type II (sample-size weighting); if it arose from differential dropout due to treatment, no statistical adjustment is valid.

🧩 The problem of confounding

🧩 What confounding means

Confounding: when two factors are entangled so that you cannot tell which one caused an observed difference.

  • Unequal sample sizes create confounding because some combinations of factors have more subjects than others.
  • In the extreme case, if certain cells have zero subjects, the factors are completely confounded—every subject in one level of Factor A is also in one level of Factor B, so you cannot separate their effects.
  • Example: In the absurd "Diet and Exercise" design (Table 2 and 3), all low-fat subjects did moderate exercise and all high-fat subjects did no exercise. The 20-unit difference in cholesterol change could be due to diet, exercise, or both—there is no way to know.

🔍 Partial confounding

  • Even when cells are not empty, unequal n causes partial confounding.
  • Example: In Table 4, 80% of low-fat subjects exercised versus 20% of high-fat subjects. Diet and Exercise are confounded but not completely.
  • The confounding means that a simple comparison of diet groups mixes in the effect of exercise.

⚖️ Weighted versus unweighted means

⚖️ How weighted means are computed

  • Weighted mean: multiply each cell mean by its sample size, sum, and divide by the total N.
  • For the low-fat condition in Table 4: (4 × 27.5 + 1 × 20) / 5 = 26.
  • The weighted mean is the same as the mean of all individual scores in that condition—it "ignores" the other factor (Exercise).

🎯 How unweighted means are computed

  • Unweighted mean: simply average the cell means without regard to sample size.
  • For the low-fat condition: (27.5 + 20) / 2 = 23.75.
  • This treats each cell equally and "controls for" the other factor.

🧭 Why unweighted means eliminate confounding

  • Weighted means are affected by the imbalance in sample sizes across the other factor, so they mix in that factor's effect.
  • Unweighted means give each cell equal weight, so they are not distorted by the confounding.
  • Example: The weighted-mean difference for Diet is −22 (−26 vs −4), but most low-fat subjects exercised and most high-fat subjects did not. The unweighted-mean difference is −15.625 (−23.75 vs −8.125), which is a better measure of the Diet main effect because it controls for Exercise.
  • Don't confuse: "ignoring" a factor (weighted) versus "controlling for" a factor (unweighted)—ignoring means you let the confounding remain; controlling means you remove it.

📛 Terminology in software

  • SPSS calls unweighted means estimated marginal means.
  • SAS and SAS JMP call them least squares means.

🧮 Types of sums of squares

🧮 Type III sums of squares

Type III sums of squares: confounded variance is not apportioned to any source of variation; each effect is tested controlling for all other effects.

  • This is by far the most common type; if sums of squares are not labeled, assume Type III.
  • In Table 5, the sum of squares for Diet, Exercise, D×E, and Error add up to 902.625, but the total is 1722—the "missing" 819.375 is the confounded variance that is not assigned to any source.
  • Type III weights all cell means equally (unweighted means).

🧮 Type I sums of squares

Type I sums of squares: all confounded variance is apportioned to sources of variation in the order the effects are listed.

  • The first effect gets any variance confounded between it and later effects; the second gets variance confounded with later effects but not the first, etc.
  • In Table 6, Diet is listed first and gets the confounded variance, so its sum of squares jumps from 390.625 (Type III) to 1210 (Type I).
  • With Type I, the sums of squares add up to the total.
  • Not recommended unless there is a strong theoretical reason to assign confounded variance to one effect over another (which is rare).

🧮 Type II sums of squares

Type II sums of squares: confounded variance between main effects is not apportioned, but confounded variance between main effects and interactions is apportioned to the main effects.

  • In the example, there is no confounding between the D×E interaction and the main effects, so Type II equals Type III.
  • Type II weights cell means by sample size (weighted means).
  • Type II is more powerful if there is truly no interaction, because it assigns interaction-confounded variance to main effects and uses sample-size weighting for better estimates.

🔀 Comparison table

TypeHow confounded variance is handledWeighting of meansWhen total SSQ = sum of source SSQ
Type IAssigned sequentially by order listedDepends on orderAlways
Type IIMain-effect/interaction confounding → main effects; main-effect/main-effect confounding → not assignedSample size (weighted)Only if no main-effect confounding
Type IIINot assigned to any sourceEqual (unweighted)Only if no confounding

🤔 Which type to use

🤔 Arguments against Type I

  • Type I allows confounded variance between two main effects to be assigned to whichever is listed first.
  • Unless there is a strong argument for how to apportion it (rarely the case), Type I is not recommended.

🤔 Type II versus Type III

  • No consensus in the field.
  • Type II is more powerful if there is no interaction, for two reasons:
    1. Variance confounded between main effect and interaction is properly assigned to the main effect.
    2. Weighting by sample size gives better estimates.
  • Some suggest using Type II if the interaction is not significant.
  • Caution (Maxwell & Delaney, 2003): switching to Type II after finding a non-significant interaction could cause a Type II error (failing to detect a real interaction), which in turn inflates the Type I error rate for the main effect test.
  • General recommendation: use Type III sums of squares.

🧪 When Type II might be justified

  • Type II and Type III test different hypotheses because they weight means differently.
  • Example (Figure 1): with unequal sample sizes, Type III (equal weighting) finds no main effect of B (marginal means both equal 8), but Type II (sample-size weighting) finds a main effect (weighted marginal means 8.33 vs 9.2).
  • If sample sizes reflect population proportions (e.g., sampling intact groups), it makes sense to weight means by sample size → use Type II.
  • If sample sizes arose from random assignment (just happened to be unequal), estimate what would have happened with equal n → use Type III.
  • Don't confuse: the hypothesis "is there a difference in the population with these proportions?" (Type II) versus "is there a difference if all cells were equally sized?" (Type III).

🧠 Interaction considerations

  • Maxwell & Delaney recognize that some prefer Type II when there are strong theoretical reasons to suspect no interaction and the p-value is much higher than 0.05.
  • However, Tukey (1991) and others argue that no effect (main or interaction) is exactly zero in the population—the role of significance testing is to determine the direction of an effect, not just whether it is non-zero.
  • If you assume no interaction, you should fit an ANOVA model without an interaction term, rather than using Type II in a model that includes interaction.

⚠️ Causes of unequal sample sizes

⚠️ When methods are invalid

  • None of the methods for dealing with unequal n are valid if the treatment itself caused the unequal sample sizes.
  • Example: An experiment on public-speaking anxiety randomly assigns 10 subjects to an experimental group (describe an embarrassing situation) and 10 to a control group (describe last meal). Four experimental subjects withdraw because they don't want to describe an embarrassing situation; no control subjects withdraw.
  • Even if the analysis shows a significant effect, you cannot conclude the treatment had an effect, because subjects willing to describe an embarrassing situation likely differ from those who are not.
  • The differential dropout destroyed the random assignment—a critical feature of experimental design.
  • No statistical adjustment can fix this flaw.

⚠️ Valid causes of unequal n

  • By design: the researcher intentionally uses different sample sizes (e.g., to reflect population proportions).
  • By accident: random assignment happens to produce slightly unequal groups.
  • By necessity: sampling intact groups with naturally different sizes.
  • In these cases, the choice of Type II versus Type III depends on whether the sample sizes are meaningful (reflect population structure) or arbitrary (random variation).
104

Tests Supplementing ANOVA

Tests Supplementing ANOVA

🧭 Overview

🧠 One-sentence thesis

After a significant ANOVA result, supplementary tests such as pairwise comparisons, specific contrasts, and simple effects tests help identify which means differ and how interactions should be interpreted.

📌 Key points (3–5)

  • What ANOVA tells you: a significant one-factor ANOVA only shows that at least one population mean differs from at least one other, not which ones differ.
  • Main effects can be followed up: use Tukey HSD for all pairwise comparisons or specific contrasts to test targeted hypotheses among marginal means.
  • Interactions require careful interpretation: a significant interaction means simple effects differ from each other; describe the interaction by comparing simple effects, not by testing whether each simple effect is zero.
  • Common confusion: testing simple effects does not explain an interaction—the interaction itself already tells you that simple effects differ; simple effects tests are useful for understanding whether main effects generalize, not for explaining the interaction.
  • Don't accept the null hypothesis: a non-significant simple effect does not mean the effect is zero; concluding two simple effects differ because one is significant and the other is not is faulty logic.

🔍 What ANOVA tells you and what it doesn't

🔍 The null hypothesis in one-factor ANOVA

The null hypothesis tested in a one-factor ANOVA: all population means are equal (H₀: μ₁ = μ₂ = ... = μₖ, where k is the number of conditions).

  • When the null hypothesis is rejected, you only know that at least one population mean is different from at least one other.
  • The ANOVA does not tell you which specific means differ or how many pairs differ.
  • Example: if you have three treatment groups and ANOVA is significant, you don't know if all three differ from each other or just one differs from the other two.

🔍 Main effects in multi-factor designs

  • A significant main effect (e.g., Factor A) indicates that at least one marginal mean for A differs from at least one other marginal mean.
  • Main effects can be followed up the same way as significant effects in one-way designs.
  • The excerpt provides a made-up example with three levels of Factor A and two levels of Factor B; the significant main effect of A (p = 0.002) means at least one marginal mean for A is different.

🧮 Pairwise and specific comparisons

🧮 Tukey HSD test for pairwise comparisons

  • Purpose: test all pairwise comparisons among means in a one-factor ANOVA or among marginal means in a multi-factor ANOVA.
  • Formula for equal sample sizes: the test statistic involves the difference between two marginal means (Mᵢ - Mⱼ), the mean square error (MSE) from the ANOVA, and n (the number of scores each mean is based on).
  • The probability value is computed using the Studentized Range Calculator with degrees of freedom equal to the error degrees of freedom.
  • Example from the excerpt: comparing the three marginal means for Factor A (A₁, A₂, A₃), the Tukey HSD test showed A₁ is significantly lower than both A₂ (p = 0.01) and A₃ (p = 0.002), but A₂ and A₃ do not differ significantly (p = 0.746).

🧮 Specific comparisons (contrasts)

  • Purpose: test targeted hypotheses, such as comparing one mean to the average of two others.
  • How it works: assign coefficients (cᵢ) to each marginal mean (Mᵢ) and compute L = sum of (cᵢ × Mᵢ).
  • Example: to compare A₁ with the average of A₂ and A₃, use coefficients 1, -0.5, -0.5, yielding L = (1)(5.125) + (-0.5)(7.375) + (-0.5)(7.875) = -2.5.
  • Then compute a t-statistic using MSE and n, with degrees of freedom equal to the error degrees of freedom from the ANOVA.
  • In the example, the difference between A₁ and the average of A₂ and A₃ was significant (p = 0.0005).

🧮 Important considerations

  • Issues concerning multiple comparisons and orthogonal comparisons (discussed elsewhere in the source material) also apply here.
  • These tests are valid whether or not they are preceded by an ANOVA.

🔀 Understanding and describing interactions

🔀 What an interaction means

An interaction means that the simple effects are different.

  • Simple effect: the effect of one factor at a single level of another factor.
  • A significant interaction indicates that the simple effects differ from each other, so the main effect (the mean of the simple effects) does not tell the whole story.
  • Don't confuse: the interaction tells you that simple effects differ, but it does not tell you whether any simple effect differs from zero.

🔀 How to describe an interaction

  1. First step: construct an interaction plot to visualize the pattern.
  2. Second step: describe the interaction in clear, jargon-free language by comparing the simple effects.
  3. Example from the excerpt (self-esteem and outcome):
    • "The effect of Outcome differed depending on the subject's self-esteem. The difference between the attribution to self following success and the attribution to self following failure was larger for high-self-esteem subjects (mean difference = 2.50) than for low-self-esteem subjects (mean difference = -2.33)."
  4. No further analyses are needed to understand the interaction itself—the interaction's significance already tells you that the simple effects differ.

🔀 When the lines are parallel vs. not parallel

  • In an interaction plot, if lines are parallel over the entire graph, there is no interaction.
  • If lines are not parallel (they converge, diverge, or cross), there is an interaction.
  • Example: in a diet study with teens and adults, the difference between Diet A and Control was the same for both age groups (parallel lines), but the difference between Diet B and Diet A was much larger for teens than adults (non-parallel lines).

⚠️ Proper and improper uses of simple effects tests

⚠️ Why simple effects tests do not explain interactions

  • Common mistake: researchers report testing simple effects "to explain the interaction."
  • Why this is invalid: an interaction does not depend on whether simple effects are zero; it only depends on whether they differ from each other.
  • The interaction's significance already tells you that simple effects differ—testing whether they differ from zero adds no information about the interaction itself.

⚠️ When simple effects tests are useful

  • Valid reason: to determine whether main effects generalize.
  • A significant interaction means the main effects are not general—the effect of one factor depends on the level of the other factor.
  • Example: in the self-esteem study, the main effect of Outcome is not very informative; the effect of outcome should be considered separately for high- and low-self-esteem subjects.
  • Testing simple effects can show that Success significantly increases attribution to self for high-self-esteem subjects but significantly lowers it for low-self-esteem subjects—an easy-to-interpret result.

⚠️ Don't accept the null hypothesis

  • Problem: if neither simple effect is significant, researchers sometimes conclude both are zero.
  • Why this is wrong: a non-significant result does not mean the effect is zero; you should not accept the null hypothesis just because it is not rejected.
  • Correct conclusion: if neither simple effect is significant, conclude that the simple effects differ and at least one is not zero, but you cannot determine which one(s).

⚠️ Don't conclude effects differ based on significance vs. non-significance

  • Example from the excerpt: a researcher hypothesized that addicted people would show a larger increase in brain activity than non-addicted people.
  • The interaction was not quite significant (p = 0.08).
  • The researcher then tested simple effects: Treatment was significant for the Addicted group (p = 0.02) but not for the Non-Addicted group (p = 0.09).
  • Faulty logic: concluding that the hypothesis is demonstrated because one simple effect is significant and the other is not.
  • Why it's wrong: this is based on accepting the null hypothesis for the Non-Addicted group just because the result is not significant.
  • Correct interpretation: the experiment supports the hypothesis, but not strongly enough for a confident conclusion.

🧩 Components of interaction (optional)

🧩 Testing portions of an interaction

  • Sometimes an interaction is not uniform across the entire graph—lines may be parallel over one portion and not parallel over another.
  • Example: in a diet study with Control, Diet A, and Diet B for teens and adults:
    • The difference between Diet A and Control was essentially the same for teens and adults (parallel).
    • The difference between Diet B and Diet A was much larger for teens than for adults (not parallel).
  • You can test these portions (components of interactions) using specific comparisons.

🧩 How to test a component

  • Assign coefficients to each cell mean to isolate the component of interest.
  • Example: to test the difference between Teens and Adults on the difference between Diets A and B, use coefficients:
Age GroupDietCoefficient
TeenControl0
TeenA1
TeenB-1
AdultControl0
AdultA-1
AdultB1
  • The same considerations regarding multiple comparisons and orthogonal comparisons apply to these tests.
105

Within-Subjects ANOVA

Within-Subjects ANOVA

🧭 Overview

🧠 One-sentence thesis

Within-subjects ANOVA is more powerful than between-subjects ANOVA because it controls for individual differences by comparing each subject's performance across different conditions, treating each person as their own control.

📌 Key points (3–5)

  • What within-subjects factors are: comparisons of the same subjects under different conditions (repeated measures on each person).
  • Why they have more power: individual differences in overall performance are controlled because each subject serves as their own control.
  • Error as interaction: the error term reflects how much the effect of the condition differs across different subjects (Subjects × Condition interaction).
  • Common confusion: sphericity assumption—within-subjects ANOVA assumes equal variances and correlations among dependent variables; violating this inflates Type I error, unlike most ANOVA assumptions.
  • Carryover effects: performing in one condition can affect performance in subsequent conditions, which can invalidate results if not symmetric or counterbalanced.

🔄 Core concepts

🔄 Within-subjects vs between-subjects factors

Within-subjects factors: involve comparisons of the same subjects under different conditions; each subject's performance is measured at each level of the factor.

Between-subjects factors: each subject's performance is measured only once, and comparisons are among different groups of subjects.

  • In the ADHD Treatment study, each child was measured four times (once per drug dose), so "Dose" is a within-subjects factor.
  • Each subject is tested under all conditions rather than being assigned to just one condition.
  • Also called a repeated-measures factor because repeated measurements are taken on each subject.

💪 Why within-subjects designs have more power

  • Individual differences are controlled: subjects invariably differ from one another in overall performance (e.g., some solve problems faster, some have higher blood pressure).
  • Within-subjects designs compare a subject's score in one condition to the same subject's scores in other conditions.
  • Each subject serves as his or her own control.
  • This control of individual differences typically gives within-subjects designs considerably more power than between-subjects designs.
  • Example: in a problem-solving experiment, some subjects will be better regardless of condition; within-subjects design removes this variability from the error term.

📊 ANOVA structure for within-subjects designs

📊 Sources of variation in one-factor within-subjects ANOVA

The ANOVA summary table includes these sources:

SourceWhat it represents
SubjectsDifferences among subjects; larger when subjects differ more from each other
Condition (e.g., Dosage)Differences between condition levels; zero if all condition means are equal
ErrorSubjects × Condition interaction; how much the condition effect differs for different subjects
TotalOverall variation in all scores
  • Subjects: if all subjects had exactly the same mean across conditions, this sum of squares would be zero.
  • Condition: if the means for all condition levels were equal, this sum of squares would be zero.
  • Error: reflects the degree to which the effect of the condition is different for different subjects.

🔀 Error as interaction

Error in within-subjects ANOVA: the extent to which the effect of one variable (Condition) differs depending on the level of another variable (Subjects); each subject is a different level of the variable "Subjects."

  • If all subjects responded very similarly to the treatment, error would be low.
  • Example: if all subjects performed moderately better with high dose than placebo, error would be low.
  • If some subjects did better with placebo while others did better with high dose, error would be high.
  • The less consistent the effect of the condition, the larger the condition effect must be to reach significance.
  • Don't confuse: this is different from between-subjects error, which includes individual differences; here, individual differences are removed into the "Subjects" source.

📐 Degrees of freedom

For a one-factor within-subjects design with k levels and n subjects:

  • Subjects: n − 1
  • Condition: k − 1
  • Error: (n − 1) × (k − 1)
  • Total: (n × k) − 1

Example: 24 children tested under 4 dosage levels:

  • Subjects df = 23
  • Dosage df = 3
  • Error df = 23 × 3 = 69
  • Total df = 95

🧮 F-test equivalence

  • The F for the condition is mean square condition divided by mean square error.
  • When comparing only two conditions, this F test is equivalent to the t test for correlated pairs, with F = t².
  • Example: comparing placebo vs. high dose yielded F = 10.38, p = 0.004.

🚧 Challenges and assumptions

⚠️ Carryover effects

Symmetric carryover effects (manageable):

  • Subjects get fatigued, so they do worse on the second condition.
  • As long as presentation order is counterbalanced (half do Condition A first, half do B first), the fatigue effect adds noise but doesn't invalidate results.
  • The effect is symmetric: having A first affects B the same as having B first affects A.

Asymmetric carryover effects (serious problem):

  • Performance in Condition B is much better if preceded by Condition A, but performance in A is about the same regardless of whether preceded by B.
  • Example: in a memory experiment with two conditions (judging word meaning vs. judging word sound), if subjects are given a surprise memory test after the first condition, there's no surprise after the second—subjects would try to memorize words in the second condition.
  • With asymmetric carryover, a between-subjects design is probably better.

🔵 Assumption of sphericity

Sphericity: the assumption that all correlations among dependent variables are equal and all variances are equal.

  • This is a restrictive assumption about variances and correlations.
  • Example from Stroop study: correlation between word reading and color naming (0.7013) is much higher than correlation between either of these and interference (0.1583 and 0.2382).
  • Variances also differ greatly (word reading: 15.77, color naming: 13.92, interference: 55.07).
  • The assumption refers to populations, not samples, but sample data can indicate the assumption is not met in the population.
  • This assumption is rarely met in practice.

💥 Consequences of violating sphericity

  • Unlike most ANOVA assumptions, violating sphericity leads to a substantial increase in Type I error rate.
  • ANOVA is robust to most assumption violations, but sphericity is an exception.
  • Violations had received little attention historically, but current consensus is that it's no longer acceptable to ignore them.
  • Don't confuse: this is different from normality or homogeneity of variance assumptions, which ANOVA handles robustly.

🛠️ Approaches to dealing with sphericity violations

Conservative test (for highly significant effects):

  • Divide degrees of freedom numerator and denominator by (number of scores per subject − 1).
  • Example: Task effect with df = 2 and 90 becomes df = 1 and 45 after adjustment.
  • If still significant after this adjustment, no need to worry about the violation.
  • Very conservative; should only be used when probability value is very low.

Epsilon (ε) corrections (better but complex):

  • Multiply degrees of freedom by a quantity called ε.
  • Two methods: Huynh-Feldt (H-F, slightly preferred) and Greenhouse-Geisser (G-G, slightly too conservative).
  • Example: for the dosage effect in Table 2, original p = 0.003 with df = 3, 69; with conservative correction (df = 1, 23), p = 0.032—a more cautious conclusion.

Multivariate approach:

  • Use a multivariate approach to within-subjects variables.
  • Has much to recommend it but is beyond the scope of the excerpt.

🔀 Mixed designs (one between- and one within-subjects factor)

🔀 Structure of mixed designs

Example: Stroop Interference study

  • Between-subjects factor: Gender (males vs. females)
  • Within-subjects factor: Task (naming colors, reading color words, naming ink color of color words)
  • This creates a Gender (2) × Task (3) design.

📋 Two error terms in mixed designs

SourcedfWhat it tests
Gender1Between-subjects variable
Error (between)45Error for between-subjects variable
Task2Within-subjects variable
Gender × Task2Interaction
Error (within)90Error for within-subjects variable and interaction
  • There are two separate error terms: one for the between-subjects variable and one for both the within-subjects variable and the interaction.
  • Typically, mean square error for the between-subjects variable is higher than the other mean square error.
  • In the Stroop example, MS error for Gender (41.79) is about twice as large as MS error for Task and interaction (20.89).

🧮 Degrees of freedom for mixed designs

  • Between-subjects variable: number of levels − 1 (Gender: 2 levels, so df = 1)
  • Within-subjects variable: number of levels − 1 (Task: 3 levels, so df = 2)
  • Interaction: product of the two df values (1 × 2 = 2)
  • Error (between): number of subjects − number of between-subjects groups
  • Error (within): (number of subjects − number of groups) × (within-subjects df)

Example with 47 subjects (some males, some females) tested on 3 tasks:

  • Gender df = 1
  • Error (between) df = 45
  • Task df = 2
  • Gender × Task df = 2
  • Error (within) df = 90
106

Log Transformations

Log Transformations

🧭 Overview

🧠 One-sentence thesis

When measurement scales have inherently meaningful units, the raw difference between means is a clear and interpretable measure of effect size, but standardized measures are needed when scales lack intrinsic meaning or when comparing across different measurement contexts.

📌 Key points (3–5)

  • When raw differences work best: if the measurement scale has meaningful units in its own right, the simple difference between means is interpretable and effective.
  • Example of meaningful units: sleep duration in minutes—a 61.8-minute increase directly shows the practical benefit of a treatment.
  • When standardized measures are needed: when units are not inherently meaningful or when comparing effects across different scales, standardized measures (like g or d) account for variability.
  • Common confusion: raw differences vs. standardized measures—raw differences are clearer when units are meaningful; standardized measures are needed when units are arbitrary or when comparing across studies.
  • Role of variability: the variability of subjects affects the size of standardized measures, making them context-dependent.

📏 When raw differences are meaningful

📏 Inherent meaningfulness of scales

When the units of a measurement scale are meaningful in their own right, then the difference between means is a good and easily interpretable measure of effect size.

  • The excerpt emphasizes that some scales have units that people can directly understand and use.
  • If the scale is meaningful, you do not need to transform or standardize the difference—just report the raw difference.
  • Example: minutes of sleep, dollars saved, or kilograms lost are all inherently meaningful.

💊 Sleep study example

  • A 2000 study by Holbrook, Crowther, Lotter, Cheng, and King tested benzodiazepine for insomnia.
  • Compared to placebo, the drug increased total sleep duration by a mean of 61.8 minutes.
  • This raw difference clearly shows the degree of effectiveness—anyone can understand what 61.8 extra minutes of sleep means.
  • Important caveat: the drug was found to sometimes have adverse side effects, so the benefit must be weighed against risks.

Why this works:

  • Minutes are a ratio scale with a true zero and equal intervals.
  • The difference is directly interpretable without needing context about variability or other studies.

🔢 When standardized measures are needed

🔢 Scales without inherent meaning

  • Not all measurement scales have units that are meaningful in their own right.
  • When units are arbitrary (e.g., scores on a psychological test with no real-world referent), raw differences are hard to interpret.
  • Standardized measures (like g or d) express the difference in terms of variability, making them comparable across different scales and studies.

📊 Standardized measures: g and d

  • The excerpt mentions two standardized measures: g and d.
  • These measures adjust the raw difference by the variability of the subjects.
  • They allow comparisons even when the original units differ or are not inherently meaningful.

How variability matters:

  • The same raw difference can represent a large or small effect depending on how much subjects vary.
  • Example: a 10-point difference is large if most subjects score within a 5-point range, but small if subjects vary by 50 points.
  • Standardized measures capture this context by dividing the difference by a measure of variability.

🔍 Don't confuse: raw vs. standardized

Measure typeWhen to useWhat it shows
Raw differenceUnits are meaningful (e.g., minutes, dollars)Direct, practical effect size
Standardized (g, d)Units are arbitrary or comparing across studiesEffect size relative to variability
  • The excerpt states that "the variability of subjects affects the size of standardized measures," so standardized measures are context-dependent.
  • Raw differences are more transparent when the scale is meaningful; standardized measures are needed when the scale is not or when cross-study comparison is required.

🧮 Ratio scales and proportional differences

🧮 Ratio scales

  • The excerpt notes that "when the dependent variable is measured on a ratio scale, it is often informative to consider the proportional difference."
  • Ratio scales have a true zero and equal intervals, so proportions and ratios are meaningful.
  • Example: if a treatment doubles sleep duration, the proportional change (100%) is informative alongside the raw difference.

Why proportional differences matter:

  • They capture relative change, which can be easier to compare across different baseline levels.
  • A 60-minute increase means more if baseline sleep is 120 minutes than if it is 480 minutes.
107

Tukey Ladder of Powers

Tukey Ladder of Powers

🧭 Overview

🧠 One-sentence thesis

The Tukey ladder of powers provides a systematic method for transforming variables using power functions to reveal linear relationships in bivariate data or to reduce skewness in distributions, thereby improving statistical analysis.

📌 Key points (3–5)

  • Core purpose: Transform variables using power transformations (x raised to λ) to make relationships linear or distributions more normal.
  • How to choose λ: Find the value that maximizes the correlation coefficient (closest to 1) for linearity, or reduces skew for normality.
  • Special convention: When λ = 0, the transformation is defined as the logarithm (not the constant 1) to maintain continuity in the ladder.
  • Common confusion: Negative λ values reverse the variable's order (e.g., 1/x decreases when x increases), so the transformation is redefined as -(x to the λ) to preserve ordering.
  • Practical impact: Transformations can change statistical significance (e.g., a non-significant t-test can become significant after log transformation).

🔢 The transformation ladder structure

🪜 What the ladder contains

The Tukey ladder of transformations: an orderly sequence of power transformations indexed by parameter λ.

  • The basic forms are y = b₀ + b₁ × (X to the λ) or (y to the λ) = b₀ + b₁ × X.
  • λ is chosen to make the relationship as close to a straight line as possible.
  • Linear relationships are special; if a transformation works, you should consider changing the measurement scale for the rest of the analysis.

📊 Standard ladder values

λTransformationNotes
2Square
1xRaw data (unchanged)
1/2√xSquare root
0log xLogarithm (by convention)
-1/21/√xReciprocal square root
-11/xReciprocal
-21/x²Reciprocal square
  • There is no constraint on λ values; any real number is valid.
  • Example: The relationship y = b₀ + b₁/x corresponds to λ = -1.

🔄 The λ = 0 convention

  • Mathematically, x to the 0 equals 1 (a constant), which has no special value.
  • Tukey suggests defining the transformation at λ = 0 as the logarithm function instead.
  • This maintains continuity and usefulness in the ladder.

🔧 Modified transformation for negative λ

⚠️ The ordering problem

  • When λ is negative, x to the λ reverses the variable's direction.
  • Example: If x is increasing, then 1/x is decreasing.
  • This reversal can confuse interpretation.

🛠️ The formal definition

The modified Tukey transformation is defined as:

  • x to the λ if λ > 0
  • log x if λ = 0
  • -(x to the λ) if λ < 0

The negation when λ < 0 preserves the order of the variable after transformation.

📋 Modified ladder table

λModified transformation
-2-1/x²
-1-1/x
-1/2-1/√x
0log x
1/2√x
1x
2

⚙️ Practical constraint

  • Generally limit to variables where x > 0 to avoid issues with negative values.
  • For some dependent variables (e.g., number of errors), it is convenient to add 1 to x before applying the transformation.

📈 Finding the best transformation for linearity

🎯 The optimization goal

Goal: find a value of λ that makes the scatter diagram as linear as possible.

  • The scatter diagram smoothly morphs from convex to concave as λ increases.
  • Intuitively, there is a unique best choice of λ corresponding to the "most linear" graph.

📐 Using correlation as the objective function

  • The correlation coefficient r measures the linearity of a scatter diagram.
  • If points fall on a straight line, their correlation is r = 1.
  • Method: Plot the correlation coefficient as a function of λ and find the maximum.
  • No need to worry about r = -1 because the Tukey transformed variable is defined to be positively correlated with x itself.

🇺🇸 US population example (1670–1860)

  • The US population followed an exponential curve during this period (as Malthus predicted for geometric population growth).
  • The logarithmic transformation (λ = 0) applied to y makes the relationship almost perfectly linear.
  • The fitted line has a slope of about 1.35, meaning the US population grew at a rate of about 35% per decade.
  • The correlation plot shows λ = 0 is nearly optimal by the correlation criterion.

Don't confuse: The raw data (λ = 1) shows a curved exponential pattern; only after transformation (λ = 0) does the relationship become linear.

🗽 New York state example (1790–2008)

  • Something unusual happened starting in 1970 (mass migration to the West and South as rust belt industries shut down).
  • Computing the best λ using data from 1790–1960 yields λ = 0.41.
  • The value λ = 0.41 is not obvious; one might reasonably choose λ = 0.50 for practical reasons.
  • Example: When the optimal λ is not a standard value, you may round to a nearby convenient value.

🌵 Arizona state example (1910–2005)

  • Arizona has attracted many retirees and immigrants.
  • The growth of population in Arizona is logarithmic (λ ≈ -0.02, very close to 0).
  • The logarithmic pattern appears to still hold through 2005.

Don't confuse: Different regions can have different growth patterns; the US as a whole had exponential growth (λ = 0 on y), while Arizona had logarithmic growth (λ ≈ 0 on y).

📊 Reducing skewness in distributions

🎲 Why reduce skew

Many statistical methods (e.g., t-tests, ANOVA) assume normal distributions.

  • Although these methods are relatively robust to violations of normality, transforming distributions to reduce skew can markedly increase their power.
  • Power refers to the ability to detect true effects.

🔬 Stereograms case study example

  • The raw data is very skewed.
  • Before transformation (λ = 1): t-test p-value = 0.056 (not conventionally significant).
  • After log transformation (λ = 0): p-value = 0.023 (conventionally significant).
  • The log transformation reduces the skew greatly.

Key takeaway: The same data can yield different statistical conclusions depending on the transformation; the transformation that reduces skew can reveal effects hidden in the raw data.

📉 How λ affects skewness

  • Decreasing λ makes the distribution less positively skewed.
  • λ = 1 is the raw data.
  • λ = 0 (log) shows slight positive skew but much less than the raw data.
  • Values below 0 result in negative skew.

Don't confuse: The direction of the effect—lowering λ reduces positive skew, but going too low introduces negative skew; there is an optimal λ for approximate normality.

🔍 Visual inspection

  • The excerpt describes distributions of data transformed with various values of λ.
  • You can visually inspect which λ produces the most symmetric (least skewed) distribution.
  • This visual method complements the correlation-based method for linearity.
108

Box-Cox Transformations

Box-Cox Transformations

🧭 Overview

🧠 One-sentence thesis

The Box-Cox transformation provides a rigorous mathematical framework for transforming variables—including the logarithm as a special case when λ equals zero—to improve data analysis by reducing skewness and achieving normality.

📌 Key points (3–5)

  • What Box-Cox is: a family of transformations indexed by λ, defined as (x to the power λ minus 1) divided by λ, that includes the logarithm when λ = 0.
  • How it relates to Tukey: the Box-Cox formula is a scaled version of Tukey's power transformation; both preserve ordering when λ is negative.
  • The λ = 0 special case: when λ = 0, the formula becomes an indeterminate form (0/0), but through calculus it resolves to the logarithm—giving a rigorous justification for inserting log at λ = 0.
  • Common confusion: the Box-Cox transformation may look different from Tukey's, but they are closely related; the key difference is scaling and the mathematical treatment of λ = 0.
  • Why it matters: Box-Cox transformations eliminate skewness, help achieve normality for statistical tests, and improve regression analysis by optimizing correlation or minimizing residual variance.

📐 The Box-Cox formula and its connection to Tukey

📐 Definition of the Box-Cox transformation

The Box-Cox transformation of the variable x, indexed by λ, is defined as: x′ (subscript λ) equals (x to the power λ minus 1) divided by λ.

  • This is a scaled version of the Tukey power transformation x to the power λ.
  • At first glance, the formula appears different from Tukey's, but a closer look reveals the connection.

🔄 Preserving order when λ < 0

  • When λ is less than 0, both the Tukey transformation and the Box-Cox transformation change the sign of x to the power λ to preserve the ordering of the data.
  • This ensures that larger x values still map to larger transformed values, even though the power is negative.

🔗 The fixed point at x = 1

  • With the Box-Cox definition, x = 1 always maps to x′ (subscript λ) = 0 for all values of λ.
  • This anchoring property is useful for comparing transformations across different λ values.
  • Example: in the figures, the red point at (1, 0) shows this fixed mapping.

🧮 The special case: λ = 0 and the logarithm

🧮 The indeterminate form

  • When λ = 0, the Box-Cox formula becomes the indeterminate form 0/0.
  • The excerpt rewrites the formula using the exponential function: x′ (subscript λ) equals (e to the power (λ times log(x)) minus 1) divided by λ.
  • Expanding the exponential as a series: approximately (1 plus λ times log(x) plus one-half times λ squared times log(x) squared plus ...) minus 1, all divided by λ.

📏 Resolving to the logarithm

  • As λ approaches 0, the series expansion simplifies to log(x).
  • The excerpt notes this can also be obtained using l'Hôpital's rule from calculus.
  • Why this matters: this gives a rigorous mathematical explanation for Tukey's suggestion that the log transformation (which is not a polynomial transformation) may be inserted at the value λ = 0.

📊 Visualizing the transformation

  • In the top row of Figure 1, when λ = 1, the transformation simply shifts x to the value x minus 1, which is a straight line.
  • In the bottom row (on a semi-logarithmic scale), when λ = 0, the transformation corresponds to a logarithmic transformation, which now appears as a straight line.
  • Figure 2 superimposes a larger collection of transformations on a semi-logarithmic scale for λ ranging from −2 to 3.
  • Don't confuse: the transformation changes shape depending on λ; the log is just one member of this family.

🎯 Transformation to normality

🎯 The goal: eliminating skewness

  • An important use of variable transformation is to eliminate skewness and other distributional features that complicate analysis.
  • Often the goal is to find a simple transformation that leads to normality.
  • The excerpt discusses how to assess normality using q-q plots: data that are normal lead to a straight line on the q-q plot.

🔍 Finding the optimal λ

  • Since the correlation coefficient is maximized when a scatter diagram is linear, the same approach can be used to find the most normal transformation.
  • Specifically, form n pairs: (Φ inverse of ((i minus 0.5) divided by n), x subscript (i)), for i = 1, 2, ..., n.
    • Φ inverse is the inverse cumulative distribution function (CDF) of the normal density.
    • x subscript (i) denotes the i-th sorted value of the data set.
  • The value of λ that gives the greatest correlation between these pairs is the optimal transformation.

💷 Example: British household income data

  • The excerpt presents a large sample of British household incomes from 1973, normalized to have mean equal to one (n = 7,125).
  • Such data are often strongly skewed, as shown in Figure 3 (left).
  • The data were sorted and paired with the 7,125 normal quantiles.
  • The value of λ that gave the greatest correlation (r = 0.9944) was λ = 0.21.
  • The kernel density plot of the optimally transformed data (Figure 4, left) is much less skewed than the original, though there is still an extra "component" in the distribution that might reflect the poor.
  • Economists often analyze the logarithm of income, corresponding to λ = 0 (Figure 4, right). The correlation is only r = 0.9901 in this case, but for convenience, the log-transform probably will be preferred.
  • Key insight: the optimal λ may not always be the most convenient; practical considerations (like interpretability) matter.

🔧 Other applications: regression analysis

🔧 Transforming predictor variables

  • Regression analysis is another application where variable transformation is frequently applied.
  • For the model: y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
  • And fitted model: ŷ = b₀ + b₁x₁ + b₂x₂ + ... + bₚxₚ
  • Each of the predictor variables x subscript j can be transformed.
  • The usual criterion is the variance of the residuals, given by: (1 divided by n) times the sum from i=1 to n of (ŷ subscript i minus y subscript i) squared.

🔧 Transforming the response variable

  • Occasionally, the response variable y may be transformed.
  • Important caution: care must be taken because the variance of the residuals is not comparable as λ varies.
  • Let ḡ subscript y represent the geometric mean of the response variables: ḡ subscript y equals (the product from i=1 to n of y subscript i) to the power (1 divided by n).
  • The transformed response is defined as: y′ (subscript λ) equals ((y to the power λ minus 1) divided by λ) times ḡ subscript y to the power (λ minus 1).
  • When λ = 0 (the logarithmic case): y′ (subscript 0) equals ḡ subscript y times log(y).
  • For more examples and discussions, the excerpt refers to Kutner, Nachtsheim, Neter, and Li (2004).

📚 Historical note

📚 Origin of the Box-Cox transformation

  • George Box and Sir David Cox collaborated on one paper (Box, 1964).
  • The story is that while Cox was visiting Box at Wisconsin, they decided they should write a paper together because of the similarity of their names (and that both are British).
  • In fact, Professor Box is married to the daughter of Sir Ronald Fisher.
  • This anecdote highlights the collaborative and sometimes serendipitous nature of statistical research.
109

Chi Square Distribution

Chi Square Distribution

🧭 Overview

🧠 One-sentence thesis

The Chi Square distribution, which arises from summing squared standard normal deviates, provides a way to test whether observed data differ significantly from theoretical expectations.

📌 Key points (3–5)

  • What it is: the distribution of the sum of squared standard normal deviates; degrees of freedom equal the number of deviates being summed.
  • How its shape changes: as degrees of freedom increase, the distribution becomes less skewed and approaches a normal distribution.
  • Mean and skew: the mean equals the degrees of freedom; distributions are positively skewed, with skew decreasing as degrees of freedom increase.
  • Common confusion: Chi Square with one degree of freedom is just a single normal deviate squared—the area below 4 in Chi Square(1) equals the area below 2 in standard normal, since 4 = 2 squared.
  • Why it matters: many test statistics are approximately Chi Square distributed, enabling tests of observed vs. expected frequencies and relationships between categorical variables.

🔢 What the Chi Square distribution is

🔢 Definition and construction

The Chi Square distribution is the distribution of the sum of squared standard normal deviates.

  • A standard normal deviate is a random sample from the standard normal distribution.
  • You square each deviate, then sum them; that sum follows a Chi Square distribution.
  • The degrees of freedom (df) = the number of standard normal deviates being summed.

🎲 Chi Square with one degree of freedom

  • Written as χ²(1).
  • It is simply the distribution of a single normal deviate squared.
  • Example: the area of a Chi Square distribution below 4 is the same as the area of a standard normal distribution below 2, because 4 = 2².
  • Don't confuse: Chi Square(1) is not "one sample from Chi Square"; it is "one squared normal deviate."

🧮 Example calculation

  • Problem: sample two scores from a standard normal distribution, square each, and sum the squares. What is the probability the sum is six or higher?
  • Since two scores are sampled, use Chi Square with 2 degrees of freedom.
  • A Chi Square calculator shows the probability of Chi Square (2 df) being six or higher is 0.050.

📐 How the shape changes with degrees of freedom

📊 Mean and skewness

  • Mean: the mean of a Chi Square distribution equals its degrees of freedom.
  • Skewness: Chi Square distributions are positively skewed.
  • As degrees of freedom increase, the degree of skew decreases.

🔄 Approaching normality

  • As degrees of freedom increase, the Chi Square distribution approaches a normal distribution.
  • The excerpt mentions density functions for Chi Square distributions with 2, 4, and 6 degrees of freedom show decreasing skew.
  • Example: Chi Square(2) is highly skewed; Chi Square(6) is less skewed and closer to symmetric.

🧪 Why Chi Square is important

🧪 Test statistics

  • Many test statistics are approximately distributed as Chi Square.
  • This makes the distribution a foundation for significance testing.

📋 Common applications

ApplicationWhat it tests
One-way tables (goodness of fit)Deviations between theoretically expected and observed frequencies
Contingency tablesRelationship between categorical variables
  • The excerpt notes numerous other tests beyond its scope are based on the Chi Square distribution.

🎯 Testing goodness of fit (one-way tables)

🎯 The basic idea

  • Use Chi Square to test whether observed data differ significantly from theoretical expectations.
  • Example: roll a fair six-sided die 36 times. Each outcome should have probability 1/6, but sample frequencies will vary by chance.
  • Question: are the observed frequencies consistent with the hypothesis that the die is fair?

🧪 The significance test approach

  • Null hypothesis: the die is fair (outcomes follow a uniform distribution).
  • Test: compute the probability of obtaining frequencies as discrepant or more discrepant from a uniform distribution as the sample shows.
  • If this probability is sufficiently low, reject the null hypothesis.
  • Don't confuse: finding that frequencies differ does not automatically mean the die is unfair—chance differences always occur; the test quantifies whether differences are too large to be due to chance alone.

📊 Example data

The excerpt provides a table of outcome frequencies from rolling a six-sided die 36 times:

OutcomeFrequency
18
25
39
  • Some outcomes (e.g., "3" came up 9 times) occurred more frequently than others (e.g., "4" came up only 2 times).
  • The first step is to compute the expected frequency for each outcome given the null hypothesis is true.
110

One-Way Tables (Testing Goodness of Fit)

One-Way Tables (Testing Goodness of Fit)

🧭 Overview

🧠 One-sentence thesis

The Chi Square test for one-way tables determines whether observed frequencies differ significantly from theoretically expected frequencies, allowing us to test hypotheses like whether a die is fair or whether data follow a normal distribution.

📌 Key points (3–5)

  • What the test does: compares observed frequencies in data to expected frequencies predicted by a theoretical hypothesis (e.g., uniform distribution, normal distribution).
  • How it works: compute expected frequencies under the null hypothesis, calculate Chi Square by summing (E-O)²/E for each outcome, then compare to the Chi Square distribution.
  • Degrees of freedom: always k-1, where k is the number of categories or outcomes.
  • Common confusion: "expected" frequencies are theoretical predictions, not what we literally expect to see in any single sample—chance differences always occur.
  • Interpretation: if the probability of obtaining Chi Square this large (or larger) is very low, reject the null hypothesis that the theoretical model fits the data.

🎲 What goodness-of-fit tests measure

🎯 The core question

  • The test asks: "Are the differences between what we observed and what theory predicts small enough to be due to chance alone?"
  • The null hypothesis is that the theoretical model (e.g., a fair die, a normal distribution) is correct.
  • We do not expect perfect agreement; the test accounts for natural sampling variation.

📊 Observed vs expected frequencies

Expected frequency: the frequency predicted for each outcome if the null hypothesis is true, calculated from the theoretical probability and the total sample size.

  • Observed frequency (O): what actually happened in the data.
  • Expected frequency (E): what the theoretical model predicts.
  • Example: For a fair six-sided die rolled 36 times, the probability of any outcome is 1/6, so E = (1/6)(36) = 6 for each face.
  • Don't confuse: "expected" is a theoretical benchmark, not a claim that every sample will match it exactly.

🧮 How to compute the Chi Square statistic

🔢 The calculation formula

For each outcome, compute:

  • (E - O)² / E

Then sum these values across all outcomes to get the Chi Square statistic.

📝 Step-by-step example: testing a die

Scenario: A six-sided die is rolled 36 times. The outcomes are:

OutcomeFrequency (O)
18
25
39
42
57
65

Step 1: Compute expected frequency for each outcome.

  • E = (1/6)(36) = 6 for all outcomes (assuming a fair die).

Step 2: For each outcome, calculate (E - O)² / E:

OutcomeEO(E-O)²/E
1680.667
2650.167
3691.5
4622.667
5670.167
6650.167

Step 3: Sum all values in the last column:

  • Chi Square = 0.667 + 0.167 + 1.5 + 2.667 + 0.167 + 0.167 = 5.333

🔑 Degrees of freedom

  • Formula: df = k - 1, where k is the number of categories.
  • Example: Six outcomes on a die → df = 6 - 1 = 5.

📉 Interpreting the result

🧪 Using the Chi Square distribution

  • The Chi Square statistic follows a Chi Square distribution with k-1 degrees of freedom.
  • Use a Chi Square calculator or table to find the probability of obtaining a Chi Square value this large or larger.
  • Example: For Chi Square = 5.333 with df = 5, the probability is 0.377.

✅ Decision rule

  • If the probability is sufficiently low (typically below 0.05), reject the null hypothesis.
  • If the probability is high, do not reject the null hypothesis.
  • Example (die): p = 0.377 is not low, so we cannot reject the hypothesis that the die is fair—the observed differences could easily be due to chance.

⚠️ What rejection means

  • Rejecting the null means the observed data are too discrepant from the theoretical model to be explained by chance alone.
  • Not rejecting means the data are consistent with the model (but does not prove the model is true).

📚 Testing other distributions: normality example

🎓 Scenario: University GPA scores

The excerpt tests whether 105 University GPA scores are normally distributed.

Step 1: Divide the normal distribution into ranges and compute expected proportions.

RangeProportionE (expected)O (observed)
Above 10.15916.6959
0 to 10.34135.80560
-1 to 00.34135.80517
Below -10.15916.69519
  • E is calculated by multiplying the total number of scores (105) by the proportion.
  • Example: E for "0 to 1" = 0.341 × 105 = 35.805.

Step 2: Compute Chi Square using the same formula.

  • The excerpt states Chi Square = 30.09 (calculation details not fully shown).

Step 3: Determine degrees of freedom.

  • Four ranges → df = 4 - 1 = 3.

Step 4: Interpret.

  • The probability is p < 0.001, which is very low.
  • Conclusion: Reject the null hypothesis that the scores are normally distributed.
  • The observed frequencies (especially 60 scores in the "0 to 1" range vs. expected ~36) deviate too much from what a normal distribution would predict.

🧩 Why this matters

  • Goodness-of-fit tests can be applied to any theoretical distribution, not just uniform distributions.
  • They help verify assumptions (e.g., normality) that underlie other statistical methods.
111

Contingency Tables

Contingency Tables

🧭 Overview

🧠 One-sentence thesis

Chi Square tests on contingency tables determine whether there is a statistically significant relationship between two nominal (categorical) variables by comparing observed frequencies to expected frequencies under the null hypothesis of no relationship.

📌 Key points (3–5)

  • What contingency tables test: whether two nominal variables are related (e.g., diet type and health outcome).
  • How expected frequencies are computed: for each cell, multiply the row total by the column total and divide by the grand total.
  • How to compute Chi Square: sum the squared differences between observed and expected frequencies, divided by expected frequencies, across all cells.
  • Degrees of freedom formula: (number of rows minus 1) times (number of columns minus 1).
  • Key assumption: each subject contributes data to only one cell; the sum of all cell frequencies must equal the number of subjects.

📊 What contingency tables test

📊 The null hypothesis

The null hypothesis tested is that there is no relationship between the two nominal variables.

  • The excerpt asks "whether there is a significant relationship between diet and outcome."
  • If the null hypothesis is true, the two variables are independent—knowing one variable tells you nothing about the other.
  • Example: in the Mediterranean Diet and Health case study, the null is that diet (AHA vs. Mediterranean) and outcome (cancers, fatal heart disease, non-fatal heart disease, healthy) are unrelated.

📋 Structure of a contingency table

  • Rows represent levels of one nominal variable; columns represent levels of the other.
  • Each cell shows the observed frequency (count) for that combination.
  • Row totals, column totals, and a grand total are computed.
  • Example: the Diet and Health table has 2 rows (AHA, Mediterranean) and 4 columns (Cancers, Fatal Heart Disease, Non-Fatal Heart Disease, Healthy), with a grand total of 605 subjects.

🧮 Computing expected frequencies

🧮 The formula

The expected frequency for a cell in the i-th row and j-th column is equal to (row total × column total) / grand total.

  • Notation: E(i,j) = (T(i) × T(j)) / T, where T(i) is the row total, T(j) is the column total, and T is the grand total.
  • This formula assumes no relationship: the proportion of the row total that falls in a given column should match the overall proportion for that column.

🔍 Example calculation

  • In the Diet and Health study, 22 out of 605 subjects developed cancer (proportion = 0.0364).
  • If there is no relationship, we expect 0.0364 of the 303 AHA diet subjects to develop cancer: (0.0364)(303) = 11.02.
  • Equivalently, (303 × 22) / 605 = 11.02.
  • Similarly, for the Mediterranean diet: (302 × 22) / 605 = 10.98.
  • The excerpt shows all expected frequencies in parentheses in Table 2.

🧪 Computing Chi Square and degrees of freedom

🧪 Chi Square statistic

  • Chi Square is computed by summing across all cells: for each cell, take (observed − expected)² / expected.
  • The excerpt states the formula in words: "Chi Square is computed by summing the squared differences between observed and expected frequencies, divided by expected frequencies."
  • Example: for the Diet and Health study, Chi Square = 16.55.

🔢 Degrees of freedom

Degrees of freedom = (r − 1)(c − 1), where r is the number of rows and c is the number of columns.

  • For the Diet and Health table: (2 − 1)(4 − 1) = 3 degrees of freedom.
  • The excerpt notes that for one-way tables (earlier in the chapter), degrees of freedom is "the number of outcomes minus one."

📉 Interpreting the result

  • Use a Chi Square distribution calculator to find the p-value.
  • Example: Chi Square = 16.55 with 3 df gives p = 0.0009.
  • Since p < 0.05 (or another chosen alpha), reject the null hypothesis of no relationship.
  • Conclusion: there is a significant relationship between diet and outcome.

⚠️ Key assumptions and warnings

⚠️ Each subject contributes to only one cell

A key assumption is that each subject contributes data to only one cell.

  • The sum of all cell frequencies must equal the number of subjects.
  • Don't confuse: if subjects contribute to multiple cells, the test is invalid.
  • Example of violation: the anagram problem data (Table 3) has 16 subjects but 32 total cell frequencies (each subject attempted two problems). The excerpt states "It would not be valid to use the Chi Square test on these data."

📏 Sample size requirements

  • The total number of subjects should be at least 20 for the Chi Square approximation to be adequate.
  • Some authors recommend a correction for continuity (Yates correction for 2×2 tables) when expected cell frequencies are below 5, but the excerpt notes "Research in statistics has shown that this practice is not advisable."

🧩 Additional examples

🧩 Saffron and liver cancer

  • Experimental group (saffron, n=24): 4 developed cancer.
  • Control group (no saffron, n=8): 6 developed cancer.
  • Chi Square test yields χ²(df=1) = 9.50, p = 0.002.
  • Conclusion: the difference is statistically significant.

🧩 Smoking and incontinence

  • The excerpt provides a scenario: 322 incontinent subjects (113 smokers, 51 former smokers, 158 never smoked) vs. 284 control subjects (68 smokers, 23 former smokers, 193 never smoked).
  • The exercise asks to create a table, compute expected frequencies, conduct a Chi Square test, and interpret the results.
  • This illustrates the same process: set up a contingency table, compute expected frequencies, calculate Chi Square, find the p-value, and draw a conclusion.
112

Benefits

Benefits

🧭 Overview

🧠 One-sentence thesis

Nonparametric rank-based tests like Spearman's ρ and the Wilcoxon rank sum test provide valid significance testing when distributions are non-normal or sample sizes are small, using permutation logic and critical-value tables instead of parametric assumptions.

📌 Key points (3–5)

  • Permutation probability: The probability of observing a correlation as large or larger is calculated by counting arrangements of ranks, not by assuming a distribution shape.
  • One-tailed vs two-tailed: One-tailed tests count only arrangements in one direction; two-tailed tests double the probability or require a higher critical value.
  • Critical value tables: When sample sizes grow, counting all permutations becomes impractical, so tables of critical values for different significance levels and sample sizes are used.
  • When to use rank tests: The Wilcoxon rank sum test (and similar nonparametric methods) are chosen when distributions are very non-normal, avoiding the assumptions required by t tests.
  • Common confusion: The same observed statistic can be significant in a one-tailed test but not in a two-tailed test because the critical values differ.

🎲 Permutation-based probability

🎲 How permutation probability works

  • The excerpt explains that probability is calculated by counting the number of rank arrangements that produce a correlation as large or larger than the observed value, then dividing by the total number of possible arrangements.
  • Example: With 5 observations, there are 5! = 120 possible arrangements; if 5 of them give a correlation ≥ 0.90, the one-tailed probability is 5/120 = 0.042.
  • This approach does not assume any particular distribution shape; it relies only on the logic of permutations.

🔄 One-tailed vs two-tailed probabilities

  • One-tailed: counts only arrangements that give a correlation as large or larger in one direction.
  • Two-tailed: counts both directions (as large or larger, or as small or smaller).
  • The excerpt shows that the one-tailed probability is 0.042, and the two-tailed probability is 0.084 (double the one-tailed value).
  • Don't confuse: the same observed correlation (0.90) can be significant one-tailed but not two-tailed, because the two-tailed test requires a more extreme result.

📋 Critical value tables

📋 Why tables are needed

  • When sample size is even moderately large, counting all permutations by hand becomes impractical.
  • The excerpt states: "it is convenient to have a table of critical values."
  • These tables list the minimum correlation (or test statistic) needed to reach significance at common levels (e.g., 0.05, 0.01) for different sample sizes.

📊 Reading the critical value table

Sample size (N)0.05 two-tail0.01 two-tail0.05 one-tail0.01 one-tail
510.91
60.88610.8290.943
100.6480.7940.5640.745
200.4470.570.380.52
  • The excerpt shows that for N = 5, the one-tailed critical value at 0.05 is 0.90; since the sample correlation is 0.90, it is significant at the 0.05 level (one-tailed).
  • For a two-tailed test at 0.05, the critical value is 1.0, so Spearman's ρ is not significant two-tailed.
  • As sample size increases, critical values generally decrease (it becomes easier to reach significance with more data).

🧪 Application: Wilcoxon rank sum test

🧪 When to use rank-based tests

  • The excerpt describes a study comparing troponin levels in patients with and without right ventricular strain.
  • The authors used the Wilcoxon rank sum test instead of a t test.
  • The excerpt asks: "Why might the authors have used the Wilcoxon test rather than a t test?"
  • Answer provided: "Perhaps the distributions were very non-normal."
  • Rank-based tests do not assume normal distributions, making them appropriate when data are skewed or have outliers.

📈 Results interpretation

  • Patients with right ventricular strain had higher troponin (median = 0.03 ng/ml) than those without (median < 0.01 ng/ml), p < 0.001.
  • The very small p-value (< 0.001) indicates strong evidence of a difference.
  • The excerpt suggests that "typically a transformation can be done to make a distribution" more normal, but the sentence is incomplete; the implication is that if transformation is difficult or inappropriate, a rank test is a good alternative.
  • Don't confuse: using a rank test does not mean the conclusion is weaker; it means the method is better suited to the data's distribution.
113

Randomization Tests: Two Conditions

Randomization Tests: Two Conditions

🧭 Overview

🧠 One-sentence thesis

Randomization tests assess whether an observed difference between groups could occur by chance by examining all possible ways the data could be reassigned to groups.

📌 Key points (3–5)

  • Core logic: Consider all possible ways to divide observed values into groups, then see where the actual data falls in that distribution.
  • How probability is calculated: Count how many arrangements produce differences as large or larger than observed, divide by total possible arrangements.
  • One-tailed vs two-tailed: One-tailed considers only one direction of difference; two-tailed considers absolute value (both directions).
  • Common confusion: The test doesn't assume the treatments differ—it asks "if treatments had identical effects, how likely is this result?"
  • Practical limitation: Exact enumeration is time-consuming for larger samples; random sampling of arrangements (via computer) approximates the exact probability.

🔍 The fundamental logic

🔍 What randomization tests ask

The central question is: Would a difference this large or larger be likely if the two treatments had identical effects?

  • The test does not start by assuming groups differ.
  • Instead, it treats all observed values as a single pool that could have been assigned to either group.
  • By examining all possible reassignments, we see how unusual the actual assignment is.

🎲 All possible arrangements

The approach taken by randomization tests is to consider all possible ways the values obtained in the experiment could be assigned to the two groups.

  • For n total values divided into groups of size r and (n - r), the number of combinations is calculated using the combinations formula.
  • Example from excerpt: 8 values divided into two groups of 4 each = 70 possible arrangements.
  • Each arrangement represents one way the experiment could have turned out if assignment were purely random.

📊 Computing the probability

📊 Counting favorable outcomes

The probability calculation has two steps:

  1. Count arrangements as extreme or more extreme: How many of the possible arrangements produce a difference between means as large or larger than observed?
  2. Divide by total arrangements: This fraction is the probability of observing such a difference by chance alone.

Example from the excerpt:

  • Observed difference between means: 10
  • Total possible arrangements: 70
  • Arrangements with difference ≥ 10: 3 (including the actual data)
  • One-tailed probability: 3/70 = 0.0429

🔀 One-tailed vs two-tailed probabilities

TypeWhat it considersExample from excerpt
One-tailedOnly one direction (e.g., Experimental > Control)3/70 = 0.0429
Two-tailedBoth directions (absolute value of difference)6/70 = 0.0857
  • Don't confuse: Two-tailed doesn't just double the one-tailed value—it counts arrangements where either group could be larger by that amount.
  • The excerpt notes "only one direction of difference is considered (Experimental larger than Control)" for the one-tailed case.

💻 Practical implementation

💻 Exact enumeration for small samples

  • Listing all possible arrangements works well for very small sample sizes.
  • The excerpt states: "Clearly, this type of analysis would be very time consuming for even moderate sample sizes. Therefore, it is most useful for very small sample sizes."

💻 Computer approximation for larger samples

An alternate approach made practical by computer software:

  • Randomly divide the data into groups thousands of times.
  • Count the proportion of times the difference is as big or bigger than the actual data.
  • If the number of random divisions is very large, this proportion approximates the exact probability.

Why this works: With enough random samples, the empirical distribution of differences converges to the true distribution of all possible arrangements.

Example: Instead of examining all 70 arrangements, randomly generate 10,000 arrangements and count how many produce differences ≥ 10.

📋 Worked example walkthrough

📋 The experimental setup

The excerpt provides fictitious data:

  • Experimental Group: 7, 8, 11, 30 (mean = 14)
  • Control Group: 0, 2, 5, 9 (mean = 4)
  • Observed difference: 14 - 4 = 10

📋 Finding extreme arrangements

To find arrangements as extreme as the observed data:

  • The excerpt identifies two rearrangements that would produce a bigger difference than 10:
    1. Score of 7 in Control, score of 9 in Experimental
    2. Score of 8 in Control, score of 9 in Experimental
  • Including the actual data: 3 total arrangements with difference ≥ 10
  • Result: probability = 3/70 = 0.0429 (one-tailed)

Key insight: We're not testing whether the groups are different—we're asking how often random assignment alone would produce a difference this large.

114

Randomization Tests: Two or More Conditions

Randomization Tests: Two or More Conditions

🧭 Overview

🧠 One-sentence thesis

Randomization tests for more than two groups work by computing a test statistic (such as the F ratio) for the actual data and then determining what proportion of all possible data rearrangements would produce a statistic as large or larger, with a very small proportion indicating a real treatment effect.

📌 Key points (3–5)

  • Core method: randomly rearrange the data into groups many times and count how often the difference is as big or bigger than the actual result.
  • Test statistic choice: when comparing several means, the F ratio from one-way ANOVA is convenient as a measure of how different the groups are.
  • Calculation approach: either list all possible arrangements (practical only for very small samples) or use computer software to randomly divide data thousands of times.
  • Common confusion: the F ratio here is not used to test significance directly—it serves as a measure of group differences; significance comes from the proportion of arrangements.
  • Interpretation: if the proportion of arrangements with an F as large or larger is very small, the observed difference is unlikely to occur by chance alone.

🔢 The randomization method

🎯 Choosing a test statistic

  • The first step is to decide on a test statistic that measures how different the groups are.
  • For more than two means, the F ratio is convenient.
  • The F ratio is computed from a one-way ANOVA, but not to test significance directly—it quantifies the magnitude of group differences.
  • Example: In the fictitious three-group experiment, the F ratio for the actual data arrangement is 2.06.

🔄 Counting arrangements

The core question: how many arrangements of the data result in a test statistic as large or larger than the one obtained from the actual data?

  • After computing the test statistic for the actual data, determine how many possible rearrangements of the data would produce an equal or larger statistic.
  • The proportion of such arrangements tells you how likely the observed result is under the assumption of no treatment effect.
  • Don't confuse: you are not testing whether the F ratio itself is significant in the traditional sense; you are counting how rare your observed F is among all possible data arrangements.

🖥️ Practical approaches

For very small samples:

  • List all possible ways to divide the data into groups.
  • Compute the test statistic for each arrangement.
  • Count how many produce a statistic as large or larger than the actual one.
  • This is very time-consuming even for moderate sample sizes.

For larger samples (computer-based):

  • Randomly divide the data into groups thousands of times.
  • Count the proportion of times the difference is as big or bigger than the actual result.
  • If the number of random divisions is very large, this proportion will be very close to the exact proportion from listing all possible arrangements.

📊 Worked example with three groups

📋 The data and initial F ratio

The excerpt presents a fictitious experiment with three groups: T1, T2, and Control.

T1T2Control
7811
121419
211220
2
5
9
  • The F ratio for this actual arrangement is 2.06.
  • This becomes the benchmark: we need to find how many arrangements produce F ≥ 2.06.

🔁 Arrangements with the same F

There are 6 arrangements that produce the same F of 2.06:

  1. T1, T2, Control (the original)
  2. T1, Control, T2
  3. T2, T1, Control
  4. T2, Control, T1
  5. Control, T1, T2
  6. Control, T2, T1
  • These are simply the six ways to rearrange the three columns.
  • Each of these six arrangements yields the same F ratio because the group labels are just permuted.

🔼 Arrangements with a larger F

For each of the 6 arrangements above, there are two specific swaps that lead to a higher F ratio:

  • Swapping the 7 for the 9 gives an F of 2.08.
  • Swapping the 8 for the 9 gives an F of 2.07.

Example: Starting from the original arrangement and swapping 7 and 9:

T1T2Control
9811
121419
211220
2
5
7
  • This produces F = 2.08, which is larger than 2.06.
  • Since there are 6 base arrangements and each has 2 swaps that increase F, there are 6 × 2 = 12 arrangements with a larger F.

🧮 Total count and proportion

Arrangements with F as large or larger:

  • 6 arrangements with F = 2.06 (same as actual)
  • 12 arrangements with F > 2.06 (larger than actual)
  • Total: 6 + 12 = 18 arrangements

Total possible arrangements:

  • The formula for the total number of arrangements is given by the multinomial coefficient.
  • For this example: n = number of observations in each group (assumed equal), k = number of groups.
  • The calculation yields 13,824 total possible arrangements.

Proportion:

  • 18 / 13,824 = 0.0013
  • This means only about 0.13% of all possible arrangements produce an F as large or larger than the observed 2.06.

💡 Interpretation

  • If there were no treatment effect, it would be very unlikely to obtain an F as large or larger than 2.06.
  • The very small proportion (0.0013) suggests that the observed group differences are unlikely to be due to chance alone.
  • This provides evidence for a real treatment effect.

🧮 Calculating total arrangements

📐 The formula

The total number of possible arrangements is computed from a multinomial coefficient formula:

  • The formula involves factorials of the total number of observations and the number of observations in each group.
  • In the example: with 8 total observations divided into groups of varying sizes across 3 conditions, the formula yields 13,824 possible arrangements.
  • This formula assumes you are counting all distinct ways to assign the observed values to the group labels.

⚠️ Practical note

  • For even moderate sample sizes, the number of possible arrangements grows very large.
  • This is why computer-based random sampling (dividing the data thousands of times) is the practical approach for most real applications.
  • The exact enumeration method shown in the example is feasible only for very small datasets.
115

Randomization Tests: Association (Pearson's r)

Randomization Tests: Association (Pearson's r)

🧭 Overview

🧠 One-sentence thesis

A randomization test for Pearson's r provides a significance test that makes no distributional assumptions by comparing the actual correlation to all possible rearrangements of one variable.

📌 Key points (3–5)

  • What the test does: tests the significance of Pearson's r without assuming normality, unlike traditional parametric tests.
  • How it works: fix the X variable and rearrange the Y variable in all possible ways, then count how many arrangements produce correlations as extreme as the observed one.
  • Calculating the p-value: divide the number of arrangements with correlations as extreme or more extreme by the total number of possible arrangements (N factorial).
  • Common confusion: one-tailed vs two-tailed—for two-tailed tests, you must count both high positive and low negative correlations; the two-tailed probability is not necessarily double the one-tailed probability in randomization tests.

🔧 The randomization approach

🔧 Core method

The approach is to consider the X variable fixed and compare the correlation obtained in the actual data to the correlations that could be obtained by rearranging the Y variable.

  • Unlike the traditional parametric test (described in the prerequisite section on inferential statistics for b and r), this method assumes no particular distribution.
  • You keep one variable (X) in its original order and systematically rearrange the other variable (Y).
  • Each rearrangement produces a different correlation coefficient.

🎯 What you're comparing

  • The actual observed correlation is compared against the distribution of all possible correlations from rearrangements.
  • You count how many rearrangements produce correlations as extreme or more extreme than what you actually observed.
  • Example: if the observed r is 0.385, you count arrangements that give r ≥ 0.385 (for a one-tailed test).

📊 Worked example

📊 The data

The excerpt provides a small dataset with 5 pairs of X and Y values:

  • Original data: X values are 1, 2.4, 3.8, 4, 11; Y values are 1, 2, 2.3, 3.7, 2.5
  • The observed correlation between X and Y is 0.385

🔢 Counting extreme arrangements

  • There is only one arrangement of Y that produces a higher correlation than 0.385.
  • That arrangement pairs the Y values differently with the fixed X values and produces r = 0.945.
  • Total arrangements as extreme or more extreme: 2 (the actual arrangement plus the one that gives 0.945).

🧮 Computing the probability

  • Total possible arrangements = N factorial, where N is the number of pairs.
  • For 5 pairs: 5! = 120 possible arrangements.
  • One-tailed probability = 2/120 = 0.017.
  • This means if there were truly no association, only 1.7% of random arrangements would produce a correlation this strong or stronger.

🔀 One-tailed vs two-tailed tests

🔀 One-tailed probability

  • Counts only arrangements that give r as large or larger than the observed value.
  • In the example: arrangements with r ≥ 0.385.

🔀 Two-tailed probability

  • Must also count arrangements where r is as negative or more negative than the opposite of the observed value.
  • In the example: would also count arrangements with r ≤ -0.385.
  • Don't confuse: in randomization tests, the two-tailed probability is not necessarily double the one-tailed probability (unlike some parametric tests).

✅ When to use this test

✅ Advantages over parametric tests

  • Makes no distributional assumptions (the traditional test assumes normality).
  • Useful when sample sizes are small or when normality cannot be assumed.
  • Provides an exact test based on the actual data permutations rather than relying on theoretical distributions.
116

Randomization Tests: Contingency Tables (Fisher's Exact Test)

Randomization Tests: Contingency Tables: (Fisher's Exact Test)

🧭 Overview

🧠 One-sentence thesis

Fisher's Exact Test provides a randomization-based significance test for contingency tables by calculating the exact probability of observed and more extreme frequency patterns given fixed row and column totals, though it can be less powerful than the Chi Square test because of this fixed-margin assumption.

📌 Key points (3–5)

  • What Fisher's Exact Test does: tests the relationship between two nominal variables (e.g., difference in proportions) by treating row and column totals as fixed and computing exact probabilities of frequency patterns.
  • How it works: calculate the probability of the observed pattern plus all more extreme patterns using a combinatorial formula, without relying on approximate distributions.
  • Why "exact" matters: unlike Chi Square, it does not depend on an approximate distribution, but this exactness comes at a cost—it assumes fixed marginals.
  • Common confusion: Fisher's Exact Test is "exact" but not necessarily more powerful; the Chi Square test often has better power because it does not fix both marginal totals.
  • Two-tailed probability quirk: in Fisher's Exact Test (and randomization tests generally), the two-tailed p-value is not always double the one-tailed p-value.

🧪 The logic of Fisher's Exact Test

🧪 When to use it

  • The excerpt describes testing the relationship between two nominal variables, with a special focus on differences between proportions.
  • Example scenario: an experiment with an Experimental Group and a Control Group, each with 4 subjects, testing whether they solved an anagram problem.
  • The test applies when you have a contingency table (e.g., 2×2) and want to know if the pattern of frequencies is significant.

🔢 The core idea: fixed marginals

Fisher's Exact Test takes the row totals and column totals as "given" and adds the probability of obtaining the pattern of frequencies obtained in the experiment and the probabilities of all other patterns that reflect a greater difference between conditions.

  • You treat the marginal totals (row sums and column sums) as fixed.
  • Then you ask: "Given these fixed totals, how likely is the observed pattern (or a more extreme one) to occur by chance?"
  • This is a randomization approach: you consider all possible rearrangements of the data that preserve the marginals.

📐 The formula

The excerpt provides a formula for the probability of any given frequency pattern:

  • Probability = (n factorial × (N - n) factorial × R factorial × (N - R) factorial) / (N factorial × r factorial × (n - r) factorial × (R - r) factorial × (N - n - R + r) factorial)

Where:

  • N = total sample size

  • n = sample size for the first group

  • r = number of successes in the first group

  • R = total number of successes

  • For the anagram example: N = 8, n = 4, r = 3, R = 3.

  • Plugging in: probability = (4! × 4! × 3! × 5!) / (8! × 3! × 1! × 0! × 4!) = 1/14 ≈ 0.0714.

🎯 Computing the p-value

🎯 One-tailed probability

  • The one-tailed p-value is the probability of the observed pattern plus all more extreme patterns in the same direction.
  • In the anagram example, the observed pattern (3 successes in Experimental, 0 in Control) has probability 0.0714.
  • The excerpt states: "Since more extreme outcomes do not exist given the row and column totals, the p value is 0.0714."
  • This is one-tailed because it only considers outcomes favoring the Experimental Group.

🎯 Two-tailed probability

  • For a two-tailed test, you also count patterns that are equally or more extreme in the opposite direction.
  • The excerpt shows an equally extreme outcome favoring the Control Group (0 successes in Experimental, 3 in Control), which also has probability 0.0714.
  • Therefore, the two-tailed p-value = 0.0714 + 0.0714 = 0.1428.
  • Important note: "In the Fisher Exact Test, the two-tailed probability is not necessarily double the one-tailed probability."
    • This is because the distribution of possible patterns may not be symmetric.
    • Don't assume you can simply multiply the one-tailed p-value by 2.

⚖️ Comparing Fisher's Exact Test to Chi Square

⚖️ What "exact" means

The Fisher Exact Test is "exact" in the sense that it is not based on a statistic that is approximately distributed as, for example, Chi Square.

  • Chi Square tests rely on an approximation (the Chi Square distribution).
  • Fisher's Exact Test computes exact probabilities using combinatorics, so it does not depend on an approximate distribution.

⚖️ The power trade-off

  • Fisher's Exact Test can be considerably less powerful than the Chi Square test because it assumes both marginal totals are fixed.
  • In reality, often only one marginal (e.g., group sizes) is truly fixed by the experimental design; the other marginal (e.g., total number of successes) may be free to vary.
  • By fixing both marginals, Fisher's test is more conservative (i.e., it may fail to detect real effects more often).

⚖️ When Chi Square is preferred

  • The excerpt notes: "Even though the Chi Square test is an approximate test, the approximation is quite good in most cases."
  • Chi Square "tends to have too low a Type I error rate more often than too high a Type I error rate."
    • In other words, Chi Square is conservative in practice, so it is unlikely to produce false positives.
  • Because Chi Square does not fix both marginals, it often has better power to detect real differences.
TestExact or approximate?Assumes fixed marginals?Power
Fisher's Exact TestExact (combinatorial)Both row and column totalsLower (more conservative)
Chi Square TestApproximate (Chi Square distribution)Typically only one marginalHigher (less conservative)

⚖️ Don't confuse

  • "Exact" does not mean "better" or "more powerful."
  • Fisher's Exact Test is exact in its probability calculation, but this comes at the cost of reduced power due to the fixed-marginal assumption.
  • Chi Square is approximate but often more appropriate and powerful for real-world data.

🔗 Connection to general randomization tests

🔗 The randomization principle

  • The excerpt begins by describing a randomization test for correlation: you fix one variable (X) and consider all possible rearrangements of the other variable (Y).
  • For each rearrangement, you compute the test statistic (e.g., correlation r).
  • The p-value is the proportion of rearrangements that produce a statistic as extreme or more extreme than the observed one.
  • Example: with 5 pairs of scores, there are 5! = 120 possible arrangements of Y. If 2 arrangements give r ≥ 0.385, the one-tailed p-value is 2/120 = 0.017.

🔗 Fisher's Exact Test as a randomization test

  • Fisher's Exact Test applies the same logic to contingency tables.
  • Instead of rearranging raw scores, you consider all possible frequency patterns that preserve the row and column totals.
  • Each pattern has a probability computed by the combinatorial formula.
  • The p-value is the sum of probabilities for the observed pattern and all more extreme patterns.

🔗 Two-tailed quirk in randomization tests

  • The excerpt emphasizes: "In randomization tests, the two-tailed probability is not necessarily double the one-tailed probability."
  • This applies to both the correlation example and Fisher's Exact Test.
  • Reason: the distribution of possible outcomes under randomization may be asymmetric, so extreme outcomes in one direction may not mirror those in the other direction.
117

Rank Randomization: Two Conditions (Mann-Whitney U, Wilcoxon Rank Sum)

Rank Randomization: Two Conditions (Mann-Whitney U, Wilcoxon Rank Sum)

🧭 Overview

🧠 One-sentence thesis

Rank randomization tests convert raw scores to ranks before computing a randomization test, trading some statistical power for computational ease and the availability of significance tables.

📌 Key points (3–5)

  • Core trade-off: rank randomization tests are easier to compute and have lookup tables, but lose some information (and power) compared to randomization tests on original numbers.
  • How the test works: convert all scores to ranks, then calculate the proportion of possible rank arrangements that produce a difference as large or larger than the observed data.
  • Two common names: Mann-Whitney U test and Wilcoxon Rank Sum Test refer to the same rank randomization procedure for two-condition designs.
  • Common confusion: one-tailed vs two-tailed probability—one-tailed considers only one direction of difference; two-tailed doubles the probability to account for both directions.
  • Practical shortcuts: for small samples, use critical-value tables; for larger samples, use a normal approximation formula with Z.

🔄 From raw scores to ranks

🔄 Why convert to ranks

  • The excerpt states that randomization tests are "very difficult to compute" because they require enumerating all possible rearrangements of the original data.
  • By converting scores to ranks first, the test becomes simpler: you work with integers 1, 2, 3, … instead of arbitrary values.
  • Advantage: tables exist that give critical values for rank sums, so you don't have to count every arrangement by hand.
  • Disadvantage: "some information is lost when the numbers are converted to ranks," so rank randomization tests are "generally less powerful" than tests on the original data.

📊 Example conversion

The excerpt provides fictitious data:

GroupOriginal scores
Experimental7, 11, 0, 5
Control8, 30, 2, 9

After ranking all eight values together (smallest = 1, largest = 8):

GroupRanksRank sum
Experimental4, 5, 7, 824
Control1, 2, 3, 612
  • The sum of all ranks (1 through 8) is always constant (36 in this case).
  • Because the total is fixed, you can use a computational shortcut: find how many arrangements give the Experimental group a rank sum ≥ 24.

🎲 Computing the probability

🎲 Counting possible arrangements

The number of ways to divide n observations into two groups of size r and n−r is given by the combinations formula: n choose r.

  • For the example: 8 observations divided into two groups of 4 each → 8 choose 4 = 70 possible arrangements.
  • The test asks: of these 70 ways, how many produce a rank sum for the Experimental group as extreme or more extreme than 24?

🎲 Enumerating extreme arrangements

The excerpt shows three rearrangements that yield rank sums ≥ 24:

Experimental ranksRank sum
6, 5, 7, 826
4, 6, 7, 825
3, 6, 7, 824
  • Including the observed arrangement (4, 5, 7, 8 → 24), there are 4 arrangements with rank sum ≥ 24.
  • One-tailed probability = 4 / 70 ≈ 0.057.
  • Two-tailed probability = 2 × 0.057 = 0.114, because you also count arrangements with rank sums as small or smaller than the observed sum (8 arrangements total out of 70).

🔍 One-tailed vs two-tailed

  • One-tailed: you predict the direction of the difference (e.g., Experimental > Control), so you count only arrangements where the Experimental rank sum is as large or larger.
  • Two-tailed: you allow for either direction, so you count arrangements where the rank sum is either (a) as large or larger or (b) as small or smaller.
  • Don't confuse: the two-tailed p is not simply "double the one-tailed p" in all contexts, but in this symmetric rank-sum case, the excerpt states it is (2)(0.057) = 0.114.

📋 Using tables and formulas

📋 Critical-value tables for small samples

The excerpt provides a table for equal sample sizes from 4 to 10 per group.

Example from Table 6 (one-tailed test, both groups n=4):

Significance levelCritical rank sum (higher group)
0.0525
0.02526
  • For the fictitious data, the observed rank sum is 24, which is below 25, so p > 0.05 (one-tailed).
  • The excerpt confirms by counting that p ≈ 0.057.
  • Why tables are useful: with larger samples (e.g., 10 per group), counting all arrangements "becomes very time consuming," so the critical value suffices for a significance decision.

📐 Normal approximation for larger samples

When sample sizes are moderate to large, the rank sum is approximately normally distributed. The excerpt gives the formula:

Z = [W_a − n_a(n_a + n_b + 1)/2] / sqrt[n_a n_b (n_a + n_b + 1)/12]

where:

  • W_a = sum of ranks for the first group,
  • n_a = sample size for the first group,
  • n_b = sample size for the second group.

Example from the Stereograms Case Study:

  • W_a = 1911, n_a = 43, n_b = 35.
  • Plugging in: Z = 2.13, which corresponds to a two-tailed p = 0.033 (using a normal distribution calculator).

🧮 When to use which method

Sample sizeMethodReason
Small (≤10/group)Critical-value tableExact, easy lookup
Moderate to largeNormal approximation (Z formula)Counting arrangements is impractical
  • The excerpt notes that the normal approximation is valid for "moderate to large sample sizes."
  • For very small samples, the table is both exact and convenient.

🔑 Key terminology

🔑 Mann-Whitney U vs Wilcoxon Rank Sum

"There are several names for rank randomization tests for differences in central tendency. The two most common are the Mann-Whitney U test and the Wilcoxon Rank Sum Test."

  • These are two names for the same procedure: convert scores to ranks, then test whether the rank sums differ more than expected by chance.
  • The excerpt does not distinguish between them further; they are used interchangeably in practice.

🔑 Rank sum

  • The rank sum is the sum of the ranks assigned to one group.
  • Because the total of all ranks is fixed, you can focus on just one group's rank sum as the test statistic.
  • Example: if the Experimental group has ranks 4, 5, 7, 8, the rank sum is 24.

🔑 Ties and mean ranks

  • The excerpt briefly mentions ties: "since there are 'ties' in the data, the mean rank of the ties is used."
  • Example: ten scores of 2.5 tie for ranks 4 through 13; the average of 4, 5, …, 13 is 8.5, so all tied values get rank 8.5.
  • This adjustment preserves the total rank sum and avoids arbitrary ordering of tied values.
118

Rank Randomization: Two or More Conditions (Kruskal-Wallis)

Rank Randomization: Two or More Conditions (Kruskal-Wallis)

🧭 Overview

🧠 One-sentence thesis

The Kruskal-Wallis test extends rank-randomization logic to designs with more than two groups, testing for differences in central tendency by converting data to ranks and comparing the result to a Chi Square distribution.

📌 Key points (3–5)

  • What it extends: the Kruskal-Wallis test is the multi-group version of the Wilcoxon test, handling designs with one between-subjects variable and more than two groups.
  • How it works: convert all data to ranks (ignoring group membership), sum the ranks for each group, compute the H statistic, and test it against a Chi Square distribution with k-1 degrees of freedom.
  • Handling ties: when scores tie, assign each the mean of the ranks they would occupy.
  • Common confusion: the Kruskal-Wallis is not a direct extension of the two-group Z test shown at the start; it uses a different statistic (H) and a different distribution (Chi Square).
  • Why it matters: it allows significance testing for central tendency differences across multiple groups without assuming normality.

🔧 Core mechanism

🔧 What the Kruskal-Wallis test does

The Kruskal-Wallis test: a rank-randomization test that extends the Wilcoxon test to designs with more than two groups, testing for differences in central tendency in designs with one between-subjects variable.

  • It is not limited to two groups; it handles k groups (where k is greater than 2).
  • The test is based on a statistic H that is approximately distributed as Chi Square.
  • Example: if you have four experimental conditions (like the "Smiles and Leniency" case study), you can use Kruskal-Wallis to test whether central tendency differs across them.

📐 The H statistic formula

The formula for H is:

H = -3(N + 1) + (12 / (N(N + 1))) × (sum over all groups of (T_i squared / n_i))

Where:

  • N is the total number of observations across all groups.

  • T_i is the sum of ranks for the i-th group.

  • n_i is the sample size for the i-th group.

  • k is the number of groups.

  • The formula combines the rank sums from each group, weighted by sample size, and adjusts by the total sample size.

  • The result is a single number (H) that summarizes how much the rank distributions differ across groups.

🔢 Step-by-step procedure

🔢 Step 1: Convert data to ranks

  • Rank all observations together, ignoring which group they belong to.
  • If there are ties, assign each tied value the mean of the ranks it would occupy.
  • Example: ten scores of 2.5 tied for ranks 4 through 13; the average of 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13 is 8.5, so all ten scores receive rank 8.5.

🔢 Step 2: Sum ranks for each group

  • After ranking, add up the ranks for each group separately.
  • Example from the "Smiles and Leniency" case study (four conditions, each with 34 observations):
    • False: 2732.0
    • Felt: 2385.5
    • Miserable: 2424.5
    • Neutral: 1776.0

🔢 Step 3: Compute H

  • Plug the rank sums, sample sizes, and total N into the formula.
  • Example calculation (with N = 136, k = 4, each n_i = 34):
    • H = -3(136 + 1) + (12 / (136 × 137)) × ((2732²/34) + (2385.5²/34) + (2424.5²/34) + (1776²/34))
    • H = 9.28

🔢 Step 4: Test significance with Chi Square

  • Use a Chi Square distribution with k - 1 degrees of freedom.
  • Example: for k = 4 groups, df = 3; H = 9.28 yields p = 0.028.
  • If p is below your significance threshold, reject the null hypothesis of no difference in central tendency.

🧮 Handling ties

🧮 Why ties matter

  • Real data often contain identical scores.
  • Assigning the same rank to all tied values would distort the rank distribution.

🧮 The mean-rank method

  • When multiple observations tie for a range of ranks, assign each the average of those ranks.
  • Example: if ten scores tie for ranks 4–13, compute (4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13) / 10 = 8.5, and assign 8.5 to all ten.
  • This preserves the total sum of ranks and avoids bias.

🔍 Interpretation and comparison

🔍 What the test tells you

  • A significant H indicates that at least one group's central tendency differs from the others.
  • The test does not tell you which specific groups differ; it is an omnibus test.
  • Example: in the "Smiles and Leniency" study, H = 9.28 with p = 0.028 allows rejection of the null hypothesis that all four conditions have the same leniency.

🔍 Don't confuse with the two-group test

  • The excerpt opens with a two-group example (Z = 2.13, p = 0.033) using a different formula and a normal approximation.
  • The Kruskal-Wallis test uses H and Chi Square, not Z and the normal distribution.
  • The two-group test is a special case; Kruskal-Wallis generalizes to k ≥ 2 groups.

📊 Summary table

ElementDescription
PurposeTest for differences in central tendency across k groups
Data transformationConvert all observations to ranks (ignoring group membership)
StatisticH, computed from rank sums and sample sizes
DistributionChi Square with k - 1 degrees of freedom
TiesAssign mean of the ranks the tied values would occupy
Null hypothesisNo difference in central tendency across groups
RejectionIf p-value from Chi Square is below threshold, reject the null
119

Rank Randomization for Association (Spearman's ρ)

Rank Randomization for Association (Spearman's ρ)

🧭 Overview

🧠 One-sentence thesis

Spearman's ρ tests whether two variables are associated by converting data to ranks and comparing the observed rank correlation to all possible rearrangements, allowing significance testing without assuming a particular distribution shape.

📌 Key points (3–5)

  • What Spearman's ρ is: the correlation coefficient calculated on ranked data rather than raw values.
  • How the test works: fix one variable's ranks, rearrange the other variable's ranks in all possible ways, and count how many arrangements produce correlations as high or higher than the observed correlation.
  • Significance calculation: the p-value equals the number of arrangements meeting or exceeding the observed correlation divided by the total possible arrangements (N factorial).
  • Common confusion: one-tailed vs two-tailed—one-tailed counts arrangements with correlations as large or larger; two-tailed must account for both directions.
  • Practical use: critical value tables exist for moderate sample sizes because counting all arrangements becomes impractical as N grows.

🔢 From raw data to ranks

🔢 Converting to ranks

  • The excerpt shows raw X and Y values first, then converts each variable to ranks separately.
  • Each value gets a rank based on its position when sorted: smallest = 1, next = 2, and so on.
  • Example from the excerpt: raw X values (1, 2.4, 3.8, 4, 11) become ranks (1, 2, 3, 4, 5); raw Y values (1, 2, 2.3, 3.7, 2.5) become ranks (1, 2, 3, 5, 4).

📊 Why rank instead of raw values

The rank randomization test for association is equivalent to the randomization test for Pearson's r except that the numbers are converted to ranks before the analysis is done.

  • Ranking removes the influence of the exact numerical distances between values.
  • The test focuses on the order relationship rather than the magnitude of differences.
  • This approach is more robust when distributions are non-normal or contain outliers.

🔀 The randomization logic

🔀 Fixing one variable and rearranging the other

  • The approach treats the X variable's ranks as fixed.
  • The test compares the actual correlation to correlations obtained by rearranging the Y variable's ranks in every possible way.
  • Example from the excerpt: the observed correlation between X ranks (1, 2, 3, 4, 5) and Y ranks (1, 2, 3, 5, 4) is 0.90.

🎲 Counting favorable arrangements

  • The excerpt identifies arrangements that produce correlations as high or higher than the observed 0.90.
  • One arrangement (switching the fourth and fifth Y values) produces a correlation of 1.0.
  • Three other arrangements also produce a correlation of 0.90.
  • Total favorable arrangements: 5 (including the observed data itself).

🧮 Total possible arrangements

  • The number of possible arrangements equals N factorial (N!), where N is the number of pairs.
  • For the example with 5 pairs: 5! = 120 possible arrangements.
  • Don't confuse: this counts all possible orderings of the Y ranks, not just some subset.

📐 Calculating significance

📐 The p-value formula

  • The probability value equals the number of favorable arrangements divided by the total possible arrangements.
  • Example from the excerpt: 5 favorable arrangements out of 120 total = 5/120 = 0.042.
  • This is a one-tailed probability because it counts only arrangements with correlations as large or larger.

↔️ One-tailed vs two-tailed

Test typeWhat it countsExample p-value
One-tailedArrangements with correlations as large or larger in one direction0.042
Two-tailedArrangements with correlations as extreme in either direction0.084 (double the one-tailed)
  • The excerpt shows the two-tailed probability is 0.084, exactly double the one-tailed value.
  • For the example data with N=5, the critical value for a two-tailed test at 0.05 level is 1.0, so the observed ρ of 0.90 is not significant two-tailed.
  • The critical value for a one-tailed test at 0.05 level is 0.90, so the observed ρ of 0.90 is exactly significant one-tailed.

📋 Using critical value tables

📋 Why tables are needed

  • Counting all arrangements becomes impractical when sample size is even moderately large.
  • The excerpt states: "Since it is hard to count up all the possibilities when the sample size is even moderately large, it is convenient to have a table of critical values."
  • Tables provide cutoff values for common significance levels without exhaustive counting.

📊 Reading the critical value table

The excerpt provides a table with columns for different significance levels:

  • N: sample size (number of pairs)
  • .05 2-tail and .01 2-tail: critical values for two-tailed tests
  • .05 1-tail and .01 1-tail: critical values for one-tailed tests

Example interpretation:

  • For N=5, one-tailed 0.05 level: critical value is 0.90
  • For N=5, two-tailed 0.05 level: critical value is 1.0
  • As N increases, critical values generally decrease (easier to reach significance with larger samples)

🔍 Applying the table to the example

  • The example has N=5 and observed Spearman's ρ = 0.90.
  • One-tailed test at 0.05 level: critical value is 0.90, so the result is significant (observed equals critical).
  • Two-tailed test at 0.05 level: critical value is 1.0, so the result is not significant (observed is less than critical).
  • This matches the calculated p-values: 0.042 (one-tailed) is less than 0.05, but 0.084 (two-tailed) is greater than 0.05.
120

Proportions

Proportions

🧭 Overview

🧠 One-sentence thesis

When comparing two proportions (such as health outcomes between diets), multiple effect-size measures exist, and while relative risk reduction can sound impressive, absolute risk reduction and number needed to treat often provide more practical insight, especially when baseline risks are low.

📌 Key points (3–5)

  • What the section addresses: how to measure the effect size when comparing two proportions (e.g., health outcomes in two diet groups).
  • Four measures introduced: Absolute Risk Reduction (ARR), Relative Risk Reduction (RRR), odds ratio, and number needed to treat (NNT).
  • Common confusion: RRR can exaggerate importance when absolute risks are very low—a 50% relative reduction sounds large, but if baseline risk is tiny, the practical benefit is minimal.
  • Which measure is best: each has proper uses, but ARR and NNT often give a clearer picture of practical impact than RRR alone.

📏 When proportions are self-evident vs. when comparison is needed

📏 Single proportions

  • Sometimes a single proportion is easy to interpret on its own.
  • Example: the excerpt mentions that 24% of white non-Hispanic adults in the U.S. (2006–2008) were obese—this percentage directly indicates the magnitude of the problem.

🔀 Comparing two proportions

  • Often the research question involves comparing outcomes between two groups.
  • The excerpt uses the Mediterranean Diet study: one group followed the AHA diet, another followed the Mediterranean diet.
  • Key comparison: proportion of people who stayed healthy throughout the study.
    • AHA diet: 0.79 healthy (so 0.21 not healthy).
    • Mediterranean diet: 0.90 healthy (so 0.10 not healthy).
  • The question becomes: how do we best measure the benefit of switching diets?

📐 Four measures of effect size for proportions

➖ Absolute Risk Reduction (ARR)

Absolute Risk Reduction (ARR): the difference between the proportion with the ailment in the control group and the proportion in the treatment group.

  • Formula (in words): ARR equals C minus T, where C is the proportion in the control group with the ailment and T is the proportion in the treatment group.
  • For the diet example:
    • C = 0.21 (not healthy on AHA diet).
    • T = 0.10 (not healthy on Mediterranean diet).
    • ARR = 0.21 − 0.10 = 0.11.
  • Interpretation: switching from the AHA diet to the Mediterranean diet reduces the proportion of non-healthy people by 0.11 (11 percentage points).

📉 Relative Risk Reduction (RRR)

Relative Risk Reduction (RRR): the proportional decrease in the ailment rate, expressed as a percentage of the control group's rate.

  • Formula (in words): RRR equals (C minus T) divided by C, then multiplied by 100.
  • For the diet example:
    • (0.21 − 0.10) / 0.21 × 100 = 52%.
  • Interpretation: the proportion of non-healthy people on the Mediterranean diet is 52% lower than on the AHA diet.
  • Don't confuse: RRR sounds large, but it is relative to the baseline; when baseline risk is very low, a high RRR can still mean a tiny absolute benefit.

🎲 Odds ratio

  • The odds ratio compares the odds of the outcome in the two groups (not the probabilities).
  • For the diet example:
    • Odds of being healthy on Mediterranean diet: 90:10 = 9:1.
    • Odds of being healthy on AHA diet: 79:21 ≈ 3.76:1.
    • Odds ratio = 9 / 3.76 ≈ 2.39.
  • Interpretation: the odds of being healthy on the Mediterranean diet are 2.39 times the odds on the AHA diet.
  • Note: this is the ratio of odds, not the ratio of probabilities.

🔢 Number Needed to Treat (NNT)

Number Needed to Treat (NNT): the number of people who must switch to the treatment in order to prevent one additional case of the ailment.

  • Formula (in words): N equals 1 divided by ARR.
  • For the diet example:
    • N = 1 / 0.11 ≈ 9.
  • Interpretation: for every nine people who switch from the AHA diet to the Mediterranean diet, one person who would otherwise not be healthy is expected to stay healthy.
  • This measure translates the effect into a concrete, practical number.

⚖️ Which measure is best?

⚖️ Trade-offs and cautions

  • The excerpt states that each measure has its proper uses, but RRR can exaggerate importance, especially when absolute risks are low.
  • Example from the excerpt:
    • A drug reduces disease risk from 1 in 1,000,000 to 1 in 2,000,000.
    • RRR = 50% (sounds impressive).
    • ARR = 0.0000005 (the practical reduction is minimal).
  • Common confusion: a high RRR does not always mean a large practical benefit; always check the baseline risk and the ARR.

🧭 Practical guidance

  • When baseline risks are very low, ARR and NNT provide clearer insight into real-world impact.
  • RRR is useful for understanding proportional change but should not be interpreted in isolation.
  • The odds ratio is common in certain statistical contexts (e.g., logistic regression) but is less intuitive for direct interpretation of benefit.

📊 Summary table

MeasureFormula (in words)Diet example valueWhat it tells you
ARRC − T0.11Absolute reduction in proportion with ailment
RRR(C − T) / C × 10052%Proportional reduction relative to control
Odds ratio(odds in treatment) / (odds in control)2.39Ratio of odds of the outcome
NNT1 / ARR9Number of people to treat to prevent one case
121

Difference Between Two Means

Difference Between Two Means

🧭 Overview

🧠 One-sentence thesis

When comparing two groups, the choice between raw mean differences and standardized measures depends on whether the measurement scale has inherent meaning, with standardized measures like Hedges' g and Cohen's d allowing interpretation even when scale units are arbitrary.

📌 Key points (3–5)

  • When raw differences work: If the measurement scale has inherent meaning (e.g., minutes of sleep), the simple difference between means is interpretable on its own.
  • When standardization is needed: If the scale lacks clear meaning (e.g., a 7-point attitude rating), standardized measures express the difference in standard deviation units, making it scale-free.
  • Two standardized measures: Hedges' g and Cohen's d both divide the mean difference by the standard deviation; they differ only in the denominator (N-1 vs N).
  • Common confusion: Standardized effect sizes depend heavily on subject variability—the same treatment effect will appear larger in a homogeneous group than in a heterogeneous group, even though the actual treatment impact is identical.
  • Context matters: A "small" effect by conventional guidelines can still be practically important depending on the situation (e.g., selection at extreme thresholds).

📏 When to use raw mean differences

📏 Inherently meaningful scales

When the units of a measurement scale are meaningful in their own right, then the difference between means is a good and easily interpretable measure of effect size.

  • If the scale has real-world meaning (e.g., minutes, kilograms, dollars), the raw difference directly shows the magnitude of the effect.
  • Example: A drug increased total sleep duration by a mean of 61.8 minutes compared to placebo—this number is immediately understandable without further transformation.

📐 Proportional differences for ratio scales

  • When the dependent variable is measured on a ratio scale (has a true zero), it is often informative to consider the proportional difference in addition to the absolute difference.
  • The same absolute difference can mean very different things depending on the baseline.
  • Example: A 61.8-minute increase represents a 51% increase if the placebo group slept 120 minutes, but only a 15% increase if they slept 420 minutes.

🔢 Log transformations and percent changes

  • If a log transformation is applied to the dependent variable, equal percent changes on the original scale result in equal absolute changes on the log scale.
  • Example: A 10% increase from 400 to 440 minutes and a 10% increase from 300 to 330 minutes both produce the same log difference: 0.041 on the log base 10 scale.
  • This property makes log scales useful for comparing proportional effects across different baselines.

🎯 Standardized measures for arbitrary scales

🎯 When scales lack inherent meaning

  • Many times the dependent variable is measured on a scale that is not inherently meaningful.
  • Example: Attitudes toward animal research measured on a 7-point scale—a 1.47-unit difference is hard to interpret because it is not clear whether this should be considered large or small.
  • Solution: Express the difference in standardized units (number of standard deviations the means differ by).

🧮 Hedges' g and Cohen's d formulas

Both measures consist of the difference between means divided by the standard deviation; they differ only in the denominator:

MeasureDenominator formulaWhat it uses
Hedges' gDivide by N-1Mean square error (MSE) with N-1 correction
Cohen's dDivide by NMean square error with N
  • Where M₁ is the mean of the first group, M₂ is the mean of the second group, MSE is the mean square error, and N is the total number of observations.
  • Both formulas produce a scale-free measure: the original units are replaced by standardized units, interpretable even if the original scale units do not have clear meaning.

📊 Example: Animal research attitudes

  • Women's mean rating: 5.353; men's mean rating: 3.882; MSE: 2.864.
  • Hedges' g calculated to be 0.87.
  • It is more meaningful to say the means were 0.87 standard deviations apart than 1.47 scale units apart, since the scale units are not well defined.

📏 Cohen's guidelines for effect size

Cohen (1988) suggested widely adopted guidelines:

Effect sizeClassification
0.2Small effect
0.5Medium effect
0.8Large effect
  • Based on these guidelines, the 0.87 effect in the animal research example is a large effect.
  • Important caveat: These guidelines are somewhat arbitrary and have not been universally accepted; other important factors may be ignored if these definitions are used mechanically (e.g., for sample size planning).

⚠️ Interpretational issues and confusions

⚠️ Context determines importance

It is important to realize that the importance of an effect depends on the context.

  • A small effect can make a big difference if only extreme observations are of interest.
  • Example: Two groups of students (red and blue) with means of 52 and 50, both with standard deviation 10—only a 0.2 standard deviation difference, generally considered small.
  • But if only students scoring 70 or higher qualify for a selective program:
    • Proportion of red students qualifying: 0.036
    • Proportion of blue students qualifying: 0.023
    • Ratio: 1.6:1, meaning 62% red vs 38% blue among 100 accepted students.
  • In most contexts this would be considered an important difference, even though the raw effect size is "small."

🔄 Subject variability strongly affects standardized measures

When the effect size is measured in standard deviation units, it is important to recognize that the variability in the subjects has a large influence on the effect size measure.

  • Key confusion: Two experiments with the same treatment effect can show very different standardized effect sizes if subject populations differ in homogeneity.
  • Example: Two hypothetical experiments on an exercise program's effect on blood pressure:
    • Both experiments: mean effect = 10 mmHg reduction in systolic blood pressure.
    • Experiment 1: standard deviation = 20 → standardized effect size = 0.50
    • Experiment 2: standard deviation = 30 → standardized effect size = 0.33
  • This standardized difference occurs even though the effectiveness of the treatment is exactly the same in the two experiments.
  • Don't confuse: A larger standardized effect size does not always mean a more effective treatment—it may simply reflect a more homogeneous subject population.

🧪 Comparing across studies

  • Because standardized measures depend on subject variability, comparing effect sizes across studies with different populations requires caution.
  • More homogeneous subjects → larger standardized effect size for the same absolute treatment effect.
  • More heterogeneous subjects → smaller standardized effect size for the same absolute treatment effect.
122

Proportion of Variance Explained

Proportion of Variance Explained

🧭 Overview

🧠 One-sentence thesis

Proportion of variance explained measures effect size by quantifying how much of the total variation in outcomes is attributable to a specific variable, with ω² preferred over η² because it provides an unbiased estimate.

📌 Key points (3–5)

  • What it measures: the proportion of total variance in outcomes that can be attributed to an experimental condition or predictor variable, expressed as a percentage or decimal.
  • Two main estimators: η² (eta squared) tends to overestimate and is biased; ω² (omega squared) is unbiased and recommended.
  • Context dependency: the proportion explained depends on both the specific levels of the independent variable used and the variability of the population sampled—it is not a universal property of a variable.
  • Common confusion: in factorial designs, distinguish between overall proportion (relative to total variance) vs. partial proportion (relative to variance for that effect plus error only).
  • Analogous measures in regression: R² in multiple regression is analogous to η² and is biased; adjusted R² is analogous to ω² and reduces bias.

📊 Understanding variance explained in ANOVA

📊 Why scores vary

  • In any experiment, subjects' responses vary for many reasons.
  • Example from the "Smiles and Leniency" case study: leniency scores varied because subjects were in different smile conditions, had different baseline leniency tendencies, were in different moods, and reacted differently to the stimulus person.
  • The goal is to determine what proportion of this total variation is due to the experimental condition.

📏 How to compute the proportion

Proportion of variance explained: the ratio of variance attributable to the experimental condition to the total variance in scores.

Method 1: Variance comparison

  • Compute the overall variance of all scores (ignoring conditions).
  • Compute the mean of the variances within each treatment condition.
  • The difference reflects variance due to conditions.
  • Example: overall variance = 2.794; mean within-condition variance = 2.649; the small difference shows "Smile Condition" explains little variance.

Method 2: Sum of squares ratio

  • More convenient: divide sum of squares for conditions by sum of squares total.
  • Example: SSQ conditions = 27.544; SSQ total = 377.189; proportion = 27.544 / 377.189 = 0.073 (7.3%).

🔄 Alternative view: proportional reduction in error

  • SSQ total (377.189) = variation when condition is ignored.
  • SSQ error (349.654) = variation remaining after accounting for condition.
  • Reduction in error = 377.189 − 349.654 = 27.535.
  • Proportional reduction = 27.535 / 377.189 = 0.073, the same as variance explained.
  • This shows that "variance explained" and "reduction in error" are two ways of describing the same concept.

⚖️ Choosing between η² and ω²

⚖️ η² (eta squared)

η²: the proportion of total variance explained by a variable, computed as SSQ conditions divided by SSQ total.

  • Bias problem: η² tends to overestimate the true proportion of variance explained.
  • Despite being reported by leading statistics packages, it is not recommended due to this positive bias.

✅ ω² (omega squared)

ω²: an unbiased estimate of the proportion of variance explained.

  • Formula (in words): (SSQ conditions minus (k − 1) times MSE) divided by (SSQ total plus MSE), where k is the number of conditions and MSE is mean square error.
  • Example: with k = 4 and the given data, ω² = 0.052 (compared to η² = 0.073).
  • ω² is smaller than η² because it corrects for the positive bias.
  • Recommendation: use ω² instead of η².

🎯 Context dependency of variance explained

🎯 Not a universal property

  • The proportion of variance explained is not a general characteristic of an independent variable.
  • It depends on:
    1. The specific levels of the independent variable used in the experiment.
    2. The variability of the population sampled.

🍺 Example: alcohol and driving ability

DesignDosesPopulationImplication
Design 10.00, 0.30, 0.60All drivers 16–80 yearsSmaller dose range → less variance due to Dose; more diverse population → larger total variance → smaller proportion explained
Design 20.00, 0.50, 1.00Experienced drivers 25–30 yearsLarger dose range → more variance due to Dose; less diverse population → smaller total variance → larger proportion explained
  • Design 2 manipulates alcohol more strongly (larger range), increasing variance due to Dose.
  • Design 1 includes a more diverse set of drivers, increasing total variance.
  • Result: proportion of variance explained by Dose would be much less in Design 1 than Design 2.
  • Don't confuse: a larger effect size in one study vs. another does not necessarily mean the variable is "more important"—it may reflect design choices and population characteristics.

🧩 Factorial designs: overall vs. partial measures

🧩 Multiple sources of variation

  • In a one-factor design: SSQ total = SSQ condition + SSQ error.
  • In an A × B factorial design: SSQ total = SSQ A + SSQ B + SSQ A×B + SSQ error.
  • Question: should proportion explained for A be computed relative to SSQ total or relative to (SSQ A + SSQ error)?

🔢 Example: age and reading method

  • Hypothetical experiment: Age (6 vs. 12 years) × Treatment (experimental vs. control).
  • Standard deviation in each cell = 5; 10 subjects per cell; total N = 40.
  • Means:
AgeExperimentalControl
64042
125056
  • SSQ Age = 1440 (very large, as expected—reading ability differs greatly between 6- and 12-year-olds).
  • SSQ Condition = 160; SSQ A×C = 40; SSQ error = 900; SSQ total = 2540.

📐 Four measures of effect size

SourcedfSSQη²partial η²ω²partial ω²
Age114400.5670.6150.5520.586
Condition11600.0630.1510.0530.119
A × C1400.0160.0430.0060.015
Error36900
Total392540

🔍 η² vs. partial η²

  • η² for Age = SSQ Age / SSQ total = 1440 / 2540 = 0.567.
    • Interpretation: Age explains 56.7% of the total variation.
  • Partial η² for Age = SSQ Age / (SSQ Age + SSQ error) = 1440 / 2340 = 0.615.
    • Interpretation: Age explains 61.5% of the variation in Age plus error (ignoring other effects).
  • Partial η² is larger because the denominator is smaller (excludes SSQ Condition and SSQ A×C).
  • The difference is even larger for Condition: η² = 0.063 vs. partial η² = 0.151, because SSQ Age is large and makes a big difference when included in the denominator.

✅ ω² vs. partial ω²

  • As with one-factor designs, ω² is preferred over η² because η² has a positive bias.
  • ω² values are smaller than η²: for Age, ω² = 0.552 vs. η² = 0.567.
  • Partial ω² is computed relative to (SSQ effect + SSQ error), analogous to partial η².
  • Formula for ω² (in words): (SSQ effect minus (df effect) times MSE) divided by (SSQ total plus MSE), where N is total observations.
  • Choice is subjective: neither ω² nor partial ω² is "correct"—they answer different questions. Understand the difference and know which your software computes (some packages label them incorrectly).

📈 Correlational studies and regression

📈 Variance explained in regression

  • In multiple regression, the sum of squares for Y (criterion variable) is partitioned into SSQ explained and SSQ error.
  • Proportion of variance explained = SSQ explained / SSQ total.
  • In simple regression: proportion = r² (correlation squared).
  • In multiple regression: proportion = R² (multiple correlation squared).

⚖️ R² is biased, adjusted R² is better

R²: analogous to η² and is a biased estimate of variance explained.

Adjusted R²: analogous to ω² and is less biased (though not completely unbiased).

  • Formula for adjusted R² (in words): 1 minus [(1 minus R²) times (N minus 1) divided by (N minus p minus 1)], where N is total observations and p is the number of predictor variables.
  • Recommendation: use adjusted R² instead of R² to reduce bias, especially with smaller samples or more predictors.
    Introduction to Statistics | Thetawave AI – Best AI Note Taker for College Students