Research Methods in Psychology

Methods of Knowing

1. Methods of Knowing

🧭 Overview

🧠 One-sentence thesis

The scientific method is the most reliable way to acquire valid knowledge because it combines systematic empiricism with logical reasoning, overcoming the weaknesses of intuition, authority, rationalism, and empiricism used alone.

📌 Key points (3–5)

Five methods of knowing: intuition, authority, rationalism, empiricism, and the scientific method—each has strengths and weaknesses.
Why non-scientific methods fail: intuition is driven by biases, authority figures can be wrong or misleading, rationalism depends on correct premises, and empiricism alone can deceive through limited or distorted observations.
What makes science different: the scientific method uses systematic empiricism (structured observations under controlled conditions) plus rationalism to test ideas and reach valid conclusions.
Common confusion: empiricism vs. systematic empiricism—casual observation (e.g., "all swans I've seen are white") is not the same as the controlled, structured observation science requires.
Trade-off: the scientific method is most reliable but requires time, resources, and can only answer empirical questions.

🧩 The five methods of acquiring knowledge

🔮 Intuition

Intuition: relying on guts, emotions, and instincts to guide us; believing what feels true rather than examining facts or using rational thought.

The appeal: quick decisions without paralyzing analysis; sometimes intuition-based decisions can be superior to analysis-based ones.
The problem: intuitions can be wrong because they are driven by cognitive and motivational biases, not logical reasoning or scientific evidence.
Example: Your friend acts strange and won't look you in the eye → you intuit they are lying. But they might just be preoccupied or uncomfortable for unrelated reasons.

👨‍🏫 Authority

Authority: accepting new ideas because an authority figure (parents, media, doctors, religious leaders, government, professors) states they are true.

Why we use it: we don't have time to independently research every piece of knowledge.
The problem: authority figures may be wrong, may use only their own intuition, or may have reasons to mislead you.
Example: Parents say "make your bed in the morning," but making the bed creates a warm, damp environment where mites thrive; leaving sheets open is actually less hospitable to mites.
Historical warning: unquestioning obedience to authority has led to atrocities (Salem Witch Trials, Nazi war crimes).
What we can do: evaluate credentials, evaluate the methods used to reach conclusions, and check for motives to mislead.

🧮 Rationalism

Rationalism: using logic and reasoning to acquire new knowledge by stating premises and following logical rules to arrive at sound conclusions.

How it works: given correct premises and valid logic, you can reach a conclusion without direct observation.
Example: Premise 1: All swans are white. Premise 2: This is a swan. Conclusion: This swan is white (no need to see it).
The problem: if premises are wrong or there is an error in logic, the conclusion will not be valid.
Example: The premise "all swans are white" is incorrect—there are black swans in Australia.
Additional risk: unless formally trained in logic, it is easy to make errors.
When it works: if premises are correct and logical rules are followed appropriately, this is a sound means of acquiring knowledge.

👁️ Empiricism

Empiricism: acquiring knowledge through observation and experience.

Why it feels reliable: direct sensory evidence (e.g., "I have only ever seen white swans, so all swans are white").
The problem: we are limited in what we can experience and observe, and our senses can deceive us.
Example: For centuries people believed the world is flat because it appears flat.
Example: Visual illusions trick our senses.
Additional issue: prior experiences can alter the way we perceive events.
Don't confuse: casual empiricism (everyday observation) vs. systematic empiricism (structured, controlled observation used in science).

🔬 The scientific method

🔬 What the scientific method is

The scientific method: a process of systematically collecting and evaluating evidence to test ideas and answer questions.

Scientists may use intuition, authority, rationalism, and empiricism to generate new ideas, but they don't stop there.
The key difference: scientists go further by using systematic empiricism (careful observations under controlled conditions) to test ideas, and they use rationalism to arrive at valid conclusions.

⚙️ Systematic empiricism

Systematic empiricism: structured observations made under various controlled conditions, not just casual everyday observation.

Science relies on observations, but not just any observations—they must be systematic.
Example: Researchers didn't trust stereotypes or informal observations about whether women talk more than men; they systematically recorded, counted, and compared the number of words spoken by a large sample of women and men.
When systematic observations conflicted with stereotypes, they trusted the systematic observations.

✅ Strengths of the scientific method

Most likely to produce valid knowledge among all five methods.
Combines the best of rationalism (logical reasoning) and empiricism (observation) while controlling for biases and errors.

⚠️ Limitations of the scientific method

Limitation	Explanation
Time and resources	The scientific method can require considerable time and resources, so it is not always feasible.
Scope	It cannot be used to answer all questions—only empirical questions (those that can be tested through observation).

📚 Three fundamental features of science

🔍 Systematic empiricism

What it means: learning based on observation, done systematically by carefully planning, making, recording, and analyzing observations.
Logical reasoning and creativity play important roles, but scientists are unique in their insistence on checking ideas against systematic observations.
Example: Researchers did not trust stereotypes or informal observations; they systematically recorded and compared data, and when observations conflicted with stereotypes, they trusted the observations.

🧪 Why psychology is a science

Psychology takes the same general approach to understanding the natural world as astronomy, biology, and chemistry.
What sciences have in common: not their subject matter or equipment, but a general approach to understanding the natural world.
Psychology applies this approach to one aspect of the natural world: human behavior.

Understanding Science

2. Understanding Science

🧭 Overview

🧠 One-sentence thesis

Science is defined not by its subject matter or tools but by a general approach—systematic empiricism, empirical questions, and public knowledge—that distinguishes it from pseudoscience and allows it to self-correct over time.

📌 Key points (3–5)

What makes something science: not the topic or equipment, but three fundamental features—systematic empiricism, empirical questions, and public knowledge.
Systematic empiricism: scientists insist on checking ideas against carefully planned, recorded, and analyzed observations, not stereotypes or informal hunches.
Empirical vs. non-empirical questions: science answers questions about how the world actually is, not questions about values (what ought to be).
Common confusion: pseudoscience may look scientific (impressive terms, research claims) but lacks one or more of the three features—especially falsifiability.
Why it matters: publication enables collaboration across time/space and allows the scientific community to detect and correct errors, making knowledge increasingly accurate.

🔬 What defines science

🔬 Not the subject matter or tools

Psychology, astronomy, biology, and chemistry share no common topic: celestial bodies, living organisms, matter, and human behavior are all different.
They also share no common equipment: a biologist would not know what to do with a radio telescope; a chemist would not know how to track a moose population.
The common thread: a general approach to understanding the natural world.
Example: Psychology is a science because it applies this same approach to one aspect of the natural world—human behavior.

🧩 The three fundamental features

All sciences share three core characteristics:

Feature	What it means	Why it matters
Systematic empiricism	Learning based on carefully planned, recorded, and analyzed observations	Scientists check ideas against observations, not stereotypes or informal impressions
Empirical questions	Questions about how the world actually is, answerable by observation	Science can answer "is X true?" but not "is X good/just/beautiful?"
Public knowledge	Publishing methods, results, and conclusions for others to evaluate	Enables collaboration and self-correction over time

🔍 Systematic empiricism

🔍 What it means

Empiricism: learning based on observation.

Scientists learn about the natural world systematically: they carefully plan, make, record, and analyze observations.
Logical reasoning and creativity play roles, but scientists are unique in their insistence on checking ideas against systematic observations.

📊 Example: word count study

Mehl and colleagues did not trust stereotypes or their own informal observations about whether women talk more than men.
Instead, they systematically recorded, counted, and compared the number of words spoken by a large sample of women and men.
When systematic observations conflicted with stereotypes, they trusted the observations.
Don't confuse: informal observation (anecdotes, personal impressions) with systematic observation (planned, recorded, analyzed data).

❓ Empirical questions

❓ What counts as empirical

Empirical questions are about the way the world actually is and can be answered by systematically observing it.
Example: "Do women talk more than men?" is empirical—either they do or they don't, and observation can determine which.

🚫 What science cannot answer

Science is not in a position to answer questions about:

Values: whether things are good/bad, just/unjust, beautiful/ugly.
Ought statements: how the world should be.

Question type	Example	Can science answer?
Empirical (is)	Is a stereotype accurate?	Yes—observe and compare
Value (ought)	Is it wrong to hold inaccurate stereotypes?	No—this is a value judgment
Empirical (is)	Does criminal behavior have a genetic basis?	Yes—study genetics and behavior
Value (ought)	What actions should be illegal?	No—this is a normative question

Important for psychology researchers: be mindful of this distinction.

📢 Public knowledge

📢 What publication means

After asking empirical questions, making systematic observations, and drawing conclusions, scientists publish their work.
Typically: write an article for a professional journal, putting the research question in context, describing methods in detail, and clearly presenting results and conclusions.
Increasingly: open access journals make articles freely available to all, so publicly-funded research creates truly public knowledge.

🤝 Why publication is essential (reason 1): collaboration

Science is a social process—a large-scale collaboration among many researchers distributed across time and space.
Current scientific knowledge is based on many different studies by many different researchers who have shared their work publicly over many years.

🔧 Why publication is essential (reason 2): self-correction

Individual scientists understand their methods can be flawed and conclusions incorrect.
Publication allows others in the scientific community to detect and correct these errors.
Over time, scientific knowledge increasingly reflects the way the world actually is.

🔁 Example: the Many Labs Replication Project

A large, coordinated effort by prominent psychological scientists worldwide to replicate findings from 13 classic and contemporary studies.
Original finding (Schnall et al., 2008): washing one's hands leads people to view moral transgressions (e.g., keeping money from a found wallet, using a kitten for sexual arousal) as less wrong.
If reliable, this might explain why many religious traditions associate physical cleanliness with moral purity.
Replication attempt (Johnson et al., 2013): using the same materials and nearly identical procedures with a much larger sample, the Many Labs researchers were unable to replicate the original finding.
Suggests the original finding may have stemmed from a relatively small sample size (which can lead to unreliable results).
Current status: we cannot definitively conclude the handwashing effect does not exist, but the effort demonstrates the collaborative and cautious nature of scientific progress.

🎭 Science versus pseudoscience

🎭 What pseudoscience is

Pseudoscience: activities and beliefs that are claimed to be scientific by their proponents—and may appear scientific at first glance—but are not.

A set of beliefs or activities is pseudoscientific if:
- (a) its adherents claim or imply it is scientific, but
- (b) it lacks one or more of the three features of science.

🔄 Example: biorhythms

The theory: people's physical, intellectual, and emotional abilities run in cycles from birth until death.
- Physical cycle: 23 days
- Intellectual cycle: 33 days
- Emotional cycle: 28 days
Example application: schedule an exam when your intellectual cycle is at a high point.
Why it looks scientific: the theory has been around for over 100 years; numerous popular books and websites use impressive, scientific-sounding terms like "sinusoidal wave" and "bioelectricity."
The problem: scientific evidence indicates biorhythms do not exist (Hines, 1998).
Don't confuse: biorhythms (pseudoscience) with sleep cycles or circadian rhythms (which do have a scientific basis).

🚩 How to identify pseudoscience

🚩 Lacks systematic empiricism

Either there is no relevant scientific research, or (as with biorhythms) there is relevant research but it is ignored.

🚩 Lacks public knowledge

Promoters might claim to have conducted scientific research but never publish it in a way that allows others to evaluate it.

🚩 Does not address empirical questions (unfalsifiable claims)

Philosopher Karl Popper emphasized this: any scientific claim must be expressed so that there are observations that would—if made—count as evidence against the claim.
Falsifiable: the claim that women talk more than men is falsifiable because systematic observations could reveal either that they do or that they don't.

🔮 Example: extrasensory perception (ESP)

Many believers in ESP and other psychic powers claim such powers can disappear when observed too closely.
This makes it so that no possible observation would count as evidence against ESP:
- If a self-proclaimed psychic predicts the future at better-than-chance levels → consistent with psychic powers.
- If she fails to predict the future at better-than-chance levels → also consistent, because her powers supposedly disappear when observed closely.
Unfalsifiable = not scientific.

⚠️ Why we should care about pseudoscience

⚠️ Reason 1: sharpens understanding of science

Learning about pseudoscience brings the fundamental features of science—and their importance—into sharper focus.

⚠️ Reason 2: pseudoscience is widespread and harmful

Biorhythms, psychic powers, astrology, and many other pseudoscientific beliefs are widely held and promoted on the Internet, TV, books, and magazines.
Far from harmless: promotion of these beliefs often results in great personal toll.
Example: believers in pseudoscience opt for "treatments" such as homeopathy for serious medical conditions instead of empirically-supported treatments.
Learning what makes them pseudoscientific helps us identify and evaluate such beliefs and practices.

⚠️ Reason 3: distinguishing psychology from "pseudo psychology"

Many pseudosciences purport to explain human behavior and mental processes: biorhythms, astrology, graphology (handwriting analysis), magnet therapy for pain control.
Important for psychology students to distinguish their field clearly from pseudo psychology.

📚 Examples of pseudoscience

From The Skeptic's Dictionary and similar sources:

Pseudoscience	What it claims
Cryptozoology	Study of "hidden" creatures like Bigfoot, Loch Ness monster, chupacabra
Pseudoscientific psychotherapies	Past-life regression, rebirthing therapy, bioscream therapy
Homeopathy	Treatment of medical conditions using natural substances diluted sometimes to the point of no longer being present
Pyramidology	Odd theories about Egyptian pyramids (e.g., built by extraterrestrials); idea that pyramids have healing and special powers

Goals of Science

3. Goals of Science

🧭 Overview

🧠 One-sentence thesis

Science in psychology pursues three goals—description, prediction, and explanation—to achieve detailed and accurate knowledge that goes beyond natural curiosity and common sense.

📌 Key points (3–5)

Three goals: describe (careful observation), predict (use observed relationships), and explain (determine causes).
Why science is needed: natural curiosity alone is not enough; scientific research produces the detailed knowledge that fills psychology textbooks.
Basic vs applied research: basic research seeks understanding for its own sake; applied research addresses practical problems—but the distinction is not always clear-cut.
Common confusion: the same research can serve both basic and applied purposes; for example, basic findings on sex differences might inform therapy, and applied driving research might reveal basic perceptual processes.

🎯 The three goals of science

🔍 Goal 1: Describe

The first and most basic goal of science is to describe, achieved by making careful observations.

What it means: systematically observe and record phenomena without yet explaining why they occur.
How it works: collect data through records, surveys, or direct observation to document patterns.
Example: researchers surveyed medical marijuana patients and found that the primary symptom they treat is pain, followed by anxiety and depression.
This goal establishes what is happening before moving to prediction or explanation.

🔮 Goal 2: Predict

Once we have observed with some regularity that two behaviors or events are systematically related to one another, we can use that information to predict whether an event or behavior will occur in a certain situation.

What it means: use observed regularities to forecast future occurrences.
How accuracy works: predictions will not be 100% accurate, but if the relationship is strong, accuracy will be greater than chance.
Example: knowing that most medical marijuana patients use marijuana to treat pain allows predicting that an individual user likely experiences pain.
Don't confuse: prediction does not require understanding why the relationship exists—only that it does.

🧬 Goal 3: Explain

The third and ultimate goal of science is to explain, which involves determining the causes of behavior.

What it means: identify the underlying mechanisms and causal relationships.
How it goes deeper: ask why and how a phenomenon occurs, not just that it occurs.
Example: Does marijuana reduce inflammation, which in turn reduces pain? Or does it reduce the distress associated with pain rather than pain itself?
This goal taps into the causal processes behind observed patterns.

🔬 Basic versus applied research

🧪 Basic research

Basic research in psychology is conducted primarily for the sake of achieving a more detailed and accurate understanding of human behavior, without necessarily trying to address any particular practical problem.

Purpose: knowledge for its own sake.
Focus: understanding mechanisms, relationships, and principles.
Example: research by Mehl and colleagues (mentioned in the excerpt) falls into this category.

🛠️ Applied research

Applied research is conducted primarily to address some practical problem.

Purpose: solve real-world issues.
Focus: practical outcomes and interventions.
Example: research on the effects of cell phone use on driving was prompted by safety concerns and led to laws limiting the practice.

🔄 The blurred line

Aspect	What the excerpt says
Distinction	Convenient but not always clear-cut
Basic → Applied	Basic research on sex differences in talkativeness could eventually affect marriage therapy
Applied → Basic	Applied research on cell phone use and driving could produce new insights into perception, attention, and action

Don't confuse: a single study can serve both purposes; the same findings may advance theory and inform practice.

🌍 Why scientific research is necessary

🌱 Science grew from natural curiosity

People have always been curious about the natural world, including themselves and their behavior.
This curiosity is probably why students study psychology in the first place.
Science became the best way to achieve detailed and accurate knowledge.

📚 Scientific research fills textbooks

Most phenomena and theories in psychology textbooks are products of scientific research.
Examples from a typical introductory textbook:
- Specific cortical areas for language and perception
- Principles of classical and operant conditioning
- Biases in reasoning and judgment
- People's surprising tendency to obey authority
What we know now only scratches the surface of what we can know, so research continues.

Science and Common Sense

4. Science and Common Sense

🧭 Overview

🧠 One-sentence thesis

Scientific psychology is necessary because common sense and intuition often lead to incorrect beliefs about human behavior, which is why scientists cultivate skepticism and tolerance for uncertainty.

📌 Key points (3–5)

Why common sense fails: Forming accurate beliefs requires observation, memory, and analysis beyond our natural capacity, so we rely on mental shortcuts that can mislead us.
Folk psychology is often wrong: Many widely held intuitive beliefs about behavior (e.g., "venting anger relieves it" or "people only use 10% of their brain") have been disproven by research.
Confirmation bias compounds errors: We notice and remember cases that confirm our beliefs while ignoring or forgetting cases that contradict them.
Common confusion: Skepticism ≠ cynicism—it means pausing to consider alternatives and search for evidence, not distrusting everything.
Scientists embrace uncertainty: Accepting what we don't know opens the door to new research questions.

🤔 Why common sense is unreliable

🧠 The limits of folk psychology

Folk psychology: intuitive beliefs about people's behavior, thoughts, and feelings that we all hold collectively.

Much of folk psychology is probably reasonably accurate, but much is not.
We lack the natural capacity to form detailed and accurate beliefs about complex human behavior.
Example: We cannot mentally count words spoken by women and men we encounter, calculate daily averages, and compare them accurately—yet we form confident beliefs about who talks more.

🔍 Research contradicts intuition

The excerpt provides specific examples where scientific research has overturned common sense:

Common belief	What research shows
Anger can be relieved by "letting it out" (punching, screaming)	This approach leaves people feeling more angry, not less
No one confesses to crimes they didn't commit (unless tortured)	False confessions are surprisingly common and occur for various reasons
People use only 10% of their brain power	This is a myth (from the "50 Great Myths" list)
Most people experience a midlife crisis in their 40s or 50s	This is a myth
Students learn best when teaching styles match learning styles	This is a myth
Low self-esteem is a major cause of psychological problems	This is a myth
Psychiatric admissions and crimes increase during full moons	This is a myth

💡 Why we want to believe

We sometimes hold incorrect beliefs because it would be nice if they were true.
Example: Many believe calorie-reducing diets are effective long-term treatments for obesity, but thorough scientific review shows they are not. People may continue believing in dieting because:
- It gives them hope for losing weight if they are obese.
- It makes them feel good about their own "self-control" if they are not obese.

🧩 How we form wrong beliefs

🧩 Mental shortcuts (heuristics)

We rely on mental shortcuts when forming and maintaining beliefs because detailed observation and analysis exceed our natural powers.
If a belief is widely shared—especially if endorsed by "experts"—and makes intuitive sense, we tend to assume it is true.
This is not inherently bad, but it can lead us astray when the shortcuts produce incorrect conclusions.

🔍 Confirmation bias

Confirmation bias: the tendency to focus on cases that confirm our intuitive beliefs and ignore or forget cases that disconfirm them.

Once we believe something, we selectively notice evidence that supports it.
Example: Once we believe women are more talkative than men, we tend to:
- Notice and remember talkative women and silent men.
- Ignore or forget silent women and talkative men.
This reinforces the original (possibly incorrect) belief even when contradictory evidence is all around us.

🧠 Limited cognitive capacity

The excerpt emphasizes that forming accurate beliefs "requires powers of observation, memory, and analysis to an extent that we do not naturally possess."
We cannot process all relevant information systematically in our heads, so we fall back on shortcuts that feel right but may be wrong.

🔬 The scientific attitude

🤨 Skepticism

Skepticism: an attitude of pausing to consider alternatives and search for evidence—especially systematically collected empirical evidence—when there is enough at stake to justify doing so.

What skepticism is NOT:
- Being cynical or distrustful.
- Questioning every belief or claim one comes across (which would be impossible).
What skepticism IS:
- Pausing to consider alternatives.
- Searching for evidence when the stakes are high enough.
Example: You read that giving children a weekly allowance helps them develop financial responsibility. A skeptical attitude means:
- Pausing to ask whether receiving an allowance might instead teach children to spend money or be more materialistic.
- Asking what evidence supports the claim: Is the author a researcher? Is scientific evidence cited?
- If important enough, turning to the research literature to see if anyone has studied it.

🤷 Tolerance for uncertainty

Tolerance for uncertainty: accepting that there are many things scientists simply do not know.

Scientists accept that evidence is often insufficient to fully evaluate a belief or claim.
Example: There is no scientific evidence that receiving an allowance causes children to be more financially responsible, nor is there evidence that it causes them to be materialistic.
From a practical perspective: This uncertainty can be problematic (e.g., making it hard to decide what to do when children ask for an allowance).
From a scientific perspective: This uncertainty is exciting—it means there are interesting, empirically testable questions that science (and perhaps you as a researcher) can answer.

🧪 Scientists are susceptible too

The excerpt emphasizes that scientists—especially psychologists—understand they are "just as susceptible as anyone else to intuitive but incorrect beliefs."
This self-awareness is why they cultivate skepticism and tolerance for uncertainty as deliberate practices, not just natural tendencies.

Experimental and Clinical Psychologists

5. Experimental and Clinical Psychologists

🧭 Overview

🧠 One-sentence thesis

Scientific research in psychology is conducted primarily by experimental psychologists, while clinical psychologists apply scientific findings to practice, and both fields rely on empirical evidence to understand and treat psychological problems effectively.

📌 Key points (3–5)

Who does research: Doctoral-level experimental psychologists (usually Ph.D.s) conduct most scientific research, often in academic settings, while clinical psychologists focus on diagnosis and treatment.
Clinical practice is scientific: Psychological disorders are part of the natural world, so questions about their causes and treatments must be tested empirically, not answered by intuition alone.
Empirically supported treatments: Treatments proven effective through systematic observation (e.g., CBT, exposure therapy) should guide clinical practice.
Common confusion: Popular claims about psychological profiles (e.g., adult children of alcoholics having distinct traits) often lack scientific support despite sounding plausible.
Why scientific literacy matters for clinicians: Even clinicians who don't conduct research need to read and evaluate evidence to make treatment decisions based on the best available science.

👥 Who Conducts Psychological Research

🎓 Experimental psychologists

Credentials: Typically hold doctoral degrees (Ph.D.) or master's degrees in psychology and related fields.
Work settings:
- Government agencies (researching public policy impacts)
- National associations (e.g., American Psychological Association)
- Non-profit organizations
- Private sector (product marketing, organizational behavior)
- Most common: College and university faculty who collaborate with graduate and undergraduate students

🔬 Research vs. clinical training

The majority of researchers are not trained or licensed as clinicians.
Instead, they have expertise in subfields like:
- Behavioral neuroscience
- Cognitive psychology
- Developmental psychology
- Personality psychology
- Social psychology
Some clinical psychology researchers do hold clinical licenses, but this is not the norm for experimental psychologists.

💡 Motivations for research

Intellectual and technical challenges
Satisfaction of contributing to scientific knowledge about human behavior
Students can get involved as research assistants or participants to explore whether they enjoy the research process

🏥 Clinical Psychology as Applied Science

🩺 What clinical practice encompasses

Clinical practice of psychology: the diagnosis and treatment of psychological disorders and related problems.

The excerpt uses this term broadly to include:

Clinical and counseling psychologists
School psychologists
Marriage and family therapists
Licensed clinical social workers
Others working individually or in small groups to address psychological problems

🔍 Why clinical questions are scientific

Psychological disorders are part of the natural world, making questions about them empirically testable.
We cannot rely on intuition or common sense for accurate answers.
Example from the excerpt: Popular belief claims adult children of alcoholics have a distinct personality profile (low self-esteem, powerlessness, intimacy difficulties), but scientific research shows they are no more likely to have these problems than anyone else.

Don't confuse: Something sounding plausible ≠ scientifically supported. Many widely believed claims lack empirical evidence.

🧪 Testing treatment effectiveness

Questions about psychotherapy effectiveness are empirically testable.
Method: Systematic observation comparing people who receive a treatment versus those who don't (or receive an alternative).
If depressed people receiving a new psychotherapy improve more than a similar group without it, the treatment shows effectiveness.

💊 Empirically Supported Treatments

📋 Definition and importance

Empirically supported treatments: treatments that have been studied scientifically and shown to result in greater improvement than no treatment, a placebo, or some alternative treatment.

These include many forms of psychotherapy.
Can be as effective as standard drug therapies.

🛠️ Examples of supported treatments

The excerpt lists several with strong empirical support:

Treatment	Effective for
Acceptance and Commitment Therapy (ACT)	Depression, mixed anxiety disorders, psychosis, chronic pain, OCD
Behavioral couples therapy	Alcohol use disorders
Cognitive Behavioral Therapy (CBT)	Many disorders including eating disorders, depression, anxiety disorders
Exposure therapy	Post-traumatic stress disorder, phobias
Exposure therapy with response prevention	Obsessive-compulsive disorder
Family-based treatment	Eating disorders

⚖️ Debate in clinical psychology

One side argues: The field hasn't paid enough attention to scientific research (e.g., failing to use empirically supported treatments); changes needed in training and practice evaluation.
Other side argues: These claims are exaggerated; suggested changes unnecessary.
Agreement on both sides: A scientific approach is essential for diagnosing and treating psychological problems based on detailed, accurate knowledge and evidence of effective treatments.

📚 Why clinicians need scientific literacy

Even clinicians who never conduct their own studies must be scientifically literate to:

Read and evaluate new research
Make treatment decisions based on the best available evidence
Avoid relying on intuition or unproven popular claims

Key principle: Clinical practice should be guided by systematic observation and empirical evidence, not just tradition or what sounds reasonable.

Key Takeaways and Exercises

6. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Psychology relies on the scientific method rather than intuition or common sense because science provides systematic, empirical, and publicly verifiable knowledge that can accurately describe, predict, and explain human behavior.

📌 Key points (3–5)

Five ways of knowing: Knowledge can be acquired through intuition, authority, rationalism, empiricism, and the scientific method—psychology uses the scientific method.
Three features of science: Systematic empiricism, empirical questions, and public knowledge distinguish science from pseudoscience.
Three research goals: Psychologists conduct research to describe phenomena, predict future behaviors, and explain causes.
Common confusion: Folk psychology (intuitions about behavior) often turns out to be wrong, which is why psychology relies on science rather than common sense.
Clinical practice connection: Scientific research provides the detailed knowledge needed to diagnose psychological problems and establish which treatments actually work.

🔬 What Makes Psychology a Science

🔬 The three fundamental features

Psychology qualifies as a science because it possesses three core characteristics:

Systematic empiricism: A structured approach to observing and measuring the natural world.

Empirical questions: Questions that can be answered through observation and measurement.

Public knowledge: Findings that are shared openly and can be verified by others.

These features distinguish genuine science from pseudoscience.
Pseudoscience refers to beliefs and activities claimed to be scientific but lacking one or more of these three features.
Example: A claim about human behavior that cannot be tested through observation would lack empirical questions and thus be pseudoscientific.

🧠 Five pathways to knowledge

The excerpt identifies multiple ways people acquire knowledge:

Method	Description
Intuition	Direct, immediate knowing without reasoning
Authority	Learning from experts or trusted sources
Rationalism	Using logical reasoning
Empiricism	Learning through observation and experience
Scientific method	Systematic combination of empiricism with controlled testing

Psychology specifically adopts the scientific method as its primary approach.
Don't confuse: All five methods can produce knowledge, but only the scientific method provides the systematic verification that science requires.

🎯 Three Goals of Psychological Research

📊 Description, prediction, and explanation

Psychologists conduct research with three distinct purposes:

To describe basic phenomena: Document what happens and how behaviors manifest.
To make predictions: Forecast future behaviors based on patterns.
To explain causes: Identify why behaviors occur.

Each goal serves a different function in building scientific understanding.
Example: A researcher might first describe how students take notes (description), predict that laptop users will remember less (prediction), then explain that verbatim transcription prevents deeper processing (explanation).

🔬 Basic vs. applied research

The excerpt distinguishes two research orientations:

Basic research: Conducted to learn about human behavior for its own sake.

Applied research: Conducted to solve some practical problem.

Both types are valuable to the field.
The distinction between the two is not always clear-cut—research can serve both purposes.
Example: Studying memory processes might be basic research, while testing a specific study technique for students would be applied research.

🧐 Critical Thinking in Psychology

🤔 Skepticism as a core attitude

Researchers cultivate specific thinking habits:

Skepticism: Searching for evidence and considering alternatives before accepting a claim about human behavior as true.

This means not accepting claims at face value, even when they seem intuitive.
Psychologists demand evidence before concluding something is true.
Example: Before accepting that "crisis counseling immediately after trauma helps long-term coping," a skeptical researcher would look for controlled studies testing this claim.

⚖️ Tolerance for uncertainty

Tolerance for uncertainty: Withholding judgment about whether a claim is true or not when there is insufficient evidence to decide.

Researchers are comfortable saying "we don't know yet" when evidence is lacking.
This prevents premature conclusions based on incomplete information.
Don't confuse: Tolerance for uncertainty is not the same as believing nothing can be known—it means waiting for adequate evidence.

🚫 Why folk psychology fails

The excerpt emphasizes a critical point:

People's intuitions about human behavior (folk psychology) often turn out to be wrong.
This is one primary reason psychology relies on science rather than common sense.
Example: The belief "you cannot truly love another person unless you love yourself" may feel intuitively true but requires empirical testing to verify.

👥 Who Does Psychological Research

🎓 Professional researchers

Scientific research in psychology is conducted mainly by people with doctoral degrees in psychology and related fields.
Most of these researchers are college and university faculty members.
They conduct research for professional reasons, personal reasons, and to contribute to scientific knowledge about human behavior.

🔬 Experimental vs. clinical psychologists

The excerpt clarifies an important distinction:

Most psychologists are experimental psychologists who conduct research.
Clinical practice (diagnosis and treatment of psychological problems) is one important application of the scientific discipline.
Don't confuse: Clinical practice is not separate from science—it should be informed by scientific research.

🏥 Science and Clinical Practice

💊 Why research matters for treatment

Scientific research is relevant to clinical practice for two key reasons:

Detailed knowledge: It provides detailed and accurate knowledge about psychological problems.
Treatment effectiveness: It establishes whether treatments are effective.

The excerpt mentions empirically supported treatments—interventions proven effective through research.
Example treatments listed: cognitive-behavioral therapy for depression and anxiety, exposure therapy for OCD, family-based treatment for eating disorders.

⚠️ The debate about scientific approaches

The excerpt acknowledges ongoing discussion in the field:

Some in the clinical psychology community argue the field has not paid enough attention to scientific research.
They suggest changes in how clinicians are trained and how treatments are evaluated.
Others believe these claims are exaggerated and changes are unnecessary.
Agreement on both sides: A scientific approach to clinical psychology is essential for diagnosing and treating problems based on detailed, accurate knowledge.

📚 Scientific literacy for clinicians

Even clinicians who never conduct research themselves need to be scientifically literate:

They must be able to read and evaluate new research.
They should make treatment decisions based on the best available evidence.
This ensures practice is grounded in what actually works, not just intuition or tradition.

🎨 The "art form" argument

The excerpt addresses a common claim:

Some clinicians argue their work is an "art form" based on intuition and personal experience.
They claim it cannot be evaluated scientifically.
The exercises ask readers to consider satisfaction with such a clinician from three perspectives: as a potential client, as a judge deciding on expert testimony, and as an insurance representative deciding on reimbursement.
This highlights that stakeholders have legitimate interests in whether clinical practice is scientifically grounded.

🔍 Distinguishing Empirical from Non-Empirical Questions

🔍 What makes a question empirical

Empirical questions: Questions that can be answered through observation and measurement.

These are the only questions science can address.
Non-empirical questions cannot be answered through observation alone.
Example: "Does studying in the same location improve effectiveness?" is empirical—it can be tested. "Is it morally right to study on Sundays?" is non-empirical—it involves values, not observable facts.

🧪 Falsifiability matters

The exercises raise the concept of falsifiability:

A claim is falsifiable if evidence could potentially prove it wrong.
Example claim: "People's choice of spouse is influenced by their parents—some choose similar spouses, others choose different ones."
This claim is problematic because it covers all possibilities (similar OR different), making it unfalsifiable—no observation could disprove it.

A Model of Scientific Research in Psychology

7. A Model of Scientific Research in Psychology

🧭 Overview

🧠 One-sentence thesis

Scientific research in psychology operates as a self-sustaining cycle where research questions lead to empirical studies, which produce published findings that generate new questions and inspire further research.

📌 Key points (3–5)

The research cycle: formulate question → conduct study → analyze data → draw conclusions → publish → new questions emerge.
Where questions originate: from the research literature itself, informal observations, or practical problems needing solutions.
Always check existing literature: even when questions come from outside the cycle, researchers must review what has already been published before proceeding.
Common confusion: research is not a one-time linear process but a continuous cycle where each study contributes to and draws from the broader research literature.
Real-world application: the model applies to diverse topics, from gender stereotypes (talkativeness study) to safety issues (cell phone use while driving).

🔄 The research cycle structure

🔄 Five core stages

The model describes research as moving through five connected stages:

Formulate a research question – identify what you want to investigate
Conduct an empirical study – design and carry out a study to answer the question
Analyze the resulting data – examine what the data shows
Draw conclusions – determine what the answer to the question is
Publish the results – share findings so they become part of the research literature

♻️ Why it's a cycle, not a line

The research literature (all published research in a field) is one of the primary sources of new research questions.
New research leads to new questions, which lead to new research, and so on.
Publication is not the end—each study suggests many new questions that other researchers (or the same team) will tackle next.
Example: The talkativeness study found little difference between women and men, but this result raised new questions about reliability and potential cultural differences.

🌱 Sources of research questions

🌱 Three main origins

Research questions can start from three places:

Source	Description	Example from excerpt
Research literature	Questions arising from published studies	Checking if talkativeness question had been adequately addressed
Informal observations	Direct observations or secondhand information from non-scientific sources	People's stereotypes about women being more talkative
Practical problems	Real-world issues that need solutions	Whether cell phone use impairs driving ability

🔍 The literature check requirement

Even when questions originate from informal observations or practical problems, researchers must start by checking the research literature.
Purpose: see if the question has already been answered and refine it based on what previous research found.
Don't confuse: starting with an observation doesn't mean skipping the literature review—every path requires consulting existing research.

📱 Example: Cell phone and driving research

📱 How the question emerged

During the 1990s, as cell phones became widespread, people began wondering whether cell phone use had a negative effect on driving.
Many psychologists decided to tackle this question scientifically.
Previous research showed that engaging in a simple verbal task impairs performance on a perceptual or motor task carried out at the same time.
However, no one had studied the effect specifically of cell phone use on driving.

📱 What researchers did

Under carefully controlled conditions, researchers compared people's driving performance while using a cell phone with their performance while not using a cell phone.
Testing occurred both in the lab and on the road.
They found that people's ability to detect road hazards, reaction time, and maintain control of the vehicle were all impaired by cell phone use.

📱 How the cycle continued

Each new study was published and became part of the growing research literature on this topic.
Other research teams subsequently demonstrated that cell phone conversations carry a greater risk than conversations with a passenger who is aware of driving conditions.
The passenger awareness often becomes a point of conversation, making it different from phone conversations.
Example: One study showed passengers adjust their conversation based on driving conditions, while phone callers cannot see the road.

👥 Example: Talkativeness and gender research

👥 Question origin and literature check

The research question—whether women are more talkative than men—was suggested by two sources:
- People's stereotypes
- Claims published in the research literature about the relative talkativeness of women and men
When researchers checked the research literature, they found this question had not been adequately addressed in scientific studies.

👥 The study and its contribution

Researchers conducted a careful empirical study.
They analyzed the results and found very little difference between women and men.
They formed their conclusions and published their work so it became part of the research literature.

👥 New questions generated

The publication is not the end of the story.
Their work suggests many new questions:
- About the reliability of the result
- About potential cultural differences
- Other questions likely to be taken up by them and by other researchers inspired by their work

📚 The research literature concept

📚 What it is

Research literature: all the published research in a particular field.

It is a primary source of new research questions.
Each published study becomes part of this literature and can inspire future work.
The literature grows continuously as new studies are published.

📚 Its role in the cycle

Researchers consult the literature to see what has already been found.
Published findings suggest gaps, contradictions, or extensions that need investigation.
The literature provides the foundation for refining new questions.
Don't confuse: the research literature is not just background reading—it actively shapes what questions get asked and how they are framed.

Finding a Research Topic

8. Finding a Research Topic

🧭 Overview

🧠 One-sentence thesis

Good research questions emerge not from mysterious inspiration but from ordinary thinking strategies applied to informal observations, practical problems, and especially previous research, which must be systematically reviewed early in the research process.

📌 Key points (3–5)

Research questions are not magical: Coming up with questions is a creative but ordinary process using simple strategies and persistence, not mysterious genius.
Three main sources of inspiration: informal observations (direct or secondhand), practical problems (applied domains), and previous research (the most common source).
The research literature matters early: Reviewing published research helps discover questions, avoid duplication, evaluate interestingness, plan methods, and position your study.
Common confusion—what counts as literature: The research literature includes professional journal articles and scholarly books, but excludes self-help books, Wikipedia, websites, and pop psychology sources not reviewed by researchers.
Previous research builds collaboration: Science is large-scale collaboration where researchers read each other's work and conduct new studies to build on it.

💡 Where research ideas come from

💡 Informal observations

Informal observations: direct observations of our own and others' behavior as well as secondhand observations from non-scientific sources such as newspapers, books, blogs, and so on.

These are everyday experiences or things you notice in daily life or media.
Example: noticing you always seem to be in the slowest grocery store line, or reading a newspaper story about people donating to a family whose house burned down.
Can spark questions like "Do most people think the same thing?" or "Who makes such donations and why?"
Famous example: Stanley Milgram's obedience research was inspired by journalistic reports of Nazi war criminal trials where defendants claimed they were "only obeying orders."

🛠️ Practical problems

Real-world issues in applied domains like law, health, education, and sports.
These lead directly to applied research questions.
Examples from the excerpt:
- Does taking lecture notes by hand improve exam performance?
- How effective is psychotherapy versus drug therapy for depression?
- To what extent do cell phones impair driving ability?
- How can we teach children to read more efficiently?
- What is the best mental preparation for running a marathon?

📚 Previous research (most common)

Science is a large-scale collaboration: many researchers read and evaluate each other's work and conduct new studies to build on it.
Experienced researchers are familiar with previous work in their area and have long lists of ideas.
For novice researchers:
- Consult with experienced researchers (e.g., students consult faculty).
- Pick up any professional journal and read titles and abstracts.
- Example: one issue of Psychological Science contained articles on perception of shapes, anti-Semitism, police lineups, the meaning of death, second-language learning, people who seek negative emotional experiences, and more.
- If you have a specific topic (e.g., memory) or domain (e.g., health care), look through specific journals like Memory & Cognition or Health Psychology.

📖 The research literature and why to review it

📖 What the research literature is

The research literature in any field: all the published research in that field.

Reviewing the literature means finding, reading, and summarizing published research relevant to your topic.
The psychology research literature is enormous: millions of scholarly articles and books dating to the beginning of the field, and it continues to grow.
Boundaries are somewhat fuzzy, but the literature consists almost entirely of:
- Articles in professional journals
- Scholarly books in psychology and related fields

❌ What is NOT part of the research literature

Source type	Why excluded
Self-help and pop psychology books	Intended for general public; not reviewed by researchers
Dictionary and encyclopedia entries	Not research-based
Websites	Not reviewed; often based on common sense or personal experience
Wikipedia	Authors are anonymous, may lack formal training/expertise, content continually changes; unsuitable as basis for scientific research

Don't confuse: Wikipedia contains valuable information, but it is unreliable for research purposes because it lacks expert review and stable content.

🔍 Why review the literature early

Reviewing the research literature early in the research process helps in several ways:

Avoid duplication: It can tell you if a research question has already been answered.
Evaluate interestingness: It can help you evaluate whether a research question is interesting.
Plan your method: It can give you ideas for how to conduct your own study.
Position your work: It can tell you how your study fits into the research literature.
Discover new questions: It helps you discover new research questions (in addition to the other benefits).

📰 Professional journals

Professional journals: periodicals that publish original research articles.

There are thousands of professional journals that publish research in psychology and related fields.
Usually published monthly or quarterly in individual issues, each containing several articles.
Issues are organized into volumes, which usually consist of all the issues for a calendar year.
These are the primary component of the research literature.

🧠 The creative process demystified

🧠 Research creativity is ordinary

Novice researchers often find coming up with good research questions difficult and stressful.
One reason: it appears to be a creative process that seems mysterious—even magical—with experienced researchers seeming to pull interesting questions "out of thin air."
Reality: Psychological research on creativity has shown it is neither mysterious nor magical.
It is largely the product of ordinary thinking strategies and persistence.
The excerpt emphasizes that simple strategies can be used to:
- Find general research ideas
- Turn those ideas into empirically testable research questions
- Evaluate those questions for interestingness and feasibility

🎯 From idea to question

Research questions often begin as more general research ideas—usually focusing on some behavior or psychological characteristic (e.g., talkativeness, learning, depression, bungee jumping).
The process involves:
1. Starting with a general idea
2. Turning it into an empirically testable research question
3. Evaluating the question for interestingness and feasibility
This is a systematic process, not a mysterious leap.

Generating Good Research Questions

9. Generating Good Research Questions

🧭 Overview

🧠 One-sentence thesis

A good research question must be empirically testable, interesting to the scientific community (because its answer is in doubt, fills a gap, or has practical implications), and feasible to answer given available resources.

📌 Key points (3–5)

How to generate testable questions: Start with a behavior or characteristic, conceptualize it as a variable, then ask about its frequency/intensity or its relationships with other variables.
What makes a question interesting: The answer must be in doubt (reasonable chance of multiple answers), fill a gap in existing research, and/or have important practical implications.
Common confusion: A question that has never been studied is not automatically interesting—it must also have uncertain answers and relevance to the scientific community.
Feasibility matters: Time, money, equipment, technical skill, and participant access all constrain which questions can be successfully pursued.
Refining existing questions: If a question has already been studied, refine it by changing how variables are measured, examining different populations, or exploring different situations.

🔬 Turning ideas into testable questions

🔬 Starting with frequency or intensity

If you have a behavior or psychological characteristic in mind, first conceptualize it as a variable.
Ask: How frequent or intense is it?
Examples from the excerpt: "How many words on average do people speak per day? How accurate are our memories of traumatic events? What percentage of people have sought professional help for depression?"
If the question has never been studied scientifically (discovered through literature review), it might be worth pursuing.

🔗 Turning characteristics into relationships

When a frequency/intensity question has already been answered, turn it into a question about relationships by asking:

General question	What it generates
What are possible causes?	Independent variables that might influence the behavior
What are possible effects?	Dependent variables the behavior might influence
What types of people show more/less?	Individual difference variables (e.g., family size and talkativeness)
What situations elicit more/less?	Situational variables (e.g., same-sex vs. mixed-sex groups and talkativeness)

Each answer becomes a second variable, suggesting a relationship to test.
Example from the excerpt: If interested in talkativeness, you might ask whether family size causes differences in talkativeness, or whether same-sex groups elicit more talkativeness than mixed-sex groups.

🔄 Refining already-studied questions

Don't give up if your question has already been studied—the fact that it was published suggests it interests the scientific community.

Refine the question by asking:

Are there other ways to define and measure the variables? (e.g., measuring talkativeness as number of different people spoken to, not just word count)
Are there types of people for whom the relationship might be stronger or weaker? (e.g., elderly people or people from other cultures, not just university students)
Are there situations where the relationship might differ—including practically important situations?

Example from the excerpt: Research showed women and men speak the same number of words per day among U.S. and Mexican university students, but you can still ask whether this holds for other measurements, age groups, or cultures.

⭐ What makes a question interesting

⭐ The answer must be in doubt

Interestingness: the quality that makes a research question engaging to the scientific community, not just personally.

A question is interesting when there is reasonable chance the answer will be something we didn't already know.
How to assess: Try to think of reasons to expect different answers, especially ones that conflict with common sense.
If you can think of reasons for at least two different answers → the question might be interesting.
If you can think of reasons for only one answer → probably not interesting.

Example from the excerpt: "Are women more talkative than men?" is interesting because the stereotype suggests yes, but similar verbal abilities suggest no.

Don't confuse: "Never studied before" ≠ automatically interesting. The question "Do people feel pain when you punch them in the jaw?" has never been formally studied but is not interesting because there is no reason to expect any answer other than yes.

📚 Filling a gap in the literature

The question should be a natural one for people familiar with existing research.
It should not just be unstudied, but should logically follow from what is already known.
Example from the excerpt: Whether taking lecture notes by hand improves exam performance would naturally occur to anyone familiar with research on note taking and shallow processing.

🌍 Practical implications

Does answering the question have important real-world consequences?
Examples from the excerpt:
- Note-taking by hand → implications for education and classroom technology policies
- Cell phone use while driving → personal safety and legal debates about restrictions

🛠️ Feasibility considerations

🛠️ Resource constraints

Feasibility: whether a research question can be successfully answered given available resources.

Factors that affect feasibility:

Time
Money
Equipment and materials
Technical knowledge and skill
Access to research participants

Researchers must evaluate these factors to avoid wasting effort on research they cannot complete.

🎯 Simplicity can be powerful

Don't confuse complexity with quality: research does not have to be complicated to produce interesting and important results.
Professional journals contain both complex studies (longitudinal designs, neuroimaging, multi-variable analyses supported by grants and teams) and simple studies (convenience samples of students with paper-and-pencil tasks).
Both types can yield valuable findings.

♻️ Use proven methods

General good practice: Use methods that have already been used successfully by other researchers unless you have good reasons not to.
Benefits:
- The approach is "tried and true" (feasibility)
- Provides continuity with previous research
- Makes it easier to compare your results with others' results
- Helps understand how others' research informs yours and vice versa
Example from the excerpt: If you want to manipulate mood to make people happy, use one of the many successful approaches from prior research (e.g., paying a compliment).

📖 Using the literature strategically

📖 Mining discussion sections

Look closely at the discussion section of recent research articles on your topic.
This is the last major section where researchers:
- Summarize results
- Interpret them in context of past research
- Suggest directions for future research
These suggestions often take the form of specific research questions.
Good strategy because suggested questions have already been identified as interesting and important by experienced researchers.

📖 What to focus on in your review

The excerpt mentions that literature reviews should help you:

Refine your research question
Identify appropriate research methods
Place your research in context of previous research
Write an effective research report

📖 How many sources?

One study found that professional psychology journals averaged about 50 sources cited per article.
This gives a rough idea of what professional researchers consider adequate.
Students might be assigned lower minimums, but the principles for selecting useful sources remain the same.
Focus on recent research (past 5 years as a general rule), except for classic articles that appear in nearly every reference list.

Developing a Hypothesis

10. Developing a Hypothesis

🧭 Overview

🧠 One-sentence thesis

Hypotheses are testable predictions derived from theories that allow researchers to systematically evaluate explanations of phenomena through empirical study.

📌 Key points (3–5)

Theory vs hypothesis: theories are broad explanations of phenomena; hypotheses are specific, testable predictions about what should be observed if a theory is accurate.
How hypotheses connect to theories: hypotheses typically follow an "if-then" structure—if the theory is correct, then a specific outcome should occur.
Common confusion: in everyday language "theory" means an untested guess, but in science a theory is an explanation that can be extensively tested and well-supported.
Best hypotheses distinguish competing theories: the strongest hypotheses make different predictions depending on which theory is correct, allowing researchers to determine which explanation is better.
Three characteristics of good hypotheses: they must be testable and falsifiable, logical (informed by theory or observation), and positive (stating that a relationship exists rather than that it doesn't).

🔬 Understanding theories and hypotheses

🔬 What a theory is

A theory is a coherent explanation or interpretation of one or more phenomena.

Theories go beyond just describing what happens—they include variables, structures, processes, or organizing principles that haven't been directly observed.
Example from the excerpt: drive theory explains both social facilitation and social inhibition by proposing that being watched creates physiological arousal, which increases the likelihood of the dominant response.
Don't confuse: outside science, "theory" often means "wild guess," but in science it simply means an explanation—it can be untested OR extensively tested and well-supported.
The theory of evolution and germ theory are called "theories" because they explain phenomena, not because they're uncertain.

🎯 What a hypothesis is

A hypothesis is a specific prediction about a new phenomenon that should be observed if a particular theory is accurate.

Hypotheses are explanations that rely on just a few key concepts.
They are often specific predictions about what will happen in a particular study.
Developed by considering existing evidence and using reasoning to infer what will happen in a specific context.
Example: "If drive theory is correct, then cockroaches should run through a straight runway faster when other cockroaches are present."
Hypotheses can be expressed as statements or rephrased as questions.

🔗 The if-then relationship

Theories and hypotheses always have an if-then structure.
This relationship makes it possible to test theories by observing whether predicted outcomes actually occur.
Example from the excerpt: "If the habituation theory is correct, then expressive writing about positive experiences should not be effective."

🛠️ How to derive hypotheses from theories

🛠️ Two main approaches

Approach	How it works	Example from excerpt
Start with a research question	Generate a question, then ask if any theory implies an answer	Wonder if expressive writing about positive experiences improves health; check if habituation theory predicts an answer
Focus on theory components	Identify parts of a theory not yet directly observed and make predictions about them	Focus on the habituation process itself—predict people should show fewer signs of emotional distress with each writing session

🏆 The best hypotheses: distinguishing competing theories

The strongest hypotheses make opposite predictions depending on which theory is correct.
Example from the excerpt: researchers tested two theories about self-judgment (number of examples vs. ease of retrieval).
- Number-of-examples theory predicted: recalling 12 examples → judge self as more assertive.
- Ease-of-examples theory predicted: recalling 6 examples → judge self as more assertive.
Only one prediction could be confirmed, providing particularly convincing evidence for one theory over the other.
Result: participants who recalled fewer examples judged themselves as more assertive, supporting the ease-of-retrieval theory.

🔄 The hypothetico-deductive method

🔄 The cycle of theory testing

The excerpt describes a systematic process researchers use:

Start with phenomena: observe a set of events or patterns
Construct or choose a theory: develop an explanation or select an existing one
Derive a hypothesis: make a prediction about what should be observed if the theory is correct
Conduct empirical study: test the hypothesis through research
Reevaluate the theory: revise the theory based on results if necessary
Repeat: derive new hypotheses from the revised theory

This is conceptualized as a cycle because it continues indefinitely.
The process meshes with the general model of scientific research, creating "theoretically motivated" or "theory-driven" research.

🪳 Example: Zajonc's cockroach study

The excerpt provides a detailed example of this method in action:

Starting point: contradictory pattern of results about social facilitation and inhibition
Theory constructed: drive theory—being watched causes physiological arousal, increasing the dominant response
Prediction: presence of others should improve performance on easy tasks but inhibit performance on difficult tasks
Hypothesis tested: cockroaches should run faster through a straight runway (easy) but slower through a cross-shaped maze (difficult) when other cockroaches are present
Method: cockroaches ran to escape light either alone or with other cockroaches in "audience boxes"
Result: hypothesis confirmed—cockroaches reached their goal more quickly in the straight runway but more slowly in the maze when others were present
Outcome: provided support for drive theory

✅ Characteristics of a good hypothesis

✅ Must be testable and falsifiable

Researchers must be able to test the hypothesis using scientific methods.
It must be possible to gather evidence that would disconfirm the hypothesis if it is false.
This connects to Popper's falsifiability criterion.

✅ Must be logical

Hypotheses are more than random guesses—they should be informed by previous theories or observations and logical reasoning.
Deductive reasoning: typically begin with a broad, general theory and generate a more specific hypothesis to test.
Inductive reasoning: occasionally, when no theory exists, use specific observations or research findings to form a more general hypothesis.

✅ Must be positive

The hypothesis should make a positive statement about the existence of a relationship or effect.
Should NOT state that a relationship or effect does not exist.
Why: scientists don't set out to show that relationships don't exist; the nature of science is to assume something doesn't exist and then seek evidence to prove this wrong.
Don't confuse: this may seem backward, but it reflects the nature of the scientific method and statistical theory.

📝 Incorporating theory into research

📝 Two basic formats

Format	When to use	Structure
Question-first	Applied research questions; questions existing theories don't address	Raise research question → conduct study → offer theories to explain results
Theory-first	Existing theory addresses the question; hypothesis is surprising or conflicts with other theories	Describe existing theories → derive hypothesis → test in new study → reevaluate theory

📝 Why use theories

Gives guidance in coming up with experiment ideas and possible projects.
Lends legitimacy to your work.
Psychologists have developed many theories about human behaviors over time.
Using established theories helps researchers break new ground rather than limiting idea development.

Designing a Research Study

11. Designing a Research Study

🧭 Overview

🧠 One-sentence thesis

Designing a research study requires operationally defining variables, selecting an appropriate sample, and choosing between experimental methods (which allow causal conclusions) and non-experimental methods (which describe and predict but cannot establish causation).

📌 Key points (3–5)

Variables must be operationally defined: abstract constructs like "depression" must be transformed into measurable observations.
Experimental vs. non-experimental research: only experimental methods (manipulating independent variables while controlling extraneous variables) can establish causal relationships.
Common confusion—internal vs. external validity: laboratory studies have high internal validity (strong causal conclusions) but low external validity (harder to generalize); field studies show the opposite trade-off.
Sampling matters: researchers study samples but want to generalize to populations; random sampling is ideal but convenience sampling is more common in psychology.
Confounds vs. extraneous variables: confounds vary systematically with the independent variable and ruin causal conclusions; extraneous variables that don't vary systematically are less problematic.

🔬 Variables and operational definitions

🔬 What is a variable

Variable: a quantity or quality that varies across people or situations.

Almost everything varies (height, major, talkativeness, number of siblings).
Constants (things that don't vary) are rare—the speed of light is one example.
Variables are the focus of research questions in psychology.

📊 Types of variables

Type	Definition	Examples
Quantitative variable	A quantity measured by assigning a number	Height, level of talkativeness, number of siblings
Categorical variable	A quality measured by assigning a category label	Chosen major, nationality, occupation

🎯 Operational definitions

Operational definition: a definition of the variable in terms of precisely how it is to be measured.

Why needed: most psychological constructs cannot be directly observed (e.g., "depression" is abstract).
How it works: transform abstract constructs into observable, measurable forms.
Example: depression can be operationally defined as:
- Scores on the Beck Depression Inventory
- Number of depressive symptoms
- Whether diagnosed with major depressive disorder
Best practice: choose operational definitions that have been used extensively in research literature.

👥 Populations and sampling

👥 Population vs. sample

Population: the very large group of people researchers want to draw conclusions about.

Sample: a small subset of the population that is actually studied.

Example: studying a few hundred university students to draw conclusions about men and women in general.
The population depends on research goals (all American teenagers, children with autism, professional athletes, all humans, etc.).

🎲 Sampling methods

Simple random sampling:

Every member of the population has an equal chance of being selected.
Example: randomly selecting 100 registered voters from a complete list.
Problem: difficult or impossible in most psychological research because populations are not clearly defined.

Convenience sampling:

The sample consists of individuals who happen to be nearby and willing to participate.
Example: introductory psychology students.
Problem: the sample might not be representative of the population, making generalization less appropriate.

⚠️ Representative samples

A representative sample is similar to the population in important respects.
Important for drawing valid conclusions about the broader population.

🧪 Experimental research

🧪 Purpose and method

Goal: test hypotheses about causal relationships between variables.
Why it's unique: the experimental method is the only method that allows determination of causal relationships.
How it works: manipulate one or more variables while controlling extraneous variables, then measure effects on participants' responses.

🔑 Key terms in experimental research

Independent variable: the variable the experimenter manipulates (the presumed cause).

Dependent variable: the variable the experimenter measures (the presumed effect).

Extraneous variables: any variable other than the dependent variable.

Confounds: a specific type of extraneous variable that systematically varies along with the variables under investigation and provides an alternative explanation for the results.

💡 Example: lighting and productivity

Independent variable: lighting conditions (bright lights vs. dim lights).
Dependent variable: workers' productivity.
Confound scenario: if bright lights are noisy, then noise varies systematically with light—we can't tell if productivity differences are due to light or noise.
Extraneous variable scenario: if there is noise both when lights are on and off, noise is merely extraneous (not confounding) because it doesn't vary systematically with the independent variable.

⚠️ Don't confuse: confounds vs. extraneous variables

Confounds are bad: they disrupt the ability to make causal conclusions because they provide alternative explanations.
Extraneous variables are less problematic: unless they vary systematically with the independent variable, they cannot be competing explanations.
Control requirement: researchers must ensure extraneous variables don't become confounding variables to make valid causal conclusions.

📋 Non-experimental research

📋 Purpose and method

Goal: describe characteristics of people, describe relationships between variables, and use those relationships to make predictions.
How it works: simply measure variables as they naturally occur without manipulating them.
Important note: non-experimental does not mean nonscientific—it is scientific in nature.

🎯 What non-experimental research can and cannot do

Can fulfill two goals of science:

Describe
Predict

Cannot fulfill one goal:

Cannot make causal conclusions
Cannot say that one variable causes another variable

💡 Examples

Measuring the number of traffic fatalities involving cell phone use (no manipulation).
Recording drivers' genders and cell phone use at an intersection to see if men or women are more likely to use phones while driving.

🏢 Laboratory vs. field studies

🏢 Definitions

Laboratory study: a study conducted in the laboratory environment.

Field study: a study conducted in the real-world, in a natural environment.

🔍 Internal validity

Internal validity: the degree to which we can confidently infer a causal relationship between variables.

Laboratory experiments:

Typically have high internal validity.
High control over environment and extraneous variables.
When only the manipulated variable differs between conditions, we can confidently conclude the independent variable caused changes in the dependent variable.

Field studies:

Typically have lower internal validity.
Less control over environment and potential extraneous variables.
Less appropriate to arrive at causal conclusions.

🌍 External validity

External validity: the degree to which we can generalize the findings to other circumstances or settings, like the real-world environment.

The trade-off:

When internal validity is high, external validity tends to be low.
When internal validity is low, external validity tends to be high.

Study type	Internal validity	External validity	Why
Laboratory	High	Low	Artificial, sterile environment; hard to generalize to real world
Field	Low	High	Real-world environment; easier to generalize

🔬 Field experiments

Field experiments: studies where an independent variable is manipulated in a natural setting and extraneous variables are controlled.

Can have both high external validity and high internal validity.
Quality depends on the level of control of extraneous variables.
Combines benefits of both approaches when done well.

Analyzing the Data

12. Analyzing the Data

🧭 Overview

🧠 One-sentence thesis

Researchers use descriptive statistics to summarize their data and inferential statistics to determine whether their findings reflect real population effects or merely random chance, though both Type I and Type II errors remain possible.

📌 Key points (3–5)

Two types of statistics: descriptive statistics summarize data; inferential statistics generalize from sample to population.
Statistical significance threshold: results with less than 5% chance of being due to random error are considered statistically significant and appropriate to generalize.
Two kinds of errors: Type I errors (false positives) occur when researchers conclude there's an effect when there isn't one; Type II errors (missed opportunities) occur when real effects go undetected.
Common confusion: the 5% threshold balances risks—lowering it reduces Type I errors but increases Type II errors.
Probabilistic nature: statistics provide probability assessments, not absolute certainty about whether effects are real.

📊 Descriptive statistics: summarizing the data

📊 What descriptive statistics do

Descriptive statistics: used to organize or summarize a set of data.

They describe what the data look like without making claims about broader populations.
Examples include percentages, measures of central tendency, measures of dispersion, and correlation coefficients.

📍 Measures of central tendency

These describe the typical, average, and center of a distribution:

Measure	Definition
Mode	The most frequently occurring score
Median	The midpoint of a distribution of scores
Mean	The average of a distribution of scores

In experimental research, means are calculated separately for each group and compared to see if they differ.

📏 Measures of dispersion

These describe the degree of spread in scores:

Range: measures the distance between the highest and lowest scores in a distribution.

Standard deviation: measures the average distance of scores from the mean.

Variance: the standard deviation squared; measures distance from the mean in a different unit.

Key question: Are scores clustered around the mean, or is there high variability?

🔗 Correlation coefficients

Correlation coefficient: describes the strength and direction of the relationship between two variables.

Values range from −1.00 (strongest negative relationship) to +1.00 (strongest positive relationship).
A value of 0 means no relationship exists.
Positive correlation: as one variable increases, the other increases (e.g., height and weight).
Negative correlation: as one variable increases, the other decreases (e.g., stress and happiness).
Commonly used in non-experimental research.

🔍 Inferential statistics: generalizing to populations

🔍 Purpose of inferential statistics

Inferential statistics: allow researchers to draw conclusions about a population based on data from a sample.

Researchers sample from a population but want to generalize findings to the broader population.
Critical question: Are the observed effects due to random chance variability or do they reflect real effects?

✅ Statistical significance

Statistically significant effect: one that is unlikely due to random chance and therefore likely represents a real effect in the population.

Convention: results with less than 5% chance of being due to random error are considered statistically significant.
When an effect is statistically significant, it is appropriate to generalize from sample to population.
When there's more than 5% chance the effect is due to chance alone, the result is not statistically significant.

⚠️ Probabilistic nature

Statistics provide probabilities, not absolute certainty.
The 5% threshold ensures high probability of correct decisions.
But mistakes can always be made—this is where Type I and Type II errors come in.

⚠️ Two types of errors

❌ Type I error (false positive)

Type I error: when a researcher concludes results are statistically significant (claiming there is an effect in the population) when in reality there is no real effect and the results are just due to chance (a fluke).

With the 5% threshold, researchers have a 5% chance or less of making a Type I error.
Example: A researcher claims a treatment works when it actually doesn't—the positive result was just luck.

🔇 Type II error (missed opportunity)

Type II error: when a researcher concludes results are not statistically significant when in reality there is a real effect in the population and they just missed detecting it.

More likely when the threshold is set too low (e.g., 1% instead of 5%) or when the sample is too small.
Example: A researcher concludes a treatment doesn't work when it actually does—they failed to detect the real effect.

⚖️ The tradeoff

Don't confuse: You cannot eliminate both errors simultaneously.
When you reduce the chances of Type I errors (by lowering the threshold), you increase the chances of Type II errors.
The 5% convention balances these two risks.

Error type	What it is	When it's more likely
Type I	False positive (claiming an effect that doesn't exist)	When threshold is too high
Type II	Missed opportunity (missing a real effect)	When threshold is too low or sample is too small

Drawing Conclusions and Reporting the Results

13. Drawing Conclusions and Reporting the Results

🧭 Overview

🧠 One-sentence thesis

Because research findings are probabilistic and subject to error, scientists support or refute theories rather than proving them, and they share results through peer-reviewed journals, books, and conference presentations.

📌 Key points (3–5)

Statistical significance supports but never proves theories: confirming a hypothesis strengthens a theory, but Type I errors, alternative theories, and the problem of induction prevent absolute proof.
Disconfirmed hypotheses don't automatically disprove theories: results may reflect Type II errors, flawed designs, or minor unstated assumptions rather than fundamental theory failure.
Common confusion: "proof" vs. "evidence"—scientists avoid "prove" because all studies are probabilistic and flawed; only evidence accumulates, never certainty.
Reporting hierarchy: peer-reviewed journal articles are most prestigious, followed by edited book chapters, then conference presentations (oral or poster).
Why scientists persist after disconfirmation: researchers may improve designs, modify theories, or identify unstated assumptions before abandoning a theory.

🔬 What conclusions researchers can draw

✅ When results support the hypothesis

If results are statistically significant and consistent with the hypothesis and theory, researchers conclude the theory is supported.
The theory made an accurate prediction and now accounts for a new phenomenon.
This strengthens the theory but does not prove it.

❌ When results disconfirm the hypothesis

If a hypothesis is disconfirmed in a systematic empirical study, the theory is weakened.
The theory made an inaccurate prediction and fails to account for a new phenomenon.
However, disconfirmation does not automatically mean the theory is false (see complications below).

🚫 Why scientists avoid "scientific proof"

🎲 Reason 1: Type I errors

A confirmed hypothesis might reflect a Type I error (false positive).
The result may be statistically significant by chance when no real effect exists in the population.
With a 5% threshold, researchers accept a 5% chance of this error.

🔄 Reason 2: Alternative theories

Other plausible theories might imply the same hypothesis.
Confirming the hypothesis strengthens all those theories equally, not just the one being tested.
You cannot isolate which theory is "correct" from a single confirmation.

🦢 Reason 3: The problem of induction

The philosophical "problem of induction": one cannot definitively prove a general principle (e.g., "All swans are white") just by observing confirming cases (e.g., white swans)—no matter how many.

It is always possible that a disconfirming case (e.g., a black swan) will eventually appear.
Future tests of the hypothesis or new hypotheses from the theory might be disconfirmed.
For these reasons, scientists treat even highly successful theories as subject to revision based on new observations.

📊 The bottom line

Because statistics are probabilistic and all studies have flaws, there is no such thing as scientific proof.
There is only scientific evidence.

🔍 Complications when hypotheses are disconfirmed

🧩 Formal logic vs. scientific practice

Formal logic: "if A then B" and "not B" necessarily lead to "not A."
- If A is the theory and B is the hypothesis, disconfirming B ("not B") should mean the theory is incorrect ("not A").
In practice: scientists do not give up on theories so easily.

🛠️ Reason 1: Methodological issues

A disconfirmed hypothesis could result from:
- A Type II error (missed opportunity)—a real effect exists but was not detected.
- A faulty research design—the researcher may not have successfully manipulated the independent variable or measured the dependent variable.

🧪 Reason 2: Unstated minor assumptions

Disconfirmation might mean some unstated but relatively minor assumption of the theory was not met.
Example: If Zajonc had failed to find social facilitation in cockroaches, he could have concluded that drive theory still applies but only to animals with sufficiently complex nervous systems.
Evidence from a study can be used to modify a theory rather than abandon it.

⚖️ When to abandon a theory

Researchers are not free to ignore repeated disconfirmations.
If they cannot improve research designs or modify theories to account for repeated disconfirmations, they must eventually abandon their theories and replace them with more successful ones.

📢 How scientists share their findings

📄 Peer-reviewed journal articles (most prestigious)

The most prestigious way to report findings.
Manuscripts must typically adhere to APA style (American Psychological Association writing style).
Articles undergo rigorous peer review before publication.

📚 Book chapters in edited books

Another way to report findings is by writing a chapter published in an edited book.
Preferably the editor puts the chapter through peer review, but this is not always the case.
Some scientists are invited by editors to write book chapters without peer review.

🎤 Conference presentations

Presentation type	Description	Duration/format
Oral presentation	Getting up in front of an audience of fellow scientists and giving a talk, then fielding questions	10 minutes to 1 hour (depending on the conference)
Poster presentation	Summarizing the study on a large poster with brief overview of purpose, methods, results, and discussion; presenter stands by poster and discusses it with people who pass by	1–2 hours

Presenting at a conference is a fun way to disseminate findings.
It is a great way to get feedback from peers before attempting the more rigorous peer-review process for journal publication.

Key Takeaways and Exercise

14. Key Takeaways and Exercise

🧭 Overview

🧠 One-sentence thesis

Psychological research follows a cyclical process from question to publication, requiring careful planning, appropriate methods, theoretical grounding, and ethical reporting to contribute meaningfully to the scientific literature.

📌 Key points (3–5)

The research cycle: research questions lead to empirical studies, which are published and become part of the literature that informs new questions.
Theory vs hypothesis: theories are broad explanations of larger phenomena, while hypotheses are specific predictions about particular study outcomes.
Experimental vs non-experimental: experimental research manipulates an independent variable to observe effects, while non-experimental research measures variables as they naturally occur.
Common confusion: theories can be supported but never proved; similarly, disconfirming a hypothesis does not necessarily disprove the underlying theory.
Statistical significance: inferential statistics help determine whether findings are unlikely due to chance, but probabilistic conclusions mean errors are always possible.

🔄 The Research Process

🔄 The cyclical model

The research process in psychology: a research question based on the research literature leads to an empirical study, the results of which are published and become part of the research literature.

The process is continuous and self-reinforcing
Each study builds on previous work and informs future research
Early literature review is essential to refine questions, identify methods, contextualize work, and prepare effective reports

📚 The research literature

Consists of all published research in psychology
Primarily includes articles in professional journals and scholarly books
PsycINFO is among the best tools for finding previous research—a computer database cataloging millions of articles, books, and book chapters

🎯 Developing Research Questions

🎯 Generating questions

Questions should be expressed in terms of variables and relationships between variables
Can be suggested by other researchers or generated by asking general questions about behavior or psychological characteristics
Must be evaluated for both interestingness and feasibility before proceeding

💡 Evaluating interestingness

Key factors that affect whether a question is interesting:

Doubt: the extent to which the answer is uncertain
Gap-filling: whether it addresses a gap in the research literature
Practical implications: whether it has important real-world applications

⚙️ Evaluating feasibility

Factors that affect whether a question can be answered:

Time available
Money and funding
Technical knowledge and skill
Access to special equipment and research participants

🧪 Theories and Hypotheses

🧪 Understanding the distinction

Concept	Scope	Function
Theory	Broad in nature	Explains larger bodies of data
Hypothesis	More specific	Makes predictions about outcomes of particular studies

🔬 The hypothetico-deductive method

Psychologists use this method like other scientists:

Construct theories to explain phenomena (or work with existing theories)
Derive hypotheses from their theories
Test the hypotheses
Reevaluate the theories in light of new results

Important: Working with theories is not "icing on the cake"—it is a basic ingredient of psychological research.

⚠️ Limits of proof

Theories can be supported but not proved
Disconfirming a hypothesis does not necessarily mean the theory has been disproved
This is a fundamental principle of scientific reasoning

🔢 Variables and Sampling

🔢 Types of variables

Variables: characteristics that vary across people or situations.

Two main types:

Quantitative: numerical measures (e.g., age)
Categorical: distinct categories (e.g., course subject)

👥 Sampling participants

Sample: a small subset of a larger population selected to participate in the research study.

Common sampling methods include:

Convenience sampling
Simple random sampling
Many other different approaches

🧬 Research Designs

🧬 Experimental vs non-experimental

Design type	Key feature	What it involves
Experimental	Manipulation	Manipulating an independent variable to observe effects on a measured dependent variable
Non-experimental	Observation	Measuring variables as they naturally occur (without manipulating anything)

🏢 Field vs laboratory research

Laboratory experiments: tend to have high internal validity (allowing strong causal conclusions)
Field studies: often have more external validity (allowing generalization to the real world)

Don't confuse: internal validity (causal conclusions) with external validity (generalizability)—each setting offers different strengths.

📊 Statistical Analysis

📊 Descriptive statistics

Measures of central tendency (describing typical/average/center scores):

Mean
Median
Mode

Measures of dispersion (describing how spread apart scores are):

Range
Standard deviation
Variance

📈 Inferential statistics

Inferential statistics: allow researchers to determine whether findings are statistically significant—that is, whether they are unlikely to be due to chance alone and therefore likely to represent a real effect in the population.

⚠️ Statistical errors

Since statistics are probabilistic in nature, we never know if our conclusions are correct. Two types of errors are possible:

Type I error: concluding an effect is real when it is not
Type II error: concluding there is no effect when there actually is a real effect in the population

📢 Reporting Results

📢 Venues for dissemination

The final step of the research process involves reporting results through:

Scientific conferences (often via poster presentations)
Journal articles
Books

📋 Conference presentations

Researchers present work on large posters providing brief overviews of purpose, methods, results, and discussion
Presenters stand by their poster for an hour or two to discuss with attendees
Great way to get feedback from peers before undergoing more rigorous peer-review for journal publication

Moral Foundations of Ethical Research

15. Moral Foundations of Ethical Research

🧭 Overview

🧠 One-sentence thesis

Ethical psychological research requires balancing four moral principles—weighing risks against benefits, acting with integrity, seeking justice, and respecting rights and dignity—across three affected groups: research participants, the scientific community, and society.

📌 Key points (3–5)

Framework structure: Ethics must consider how four moral principles apply to three groups (participants, scientific community, society).
Unavoidable conflict: Ethical tensions are inherent in research because risks and benefits often affect different groups, and complete truthfulness can conflict with scientific validity.
Risks vs. benefits are not directly comparable: Participants may bear the risks while science or society gains the benefits, making tradeoffs difficult.
Common confusion: Acting with integrity (honesty) can conflict with conducting valuable research that requires deception—researchers must resolve this responsibly, not avoid it entirely.
Trust is foundational: All four principles depend on trust between researchers, participants, the scientific community, and society.

🧩 What ethics means in research

🧩 Definition and scope

Ethics: the branch of philosophy concerned with morality—what it means to behave morally and how people can achieve that goal; also a set of principles and practices that provide moral guidance in a particular field.

Ethics applies to many fields: business, medicine, teaching, and scientific research.
Psychological research raises many ethical issues, especially when human participants are involved.
The excerpt emphasizes that a general framework helps researchers think through these issues systematically.

📊 The framework table

The excerpt presents a table with:

Rows: four moral principles (weighing risks vs. benefits, acting responsibly and with integrity, seeking justice, respecting rights and dignity).
Columns: three affected groups (research participants, scientific community, society).
Purpose: A thorough ethical analysis must consider how each principle applies to each group.

⚖️ Weighing risks against benefits

⚖️ What counts as risks and benefits

For research participants:

Risks: treatment might fail or harm, procedures might cause physical or psychological harm, privacy might be violated.
Benefits: receiving helpful treatment, learning about psychology, satisfaction from contributing to knowledge, compensation (money or course credit).

For the scientific community and society:

Risks: wasting time/money/effort on uninteresting or poorly designed studies; research results could be misunderstood or misapplied with harmful consequences.
Benefits: advancing scientific knowledge and contributing to societal welfare.

⚖️ The challenge of comparison

Risks and benefits are often not directly comparable.
Common pattern: risks fall primarily on participants, while benefits go primarily to science or society.
Example: Milgram's obedience study caused severe psychological stress to participants (sweating, trembling, stuttering, nervous laughter, convulsive seizures, one participant "reduced to a twitching, stuttering wreck") but produced important scientific insights about obedience to authority with implications for understanding events like the Holocaust.
The excerpt asks: "Was it worth it?" and notes that competent, well-meaning researchers can disagree.

⚖️ The MMR vaccine study harm

The retracted study linking the MMR vaccine to autism caused harm to both science and society.
Science: other researchers wasted resources on unnecessary follow-up research.
Society: people avoided the vaccine, increasing risk of measles, mumps, and rubella; many people, including children, died as a result.

🤝 Acting responsibly and with integrity

🤝 What integrity means

Carrying out research thoroughly and competently.
Meeting professional obligations.
Being truthful.
Why it matters: Integrity promotes trust, which is essential for all effective human relationships.

🤝 Trust from participants

Participants must trust that researchers:

Are honest about what the study involves.
Will keep promises (e.g., maintain confidentiality).
Will maximize benefits and minimize risk.

🤝 Trust from the scientific community and society

Researchers must conduct research thoroughly and competently.
Researchers must report honestly.
When trust is violated, consequences follow: wasted resources, public harm (as in the MMR vaccine case).

🤝 The deception dilemma

Some research questions are difficult or impossible to answer without deceiving participants.
Example: Milgram's study required deception (participants were told they were studying punishment and learning, not obedience).
Conflict: Acting with integrity (honesty) can conflict with doing research that advances knowledge and benefits society.
Don't confuse: The excerpt does not say deception is always wrong; it says researchers must deal with this conflict responsibly.

⚖️ Seeking justice

⚖️ Fairness to participants

Treat participants fairly: adequate compensation, benefits and risks distributed across all participants.
Example: In a study of a new psychotherapy, some participants receive treatment while others are in a control group with no treatment. If the therapy is effective, it would be fair to offer it to the control group after the study ends.

⚖️ Historical injustice at the societal level

Some groups have historically faced more than their fair share of research risks: people who are institutionalized, disabled, or belong to racial or ethnic minorities.
Tuskegee syphilis study (1932–1972): Poor African American men near Tuskegee, Alabama, were told they were being treated for "bad blood." They were given some free medical care but were not treated for their syphilis. Researchers observed how the disease developed in untreated patients. Even after penicillin became standard treatment in the 1940s, these men were denied treatment and not given an opportunity to leave. The study ended only after journalists and activists made details public.
President Bill Clinton formally apologized in 1997: "They believed they had found hope when they were offered free medical care by the United States Public Health Service. They were betrayed."
Researchers must now consider justice and fairness at the societal level.

🛡️ Respecting people's rights and dignity

🛡️ Autonomy and informed consent

Autonomy: people's right to make their own choices and take their own actions free from coercion.

Informed consent: researchers obtain and document people's agreement to participate after having informed them of everything that might reasonably be expected to affect their decision.

Participants must be told what might affect their decision to participate.
Tuskegee example: Participants agreed to participate but were not told they had syphilis and would be denied treatment. Had they known, they likely would not have agreed. They did not give true informed consent.
Milgram example: Participants were not told they might be "reduced to a twitching, stuttering wreck." Many likely would not have agreed had they known. They did not give true informed consent.

🛡️ Privacy and confidentiality

Privacy: people's right to decide what information about them is shared with others.

Confidentiality: an agreement not to disclose participants' personal information without their consent or appropriate legal authorization.

Anonymity: when participants' names and other personally identifiable information are not collected at all.

Researchers must maintain confidentiality.
Anonymity is even more protective (no identifiable information collected).

🔄 Unavoidable ethical conflict

🔄 Why conflict is inherent

Almost no psychological research is completely risk-free, so there will almost always be conflict between risks and benefits.
Research beneficial to one group (e.g., scientific community) can harm another (e.g., participants), creating difficult tradeoffs.
Being completely truthful can make it difficult or impossible to conduct scientifically valid studies on important questions.

🔄 Easy vs. difficult conflicts

Easy to resolve: Deceiving participants and subjecting them to physical harm would not be justified by filling a small gap in the research literature—nearly everyone agrees.
Difficult to resolve: Competent, well-meaning researchers can disagree.
Example: A study on "personal space" in a public men's room secretly observed participants to see if they took longer to urinate when another man (a confederate) was at a nearby urinal. Some critics found this an unjustified assault on human dignity; the researchers argued they had carefully considered the ethics, minimized risks, and concluded benefits outweighed risks (they had interviewed preliminary participants who were not bothered by being observed).

🔄 Responsible resolution

Ethical conflict cannot be eliminated completely, but it can be dealt with responsibly and constructively.
How: Thoroughly and carefully think through the ethical issues, minimize risks, weigh risks against benefits.
Be able to explain ethical decisions to others, seek feedback, and ultimately take responsibility.

From Moral Principles to Ethics Codes

16. From Moral Principles to Ethics Codes

🧭 Overview

🧠 One-sentence thesis

Broad moral principles have been translated into detailed, enforceable ethics codes—such as the APA Ethics Code—that guide researchers through common ethical issues like informed consent, deception, and scholarly integrity.

📌 Key points (3–5)

Why codes exist: Even people who agree on general moral principles can disagree on specific ethical issues, so detailed codes provide concrete guidance.
Historical progression: Ethics codes evolved from the Nuremberg Code (1947) through the Declaration of Helsinki (1964) and the Belmont Report (1978) to today's federal regulations and professional standards.
Key mechanisms: Institutional Review Boards (IRBs) review research protocols and classify studies by risk level (exempt, expedited, or greater than minimal risk).
Common confusion: Informed consent is not just signing a form—it requires genuine understanding and voluntary agreement, which may need more than a signature.
Core standards: The APA Ethics Code Standard 8 addresses research-specific issues including informed consent, deception, debriefing, animal research, and scholarly integrity.

📜 Historical development of ethics codes

📜 The Nuremberg Code (1947)

One of the earliest ethics codes, written in conjunction with trials of Nazi physicians who conducted cruel experiments on concentration camp prisoners.
Provided a standard to compare the behavior of those on trial; many were convicted and imprisoned or sentenced to death.
Key contributions:
- Emphasized carefully weighing risks against benefits
- Stressed the need for informed consent

🌍 Declaration of Helsinki (1964)

Created by the World Medical Council as a similar ethics code.
What it added:

A written protocol—a detailed description of the research—that is reviewed by an independent committee.
Has been revised several times, most recently in 2004.

🇺🇸 The Belmont Report (1978)

Published in the United States in response to concerns about the Tuskegee study and other problematic research.
Three core principles it recognized:

Principle	Definition	Implication
Justice	Conducting research in a way that distributes risks and benefits fairly across different groups at the societal level	Fairness in who bears research burdens and who benefits
Respect for persons	Acknowledges individuals' autonomy and protection for those with diminished autonomy (e.g., prisoners, children)	Translates to the need for informed consent
Beneficence	Maximizing the benefits of research while minimizing harms to participants and society	Risk-benefit analysis is mandatory

⚖️ Federal Policy for the Protection of Human Subjects

The Belmont Report became the basis of federal laws that apply to research conducted, supported, or regulated by the federal government.
Most important requirement: Universities, hospitals, and other institutions receiving federal support must establish an Institutional Review Board (IRB).

🛡️ Institutional Review Boards and risk levels

🛡️ What an IRB does

Institutional Review Board (IRB): a committee responsible for reviewing research protocols for potential ethical problems.

Composition requirements:
- At least five people with varying backgrounds
- Members of different professions
- Scientists and nonscientists
- Men and women
- At least one person not otherwise affiliated with the institution
IRB responsibilities:
- Make sure risks are minimized
- Ensure benefits outweigh risks
- Verify research is carried out fairly
- Check that informed consent procedure is adequate

📊 Three levels of research risk

Risk Level	Definition	Review Process	Examples
Exempt	Lowest level of risk	Once approved, exempt from regular continuous review	Research on normal educational activities; standard psychological measures and surveys (nonsensitive, confidential); existing data from public sources
Expedited	Somewhat higher risk, but still no greater than minimal risk	Done by one IRB member or a separate committee under IRB authority	Risks no greater than those encountered by healthy people in daily life or during routine physical/psychological examinations
Greater than minimal risk	Does not qualify for exempt or expedited	Must be reviewed by the full board of IRB members	Research posing risks beyond everyday life

Don't confuse: "Exempt" does not mean "no review"—it means exempt from continuous review after initial approval.

📋 APA Ethics Code Standard 8: Research and Publication

📋 Overview of Standard 8

The APA's Ethical Principles of Psychologists and Code of Conduct (APA Ethics Code) was first published in 1953 and revised several times, most recently in 2010.
Includes about 150 specific ethical standards.
Standard 8 is the most relevant part for research, covering:
- Institutional approval
- Informed consent
- Deception
- Debriefing
- Use of nonhuman animals
- Scholarly integrity (reporting results, plagiarism, publication credit)

✅ Institutional approval (8.01)

When institutional approval is required, psychologists must:
- Provide accurate information about research proposals
- Obtain approval prior to conducting research
- Conduct research in accordance with the approved protocol

🤝 Informed consent requirements

🤝 What informed consent means (8.02)

Informed consent: obtaining and documenting people's agreement to participate in a study, having informed them of everything that might reasonably be expected to affect their decision.

What participants must be told (8 elements):

Purpose of the research, expected duration, and procedures
Right to decline to participate and to withdraw once participation has begun
Foreseeable consequences of declining or withdrawing
Reasonably foreseeable factors that may influence willingness to participate (potential risks, discomfort, adverse effects)
Any prospective research benefits
Limits of confidentiality
Incentives for participation
Whom to contact for questions about the research and participants' rights

Researchers must provide opportunity for prospective participants to ask questions and receive answers.

🎥 Recording voices and images (8.03)

Psychologists must obtain informed consent before recording voices or images for data collection.
Exceptions (consent not required):
- Research consists solely of naturalistic observations in public places, and recording will not cause personal identification or harm
- Research design includes deception, and consent for recording is obtained during debriefing

👥 Special populations (8.04)

When conducting research with clients/patients, students, or subordinates, psychologists must protect them from adverse consequences of declining or withdrawing.
When research participation is a course requirement or extra credit opportunity, the prospective participant must be given the choice of equitable alternative activities.

🚫 When informed consent can be dispensed with (8.05)

Psychologists may dispense with informed consent only when:

Research would not reasonably be assumed to create distress or harm, AND involves:
- Study of normal educational practices in educational settings
- Only anonymous questionnaires, naturalistic observations, or archival research where disclosure would not place participants at risk and confidentiality is protected
- Study of job/organization effectiveness in organizational settings with no risk to employability and confidentiality protected
Where otherwise permitted by law or federal/institutional regulations

💰 Inducements for participation (8.06)

Psychologists must avoid offering excessive or inappropriate financial or other inducements that are likely to coerce participation.
When offering professional services as an inducement, psychologists must clarify the nature of services, risks, obligations, and limitations.

⚠️ Common confusion: consent forms vs. genuine consent

Don't confuse signing a form with true informed consent:

Many participants do not actually read consent forms or read them but do not understand them.
Participants often mistake consent forms for legal documents and mistakenly believe signing them means giving up their right to sue the researcher.
Best practice (even with competent adults):
- Tell participants about risks and benefits
- Demonstrate the procedure
- Ask if they have questions
- Remind them of their right to withdraw at any time
- In addition to having them read and sign a consent form

Example: An organization conducts a study with competent adults. Instead of only handing out consent forms, researchers verbally explain the study, show what participants will do, answer questions, and emphasize that withdrawal is allowed at any time—this approach ensures genuine understanding, not just a signature.

🎭 Deception and debriefing

🎭 When deception is allowed (8.07)

Psychologists may use deception only when:

The use of deceptive techniques is justified by the study's significant prospective scientific, educational, or applied value
Effective nondeceptive alternative procedures are not feasible

Absolute prohibition:

Psychologists do not deceive prospective participants about research that is reasonably expected to cause physical pain or severe emotional distress.

Disclosure requirement:

Psychologists must explain any deception that is an integral feature of the design as early as feasible, preferably at the conclusion of participation, but no later than at the conclusion of data collection.
Participants must be permitted to withdraw their data after learning about the deception.

💬 Debriefing requirements (8.08)

Psychologists must provide a prompt opportunity for participants to obtain appropriate information about the nature, results, and conclusions of the research.
They must take reasonable steps to correct any misconceptions participants may have.

When debriefing can be delayed:

If scientific or humane values justify delaying or withholding information, psychologists must take reasonable measures to reduce the risk of harm.

Harm mitigation:

When psychologists become aware that research procedures have harmed a participant, they must take reasonable steps to minimize the harm.

🐾 Humane care and use of animals

🐾 Animal research standards (8.09)

Compliance and supervision:

Psychologists must acquire, care for, use, and dispose of animals in compliance with current federal, state, and local laws and regulations, and with professional standards.
Psychologists trained in research methods and experienced in the care of laboratory animals must supervise all procedures involving animals.
They are responsible for ensuring appropriate consideration of animals' comfort, health, and humane treatment.

Training requirement:

Psychologists must ensure that all individuals under their supervision who are using animals have received instruction in research methods and in the care, maintenance, and handling of the species being used.

Minimizing harm:

Psychologists must make reasonable efforts to minimize the discomfort, infection, illness, and pain of animal subjects.
A procedure subjecting animals to pain, stress, or privation may be used only when:
- An alternative procedure is unavailable
- The goal is justified by its prospective scientific, educational, or applied value

Surgical and termination procedures:

Psychologists must perform surgical procedures under appropriate anesthesia and follow techniques to avoid infection and minimize pain during and after surgery.
When it is appropriate that an animal's life be terminated, psychologists must proceed rapidly, with an effort to minimize pain and in accordance with accepted procedures.

📝 Scholarly integrity

📝 Reporting research results (8.10)

Fabrication prohibition:

Psychologists do not fabricate data.

Error correction:

If psychologists discover significant errors in their published data, they must take reasonable steps to correct such errors in a correction, retraction, erratum, or other appropriate publication means.

📄 Plagiarism (8.11)

Psychologists do not present portions of another's work or data as their own, even if the other work or data source is cited occasionally.

This means proper attribution is required throughout, not just occasional citation.

✍️ Publication credit (8.12)

Responsibility and credit:

Psychologists take responsibility and credit, including authorship credit, only for work they have actually performed or to which they have substantially contributed.

Accurate reflection of contributions:

Principal authorship and other publication credits must accurately reflect the relative scientific or professional contributions of the individuals involved, regardless of their relative status.
Mere possession of an institutional position (e.g., department chair) does not justify authorship credit.
Minor contributions are acknowledged appropriately, such as in footnotes or in an introductory statement.

Student authorship:

Except under exceptional circumstances, a student is listed as principal author on any multiple-authored article that is substantially based on the student's doctoral dissertation.
Faculty advisors must discuss publication credit with students as early as feasible and throughout the research and publication process.

🔄 Duplicate publication and data sharing (8.13–8.14)

Duplicate publication (8.13):

Psychologists do not publish, as original data, data that have been previously published.
This does not preclude republishing data when accompanied by proper acknowledgment.

Sharing data for verification (8.14):

After research results are published, psychologists must not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis.
Conditions:
- Data used only for verification purpose
- Confidentiality of participants can be protected
- Legal rights concerning proprietary data do not preclude release
Psychologists may require that individuals or groups requesting data be responsible for costs associated with providing such information.
Psychologists who request data from others may use shared data only for the declared purpose and must obtain prior written agreement for all other uses.

🔍 Reviewers (8.15)

Psychologists who review material submitted for presentation, publication, grant, or research proposal review must respect the confidentiality of and the proprietary rights in such information of those who submitted it.

Putting Ethics Into Practice

17. Putting Ethics Into Practice

🧭 Overview

🧠 One-sentence thesis

Conducting ethical research requires researchers to proactively identify risks and deception, minimize harm through careful design choices, and maintain integrity from study design through publication.

📌 Key points (3–5)

Researcher responsibility: Lack of awareness is not a defense; researchers must know ethics codes, institutional policies, and seek clarification when uncertain.
Risk identification and minimization: List all potential harms (physical, psychological, confidentiality violations), seek diverse input, and use design modifications, prescreening, or confidentiality protections to reduce risks.
Deception management: Minimize all forms of deception (active misleading, withholding information, allowing false assumptions); use only when truly necessary and reveal during debriefing.
Common confusion: Informed consent is not just signing a form—it requires explaining risks/benefits, demonstrating procedures, answering questions, and reminding participants of withdrawal rights.
Ongoing vigilance: Ethics work continues after approval—monitor participants for unanticipated reactions, protect confidentiality during data collection, and maintain integrity through publication.

🎯 Foundational responsibilities

📚 Know your ethical obligations

The APA Ethics Code explicitly states that ignorance of standards is not a defense against charges of unethical conduct.
Minimum requirements for new researchers:
- Read and understand relevant sections of the APA Ethics Code
- Distinguish minimal risk from at-risk research
- Know your institution's specific policies and procedures
- Understand how to prepare and submit protocols for IRB review
When in doubt: Review ethics codes, read how others resolved similar issues, or consult experienced researchers, your IRB, or course instructor.
Key principle: You as the researcher must ultimately take responsibility for the ethics of your research.

🔍 Seek input proactively

Researchers often underestimate risks or overlook them completely.
Example from the excerpt: An emergency medical technician researcher wanted to show gruesome crime/accident photos but greatly underestimated how disturbing these were to most people because of her professional exposure.
Why diverse input matters: Nonresearchers may better understand the participant perspective; collaborators may spot risks you missed.

🛡️ Identifying and minimizing risks

📋 List all potential risks

Start by comprehensively listing risks, including:

Physical harm
Psychological harm (distress, embarrassment, fear)
Confidentiality violations

Important consideration: Some risks apply only to certain participants.

Example: A survey about fear of crimes might not bother most people, but could upset crime victims.

🔧 Three strategies to minimize risks

Strategy	How it works	Example from excerpt
Modify research design	Shorten procedures, replace upsetting materials with milder versions	Burger's 2009 Milgram replication stopped at 150-V instead of 450-V, avoiding severe stress while still answering the research question
Prescreening	Identify and eliminate high-risk participants through informed consent warnings or data collection	Burger used questionnaires and clinical psychologist interviews to exclude participants with physical/psychological vulnerabilities
Confidentiality protections	Keep consent forms separate from data, collect only necessary personal information, prevent unintentional sharing	Administer personal surveys individually in private rather than in public settings where responses could be overheard

🔬 The Burger replication example

Burger's 2009 replication of Milgram's obedience study stopped the procedure when participants were about to administer the 150-V shock.

Rationale: In Milgram's original study, (a) severe negative reactions occurred after this point, and (b) most participants who gave the 150-V shock continued to the maximum.
Result: Burger could compare results up to 150-V and estimate continuation rates without subjecting participants to severe stress.
Finding: Contemporary participants were just as obedient as Milgram's original participants.

🎭 Managing deception

🔍 Recognize all forms of deception

Deception is not only active misleading; it includes:

Allowing participants to make incorrect assumptions
Withholding information about the full design or purpose
Using confederates or phony equipment
Presenting false feedback about performance

✂️ Minimize or eliminate deception

First principle: According to the APA Ethics Code, deception is acceptable only if there is no other way to answer your research question.

Practical example from the excerpt:

Deceptive design: Show photos of "college professors" (actually family/friends) and ask participants to rate teaching ability.
Non-deceptive alternative: Tell participants to imagine the photos are of professors and rate them as if they were.

⏰ When to reveal information

Generally acceptable: Wait until debriefing to reveal the research question, as long as you describe procedure, risks, and benefits during informed consent.
Why this is acceptable: Knowing the research question (e.g., "Does professor age affect expectations?") could invalidate results—participants might rate differently because they think you want them to, or rate the same to avoid appearing prejudiced.
Minimizing even mild deception: Inform participants that while you've accurately described procedure/risks/benefits, you will reveal the research question afterward—they consent to having information withheld temporarily.

⚖️ Weighing risks against benefits

📊 Identify all benefits

Consider benefits to:

Research participants (e.g., relevant practical information, referrals)
Science (advancing knowledge)
Society (practical implications)
Student researchers (learning to conduct research)

🎚️ The risk-benefit threshold

Risk level	Benefit requirement	Justification
Minimal risk (no more than daily life or routine exams)	Small benefit to participants, science, or society	Generally sufficient to justify the research
More than minimal risk	Greater benefits required	Study must be well-designed, answer scientifically interesting questions, or have clear practical implications
Potential for lasting harm	Rarely justified	Research causing more than minor harm or lasting distress is rarely considered justified by benefits

Ethical boundary: It is unethical to subject people to pain, fear, or embarrassment merely to satisfy personal curiosity.

📝 Creating consent and debriefing procedures

✅ Informed consent requirements

Don't confuse: Informed consent is not just having participants sign a form.

Comprehensive informed consent process:

During recruitment: Provide as much study information as possible so those who might object can avoid it.
Oral explanation: Prepare a script or talking points to explain the study in simple, everyday language—cover procedure, risks, benefits, and right to withdraw.
Written form: Create a consent form covering all points in APA Standard 8.02a for participants to read and sign.
If using deception: Include (orally and in writing) that you are withholding some information about design/purpose but will reveal it during debriefing.

Why forms alone are insufficient:

Many participants don't actually read consent forms or don't understand them.
Participants often mistake consent forms for legal documents and wrongly believe signing means giving up the right to sue.
Best practice: Tell participants about risks/benefits, demonstrate the procedure, ask if they have questions, and remind them of withdrawal rights—in addition to having them sign a form.

🗣️ Effective debriefing

Key components:

Reveal the research question and full study design
Explain what happened in other conditions (if participants were in only one)
If deception was used: Reveal it as soon as possible, apologize, explain why it was necessary, and correct misconceptions
Provide additional benefits: Relevant practical information, pamphlets, referrals to counseling or other resources

Example from excerpt: In a study on attitudes toward domestic abuse, provide pamphlets about domestic abuse and referral information to university counseling.

Critical: Schedule plenty of time—informed consent and debriefing cannot be effective if rushed.

⚠️ When informed consent is not necessary

According to APA Standard 8.05, informed consent is not required when:

Research is not expected to cause harm
The procedure is straightforward
The study is conducted in the context of ordinary activities

Examples:

Observing whether people hold doors open for others outside a public building
A college instructor comparing two legitimate teaching methods across course sections

🔐 Approval and follow-through

📄 Getting institutional approval

Protocol requirements (what you must describe):

Purpose of the study
Research design and procedure
Risks and benefits
Steps taken to minimize risks
Informed consent and debriefing procedures

Mindset: Don't view IRB approval as merely an obstacle; treat it as an opportunity to think through ethics and consult with experienced others who offer different perspectives.

If the IRB has concerns: Address them promptly and in good faith, even if it means further modifying your design before resubmitting.

🔄 Maintaining ethics during research

Ethics work does not end at approval:

Stick to your protocol or seek additional approval for anything beyond minor changes.
Monitor participants for unanticipated reactions and seek feedback during debriefing.
- Milgram criticism: Although he didn't know participants would have severe reactions initially, he knew after testing the first several participants and should have made adjustments then.
Protect confidentiality: Keep consent forms and data safe and separate; ensure no one has access to personal information (intentionally or unintentionally).

📚 Publication integrity

Address authorship early: Decide with collaborators who will be authors and in what order.
Avoid plagiarism in your writing.
Never fabricate or alter data: Your scientific goal is to learn how the world actually is; your duty is to report honestly and accurately.
Remember: Unexpected results are often as interesting—or more so—than expected ones.

Key Takeaways and Exercises

18. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Ethical responsibility in psychological research extends from initial design through publication, requiring researchers to balance moral principles, minimize risks, follow institutional protocols, and maintain integrity at every stage.

📌 Key points (3–5)

Four moral principles guide ethical decisions: weighing risks against benefits, acting responsibly and with integrity, seeking justice, and respecting people's rights and dignity—applied to participants, science, and society.
Ethical conflict is unavoidable: researchers must minimize risks, weigh trade-offs, explain decisions, seek feedback, and take responsibility.
Multiple ethics codes provide guidance: Nuremberg Code, Declaration of Helsinki, Belmont Report, Federal Policy, and especially APA Ethics Code Standard 8 (informed consent, deception, debriefing, animal subjects, scholarly integrity).
Common confusion—informed consent vs. signing a form: informed consent is a process of informing participants of everything that might affect their decision, not merely having them sign a document.
Ethical duties continue after IRB approval: monitor participant reactions, protect confidentiality, maintain integrity through publication, avoid fabrication or plagiarism, and report results honestly.

📋 The ethical approval process

📋 Writing the protocol

Institutional approval requires a written protocol describing:
- Purpose of the study
- Research design and procedure
- Risks and benefits
- Steps taken to minimize risks
- Informed consent and debriefing procedures
Don't confuse: the approval process is not just an obstacle but an opportunity to consult with experienced reviewers who offer different perspectives.

🔄 Responding to IRB feedback

Address IRB questions or concerns promptly and in good faith.
Be prepared to make further modifications to your design and procedure before resubmitting.
Example: if the IRB raises concerns about participant risk, you may need to add safeguards or change procedures.

🛡️ Minimizing risks and deception

🛡️ Concrete steps before the study

The excerpt emphasizes several practical actions:

Make changes to your research design: adjust procedures to reduce potential harm.
Prescreen participants: identify and eliminate high-risk individuals who might be harmed.
Provide maximum information: give participants as much detail as possible during informed consent and debriefing.
Schedule adequate time: informed consent and debriefing cannot be effective if rushed.

🚫 When deception is allowed

The APA Ethics Code allows deception when: the benefits outweigh the risks, participants cannot reasonably be expected to be harmed, there is no way to conduct the study without deception, and participants are informed of the deception as soon as possible.

Some researchers argue deception is never justified, but the APA code permits it under strict conditions.
Don't confuse: using deception does not mean you can skip debriefing—participants must be informed "as soon as possible."

🔍 Ongoing ethical responsibilities

🔍 During the research

Stick to your approved protocol or seek additional approval for anything beyond minor changes.
Monitor participants for unanticipated reactions: the excerpt cites Milgram's study as a cautionary example—after the first several participants had severe negative reactions, adjustments should have been made.
Seek feedback during debriefing: learn whether participants experienced unexpected harm or distress.

🔒 Protecting confidentiality

Keep consent forms and data safe and separate from each other.
Ensure no one (intentionally or unintentionally) has access to any participant's personal information.
Be alert for potential violations throughout the study.

📝 Through publication and beyond

Responsibility	What it means
Address authorship early	Decide who will be authors and in what order with collaborators at the start
Avoid plagiarism	Maintain integrity in your writing
Report honestly	Your scientific goal is to learn how the world actually is; your duty is to report results accurately
Do not fabricate or alter data	Never be tempted to change results, even if unexpected—unexpected results are often as interesting or more so

🧠 The framework for ethical thinking

🧠 Four moral principles

The excerpt identifies four principles that must be considered:

Weighing risks against benefits: balance potential harm against potential knowledge gained.
Acting responsibly and with integrity: follow through on commitments and maintain honesty.
Seeking justice: ensure fair treatment and distribution of research burdens and benefits.
Respecting people's rights and dignity: honor autonomy and protect vulnerable individuals.

👥 Three groups to consider

Apply each principle to:

Research participants: those directly involved in the study.
Science: the broader scientific community and knowledge base.
Society: the public who may be affected by research findings or applications.

Example: a study might benefit science (new knowledge) but pose risks to participants (distress) and raise justice concerns for society (if findings are misused).

📚 Key ethics codes and standards

📚 Major written codes

The excerpt lists several foundational documents:

Nuremberg Code: early post-WWII ethics guidelines.
Declaration of Helsinki: international medical research ethics.
Belmont Report: U.S. framework for research ethics.
Federal Policy for the Protection of Human Subjects: U.S. government regulations.

📖 APA Ethics Code Standard 8

Standard 8 of the APA Ethics Code is the most important for psychology researchers, covering informed consent, deception, debriefing, use of nonhuman animal subjects, and scholarly integrity.

The full APA Ethics Code includes many standards for clinical practice, but Standard 8 is specifically for research.
Don't confuse: the APA code is broader than just research ethics, but Standard 8 is the key section for researchers.

✅ Informed consent essentials

✅ What informed consent really is

Informed consent: the process of obtaining and documenting people's agreement to participate in a study, having informed them of everything that might reasonably be expected to affect their decision.

It is a process, not a single event.
Although it often involves reading and signing a consent form, it is not equivalent to signing a form.
The goal is to ensure participants understand what they are agreeing to and can make a free choice.

📄 What to include

According to Standard 8.02 (referenced in the exercises), informed consent should cover:

Nature of the research
Risks and benefits
Right to withdraw
Confidentiality protections
Contact information for questions
Any other information that might reasonably affect the decision to participate

Example: if a study involves discussing traumatic experiences, participants must be told this in advance so they can decide whether to participate.

Understanding Psychological Measurement

19. Understanding Psychological Measurement

🧭 Overview

🧠 One-sentence thesis

Psychological measurement assigns scores systematically to represent characteristics of individuals, requiring researchers to define constructs both conceptually and operationally, then choose appropriate measurement levels to capture meaningful information.

📌 Key points (3–5)

What measurement is: systematically assigning scores to represent characteristics, not requiring specific instruments—only a systematic procedure.
Psychological constructs: variables like personality traits or emotions that cannot be observed directly because they represent tendencies or internal processes, not single observable behaviors.
Conceptual vs operational definitions: conceptual definitions describe what a construct is and how it relates to other variables; operational definitions specify exactly how to measure it.
Common confusion: constructs vs simple variables—age and weight are straightforward to measure, but constructs like self-esteem or extraversion require indirect measurement because they summarize complex behaviors and internal processes.
Levels of measurement matter: nominal, ordinal, interval, and ratio scales communicate different amounts of quantitative information and determine which statistical procedures are appropriate.

📏 What measurement means in psychology

📏 The general definition

Measurement: the assignment of scores to individuals so that the scores represent some characteristic of the individuals.

This definition applies to everyday measurement (bathroom scales, thermometers) and scientific measurement in all fields.
In psychology (often called psychometrics), measurement does not require special instruments—only a systematic procedure.
Example: A cognitive psychologist measures working memory capacity using a backward digit span task (reading digits, asking the person to repeat them in reverse order, increasing the list length until an error occurs). The longest correct list length is the score representing working memory capacity.
Example: A clinical psychologist measures depression using the Beck Depression Inventory (21 self-report items about symptoms over the past 2 weeks). The sum of ratings is the score representing current depression level.

🔑 The key requirement

Measurement requires some systematic procedure for assigning scores so they represent the characteristic of interest.
It does not depend on what tools you use or how complex the procedure is.

🧩 Psychological constructs

🧩 What constructs are

Constructs: variables that cannot be observed directly, including personality traits (e.g., extraversion), emotional states (e.g., fear), attitudes (e.g., toward taxes), and abilities (e.g., athleticism).

Most variables psychologists study are constructs, not simple variables like age, height, or weight.
Constructs are pronounced "CON-structs."

🚫 Why constructs cannot be observed directly

Reason 1: They represent tendencies

Constructs often describe general tendencies to think, feel, or act in certain ways across situations.
Example: Saying a student is highly extraverted does not mean she is behaving extravertedly right now (she might be sitting quietly reading). It means she has a general tendency to behave in extraverted ways (outgoing, enjoying social interactions) across various situations.

Reason 2: They involve internal processes

Constructs often involve internal processes not obvious to outside observers.
Example: Fear involves activation of nervous system structures, plus certain thoughts, feelings, and behaviors—none necessarily visible.
Don't confuse: A construct is not reducible to any single thought, feeling, act, or physiological process. It is a summary of a complex set of behaviors and internal processes.

🌟 Example: The Big Five personality dimensions

The excerpt presents the Big Five as five broad dimensions capturing much variation in human personality, each defined by six more specific "facets":

Dimension	Example Facets
Openness to experience	Fantasy, Aesthetics, Feelings, Actions, Ideas, Values
Conscientiousness	Competence, Order, Dutifulness, Achievement/Striving, Self-discipline, Deliberation
Extraversion	Warmth, Gregariousness, Assertiveness, Activity, Excitement seeking, Positive emotions
Agreeableness	Trust, Straightforwardness, Altruism, Compliance, Modesty, Tender-mindedness
Neuroticism	Worry, Anger, Discouragement, Self-consciousness, Impulsivity, Vulnerability

🔬 Conceptual vs operational definitions

🔬 Conceptual definitions

Conceptual definition: describes the behaviors and internal processes that make up a construct, along with how it relates to other variables.

Example: Neuroticism is people's tendency to experience negative emotions (anxiety, anger, sadness) across various situations. It has a strong genetic component, remains fairly stable over time, and is positively correlated with experiencing pain and other physical symptoms.
Researchers develop conceptual definitions that are more detailed, precise, and empirically accurate than dictionary definitions.
The research literature often includes different conceptual definitions of the same construct because researchers propose definitions, test them empirically, and revise or replace them as needed.

🔧 Operational definitions

Operational definition: a definition of a variable in terms of precisely how it is to be measured.

Three broad categories of measures:

Category	Description	Example from excerpt
Self-report measures	Participants report on their own thoughts, feelings, and actions	Rosenberg Self-Esteem Scale
Behavioral measures	Some aspect of participants' behavior is observed and recorded	Bandura's operational definition of physical aggression: counting specific acts (hitting a Bobo doll with a mallet, punching it, kicking it) during a 20-minute period
Physiological measures	Recording physiological processes	Heart rate, blood pressure, galvanic skin response, hormone levels, brain electrical activity and blood flow

🔄 Multiple operational definitions and converging operations

Any given construct will have multiple operational definitions.
Example: Stress has been operationally defined as:
- Social Readjustment Rating Scale: self-report of stressful events in the past year with severity points (e.g., divorce = 73 points, job change = 36 points).
- Hassles and Uplifts Scale: focuses on everyday stressors like misplacing things.
- Perceived Stress Scale: focuses on feelings of stress (e.g., "How often have you felt nervous and stressed?").
- Physiological variables: blood pressure, cortisol levels.

Converging operations: using multiple operational definitions of the same construct, either within a study or across studies, so the various definitions "converge" on the same construct.

When scores from different operational definitions are closely related and produce similar patterns of results, this is good evidence the construct is being measured effectively and is useful.
Example: Various stress measures are all correlated with each other and with immune system functioning. This allows researchers to draw general conclusions like "stress is negatively correlated with immune system functioning" rather than narrow conclusions about specific scales.

📊 Levels of measurement

📊 Why levels matter

Psychologist S. S. Stevens suggested that scores can communicate more or less quantitative information about a variable. The level of measurement determines:

What type of information the scores communicate.
Which statistical procedures can be used.
What conclusions can be drawn.

Example: Race officials could rank runners (first, second, etc.) or time them with a stopwatch (11.5 s, 12.1 s, etc.). Both measure runners' times, but the stopwatch also communicates how much longer one runner took.

🏷️ Nominal level

Nominal level: used for categorical variables; involves assigning scores that are category labels.

Category labels communicate whether two individuals are the same or different on the variable.
Example: Asking about marital status or ethnicity.
Key limitation: No ordering among responses. Green is not "ahead of" blue when classifying favorite colors.
This is the lowest level of measurement.
Only the mode can be used as a measure of central tendency.

📶 Ordinal level

Ordinal level: assigning scores that represent rank order of individuals.

Ranks communicate whether individuals are the same/different and whether one is higher or lower on the variable.
Example: Consumer satisfaction with microwaves rated as "very dissatisfied," "somewhat dissatisfied," "somewhat satisfied," or "very satisfied." The items are ordered from least to most satisfied.
Key limitation: The difference between two levels cannot be assumed equal to the difference between two other levels. The gap between "very dissatisfied" and "somewhat dissatisfied" may not equal the gap between "somewhat dissatisfied" and "somewhat satisfied."
Statisticians say: differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale.
Median or mode can be used as measures of central tendency.

📏 Interval level

Interval level: assigning scores using numerical scales in which intervals have the same interpretation throughout.

Example: Fahrenheit or Celsius temperature scales. The difference between 30° and 40° represents the same temperature difference as between 80° and 90°.
Key limitation: No true zero point, even if a value is labeled "zero." Zero degrees Fahrenheit does not represent complete absence of temperature. The "zero" label is applied for historical reasons.
Because there is no true zero, ratios do not make sense. You cannot say 80° is "twice as hot" as 40° because this depends on an arbitrary decision about where to start the scale.
Example in psychology: IQ is often considered interval level. A score of 0 would not indicate complete absence of IQ, and someone with IQ 140 does not have twice the IQ of someone with IQ 70. However, the difference between IQ 80 and 100 equals the difference between IQ 120 and 140.
Mean, median, or mode can be used as measures of central tendency.

📐 Ratio level

Ratio level: assigning scores with a true zero point that represents the complete absence of the quantity.

Examples: Height in meters, weight in kilograms, counts of discrete objects (number of siblings, number of correct answers on an exam).
A ratio scale combines all three earlier scales: it provides labels (like nominal), ordering (like ordinal), and equal intervals (like interval), plus equal ratios at different places on the scale have the same meaning.
Example: The Kelvin temperature scale has absolute zero, making it a ratio scale. If one temperature is twice as high as another on the Kelvin scale, it has twice the kinetic energy.
Example: Money in your pocket (25 cents, 50 cents). Zero money implies absence of money, so someone with 50 cents has twice as much as someone with 25 cents.
Any measure of central tendency (mean, median, mode) can be used.
Only ratio-level measurement allows meaningful statements about ratios of scores.

📋 Summary table

Level	Category labels	Rank order	Equal intervals	True zero	Central tendency options
Nominal	✓				Mode only
Ordinal	✓	✓			Median or mode
Interval	✓	✓	✓		Mean, median, or mode
Ratio	✓	✓	✓	✓	Mean, median, or mode

⚠️ Don't confuse

Interval vs ratio: Both have equal intervals, but only ratio has a true zero. This means only ratio-level measurements allow meaningful ratio statements (e.g., "twice as much").
Ordinal vs interval: Ordinal tells you the order but not whether gaps are equal; interval guarantees equal gaps throughout the scale.

Reliability and Validity of Measurement

20. Reliability and Validity of Measurement

🧭 Overview

🧠 One-sentence thesis

Psychologists must collect data to demonstrate that their measures work by assessing reliability (consistency) and validity (whether scores truly represent the intended construct), rather than simply assuming their measures are accurate.

📌 Key points (3–5)

Core principle: Researchers do not assume measures work—they collect data to demonstrate effectiveness and stop using measures that fail to perform.
Reliability types: Consistency across time (test-retest), across items (internal consistency), and across raters (inter-rater reliability).
Validity types: Face validity (appears to measure the construct), content validity (covers the full construct), and criterion validity (correlates with expected variables).
Common confusion: A measure can be extremely reliable but have no validity—reliability is necessary but not sufficient for a good measure.
Convergent vs discriminant validity: Scores should correlate with similar constructs (convergent) but not with conceptually distinct ones (discriminant).

🔄 Understanding Reliability

🔄 What reliability measures

Reliability: the consistency of a measure.

Reliability is not about accuracy or truth; it's about whether the measure produces consistent results.
Psychologists examine three dimensions of consistency, each appropriate for different situations.
A broken bathroom scale analogy: if you've been dieting and your clothes fit loosely, but the scale says you gained 10 pounds, you'd conclude it's broken—this illustrates the importance of consistency.

⏱️ Test-retest reliability

Test-retest reliability: the extent to which a measure produces consistent scores across time for constructs assumed to be stable.

When it applies:

Used for constructs assumed to be consistent over time (intelligence, self-esteem, Big Five personality dimensions).
Example: A highly intelligent person today should score similarly next week on a good intelligence measure.

How to assess:

Administer the measure to the same group twice at different times.
Graph data in a scatterplot and compute the correlation coefficient.
A correlation of +.80 or greater indicates good reliability.
The excerpt mentions a Rosenberg Self-Esteem Scale example with a correlation of +.95 (excellent).

Don't confuse: Not all constructs should be stable—mood changes by nature, so a low test-retest correlation for mood over a month would not be concerning.

🧩 Internal consistency

Internal consistency: the consistency of people's responses across items on a multiple-item measure.

Why it matters:

All items should reflect the same underlying construct, so responses should correlate with each other.
Example: On the Rosenberg Self-Esteem Scale, people who agree they are "a person of worth" should tend to agree they have "a number of good qualities."
Applies to behavioral and physiological measures too (e.g., consistently high or low bets across trials in a risk-seeking game).

Assessment methods:

Method	Description	Good threshold
Split-half correlation	Split items into two sets (e.g., even/odd or first/second half), compute scores for each set, examine correlation	+.80 or greater
Cronbach's α (alpha)	Mean of all possible split-half correlations for a set of items	+.80 or greater

Example: The excerpt describes a split-half correlation of +.88 for even vs. odd items on the Rosenberg Self-Esteem Scale.

👥 Inter-rater reliability

Inter-rater reliability: the extent to which different observers are consistent in their judgments.

When it's needed:

Behavioral measures involving significant observer judgment.
Example: Video-recording students' social skills during first meetings, then having multiple observers rate each student—ratings should be highly correlated if social skills are detectable.
Also relevant to the Bandura Bobo doll study mentioned: observers' counts of aggressive acts should correlate highly.

Assessment:

Cronbach's α for quantitative judgments.
Cohen's κ (kappa) for categorical judgments.

✅ Understanding Validity

✅ Core validity concept

Validity: the extent to which scores from a measure represent the variable they are intended to measure.

Key insight:

Reliability is necessary but not sufficient for validity.
The "absurd example" from the excerpt: measuring self-esteem by index finger length would have excellent test-retest reliability but absolutely no validity—finger length indicates nothing about self-esteem.
Validity requires multiple types of evidence beyond reliability.

👁️ Face validity

Face validity: the extent to which a measurement method appears "on its face" to measure the construct of interest.

Characteristics:

Usually assessed informally, based on whether the measure seems related to the construct.
Example: A self-esteem questionnaire including items about being "a person of worth" has good face validity; finger-length measurement has poor face validity.

Important limitation:

Face validity is "at best a very weak kind of evidence."
Based on intuitions about human behavior, which are frequently wrong.
Many established measures work well despite lacking face validity.
Example: The MMPI-2 uses items like "I enjoy detective or mystery stories" to measure aggression suppression—no obvious relationship, but the pattern of responses matches those who suppress aggression.

📋 Content validity

Content validity: the extent to which a measure "covers" the construct of interest.

How it works:

Check the measurement method against the conceptual definition of the construct.
Example: If test anxiety is defined as involving both nervous system activation (nervous feelings) and negative thoughts, the measure should include items about both.
Example: Attitudes involve thoughts, feelings, and actions—so measuring attitudes toward exercise must reflect all three aspects.

Assessment:

Not usually quantitative; assessed by carefully checking against the conceptual definition.

🎯 Criterion validity

Criterion validity: the extent to which people's scores on a measure are correlated with other variables (criteria) that one would expect them to be correlated with.

Types:

Type	Timing	Example
Concurrent validity	Criterion measured at same time	Test anxiety scores correlated with blood pressure during an exam
Predictive validity	Criterion measured in the future	Test anxiety scores predict future exam performance

How it works:

Test anxiety scores should be negatively correlated with exam performance and course grades.
Test anxiety scores should be positively correlated with general anxiety.
Physical risk-taking scores should correlate with extreme activities, speeding tickets, and broken bones.

Criteria can include other measures of the same construct:

Example: The Need for Cognition Scale (measuring how much people value thinking) was shown to correlate positively with academic achievement and negatively with dogmatism.
Over the years, it has been correlated with advertisement effectiveness, interest in politics, and juror decisions.

🔗 Convergent and discriminant validity

Convergent validity:

Convergent validity: the extent to which scores on a measure are correlated with other measures of the same construct.

New measures should correlate positively with existing established measures of the same construct.
Provides evidence that the measure captures the intended construct.

Discriminant validity:

Discriminant validity: the extent to which scores on a measure are NOT correlated with measures of conceptually distinct variables.

Self-esteem (stable general attitude) should not be highly correlated with mood (current feeling).
If a new self-esteem measure highly correlates with mood, it may be measuring mood instead.

Example from the excerpt:

Need for Cognition Scale showed only weak correlation with cognitive style (analytical vs. holistic thinking).
No correlation with test anxiety or social desirability—all these low correlations provide evidence of discriminant validity.

Don't confuse: Convergent validity requires high correlations with similar constructs; discriminant validity requires low correlations with distinct constructs—both are needed to demonstrate a measure captures what it's supposed to and nothing else.

Practical Strategies for Psychological Measurement

21. Practical Strategies for Psychological Measurement

🧭 Overview

🧠 One-sentence thesis

Measuring psychological constructs requires a systematic four-step process—conceptual definition, operational definition, implementation, and evaluation—with careful attention to reliability, validity, and minimizing participant reactivity at every stage.

📌 Key points (3–5)

The four-step measurement process: conceptually define the construct, operationally define it, implement the measure, and evaluate its reliability and validity.
Use existing vs. create new measures: existing measures save time and allow comparison with prior research, but new measures may be needed when none exist or to test convergent validity.
Common confusion—single vs. multiple items: single items are vulnerable to random errors and misunderstandings; multiple items that can be summed or averaged produce more reliable scores and better content validity.
Participant reactivity threatens validity: socially desirable responding, demand characteristics, and researcher expectations can bias results unless precautions like anonymity and standardized procedures are used.
Evaluation is always necessary: even well-established measures must be assessed for reliability and validity in your specific sample and testing conditions.

🎯 The four-step measurement process

🎯 Step 1: Conceptually defining the construct

A clear and complete conceptual definition of a construct is a prerequisite for good measurement.

You must know precisely what you mean by the construct before you can measure it.
Vague definitions lead to poor measurement choices.
Example: "memory" is too broad—you must specify whether you mean long-term episodic memory, skill memory, prospective memory, etc.
How to do it: read the research literature carefully and pay attention to how others have defined the construct.
Don't confuse: a conceptual definition (what the construct means theoretically) with an operational definition (how you will measure it).

🎯 Step 2: Operationally defining the construct

An operational definition is a definition of the variable in terms of precisely how it is to be measured.

Most psychological variables are abstract and cannot be directly observed, so they must be transformed into something observable.
The same construct can be operationally defined in many different ways.
Example: stress can be measured as scores on the Perceived Stress Scale, cortisol concentrations in saliva, or number of recent stressful life events.
This step involves deciding whether to use an existing measure or create your own.

🎯 Step 3: Implementing the measure

Administer the measure in ways that maximize reliability and validity.
Covered in detail in a later section.

🎯 Step 4: Evaluating the measure

Assess the reliability and validity of the measure with your specific sample and conditions.
Covered in detail in a later section.

🔍 Choosing between existing and new measures

🔍 Advantages of using existing measures

Three main advantages:

You save time and effort creating your own
There is already evidence the measure is valid (if used successfully before)
Your results can be compared and combined with previous results

Other researchers may expect you to use an existing reliable and valid measure unless you have a good reason not to.
If multiple existing measures are available, choose based on: most common, best reliability/validity evidence, best fit for your specific aspect of interest, or easiest to use.
Example: The Ten-Item Personality Inventory (TIPI) measures all Big Five personality dimensions with just 10 items—less reliable than longer measures, but useful when testing time is severely limited.

🔍 Where to find existing measures

For research measures (usually free with proper citation):

Detailed descriptions in published research articles
Later articles may describe briefly and reference the original article
Directory of Unpublished Experimental Measures (APA publication)
PsycTESTS (APA catalog/collection)

For proprietary measures (must be purchased):

Many clinical psychology applications
Examples: standard intelligence tests, Beck Depression Inventory, Minnesota Multiphasic Personality Inventory (MMPI)
Details in reference books: Tests in Print and Mental Measurements Yearbook (often in university libraries)

🔍 When to create your own measure

Valid reasons:

No existing measure of your construct
Existing measures are too difficult or time-consuming
You want to evaluate convergent validity by seeing if a new measure works like existing ones

🛠️ Guidelines for creating new measures

🛠️ Start with existing measures as templates

Most new measures are variations of existing measures.
Look to research literature for ideas.
Possible modifications: adapt an existing questionnaire, create a paper-and-pencil version of a computerized measure (or vice versa), or adapt a measure used for another purpose.
Example: The Stroop task (quickly naming colors that color words are printed in) was adapted for social anxiety research—people high in social anxiety are slower at color naming when words have negative social connotations like "stupid."

🛠️ Strive for simplicity

Why simplicity matters:

Participants are not as interested in your research as you are
They vary widely in ability to understand and carry out tasks

How to achieve simplicity:

Create clear instructions using simple language
Present instructions in writing or read aloud (or both)
Include one or more practice items so participants can become familiar
Build in opportunity for questions before continuing
Keep the measure brief to avoid boring or frustrating participants

🛠️ Use multiple items, not single items

Two reasons multiple items are better:

Reason	Explanation
Content validity	Multiple items often required to cover a construct adequately
Reliability	Single items can be influenced by irrelevant factors (misunderstanding, distraction, simple errors); when several responses are summed or averaged, these irrelevant factors cancel out

Important requirement: Multiple items must be structured so they can be combined into a single overall score by summing or averaging.

Don't confuse: asking different types of questions (annual income, credit score, thriftiness rating) is not a true multiple-item measure if there's no obvious way to combine responses.
Example of proper multiple-item measure: have people rate the degree to which 10 statements about financial responsibility describe them on the same five-point scale.

🛠️ Test your measure on several people first

Best way to ensure quality:

Observe them as they complete the task
Time them
Ask them afterward to comment on: how easy or difficult it was, whether instructions were clear, and anything else you wonder about
Better to discover problems before large-scale data collection begins

⚙️ Implementing measures effectively

⚙️ Optimize testing conditions

Test everyone under similar conditions
Ideally: quiet and free of distractions
Group testing is efficient but can create distractions that reduce reliability and validity
Use previous research as a guide—if others successfully tested people in groups with a particular measure, consider doing so too

⚙️ Minimize participant reactivity

Three types of reactivity that threaten validity:

Type	Definition	Example
Socially desirable responding	Doing or saying things because they think it's socially appropriate	People with low self-esteem agree they feel they are "a person of worth" not because they really feel this way, but because they believe it's the socially appropriate response
Demand characteristics	Subtle cues that reveal how the researcher expects participants to behave	A participant whose attitude toward exercise is measured immediately after reading about heart disease dangers might respond more favorably because she believes she is expected to
Researcher expectation effects	Your own expectations bias participants' behaviors in unintended ways	(Not detailed in excerpt)

⚙️ Precautions to reduce reactivity

Procedural safeguards:

Make the procedure clear and brief so participants aren't tempted to vent frustrations on your results
Guarantee participants' anonymity and make this clear to them
Seat group participants far enough apart that they cannot see each other's responses
Give everyone the same type of writing implement (so they can't be identified by, e.g., a pink glitter pen)
Allow participants to seal completed questionnaires into individual envelopes or put them into a drop box

Information management:

Informed consent requires telling participants what they will be doing, but does not require revealing your hypothesis or information suggesting how you expect them to respond
Example: A questionnaire measuring financial responsibility need not be titled "Are You Financially Responsible?"—it could be "Money Questionnaire" or have no title at all

Researcher blinding:

Have the measure administered by a helper who is "blind" (unaware of the measure's intent or any hypothesis being tested)
Standardize all interactions between researchers and participants (e.g., always reading the same instructions word for word)

📊 Evaluating your measure

📊 Why evaluation is always necessary

Even if a measure has been used extensively by other researchers with evidence of reliability and validity, you should not assume it worked as expected for your particular sample and testing conditions
You now have additional evidence bearing on reliability and validity that should be added to the research literature

📊 Assessing reliability

Test-retest reliability:

In most research designs, not possible because participants are tested at only one time
For a new measure, you might design a study specifically to assess test-retest reliability by testing the same participants at two separate times
Sometimes a study designed to answer a different question still allows assessment
Example: A psychology instructor measures students' attitude toward critical thinking at the beginning and end of semester—even if there's no change, he can look at the correlation between scores at the two times

Internal consistency:

Customary to assess for any multiple-item measure
Usually by looking at split-half correlation or Cronbach's alpha

📊 Assessing criterion validity

Multiple strategies:

If your study included more than one measure of the same construct or measures of conceptually distinct constructs, look at correlations among these measures to ensure they fit your expectations
A successful experimental manipulation provides evidence of criterion validity
Example: MacDonald and Martineau manipulated participants' moods by having them think positive or negative thoughts; their mood measure showed a distinct difference between groups, simultaneously providing evidence that the mood manipulation worked and that the mood measure was valid

📊 What if data cast doubt on reliability or validity?

Ask why—several possibilities:

Something wrong with your measure or how you administered it
Something wrong with your conceptual definition
Your experimental manipulation failed
Example: If a mood measure showed no difference between people instructed to think positive versus negative thoughts, maybe participants didn't actually think the thoughts they were supposed to, or the thoughts didn't actually affect their moods

Next steps: "Back to the drawing board" to revise the measure, revise the conceptual definition, or try a new manipulation.

Key Takeaways and Exercises

22. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Psychological measurement requires researchers to continuously evaluate and demonstrate that their measures are both reliable (consistent) and valid (actually measuring what they intend), rather than simply assuming their tools work.

📌 Key points (3–5)

Measurement fundamentals: Psychological constructs (intelligence, self-esteem, depression) are not directly observable, so researchers must define them conceptually and measure them operationally using multiple methods.
Four levels of measurement: Variables can be measured at nominal, ordinal, interval, or ratio levels, each providing different amounts of quantitative information and affecting statistical choices.
Reliability vs validity: Reliability means consistency (across time, items, and raters); validity means the scores actually represent the intended construct—these are distinct criteria.
Common confusion: Researchers cannot assume measures work—they must conduct research to demonstrate reliability and validity, and this assessment is ongoing, not one-time.
Practical approach: Good measurement starts with clear conceptual definitions, involves choosing or creating appropriate measures, and requires reevaluation with each new dataset.

📏 What psychological measurement involves

📏 Definition and scope

Measurement is the assignment of scores to individuals so that the scores represent some characteristic of the individuals.

Psychological measurement can use:
- Self-report (questionnaires, scales)
- Behavioral observation (actions, response times)
- Physiological measures (biological indicators)

🔍 Psychological constructs

Constructs like intelligence, self-esteem, and depression are not directly observable.
They represent behavioral tendencies or complex patterns of behavior and internal processes.
Researchers must conceptually define these constructs accurately to describe what they actually are.

🔧 Operational definitions

For any conceptual definition, there are many different operational definitions (ways of measuring).
Converging operations: using multiple operational definitions is a common strategy.
Example: Sexual jealousy could be measured through self-report questionnaires, behavioral observations of reactions, or physiological responses.

📊 Levels of measurement

📊 Four distinct levels

The excerpt identifies four levels that communicate increasing amounts of quantitative information:

Level	What it measures	Example from exercises
Nominal	Categories without order	Right-handed vs left-handed
Ordinal	Ranked order without equal intervals	Exam completion order (1st, 2nd, 3rd)
Interval	Equal intervals, no true zero	(Not explicitly exemplified)
Ratio	Equal intervals with true zero	Number of doctor visits

📊 Why it matters

The level of measurement affects:
- What kinds of statistics you can use
- What conclusions you can draw from your data

🎯 Reliability: consistency of measurement

🎯 Three types of reliability

Reliability means consistency across different dimensions:

Test-retest reliability: consistency across time
- Does the measure give similar results when administered at different times?
Internal consistency: consistency across items
- Do different items on the same measure correlate with each other?
- Example: The excerpt mentions assessing split-half correlation (even vs. odd items) for the Rosenberg Self-Esteem Scale.
Interrater reliability: consistency across researchers
- Do different observers or raters produce similar scores?

🎯 Don't confuse with validity

Reliability is about consistency, not accuracy.
A measure can be reliable (consistent) but still not valid (not measuring what it claims to measure).

✅ Validity: measuring what you intend

✅ Core definition

Validity is the extent to which the scores actually represent the variable they are intended to.

Validity is a judgment based on various types of evidence, not a single test.
It is an ongoing assessment, not a one-time determination.

✅ Types of evidence for validity

Face and content validity:

Does the measure appear to cover the construct of interest?
Example: An exam's face validity relates to whether it appears to measure the intended knowledge.

Criterion validity:

Are scores correlated with other variables they should be correlated with?
Are scores not correlated with conceptually distinct variables?
Example: A successful experimental manipulation (mood induction showing differences between positive/negative thought groups) provides evidence of criterion validity.

✅ What to do when validity is questioned

The excerpt emphasizes asking "why" if new data cast doubt on reliability or validity:

Maybe something is wrong with the measure or how it was administered
Maybe the conceptual definition needs revision
Maybe the experimental manipulation failed
Example: If a mood measure shows no difference between positive and negative thought groups, perhaps participants didn't think the intended thoughts or the thoughts didn't affect mood
Result: "back to the drawing board" to revise the measure, definition, or manipulation

🛠️ Practical strategies for good measurement

🛠️ Starting point: clear conceptual definition

Good measurement begins with clear and detailed thinking about what you want to measure.
Review the research literature to understand how others have defined the construct.

🛠️ Choosing or creating measures

Decision factors:

Availability of existing measures
Adequacy of existing measures for your purposes
If existing measures are available and adequate, use them; if not, create new ones.

🛠️ Maximizing reliability and validity

Take several simple steps when creating new measures.
Apply these steps when implementing both existing and new measures.
The excerpt emphasizes that these steps "can help maximize reliability and validity."

🛠️ Ongoing evaluation

Once you have used a measure, reevaluate its reliability and validity based on your new data.
Don't confuse: Assessment of reliability and validity is an ongoing process, not a one-time event.
Researchers do not simply assume their measures work—they conduct research to demonstrate effectiveness.
If they cannot show measures work, they stop using them.

🔄 The continuous cycle

🔄 Research as demonstration

Psychological researchers actively demonstrate that their measures work.
This is not a passive assumption but an active research requirement.

🔄 Multiple sources of evidence

When evaluating a measure, consider:

The measure's reliability (consistency)
Whether it covers the construct of interest
Whether scores correlate appropriately with expected variables
Whether scores are distinct from conceptually different variables

🔄 Iterative improvement

Each new dataset provides an opportunity to reassess measurement quality.
This creates a cycle of continuous improvement and validation.

Experiment Basics

23. Experiment Basics

🧭 Overview

🧠 One-sentence thesis

Experiments uniquely establish causal relationships between variables by actively manipulating the independent variable while controlling extraneous variables, making them one of psychology's most powerful research tools.

📌 Key points (3–5)

What defines an experiment: manipulation of the independent variable (systematically changing its levels) and control of extraneous variables (holding them constant).
Manipulation vs. comparison: actively intervening to change a variable is fundamentally different from comparing groups that already differ—only manipulation supports causal conclusions.
Extraneous variables create two problems: they add "noise" (variability) that obscures effects, and they can become confounding variables that provide alternative explanations.
Common confusion: manipulation vs. control—researchers manipulate the independent variable by changing its levels and control other variables by holding them constant; these are distinct operations despite similar everyday meanings.
Treatment studies require control conditions: placebo controls, no-treatment controls, or wait-list controls are necessary to rule out expectation effects and determine whether treatments actually work.

🔬 What makes a study an experiment

🔬 Two fundamental features

An experiment is a type of study designed specifically to answer the question of whether there is a causal relationship between two variables.

Feature 1: Manipulation of the independent variable

Researchers systematically vary the level of the independent variable.
The different levels are called conditions.
Example: Darley and Latané told participants there were either one, two, or five other students in a discussion—this created three conditions of a single independent variable (number of witnesses).

Feature 2: Control of extraneous variables

Researchers hold constant or minimize variability in variables other than the independent and dependent variables.

Extraneous variables: anything that varies in the context of a study other than the independent and dependent variables.

Example: testing all participants in the same room, using identical instructions, random assignment to conditions.

🔍 Why only experiments support causal claims

The combination of manipulation + control allows researchers to isolate the effect of the independent variable.
Other study types (e.g., comparing people who already differ) cannot rule out alternative explanations because pre-existing groups likely differ in multiple ways.

🎛️ Manipulating the independent variable

🎛️ What manipulation means

Active intervention by the researcher to change the independent variable's level systematically.
Different groups experience different levels, or the same group experiences different levels at different times.
Example: instructing some participants to write about traumatic experiences and others to write about neutral experiences creates a "traumatic condition" and a "neutral condition."

⚠️ Manipulation vs. pre-existing differences

Comparing groups that already differ before the study begins is not manipulation.
Example: comparing health of people who already keep a journal vs. those who don't is not an experiment.
Why it matters: groups that differ in one way at the start likely differ in other ways (conscientiousness, introversion, stress levels), so any observed difference could be due to those other factors, not the variable of interest.
Don't confuse: observing natural variation ≠ experimental manipulation; only active intervention counts as manipulation.

📊 Single-factor designs

Design type	Description	Example from excerpt
Single-factor two-level	One independent variable with two conditions	Comparing one witness vs. five witnesses
Single-factor multi-level	One independent variable with more than two conditions	Darley & Latané used one, two, and five witnesses (three conditions)

Multi-level designs can provide greater insights than two-level designs.

🚫 When experiments are impossible

Some variables cannot be manipulated for practical or ethical reasons.
Example: whether people had a significant early illness experience cannot be manipulated, so studying its effect on hypochondriasis requires nonexperimental approaches.
This limitation does not mean the relationship cannot be studied—only that it must be done differently.

🎚️ Controlling extraneous variables

🎚️ Why control matters

Extraneous variable: anything that varies in the context of a study other than the independent and dependent variables.

Many extraneous variables likely affect the dependent variable.
Examples: in a study on expressive writing and health, extraneous variables include writing ability, diet, gender, time of day, whether writing by hand or computer, weather.
Uncontrolled extraneous variables make it difficult to separate the independent variable's effect from other influences.

🔊 Problem 1: Extraneous variables as "noise"

Extraneous variables add variability to the data, making the independent variable's effect harder to detect.
Example scenario: In a mood-and-memory study, ideally everyone in the happy condition would recall exactly four events and everyone in the sad condition exactly three. In reality, individual differences (fewer memories to draw on, less effective strategies, lower motivation) create variability around those averages.
The mean difference stays the same, but greater variability makes the effect "much less obvious."
Solution: control extraneous variables so data are less noisy and effects are easier to detect.

🔀 Problem 2: Extraneous variables as confounding variables

A confounding variable is an extraneous variable that differs on average across levels of the independent variable (i.e., it varies systematically with the independent variable).

Key distinction: an extraneous variable becomes confounding when it differs systematically between conditions.
Example: IQ is an extraneous variable in almost all experiments. If participants with lower and higher IQs are roughly equally distributed across conditions, IQ is acceptable (even desirable) variation. But if one condition has substantially higher average IQ than another, IQ is now a confounding variable.
Why "confounding" means "confusing": confounding variables provide an alternative explanation for observed differences.
Example scenario: participants in a positive mood condition score higher on memory than those in a negative mood condition. But if the positive mood group also has higher average IQ, it's unclear whether the higher scores were caused by positive mood or higher IQ.

🛠️ How to control extraneous variables

Method 1: Hold them constant

Test all participants in the same location, give identical instructions, treat them the same way.
Hold participant variables constant: e.g., many language studies limit participants to right-handed people (left-handed people process language differently, which would add noise).
Extreme version: limit to one very specific category (e.g., 20-year-old, heterosexual, female, right-handed psychology majors).
Trade-off: homogeneous samples reduce noise but lower external validity (generalizability). Example: results from younger lesbian women might not apply to older gay men.
In many situations, the advantages of a diverse sample (increased external validity) outweigh the noise reduction from a homogeneous one.

Method 2: Random assignment to conditions

Assigns participants to conditions using a random process.
Ensures different groups are, on average, highly similar on all extraneous variables (IQ, gender, motivation, health, etc.).
Prevents extraneous variables from becoming confounding variables.
This is a "much more general approach" than holding variables constant.

🏥 Treatment and control conditions

🏥 What treatments are

A treatment is any intervention meant to change people's behavior for the better.

Includes psychotherapies and medical treatments for disorders.
Also includes interventions to improve learning, promote conservation, reduce prejudice, etc.

🧪 Basic experimental structure

Treatment condition: participants receive the treatment.
Control condition: participants do not receive the treatment.
Random assignment to conditions.
If the treatment group ends up better off, the researcher can conclude the treatment works.

Randomized clinical trial: an experiment testing the effectiveness of psychotherapies or medical treatments.

🍬 The placebo problem

What placebos are

A placebo is a simulated treatment that lacks any active ingredient or element that should make it effective. A placebo effect is a positive effect of such a treatment.

Examples: eating chicken soup for a cold, placing soap under bed sheets to stop leg cramps—probably nothing more than placebos.
Probably driven primarily by people's expectations that they will improve.
Expectation to improve can reduce stress, anxiety, and depression, which can alter perceptions and even improve immune system functioning.

Why placebos are a problem

If a treatment group improves more than a no-treatment control group, you cannot conclude the treatment worked—improvement might be due to expectations, not the treatment itself.
Example scenario: treatment group improves more than no-treatment group, but this could be because treatment participants expected to improve while control participants did not.

Placebo effects are surprisingly powerful

Work not only for psychological disorders (depression, anxiety, insomnia) but also for physiological disorders (asthma, ulcers, warts).
Even "sham surgery" can be as effective as actual surgery.
Example: arthroscopic knee surgery study—control participants were prepped, received tranquilizer and incisions, but not the actual procedure. Result: sham surgery group improved just as much as treatment groups in knee pain and function.

🎯 Solutions to the placebo problem

Control condition type	Description	Purpose
No-treatment control	Participants receive no treatment	Basic comparison, but vulnerable to placebo effects
Placebo control	Participants receive a placebo that looks like treatment but lacks active ingredient	If both groups expect improvement, any extra improvement in treatment group must be due to treatment, not expectations
Wait-list control	Participants told they will receive treatment but must wait	Allows comparison with people not currently receiving treatment but who still expect to improve eventually
Best-available-treatment control	Compare new treatment with best existing alternative	Once an effective treatment exists, the question becomes "Does it work better than what's already available?"

Placebo control details

Example: treatment group takes a pill, placebo group takes identical-looking "sugar pill" without active ingredient.
In psychotherapy research, placebo might involve unstructured talk with a therapist.
Informed consent requires telling participants they will be assigned to either treatment or placebo—though not which one until the study ends.
Often, control participants are offered the real treatment after the experiment.

Don't confuse: all control conditions involve not receiving the treatment, but they differ in what participants do receive and what they expect, which affects their ability to rule out placebo effects.

Experimental Design

24. Experimental Design

🧭 Overview

🧠 One-sentence thesis

Experiments can be designed either between-subjects (each participant experiences one condition) or within-subjects (each participant experiences all conditions), and the choice depends on balancing control of extraneous variables against practical constraints like carryover effects and testing time.

📌 Key points (3–5)

Between-subjects vs within-subjects: the core design choice is whether each participant sees one condition or all conditions.
Random assignment controls extraneous variables: randomly assigning participants to conditions ensures groups are similar on average, preventing confounds.
Within-subjects designs face carryover effects: practice, fatigue, and context effects can occur when participants experience multiple conditions, but counterbalancing addresses this.
Common confusion: random assignment (assigning participants to conditions) is not the same as random sampling (selecting participants from a population).
Design trade-offs: between-subjects designs are simpler and avoid carryover effects; within-subjects designs control participant variables better and need fewer participants.

🔀 Between-subjects experiments

🔀 What defines between-subjects design

Between-subjects experiment: each participant is tested in only one condition.

Each person experiences only one level of the independent variable.
Different groups of participants are compared.
Example: 100 students split so 50 write about a traumatic event and 50 write about a neutral event.

⚖️ The control challenge

The researcher must ensure that groups are highly similar on average across all extraneous variables (gender, IQ, motivation, health, etc.).
If groups differ systematically on an extraneous variable, that variable becomes a confounding variable.
The goal: any observed difference between conditions reflects the independent variable, not pre-existing differences.

🎲 Random assignment

🎲 What random assignment means

Random assignment: using a random process to decide which participants are tested in which conditions.

Don't confuse: random assignment (assigning sample members to conditions) ≠ random sampling (selecting a sample from a population).
Random sampling is rarely used in psychology; random assignment is essential in experiments.

🎯 Two criteria for strict random assignment

Each participant has an equal chance of being assigned to each condition (e.g., 50% chance for two conditions).
Each participant is assigned independently of others.

Simple methods:

Flip a coin for each participant (heads = Condition A, tails = Condition B).
Use a computer to generate random integers (1 = Condition A, 2 = Condition B, 3 = Condition C).
In practice, a full sequence is created ahead of time and each new participant gets the next condition in the sequence.

🧱 Block randomization

Problem with strict random assignment: unequal sample sizes across conditions.
Solution: block randomization keeps group sizes as equal as possible.
How it works: all conditions occur once before any repeat; within each "block," conditions appear in random order.
Example table shows 9 participants assigned to three conditions (A, B, C) in blocks.
The Research Randomizer website can generate block randomization sequences.

🤔 Limitations and strengths

Random assignment is not guaranteed to control all extraneous variables—by chance, one group might be older or more motivated.
Why it still works:
- Random assignment performs better than expected, especially with large samples.
- Inferential statistics account for the "fallibility" of random assignment.
- Confounds from random assignment are likely detected when experiments are replicated.
Random assignment is always considered a strength of experimental design.

🔗 Matched-groups design

Matched-groups design: participants in different conditions are matched on the dependent variable or extraneous variables before the independent variable is manipulated.

This guarantees certain variables won't be confounded across conditions.
Example: measure health in all participants, rank-order them by health, then randomly assign the two healthiest to different conditions, the next two healthiest to different conditions, and so on.
Result: conditions are matched on health at the start, so any difference at the end is due to the manipulation.

🔄 Within-subjects experiments

🔄 What defines within-subjects design

Within-subjects experiment: each participant is tested under all conditions.

The same people experience every level of the independent variable.
Example: the same group judges both an attractive defendant and an unattractive defendant.

✅ Primary advantage

Maximum control of extraneous participant variables: participants in all conditions have the same IQ, socioeconomic status, number of siblings, etc.—because they are the same people.
Statistical procedures can remove the effect of participant variables, making data less noisy and effects easier to detect.
Requires fewer participants than between-subjects designs to detect the same effect size.

⚠️ Primary disadvantage: order effects

Order effect: participants' responses in various conditions are affected by the order of conditions.

Types of carryover effects:

Type	Definition	Example
Practice effect	Participants perform better in later conditions because they've practiced	Task performance improves over trials
Fatigue effect	Participants perform worse in later conditions because they're tired or bored	Task performance declines over trials
Context effect (contrast effect)	Being tested in one condition changes how participants perceive stimuli in later conditions	An average defendant judged more harshly after an attractive defendant than after an unattractive one

🧩 Additional problems

Easier to guess the hypothesis: if a participant judges an attractive then an unattractive defendant, they may guess the hypothesis is about attractiveness affecting guilt.
This knowledge could lead them to judge more harshly (because they think that's expected) or similarly (to be "fair").
Order as a confounding variable: if the attractive condition is always first and unattractive always second, any difference might be due to order, not attractiveness.

⚖️ Counterbalancing

⚖️ What counterbalancing does

Counterbalancing: testing different participants in different orders.

Solution to the problem of order effects.
Controls the order of conditions so it's no longer a confounding variable.
Makes it possible to detect carryover effects by analyzing data separately for each order.

🔢 Complete counterbalancing

An equal number of participants complete each possible order of conditions.
Two conditions: half do A then B, half do B then A.
Three conditions: six possible orders (ABC, ACB, BAC, BCA, CAB, CBA); some participants tested in each.
Four conditions = 24 orders; five conditions = 120 orders.
Participants are randomly assigned to orders.

🔲 Latin square design

A more efficient counterbalancing method for large numbers of conditions.
Uses equal rows and columns (like a Sudoku puzzle): no treatment repeats in a row or column.
Each condition appears at each ordinal position once and precedes/follows each other condition once.
Example for four treatments requires only 4 versions instead of 24.
A 6-condition experiment needs only 6 orders instead of 720.

🎲 Random counterbalancing

Used when the number of conditions is very large.
The order of conditions is randomly determined for each participant.
Not as powerful as complete or Latin square counterbalancing (more random error).
An option when order effects are likely small and conditions are numerous.

📊 Context matters: the "9 vs 221" study

Researcher Michael Birnbaum argued that lack of context in between-subjects designs can be a bigger problem than context effects in within-subjects designs.
Experiment: one group rated the number 9, another rated 221 on a scale of "very very small" to "very very large."
Result: participants rated 9 as larger (mean 5.13) than 221 (mean 3.10)!
Explanation: participants compared 9 to other one-digit numbers (relatively large) and 221 to other three-digit numbers (relatively small).
Lesson: between-subjects designs can create misleading results when context is missing.

🔀 Simultaneous within-subjects designs

🔀 Mixed presentation approach

Instead of testing one condition at a time, participants respond to multiple stimuli in a mixed sequence.
Example: instead of judging 10 attractive defendants then 10 unattractive defendants, present all 20 in a mixed sequence.
The researcher computes each participant's mean rating for each type.
Example: participants study a single list with both negative and positive adjectives, then recall as many as possible; count each type recalled.

🤔 Choosing between designs

🤔 Trade-offs summary

Aspect	Between-subjects	Within-subjects
Conceptual simplicity	Simpler	More complex
Testing time per participant	Less	More
Carryover effects	Not an issue	Requires counterbalancing
Control of participant variables	Less control	Maximum control
Data noise	More noise	Less noise (easier to detect effects)
Number of participants needed	More	Fewer

📋 Rule of thumb

If possible, use a within-subjects design with proper counterbalancing when:
- You have enough time per participant.
- You have no serious concerns about carryover effects.
Use between-subjects when:
- A within-subjects design is difficult or impossible.
- Testing time per participant is limited (e.g., doctor's waiting room, grocery store line).
- The treatment produces long-term change that makes control condition testing impossible (e.g., prejudice reduction, psychotherapy effectiveness).

🔄 Mixed methods approach

Using one design type doesn't preclude using the other in a different study.
Professional researchers often use both between-subjects and within-subjects designs to answer the same research question.
This mixed methods approach provides converging evidence.

Experimentation and Validity

25. Experimentation and Validity

🧭 Overview

🧠 One-sentence thesis

Experiments are evaluated using four types of validity—internal, external, construct, and statistical—and researchers must prioritize among them because no single study can maximize all four simultaneously.

📌 Key points (3–5)

Internal validity: whether the study design supports the conclusion that the independent variable caused changes in the dependent variable; experiments are high in internal validity due to manipulation and control.
External validity: whether results generalize to other people and situations beyond those studied; depends on similarity to real-world contexts (mundane realism) and psychological processes (psychological realism).
Construct and statistical validity: construct validity concerns how well the operationalization matches the research question; statistical validity concerns proper statistical treatment and adequate sample size.
Common confusion: external validity is not just about sample size—small samples threaten statistical validity, not necessarily external validity; artificial settings don't always mean low external validity if psychological processes generalize.
Trade-offs matter: researchers must prioritize validities because achieving high validity in all four areas simultaneously is often impossible.

🔬 Internal Validity: Establishing Causation

🔬 What internal validity measures

Internal validity: the degree to which a study's design supports the conclusion that the independent variable caused observed differences in the dependent variable.

Correlation does not imply causation—two statistically related variables don't necessarily have a causal relationship.
Example: people who exercise regularly being happier doesn't prove exercise causes happiness; happiness might cause exercise, or a third factor (like physical health) might cause both.

🎯 Why experiments have high internal validity

The logic of experimental causation:

Create two or more highly similar conditions
Manipulate only the independent variable to produce one difference between conditions
Any later difference between conditions must have been caused by that manipulation

Key mechanisms that support internal validity:

Manipulation of the independent variable by the researcher
Control of extraneous variables
Random assignment to minimize confounds

🔄 Contrast with non-experimental designs

Design type	Internal validity	Why
Experimental	High	Variables are manipulated and controlled
Non-experimental (correlational)	Low	Variables are only measured, not manipulated

Example from the excerpt: Darley and Latané's study had high internal validity because the only difference between conditions was the number of students participants believed were involved, so that difference must have caused the helping behavior differences.

🌍 External Validity: Generalizability

🌍 What external validity measures

External validity: the degree to which a study's design supports generalizing results to people and situations beyond those actually studied.

Not about whether the study "feels real," but whether conclusions apply broadly
General rule: higher when participants and situations studied are similar to those researchers want to generalize to

🎭 Two types of realism

Mundane realism:

Participants encounter situations similar to everyday life
Example: studying shoppers' actual cereal-buying decisions in a real grocery store = high mundane realism
Example: undergraduates judging colors on a computer screen = low mundane realism

Psychological realism:

The same mental process operates in both the laboratory and the real world
Example: visual processing of colors has high psychological realism even in an artificial lab setting

🚫 Don't confuse: Artificial settings ≠ automatically low external validity

Why experiments aren't always low in external validity:

Field experiments exist: conducted entirely outside the laboratory
- Example: Cialdini's hotel towel study manipulated messages on cards in real hotel rooms; guests who read that most guests reuse towels reused their own towels more often
- High external validity because conducted in the actual setting of interest
Psychological processes generalize: experiments often study processes that operate across many people and situations
- Example: Fredrickson's swimsuit/math test study seemed artificial, but the self-objectification process it revealed likely operates in many women and situations
- The specific situation (math test in swimsuit) doesn't need to recur; the underlying attention-diversion process does

🏗️ Construct Validity: Quality of Operationalization

🏗️ What construct validity measures

Construct validity: the quality of the experiment's manipulations—how well the operationalization speaks to the research question.

The operationalization process:

Start with research question
Convert to experimental design (operationalization)
Evaluate: does the manipulation clearly test what you intended?

📐 Example: Evaluating construct validity

Darley and Latané's diffusion of responsibility study:

Research question: Does helping behavior become diffused?
Operationalization: Increased number of potential helpers in a crisis situation
Evaluation: Very high construct validity—the manipulation clearly tests diffusion

Thought experiment with fewer conditions:

If only two conditions (one student or two): lower construct validity
Why? Could be mere presence of others (social inhibition), not clearly diffusion of responsibility
With five conditions: might reveal whether decrease continues or plateaus—more nuanced understanding

⚖️ Balancing conditions

More conditions don't automatically mean higher construct validity
Design should match the specific research question
Consider how well each condition illuminates the phenomenon

📊 Statistical Validity: Sound Analysis

📊 What statistical validity measures

Statistical validity: the proper statistical treatment of data and the soundness of researchers' statistical conclusions.

What threatens statistical validity:

Using the wrong type of inferential test (t-tests, ANOVA, regression, correlation, etc.)
Ignoring the scale of measurement of the dependent variable
Violating statistical assumptions (e.g., assuming normal distribution when data aren't normally distributed)
Using statistics anyway when assumptions aren't met

👥 The sample size critique

Common confusion: Small samples and validity types

People often say small samples threaten external validity (can't generalize from small sample)
Actually threatens statistical validity, not necessarily external validity
Why? Some studies with small samples (even one person) are still illuminating for psychological research
The real issue: whether statistics can detect the effect, which depends on sample size

🔍 Power analysis

Power analysis: a calculation that determines the number of participants needed to detect an effect of a specific size.

What affects detecting an effect:

Whether a relationship really exists between variables
Number of conditions in the study
Size of the sample
Researchers should conduct power analysis when designing studies

⚖️ Prioritizing and Trading Off Validities

⚖️ The impossibility of perfection

Core principle: Researchers must prioritize because high validity in all four areas simultaneously is often not possible.

Example from the excerpt:

Cialdini's hotel towel study: high external validity but more modest statistical validity
This doesn't invalidate the study; it shows where future follow-up studies could improve

🎯 Common patterns in psychology research

According to the excerpt:

Many psychology studies have high internal validity and high construct validity
These studies sometimes sacrifice external validity
This is a deliberate trade-off, not a flaw

When reading or designing experiments, ask:

Which validities does this study prioritize?
Are the trade-offs appropriate for the research question?
Where might follow-up studies improve?

Practical Considerations

26. Practical Considerations

🧭 Overview

🧠 One-sentence thesis

Conducting an experiment requires careful attention to participant recruitment, procedure standardization, and pilot testing to ensure that the study is feasible, unbiased, and capable of detecting real effects.

📌 Key points (3–5)

Recruiting participants: researchers must plan how to obtain participants, whether through formal subject pools, advertisements, or field selection with well-defined rules.
Standardizing procedures: unintended variation (e.g., experimenter behavior, instructions) can introduce extraneous variables or confounds, so protocols must be consistent across all participants.
Experimenter expectancy effects: experimenters' expectations can unintentionally influence participant behavior, which is why blinding and standardization are critical.
Common confusion: null results may mean no real effect exists, or they may mean the manipulation failed—manipulation checks help distinguish these scenarios.
Pilot testing: small-scale trials before the full study reveal problems with instructions, timing, manipulations, and data recording.

👥 Recruiting participants

👥 Formal subject pools

Many institutions maintain subject pools: established groups who have agreed to participate in research.
Example: introductory psychology students who must participate in studies to meet course requirements, signing up via online systems.

📢 Other recruitment methods

Posting or publishing advertisements.
Making personal appeals to groups representing the population of interest.
Example: a researcher studying older adults might speak at a retirement community meeting to ask for volunteers.

🎯 Selection in field experiments

In field studies, participants are not "recruited" but selected according to pre-defined rules.
Example: in a supermarket stairway study on smiling and helping, the confederate gazed at the first person aged 20–50 who gazed back.
Why this matters: clear selection rules prevent bias (e.g., choosing friendly-looking people only when smiling).
The IRB must approve dispensing with informed consent when the situation causes no harm and occurs in ordinary activities.

🧑‍🎓 The volunteer subject issue

Volunteers differ predictably from non-volunteers:

Characteristic	Volunteers tend to be...
Interest	More interested in the research topic
Education	More educated
Approval need	Higher need for approval
IQ	Higher IQ
Sociability	More sociable
Social class	Higher social class

External validity concern: if volunteers behave differently than the general population, results may not generalize.
Example: rational arguments might work better on volunteers (due to higher education/IQ) than on the general population.

🔧 Standardizing the procedure

🔧 Why standardization matters

Standardization: carrying out the procedure in the same way for all participants regardless of condition.

Extraneous variables can easily creep in: one experimenter gives clear instructions, another gives vague ones; one is warm, another cold.
If these variables vary systematically across conditions, they become confounding variables and offer alternative explanations.
Example: if treatment-group participants are tested by a warm experimenter and control-group participants by a cold one, the apparent treatment effect might actually be an experimenter-demeanor effect.

🧪 Experimenter expectancy effect

Experimenter expectancy effect: unintended variation in procedure caused by the experimenter's expectations about how participants "should" behave.

Experimenters may unintentionally give clearer instructions, more encouragement, or more time to participants they expect to perform better.
Classic example: students told their rats were "maze-bright" (vs. "maze-dull") saw better maze performance over five days, even though the rats were genetically similar. Students' expectations led them to handle "bright" rats more positively, which affected the rats' learning.
Don't confuse: this is not deliberate bias; it is unconscious influence through subtle behavioral cues.

🧑‍🔬 Experimenter sex as an extraneous variable

Male and female experimenters interact slightly differently with participants, and participants respond differently to male vs. female experimenters.
Example: in a pain-tolerance study (hands in icy water), male participants tolerated pain longer with a female experimenter, and female participants tolerated it longer with a male experimenter.

📋 How to standardize

Written protocol: specify everything experimenters do and say from greeting to dismissal.
Standard instructions: participants read them or experimenters read them word-for-word.
Automate: use software or slide shows to deliver stimuli and record responses.
Anticipate questions: raise and answer them in instructions or develop standard answers.
Train experimenters together: have them practice on each other.
Counterbalance experimenters: ensure each experimenter tests participants in all conditions.

🕶️ Blinding

Double-blind study: neither participants nor experimenters know which condition each participant is in.

Minimizes experimenter expectancy effects by minimizing expectations.
Example: in a drug study, neither participant nor experimenter knows who receives the drug vs. placebo.
Single-blind study: only the participant is blind to the condition.
When blinding is not possible: the investigator is also the only experimenter, or the experimenter must carry out different procedures in different conditions.

📝 Record keeping and manipulation checks

📝 Good record-keeping practices

Generate a written sequence of conditions before the study begins; test each new participant in the next condition.
Add to the list: demographic info, date/time/place of testing, experimenter name, and comments about unusual occurrences or questions.
Confidentiality: do not include participants' names or identifying information with their data.
Identification numbers: assign each participant a consecutive number (starting with 1) and write it on all response sheets or questionnaires.

✅ Manipulation check

Manipulation check: a separate measure of the construct the researcher is trying to manipulate, used to confirm the independent variable was successfully manipulated.

Purpose: verify that the manipulation worked as intended.
Example: if trying to manipulate stress by telling participants they must give a speech, measure stress via questionnaire or blood pressure afterward.

❓ Why manipulation checks matter for null results

Null result: no significant effect of the independent variable on the dependent variable.
Two possible explanations:
1. The independent variable truly has no effect.
2. The manipulation of the independent variable failed.
Manipulation check resolves this: if the check shows the manipulation worked, the null result is real; if it shows the manipulation failed, you need a better manipulation.
Example: exposing participants to happy/sad music to induce moods, then measuring childhood-event recall. If no effect is found, a mood measure (manipulation check) reveals whether the music actually changed moods.

⏰ When to conduct manipulation checks

Usually done at the end of the procedure to ensure the effect lasted and to avoid drawing attention to the manipulation (avoiding demand characteristics).
Best practice: include a manipulation check in the pilot test to catch problems early.

🧪 Pilot testing

🧪 What is a pilot test?

Pilot test: a small-scale study conducted to make sure a new procedure works as planned.

Participants can be recruited formally (from a subject pool) or informally (family, friends, classmates).
The sample can be small, but large enough to give confidence the procedure works.

🧪 Key questions a pilot test answers

Do participants understand the instructions?
What misunderstandings, mistakes, or questions arise?
Do participants become bored or frustrated?
Is an indirect manipulation effective? (Requires a manipulation check.)
Can participants guess the research question or hypothesis? (Are there demand characteristics?)
How long does the procedure take?
Are computer programs or automated procedures working properly?
Are data being recorded correctly?

🧪 How to conduct a pilot test

Observe participants carefully during the procedure.
Talk with them afterward: participants may hesitate to criticize, so emphasize that their feedback is part of a pilot test and genuinely helpful.
Iterate: if problems arise, solve them, pilot test the new procedure, and repeat until ready to proceed with the actual study.

Key Takeaways and Exercises

27. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Experimental research requires careful manipulation of independent variables, control of extraneous variables, and thoughtful design choices to ensure both internal validity (causal conclusions) and external validity (generalizability).

📌 Key points (3–5)

Core experimental elements: manipulation of an independent variable, measurement of a dependent variable, and control of extraneous variables.
Control conditions matter: experiments need appropriate comparison groups (no-treatment, placebo, wait-list, or best-alternative controls).
Design trade-offs: between-subjects vs. within-subjects designs each have pros and cons that require careful consideration.
Common confusion: extraneous variables vs. confounds—a confound is an extraneous variable that varies systematically with the independent variable, not just any extra variable.
Validity balance: internal validity (supporting causal claims) and external validity (generalizability) both matter, though experiments may seem "artificial."

🔬 Fundamental experimental components

🔬 What defines an experiment

An experiment is a type of empirical study that features the manipulation of an independent variable, the measurement of a dependent variable, and control of extraneous variables.

Manipulation: the researcher actively changes the independent variable.
Measurement: the dependent variable is observed and recorded.
Control: extraneous variables are managed to prevent confounding.

⚠️ Extraneous variables vs. confounds

An extraneous variable is any variable other than the independent and dependent variables. A confound is an extraneous variable that varies systematically with the independent variable.

Not all extraneous variables are confounds—only those that change along with the independent variable.
Example: if stressed participants are always tested in the morning and unstressed participants in the afternoon, time-of-day becomes a confound.
Don't confuse: random noise (extraneous but not systematic) vs. confounds (systematic co-variation).

🎯 Control conditions and comparisons

🎯 Types of control conditions

Experimental research on treatment effectiveness requires both a treatment condition and a control condition:

Control type	What it provides
No-treatment control	Baseline with no intervention
Placebo control	Controls for expectation effects
Wait-list control	Ethical alternative—participants receive treatment later
Best available alternative	Compares new treatment to existing standard

🎯 Why controls matter

Without appropriate controls, you cannot determine whether the independent variable caused the observed effect.
The choice of control depends on the research question and ethical considerations.

🔄 Design choices: Between vs. Within

🔄 Between-subjects designs

Different participants are assigned to different conditions.
Requires random assignment to control extraneous variables and prevent confounding.

🔄 Within-subjects designs

The same participants experience all conditions.
Requires counterbalancing of condition orders to control for order effects.

🔄 Deciding which to use

The excerpt emphasizes that "deciding which to use in a particular situation requires careful consideration of the pros and cons of each approach."
Both random assignment (between-subjects) and counterbalancing (within-subjects) serve the same fundamental purpose: controlling extraneous variables so they don't become confounds.

📏 Validity considerations

📏 Internal validity

Studies are high in internal validity to the extent that the way they are conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable.

Experiments are generally high in internal validity because of:
- Manipulation of the independent variable
- Control of extraneous variables
Internal validity = confidence in causal claims.

📏 External validity

Studies are high in external validity to the extent that the result can be generalized to people and situations beyond those actually studied.

Experiments can seem "artificial" and appear low in external validity.
Important consideration: whether the psychological processes under study are likely to operate in other people and situations.
Don't assume artificiality automatically means poor generalizability.

🛠️ Practical implementation

🛠️ Recruiting participants

Several effective methods for recruiting research participants:

Formal subject pools
Advertisements
Personal appeals
Field experiments require well-defined participant selection procedures

🛠️ Standardizing procedures

Standardization minimizes extraneous variables.
Includes controlling for experimenter expectancy effects (when researcher expectations influence results).
Example: In a memory study with stressed vs. unstressed conditions, standardization might include using identical word lists, timing, room conditions, and instructions across all participants.

🛠️ Pilot testing

Conduct one or more small-scale pilot tests before the full experiment.
Purpose: ensure the procedure works as planned.
Allows identification and correction of problems before investing in full data collection.

Overview of Non-Experimental Research

28. Overview of Non-Experimental Research

🧭 Overview

🧠 One-sentence thesis

Non-experimental research measures variables as they naturally occur without manipulating an independent variable, making it essential when experimental manipulation is impossible, unethical, or inappropriate for the research question.

📌 Key points (3–5)

What defines non-experimental research: lacks manipulation of an independent variable; researchers simply measure variables as they naturally occur.
When to use it: appropriate when the research question involves a single variable, non-causal relationships, situations where manipulation is impossible/unethical, or exploratory questions.
Two main types: correlational research (measuring statistical relationships between variables) and observational research (observing behavior without manipulation).
Common confusion: non-experimental research cannot establish causation like experiments can, but this does not make it less important—it serves different scientific goals (describe and predict vs. explain).
Internal validity trade-off: non-experimental designs are lower in internal validity than experiments because they lack manipulation and control of extraneous variables.

🔬 What non-experimental research is

🔬 Core definition

Non-experimental research: research that lacks the manipulation of an independent variable; researchers measure variables as they naturally occur (in the lab or real world).

The key distinction from experimental research is the absence of manipulation.
Researchers observe and measure but do not intervene or assign conditions.
Example: measuring people's self-esteem and their GPAs to see if the two are related—no variable is manipulated.

⚖️ Why the distinction matters

Most psychology researchers consider the experimental vs. non-experimental distinction extremely important.
Causal conclusions: experimental research can provide strong evidence that changes in an independent variable cause differences in a dependent variable.
Non-experimental research generally cannot establish causation.
Don't confuse: "cannot establish causation" does not mean "less important"—it simply serves different purposes.

🎯 When to choose non-experimental research

🎯 Four key situations

The excerpt identifies when non-experimental research is preferred or necessary:

Situation	Example from excerpt
Single variable question	How accurate are people's first impressions?
Non-causal relationship	Is there a correlation between verbal and mathematical intelligence?
Manipulation impossible/unethical	Does hippocampus damage impair long-term memory formation? (cannot ethically damage brains)
Exploratory/experiential question	What is it like to be a working mother diagnosed with depression?

🎯 Matching research goals to methods

The three goals of science: describe, predict, and explain.
If the goal is to explain causal relationships → experimental approach typically preferred.
If the goal is to describe or predict → non-experimental approach is appropriate.
The two approaches can be complementary: the excerpt describes how Milgram's original observational obedience study was followed by experiments manipulating variables like distance between participants.

📊 Two main types

📊 Correlational research

Correlational research: the researcher measures two variables with little or no attempt to control extraneous variables and then assesses the relationship between them.

Most common type of non-experimental research in psychology.
Focuses on the statistical relationship between two variables.
Does not include manipulation of an independent variable.
Example: collecting data on students' self-esteem and GPAs to see if they are statistically related.

👁️ Observational research

Observational research: focuses on making observations of behavior in a natural or laboratory setting without manipulating anything.

The excerpt cites Milgram's original obedience study as an example—he observed all participants performing the same task under the same conditions.
Another example: the Loftus and Pickrell study observing whether participants "remembered" mildly traumatic childhood events that never actually happened (nearly a third "remembered" at least one event).
Can be conducted in natural settings or laboratories.

⏳ Studying change over time

⏳ Three non-experimental approaches

When psychologists want to study change over time (e.g., aging), they use one of three approaches:

Approach	What it involves	Advantage	Limitation
Cross-sectional	Comparing pre-existing groups at different ages	Quick, immediate results	Differences may reflect cohort effects, not true age effects
Longitudinal	Following one group over time as they age	Superior for studying aging effects	Time-consuming, requires greater investment
Cross-sequential	Following multiple age groups over a shorter period	Immediate comparisons + can later distinguish age vs. cohort effects	Combines benefits of both approaches

⏳ The cohort effect problem

Cohort effect: differences between groups may reflect the generation people come from rather than a direct effect of age.
Example: comparing 20-year-olds to 60-year-olds today—differences might be due to being born in different eras, not aging itself.
Cross-sequential studies help address this by following different age groups over time to determine whether original differences are true age effects or cohort effects.

📈 Quantitative vs. qualitative approaches

📈 Quantitative research

Data consist of numbers analyzed using statistical techniques.
Most research discussed in the excerpt is quantitative.

📝 Qualitative research

Data are usually nonnumerical and cannot be analyzed using statistical techniques.
Example: Rosenhan's observational study of psychiatric wards—data were notes taken by "pseudopatients" (people pretending to have heard voices) and hospital records.
Analysis consists mainly of written descriptions supported by concrete examples.
The excerpt quotes Rosenhan describing staff "depersonalization": "Upon being admitted, I and other pseudopatients took the initial physical examinations in a semi-public room, where staff members went about their own business as if we were not there."
Different analysis tools: thematic analysis (themes in data) or conversation analysis (how words were said).

🔍 Internal validity considerations

🔍 What internal validity means here

Internal validity: the extent to which the design of a study supports the conclusion that changes in the independent variable caused any observed differences in the dependent variable.

🔍 How designs compare

The excerpt describes a hierarchy of internal validity:

Experimental research (highest): uses manipulation and control to rule out alternative explanations.
Quasi-experimental research (middle): contains some but not all features of true experiments (e.g., may lack random assignment).
Non-experimental/correlational research (lowest): fails to use manipulation or control.

🔍 Important nuances

There is overlap between categories—a poorly designed experiment can have lower internal validity than a well-designed quasi-experiment.
Example of quasi-experimental limitation: starting an anti-bullying program in one school vs. another—without random assignment, students in the treatment school might differ in other ways (selection effect).
Internal validity is only one of several validities to consider (as noted in Chapter 5 of the source material).

Correlational Research

29. Correlational Research

🧭 Overview

🧠 One-sentence thesis

Correlational research measures the statistical relationship between two variables without manipulation, allowing researchers to describe and predict relationships even when experiments are impossible or unethical, though such studies cannot establish causation.

📌 Key points (3–5)

What defines correlational research: measuring two variables and assessing their statistical relationship with little or no control of extraneous variables, without manipulating either variable.
Why researchers choose it: when causal relationships are not the focus, when manipulation is impossible/impractical/unethical, or when high external validity is needed.
Strength vs direction: Pearson's r ranges from −1.00 to +1.00; the absolute value indicates strength (small ~±.10, medium ~±.30, large ~±.50), while the sign indicates direction (positive or negative).
Common confusion: correlation does not imply causation due to the directionality problem (X causes Y or Y causes X?) and the third-variable problem (Z causes both X and Y).
Trade-off with experiments: correlational studies typically have lower internal validity but higher external validity than experiments because they reflect real-world conditions without artificial controls.

🔬 What correlational research is and isn't

🔬 Defining features

Correlational research: a type of non-experimental research in which the researcher measures two variables (binary or continuous) and assesses the statistical relationship (i.e., the correlation) between them with little or no effort to control extraneous variables.

The key criterion is that neither variable is manipulated—both are simply measured.
This definition holds regardless of whether variables are quantitative or categorical.
The terms "independent variable" and "dependent variable" do not apply because nothing is manipulated.

🎯 Common misconception about variable types

A beginning researcher might think correlational research requires two quantitative variables (e.g., test scores), but this is incorrect.

The defining feature is measurement without manipulation, not the type of variable.
Example: Comparing self-esteem scores between American and Japanese college students is correlational because nationality was not manipulated—it was measured.
Example: Comparing college faculty and factory workers on need for cognition is correlational because occupation was not manipulated.

Don't confuse: What makes a study correlational vs experimental is how the study is conducted (manipulation or not), not the types of variables, graphs, or statistics used.

🔍 Ambiguous cases

The excerpt presents a hypothetical study on to-do lists and stress to illustrate the importance of knowing how variables were handled:

If participants were randomly assigned to make or not make to-do lists → experiment.
If participants were simply asked whether they already make to-do lists → correlational study.
The data and variables look identical, but the conclusions differ: experiments can support causal claims; correlational studies cannot.

🎯 Why researchers choose correlational designs

🎯 Goals of description and prediction

Two goals of science are to describe and to predict, and correlational research supports both.

Researchers can describe the strength and direction of relationships between variables.
If a relationship exists, scores on one variable can predict scores on the other using regression (a statistical technique mentioned for later discussion).
Causal explanation is not always the goal; sometimes researchers simply want to know whether and how strongly variables are related.

🚫 When manipulation is not possible

Even when researchers suspect a causal relationship, they may be unable to manipulate the independent variable.

Impossible: Some variables cannot be manipulated (e.g., nationality, past experiences).
Impractical: Manipulation may be too costly or logistically difficult.
Unethical: Researchers cannot ethically manipulate certain variables.

Example: A researcher interested in the relationship between cannabis use frequency and memory abilities cannot ethically manipulate how often people use cannabis, so must measure both variables and assess their correlation.

📏 Establishing reliability and validity

Correlation is used to evaluate measurement quality.

Example: A researcher administering a brief extraversion test alongside a longer, validated extraversion test can check whether scores on the two tests are strongly correlated. This assesses the validity of the brief test without any causal claim or manipulation.

🌍 Higher external validity

Correlational research often has higher external validity than experimental research due to the internal-external validity trade-off.

Aspect	Experiments	Correlational studies
Internal validity	Higher (more controls)	Lower (no manipulation/control)
External validity	Lower (artificial conditions)	Higher (real-world conditions)
Generalizability	May not reflect reality	More likely to reflect real relationships

As experiments add controls to increase internal validity, they often introduce artificial conditions that reduce external validity.
Correlational studies measure variables as they naturally occur, so results are more likely to generalize to real-world settings.

🔗 Providing converging evidence

Correlational research can complement experimental research to strengthen confidence in a theory.

If a theory is supported by both a high-internal-validity experiment and a high-external-validity correlational study, researchers can be more confident in the theory's validity.
Example: Correlational studies showing a relationship between watching violent television and aggressive behavior have been complemented by experiments confirming the relationship is causal.

📊 Understanding correlation coefficients

📊 Positive and negative relationships

The excerpt uses scatterplots to illustrate relationships between quantitative variables.

Positive relationship: higher scores on one variable tend to be associated with higher scores on the other (both move in the same direction, either up or down).

Negative relationship: higher scores on one variable tend to be associated with lower scores on the other (they move in opposite directions).

Example (positive): People under more stress tend to have more physical symptoms—as stress increases, symptoms increase.

Example (negative): Higher stress is associated with lower immune system functioning—as stress increases, immune function decreases.

📏 Pearson's r: strength and direction

Pearson's Correlation Coefficient (Pearson's r): a statistic measuring the strength of a correlation between quantitative variables, ranging from −1.00 to +1.00.

Value	Meaning
−1.00	Strongest possible negative relationship
0	No relationship (shapeless cloud on scatterplot)
+1.00	Strongest possible positive relationship

Interpreting strength (absolute value):

Near ±.10 = small
Near ±.30 = medium
Near ±.50 = large
Most psychology correlations (except reliability coefficients) are small or medium

Don't confuse: The sign (+ or −) indicates direction, not strength. Pearson's r values of +.30 and −.30 are equally strong; one is moderately positive, the other moderately negative.

As the absolute value of r moves toward 1.00, points on a scatterplot come closer to falling on a single straight line.

⚠️ When Pearson's r is misleading

⚠️ Nonlinear relationships

Pearson's r is a good measure only for linear relationships (points approximated by a straight line), not nonlinear relationships (points better approximated by a curve).

Example: The relationship between hours of sleep per night and depression level forms an upside-down "U" curve—people who get about eight hours are least depressed; those who get too little or too much are more depressed. Even though this is a fairly strong relationship, Pearson's r would be close to zero because the points are not well fit by a straight line.

Best practice: Make a scatterplot first to confirm the relationship is approximately linear before using Pearson's r.

⚠️ Restriction of range

Restriction of range: when one or both variables have a limited range in the sample relative to the population, Pearson's r can be misleading.

Example: There is a strong negative correlation (r = −.77) between age and enjoyment of hip hop music in the general population. However, if data are collected only from 18- to 24-year-olds, the correlation in that restricted age range is 0, making the relationship seem weak or nonexistent.

Best practices:

Design studies to avoid restriction of range (e.g., collect data from people of a wide age range if age is a primary variable).
Examine data for possible restriction of range.
Interpret Pearson's r in light of any restrictions found.

🚫 Why correlation does not imply causation

🚫 The core principle

"Correlation does not imply causation" is a fundamental rule in interpreting correlational research.

Example: A 2012 study found a positive correlation (r = 0.79) between a nation's per capita chocolate consumption and the number of Nobel prizes awarded to its citizens. This does not mean eating chocolate causes people to win Nobel prizes.

The excerpt identifies two reasons correlation does not imply causation: the directionality problem and the third-variable problem.

🔄 The directionality problem

Directionality problem: Two variables, X and Y, can be statistically related because X causes Y or because Y causes X.

Example: People who exercise are happier on average than people who do not exercise.

Possible interpretation 1: Exercising causes happiness.
Possible interpretation 2: Happiness causes exercise (perhaps being happy gives people more energy or leads them to seek social opportunities at the gym).

Both interpretations are consistent with the correlation, so the correlation alone cannot determine which is correct.

🔺 The third-variable problem

Third-variable problem: Two variables, X and Y, can be statistically related not because X causes Y or Y causes X, but because some third variable, Z, causes both X and Y.

Spurious correlations: correlations that result from a third variable.

Example (chocolate and Nobel prizes): The correlation probably reflects geography—European countries tend to have both higher per capita chocolate consumption and greater investment in education and technology per capita.

Example (exercise and happiness): Physical health could be the third variable causing both—being physically healthy could cause people to exercise and cause them to be happier.

🛠️ How to address causation questions

The most effective way to address directionality and third-variable problems is to conduct an experiment.

Example: Instead of measuring how much people exercise, a researcher could randomly assign half the participants to run on a treadmill for 15 minutes and the rest to sit on a couch for 15 minutes.

Why this matters:

If exercisers end up in more positive moods, it cannot be because their moods affected how much they exercised (the researcher determined exercise through random assignment).
It cannot be because a third variable (e.g., physical health) affected both exercise and mood.
Experiments eliminate the directionality and third-variable problems, allowing researchers to draw firm conclusions about causal relationships.

📰 Media misinterpretation

The excerpt warns that many journalists do not understand that correlation does not imply causation.

Example headline: "Lots of Candy Could Lead to Violence" (based on a study showing children who ate candy daily were more likely to be arrested for violent offenses later in life).

Critical thinking questions:

Could candy really "lead to" violence?
What alternative explanations exist? (Perhaps a third variable, such as parenting style or socioeconomic factors, causes both candy consumption and later behavior.)
How could the headline be rewritten to avoid misleading readers?

📍 Data collection in correlational research

📍 Flexibility of settings

The defining feature is non-manipulation, not the location or method of measurement.

Variables can be measured in a laboratory (e.g., computerized tasks for backward digit span and risky decision-making).
Variables can be measured in natural settings (e.g., asking people at a shopping mall about environmental attitudes and shopping habits).
Both approaches are correlational as long as no variable is manipulated.

Don't confuse: The setting (lab vs field) does not determine whether a study is correlational or experimental; manipulation does.

Complex Correlation

30. Complex Correlation

🧭 Overview

🧠 One-sentence thesis

Complex correlational research allows researchers to measure multiple variables simultaneously and use statistical techniques like partial correlation and multiple regression to explore possible causal relationships while controlling for third variables, though it cannot definitively establish causation.

📌 Key points (3–5)

Why use complex designs: Researchers measure several variables at once to assess relationships among them, especially when experiments are impractical or unethical.
Statistical control vs experimental control: Instead of random assignment, researchers measure potential third variables and include them in statistical analyses (partial correlation, multiple regression).
Common confusion: "Correlation does not imply causation" is true, but complex correlational methods can rule out some alternative explanations and show patterns consistent with certain causal interpretations.
Factor analysis: A technique that organizes many variables into smaller clusters (factors) representing underlying constructs.
Prediction vs description: Regression allows prediction of one variable from another; multiple regression shows whether a predictor contributes over and above other predictors.

📊 Measuring multiple variables

📊 Basic approach

Researchers measure several variables—either binary (0/1) or continuous—and assess statistical relationships among them.
Variables can be transformed if needed (e.g., skewed distributions converted to binary: did not occur = 0, did occur = 1).

🔬 Example studies from the excerpt

Optimism and heart health: Radcliffe and Klein measured optimism and found it related to better health behaviors, knowledge of risk factors, and accurate risk beliefs.
Relationship aggression: Jouriles et al. measured physical and psychological aggression (as binary variables) and psychological distress; found that physical aggression was moderately linked to psychological aggression, which in turn related to distress.
Need for Cognition Scale validation: Cacioppo and Petty measured need for cognition alongside intelligence, social desirability, and dogmatism to validate their new scale.

🗂️ Correlation matrix

Correlation matrix: a table showing the correlation (Pearson's r) between every possible pair of variables in a study.

Only half the matrix is filled because the other half would duplicate the same information.
Diagonal values (correlation of a variable with itself) are always +1.00, so they are replaced with dashes.
Example from the excerpt: need for cognition correlated +.39 with intelligence, −.27 with dogmatism, +.08 with social desirability.

🧩 Factor analysis

🧩 What it does

Factor analysis: a statistical technique that organizes many conceptually similar variables into a smaller number of clusters (factors), where variables are strongly correlated within each cluster but weakly correlated between clusters.

Each cluster is interpreted as multiple measures of the same underlying construct.
Example: mental tasks often organize into mathematical intelligence and verbal intelligence factors.

🎵 Music preferences example

Rentfrow and Gosling asked students to rate 14 music genres, then used factor analysis to identify four factors:
- Reflective and Complex: blues, jazz, classical, folk
- Intense and Rebellious: rock, alternative, heavy metal
- Upbeat and Conventional: country, soundtrack, religious, pop
- Energetic and Rhythmic: rap/hip-hop, soul/funk, electronica

⚠️ Important clarifications

Factors are not categories: People are not "either/or." Factors operate independently—someone high in one factor can be high or low in another.
Interpretation is up to researchers: Factor analysis reveals structure; researchers must label factors and explain why that structure exists (e.g., Big Five personality factors may be controlled by different genes).

🔍 Exploring causal relationships

🔍 The challenge

The saying "correlation does not imply causation" is true—correlational research cannot unambiguously establish causation.
However, complex correlational research can rule out some alternative explanations through statistical control of potential third variables.

🛠️ Partial correlation

Partial correlation: a technique that examines the relationship between two variables while statistically controlling for one or more potential third variables.

Instead of controlling variables through random assignment (as in experiments), researchers measure them and include them in the analysis.
The technique examines the part of each variable that is independent of the third variable.

📺 Example: TV violence and aggression

Scenario: A researcher wants to know if violent TV viewing relates to aggression, but worries that socioeconomic status (SES) might be a third variable.

Steps:

Measure violent TV viewing, aggression, and SES.
Find the simple correlation (say, +.35).
Use partial correlation to control for SES.

Possible outcomes:

If partial correlation remains +.34 → SES is not driving the relationship.
If partial correlation drops to +.03 → SES is a third variable driving the relationship.
If partial correlation drops to +.20 → SES accounts for some, but not all, of the relationship.

⚠️ Limitations

Partial correlation does not solve the directionality problem (which variable causes which).
There may be other third variables the researcher did not consider or control.

📈 Regression techniques

📈 Simple regression

Regression: a statistical technique that allows researchers to predict one variable (outcome/criterion variable) given another (predictor variable).

Formula: Y = b₁X₁
- Y = predicted score on the outcome variable
- X₁ = score on the predictor variable
- b₁ = regression weight (slope of the line)
The regression weight indicates how much Y changes for each one-unit change in X.
Example: Once we know IQ correlates with GPA, we can use IQ scores to predict GPA.

📈 Multiple regression

Involves measuring several predictor variables (X₁, X₂, X₃, …) and using them to predict an outcome variable (Y).
Formula: Y = b₁X₁ + b₂X₂ + b₃X₃ + … + bᵢXᵢ
Each regression weight shows how much that predictor contributes to the outcome.

🎯 Key advantage: "over and above"

Multiple regression shows whether a predictor contributes to the outcome over and above other predictors (i.e., after statistically controlling for them).
Example from the excerpt: Does income relate to happiness over and above health? Does health relate to happiness over and above income?
- Multiple regression can answer both by controlling for the overlap between income and health.
- Research has shown both income and health make extremely small contributions to happiness except in cases of severe poverty or illness.

🔄 Don't confuse

Prediction vs causation: Regression allows prediction, but purely correlational approaches cannot definitively establish that one variable causes another.
Best use: Show patterns of relationships consistent with some causal interpretations and inconsistent with others.

Qualitative Research

31. Qualitative Research

🧭 Overview

🧠 One-sentence thesis

Qualitative research complements quantitative methods by generating new research questions, providing rich contextual descriptions of human experience, and revealing the lived reality of participants in ways that statistical analysis cannot capture.

📌 Key points (3–5)

What qualitative research does differently: starts with broad questions, collects detailed non-numerical data from small samples, and uses interpretive (not statistical) analysis to understand participant experience.
Core strengths: generates novel research questions, provides "thick description" of behavior in real-world contexts, and conveys the "lived experience" of participants.
How it differs from quantitative: quantitative gives precise answers to specific questions and draws general conclusions; qualitative explores depth, context, and subjective meaning.
Common confusion: the distinction is less about data collection method and more about how data are analyzed—coding interview transcripts statistically makes it quantitative.
Mixed-methods approach: combining both methods (hypothesis generation + testing, or triangulation) produces richer, more complete understanding than either alone.

🔍 What qualitative research is

🔍 Core definition and approach

Qualitative research: an approach that begins with less focused research questions, collects large amounts of relatively "unfiltered" data from relatively small numbers of individuals, and describes data using nonstatistical techniques.

Researchers are usually less concerned with drawing general conclusions about human behavior than with understanding participant experience in detail.
The focus is on what it is like from the participants' own perspectives.
Example: Lindqvist and colleagues interviewed families of 10 teenage suicide victims in their homes, asking them to talk about the victim and anything else they wanted to share—not testing a specific hypothesis but exploring the variety of reactions families had.

📊 Contrast with quantitative research

Dimension	Qualitative	Quantitative
Starting point	Broad research question	Focused hypothesis
Sample	Small number, in-depth	Large number, less depth
Data type	Non-numerical, "unfiltered"	Numerical, structured
Analysis	Interpretive techniques (grounded theory, thematic analysis)	Statistical techniques
Goal	Understand experience and context	Draw general conclusions
Conclusions	Based on researcher interpretation	Based on statistical analysis

Don't confuse: The key distinction is analysis method, not just data collection—if you code qualitative interviews numerically and run statistics, it becomes quantitative research.

💪 Strengths of qualitative research

💡 Generating new research questions

Qualitative research excels at discovering questions that researchers might not have thought to ask.
Example: The suicide study revealed that families struggled with the question of "why," especially when the suicide was unexpected—this relationship can now be tested quantitatively, but it's unclear whether the question would have arisen without sitting down and listening to families.

📖 Thick description

Qualitative research provides rich, detailed descriptions of human behavior in real-world contexts.
"Thick description" (Geertz, 1973) captures nuance and complexity that numbers cannot convey.
Example: All families in the suicide study spontaneously offered to show the interviewer the victim's bedroom or the suicide location—revealing the importance of these physical spaces. A quantitative study would be unlikely to discover this detail.

🧑‍🤝‍🧑 Conveying lived experience

Qualitative research communicates what it is actually like to be a member of a particular group in a particular situation.
The "lived experience" of research participants comes through in their own words and perspectives.
This depth of understanding is not accessible through statistical summaries alone.

🛠️ Data collection methods

🎤 Interviews

Most common approach for psychological qualitative research.
Unstructured interviews: small number of general questions or prompts; participants talk about what interests them.
Structured interviews: strict script that the interviewer follows exactly.
Semi-structured interviews (most common): a few consistent questions with follow-up questions on topics that emerge.
Interviews can be lengthy and detailed but are conducted with relatively small samples.
Example: The suicide study used unstructured interviews because researchers knew that disclosure about such a sensitive topic should be led by families, not by researchers.

👥 Focus groups

Small groups of people participate together in interviews focused on a particular topic or issue.
Interaction among participants can sometimes bring out more information than one-on-one interviews.
Standard technique in business and industry for understanding consumer preferences.
Content is usually recorded and transcribed for later analysis.

Watch out for: Group dynamics can affect results—desire to be liked may lead to inaccurate answers; highly extraverted participants can dominate discussions.

🔬 Data analysis in qualitative research

🌱 Grounded theory approach

Grounded theory: an approach where researchers start with the data and develop a theory or interpretation that is "grounded in" those data.

Reverses the quantitative process: instead of starting with theory → hypothesis → data, qualitative researchers go data → themes → theory.
Analysis stages:
1. Identify ideas that are repeated throughout the data
2. Organize these ideas into a smaller number of broader themes
3. Write a theoretical narrative—an interpretation of the data in terms of identified themes

📝 Example: Postpartum depression study

Abrams and Curran studied postpartum depression symptoms among low-income mothers through unstructured interviews with 19 participants.

Five broad themes identified:

Theme	Repeating ideas
Ambivalence	"I wasn't prepared for this baby," "I didn't want to have any more children"
Caregiving overload	"Please stop crying," "I need a break," "I can't do this anymore"
Juggling	"No time to breathe," "Everyone depends on me," "Navigating the maze"
Mothering alone	"I really don't have any help," "My baby has no father"
Real-life worry	"I don't have any money," "Will my baby be OK?" "It's not safe here"

The theoretical narrative focused on participants' experience of symptoms not as an abstract "affective disorder" but as closely tied to the daily struggle of raising children alone under difficult circumstances.
Supported by many direct quotations from participants themselves.

🤝 Mixed-methods research

🔄 Combining quantitative and qualitative

Mixed-methods research: the combination of quantitative and qualitative approaches in a single study.

Many researchers now agree that the two approaches can and should be combined.

🧪 Hypothesis generation + testing

Use qualitative research to generate hypotheses by exploring phenomena in depth.
Use quantitative research to test those hypotheses with larger samples and statistical analysis.
Example: A qualitative study might suggest that families experiencing unexpected suicide have more difficulty resolving "why" questions; a quantitative study could then measure these variables in a large sample to test the relationship.

🔺 Triangulation

Use both methods simultaneously to study the same questions and compare results.
If results converge: they reinforce and enrich each other.
If results diverge: they suggest an interesting new question—why do they diverge and how can they be reconciled?

📐 Example: Female engineering students study

Trenor and colleagues investigated the experience of female engineering students:

Phase 1 (quantitative): Survey showed no statistical differences in sense of belonging across ethnic groups.
Phase 2 (qualitative): Interviews revealed that many minority students reported how cultural diversity enhanced their sense of belonging.
Without the qualitative component, researchers might have drawn the wrong conclusion that ethnicity doesn't matter for sense of belonging.

Don't confuse: Some say qualitative is best for identifying behaviors (the "what") and quantitative for understanding meaning (the "why"), but researchers increasingly argue for breaking down this artificial divide—both methods can investigate the same questions in complementary ways.

⚖️ Addressing criticisms

🎯 Quantitative researchers' concerns

Lack of objectivity
Difficult to evaluate reliability and validity
Do not allow generalization beyond those studied

Response: Qualitative researchers are well aware of these issues and have developed frameworks for addressing objectivity, reliability, validity, and generalizability (beyond the scope of this excerpt).

🎨 Qualitative researchers' concerns

Quantitative methods overlook the richness of human behavior and experience
They answer only simple questions about easily quantifiable variables

Response: Quantitative researchers do not believe all human behavior can be described by a small number of variables; they use simplification as a strategy for uncovering general principles, not as a claim that simplification captures everything.

Observational Research

32. Observational Research

🧭 Overview

🧠 One-sentence thesis

Observational research systematically records behavior without manipulation or control, providing detailed descriptions of variables but cannot establish causal relationships.

📌 Key points (3–5)

What observational research is: Non-experimental studies that systematically observe and record behavior to describe variables or obtain snapshots of characteristics.
Main types: Naturalistic observation (behavior in natural environment), participant observation (researcher becomes active member), structured observation (specific behaviors in controlled settings), case studies (in-depth individual examination), and archival research (analyzing existing data).
Data can be qualitative or quantitative: Methods may collect qualitative data (detailed descriptions), quantitative data (numerical measurements), or both (mixed-methods).
Common confusion—disguised vs undisguised: Disguised observation means participants don't know they're being studied (less reactive but raises ethical concerns); undisguised means participants are aware (more ethical but may cause reactivity).
Key limitation: Cannot determine causation because nothing is manipulated or controlled—only describes what happens, not why.

🔬 Naturalistic Observation

🌳 What it involves

Naturalistic observation: an observational method that involves observing people's behavior in the environment in which it typically occurs.

This is a type of field research (not laboratory research).
Researchers observe behavior where it naturally happens—grocery stores, playgrounds, psychiatric wards, or in Jane Goodall's case, chimpanzees in East Africa.
The goal is to see authentic behavior in real-world contexts.

🎭 Disguised vs undisguised approaches

Approach	What it means	Advantages	Concerns
Disguised	Participants unaware they're being studied	Less reactive; higher validity; more natural behavior	Ethical issues if privacy is violated
Undisguised	Participants know they're being observed	More ethical; can obtain informed consent	Reactivity—people may act differently when watched

Ethical guideline: Disguised observation is acceptable if participants remain anonymous and behavior occurs in public settings where people have no expectation of privacy.

Example: Observing shoppers putting items in carts is acceptable (public behavior); observing bathroom behavior violates privacy expectations.

🎬 The Hawthorne effect and habituation

Reactivity: when a measure changes participants' behavior.
Hawthorne effect: people act differently when they know they're being observed and studied.
Don't confuse: Initial reactivity vs long-term habituation—people often become used to being observed and eventually behave naturally (as seen in reality TV shows like Big Brother).

👥 Participant Observation

🤝 How it differs from naturalistic observation

Participant observation: researchers become active participants in the group or situation they are studying.

Very similar to naturalistic observation—same natural settings, same types of data (interviews, notes, documents, photographs).
Only difference: researchers actively join the group instead of just watching from outside.
Rationale: Some information is only accessible to or interpretable by active group members.

🎭 Disguised vs undisguised participation

Disguised participant observation: Researchers pretend to be regular members and conceal their researcher identity.

Example: Leon Festinger infiltrated a doomsday cult (the Seekers) to study how members coped when their apocalypse prediction failed on December 21, 1954. Members didn't know researchers were studying them. This led to insights about cognitive dissonance.

Example: Rosenhan's 1973 study where researchers posed as psychiatric patients to observe how staff treated patients in psychiatric hospitals.

Undisguised participant observation: Researchers join the group but disclose their true identity as researchers.

Example: Amy Wilkins spent 12 months attending a university religious organization's meetings, participating and interviewing members while they knew she was a researcher studying them.

⚖️ Ethical and practical trade-offs

Ethical concerns with disguised approach: No informed consent possible; involves deception by withholding researcher identity.
When disguised may be necessary: Accessing protective groups (like cults) that wouldn't allow known researchers.
Advantage of disguised: Less prone to reactivity.
Limitations of both approaches: Researcher presence may change group dynamics; developing relationships may reduce objectivity and increase experimenter bias.

📊 Structured Observation

🎯 What makes it "structured"

Structured observation: investigator makes careful observations of one or more specific behaviors in a particular setting that is more structured than naturalistic or participant observation settings.

Often conducted in laboratory environments or natural settings that researchers have structured in some way.
Key difference from naturalistic/participant observation: Emphasis on gathering quantitative rather than qualitative data.
Researchers focus on a limited set of specific behaviors rather than recording everything.
This allows behaviors to be quantified and measured.

🚶 Examples of structured observation

Levine & Norenzayan's pace-of-life study: Measured how long pedestrians took to walk 60 feet in different countries.

Controlled conditions: main business hours, clear summer days, flat unobstructed sidewalks, only solo walkers, excluded children/disabled/window-shoppers.
Found: Canadians and Swedes walked 60 feet in under 13 seconds; Brazilians and Romanians took close to 17 seconds.
Why control conditions: Makes data collection manageable and controls extraneous variables (e.g., weather effects).

Kraut & Johnston's bowling study: Observed bowlers' facial reactions when facing pins vs when turning toward companions.

Created specific list of reactions to code: "closed smile," "open smile," "laugh," "neutral face," "look down," "look away," "face cover."
Found: Bowlers rarely smiled facing the pins but smiled much more after turning toward companions—suggesting smiling is social communication, not just happiness expression.

Cohen's culture-of-honor study: Observers rated emotional reactions of participants who were deliberately bumped and insulted by a confederate in a hallway.

Hypothesis: Participants from southern U.S. (culture of honor) would react with more aggression than northern participants.
Prediction was supported by observational data.

🔢 Coding process

Coding: a process that requires clearly defining a set of target behaviors, then categorizing participants in terms of which behaviors they engaged in and how many times.

Observers must define target behaviors so different observers code them the same way.
Interrater reliability is critical: Multiple raters code the same behaviors independently; researchers must show close agreement.
Example: Kraut and Johnston had two observers independently code videotaped reactions; they agreed 97% of the time (good interrater reliability).

⚖️ Advantages and limitations

Advantages:

Far more efficient than naturalistic/participant observation (focused on specific behaviors saves time and expense).
Structured environment encourages behaviors of interest (less waiting time).
Greater control over the environment.

Limitations:

More control may make environment less natural (decreases external validity).
Unclear whether laboratory observations generalize to real world.
Often not disguised, so more concerns with reactivity.

🔍 Case Studies

📋 What case studies examine

Case study: an in-depth examination of an individual.

Sometimes completed on social units (e.g., a cult) or events (e.g., a natural disaster).
Most commonly in psychology: detailed description and analysis of an individual.
Often the individual has a rare/unusual condition, disorder, or specific brain damage.

🧠 The case of HM

Background: HM had severe epilepsy; in 1953, surgeon removed large sections of his hippocampus to stop seizures.

Outcome:

Surgery resolved epilepsy; IQ and personality unaffected.
But HM developed anterograde amnesia—lost ability to form new long-term memories.
Short-term memory preserved (could carry on conversations, remember short strings of information).
Could not consolidate new information from short-term to long-term memory.
Would completely forget conversations after they ended.

Why important:

Suggested dissociation between short-term and long-term memory (two different abilities in different brain areas).
Showed temporal lobes are crucial for memory consolidation.
Provided insights into normal memory processes through studying an abnormal case.

📚 Other famous case studies

Anna O. (Sigmund Freud): Used to illustrate psychoanalysis principles; woman who couldn't drink fluids, which Freud attributed to repressed memory.
Little Albert (Watson & Rayner, 1920): Child who allegedly learned to fear white rat and other furry objects through conditioning.

🔬 Methods used in case studies

Tend to be more qualitative in nature.
In-depth, often longitudinal examination.
May or may not observe in natural setting.
Focus on detailed descriptions rather than statistical analyses (though some quantitative data may be included, e.g., comparing depression scores before/after treatment).
Various tools: interviews, naturalistic observation, structured observation, psychological testing (IQ tests), physiological measurements (brain scans).

⚖️ Value and limitations

Value:

Provide detailed analysis not found in other methods.
Greater insights from detailed examination.
May reveal what to study more extensively in future controlled research.
Often the only way to study rare conditions (impossible to find large enough sample for quantitative methods).
Can provide insights into normal behavior even when studying rare individuals.

Critical limitations:

Cannot provide evidence for theories—only inspiration to formulate hypotheses that must then be tested with rigorous quantitative methods.
Internal validity problems: Lack proper experimental controls; cannot determine causation; cannot rule out alternative explanations.
- Example: HM's surgeon may have accidentally lesioned another brain area that contributed to memory problems.
External validity problems: Single individual (typically abnormal) means results cannot be generalized to others.
Researcher bias: Ample opportunity for theoretical biases to color or bias the case description.
- Example: Accusations that the researcher who studied HM destroyed contradictory data that didn't support her theory.

Don't confuse: Case studies can inspire theories but cannot support them—they suffer from problems with both internal and external validity.

📂 Archival Research

📊 What archival research involves

Archival research: analyzing archival data that have already been collected for some other purpose.

Another approach often considered observational research.
Uses existing data rather than collecting new observations.

🏷️ Example: Implicit egotism study

Pelham and colleagues studied "implicit egotism"—tendency for people to prefer things similar to themselves.

Examined Social Security records.
Found: Women named Virginia, Georgia, Louise, and Florence were especially likely to have moved to states of Virginia, Georgia, Louisiana, and Florida, respectively.
Measurement was relatively straightforward (counting names in records).

📝 Content analysis approach

Peterson and colleagues studied relationship between optimism and health using data collected decades earlier.

Process:

In 1940s: Healthy male college students completed open-ended questionnaire about difficult wartime experiences.
In late 1980s: Researchers reviewed responses to measure "explanatory style" (habitual ways of explaining bad events).
- Pessimistic people: blame themselves, expect long-term negative consequences affecting many life aspects.
- Optimistic people: blame outside forces, expect limited negative consequences.
Identified all negative events and causal explanations from questionnaires; wrote them on index cards.
Separate raters rated each explanation on three dimensions of optimism-pessimism.
Averaged ratings to produce explanatory style score for each participant.
Assessed statistical relationship between undergraduate explanatory style and health at approximately age 60.

Result: More optimistic as undergraduates → healthier as older men (Pearson's r = +.25).

🔍 Content analysis defined

Content analysis: a family of systematic approaches to measurement using complex archival data.

Like structured observation, requires specifying keywords, phrases, or ideas, then finding all occurrences in the data.
Occurrences can be counted, timed (e.g., time devoted to entertainment topics on news), or analyzed in various ways.
Requires systematic, clearly defined procedures (similar to coding in structured observation).

Key Takeaways and Exercises

33. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Non-experimental research designs—including correlational, observational, and qualitative approaches—cannot establish causation but serve essential roles in describing relationships, making predictions, and exploring complex patterns among variables.

📌 Key points (3–5)

Core distinction: Non-experimental research lacks manipulation of an independent variable, unlike experimental designs.
Causation vs correlation: A statistical relationship between X and Y does not prove X causes Y; Y might cause X, or a third variable Z might cause both.
Common confusion: Internal validity hierarchy—experimental research is high, correlational is low, and quasi-experimental falls in between.
Multiple methods serve different purposes: Correlational research establishes reliability/validity and makes predictions; qualitative research explores broader questions with detailed data; observational methods range from naturalistic to structured approaches.
Complementary approaches: Quantitative and qualitative methods work together—qualitative research can generate hypotheses that quantitative research then tests.

🔬 Types of non-experimental research

🔬 Correlational research

Correlational research: focuses on statistical relationships between variables that are measured but not manipulated.

Measures two variables and assesses the relationship between them without manipulating an independent variable.
Low in internal validity: Cannot establish causal relationships.
What it can do: Establish reliability and validity, provide converging evidence, describe relationships, and make predictions.
Example: A researcher measures driver impulsivity and examines its statistical relationship with number of traffic tickets received.

👁️ Observational research

Observational research: participants are observed and their behavior is recorded without the researcher interfering or manipulating any variables.

Five main approaches:

Approach	Definition	Key feature
Naturalistic observation	Observe people in their natural setting	No interference with natural environment
Participant observation	Researcher becomes an active member of the group	Insider perspective
Structured observation	Code a small number of behaviors quantitatively	Systematic, quantifiable coding
Case studies	Collect in-depth information on a single individual	Deep dive into one case
Archival research	Analyze existing data	Uses pre-existing records

🗣️ Qualitative research

Qualitative research: an important alternative to quantitative research that generally involves asking broader research questions, collecting more detailed data (e.g., interviews), and using non-statistical analyses.

Asks broader research questions than quantitative approaches.
Collects more detailed data (such as interviews).
Uses non-statistical analyses.
Complementary role: Can generate hypotheses that quantitative research then tests.
Example: Conducting detailed interviews with unmarried teenage fathers to learn about their feelings and thoughts about their role, then summarizing their feelings in a written narrative.

⚠️ The causation problem

⚠️ Correlation does not imply causation

Correlation does not imply causation: A statistical relationship between two variables, X and Y, does not necessarily mean that X causes Y.

Three possible explanations for any correlation:

X causes Y
Y causes X (reverse causation)
A third variable Z causes both X and Y

Don't confuse: A strong statistical relationship with a causal relationship—they are not the same thing.

🔄 The directionality problem

When two variables are correlated, we cannot determine which one causes the other from correlation alone.
Example: People who exercise more tend to weigh less—but does exercise cause lower weight, or does lower weight make people more likely to exercise?

🎯 Third variable problem

A hidden variable might cause both observed variables.
Example: People who eat more lobster tend to live longer—but this doesn't mean lobster causes longevity; wealth (third variable) might enable both lobster consumption and better healthcare.

📊 Understanding correlation coefficients

📊 Range and meaning

Range: Correlation coefficients can range from -1 to +1.
Sign indicates direction: Positive (+) means variables move together; negative (-) means they move in opposite directions.
Numerical value indicates strength: Closer to -1 or +1 means stronger relationship; closer to 0 means weaker relationship.

🔢 Complex correlational research

Explores relationships among several variables in the same study.
Advanced techniques: Partial correlation and multiple regression.
What these can show: Patterns of relationships consistent with some causal interpretations and inconsistent with others.
What they cannot do: Unambiguously establish that one variable causes another.
Example: A multiple regression analysis might show that intelligence is not related to critical thinking course performance, but need for cognition is—revealing which variables matter most.

📋 Correlation matrix

A tool for displaying multiple correlations in one study.
Shows Pearson's r values (correlation coefficients) between all pairs of variables.
Example variables: depression, anxiety, self-esteem, and happiness—each pair would have its own correlation coefficient.

🎯 Internal validity hierarchy

🎯 Comparing research designs

Research type	Internal validity	Why
Experimental	High	Manipulates independent variable; controls confounds
Quasi-experimental	Medium	Some manipulation but lacks full control
Correlational	Low	No manipulation; cannot rule out alternative explanations

🔍 Distinguishing experimental from non-experimental

Key question: Does the researcher manipulate an independent variable?

Experimental: Researcher randomly assigns patients with low back pain to either hypnosis treatment or exercise treatment, then measures pain after 3 months.
Non-experimental: A manager studies the correlation between new employees' college GPAs and their first-year performance reports (no manipulation, just measurement).

Don't confuse: Measuring multiple variables with manipulating variables—measurement alone is correlational, not experimental.

🧪 Practical applications and exercises

🧪 Identifying research designs

When evaluating a study, ask:

Is an independent variable manipulated? (If yes → experimental; if no → non-experimental)
Are variables measured and relationships assessed? (If yes → likely correlational)
Are participants observed without interference? (If yes → observational)
Is the focus on detailed, non-statistical data? (If yes → likely qualitative)

🤔 Evaluating case studies

When reading published case studies, consider:

Internal validity problems: What alternative explanations exist? What confounds weren't controlled?
External validity problems: Can findings generalize beyond this single case?
Hypothesis generation: What testable predictions emerge from this case?

💭 Qualitative vs quantitative differences

Example: A study of girls who play youth baseball

Qualitative approach:

Broader research questions
Detailed interviews (one-on-one or focus groups)
Rich narrative data about experiences and feelings
Non-statistical analysis

Quantitative approach:

Specific, measurable questions
Surveys with numerical scales
Statistical analysis of patterns
Generalizable findings

Overview of Survey Research

34. Overview of Survey Research

🧭 Overview

🧠 One-sentence thesis

Survey research is a flexible quantitative and qualitative method that uses self-reports and careful sampling to study everything from voting intentions to mental health prevalence, and it has evolved from early social reform documentation into a rigorous scientific approach used across psychology and social sciences.

📌 Key points (3–5)

Two defining characteristics: variables measured through self-reports (questionnaires/interviews) and strong emphasis on large random samples for accurate population estimates.
Flexible applications: can describe single variables (e.g., disorder prevalence), assess relationships between variables, or be integrated into experimental designs with manipulated independent variables.
Historical validation: the 1936 election demonstrated scientific survey methods' superiority when George Gallup correctly predicted Roosevelt's landslide victory using small careful samples, while a magazine's massive "straw poll" failed.
Common confusion: survey research is typically non-experimental (descriptive or correlational), but it can also be used within experimental research when researchers manipulate variables.
Modern importance: continues as a primary data collection method, particularly valuable for estimating prevalence of conditions and identifying statistical relationships in large, diverse populations.

📊 What survey research is

📊 Core definition and characteristics

Survey research: a quantitative and qualitative method with two key features—(1) variables measured using self-reports through questionnaires or interviews, and (2) considerable attention to sampling, especially large random samples.

Participants are called respondents in survey research terminology.
Random sampling is routinely used in survey research, unlike most other psychology approaches.
"Almost anything goes" beyond these two characteristics—surveys vary widely in length, delivery method (in-person, phone, mail, Internet), and topic.

🔬 Experimental vs non-experimental uses

Most survey research is non-experimental:

Describes single variables (e.g., percentage preferring a candidate, prevalence of a disorder)
Assesses statistical relationships between variables (e.g., income and health)

But surveys can be experimental:

Example from the excerpt: Lerner's study after 2001 attacks used self-reports and large national sample (survey characteristics) while manipulating emotion (anger vs. fear) to measure effect on risk judgments (experimental characteristic)
This shows survey methods can test causal hypotheses, not just describe or correlate.

🕰️ Historical development and validation

🕰️ Origins in social reform

Roots in English and American "social surveys" around 1900
Early researchers and reformers documented social problems like poverty
By the 1930s, US government conducted surveys on economic and social conditions
Need for population-level conclusions drove advances in sampling procedures

🗳️ The watershed 1936 election

The failed prediction:

Literary Digest sent ballots (also subscription requests) to millions of Americans
Based on this "straw poll," editors predicted Landon would win in a landslide

The scientific triumph:

New pollsters used scientific methods with much smaller samples
Predicted the opposite: Roosevelt would win in a landslide
George Gallup publicly criticized Literary Digest methods before the election and guaranteed his prediction would be correct
Roosevelt did win in a landslide, demonstrating effectiveness of careful survey methodology

Don't confuse: bigger sample ≠ better accuracy; the Literary Digest had millions of responses but poor methodology, while Gallup's smaller scientific sample was accurate.

📚 Expansion into academia

Gallup's success led to the 1948 first national election survey by the Survey Research Center at University of Michigan
Evolved into American National Election Studies (Stanford-Michigan collaboration), continuing today
Spread into political science, sociology, public health as primary data collection approach
Psychologists in the 1930s advanced questionnaire design (e.g., Likert scale still used today)
Strong historical link to social psychology studying attitudes, stereotypes, and prejudice
Early attitude researchers sought larger, more diverse samples beyond convenience samples of university students

🧠 Modern applications in psychology

🧠 Mental health prevalence research

The National Comorbidity Survey example:

Large-scale mental health survey conducted in the United States
Nearly 10,000 adults given structured mental health interviews in their homes (2002-2003)
Measured lifetime prevalence (percentage developing a problem sometime in their lifetime)

Key findings from the excerpt's table:

Disorder	Total %	Female %	Male %
Generalized anxiety disorder	5.7	7.1	4.2
Obsessive-compulsive disorder	2.3	3.1	1.6
Major depressive disorder	16.9	20.2	13.2
Bipolar disorder	4.4	4.5	4.3
Alcohol abuse	13.2	7.5	19.6
Drug abuse	8.0	4.8	11.6

💡 Value for research and practice

Helps basic researchers understand causes and correlates of mental disorders
Provides clinicians and policymakers with accurate information about how common disorders are
Enables identification of statistical relationships among disorders and other factors

🔬 Experimental survey research

The 2001 terrorism study example:

Researchers surveyed nearly 2,000 American teens and adults (ages 13-88) after September 2001 attacks
Asked about reactions to attacks and judgments of terrorism-related and other risks
Descriptive findings: participants overestimated most risks; females more than males; no teen-adult differences
Experimental manipulation: some primed for anger (asked what made them angry, shown anger-evoking photo/audio), others primed for fear (parallel fear treatment)
Causal finding: anger-primed participants perceived less risk than fear-primed participants, showing risk perceptions are tied to specific emotions
This demonstrates surveys on large diverse samples can supplement laboratory studies on university students

🎯 Why survey research matters

🎯 Flexibility and scope

Survey research is a flexible approach because:

Can study basic research questions (theoretical relationships) and applied questions (practical problems)
Works for diverse topics: voting, consumer preferences, social attitudes, health, mental disorders
Accommodates both quantitative analysis (statistics) and qualitative analysis (many questions lend themselves to qualitative approaches)
Can be purely descriptive, correlational, or integrated into experimental designs

🎯 Unique strengths

Only psychology approach routinely using random sampling: provides most accurate population estimates
Large-scale capability: can reach thousands of participants across diverse demographics
Real-world relevance: studies people in natural contexts (homes, daily life) rather than only laboratory settings
Policy impact: provides data for understanding social conditions and informing decisions

Constructing Surveys

35. Constructing Surveys

🧭 Overview

🧠 One-sentence thesis

Constructing effective surveys requires understanding how respondents cognitively process questions and applying principles that minimize unintended context effects to maximize reliability and validity of answers.

📌 Key points (3–5)

Survey responding is a multi-step cognitive process: respondents must interpret questions, retrieve information from memory, form judgments, format responses, and edit their answers.
Context effects can bias responses: the order of items, response options provided, and question wording can systematically influence answers in unintended ways.
BRUSO principles guide effective item writing: items should be Brief, Relevant, Unambiguous, Specific, and Objective.
Common confusion—open vs closed items: open-ended items allow free responses and are qualitative; closed-ended items provide fixed options and are quantitative.
Rating scales need careful design: response options should be mutually exclusive, exhaustive, and balanced around a neutral midpoint.

🧠 Survey responding as a psychological process

🧠 The cognitive model of survey response

Responding to a survey is not straightforward—it involves five distinct mental steps:

Interpret the question – decide what the question is really asking
Retrieve information – recall relevant facts or beliefs from memory
Form a tentative judgment – use the retrieved information to arrive at an answer
Format the response – convert the judgment into one of the provided response options
Edit the response – decide whether to report the answer as-is or modify it

Example: A question like "How many alcoholic drinks do you consume in a typical day?" requires respondents to decide whether beer and wine count, whether "typical" means weekday or weekend, how to recall their drinking behavior, how to calculate an average, and whether to report honestly or adjust their answer to avoid looking bad.

🔍 Why this matters

What seems like a simple question becomes complex when you consider all the cognitive steps involved.
Each step introduces opportunities for unintended influences on the answer.
Understanding this process helps researchers design questions that minimize confusion and bias.

⚠️ Context effects on survey responses

📊 Item-order effects

Item-order effect: when the order in which items are presented affects people's responses.

Earlier items can change how respondents interpret later items or what information they retrieve.
Example: When college students were asked about dating frequency before life satisfaction, the correlation between the two was +.66; when life satisfaction came first, the correlation was only −.12.
Why this happens: Answering the dating question first made that information more accessible in memory, so respondents used it as a basis for rating life satisfaction.
How to mitigate: Rotate or randomize question order when there is no natural sequence; counterbalancing reduces order effects.

🎚️ Response option effects

The range and wording of response options can systematically shift answers:

When asked how often they are "really irritated" with options from "less than once a year" to "more than once a month," people think of major irritations and report low frequency.
With options from "less than once a day" to "several times a month," people think of minor irritations and report high frequency.
Middle-option bias: People assume middle options represent "normal" and gravitate toward them if they see themselves as typical.
Example: People report watching more TV when response options center on 4 hours than when centered on 2 hours.

🔄 Don't confuse: content vs context

Content effects relate to what the question asks about.
Context effects relate to how the question is presented—order, wording, response options—not the topic itself.

✍️ Writing effective survey items

📝 Open-ended vs closed-ended items

Feature	Open-ended	Closed-ended
Format	Ask a question; allow any answer	Provide fixed response options
When to use	Early research stages; unknown responses; qualitative goals	Well-defined variables; quantitative analysis
Advantages	No influence on responses; exploratory	Quick to complete; easy to analyze numerically
Disadvantages	Time-consuming; require qualitative coding; higher skip rates	Must anticipate all relevant responses; can introduce bias

Example open-ended: "Please describe a time when you were discriminated against because of your age."

Example closed-ended: "Have you ever in your adult life been depressed for a period of 2 weeks or more? Yes / No"

🎯 The BRUSO model

BRUSO: an acronym for principles of effective questionnaire items—Brief, Relevant, Unambiguous, Specific, and Objective.

Principle	What it means	Poor example	Effective example
Brief	Short and to the point; avoid unnecessary words	"Are you now or have you ever been the possessor of a firearm?"	"Have you ever owned a gun?"
Relevant	Only ask what matters to the research question	Asking sexual orientation when it's not relevant	Omit the item unless clearly relevant
Unambiguous	Can be interpreted in only one way	"Are you a gun person?"	"Do you currently own a gun?"
Specific	Clear what the response should be about	"How much have you read about the new gun control measure and sales tax?" (double-barreled)	Split into two items: one about gun control, one about sales tax
Objective	Does not reveal researcher's opinions or lead respondents	"How much do you support the new gun control measure?"	"What is your view of the new gun control measure?"

Best practice: Conduct a pilot test and ask people to explain how they interpreted the question.

🚫 Avoid double-barreled items

Double-barreled items ask about two separate issues but allow only one response.
Example: "Please rate the extent to which you have been feeling anxious and depressed."
Solution: Split into two items—one about anxiety, one about depression.

📊 Designing response scales

🏷️ Rating scales for closed-ended items

Rating scale: an ordered set of responses that participants must choose from.

How many options?

Five-point scales work best for unipolar constructs (e.g., frequency: Never, Rarely, Sometimes, Often, Always).
Seven-point scales work best for bipolar constructs (e.g., liking: Like very much → Dislike very much).
More options (0–10) are appropriate for familiar dimensions like attractiveness, pain, or likelihood.

Branching for bipolar scales:

First ask a general question: "Do you generally like or dislike ice cream?"
Then refine with relevant choices from the seven-point scale.
This improves both reliability and validity.

⚖️ Balanced vs unbalanced scales

Unbalanced (avoid):

Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely

Balanced (preferred):

Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely | Extremely Likely
The most extreme options should be balanced around a neutral midpoint.
Middle option debate: Including a neutral option (e.g., "Neither agree nor disagree") allows genuine neutrality but may encourage default responses; omitting it forces deeper thought but may frustrate truly neutral respondents.

🔢 Categorical response options

Mutually exclusive: categories should not overlap (e.g., "Protestant" and "Catholic" are mutually exclusive; "Christian" and "Catholic" are not).
Exhaustive: categories should cover all possible responses; if not feasible, include an "Other" option with space to specify.
If respondents can belong to multiple categories (e.g., race), instruct them to "choose all that apply."

📏 What is a Likert scale?

Likert scale: a specific approach where respondents rate their agreement with multiple statements (both favorable and unfavorable) about a person, group, or idea on a 5-point scale (Strongly Agree to Strongly Disagree); responses are summed to produce an overall attitude score.

Don't confuse: Not every rating scale is a Likert scale—only those measuring attitudes through agreement with multiple statements. A simple 0-to-10 satisfaction scale is just a "rating scale."

📋 Formatting the survey

📢 Introduction functions

Every survey needs a written or spoken introduction that serves two purposes:

Encourage participation:
- Briefly explain the survey's purpose and importance
- Identify the sponsor (university-based surveys generate higher response rates)
- Acknowledge the respondent's importance
- Describe any incentives
Establish informed consent:
- Topics covered
- Estimated time required
- Option to withdraw at any time
- Confidentiality protections
- Written consent forms are not always required (completion of the survey may serve as evidence of consent for minimal-risk research)

🗂️ Organizing questionnaire items

Order of presentation:

Start with clear instructions and examples of how to use response scales.
Most important items first – respondents are most interested and least fatigued at the beginning.
Group by topic or type – items using the same rating scale should be together for efficiency.
Demographic items last – least interesting but easy to answer even if respondents are tired.
End with an expression of appreciation.

Why this order matters:

Maximizes data quality for the most critical research questions.
Reduces respondent burden and annoyance.
Minimizes dropout before completing key items.

Conducting Surveys

36. Conducting Surveys

🧭 Overview

🧠 One-sentence thesis

Survey research achieves accurate population estimates primarily through probability sampling methods that give every member a known selection chance, combined with strategies to maximize response rates and minimize sampling bias.

📌 Key points (3–5)

Probability vs. non-probability sampling: probability sampling specifies selection probabilities for each population member; non-probability sampling (convenience, snowball, quota, self-selection) does not.
Why probability sampling matters for surveys: survey researchers need accurate population estimates, which require probability samples rather than the convenience samples common in other psychological research.
Sampling bias and non-response bias: samples can misrepresent the population if selection is skewed or if non-responders differ systematically from responders.
Common confusion—sample size vs. population size: confidence intervals depend on sample size, not population size; a sample of 1,000 yields similar confidence whether the population is 100,000 or 100 million.
Survey methods trade-offs: in-person interviews have highest response rates but highest cost; mail and internet surveys cost less but risk lower response rates and greater non-response bias.

📊 Sampling fundamentals

📊 Two broad categories of sampling

Probability sampling: the researcher can specify the probability that each population member will be selected for the sample.

Non-probability sampling: the researcher cannot specify these probabilities.

Most psychological research uses non-probability sampling (convenience, snowball, quota, self-selection).
Survey researchers prefer probability sampling because accurate population estimates depend on it.
Example: election outcome estimates require probability samples of likely registered voters, since margins are often only a few percentage points.

🎯 What probability sampling requires

Clear population specification:

Depends on the research question.
Examples: all registered voters in Washington State; American consumers who purchased a car in the past year; women over 40 in Seattle who received a mammogram in the past decade.

Sampling frame:

Sampling frame: essentially a list of all members of the population from which to select respondents.

Sources: telephone directories, voter registration lists, hospital/insurance records, maps (for selecting cities, streets, households).
Without a sampling frame, most probability methods cannot be used (except cluster sampling).

🔢 Probability sampling methods

🎲 Simple random sampling

Simple random sampling: done so that each individual in the population has an equal probability of being selected.

Could involve drawing names from a hat, but more commonly uses computerized sorting/selection.
Random-digit dialing: a computer randomly generates phone numbers from possible numbers within a geographic area (common in telephone surveys).

🧱 Stratified random sampling

Stratified random sampling: the population is divided into subgroups or "strata" (usually by demographics), then a random sample is taken from each stratum.

Two variants:

Type	Purpose	Example from excerpt
Proportionate	Match subgroup proportions in population	12.6% of Americans are African American → ensure ~126 of 1,000 respondents are African American
Disproportionate	Oversample small subgroups to draw valid conclusions	Asian Americans are ~5.6% of population; simple random sample of 1,000 might include too few; oversample to ensure enough for valid conclusions

🗂️ Cluster sampling

Cluster sampling: larger clusters of individuals are randomly sampled, then individuals within each cluster are randomly sampled.

Only probability method that does not require a sampling frame.
Useful for face-to-face interviews because it minimizes travel.
Example: instead of traveling to 200 small towns to interview 200 residents, travel to 10 towns and interview 20 residents in each.

📏 Sample size considerations

📏 How large should a sample be?

Two factors:

Confidence level desired: larger samples yield statistics closer to the population value.
Budget constraint: larger samples cost more time, effort, and money.

Most survey research uses 100–1,000 respondents.
Conducting a power analysis beforehand helps balance these trade-offs.

🔍 Why ~1,000 is often adequate

Confidence intervals and sample size:

A sample of 1,000 American adults is considered good even though the population is ~252 million (only 0.000004% of the population).
Example with 50% voting for incumbent:
- 100 voters → 95% confidence interval: 40–60%
- 1,000 voters → 95% confidence interval: 47–53%
- 2,000 voters → 95% confidence interval: 48–52%
The interval shrinks as sample size increases, but at a slower rate; beyond 1,000, the gain is often not worth the cost.

Surprising fact—population size doesn't matter:

Confidence intervals depend only on sample size, not population size.
A sample of 1,000 produces a 47–53% confidence interval whether the population is 100,000, 1 million, or 100 million.

⚠️ Sampling bias

⚠️ What is sampling bias?

Sampling bias: occurs when a sample is selected in such a way that it is not representative of the entire population and therefore produces inaccurate results.

Historical example: the Literary Digest straw poll in 1936 was far off because mailing lists (from telephone directories and automobile registrations) over-represented wealthier people, who were more likely to vote for Landon.
Gallup succeeded by sampling less wealthy people as well.

🚫 Non-response bias

Non-response bias: occurs when survey non-responders differ from responders in systematic ways.

Why it happens:

Not everyone selected responds; some have died/moved, others decline (too busy, not interested, don't participate on principle).

Example from the excerpt:

A mail survey on alcohol consumption had only ~50% response after initial contact and two follow-up reminders.
Researchers visited non-responders' homes (up to five times) and found an especially high proportion of abstainers (nondrinkers).
Original estimates based only on responders were too high because abstainers were underrepresented.

Don't confuse: statistical corrections for non-response bias exist but rely on assumptions (e.g., non-responders resemble late responders) that may not be correct.

✅ Minimizing non-response bias

Best approach: maximize the response rate.

Factors that increase response rates:

Survey method: in-person interviews highest, then telephone, then mail and internet.
Pre-notification: send a short message informing potential respondents they will be asked to participate soon.
Follow-up reminders: send simple reminders to non-responders after a few weeks.
Questionnaire design: keep surveys short, simple, and on topic (perceived length and complexity reduce response).
Incentives: offering cash is reliable (but ethical limits exist—incentives must not be so large as to be coercive).

📞 Survey methods

📞 Four main ways to conduct surveys

Method	Response rate	Personal contact	Cost	Notes from excerpt
In-person interviews	Highest	Closest	Highest (by far)	Important when interviewer must see/judge respondents (e.g., mental health interviews)
Telephone surveys	Lower than in-person	Some	Costly but less than in-person	Telephone directories less comprehensive today (more people have only cell phones, no landlines)
Mail surveys	Even lower	None	Less costly	Most susceptible to non-response bias
Internet surveys	Varies	None	Lowest	Becoming dominant; easy to construct; methods in rapid development

🌐 Internet survey approaches

Initial contact by mail with link:

Does not necessarily produce higher response rates than ordinary mail surveys.

Initial contact by email with direct link:

Works well when the population has known email addresses and uses them regularly (e.g., university community).
For other populations, hard/impossible to find comprehensive email address lists as sampling frames.

Posting on websites:

Request to participate with link posted on sites visited by population members.
Very difficult to get a random sample this way (visitors likely differ from the population as a whole).

Why internet surveys are growing:

Low cost.
More people online than ever.
Likely to become the dominant approach.

🧪 Myths about web-based studies

The excerpt notes that concerns about online data collection have been found to be myths:

Preconception	Finding
Internet samples are not demographically diverse	Internet samples are more diverse than traditional samples in many domains (though not completely representative)
Internet users are maladjusted, socially isolated, or depressed	Internet users do not differ from nonusers on markers of adjustment and depression
Internet-based findings differ from other methods	Evidence so far suggests internet findings are consistent with traditional methods (e.g., self-esteem, personality), but more data needed

🛠️ Online survey tools

🛠️ Tools mentioned in the excerpt

Free accounts (limited items and respondents, useful for small-scale surveys and practice):

SurveyMonkey
PsyToolkit (free, noncommercial, also does experimental paradigms)
Qualtrics
PsycData

Canadian-hosted sites (to avoid US Patriot Act data seizure concerns):

Fluid Surveys
Simple Survey
Lime Survey

🤖 Amazon Mechanical Turk (MTurk)

Originally for usability testing; now has database of over 500,000 workers from over 190 countries.
Can deploy simple tasks (e.g., testing different question wording) at very low cost (a few cents for <5 minutes).
Lauded as an inexpensive way to gather high-quality data.
Example use: set parameters matching your sample frame and run experiments cheaply.

Key Takeaways and Exercises

37. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Quasi-experimental designs manipulate an independent variable but lack random assignment or counterbalancing, placing them between non-experimental studies and true experiments in internal validity, and they are commonly used in field settings where random assignment is difficult.

📌 Key points (3–5)

What quasi-experiments are: research that resembles experiments but lacks either random assignment to conditions or counterbalancing, making groups nonequivalent.
Why they matter: they eliminate directionality problems but do not eliminate confounding variables, so internal validity is moderate.
Common confusion: quasi-experiments vs. true experiments—random assignment or counterbalancing is the key difference; without these safeguards, groups may differ in important ways beyond the treatment.
Where they are used: typically in field settings (schools, workplaces, clinics) where random assignment is impractical.
Design types covered: one-group designs (posttest only, pretest-posttest, interrupted time series) and nonequivalent groups designs (posttest only, pretest-posttest, interrupted time series with groups, switching replication).

🔬 What makes a study quasi-experimental

🔬 Definition and core features

Quasi-experimental research: research that resembles experimental research but lacks either random assignment to conditions or counterbalancing.

The independent variable is manipulated before the dependent variable is measured.
Either a control group is missing, or participants are not randomly assigned to conditions.
Because the IV is manipulated first, quasi-experiments eliminate the directionality problem (you know which variable came first).
However, without random assignment or counterbalancing, confounding variables remain a threat.

📊 Internal validity position

Study type	Random assignment / counterbalancing	Confounding variables	Internal validity
Non-experimental	No manipulation	Present	Lowest
Quasi-experimental	Manipulation, but no random assignment or counterbalancing	Present	Moderate
True experiment	Manipulation + random assignment or counterbalancing	Controlled	Highest

Quasi-experiments fall between non-experimental and true experiments in internal validity.
They are often the best option when random assignment is impossible or unethical.

🏥 Typical use cases

Conducted in field settings (schools, hospitals, workplaces).
Often used to evaluate the effectiveness of a treatment or intervention (e.g., psychotherapy, educational programs).
Example: evaluating an anti-drug program in one school without being able to randomly assign students to treatment and control conditions.

🧪 One-group designs

🧪 One-group posttest only design

One-group posttest only design: a treatment is implemented and then a dependent variable is measured once after the treatment.

How it works: implement treatment → measure outcome once.
Example: an anti-drug program is delivered to elementary students, then their attitudes toward drugs are measured immediately after.
Major limitation: no control or comparison group, so you cannot know what would have happened without the treatment.
Common misuse: advertisers claim "80% of women noticed brighter skin after using Brand X"—without a comparison group, this statistic is meaningless.
Verdict: this is the weakest quasi-experimental design.

📏 One-group pretest-posttest design

One-group pretest-posttest design: the dependent variable is measured once before the treatment and once after.

How it works: measure outcome → implement treatment → measure outcome again.
Example: measure students' attitudes toward drugs, deliver anti-drug program, measure attitudes again.
Similarity to within-subjects experiments: each participant is tested before and after treatment, but the order is not counterbalanced (you cannot "untreat" someone).
Improvement over posttest only: you can see whether scores changed from pretest to posttest.
Problem: many alternative explanations (threats to internal validity) can account for the change.

⚠️ Threats to internal validity in pretest-posttest designs

⚠️ History

Other events might occur between pretest and posttest that cause the change.
Example: a celebrity drug overdose is widely reported, and students' attitudes toward drugs become more negative—not because of the program, but because of the news event.

⚠️ Maturation

Participants naturally change over time (growing, learning, developing).
Example: in a year-long program, students may become better reasoners or less impulsive simply because they are maturing, not because of the program.

⚠️ Testing

The act of measuring the dependent variable at pretest can affect responses at posttest.
Example: completing a survey about attitudes toward drugs may prompt students to think more deeply about the topic, leading to attitude change independent of the program.

⚠️ Instrumentation

The measuring instrument or observer changes over time.
Example: observers may gain skill, become fatigued, or shift their standards; participants may take the pretest seriously but become bored and careless at posttest.

⚠️ Regression to the mean

Regression to the mean: individuals who score extremely high or low on a variable on one occasion will tend to score less extremely on the next occasion.

Example: a bowler with a long-term average of 150 who bowls a 220 will almost certainly score lower next time—her score will "regress" toward her mean.
When it is a problem: when participants are selected for study because of their extreme scores.
Example: if only students with extremely favorable attitudes toward drugs are given the anti-drug program, their scores will likely be lower at posttest even without any treatment effect.

⚠️ Spontaneous remission

Spontaneous remission: the tendency for many medical and psychological problems to improve over time without any treatment.

Example: common cold sufferers improve in a week even without treatment; severely depressed people often improve somewhat over 6 months without treatment.
One study found that participants in waitlist control conditions improved 10–15% before receiving any treatment.
Implication: you must be very cautious about inferring causality from pretest-posttest designs.

🛡️ How to address these threats

The excerpt recommends adding a control group that does not receive the treatment.
A control group would be subject to the same threats (history, maturation, testing, instrumentation, regression, spontaneous remission), so any difference between treatment and control groups can be attributed to the treatment.
Important: adding a control group means the design is no longer a one-group design.

📖 Case study: Does psychotherapy work?

📖 Early evidence (Eysenck, 1952)

Early studies used pretest-posttest designs.
Eysenck summarized 24 studies: about two-thirds of patients improved from pretest to posttest.
However, Eysenck compared these results with archival data (state hospital and insurance records) showing that similar patients recovered at about the same rate without psychotherapy.
Conclusion: the improvement might be no more than spontaneous remission.
Eysenck did not conclude psychotherapy was ineffective; he concluded there was no evidence that it was, and he called for "properly planned and executed experimental studies."

📖 Later evidence (Smith, Glass, & Miller, 1980)

By 1980, hundreds of experiments had been conducted with random assignment to treatment and control conditions.
Overall, psychotherapy was quite effective: about 80% of treatment participants improved more than the average control participant.
Subsequent research has focused on the conditions under which different types of psychotherapy are more or less effective.

⏱️ Interrupted time-series design

Interrupted time-series design: a set of measurements taken at intervals over a period of time, "interrupted" by a treatment.

How it works: measure outcome repeatedly before treatment → introduce treatment → measure outcome repeatedly after treatment.
Example: a factory measures worker productivity each week for a year, then reduces work shifts from 10 hours to 8 hours; productivity increases quickly and remains elevated for months, suggesting the shift reduction caused the increase.
Advantage over simple pretest-posttest: multiple measurements before and after the treatment help distinguish treatment effects from normal variation.
Example: if student absences are measured weekly, a single drop from Week 7 to Week 8 might look like a treatment effect in a simple pretest-posttest design, but multiple measurements reveal it is just normal week-to-week variation.

🔀 Nonequivalent groups designs

🔀 What makes groups nonequivalent

In true between-subjects experiments, random assignment creates equivalent groups.
When participants are not randomly assigned, the resulting groups are likely dissimilar in important ways—they are nonequivalent.

Nonequivalent groups design: a between-subjects design in which participants have not been randomly assigned to conditions.

📊 Posttest only nonequivalent groups design

Posttest only nonequivalent groups design: participants in one group are exposed to a treatment, a nonequivalent group is not exposed, and then the two groups are compared.

How it works: one group receives treatment, another does not, then both are measured once.
Example: one third-grade class receives a new method of teaching fractions (treatment group), another class does not (control group), then both classes are tested on fractions knowledge.
Problem: students are not randomly assigned to classes, so there could be important differences.
- Parents of higher-achieving students might request a particular teacher.
- The principal might assign "troublemakers" to a stronger disciplinarian.
- Teachers' styles and classroom environments might differ.
Improvement strategies: select two classes at the same school with similar standardized test scores, same-sex teachers of similar age and teaching style.
Limitation: even with these steps, important confounding variables may remain without true random assignment.

📈 Pretest-posttest nonequivalent groups design

Pretest-posttest nonequivalent groups design: a treatment group is given a pretest, receives a treatment, and then is given a posttest; at the same time, a nonequivalent control group is given a pretest, does not receive the treatment, and then is given a posttest.

How it works: both groups measured before treatment → treatment group receives treatment, control does not → both groups measured after treatment.
Key question: do participants who receive the treatment improve more than participants who do not?
Example: students in one school receive an anti-drug program (pretest → program → posttest); students in a similar school do not receive the program (pretest → no program → posttest).
Improvement over posttest only: if students in the treatment condition become more negative toward drugs than students in the control condition, this is better evidence for a treatment effect.
Remaining threats: something could occur at one school but not the other (differential history).
- Example: a student drug overdose at one school would affect students there but not at the other school.
Note: if participants were randomly assigned to conditions, this would become a true between-groups experiment.

⏱️ Interrupted time-series design with nonequivalent groups

Interrupted time-series design with nonequivalent groups: a set of measurements taken at intervals over time both before and after an intervention in two or more nonequivalent groups.

How it works: measure outcome repeatedly in treatment and control groups → introduce treatment to one group → continue measuring outcome repeatedly in both groups.
Example: one manufacturing company reduces work shifts from 10 to 8 hours (treatment group); another company does not change shift length (control group); productivity is measured weekly in both companies.
Advantage: if productivity increases quickly and remains elevated in the treatment group but stays consistent in the control group, this provides better evidence for treatment effectiveness.
Example: if student absences drop after taking attendance in one section of a course but remain high in another section, this provides superior evidence that taking attendance reduced absences.

🔄 Pretest-posttest design with switching replication

Pretest-posttest design with switching replication: nonequivalent groups are given a pretest, one group receives treatment while the control does not, the dependent variable is assessed again, then the treatment is added to the control group, and the dependent variable is assessed one last time.

How it works:
1. Measure outcome in both groups.
2. Introduce treatment to Group 1 only.
3. Measure outcome in both groups.
4. Introduce treatment to Group 2 (Group 1 continues treatment).
5. Measure outcome in both groups.
Example: measure depression in patients and students → introduce exercise intervention to patients only → measure depression again (patients should improve, students should not) → introduce exercise to students (while patients continue) → measure depression again (now students should improve).
Strengths:
- Built-in replication: evidence for treatment efficacy in two different samples.
- Better control over history effects: unlikely that an outside event would perfectly coincide with treatment introduction in both groups.
- Controls for maturation and instrumentation: both groups would show the same rates of spontaneous remission and the same measurement changes.
Remaining threats: demand characteristics, placebo effects, and experimenter expectancy effects can still be problems (but can be controlled using methods from Chapter 5).

🔁 Switching replication with treatment removal design

Switching replication with treatment removal design: the treatment is removed from the first group when it is added to the second group.

How it works:
1. Measure outcome in both groups.
2. Introduce treatment to Group 1 only.
3. Measure outcome in both groups (Group 1 should improve, Group 2 should not).
4. Remove treatment from Group 1 and introduce it to Group 2.
5. Measure outcome in both groups (Group 2 should improve, Group 1 should worsen).
Example: measure depression in patients and students → patients start exercising → measure depression (patients improve, students do not) → patients stop exercising, students start exercising → measure depression (students improve, patients worsen).
Strengths:
- Demonstrates treatment effect in two groups staggered over time.
- Demonstrates reversal of treatment effect after treatment is withdrawn.
- Provides strong evidence for treatment efficacy.
- Provides evidence for whether treatment effects persist after withdrawal.

📝 Survey research key takeaways (from excerpt)

📝 Sampling bias and non-response bias

Sampling bias: when a sample is not representative of the population and therefore produces inaccurate results.

Non-response bias: when people who do not respond to the survey differ in important ways from people who do respond.

Non-response bias is the most pervasive form of sampling bias.
How to minimize non-response bias:
- Prenotify respondents.
- Send reminders.
- Construct short, easy-to-complete questionnaires.
- Offer incentives.

📝 Survey methods

Method	Response rate	Cost
In-person	Highest	Most expensive
Telephone	Moderate	Moderate
Mail	Lower	Less expensive
Internet	Lower	Least expensive

Internet surveys are likely to become the dominant approach because of their low cost.

One-Group Designs

38. One-Group Designs

🧭 Overview

🧠 One-sentence thesis

One-group quasi-experimental designs manipulate an independent variable but lack control groups or random assignment, placing them between non-experimental studies and true experiments in internal validity.

📌 Key points (3–5)

What quasi-experiments are: research that resembles experiments but lacks either random assignment or control groups, though the independent variable is still manipulated.
Three main one-group designs: posttest only, pretest-posttest, and interrupted time-series, each with increasing levels of measurement.
Key advantage over non-experimental research: manipulation before measurement eliminates the directionality problem.
Common confusion: these designs cannot eliminate confounding variables like true experiments can, because groups are not equivalent or counterbalancing is not used.
Multiple threats to validity: history, maturation, testing, instrumentation, regression to the mean, and spontaneous remission can all offer alternative explanations for observed changes.

🔬 What makes research quasi-experimental

🔬 Defining characteristics

Quasi-experimental research: research that resembles experimental research but is not true experimental research, missing either random assignment to conditions or counterbalancing safeguards.

The prefix "quasi" means "resembling"—these studies look like experiments but lack key protections.
An independent variable is manipulated, but either no control group exists or participants are not randomly assigned.
Example: implementing an anti-drug program and measuring attitudes, but without comparing to a group that didn't receive the program.

⚖️ Where quasi-experiments fall on the validity spectrum

Research type	Directionality problem	Confounding variables	Internal validity
Non-experimental	Present	Present	Lowest
Quasi-experimental	Eliminated	Present	Medium
True experimental	Eliminated	Eliminated	Highest

Because the independent variable is manipulated before the dependent variable is measured, the directionality problem is solved.
However, without random assignment or counterbalancing, confounding variables remain a problem.
Quasi-experiments are most likely conducted in field settings where random assignment is difficult or impossible.

📊 The three one-group designs

📊 One-group posttest only design

One-group posttest only design: a treatment is implemented and then a dependent variable is measured once after the treatment.

This is the weakest type of quasi-experimental design.
The major limitation is the complete lack of a control or comparison group.
There is no way to determine what would have happened without the treatment.
Example: measuring students' attitudes toward illegal drugs immediately after an anti-drug program, with no baseline or comparison.
Don't confuse: advertisers often report statistics from this design (e.g., "80% of women noticed brighter skin") that are meaningless without a comparison group.

📊 One-group pretest-posttest design

One-group pretest-posttest design: the dependent variable is measured once before the treatment is implemented and once after.

Similar to a within-subjects experiment where each participant is tested under control then treatment conditions.
Unlike within-subjects experiments, the order cannot be counterbalanced—participants cannot be "untreated" after treatment.
If posttest scores are better than pretest scores, the treatment might be responsible, but certainty is low.
Example: measuring attitudes toward drugs one week, implementing the anti-drug program the next week, then measuring attitudes again.

📊 Interrupted time-series design

Interrupted time-series design: a time series (set of measurements taken at intervals over time) is "interrupted" by a treatment, with multiple measurements before and after.

A variant of the pretest-posttest design with multiple measurements both before and after treatment.
The multiple measurements help distinguish treatment effects from normal variation.
Example: measuring worker productivity each week for a year, then reducing work shifts from 10 to 8 hours and continuing to measure productivity.
Advantage over simple pretest-posttest: if there had been only one measurement before and one after, normal week-to-week variation might be mistaken for a treatment effect.

⚠️ Threats to internal validity

⚠️ History

History: other things might have happened between the pretest and posttest that caused a change.

Events outside the study can produce the observed effect.
Example: in an anti-drug program study, a celebrity might die of a drug overdose between measurements, or an anti-drug program might air on television.

⚠️ Maturation

Maturation: participants might have changed between measurements in ways they were going to anyway because they are growing and learning.

Natural development over time can mimic treatment effects.
Example: in a year-long anti-drug program, participants might naturally become less impulsive or better reasoners, which could change their attitudes independent of the program.

⚠️ Testing

Testing: the act of measuring the dependent variable during the pretest affects participants' responses at posttest.

Simply completing a measure can inspire thinking and conversations that produce changes.
Example: completing a measure of attitudes toward illegal drugs may inspire further reflection that then changes posttest scores.

⚠️ Instrumentation

Instrumentation: the basic characteristics of the measuring instrument change over time.

Human observers may gain skill, become fatigued, or change their standards.
Participants may take measurements less seriously over time.
Example: participants may take an attitude measure very seriously when it's novel at pretest but become bored and less careful at posttest.

⚠️ Regression to the mean

Regression to the mean: an individual who scores extremely high or low on a variable on one occasion will tend to score less extremely on the next occasion.

This is a statistical fact, not a real change in the underlying characteristic.
Particularly problematic when participants are selected for study because of their extreme scores.
Example: students with extremely favorable attitudes toward drugs who are selected for the program will likely score lower (less favorable) at posttest even without any program effect, simply because their initial scores were extreme.
Example: a bowler with a long-term average of 150 who bowls a 220 will almost certainly score lower in the next game—the score will "regress" toward the mean of 150.

⚠️ Spontaneous remission

Spontaneous remission: the tendency for many medical and psychological problems to improve over time without any form of treatment.

Closely related to regression to the mean and extremely important in psychological research.
Example: 100 people with common colds will probably improve in a week regardless of any treatment like chicken soup.
Example: severely depressed people today are likely to be less depressed on average in 6 months without treatment—research found waitlist control participants improved 10 to 15% before receiving any treatment.
Caution: one must generally be very cautious about inferring causality from pretest-posttest designs because of this threat.

🛡️ Addressing validity threats

🛡️ The control group solution

A common approach to ruling out threats to internal validity is adding a control group that does not receive the treatment.
The control group would be subject to the same threats from history, maturation, testing, instrumentation, regression to the mean, and spontaneous remission.
This allows the researcher to measure the actual effect of the treatment (if any).
Important note: including a control group means the design is no longer a one-group design—it becomes a different type of quasi-experimental or true experimental design.

🛡️ Historical example: Does psychotherapy work?

Early studies on psychotherapy effectiveness used pretest-posttest designs.
In 1952, researcher Hans Eysenck summarized 24 studies showing about two-thirds of patients improved between pretest and posttest.
However, Eysenck compared these results with archival data showing similar patients recovered at about the same rate without psychotherapy.
This suggested the improvement might be no more than spontaneous remission.
Eysenck did not conclude psychotherapy was ineffective—only that there was no evidence it was effective, and he called for properly planned experimental studies.
By 1980, hundreds of experiments with random assignment to treatment and control conditions had been conducted.
These later studies found psychotherapy was quite effective, with about 80% of treatment participants improving more than the average control participant.

Non-Equivalent Groups Designs

39. Non-Equivalent Groups Designs

🧭 Overview

🧠 One-sentence thesis

Non-equivalent groups designs are between-subjects quasi-experiments in which participants are not randomly assigned to conditions, creating groups that may differ in important ways and introducing potential confounding variables that threaten internal validity.

📌 Key points (3–5)

Core definition: Non-equivalent groups designs lack random assignment, so the groups being compared may be dissimilar in ways beyond the treatment.
Common confusion: The key distinction from true experiments is random assignment—without it, observed differences might be due to pre-existing group differences rather than the treatment.
Multiple design types: Posttest-only, pretest-posttest, interrupted time-series with control groups, and switching replication designs each add different controls to strengthen internal validity.
Improving validity: Adding pretests, control groups, multiple measurement points, and switching replications can reduce (but not eliminate) threats from confounding variables like history, maturation, and selection.
Remaining threats: Even improved designs cannot fully control for demand characteristics, placebo effects, and experimenter expectancy effects without additional methods.

🔍 What makes groups "non-equivalent"

🔍 The absence of random assignment

A nonequivalent groups design is a between-subjects design in which participants have not been randomly assigned to conditions.

In true experiments, random assignment creates groups that researchers consider equivalent—likely to be quite similar.
Without random assignment, the resulting groups are likely to be dissimilar in some ways.
Example: If one third-grade class receives a new teaching method and another class serves as a control, the classes may differ because parents requested certain teachers, principals assigned "troublemakers" to specific classes, or teachers have different styles.

⚠️ Why this matters for interpretation

Any observed difference between groups might be caused by the treatment or by pre-existing differences (confounding variables).
Researchers can take steps to make groups as similar as possible (e.g., selecting classes at the same school, matching on test scores, choosing similar teachers), but without random assignment, the possibility of uncontrolled confounding variables remains.

📐 Basic non-equivalent designs

📐 Posttest-only nonequivalent groups design

In this design, participants in one group are exposed to a treatment, a nonequivalent group is not exposed to the treatment, and then the two groups are compared.

Structure: Treatment group receives intervention → both groups measured afterward.
Weakness: No baseline measurement, so any observed difference could be due to pre-existing group differences rather than the treatment.
Example: One class of third graders learns fractions with a new method, another class uses the old method, then both take a test. If the new-method class scores higher, it might be because of the method—or because higher-achieving students were already in that class.

📊 Pretest-posttest nonequivalent groups design

There is a treatment group that is given a pretest, receives a treatment, and then is given a posttest. At the same time there is a nonequivalent control group that is given a pretest, does not receive the treatment, and then is given a posttest.

Structure: Both groups measured before → treatment group receives intervention → both groups measured after.
Key question: Not simply whether the treatment group improves, but whether it improves more than the control group.
Advantage over posttest-only: If both groups change similarly, the change is likely due to history (e.g., news of a celebrity drug overdose) or maturation (e.g., improved reasoning) rather than the treatment.
Remaining threat: Differential history—something could occur at one location but not the other (e.g., a student drug overdose at one school, asbestos closure at another school).
Example: Students at one school receive an anti-drug program and show more negative attitudes toward drugs at posttest. If students at a similar school (no program) show the same attitude change, the change is probably not due to the program.

🔄 When it becomes a true experiment

If participants in a pretest-posttest design are randomly assigned to conditions, it becomes a true between-groups experiment rather than a quasi-experiment.
This is the kind of experiment that has been conducted many times to demonstrate the effectiveness of psychotherapy.

🕒 Time-series and replication designs

🕒 Interrupted time-series design with nonequivalent groups

This design involves taking a set of measurements at intervals over a period of time both before and after an intervention of interest in two or more nonequivalent groups.

Structure: Multiple measurements before and after the intervention in both treatment and control groups.
Advantage: If the treatment group shows a rapid change after the intervention while the control group remains consistent, this provides better evidence for treatment effectiveness.
Example: A manufacturing company measures worker productivity each week for a year before and after reducing shifts from 10 to 8 hours. Another company (not changing shifts) serves as a control. If productivity increases quickly in the treatment company but stays consistent in the control company, the shift change likely caused the improvement.
Example: Attendance is taken in one section of a research methods course (treatment), while another section does not take attendance (control). If absences drop in the treatment section but remain high in the control section, this provides superior evidence that taking attendance reduces absences.

🔁 Pretest-posttest design with switching replication

Nonequivalent groups are administered a pretest of the dependent variable, then one group receives a treatment while a nonequivalent control group does not receive a treatment, the dependent variable is assessed again, and then the treatment is added to the control group, and finally the dependent variable is assessed one last time.

Structure: Pretest both groups → treat Group 1 → measure both → treat Group 2 (Group 1 continues) → measure both.
Key feature: The treatment is introduced to the second group in a staggered, delayed fashion.
Example: Patients with depression receive an exercise intervention while students with depression do not. Depression is measured. If patients improve but students do not, this suggests the treatment works. Then the students begin exercising. If their depression now decreases, this replicates the finding.

💪 Strengths of switching replication

Strength	Explanation
Built-in replication	Evidence for treatment efficacy in two different samples (e.g., patients and students)
Control for history	Unlikely that an outside event would coincide with treatment introduction in the first group and delayed introduction in the second group
Control for maturation	Both groups would show the same rates of spontaneous remission if maturation were the cause
Control for instrumentation	If the measurement instrument changes, the change would be consistent across both groups

Remaining threats: Demand characteristics, placebo effects, and experimenter expectancy effects can still be problems (but can be controlled using methods from earlier chapters).

🔄 Switching replication with treatment removal design

The treatment is removed from the first group when it is added to the second group.

Structure: Pretest both → treat Group 1 → measure both → remove treatment from Group 1 and add to Group 2 → measure both.
Key difference from basic switching replication: The first group stops receiving the treatment when the second group starts.
Example: Patients exercise for a week and depression decreases; students do not exercise and depression stays the same. Then patients stop exercising and students start. If students' depression now decreases and patients' depression increases, this provides strong evidence that the exercise intervention is effective.

🏆 Why this design is powerful

Demonstrates treatment effect in two groups staggered over time: Replicates the finding.
Demonstrates reversal of treatment effect after removal: Shows that the treatment effect depends on the treatment being present.
Evidence for maintenance: Can reveal whether the treatment continues to show effects after it has been withdrawn.
Don't confuse: This design is stronger than basic switching replication because it shows the treatment effect can be reversed, not just replicated.

Key Takeaways and Exercises

40. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Quasi-experimental designs manipulate independent variables without random assignment, offering higher internal validity than non-experimental studies but lower than true experiments, with switching replication designs providing the strongest evidence.

📌 Key points (3–5)

What quasi-experimental research is: manipulation of an independent variable without random assignment or counterbalancing.
Two main categories: within-subjects designs (one-group posttest only, one-group pretest-posttest, interrupted time-series) and between-subjects designs (posttest only with nonequivalent groups, pretest-posttest with nonequivalent groups, interrupted time-series with nonequivalent groups, switching replication variants).
Internal validity trade-off: eliminates directionality problems through manipulation but retains confounding variable problems due to lack of random assignment.
Common confusion: switching replication designs come in two forms—basic (treatment continues in first group) vs. treatment removal (treatment is withdrawn from first group when added to second).
Strongest design: switching replication designs (especially with treatment removal) provide the highest internal validity among quasi-experimental approaches.

🔬 Core design types

🔬 Within-subjects quasi-experimental designs

Three main types exist:

One-group posttest only: measure outcome after treatment only
One-group pretest-posttest: measure before and after treatment
Interrupted time-series: multiple measurements over time with treatment introduced at a point

All test the same participants across conditions without random assignment.

🔬 Between-subjects quasi-experimental designs

Five main types exist:

Posttest only with nonequivalent groups: compare groups after treatment
Pretest-posttest with nonequivalent groups: compare groups before and after treatment
Interrupted time-series with nonequivalent groups: time-series comparison across groups
Pretest-posttest with switching replication: second group receives treatment later while first continues
Switching replication with treatment removal: treatment removed from first group when added to second

Groups are not randomly assigned.

🎯 Switching replication designs

🎯 Basic switching replication

In a basic pretest-posttest design with switching replication, the first group receives a treatment and the second group receives the same treatment a little bit later on (while the initial group continues to receive the treatment).

First group gets treatment immediately
Second group gets treatment after a delay
First group continues receiving treatment throughout
Demonstrates effect in two groups staggered over time

Example: Patients with depression start exercising immediately; students with depression start exercising one week later. Both groups continue exercising once started.

🎯 Treatment removal variant

In a switching replication with treatment removal design, the treatment is removed from the first group when it is added to the second group.

Treatment withdrawn from first group when second group begins
Demonstrates both treatment effect and reversal
Provides evidence for whether effects persist after withdrawal
Strongest evidence for treatment efficacy among quasi-experimental designs

Example: Patients exercise for one week, then stop. Simultaneously, students begin exercising. Depression should decrease in students and increase in patients.

Don't confuse: Basic switching replication keeps the first group in treatment; treatment removal withdraws it.

⚖️ Internal validity considerations

⚖️ What quasi-experiments eliminate

Directionality problem: solved through manipulation of the independent variable
Researchers control when and how the treatment is applied
Can establish temporal precedence (cause before effect)

⚖️ What quasi-experiments retain

Confounding variables problem: remains due to lack of random assignment
Groups may differ in systematic ways beyond the treatment
Cannot rule out alternative explanations as confidently as true experiments

⚖️ Validity hierarchy

Design type	Internal validity level	Reason
Non-experimental	Lowest	No manipulation, no random assignment
Quasi-experimental	Medium	Manipulation present, no random assignment
True experiment	Highest	Both manipulation and random assignment
Switching replication (quasi)	Highest among quasi	Demonstrates effect, replication, and sometimes reversal

🚨 Threats to validity in practice exercises

🚨 Multiple confounds example

The excerpt presents a scenario: two professors test daily quizzes by having Professor A give quizzes and Professor B not give quizzes, then compare final exam performance.

Five potential confounding variables that might differ:

Teaching style differences between professors
Class meeting time (morning vs. afternoon)
Student characteristics (different enrollment patterns)
Classroom environment differences
Professor expectations and enthusiasm

Why this matters: Without random assignment, any observed difference could stem from these confounds rather than the quizzes themselves.

🚨 Specific validity threats

For a study measuring obese children's weight before and after a 3-month activity program:

Regression to the mean: extreme scores tend to move toward average on retesting
Spontaneous remission: natural improvement over time without intervention
History: external events during the 3 months that affect weight
Maturation: natural developmental changes in growing children

Each threat offers an alternative explanation for any observed weight change beyond the program's effect.

Setting Up a Factorial Experiment

41. Setting Up a Factorial Experiment

🧭 Overview

🧠 One-sentence thesis

Factorial designs allow researchers to study multiple independent variables simultaneously, enabling them to detect not only main effects of each variable but also interactions between variables—which are often among the most interesting findings in psychological research.

📌 Key points (3–5)

What factorial designs are: experiments that combine each level of one independent variable with each level of others to create all possible conditions.
Why researchers use multiple independent variables: to answer more sophisticated questions and to discover whether the effect of one variable depends on the level of another (interactions).
How to read factorial notation: the numbers indicate how many independent variables and how many levels each has (e.g., 2 × 2 means two variables, each with two levels).
Common confusion: distinguishing between-subjects, within-subjects, and mixed factorial designs—each determines whether participants experience one condition or multiple conditions.
Non-manipulated variables: factorial designs can include measured (not manipulated) participant variables, but causal conclusions can only be drawn about manipulated variables.

🔬 Why Include Multiple Independent Variables

🔬 Answering sophisticated research questions

Including multiple independent variables in one experiment allows researchers to address multiple questions simultaneously.
Example: Instead of conducting separate studies on disgust's effect on moral judgment and private body consciousness's effect on moral judgment, Schnall and colleagues studied both in one experiment.
This approach is more efficient than running separate studies for each variable.

🔗 Discovering interactions

Interaction: when the effect of one independent variable depends on the level of another independent variable.

Interactions often represent the most interesting results in psychological research.
Example: Schnall and colleagues found that disgust affected moral judgments, but only for participants high in private body consciousness—this is an interaction.
Without studying both variables together, this nuanced finding would have been missed.

📊 Understanding Factorial Design Structure

📊 What factorial design means

Factorial design: each level of one independent variable is combined with each level of the others to produce all possible combinations, with each combination becoming a condition in the experiment.

A factorial design table shows all possible combinations of the independent variables.
Example: cell phone use (yes vs. no) and time of day (day vs. night) creates four conditions: using phone during day, not using phone during day, using phone at night, not using phone at night.

🔢 Reading factorial notation

The notation tells you key information about the design:

Notation	Meaning	Number of conditions
2 × 2	Two variables, each with two levels	4
3 × 2	Two variables: one with three levels, one with two	6
2 × 2 × 2	Three variables, each with two levels	8
4 × 5	Two variables: one with four levels, one with five	20

The number of digits tells you how many independent variables.
Each digit's value tells you how many levels that variable has.
Multiply the numbers to get the total number of conditions.
Don't confuse: 2 × 2, 3 × 3, and 2 × 3 all have two independent variables (two numbers in notation), but different numbers of levels and conditions.

⚠️ Practical limits

In practice, designs rarely exceed three independent variables with more than two or three levels each.
Reason 1: The number of conditions can become unmanageable (e.g., a 2 × 2 × 2 × 3 design has 24 conditions).
Reason 2: The number of participants required to populate all conditions while maintaining adequate statistical power can make the design unfeasible.

👥 Assigning Participants to Conditions

👥 Between-subjects factorial design

Between-subjects factorial design: all independent variables are manipulated between subjects, so each participant is tested in only one condition.

Example: Each participant is tested either while using a cell phone or not, and either during the day or at night—experiencing only one of the four conditions.
Advantages: conceptually simpler, avoids order/carryover effects, minimizes time and effort per participant.

🔄 Within-subjects factorial design

Within-subjects factorial design: all independent variables are manipulated within subjects, so each participant is tested in all conditions.

Example: Each participant is tested both while using a cell phone and while not, and both during the day and during the night—experiencing all four conditions.
Advantages: more efficient for the researcher, controls extraneous participant variables.

🔀 Mixed factorial design

Mixed factorial design: one independent variable is manipulated between subjects and another within subjects.

Example: Test the same participants both while using and not using a cell phone (within-subjects), but test each participant either during the day or at night (between-subjects).
Each participant in this mixed design would be tested in two of the four conditions.
Regardless of design type, assignment to conditions or orders is typically done randomly.

🧪 Non-Manipulated Independent Variables

🧪 Including measured variables

Non-manipulated independent variable: a variable the researcher measures but does not manipulate.

These are usually participant variables (e.g., private body consciousness, hypochondriasis, self-esteem, gender).
Example: Schnall and colleagues manipulated disgust (clean vs. messy room) but only measured private body consciousness.
By definition, non-manipulated variables are between-subjects factors—people cannot be tested in both "high" and "low" conditions of a trait they possess.

⚖️ Causal conclusions: critical limitation

Studies with at least one manipulated variable are generally considered experiments.
Important: Causal conclusions can only be drawn about manipulated variables, not measured ones.
Example: Schnall and colleagues could conclude disgust caused harsher moral judgments (they manipulated room cleanliness), but they could not conclude private body consciousness caused harsher judgments (they only measured it).
Don't confuse: A measured variable showing an effect might actually be caused by a third variable (e.g., neuroticism might cause both high body consciousness and strict moral codes).

🔍 Non-Experimental Factorial Studies

🔍 When no variables are manipulated

Factorial designs can include only non-manipulated independent variables—but then they are non-experimental, not experiments.
Example: A researcher measures participants' moods (positive vs. negative), self-esteem (high vs. low), and willingness to have unprotected sex—a 2 × 2 design with no manipulation.
This is a non-experimental study because neither independent variable was manipulated.

⚠️ Limitations of non-experimental factorial designs

As always, be cautious about inferring causality from non-experimental studies.
The directionality problem and third-variable problem still apply.
Example: An apparent effect of mood on willingness to have unprotected sex might be caused by any other variable correlated with mood.
Don't confuse: The factorial structure alone does not make a study experimental—at least one variable must be manipulated.

Setting Up a Factorial Experiment

Interpreting the Results of a Factorial Experiment

42. Interpreting the Results of a Factorial Experiment

🧭 Overview

🧠 One-sentence thesis

Factorial experiments yield three types of results—main effects (the overall effect of one independent variable averaged across others), interaction effects (when one variable's effect depends on another's level), and simple effects (used to break down interactions by examining each variable at specific levels of the other)—and interactions often reveal more nuanced patterns than main effects alone.

📌 Key points (3–5)

Main effects measure the overall impact of one independent variable averaged across all levels of the other independent variable(s); each independent variable has one main effect.
Interaction effects occur when the effect of one independent variable depends on the level of another; they can be spreading (effect present at one level but not another, or stronger at one level) or crossover (effects in opposite directions).
Simple effects analyses are needed when an interaction is present, because main effects can be misleading; simple effects examine each independent variable at each level of the other variable(s).
Common confusion: main effects vs. simple effects—main effects average across the other variable, while simple effects look at each level separately; simple effects are only necessary when an interaction exists.
Graphing conventions: one independent variable goes on the x-axis, the other is shown by different-colored bars or lines, and the y-axis always shows the dependent variable; line graphs are used when the x-axis variable is quantitative or represents time.

📊 Visualizing factorial results

📊 How to graph two independent variables

One independent variable is placed on the x-axis.
The other independent variable is represented by different-colored bars (bar graph) or different-formatted lines (line graph).
The y-axis is always reserved for the dependent variable.
The choice of which variable goes on the x-axis comes down to which arrangement communicates the results most clearly.

📈 When to use bar graphs vs. line graphs

Bar graphs: appropriate when both independent variables are categorical.
Line graphs: appropriate when the x-axis variable is quantitative with a small number of distinct levels, or when representing measurements over a time interval (time series).
Example: A 2×2 design with time of day (day vs. night) on the x-axis and cell phone use (no vs. yes) as different-colored bars would be a bar graph.
Example: A 4×2 design with psychotherapy length (quantitative) on the x-axis and psychotherapy type as different-formatted lines would be a line graph.

🎯 Main effects

🎯 What a main effect measures

Main effect: the effect of one independent variable on the dependent variable—averaging across the levels of the other independent variable.

There is one main effect for each independent variable in the study.
Main effects are calculated by averaging across all levels of the other independent variable(s).
Example: In a study of cell phone use and time of day on driving performance, the main effect of cell phone use would compare driving performance with vs. without cell phones, averaged across both day and night conditions.

🔄 Independence of main effects

Main effects are independent of each other.
Whether or not there is a main effect of one independent variable says nothing about whether there is a main effect of the other.
Example: A study might show a clear main effect of psychotherapy length (longer therapy works better) without necessarily showing a main effect of psychotherapy type.
Don't confuse: the presence or absence of one main effect does not predict the presence or absence of another main effect.

📊 Identifying main effects in graphs

In bar graphs: compare the average height of bars for one variable across all levels of the other.
Example: If blue bars are, on average, higher than red bars across all x-axis positions, there is a main effect of the variable represented by bar color.
Example: If performance is better during the day than at night—both when using cell phones and when not—there is a main effect of time of day.

🔀 Interaction effects

🔀 What an interaction means

Interaction effect (or just "interaction"): when the effect of one independent variable depends on the level of another.

An interaction indicates that the relationship between one independent variable and the dependent variable changes depending on the level of the other independent variable.
You already understand interactions intuitively from everyday life.
Example: Your decision to see a movie depends on both which movie it is (main effect of movie type) and who is coming with you (interaction if your decision about the romantic comedy depends on who else is there).
Example: Drug interactions—Viagra and nitrates each have beneficial main effects, but their combination can be lethal (a very important interaction).

🔀 Research examples of interactions

Psychotherapy and motivation: The effect of receiving psychotherapy is stronger among people highly motivated to change than among those not motivated—the effect of therapy depends on motivation level.
Room cleanliness and body consciousness: The effect of room cleanliness (messy vs. clean) on moral judgments depended on private body consciousness; if participants were high in private body consciousness, those in the messy room made harsher judgments, but if low in private body consciousness, room condition did not matter.
Hypochondriasis and word type: People high in hypochondriasis recalled negative health-related words more accurately than people low in hypochondriasis, but recalled non-health-related words about the same—the effect of hypochondriasis depends on word type.

📐 Types of interactions: spreading interactions

Spreading interactions occur when one independent variable has an effect at one level of the other variable but no effect (or a weaker effect) at another level.

Type 1 spreading: Independent variable B has an effect at level 1 of independent variable A (bars differ in height on the left side) but no effect at level 2 of A (bars are the same height on the right side).
- Example: Disgust had an effect on moral judgments for those high in private body consciousness but no effect for those low in private body consciousness.
Type 2 spreading: Independent variable B has a stronger effect at level 1 of A than at level 2 of A (larger difference in bar heights on the left, smaller difference on the right).
- Example: Using a cell phone had a strong effect on driving at night and a weaker effect during the day.

❌ Types of interactions: crossover interactions

Crossover interaction: when independent variable B has an effect at both levels of independent variable A, but the effects are in opposite directions.

In a line graph, the two lines literally "cross over" each other.
Example: Introverts perform better than extraverts when they have not ingested caffeine, but extraverts perform better than introverts when they have ingested 4 mg of caffeine per kilogram of body weight—the effect of personality reverses depending on caffeine level.
Don't confuse with spreading interactions: in crossover interactions, the effect reverses direction; in spreading interactions, the effect is present or stronger at one level but not reversed.

🔬 Simple effects analyses

🔬 Why simple effects are needed

When an interaction is present, main effects can be misleading because they average across conditions where effects differ or reverse.
Example: In the caffeine-personality study, averaging across introversion and extraversion would show no main effect of caffeine (because positive effects on extraverts are canceled by negative effects on introverts), but this would miss the real story.
Simple effects provide a way of breaking down the interaction to figure out precisely what is going on.
Simple effects analyses are only necessary when an interaction is detected; when there is no interaction, main effects tell the complete and accurate story.

🔬 What simple effects measure

Simple effects: the effects of each independent variable at each level of the other independent variable(s).

Unlike main effects (which average across the other variable), simple effects examine one variable at a specific level of the other variable.
Example: Instead of examining the overall effect of caffeine (averaged across personality types), researchers examine the effect of caffeine in introverts and then separately examine the effect of caffeine in extraverts.
Example: Schnall and colleagues examined the effect of disgust separately for people high in private body consciousness (found an effect) and for people low in private body consciousness (found no effect).

🔬 How many simple effects to examine

The number of simple effects depends on the design:

Design	Conditions	Main effects	Simple effects
2×2	4	2	4
2×3	6	2	5
3×3	9	2	6

Number of main effects depends simply on the number of independent variables (one per variable).
Number of simple effects depends on the number of levels of the independent variables (a separate analysis of each independent variable is conducted at each level of the other).
Example: In a 2×2 design, you examine the effect of variable A at level 1 of B, the effect of A at level 2 of B, the effect of B at level 1 of A, and the effect of B at level 2 of A (four simple effects total).

🔬 Example: breaking down the hypochondriasis interaction

Brown and colleagues found an interaction between word type (health-related or not) and hypochondriasis (high or low) on word recall.
To break down this interaction, they examined the effect of hypochondriasis at each level of word type:
- Effect of hypochondriasis on recall of health-related words: people high in hypochondriasis recalled more than people low in hypochondriasis.
- Effect of hypochondriasis on recall of non-health-related words: no effect of hypochondriasis.
This simple effects analysis revealed that hypochondriasis only affected recall for health-related words, clarifying the nature of the interaction.

🚫 Non-experimental factorial designs

🚫 When factorial designs are not experiments

Factorial designs can include only non-manipulated independent variables, in which case they are no longer experiments but are instead non-experimental.
Example: A study that simply measures participants' moods (positive vs. negative) and self-esteem (high vs. low) and their willingness to have unprotected sex is a 2×2 factorial design, but because neither variable was manipulated, it is non-experimental.
Don't confuse: A similar study by MacDonald and Martineau (2002) was an experiment because they manipulated participants' moods.

🚫 Causal inference limitations

Because neither independent variable is manipulated in a non-experimental factorial design, one must be cautious about inferring causality.
The directionality problem and third-variable problem apply.
Example: An effect of participants' moods on their willingness to have unprotected sex might be caused by any other variable that happens to be correlated with their moods.
Non-manipulated variables (like gender) can be included in factorial designs, but they limit the causal conclusions that can be made about the effects of the non-manipulated variable on the dependent variable.

Key Takeaways and Exercises

43. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Factorial designs allow researchers to study multiple independent variables simultaneously, revealing not only each variable's overall effect but also how variables interact to produce effects that depend on one another.

📌 Key points (3–5)

What factorial designs do: combine each level of one independent variable with each level of others to create all possible experimental conditions.
Main effects vs interactions: main effects show the overall impact of one variable averaged across all others; interactions show when one variable's effect depends on the level of another variable.
Common confusion: distinguishing main effects from interactions—main effects are averages; interactions reveal that the effect changes depending on other variables.
Non-manipulated variables: can be included (e.g., gender) but limit causal conclusions about that variable's effects.
Simple effects analysis: breaks down interactions by examining each independent variable at each level of the other variable.

🔬 Factorial design structure

🔬 What factorial designs combine

Factorial design: an approach in which each level of one independent variable is combined with each level of the others to create all possible conditions.

This creates a complete grid of experimental conditions.
Example: if Variable A has 2 levels and Variable B has 3 levels, the factorial design produces 2 × 3 = 6 conditions.
The excerpt emphasizes this is the most common approach when researchers include multiple independent variables.

🔀 Manipulation types

Each independent variable can be handled in two ways:

Between-subjects: different participants experience different levels.
Within-subjects: the same participants experience all levels.
Researchers choose based on the nature of the variable and practical constraints.

⚠️ Non-manipulated variables

Variables like gender can be included in factorial designs.
Important limitation: these limit the causal conclusions that can be made about the non-manipulated variable's effects on the dependent variable.
Don't confuse: manipulated variables allow causal inference; non-manipulated variables only allow correlational conclusions even within a factorial design.

📊 Main effects and interactions

📊 Main effects

Main effect of an independent variable: its overall effect averaged across all other independent variables.

There is one main effect for each independent variable in the design.
It answers: "What is this variable's impact, on average, ignoring the other variables?"
Example: In a 2 × 2 design with variables A and B, there is one main effect for A and one main effect for B.

🔗 Interactions

Interaction: occurs between two independent variables when the effect of one depends on the level of the other.

The excerpt highlights that some of the most interesting research questions and results in psychology are specifically about interactions.
An interaction means the variables don't work independently—their combined effect is different from simply adding their separate effects.
Example: Variable A might increase the outcome at Level 1 of Variable B but decrease it at Level 2 of Variable B.

🔍 Simple effects analysis

Simple effects analysis: a means for researchers to break down interactions by examining the effect of each independent variable at each level of the other independent variable.

Used when an interaction is found.
Provides a detailed look at how one variable works at specific levels of another.
Complexity note: The number of simple effects analyses depends on the number of levels of the independent variables (a separate analysis of each independent variable is conducted at each level of the other independent variable).
Example from excerpt: A design with nine conditions would need to look at 2 main effects and 6 simple effects.

🧮 Counting effects in factorial designs

🧮 How many main effects

The number of main effects depends simply on the number of independent variables included.
One main effect can be explored for each independent variable.
Example: 3 independent variables → 3 main effects.

🧮 How many simple effects

The number of simple effects analyses depends on the number of levels of the independent variables.
More levels create more simple effects to examine.
The excerpt's example: nine conditions → 2 main effects + 6 simple effects.

🎯 Practice exercises

🎯 Identifying variables

The excerpt suggests practicing by:

Returning to five article titles (referenced earlier in the text).
For each, identify the independent variables and the dependent variable.

🎯 Creating a factorial design table

Practice task:

Design an experiment on the effects of room temperature and noise level on performance on the MCAT.
Indicate whether each independent variable will be manipulated between-subjects or within-subjects.
Explain why you chose each manipulation type.

🎯 Sketching results patterns

Practice drawing 8 different bar graphs for a 2 × 2 factorial experiment to depict:

Pattern	Main effect of A	Main effect of B	Interaction
1	No	No	No
2	Yes	No	No
3	No	Yes	No
4	Yes	Yes	No
5	Yes	Yes	Yes
6	Yes	No	Yes
7	No	Yes	Yes
8	No	No	Yes

This exercise helps distinguish between main effects (overall trends) and interactions (when effects depend on other variables).
Don't confuse: a graph can show main effects without interactions (parallel lines) or interactions without main effects (crossing lines that average out).

Overview of Single-Subject Research

44. Overview of Single-Subject Research

🧭 Overview

🧠 One-sentence thesis

Single-subject research focuses intensively on the behavior of a small number of participants to discover strong, consistent causal effects that are important for real-world application, offering an alternative to traditional group research that can hide individual differences.

📌 Key points (3–5)

What it is: A quantitative approach studying 2–10 participants in detail, focusing on each individual's behavior rather than group averages.
Core assumptions: It emphasizes discovering causal relationships through experimental manipulation, studying strong effects with social importance, and revealing individual differences that group research may hide.
Common confusion: Single-subject research is not the same as qualitative case studies—it uses experimental control, highly structured data, and quantitative analysis, not narrative descriptions.
Who uses it: Researchers in applied behavior analysis, experimental analysis of behavior, and clinicians across theoretical perspectives studying therapeutic change.
Why it matters: Treatments that help half but harm half of participants appear to have no effect in group research, but single-subject research reveals these critical individual differences.

🔬 What single-subject research is

🔬 Definition and scope

Single-subject research: a type of quantitative research that involves studying in detail the behavior of each of a small number of participants.

Despite the name, it typically involves 2–10 participants, not just one.
Also called small-n designs (where n = sample size).
The focus is on each individual's behavior, not group averages.

🆚 How it differs from group research

Aspect	Single-subject research	Group research
Number of participants	2–10 (small n)	Large numbers
Focus	Individual behavior patterns	Group means, standard deviations
Data analysis	Quantitative, per individual	Quantitative, aggregated
Prevalence	Alternative approach	Most common in psychology

⚠️ Not the same as case studies

Don't confuse with qualitative approaches:

Case studies (Chapter 6): In-depth qualitative analysis and description of an individual.
Qualitative research: Focuses on subjective experience, unstructured data (e.g., detailed interviews), narrative analysis.
Single-subject research: Focuses on objective behavior through experimental manipulation, highly structured data, and quantitative analysis.

Example: A case study might describe a patient's subjective experience through interviews; single-subject research would experimentally manipulate a treatment and measure specific behaviors with numerical data.

🧱 Core assumptions

🧱 Focus intensively on individuals

Why study individuals rather than groups?

Two key reasons:

Group research can hide individual differences
- A treatment with positive effects for half the participants and negative effects for the other half averages to no effect.
- Single-subject research would reveal these opposing patterns.
- Example: If a teaching method helps 5 students but confuses 5 others, group research shows zero average improvement, missing the fact that it's harmful for half the class.
Sometimes one individual is the primary interest
- A school psychologist wants to change a particular disruptive student's behavior.
- Conducting a study on that specific student is more direct and effective than relying only on published group research.

🔍 Discover causal relationships

Single-subject research is experimental:

Manipulates an independent variable (the treatment).
Carefully measures a dependent variable (the behavior).
Controls extraneous variables.
Has good internal validity.

Example from the excerpt: Hall and colleagues measured studying behavior:

First under no-treatment control (baseline).
Then under treatment (positive teacher attention).
Then back to control (treatment removed).
Then treatment reintroduced.

Result: Clear increase when treatment introduced, decrease when removed, increase when reintroduced → strong evidence the treatment caused the improvement.

💪 Study strong, consistent, socially important effects

Social validity: treatments that have substantial effects on important behaviors and can be implemented reliably in real-world contexts.

What applied researchers want:

Strong effects: Not tiny statistical differences, but meaningful changes.
Consistent effects: Reliable across measurements.
Biological or social importance: Behaviors that matter in real life.
Implementable: Can be used in actual settings (classrooms, clinics, etc.).

Example: Hall's study had good social validity because:

It showed strong and consistent effects.
Studying behavior is obviously important to teachers, parents, and students.
Teachers found the treatment (positive attention) easy to implement in chaotic elementary school classrooms.

📜 Historical context and users

📜 Early foundations

Psychology's founders used single-subject approaches:

Wilhelm Wundt (late 1800s): Studied sensation and consciousness by focusing intensively on each of a small number of participants.
Herman Ebbinghaus: Memory research.
Ivan Pavlov: Classical conditioning research.
These studies are still described in introductory psychology textbooks.

🐀 Mid-20th century: Skinner and experimental analysis of behavior

B. F. Skinner's contributions (1938):

Clarified assumptions underlying single-subject research.
Refined techniques.
Used it to describe how rewards, punishments, and external factors affect behavior over time.

Experimental analysis of behavior:

Carried out primarily with nonhuman subjects (rats and pigeons).
Remains an important subfield of psychology.
Relies almost exclusively on single-subject research.
Published in the Journal of the Experimental Analysis of Behavior.

👥 1960s onward: Applied behavior analysis

Applied behavior analysis:

By the 1960s, researchers applied this approach to humans.
Focuses on applied research with practical goals.
Important in contemporary research on:
- Developmental disabilities
- Education
- Organizational behavior
- Health
- Many other areas
Published in the Journal of Applied Behavior Analysis (including Hall's study).

🌐 Beyond the behavioral perspective

Single-subject research is not limited to one theory:

Most contemporary single-subject research is from the behavioral perspective.
But it can address questions from any theoretical perspective.

Examples:

A studying technique based on cognitive principles of learning and memory could be tested on individual high school students.
Clinicians from any perspective (behavioral, cognitive, psychodynamic, humanistic) can use it to:
- Study therapeutic change processes with individual clients.
- Document clients' improvement.

Single-Subject Research Designs

45. Single-Subject Research Designs

🧭 Overview

🧠 One-sentence thesis

Single-subject research designs establish causal relationships by repeatedly measuring individual participants across distinct phases, using visual inspection to determine whether treatments produce clear, replicable changes in behavior.

📌 Key points (3–5)

Core approach: Measure the same participant repeatedly over time, test under one condition per phase, and wait for steady state before changing conditions.
Reversal designs (ABA/ABAB): Establish baseline, introduce treatment, remove treatment to show the effect reverses—greatly increasing internal validity.
Multiple-baseline designs: Introduce treatment at different times across participants, behaviors, or settings to rule out coincidence without removing the treatment.
Common confusion: Single-subject research is not just "small n" group research—it uses visual inspection of individual data patterns (level, trend, latency) rather than averaging across participants and relying on inferential statistics.
Historical roots: Developed by B.F. Skinner for experimental analysis of behavior; now widely used in applied behavior analysis for developmental disabilities, education, organizational behavior, and health.

🔬 Common features across designs

📊 Repeated measurement over time

The dependent variable (y-axis) is measured repeatedly at regular intervals over time (x-axis).
This creates a continuous record of behavior rather than a single snapshot.
The study is divided into distinct phases, each designated by a capital letter (A, B, C, etc.).
Each phase tests the participant under one condition.

⚖️ The steady state strategy

Steady state strategy: The researcher waits until the participant's behavior becomes fairly consistent from observation to observation before changing conditions.

The change from one condition to the next does not occur after a fixed time or number of observations.
Instead, it depends on the participant's behavior reaching consistency.
Why it matters: When the dependent variable has reached steady state, any change across conditions will be relatively easy to detect—this minimizes "noise" in the data.
Example: A researcher measuring study time waits until the daily measurements stabilize before introducing a reward system.

🔄 Reversal designs

🔄 Basic ABA design

Reversal design (ABA design): A baseline phase (A) is established, treatment is introduced (B), then the treatment is removed and baseline conditions are restored (A).

Phase breakdown:

Phase A (baseline): Establish the level of responding before any treatment; this is the control condition.
Phase B (treatment): Introduce the treatment; wait for adjustment period and then steady state.
Phase A (return to baseline): Remove treatment; wait again for steady state.

The design can be extended: ABAB (reintroduce treatment), ABABA (another baseline), and so on.

🔍 Why reversal is necessary

An AB design alone is essentially an interrupted time-series design—if behavior changes after treatment, something else might have changed at the same time.
The reversal solves this: If the dependent variable changes with treatment introduction and then changes back with treatment removal, the treatment is much more clearly the cause.
This greatly increases internal validity.
Don't confuse: The reversal only works if the treatment does not create a permanent effect; if behavior stays changed after removal, it's unclear whether the treatment caused the original change or something else coincided with it.

🔀 Multiple-treatment reversal design

A baseline phase is followed by separate phases introducing different treatments.
Example: Baseline (A) → positive attention (B) → mild punishment (C) → return to baseline (A) → reintroduce treatments in reverse order (C, B) to control carryover effects.
This design is also called ABCACB.

⚡ Alternating treatments design

Two or more treatments are alternated quickly on a regular schedule.
Example: Positive attention one day, mild punishment the next, alternating throughout.
Or one treatment in the morning, another in the afternoon.
Limitation: Only works when treatments are fast-acting.

📐 Multiple-baseline designs

❓ Why use multiple-baseline instead of reversal

Two problems with reversal designs:

Ethical issue: If a treatment is working (e.g., reducing self-injury), it may be unethical to remove it just to demonstrate reversal.
Permanent effects: The dependent variable may not return to baseline when treatment is removed—either because the treatment had a lasting positive effect, or because something else (not the treatment) caused the change.

Solution: Use a multiple-baseline design, which does not require removing the treatment.

👥 Multiple-baseline across participants

Establish a baseline for each of several participants.
Introduce the treatment at a different time for each participant.
Each participant is essentially tested in an AB design.
Logic: If behavior changes when treatment is introduced for one participant, it might be coincidence; if it changes for multiple participants at different introduction times, coincidence is very unlikely.

Example from the excerpt:

Ross & Horner (2009) studied bullying behavior at three schools.
They observed two problem students at each school during baseline.
They implemented a bullying prevention program at School 1 after 2 weeks, School 2 after 4 weeks, School 3 after 6 weeks.
Aggressive behaviors dropped shortly after the program was implemented at each school.
If the program had been introduced at all three schools simultaneously, a coincidental event (holiday, TV program, weather change) could explain the results—but three separate coincidences at different times is very unlikely.

🎯 Multiple-baseline across behaviors

Multiple baselines are established for the same participant but for different dependent variables.
The treatment is introduced at a different time for each dependent variable.

Example:

An office worker has two tasks: making sales calls and writing reports.
Baseline: Measure both for several weeks.
Introduce goal-setting treatment for sales calls first, then later for report writing.
If productivity increases on both tasks after treatment introduction (at different times), the treatment is likely responsible.

🏠 Multiple-baseline across settings

Multiple baselines are established for the same participant but in different settings.

Example:

Baseline: Measure time a child spends reading during free time at school and at home.
Introduce positive attention first at school, later at home.
If reading time increases in each setting after treatment introduction, the treatment is likely responsible.

📊 Data analysis: Visual inspection

👁️ Visual inspection vs. statistics

Visual inspection: Plotting individual participants' data, looking carefully at those data, and making judgments about whether and to what extent the independent variable had an effect.

Key differences from group research:

Group research combines data across participants, uses means/standard deviations, and relies on inferential statistics.
Single-subject research plots individual data and relies heavily on visual inspection.
Inferential statistics are typically not used (though this is changing).

📏 Three factors in visual inspection

Factor	Definition	What it suggests
Level	How high or low the dependent variable is in one condition vs. another	Much higher or lower in one condition → treatment had an effect
Trend	Gradual increases or decreases across observations	Begins increasing/decreasing with condition change → treatment had an effect; especially telling when trend changes direction (e.g., unwanted behavior increasing during baseline, then decreasing with treatment)
Latency	Time it takes for the dependent variable to begin changing after a condition change	Short latency (change happens immediately) → treatment likely responsible

Example of effective treatment (top panel, Figure 10.4):

Fairly obvious changes in level and trend from condition to condition.
Short latencies—change happens immediately.
Conclusion: Treatment was responsible for changes.

Example of ineffective treatment (bottom panel, Figure 10.4):

Small changes in level.
Increasing trend in treatment condition looks like a continuation of a baseline trend.
Conclusion: Treatment was not responsible for changes.

🔢 Statistical approaches (supplementary)

Becoming more common but still debated.
Approach 1: Compute mean and standard deviation for each participant under each condition; apply t-tests or ANOVA (averaging across participants is less common).
Approach 2: Compute percentage of non-overlapping data (PND)—the percentage of responses in the treatment condition that are more extreme than the most extreme response in the control condition.
- Example: In Hall et al.'s study, all measures of Robbie's study time in the first treatment were greater than the highest baseline measure → PND = 100%.
- Greater PND → stronger treatment effect.
Don't confuse: Formal statistics are considered a supplement to visual inspection, not a replacement.

🌍 Applications and scope

🧪 Historical development

B.F. Skinner clarified assumptions and refined techniques in the mid-20th century.
Used primarily with nonhuman subjects (rats, pigeons) to study how rewards, punishments, and external factors affect behavior over time.
Called the experimental analysis of behavior—remains an important subfield relying almost exclusively on single-subject research.
Journal: Journal of the Experimental Analysis of Behavior.

🏥 Applied behavior analysis

By the 1960s, researchers began using this approach for applied research primarily with humans.
Called applied behavior analysis (Baer, Wolf, & Risley, 1968).
Plays an especially important role in research on developmental disabilities, education, organizational behavior, and health.
Journal: Journal of Applied Behavior Analysis.

🧠 Beyond the behavioral perspective

Although most contemporary single-subject research is conducted from the behavioral perspective, it can in principle address questions from any theoretical perspective.
Example: A studying technique based on cognitive principles of learning and memory could be evaluated on individual high school students using the single-subject approach.
Clinicians of any perspective (behavioral, cognitive, psychodynamic, humanistic) can use it to study therapeutic change with individual clients and document improvement.

The Single-Subject Versus Group "Debate"

46. The Single-Subject Versus Group“Debate”

🧭 Overview

🧠 One-sentence thesis

Single-subject research and group research are complementary quantitative approaches with different strengths and weaknesses, best suited for answering different kinds of research questions rather than being in true opposition.

📌 Key points (3–5)

Both are quantitative: Single-subject and group research both manipulate independent variables, measure dependent variables, and control extraneous variables to establish causal relationships.
Main disagreements: Debates center on data analysis methods (visual inspection vs. statistics) and external validity (generalizing from few vs. many participants).
Common confusion: Studying large groups does not automatically solve generalization problems—generalizing to individuals from group averages can be misleading, and generalization also depends on similarity of situations studied.
Complementary strengths: Single-subject research excels at detecting strong, consistent effects in individuals; group research excels at detecting weak effects and interactions at the population level.
Converging evidence principle: No single study design is perfect; examining multiple studies with different flaws that point to the same conclusion increases confidence in findings.

📊 Data analysis disagreements

📊 Group researchers' concerns about visual inspection

Group research advocates worry that visual inspection (the primary single-subject analysis method) has three problems:

Not sensitive enough to detect weak treatment effects
Unreliable: different researchers may reach different conclusions from the same data
Hard to summarize: overall judgments ("effective" or "not effective") cannot be clearly compared across studies, unlike statistical measures of relationship strength

🔬 Single-subject researchers' response

Single-subject researchers acknowledge these concerns but argue their methods minimize the problems:

They use the steady state strategy combined with focus on strong and consistent effects
If an effect is too weak to detect visually, they work to increase effect strength or reduce data noise by controlling extraneous variables
If the effect remains difficult to detect, they consider it neither strong enough nor consistent enough to pursue further
Many now use statistical analysis as a supplement to visual inspection, especially for cross-study comparisons

📉 Single-subject researchers' concerns about group means

Turning the tables, single-subject researchers point out that focusing on group means can be highly misleading:

Example: Imagine a treatment has a strong positive effect on half the participants and an equally strong negative effect on the other half. In a between-subjects experiment, these effects cancel out statistically—the treatment group mean equals the control group mean, making it appear the treatment had no effect when it actually had a strong effect on every participant.

🔄 Group researchers' response

Group researchers share this concern and use several strategies to address it:

Strategy	How it helps
Examine distributions	Looking at histograms can reveal bimodal distributions showing both positive and negative effects
Within-subjects designs	Allow observation of effects at the individual level and specification of what percentage show strong, medium, weak, or negative effects
Factorial designs	Can examine whether effects differ across different groups (e.g., introverts vs. extraverts)

🌍 External validity disagreements

🌍 Group researchers' concern about generalization

Advocates of group research question whether results from just a few participants will generalize to others in the population.

Example: If a treatment reduces self-injury in two children with intellectual disabilities, how can we know it will work for other children with intellectual delays?

🔁 Single-subject researchers' response

Single-subject researchers address generalization concerns through multiple strategies:

Strong, consistent effects observed even in small samples are likely to generalize to the population
Emphasis on replication: They replicate findings with another small sample, perhaps with slightly different participants or conditions; each similar result increases confidence in generality
Historical success: Principles of classical and operant conditioning—discovered using single-subject approaches—have successfully generalized across an incredibly wide range of species and situations

🔄 Single-subject researchers' counter-concern

Single-subject researchers point out that large groups don't entirely solve the generalization-to-individuals problem:

Example: A treatment shows a small positive effect on average in a large group study. Likely, many participants showed small positive effects, others showed large positive effects, and still others showed small negative effects. When applying this treatment to another large group, we can predict a small average effect. But when applying it to another individual, we cannot predict whether the effect will be small, large, or even negative.

🎯 Generalization requires more than sample size

Single-subject researchers emphasize that group researchers face a similar problem with situations:

Researchers studying cell phone use on a closed oval track want to generalize to real-world driving situations
This requires generalizing from a single situation to a population of situations
Ability to generalize depends on careful consideration of the similarity of both participants and situations studied to the populations one wants to generalize to—not just sheer number of participants

🤝 Complementary methods framework

🤝 When single-subject research is best

Single-subject research is particularly appropriate for:

Testing treatment effectiveness on individuals when the focus is on strong, consistent, and biologically or socially important effects
Situations where the behavior of particular individuals is of interest
Clinicians working with one individual at a time—may be their only option for systematic quantitative research

🤝 When group research is best

Group research is ideal for:

Testing treatment effectiveness at the group level
Detecting weak effects, which can be interesting for many reasons (e.g., finding a weak effect might lead to treatment refinements that produce larger, more meaningful effects)
Studying interactions between treatments and participant characteristics (e.g., effectiveness differs for high vs. low motivation)
Answering questions about independent variables that cannot be manipulated (e.g., number of siblings, extraversion, culture)

🧬 Research traditions matter

The most important factor affecting which approach a researcher uses is research tradition:

Researchers in experimental analysis of behavior and applied behavior analysis learn to conceptualize questions in ways amenable to single-subject approaches
Researchers in most other areas of psychology learn to conceptualize questions in ways amenable to group approaches
Many topics successfully integrate both traditions

Example: Research on innate "number sense"—awareness of how many objects or events experienced without counting—has used both single-subject research with rats and birds and group research with human infants, showing strikingly similar abilities across populations.

🔍 The principle of converging evidence

🔍 No perfect design exists

The principle of converging evidence: examine the pattern of flaws across multiple studies because this pattern can either support or undermine conclusions.

Key insights:

Every research design has strengths and weaknesses
True experiments have high internal validity but may have problems with external validity
Non-experimental research (e.g., correlational) often has good external validity but poor internal validity
No single study can be definitive—this is why there is no "scientific proof," only scientific evidence

🔍 How to evaluate converging evidence

Scientists evaluate theories by looking at overall trends in multiple partially flawed studies:

Pattern of flaws	Impact on confidence
All studies flawed in the same way (e.g., all correlational with third-variable and directionality problems)	Undermines confidence—consistency may result from the shared flaw
Studies flawed in different ways, with weaknesses of some balanced by strengths of others (low external validity of experiments balanced by high external validity of correlational studies)	Increases confidence—diverse approaches pointing to same conclusion

🔍 Progress, not perfection

Don't confuse: The media often tries to reach strong conclusions from one study, but scientists focus on evaluating a body of research.

Psychologists use a diverse set of approaches with complementary strengths
If many studies using different designs converge on the same conclusion, confidence in that conclusion increases dramatically
In science, we strive for progress, not perfection

Key Takeaways and Exercises

47. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Single-subject research and group research are complementary methods with different strengths that suit different research questions, and both contribute valuable evidence through distinct approaches to experimental control and generalization.

📌 Key points (3–5)

What single-subject research is: experimental study of objective behavior through manipulation and control, using highly structured quantitative data—distinct from qualitative case studies.
Core design logic: measure the dependent variable repeatedly over time and change conditions only when behavior reaches a steady state, allowing clear observation of whether the independent variable causes changes.
Two main designs: reversal designs (baseline → treatment → baseline) and multiple-baseline designs (staggered treatment introduction across participants, variables, or settings).
Common confusion: single-subject vs qualitative research—single-subject research uses experimental manipulation and quantitative analysis, not open-ended qualitative exploration.
Complementary relationship: disagreements between single-subject and group researchers center on data analysis and external validity, but the methods are best seen as complementary rather than competing.

🔬 What single-subject research is

🔬 Definition and scope

Single-subject research: experimental study focusing on understanding objective behavior through experimental manipulation and control, collecting highly structured data, and analyzing those data quantitatively.

It is not the same as qualitative research on one person or a few individuals.
The key distinction: single-subject research uses experimental methods and quantitative analysis, not narrative or interpretive approaches.
Example: A researcher systematically introduces and removes a treatment while measuring specific behaviors numerically, rather than conducting open-ended interviews.

📜 Historical context and theoretical ties

Single-subject research has existed since the beginning of psychology as a field.
Today it is most strongly associated with the behavioral theoretical perspective.
However, it can in principle be used to study behavior from any perspective, not just behavioral.

🧪 Core design logic

🧪 Repeated measurement and steady state

The typical approach: measure the dependent variable repeatedly over time.
Change conditions (e.g., from baseline to treatment) only when the dependent variable has reached a steady state.
Why this matters: this approach allows the researcher to see whether changes in the independent variable are causing changes in the dependent variable.
Don't confuse: "steady state" means stable, predictable behavior—not zero behavior or perfect flatness, but consistency that lets you detect real changes.

🔄 Reversal design

Reversal design: the participant is tested in a baseline condition, then tested in a treatment condition, and then returned to baseline.

The logic: if the dependent variable changes with the introduction of the treatment and then changes back with the return to baseline, this provides strong evidence of a treatment effect.
Example: A child's tooth-brushing frequency is measured (baseline), then positive attention is given after brushing (treatment), then attention is withdrawn (return to baseline)—if brushing increases during treatment and decreases when attention stops, the treatment likely caused the change.

📊 Multiple-baseline design

Multiple-baseline design: baselines are established for different participants, different dependent variables, or different settings—and the treatment is introduced at a different time on each baseline.

The logic: if the introduction of the treatment is followed by a change in the dependent variable on each baseline, this provides strong evidence of a treatment effect.
The staggered timing is crucial—it shows that change happens only when treatment is introduced, not due to other time-related factors.
Example: Three students start self-testing at different weeks; if each student's spelling test performance improves only after their own self-testing begins, the treatment is likely effective.

📈 Data analysis approach

📈 Graphing and visual judgment

Single-subject researchers typically analyze their data by graphing them.
Judgments about whether the independent variable is affecting the dependent variable are based on three features:
- Level: the average value or magnitude of the dependent variable.
- Trend: the direction and slope of change over time.
- Latency: how quickly the dependent variable changes after the independent variable is introduced.
This is a visual, pattern-recognition approach rather than statistical hypothesis testing.

🤝 Relationship between single-subject and group research

🤝 Points of disagreement

Issue	Nature of disagreement
Data analysis	Single-subject uses visual judgment; group research uses statistical tests
External validity	Especially generalization to other people—single-subject studies one or few; group research studies many

These differences sometimes lead to disagreements between single-subject and group researchers.
Example of criticism: A single-subject study on one man with social anxiety disorder might be criticized because it "cannot be generalized to others."
Example of counter-criticism: A group study showing "average" effects might be criticized because averages "cannot be generalized to individuals."

🤝 Complementary strengths

Single-subject research and group research are probably best seen as complementary methods.
Each has different strengths and weaknesses.
Each is appropriate for answering different kinds of research questions.
Don't confuse: "complementary" means they work together to build knowledge, not that one is superior or that they are interchangeable.

🛠️ Practice exercises

🛠️ Reading and summarizing

Find and read a published article in psychology that reports new single-subject research.
An archive of articles is available in the Journal of Applied Behavior Analysis.
Write a short summary of the study.

🛠️ Designing studies

The excerpt provides three research questions for practice design:

Does positive attention from a parent increase a child's tooth-brushing behavior?
Does self-testing while studying improve a student's performance on weekly spelling tests?
Does regular exercise help relieve depression?

For each, specify:

The treatment (what is introduced or changed).
Operational definition of the dependent variable (how behavior is measured).
When and where observations will be made.

🛠️ Graphing and interpretation

Create a graph that displays hypothetical results for a designed study.
Write a paragraph describing what the results show.
Be sure to comment on level, trend, and latency.

🛠️ Responding to criticisms

How to respond to the criticism that a single-subject study "cannot be generalized to others."
How to respond to the criticism that group study "average effects cannot be generalized to individuals."

🛠️ Redesign and comparison

Redesign a single-subject study (Hall and colleagues, mentioned at the beginning of the chapter) as a group study.
List the strengths and weaknesses of the new study compared with the original study.

🛠️ Generation effect application

Generation effect: the fact that people who generate information as they are learning it (e.g., by self-testing) recall it better later than do people who simply review information.

Design a single-subject study on the generation effect applied to university students learning brain anatomy.

American Psychological Association (APA) Style

48. American Psychological Association (APA) Style

🧭 Overview

🧠 One-sentence thesis

APA style is a standardized set of writing guidelines designed to facilitate scientific communication in psychology by promoting clarity, consistency, and objectivity in presenting research.

📌 Key points (3–5)

Purpose of APA style: facilitates scientific communication by standardizing organization, content, and expression in research writing—making it easier to write and read research.
Three levels: overall article organization (sections in fixed order), high-level style (formal and straightforward expression), and low-level style (specific formatting rules for citations, numbers, tables, etc.).
Not synonymous with "good writing": APA style is a genre appropriate for psychological research contexts; other contexts (literary analysis, newspaper articles) require different styles (MLA, AP).
Common confusion: APA style vs. general writing quality—adopting APA style means choosing a format appropriate to the task, not abandoning good writing principles.
Reflects scientific values: many seemingly arbitrary rules actually promote objectivity, collaboration, tentative conclusions, and unbiased language.

📚 What APA style is and why it exists

📖 Definition and origin

APA style: a set of guidelines for writing in psychology and related fields, set down in the Publication Manual of the American Psychological Association.

Originated in 1929 as a short journal article providing basic manuscript standards.
Now in its sixth edition, nearly 300 pages long.
Primary purpose: facilitate scientific communication by promoting clarity and standardizing organization and content.

🎯 Why standardization matters

Easier to write: you know what information to present, in what order, and in what style.
Easier to read: research is presented in familiar and expected ways.
Science as collaboration: unless you make your research public, you are not really engaged in science—APA style supports this large-scale collaboration among researchers distributed across space and time.

🔀 APA style as a genre

APA style is a genre appropriate for presenting psychological research in academic and professional contexts.
It is not synonymous with "good writing" in general.
Different writing tasks require different styles:
- Literary analysis → MLA style
- Newspaper article → AP style
- Empirical research report → APA style
Part of being a good writer is adopting a style appropriate to the task at hand.

🏗️ The three levels of APA style

📋 Level 1: Overall organization of an article

Empirical research reports have several distinct sections that always appear in the same order:

Section	Purpose
Title page	Presents article title, author names, and affiliations
Abstract	Summarizes the research
Introduction	Describes previous research and rationale for the current study
Method	Describes how the study was conducted
Results	Describes the results of the study
Discussion	Summarizes the study and discusses its implications
References	Lists the references cited throughout the article

✍️ Level 2: High-level style (clear expression of ideas)

Covered in Chapter 3 "Writing Clearly and Concisely" of the Publication Manual.

🎩 Formal tone

APA-style writing is formal rather than informal.
Appropriate for communicating with professional colleagues (researchers and practitioners) who share an interest in the topic.
These colleagues are not necessarily similar to the writer or to each other (e.g., a graduate student in British Columbia might write for a young psychotherapist in Toronto and a respected professor in Tokyo).
Avoid: slang, contractions, pop culture references, humor, and other elements acceptable in informal writing or conversation.

🔍 Straightforward communication

Communicates ideas as simply and clearly as possible.
Puts the focus on the ideas themselves, not on how they are communicated.
Minimize: literary devices (metaphor, imagery, irony, suspense), humor.
Use: short, direct sentences.
Technical terms: use them to improve communication, not to sound more "scientific."
Example: write "participants immersed their hands in a bucket of ice water" rather than "were subjected to a pain-inducement apparatus."
Don't confuse: using technical terms appropriately (e.g., "between-subjects design") is better than avoiding them when they communicate clearly.

🔧 Level 3: Low-level style (specific formatting rules)

Covered in Chapters 4–7 of the Publication Manual.

Includes all specific guidelines for spelling, grammar, references and citations, numbers and statistics, figures and tables, etc.
So many guidelines that even experienced professionals need to consult the Publication Manual regularly.

⚠️ Top 10 common APA style errors

Based on analysis of manuscripts submitted to one professional journal over 6 years:

Error type	Example
1. Use of numbers	Failing to use numerals for 10 and above
2. Hyphenation	Failing to hyphenate compound adjectives before a noun (e.g., "role-playing technique")
3. Use of et al.	Failing to use it after a reference is cited for the first time
4. Headings	Not capitalizing headings correctly
5. Use of since	Using since to mean because
6. Tables and figures	Not formatting them in APA style; repeating information already in the text
7. Use of commas	Failing to use a comma before and or or in a series of three or more elements
8. Use of abbreviations	Failing to spell out a term completely before introducing an abbreviation
9. Spacing	Not consistently double-spacing between lines
10. Use of "&" in references	Using & in the text or and in parentheses

🔬 How APA style reflects scientific values

🧪 Features that promote scientific objectivity

Many features of APA style that seem arbitrary actually reflect psychologists' scientific values and assumptions:

APA style feature	Scientific value or assumption it reflects
Very few direct quotations of other researchers	Phenomena and theories are objective and do not depend on specific words a particular researcher used
Criticisms directed at work, not at researchers personally	Focus is on drawing general conclusions about the world, not on personalities of particular researchers
Many references and citations	Scientific research is a large-scale collaboration among many researchers
Empirical reports organized with specific sections in fixed order	There is an ideal approach to conducting empirical research (even if not always achieved)
Researchers "hedge" conclusions (e.g., "The results suggest that…")	Scientific knowledge is tentative and always subject to revision based on new empirical results

🌍 Avoiding biased language

Two reasons to avoid biased language:

Avoid offending people interested in your work.
Promote scientific objectivity and accuracy.

🛡️ General principles for avoiding bias

Principle 1: Be sensitive to labels

Avoid terms that are offensive or have negative connotations.
Avoid terms that identify people with a disorder or problem they happen to have.
Put the "person first."
Example: "people diagnosed with schizophrenia" is better than "schizophrenics."

Principle 2: Use more specific terms

More specific is better than more general.
Example: "Chinese Americans" is better than "Asian Americans" if everyone in the group is Chinese American.

Principle 3: Avoid objectifying participants

Acknowledge their active contribution to the research.
Example: "The students completed the questionnaire" is better than "The subjects were administered the questionnaire."
This also makes for clearer, more engaging writing.

📊 Examples of avoiding biased language

Instead of…	Use…
man, men	men and women, people
firemen	firefighters
homosexuals, gays, bisexuals	lesbians, gay men, bisexual men, bisexual women
minority	specific group label (e.g., African American)
neurotics	people scoring high in neuroticism
special children	children with learning disabilities

🔄 Note on "subjects" vs. "participants"

Previous edition strongly discouraged subjects (except for nonhumans) and encouraged participants.
Current edition acknowledges subjects can still be appropriate in areas where it has traditionally been used (e.g., basic memory research).
Encourages use of more specific terms when possible: university students, children, respondents, etc.

📝 Example: "sexual orientation" vs. "sexual preference"

Use "sexual orientation" instead of "sexual preference."
Reason: people do not generally experience their orientation as a "preference," nor is it as easily changeable as this term suggests.
This is both for accuracy and to avoid offense.

📚 References and citations in APA style

📖 Why references are important

Science is a large-scale collaboration among researchers.
References to the work of other researchers are extremely important.
This importance is reflected in extensive and detailed rules for formatting and using them.

📄 The reference list

Appears at the end of an APA-style article or book chapter.
Contains references to all works cited in the text (and only the works cited in the text).
Begins on its own page with the heading "References," centered in upper and lower case.
References listed alphabetically by last name of first named author.
Everything is double-spaced.

📰 Formatting journal article references

Generic format:

Author, A. A., Author, B. B., & Author, C. C. (year). Title of article. Title of Journal, volume(issue), pp–pp. doi:xx.xxxxxxxxxx

Concrete example:

Adair, J. G., & Vohra, N. (2003). The explosion of knowledge, references, and citations: Psychology's unique response to a crisis. American Psychologist, 58(1), 15–23. doi: 10.1037/0003-066X.58.1.15

🔍 Key features to notice

Hanging indent: first line not indented, all subsequent lines are.
Author order: appears as on the article, reflecting relative contributions.
Author names: last names and initials only, separated by commas with ampersand (&) before the last author (even with only two authors).
Article title: only first word capitalized (except proper nouns/adjectives or first word of subtitle).
Journal title: all important words capitalized.
Italicization: journal title and volume number italicized; issue number (in parentheses) is not.
DOI: digital object identifier provides permanent link to the article; include if available (found in electronic database records or on first page of published article).

📕 Formatting book references

Generic format:

Author, A. A. (year). Title of book. Location: Publisher.

Concrete example:

Kashdan, T., & Biswas-Diener, R. (2014). The upside of your dark side. New York, NY: Hudson Street Press.

📗 Formatting book chapter references

Generic format:

Author, A. A., Author, B. B., & Author, C. C. (year). Title of chapter. In A. A. Editor, B. B. Editor, & C. C. Editor (Eds.), Title of book (pp. xxx–xxx). Location: Publisher.

Concrete example:

Lilienfeld, S. O., & Lynn, S. J. (2003). Dissociative identity disorder: Multiple personalities, multiple controversies. In S. O. Lilienfeld, S. J. Lynn, & J. M. Lohr (Eds.), Science and pseudoscience in clinical psychology (pp. 109–142). New York, NY: Guilford Press.

🔍 Key differences from journal articles

Editor names: first and middle initials followed by last names (not reversed), with "Eds." (or "Ed." for one) in parentheses after final editor's name.
Book title: only first word capitalized (with exceptions noted for article titles); entire title italicized.
Chapter page numbers: appear in parentheses after book title with abbreviation "pp."
Ending: location of publication and publisher, separated by a colon.

💬 Reference citations in the text

🎯 What must be cited

Phenomena discovered by other researchers.
Theories they have developed.
Hypotheses they have derived.
Specific methods they have used (e.g., specific questionnaires or stimulus materials).
Factual information that is not common knowledge (so others can check it).

🚫 What does not need citations

Widely shared methodological and statistical concepts (e.g., between-subjects design, t test).
Statements so broad they would be difficult to argue with (e.g., "Working memory plays a role in many daily activities").
Warning: "common knowledge" about human behavior is often incorrect—when in doubt, find a reference or remove the assertion.

✍️ Two ways to cite in text

🔤 Method 1: Authors' names in the sentence

Use authors' last names (no first names or initials) followed immediately by year in parentheses.

Examples:

"Burger (2008) conducted a replication of Milgram's (1963) original obedience study."
"Although many people believe that women are more talkative than men, Mehl, Vazire, Ramirez-Esparza, Slatcher, and Pennebaker (2007) found essentially no difference in the number of words spoken by male and female college students."

Things to notice:

Authors' names treated grammatically as names of people, not as things (better: "a replication of Milgram's (1963) study" than "a replication of Milgram (1963)").
Two authors: names not separated by commas.
Three or more authors: names separated by commas.
Use the word and (not ampersand) to join authors' names.
Year follows immediately after final author's name.
Year only needs to be included the first time a work is cited in the same paragraph.

📌 Method 2: Parenthetical citation

Include authors' last names and year in parentheses following the idea being credited.

Examples:

"People can be surprisingly obedient to authority figures (Burger, 2008; Milgram, 1963)."
"Recent evidence suggests that men and women are similarly talkative (Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007)."

Things to notice:

Often placed at end of sentence to minimize disruption.
Always includes the year, even when citation is given multiple times in the same paragraph.
Multiple citations in same parentheses: organized alphabetically by first author's name, separated by semicolons.

🔀 Choosing between the two styles

No strict rules; most articles contain a mixture.
Method 1 works well when:
- You want to emphasize the person who conducted the research (e.g., comparing theories of two prominent researchers).
- You are describing a particular study in detail.
Method 2 works well when:
- You are discussing a general idea.
- You want to include multiple citations for the same idea.

🔤 Using "et al." correctly

Et al.: abbreviation for the Latin term et alia, meaning "and others."

Rules for using et al.:

More than two but fewer than six authors: include all names when first cited; after that, use first author's name followed by "et al."
Only two authors: include both names in every citation.
Six or more authors: use first author's name followed by "et al." every time (even the first time).

Examples:

"Recall that Mehl et al. (2007) found that women and men spoke about the same number of words per day on average."
"There is a strong positive correlation between the number of daily hassles and the number of symptoms people experience (Kanner et al., 1981)."

Formatting notes:

No comma between first author's name and "et al."
No period after "et" (it is a complete word).
Period after "al." (it is an abbreviation for alia).

Writing a Research Report in American Psychological Association (APA) Style

49. Writing a Research Report in American Psychological Association (APA) Style

🧭 Overview

🧠 One-sentence thesis

An APA-style empirical research report follows a standardized structure—title page, abstract, introduction, method, results, discussion, and references—designed to present new research findings clearly and enable replication by other researchers.

📌 Key points (3–5)

Standard structure: empirical reports include title page, abstract, introduction (with opening, literature review, closing), method, results, discussion, and references.
Introduction logic: the literature review must build an argument for why the research question is worth addressing, not just list past studies.
Method clarity: the method section must be detailed enough that other researchers could replicate the study by following the description.
Common confusion: the design describes the overall structure (independent/dependent variables, manipulations), while the procedure describes what participants actually did step-by-step.
Discussion balance: acknowledge limitations honestly but don't overdo it—focus on two or three that could have influenced results, not routine issues.

📄 Title page and abstract

📄 Title page elements

The title is centered in the upper half, with important words capitalized.
Should clearly and concisely communicate primary variables and research questions in about 12 words or fewer.
Sometimes requires a main title and subtitle separated by a colon.
Below the title: authors' names in order reflecting contribution, then institutional affiliation.
For publication submissions: include an author note with full affiliations, acknowledgments, and contact information.

📝 Abstract structure

The abstract is a summary of the study, usually limited to about 200 words.

Appears on the second page with the heading "Abstract."
First line is not indented.
Must present: the research question, a summary of the method, the basic results, and the most important conclusions.
Example: Because of the word limit, writing a good abstract is challenging but essential.

🎯 Introduction architecture

🎯 The opening (1–2 paragraphs)

Introduces the research question and explains why it is interesting.
Researcher Daryl Bem recommends starting with general observations about the topic in ordinary language (not technical jargon).
Should be about people and their behavior, not about researchers or their research.
Concrete examples are very useful.
After capturing attention, explain why the research question matters: Will it fill a gap? Test a theory? Have practical implications?

Don't confuse: A poor opening talks about theories or past research first; a good opening starts with relatable observations about human behavior.

📚 The literature review

Describes relevant previous research but is not simply a list of past studies.
Must constitute an argument for why the research question is worth addressing.
By the end, readers should be convinced the research question makes sense and the present study is a logical next step.

Structure strategies:

Describe a phenomenon + studies demonstrating it → competing theories → hypothesis to test theories.
Describe one phenomenon → describe an inconsistent phenomenon → propose a theory resolving the inconsistency → hypothesis to test the theory.
In applied research: describe phenomenon/theory → how it applies to real-world situation → suggest a test.

Writing tips:

Start with an outline of main points in the order you want to make them.
Begin the literature review by summarizing your argument before making it.
Open each paragraph with a sentence that summarizes the main point and links to preceding points (these provide transitions).
Your goal is to argue why the question is interesting, not necessarily why your favorite answer is correct—the review must be balanced.
Discuss contradictory evidence; ignoring it is not acceptable.
It is acceptable to argue that the balance of research supports a phenomenon, but not to ignore inconsistencies.

🎬 The closing (final 1–2 paragraphs)

Two important elements:

A clear, formal statement of the main research question and hypothesis (often in terms of operational definitions).
A brief overview of the method and comment on its appropriateness.

Example: The excerpt shows how Darley and Latané (1968) concluded their introduction by stating their hypothesis about bystanders, then explaining what conditions their experiment needed to fulfill.

🔬 Method section

🔬 Core principle

The method section should be clear and detailed enough that other researchers could replicate the study by following your "recipe."

Must describe all important elements: participant demographics, recruitment, random assignment, variable manipulation/measurement, counterbalancing, etc.
Avoid irrelevant details (e.g., specific classroom number, pencil type).
Begins immediately after introduction with heading "Method" (not "Methods") centered.

👥 Participants subsection

First subsection, left justified and italicized.
Indicates: how many participants, number of women and men, age indication, other relevant demographics, how recruited, any incentives.

🏗️ Three organizational approaches

Approach	Structure	When to use
Simple	Participants → Design and procedure	Methods are relatively simple, describable in a few paragraphs
Typical	Participants → Design → Procedure	Both design and procedure are complicated, each requiring multiple paragraphs
Complex	Participants → Materials → Design → Procedure	There are complicated materials to describe (questionnaires, stimuli, etc.)

🔍 Design vs. procedure distinction

Design = the overall structure:

What were the independent and dependent variables?
Was the independent variable manipulated between or within subjects?
How were variables operationally defined?

Procedure = how the study was carried out:

Often works well to describe in terms of what participants did rather than what researchers did.
Example: participants gave informed consent, read instructions, completed practice trials, completed test trials, completed questionnaires, were debriefed and excused.

Don't confuse: Design is the blueprint; procedure is the step-by-step execution.

🧰 Materials subsection

Good place to describe complicated materials: multiple questionnaires, written vignettes, perceptual stimuli, etc.
Heading can be modified to reflect content: "Questionnaires," "Stimuli," etc.
Also where you present reliability and/or validity of measures (test-retest correlations, Cronbach's α, etc.).

📊 Results section

📊 What to include

Present main results of the study, including statistical analyses.
Does not include raw data (individual responses/scores), but researchers should save and make available upon request.
Many journals now encourage or require open sharing of raw data and materials online.

🔢 Organization and preliminary issues

Preliminary issues to address:

Whether any participants or responses were excluded and why (rationale should be clear).
How multiple responses were combined to produce primary variables (e.g., mean ratings, percentage correct, number correct minus incorrect).
Whether the manipulation was successful (report manipulation check results).

Tackling primary research questions:

Answer one at a time with clear organization.
Approach options: most general to specific, or main question first then secondary ones.

📝 Structure for each result (Bem's recommendation)

Remind the reader of the research question.
Give the answer to the research question in words.
Present the relevant statistics.
Qualify the answer if necessary.
Summarize the result.

Key insight: Only step 3 involves numbers. The basic results should be clear even to a reader who skips over the numbers.

💬 Discussion section

💬 Typical elements

The discussion usually includes some combination of:

Summary of the research
Theoretical implications
Practical implications
Limitations
Suggestions for future research

📋 Summary and implications

Typically begins with a summary providing a clear answer to the research question.
Short report with single study: might require only a sentence.
Longer report with multiple studies: might require a paragraph or two.
Followed by theoretical implications: Do results support existing theories? If not, how can they be explained?
You don't need a definitive explanation, but outline one or more possible explanations.
In applied research: discuss practical implications—how can results be used, by whom, to accomplish real-world goals?

⚠️ Limitations discussion

Discuss problems with internal or external validity, manipulation effectiveness, measure reliability, participant understanding, or suspicion.
Don't overdo it: All studies have limitations; readers understand different samples or measures might produce different results.
Pick two or three limitations that seem like they could have influenced results.
Explain how they could have influenced results and suggest ways to deal with them.
Avoid mentioning routine issues unless there's good reason to think they would have changed results.

🔮 Future research suggestions

Not just a list of new questions.
Discuss two or three of the most important unresolved issues.
Identify and clarify each question, suggest alternative answers, and suggest ways they could be studied.

🎬 Ending strategies

Some researchers end with a sweeping or thought-provoking conclusion.
Example: Darley and Latané ended by discussing how understanding situational forces might help people overcome hesitation to intervene.
Caution: This can be difficult to pull off; may sound overreaching or banal.
Often better to simply return to the problem introduced in the opening paragraph and clearly state how your research addressed it.

📚 References and supplemental materials

📚 References section

Begins on a new page with "References" centered at the top.
All references cited in text are listed alphabetically by first author's last name.
If same first author: alphabetically by second author's last name.
If all authors the same: chronologically by year of publication.
Everything is double-spaced both within and between references.

📎 Appendices

An appendix is appropriate for supplemental material that would interrupt the flow of the research report if presented within any major section.

Appropriate content:

Lists of stimulus words
Questionnaire items
Detailed descriptions of special equipment or unusual statistical analyses
References to studies included in a meta-analysis

Formatting:

Each begins on a new page.
If only one: heading is "Appendix" centered at top.
If more than one: "Appendix A," "Appendix B," etc., in order first mentioned in text.

📊 Tables and figures

Both used to present results.
Figures can also display graphs, illustrate theories (flowcharts), display stimuli, outline procedures, etc.
Each appears on its own page after any appendices.
Tables numbered in order first mentioned ("Table 1," "Table 2," etc.).
Figures numbered the same way ("Figure 1," "Figure 2," etc.).
Tables: brief explanatory title with important words capitalized appears above.
Figures: brief explanatory caption where only first word of each sentence is capitalized (aside from proper nouns).

🧭 Overview

🧠 One-sentence thesis

Researchers in psychology present their work through multiple formats beyond journal manuscripts—including review articles, conference talks, and posters—each with distinct structural and stylistic conventions suited to different communication contexts.

📌 Key points (3–5)

Beyond empirical reports: Review/theoretical articles, final manuscripts, and conference presentations are all valid APA-style formats with different purposes.
Conference presentations come in two forms: oral presentations (talks with slides, 10–20 minutes) and posters (visual displays during interactive sessions).
Format adapts to purpose: Final manuscripts (dissertations, theses) may deviate from strict APA style for readability; conference formats prioritize visual clarity and interaction.
Common confusion: Copy manuscripts (for journal submission) vs. final manuscripts (dissertations/theses)—final manuscripts may place tables/figures inline for easier reading rather than at the end.
Posters encourage interaction: Unlike talks, poster sessions create opportunities for direct conversation between researchers and visitors.

📝 Manuscript variations

📝 Review and theoretical articles

Review articles: summarize research on a particular topic without presenting new empirical results. Theoretical articles: review articles that present a new theory.

Structure mirrors empirical reports: title page, abstract, references, appendixes, tables, figures.
Key difference: no method or results section (because no new empirical data).
Body organization:
- Opening: identifies topic and explains importance
- Literature review: organizes previous research, identifies relationships or gaps
- Closing: summarizes conclusions, suggests future directions or discusses implications
Sections and headings vary by article (unlike the fixed structure of empirical reports).
In theoretical articles, much of the body presents the new theory itself.

📄 Final manuscripts

Final manuscripts: prepared by the author in final form with no intention of submitting for publication elsewhere (e.g., dissertations, theses, student papers).

May differ from strict APA style to improve readability.
Example difference: tables and figures placed near discussion points instead of at the manuscript's end.
Don't confuse: Dissertations/theses may not adhere strictly to APA formatting, even though they use APA style principles.
For student papers: always check instructor requirements—research methods courses usually require submission-ready manuscript format.

🎤 Oral presentations at conferences

🎤 Structure and timing

Duration: 10–20 minutes, with last few minutes for audience questions.
At larger conferences: grouped into hour-or-two sessions on the same general topic.
Presenters submit abstracts in advance; peer review is less rigorous than journal submission.

🖼️ Slide design principles

Principle	Guideline
Slide count	No more than one slide per minute
Structure	Mirrors APA research report: title/authors → background → method → results → conclusions
Content	Main points in bulleted lists or simple tables/figures
Role	Visual aids, not the focus; presenter speaks to audience

🗣️ Presentation style

Look at audience members, not just slides.
Conversational tone: less formal than APA writing, more formal than casual conversation.
Slides support the talk; they don't replace the speaker.

🖼️ Poster presentations

🖼️ Format and setting

Presented during one- to two-hour poster sessions in large conference rooms.
Presenters stand near posters on bulletin boards; visitors circulate, read, and discuss.
Typical size: approximately four feet wide by three feet high.
Increasingly popular format—one recent APA conference featured nearly 2,000 posters across 16 sessions.

📐 Content organization

Standard sections (similar to research reports):

Title, author names and affiliations
Introduction
Method
Results
Discussion or conclusions
References
Acknowledgments
Abstract may be omitted (the poster itself is already a summary)

🎨 Design for clarity

Font sizes (for crowded, noisy environments):

Title and authors: ~72 points
Main text: ~28 points

Layout principles:

Organize into sections with clear headings
Text in sentences or bulleted points, not paragraphs
Column layout (top-to-bottom flow) preferred over row layout—allows multiple readers simultaneously without crowding
Figures can be more colorful than in manuscripts
May include visual stimuli photos, apparatus images, or participant simulation examples
Decorative elements acceptable but avoid overdoing

🤝 Interactive purpose

Posters facilitate researcher interaction—a primary advantage.
Presenters should:
- Stand by their poster
- Greet visitors and offer to describe research
- Use poster as visual aid during explanations
- Be prepared for questions and critical comments
Good practice: have detailed write-ups available, offer to send more information, or provide contact details for follow-up.

🌐 Conference landscape

🌐 Types of conferences

Professional conferences: where researchers share work (distinct from clinical-practice conferences).

Range: small-scale (dozen researchers, one afternoon) to large-scale (thousands of researchers, several days)
Formal presentations: talks and posters (plus informal discussions)

🌐 Presentation acceptance

Requires submitting an abstract in advance
Acceptance process: peer review, but typically less rigorous than journal manuscript review

Key Takeaways and Exercises

51. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

APA style is a three-level system of guidelines for writing psychology research that governs article organization, high-level formal writing, and low-level formatting rules, enabling researchers to communicate findings through empirical reports, reviews, and conference presentations.

📌 Key points (3–5)

Three levels of APA style: organization of research articles, high-level formal/straightforward writing style, and low-level specific rules (grammar, spelling, references).
Core sections of an empirical report: abstract, introduction (opening + literature review + closing), method, results, discussion, and references—each with specific functions.
Introduction structure: opens with the research question, reviews previous research to argue why the current study is worth doing, and closes by restating the question and commenting on method.
Multiple presentation formats: APA-style empirical reports, theoretical/review articles, final manuscripts (dissertations, theses, student papers), and conference talks/posters (less detailed, designed for interaction).
Common confusion: conference presentations follow some APA guidelines but are considerably less detailed than full research reports; their function is to present new research and facilitate interaction, not exhaustive documentation.

📐 The three levels of APA style

📐 Organizational level

What it governs: the structure of a research article.
The excerpt identifies several standard sections: abstract, introduction, method, results, discussion, and references.
Each section has a defined role in the overall argument and documentation of the study.

✍️ High-level style

What it means: writing in a formal and straightforward way.
This level concerns tone, clarity, and directness—not specific formatting rules.
The goal is to communicate research findings clearly to other researchers and practitioners.

🔍 Low-level style

What it includes: many specific rules of grammar, spelling, and formatting of references.
These are the detailed mechanics that ensure consistency across psychology publications.
References and reference citations have specific formatting and citation rules.

📄 Structure of an APA-style empirical research report

📄 Standard sections

The excerpt lists the main sections in order:

Section	Primary function
Abstract	Brief summary of the entire study
Introduction	Present research question, review literature, justify the study
Method	Describe procedure in enough detail for replication
Results	Report findings in organized fashion with statistics and explanations
Discussion	Summarize, discuss implications and limitations, suggest future research
References	List all sources cited

📖 Introduction components

The introduction has three parts:

Opening: presents the research question.
Literature review: describes previous research on the topic; constitutes an argument for why the current study is worth doing.
Closing: restates the research question and comments on the method.

Don't confuse: the literature review is not just a summary of past work—it builds an argument for the value of the current study.

🔬 Method section requirements

The method section describes the method in enough detail that another researcher could replicate the study.

Minimum subsections: participants subsection and design and procedure subsection.
The standard is replicability: another researcher should be able to reproduce the study from this description.

📊 Results section approach

Organization: results are described in an organized fashion.
Dual presentation: each primary result is presented in terms of statistical results but also explained in words.
The excerpt emphasizes that numbers alone are not sufficient; verbal explanation is required.

💬 Discussion section elements

The discussion typically includes:

Summary of the study
Theoretical implications
Practical implications
Limitations of the study
Suggestions for further research

🎤 Other presentation formats

🎤 Beyond empirical reports

The excerpt identifies several formats for presenting psychology research:

Theoretical and review articles: not empirical reports but still follow APA conventions.
Final manuscripts: dissertations, theses, and student papers.
Conference presentations: talks and posters at professional conferences.

🎨 Conference talks and posters

Relationship to APA style: follow some APA style guidelines but are considerably less detailed than APA-style research reports.
Primary function: to present new research to interested researchers and facilitate further interaction among researchers.
Key difference: not exhaustive documentation; designed for engagement and discussion rather than complete replication information.

Example: A poster might show the research question, key method details, main results, and conclusions, but omit the full literature review and detailed procedure that would appear in a journal article.

Don't confuse: conference presentations are not "incomplete" research reports—they serve a different purpose (interaction and dissemination) and appropriately use a different level of detail.

🔗 References and citations

🔗 Importance in APA style

References and reference citations are an important part of APA style.

The excerpt emphasizes that there are specific rules for formatting references and for citing them in the text of an article.
This is part of the low-level style that ensures consistency and allows readers to locate sources.

🔗 Two components

Reference citations: how sources are mentioned in the text of the article.
References: the full list of sources at the end, with specific formatting requirements.

🛠️ Practice exercises

The excerpt provides several practice activities:

🛠️ Comparison exercise

Find a research description in popular media (magazine, newspaper, blog, website).
Identify five specific differences between that description and how it would be written in APA style.
Purpose: understand the distinctive features of formal academic writing versus general-audience writing.

🛠️ Error correction

Find and correct errors in fictional APA-style references and citations.
Purpose: learn the specific formatting rules through active correction.

🛠️ Evaluation exercises

Rate the effectiveness of article openings in a professional journal.
Identify where introduction components (opening, literature review, closing) begin and end.
Highlight discussion elements (summary, implications, limitations, suggestions) in different colors.
Purpose: develop skill in recognizing and evaluating the standard components of APA-style writing.

🛠️ Poster analysis

Find examples of conference posters online (search terms: "psychology" and "poster").
Identify main strengths and weaknesses based on chapter information.
Purpose: apply knowledge of presentation formats to real examples and develop critical evaluation skills.

Describing Single Variables

52. Describing Single Variables

🧭 Overview

🧠 One-sentence thesis

Describing single variables through distributions, central tendency, and variability allows researchers to summarize and understand the characteristics of their data before examining relationships between variables.

📌 Key points (3–5)

What a distribution shows: how scores are spread across the levels of a variable, displayed through frequency tables and histograms.
Central tendency measures: mean, median, and mode each describe the "middle" of a distribution in different ways.
Variability measures: range and standard deviation quantify how spread out scores are around the center.
Common confusion: mean vs. median—the mean can be misleading in skewed distributions because outliers pull it toward the tail, while the median stays at the true middle.
Location within a distribution: percentile ranks and z scores describe where an individual score sits relative to the rest of the data.

📊 Understanding distributions

📊 What a distribution is

Distribution: the way scores are distributed across the levels of a variable.

It shows the pattern of how often each value occurs in your data.
Example: in a sample of 100 students, the distribution of "number of siblings" might show 10 with no siblings, 30 with one sibling, 40 with two siblings, etc.
Distributions apply to both quantitative variables (like test scores) and categorical variables (like sex).

📋 Frequency tables

Frequency table: a table listing each value of a variable and how often it occurs.

The first column lists values (usually highest to lowest).
The second column shows the frequency (count) of each value.
Benefits: quickly see the range, most/least common scores, and any extreme outliers.
Grouped frequency tables: when scores span a wide range, group them into equal-width ranges (usually 5–15 ranges).

📈 Histograms

Histogram: a graphical display of a distribution using vertical bars.

The x-axis shows the variable levels; the y-axis shows frequency.
Each bar's height represents how many individuals have that score.
For quantitative variables, bars typically touch; for categorical variables, small gaps separate them.
Advantage: even quicker to grasp patterns than a frequency table.

🔍 Distribution shapes and patterns

🔍 Peaks: unimodal vs. bimodal

Unimodal: one distinct peak near the middle, with tails tapering in both directions (most common).
Bimodal: two distinct peaks, indicating two clusters in the data.
Example: a bimodal distribution on the Beck Depression Inventory might show one peak for non-depressed individuals and another for clinically depressed individuals.

⚖️ Symmetry and skew

Shape	Description	Visual pattern
Symmetrical	Left and right halves are mirror images	Peak in center, equal tails
Negatively skewed	Peak shifted toward upper end	Long tail stretches toward lower scores
Positively skewed	Peak shifted toward lower end	Long tail stretches toward higher scores

⚠️ Outliers

Outlier: an extreme score much higher or lower than the rest of the distribution.

May represent truly extreme cases (e.g., one clinically depressed person in a happy sample).
May also represent errors, misunderstandings, or equipment malfunctions.
Don't confuse: an outlier is not just "different"—it's dramatically separated from the cluster of other scores.

📍 Measures of central tendency

📍 What central tendency means

Central tendency: the middle point around which scores in a distribution cluster (also called the average).

It summarizes "where the data is centered."
Three common measures: mean, median, and mode—each useful in different situations.

🧮 Mean

Mean (M): the sum of all scores divided by the number of scores.

Formula in words: add up all the scores, then divide by how many scores there are.
Most common measure because it's easy to understand and has useful statistical properties.
Weakness: sensitive to outliers and skew—one extreme score can pull the mean far from the typical value.
Example: reaction times of 200, 250, 280, 250 ms have a mean of 245 ms, but adding one score of 5,000 ms raises the mean to 1,445 ms—no longer representative.

🎯 Median

Median: the middle score, where half the scores are lower and half are higher.

To find it: arrange scores from lowest to highest and pick the middle one.
With an even number of scores, take the value halfway between the two middle scores.
Strength: not affected by outliers or skew—stays at the true middle.
Preferred for highly skewed distributions (like reaction times).

🏔️ Mode

Mode: the most frequent score in a distribution.

Simply the value that appears most often.
The only measure of central tendency that works for categorical variables.
Example: if more students scored 22 on a self-esteem scale than any other value, 22 is the mode.

🔄 How they relate in different shapes

Symmetrical and unimodal: mean, median, and mode are very close together at the peak.
Bimodal: mean and median fall between the two peaks; mode is at the tallest peak.
Skewed: mean is pulled toward the longer tail; median stays closer to the peak; mode is at the peak.
You don't have to choose just one—each provides slightly different information.

📏 Measures of variability

📏 What variability means

Variability: the extent to which scores vary around their central tendency.

Two distributions can have the same mean but very different spreads.
Low variability: scores cluster tightly around the center.
High variability: scores spread across a much wider range.

📐 Range

Range: the difference between the highest and lowest scores.

Formula: highest score minus lowest score.
Easy to compute and understand.
Weakness: misleading when outliers are present—one extreme score can inflate the range dramatically.
Example: exam scores from 90 to 100 have a range of 10, but one score of 20 would increase the range to 80.

📊 Standard deviation

Standard deviation (SD): the average distance between the scores and the mean.

Most common measure of variability.
Tells you how much scores differ from the mean "on average."
How to compute: find each score's distance from the mean, square those distances, find the mean of the squared distances (called variance), then take the square root.
Always positive (because distances are squared).
Example: a standard deviation of 1.69 means scores differ from the mean by about 1.69 units on average.

🔢 Variance

Variance (SD²): the mean of the squared differences from the mean.

Itself a measure of variability, but mainly used in inferential statistics.
The standard deviation is simply the square root of the variance.

⚙️ N vs. N−1 in calculations

Dividing by N: appropriate when simply describing variability in your sample.
Dividing by N−1: corrects for the tendency of sample standard deviation to underestimate the population standard deviation; used by most calculators and software.
Researchers typically use N−1 because they want to draw conclusions about the larger population.

🎯 Locating individual scores

🎯 Percentile rank

Percentile rank: the percentage of scores in the distribution that are lower than a given score.

To find it: count how many scores are lower, divide by the total number of scores, multiply by 100.
Example: if 32 of 40 scores are lower than 23, a score of 23 has a percentile rank of 80 (the 80th percentile).
Commonly used to report standardized test results.
A percentile rank of 40 means you scored higher than 40% of people who took the test.

🎯 Z score

Z score: the difference between an individual's score and the mean, divided by the standard deviation.

Formula in words: subtract the mean from the score, then divide by the standard deviation.
Expresses "how many standard deviations above or below the mean" a score is.
Example: in an IQ distribution with mean 100 and SD 15, a score of 110 has z = (110−100)/15 = +0.67 (about two-thirds of a standard deviation above the mean).
A score of 85 has z = (85−100)/15 = −1.00 (one standard deviation below the mean).

🔍 Why z scores matter

Provide a standardized way to describe location within any distribution.
Sometimes used to report standardized test results.
Help define outliers: scores with z < −3.00 or z > +3.00 (more than three standard deviations from the mean) are often considered outliers.
Play an important role in other statistical computations.

Describing Statistical Relationships

53. Describing Statistical Relationships

🧭 Overview

🧠 One-sentence thesis

Statistical relationships between variables can be quantified using effect-size measures like Cohen's d for group differences and Pearson's r for correlations, enabling researchers to communicate the strength of relationships in standardized units across different studies and measures.

📌 Key points (3–5)

Group differences: Described using means and standard deviations of each group; Cohen's d quantifies effect size as the difference between means in standard deviation units.
Correlation strength: Pearson's r measures linear relationships between quantitative variables, ranging from −1.00 to +1.00.
Interpretation guidelines: Small, medium, and large effect sizes have conventional thresholds (Cohen's d: 0.20, 0.50, 0.80; Pearson's r: ±0.10, ±0.30, ±0.50).
Common confusion: "Effect size" does not imply causation—the term applies to both experimental and correlational studies, but only experiments support causal claims.
Limitations to watch: Nonlinear relationships and restriction of range can make Pearson's r misleading; always examine scatterplots and consider the full population range.

📊 Describing group differences

📏 Means and standard deviations

Group or condition differences are typically reported as the mean and standard deviation for each group.
Example from the excerpt: In a phobia treatment study, the exposure condition had a mean fear rating of 3.47 (SD = 1.77), the education condition had a mean of 4.83 (SD = 1.52), and the control condition had a mean of 5.56 (SD = 1.21).
Bar graphs are commonly used to display group means visually.

📐 Cohen's d as effect size

Cohen's d: The difference between two group means divided by the standard deviation, expressed as d = (M₁ − M₂) / SD.

What it measures: How many standard deviations apart the two group means are.
Why it's useful: Cohen's d has the same meaning regardless of the variable or measurement scale—a d of 0.50 always means "half a standard deviation apart," whether measuring self-esteem scores, reaction times, or blood pressure.
Interpretation guidelines:

Strength	Cohen's d
Small	0.20
Medium	0.50
Large	0.80

Example: The phobia study showed d = 0.82 between exposure and education conditions, indicating a large effect.
Don't confuse: Cohen's d values should always be positive (use absolute difference); the sign doesn't indicate direction, just which mean is subtracted from which.

⚠️ The "effect size" terminology trap

The term "effect size" can be misleading because it suggests causation.
In experiments (random assignment): A Cohen's d can support causal claims (e.g., "exercising caused an increase in happiness").
In correlational studies: The same d value only describes the magnitude of difference (e.g., "exercisers were happier than non-exercisers by this amount").
Simply calling it an "effect size" does not make the relationship causal.

📈 Describing correlations between quantitative variables

📉 Visual representations

Line graphs: Used when the x-axis variable has a small number of distinct values or categories.
- Example: Response time across four quartiles of alphabetical name position.
Scatterplots: Used when the x-axis variable has many possible values.
- Example: Self-esteem scores at two time points, with each point representing one individual.

🔗 Types of relationships

Relationship type	Pattern	Example from excerpt
Positive	Higher scores on one variable associate with higher scores on the other (lower left to upper right)	Self-esteem scores at Time 1 vs. Time 2
Negative	Higher scores on one variable associate with lower scores on the other (upper left to lower right)	Alphabetical name position vs. response speed
Linear	Points fit well to a straight line	Both examples above
Nonlinear	Points fit better to a curved line	Sleep duration vs. depression (U-shaped curve)

📊 Pearson's r

Pearson's r: A statistic measuring the strength of linear correlation between quantitative variables; the "mean cross-product of z scores."

Range: −1.00 (strongest negative) through 0 (no relationship) to +1.00 (strongest positive).
Sign meaning: The sign (+ or −) indicates direction only, not strength; r = +0.30 and r = −0.30 are equally strong.
Interpretation guidelines:

Strength	Pearson's r
Small	±0.10
Medium	±0.30
Large	±0.50

🧮 How Pearson's r is computed

Convert all X scores to z-scores: subtract the mean of X, divide by SD of X.
Convert all Y scores to z-scores: subtract the mean of Y, divide by SD of Y.
For each individual, multiply their X z-score by their Y z-score (the "cross-product").
Take the mean of all cross-products—that's Pearson's r.

Conceptually: Pearson's r tells you how consistently high scores on one variable pair with high (or low) scores on the other, in standardized units.

⚠️ When Pearson's r can mislead

🔄 Nonlinear relationships

Pearson's r only measures linear relationships.
If the true relationship is curved (e.g., the U-shaped sleep-depression example), Pearson's r will be close to zero even though a strong relationship exists.
What to do: Always create a scatterplot first to confirm the relationship is approximately linear before relying on Pearson's r.

🔒 Restriction of range

Restriction of range: When one or both variables have a limited range in the sample compared to the full population.

Example from the excerpt: Age and hip-hop enjoyment show a strong negative correlation (r = −0.77) across all ages, but if you only sample 18- to 24-year-olds, the correlation appears to be zero.
Why it happens: Within a narrow slice of the population, there may be little variation to detect.
What to do:
- Design studies to include a wide range of values on key variables.
- Examine your data for possible restriction of range.
- Interpret Pearson's r cautiously if range is restricted.
Don't confuse: A weak correlation in a restricted sample does not mean the variables are unrelated in the broader population.

🔬 Real-world application: Gender similarities

🧪 Sex differences expressed as Cohen's d

The excerpt describes researcher Janet Shibley Hyde's work translating sex-difference findings into Cohen's d values (positive = men score higher; negative = women score higher):

Variable	Cohen's d	Interpretation
Mathematical problem solving	+0.08	Trivial
Reading comprehension	−0.09	Trivial
Smiling	−0.40	Small to medium
Aggression	+0.50	Medium
Attitudes toward casual sex	+0.81	Large
Leadership effectiveness	−0.02	Trivial

🎯 The gender similarities hypothesis

Although some variables show large differences, the vast majority show small or trivial differences (d < 0.10).
Hyde argues it makes as much sense to emphasize fundamental similarities between men and women as to emphasize differences.
Example: The talkativeness difference mentioned elsewhere in the book was d = 0.06, trivial.

Expressing Your Results

54. Expressing Your Results

🧭 Overview

🧠 One-sentence thesis

Descriptive statistical results must be presented clearly and efficiently in writing, figures, and tables following APA style guidelines so that readers can understand findings without referring back to the main text.

📌 Key points (3–5)

When to use each format: write out results when you have a small number; use figures (graphs) or tables when you have many results to report more clearly.
Figures must stand alone: every graph should add new information (not repeat text/tables), be as simple as possible, and be interpretable on its own with only the caption.
APA number rules: use words for numbers below 10 (except statistics), numerals for 10+, and always present statistical results as numerals rounded to two decimals.
Common confusion—narrative vs. parenthetical: write out "mean" and "standard deviation" in sentences, but use symbols M and SD in parentheses.
Error bars show variability: bars extending from graph points usually represent standard error (not standard deviation), helping readers see if differences are statistically significant.

✍️ Writing out statistics

✍️ Number formatting rules

Statistical results are always presented in the form of numerals rather than words and are usually rounded to two decimal places.

Use words only for numbers less than 10 that are not statistical results.
Use numerals for 10 and above.
Exception: all statistics use numerals, even if below 10.
Example: "The mean age of the participants was 22.43 years with a standard deviation of 2.34."

📝 Narrative vs. parenthetical presentation

Two ways to present the same information:

Format	Terms	Symbols	Example
Narrative	"mean" and "standard deviation" spelled out	Not used	"The treatment group had a mean of 23.40"
Parenthetical	Abbreviated	M and SD	"(M = 4.05, SD = 2.32)"

Don't confuse: you can mix both in one sentence—narrative for context, parenthetical for precision.
Example: "Among the participants with low self-esteem, those in a negative mood expressed stronger intentions (M = 4.05, SD = 2.32) than those in a positive mood (M = 2.15, SD = 2.27)."

🔄 Parallel construction

Present similar results in similar ways for clarity.
Bad example (non-parallel): "The treatment group had a mean of 23.40 (SD = 9.33), while 20.87 was the mean of the control group, which had a standard deviation of 8.45."
Good example (parallel): "The treatment group had a mean of 23.40 (SD = 9.33), while the control group had a mean of 20.87 (SD = 8.45)."

📊 Creating figures (graphs)

📊 Three core principles for figures

Add information, don't repeat: if a figure presents information more clearly than text or a table, keep the figure and eliminate the redundant format.
Keep it simple: avoid unnecessary color, decoration, or complexity.
Make it self-contained: a reader should understand the basic result from the figure and caption alone, without consulting the text.

📐 Technical layout guidelines

Graph dimensions and axes:

Scatterplots, bar graphs, and line graphs should be slightly wider than tall.
Independent variable on the x-axis (horizontal), dependent variable on the y-axis (vertical).
Values increase left-to-right on x-axis, bottom-to-top on y-axis.
Both axes should begin at zero.

Labels and text:

Axis labels must be clear, concise, include units of measurement (if not in caption), and run parallel to the axis.
Legends appear within the figure.
Use the same simple font throughout; size between 8 and 14 points.

🏷️ Caption requirements

Every figure caption has three parts:

Title: "Figure" + number (in order of appearance) + period, all italicized.
Description: brief summary of what the figure shows, ending with a period.
Interpretation notes: any abbreviations, units, error bar definitions, etc., needed to read the figure.

Example structure: Figure 12. Mean response time by alphabetical position. Response times are expressed as z scores. Error bars represent standard errors.

📈 Types of graphs

📊 Bar graphs

Bar graphs are generally used to present and compare the mean scores for two or more groups or conditions.

Each bar represents the mean for one group or condition.
Error bars: small vertical lines extending up and down from the top of each bar.
- Usually extend one standard error in each direction (not standard deviation).
- Standard error = standard deviation ÷ square root of sample size.
- Why it matters: a difference greater than two standard errors is typically statistically significant, so you can "see" significance from the graph.
Example: comparing mean phobia severity ratings for an education treatment group vs. an exposure treatment group.

📉 Line graphs

Line graphs are used when the independent variable is measured in a more continuous manner (e.g., time) or to present correlations between quantitative variables.

Each point represents the mean score at one level of the independent variable.
Points are connected by lines to show trends over continuous variables.
Also include error bars (standard error).
Convention: use a bar graph when the x-axis variable is categorical; use a line graph when it is quantitative.
Don't confuse: line graphs and bar graphs show the same type of relationship (differences in average scores), just formatted differently based on variable type.

🔵 Scatterplots

Scatterplots are used to present correlations and relationships between quantitative variables when the variable on the x-axis has a large number of levels.

Each point represents an individual (not a group mean).
No lines connect the points (unlike line graphs).
When variables on both axes are similar and on the same scale, make the axes the same length to emphasize this.
Overlapping points: if multiple individuals fall at the same spot, offset points slightly, show the count in parentheses, or make the point larger/darker.
Regression line: the straight line that best fits the points can be included to show the trend.

📋 Creating tables

📋 General table principles

Tables follow the same three core principles as figures:

Add important information (don't duplicate text).
Keep them as simple as possible.
Make them interpretable on their own.

📊 Tables of means and standard deviations

Most common use: present several means and standard deviations for complex designs with multiple independent and dependent variables.

Formatting rules:

Horizontal lines only at the top, bottom, and just beneath column headings.
Every column has a heading, including the leftmost.
Use spanning headings over multiple columns to organize information efficiently.
Number tables consecutively (Table 1, Table 2, etc.).
Give each table a brief, clear, descriptive title.

Example scenario: a study with low/high self-esteem participants in negative/positive moods, measuring intentions and attitudes toward unprotected sex—all means and standard deviations organized in one table by mood and self-esteem level.

🔗 Correlation matrices

A correlation matrix is a table that presents correlations—usually measured by Pearson's r—among several variables.

Structure:

Only half the table is filled in because the other half would be identical (correlation of A with B = correlation of B with A).
The diagonal (correlation of a variable with itself) is always 1.00, so these cells are replaced with dashes for readability.
Example: a study examining relationships between working memory, executive function, processing speed, vocabulary, episodic memory, and age—all pairwise correlations shown in one compact table.

📝 Relationship between tables and text

Precise statistical results in a table do not need to be repeated in the text.
The writer should note major trends and alert readers to specific details of particular interest (e.g., "the correlation between working memory and executive function was extremely strong at .96").

Conducting Your Analyses

55. Conducting Your Analyses

🧭 Overview

🧠 One-sentence thesis

Analyzing research data requires systematic preparation, careful screening for errors and outliers, and a clear distinction between testing planned hypotheses and exploring unexpected patterns.

📌 Key points (3–5)

Data preparation is essential: check for completeness, accuracy, missing responses, and store raw data securely before any analysis begins.
Preliminary checks come first: assess internal consistency of measures, examine distributions of each variable, and identify outliers before testing hypotheses.
Planned vs exploratory analyses must be distinguished: planned analyses test pre-existing hypotheses; exploratory analyses search for unexpected patterns but require skepticism and replication.
Common confusion: outliers are not always errors—they may represent genuine extreme responses, so analyze data both with and without them when results differ.
Descriptive statistics tell the story: understand what happened in your study at the descriptive level before moving to inferential statistics.

🗂️ Preparing raw data

🔒 Security and storage

Remove any information that could identify individual participants.
Store data in a secure location (locked room or password-protected computer).
Keep consent forms in a separate secure location.
Make photocopies or backup files and store them in another secure location.
Professional researchers keep raw data and consent forms for several years in case questions arise later.

🔍 Checking for completeness and accuracy

Raw data check: examine data to ensure they are complete and appear accurately recorded, whether recorded by participants, researchers, or computer programs.

Look for illegible or missing responses.
Identify obvious misunderstandings (e.g., a response of "12" on a 1-to-10 scale).
Decide whether problems are severe enough to exclude a participant's data.
If main independent or dependent variable information is missing, or several responses are missing/suspicious, consider exclusion.
Important: never throw away or delete excluded data—set it aside and keep notes explaining why you excluded it, as you must report this information.

📊 Creating the data file

Use a spreadsheet program (Microsoft Excel) or statistical analysis program (SPSS).
Standard format: each row = one participant; each column = one variable (with variable name at top).
First column typically contains participant identification numbers.
Follow with demographic information, independent variables, then dependent variables.
Categorical variables can be entered as labels (e.g., "M" and "F") or numbers (e.g., "0" and "1").

📝 Handling multiple-response measures

Enter each response as a separate variable in the spreadsheet rather than combining by hand.
Use software functions to combine them (e.g., "AVERAGE" in Excel or "Compute" in SPSS).
Benefits of this approach:
- More accurate than manual calculation
- Allows error detection and correction
- Enables assessment of internal consistency
- Permits analysis of individual responses later if needed

Example: For a self-esteem measure with four items, enter SE1, SE2, SE3, SE4 as separate columns, then use software to compute the TOTAL.

🔬 Preliminary analyses

🧪 Assessing internal consistency

For multiple-response measures, assess internal consistency before main analyses.
Statistical programs can compute Cronbach's α or Cohen's κ.
Alternative: compute and evaluate a split-half correlation if advanced statistics are beyond your comfort level.
Note: this step is not necessary for manipulated independent variables, since the researcher determined the distribution.

📈 Analyzing each variable separately

Make histograms for each important variable.
Note the shapes of distributions.
Compute common measures of central tendency and variability.
Crucial: understand what these statistics mean in terms of your actual variables.

Example: A distribution of self-report happiness ratings on a 1-to-10 scale might show mean = 8.25, SD = 1.14, unimodal and negatively skewed. This means most participants rated themselves fairly high on happiness, with a small number rating themselves noticeably lower.

🎯 Identifying and handling outliers

Outliers require careful examination and decision-making:

Outlier type	What it might mean	How to handle
Data entry error	Response entered incorrectly	Correct the data file and continue
Misunderstanding/inattention	Participant didn't understand task	May justify exclusion; keep notes on criteria
Genuine extreme response	Honest, accurate estimate	Consider keeping; use median or analyze both ways

Key principle: If you exclude outliers, keep notes on which responses/participants you excluded and why, apply criteria consistently, and report exclusions when presenting results.

⚖️ When outliers might be genuine

Example from the excerpt: In a university student sample, most reported fewer than 15 sexual partners, but a few reported 60 or 70.
These extreme scores might represent errors, misunderstandings, or exaggerations—but could also be honest and accurate.
Strategies when outliers might be real:
- Use the median and other statistics not strongly affected by outliers
- Analyze data both including and excluding outliers
- If results are essentially the same, leave outliers in
- If results differ, report both analyses and discuss the differences

🔎 Planned vs exploratory analyses

📋 Planned analyses

Planned analysis: testing a relationship that you expected in your hypothesis before you designed the study.

Conduct these first to answer your primary research questions.
If you expected a difference between groups: compute relevant means and standard deviations, make a bar graph, compute Cohen's d.
If you expected a correlation: make a line graph or scatterplot (check for nonlinearity and restriction of range), compute Pearson's r.

🎣 Exploratory analyses

Exploratory analysis: an analysis undertaken without an existing hypothesis, exploring data for relationships you did not predict.

The excerpt quotes researcher Daryl Bem's advice to:

Examine data from every angle
Analyze subgroups separately (e.g., sexes separately)
Create new composite indexes
If a datum suggests a new hypothesis, look for additional evidence elsewhere in the data
Reorganize data to bring dim traces of interesting patterns into bolder relief
Go on a "fishing expedition" for something interesting

⚠️ Why the distinction matters

Critical issue: Complex data sets are likely to include "patterns" that occurred entirely by chance.
Every unplanned analysis increases the likelihood these chance patterns will appear real (called a "Type 1" error).
Results discovered during exploratory analyses should be:
- Viewed skeptically
- Replicated in at least one new study before being presented as findings
- Clearly labeled as exploratory in your report
Exploratory findings can provide the basis for future research and material for the discussion section.

Don't confuse: Planned analyses test what you predicted; exploratory analyses generate new ideas that need further testing.

📊 Understanding your descriptive statistics

💡 Descriptive statistics tell "what happened"

Beginning researchers sometimes forget that descriptive statistics reveal the actual results of the study.
Inferential statistics are important (covered in the next chapter), but descriptive statistics come first.

🔢 Examples of descriptive clarity

The excerpt provides two scenarios where descriptive statistics alone make the results clear:

Scenario 1 - Clear treatment effect:

Treatment group (n=50): Mean = 34.32, SD = 10.45
Control group (n=50): Mean = 21.45, SD = 9.22
Cohen's d = 1.31 (extremely strong)
Although inferential statistics (like a t-test) would be required in a formal report, the descriptive statistics alone show the treatment worked.

Scenario 2 - Clear lack of relationship:

Scatterplot shows an indistinct "cloud" of points
Pearson's r = −.02 (trivial)
Although inferential statistics would be required in a formal report, the descriptive statistics alone show the variables are essentially unrelated.

🎯 The key principle

Always ensure you thoroughly understand your results at a descriptive level first, then move on to inferential statistics. Descriptive statistics show what actually happened in your study; inferential statistics help determine whether those results likely apply to the broader population.

Key Takeaways and Exercises

56. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Descriptive statistics—distributions, central tendency, variability, effect sizes, and proper presentation—form the foundation for understanding what actually happened in a study before moving to inferential tests.

📌 Key points (3–5)

Distributions first: Every variable has a distribution that can be described by shape (unimodal/bimodal, symmetrical/skewed), frequency tables, and histograms.
Three ways to describe center and spread: Central tendency uses mean, median, and mode; variability uses range and standard deviation.
Effect sizes matter: Cohen's d measures group differences (±0.20/0.50/0.80 = small/medium/large); Pearson's r measures correlations (±.10/.30/.50 = small/medium/large).
Common confusion: Researchers sometimes rush to inferential statistics without thoroughly understanding their descriptive statistics, which actually tell "what happened."
Presentation rules: APA style has specific rules for text, graphs, and tables—use words for numbers under 10, round to two decimals, and ensure graphs/tables add information rather than repeat it.

📊 Understanding distributions

📊 What a distribution shows

Distribution: the way scores are distributed across the levels of a variable.

Every variable in your data has a distribution—not just a single number.
You describe it using frequency tables and histograms (visual representations).
The shape matters: is it unimodal (one peak) or bimodal (two peaks)? Symmetrical or skewed?

📍 Locating individual scores

Two ways to describe where a score sits within its distribution:

Percentile rank: the percentage of scores below that score.
z score: the difference between the score and the mean, divided by the standard deviation.

Example: If a score has a percentile rank of 75, it means 75% of scores fall below it.

📏 Measuring center and spread

📏 Central tendency (the "middle")

Three statistics describe the center:

Statistic	Definition	When to use
Mean	Sum of scores divided by number of scores	Most common measure
Median	The middle score	Better when data are skewed
Mode	The most common score	Useful for categorical data

📐 Variability (the "spread")

Two statistics describe how scattered the data are:

Range: difference between highest and lowest scores (simple but crude).
Standard deviation: the average amount by which scores differ from the mean (more precise).

Don't confuse: Standard deviation is not the same as range—it accounts for all scores, not just the extremes.

🔬 Describing relationships and differences

🔬 Cohen's d for group differences

Cohen's d: a measure of effect size for differences between two group or condition means, calculated as the difference of the means divided by the standard deviation.

Interpretation guidelines:

±0.20 = small effect
±0.50 = medium effect
±0.80 = large effect
Typically presented in bar graphs.
Example: If a treatment group has mean = 34.32 (SD = 10.45) and control group has mean = 21.45 (SD = 9.22), with d = 1.31, this is an extremely strong effect—the descriptive statistics alone show the treatment worked.

🔬 Pearson's r for correlations

Pearson's r: a measure of relationship strength for relationships between quantitative variables, calculated as the mean cross-product of the two sets of z scores.

Interpretation guidelines:

±.10 = small relationship
±.30 = medium relationship
±.50 = large relationship
Typically presented in line graphs or scatterplots.
Example: If a scatterplot shows an indistinct "cloud" and r = −.02, the variables are essentially unrelated, even before running inferential tests.

📝 APA-style presentation rules

📝 Text presentation

Key rules for presenting numbers in text:

Use words only for numbers less than 10 that do not represent precise statistical results.
Round results to two decimal places.
Use words (e.g., "mean") in the text and symbols (e.g., "M") in parentheses.

📊 Graphs and tables

Three principles:

Add information rather than repeating what's already in the text.
Keep it simple—avoid unnecessary complexity.
Make them interpretable on their own with descriptive captions (graphs) or titles (tables).

Don't confuse: Simple results go in text; complex results go in graphs or tables.

🧹 Preparing and cleaning data

🧹 Preliminary analysis steps

Before analyzing, you must:

Check the reliability of measures.
Evaluate the effectiveness of any manipulations.
Examine the distributions of individual variables.
Identify outliers.

🚨 Handling outliers

Outliers that appear to result from error, misunderstanding, or lack of effort can be excluded, but:

Apply exclusion criteria the same way to all data.
Describe your criteria when presenting results.
Set aside (don't destroy or delete) excluded data in case they're needed later.

Example: A participant estimating heights reports "84 inches"—this might be an error worth investigating before deciding whether to exclude.

🎯 Descriptive statistics come first

🎯 Why descriptive statistics matter

The excerpt emphasizes: descriptive statistics really tell "what happened" in your study.

Beginning researchers sometimes forget this and rush to inferential statistics.
You should always thoroughly understand your results at a descriptive level first.
Then move on to inferential statistics (which tell whether results likely apply to the population).

Example: If treatment mean = 34.32 and control mean = 21.45 with d = 1.31, it should be clear from descriptives alone that the treatment worked—even though you still need to report inferential tests formally.

Understanding Null Hypothesis Testing

57. Understanding Null Hypothesis Testing

🧭 Overview

🧠 One-sentence thesis

Null hypothesis testing provides a formal method for researchers to decide whether a statistical relationship observed in a sample reflects a real relationship in the population or merely occurred by chance due to sampling error.

📌 Key points (3–5)

Purpose: Helps researchers choose between two interpretations—either there is a real relationship in the population, or the sample relationship is just sampling error.
Core logic: Assume the null hypothesis (no relationship) is true, calculate how likely the sample result would be under that assumption, and reject the null if the result would be extremely unlikely.
What determines significance: Both relationship strength and sample size matter—stronger relationships and larger samples make it easier to reject the null hypothesis.
Common confusion: The p value is NOT the probability that the null hypothesis is true; it is the probability of obtaining the sample result (or more extreme) if the null hypothesis were true.
Statistical vs. practical significance: A statistically significant result is not necessarily strong or important—even very weak relationships can be significant with large enough samples.

🎯 The purpose and problem

🎯 Why null hypothesis testing exists

The purpose of null hypothesis testing is simply to help researchers decide between two interpretations of a statistical relationship in a sample.

Researchers measure variables in a sample (e.g., 50 adults with depression) and compute statistics (e.g., means, correlations).
The goal is to draw conclusions about the population (all adults with depression), where the true values are called parameters.
The problem: Sample statistics are not perfect estimates—they vary randomly from sample to sample due to sampling error (random variability, not a mistake).
Example: The mean number of depressive symptoms might be 8.73 in one sample, 6.45 in another, and 9.44 in a third, even when all samples come from the same population.

🔀 Two possible interpretations

Any statistical relationship in a sample can mean one of two things:

There is a relationship in the population, and the sample reflects it.
There is no relationship in the population, and the sample relationship is just sampling error ("occurred by chance").

Example: Mehl and colleagues found women spoke a mean of 16,215 words/day and men 15,669 words/day, but concluded there was no real sex difference in the population—the small sample difference was likely due to chance.
Example: Kanner and colleagues found a correlation of +.60 between daily hassles and symptoms and concluded there is a real relationship in the population—such a strong correlation would be unlikely by chance alone.

🧮 The logic and mechanics

🧮 How null hypothesis testing works

The process follows these steps:

Assume the null hypothesis (H₀) is true: There is no relationship in the population.
Determine how likely the sample result would be if the null hypothesis were true.
Make a decision:
- If the sample result would be extremely unlikely, reject the null hypothesis in favor of the alternative hypothesis (H₁: there is a relationship).
- If the result would not be extremely unlikely, retain the null hypothesis (do not conclude there is a relationship).

📊 The p value

The p value is the probability of obtaining the sample result or a more extreme result if the null hypothesis were true.

A low p value means the sample result would be unlikely if the null hypothesis were true → reject the null.
A p value that is not low means the sample result would be likely if the null hypothesis were true → retain the null.
The criterion for "low enough" is called α (alpha), almost always set to .05.
If p ≤ .05, the result is statistically significant and the null hypothesis is rejected.
If p > .05, the null hypothesis is retained (researchers "fail to reject" it, but never "accept" it).

❌ The most common misunderstanding

Wrong interpretation: "A p value of .02 means there is only a 2% chance the result is due to chance and a 98% chance it reflects a real relationship."

Correct interpretation: "A p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time."

The p value is about the sample result, not about the truth of any hypothesis.
Don't confuse: p is not the probability that the null hypothesis is true or false.

🔢 What drives statistical significance

🔢 Two key factors

The answer to "What is the p value?" depends on just two things:

Strength of the relationship in the sample (e.g., Cohen's d or Pearson's r).
Sample size (N).

General rule: The stronger the sample relationship and the larger the sample, the lower the p value (the less likely the result would be if the null hypothesis were true).

📐 Intuitive examples

Strong relationship, large sample: A study with 500 women and 500 men finds Cohen's d = 0.50. If there were really no sex difference in the population, this strong result from such a large sample should seem highly unlikely → reject the null.
Weak relationship, small sample: A study with 3 women and 3 men finds d = 0.10. If there were no sex difference, this weak result from such a small sample should seem likely → retain the null.
Trade-offs: A weak result can be significant if the sample is large enough; a strong result can be significant even if the sample is small.

📋 Rough guideline table

The excerpt provides a table showing how relationship strength and sample size combine to determine significance. Key patterns:

Weak relationships with medium or small samples are never statistically significant.
Strong relationships with medium or larger samples are always statistically significant.
Medium relationships require at least a medium sample to be significant.

Example: For an independent-samples t-test with N = 50 per group, a medium effect (d = .50) would be significant, but a weak effect (d = .20) would not.

⚖️ Statistical vs. practical significance

⚖️ A crucial distinction

Practical significance refers to the importance or usefulness of the result in some real-world context.

A statistically significant result is not necessarily strong or important.
Even a very weak relationship can be statistically significant if the sample is large enough.
Example: Sex differences in math problem-solving and leadership are statistically significant but actually quite weak—perhaps even "trivial"—yet the word "significant" can mislead people into thinking they are strong and important.
In clinical practice, this is called clinical significance: a treatment might produce a statistically significant effect but still not be strong enough to justify the time, effort, and cost, especially if easier and cheaper treatments exist.

🛡️ How to avoid confusion

Always report an effect size measure (e.g., Cohen's d, Pearson's r) alongside the p value.
The p value alone cannot substitute for relationship strength because it also depends on sample size.
Remember: Statistical significance tells you whether an effect exists; effect size tells you how strong it is; practical significance tells you whether it matters.

Note: The excerpt also introduces examples of specific null hypothesis tests (t-tests, ANOVA, tests of correlation coefficients) and discusses errors in null hypothesis testing (Type I and Type II errors), statistical power, criticisms of null hypothesis testing, and the replicability crisis. These topics are covered in subsequent sections of the chapter.

Some Basic Null Hypothesis Tests

58. Some Basic Null Hypothesis Tests

🧭 Overview

🧠 One-sentence thesis

Null hypothesis tests—including t-tests, ANOVA, and correlation tests—provide standardized procedures for deciding whether sample data provide enough evidence to conclude that an effect exists in the population.

📌 Key points (3–5)

What these tests do: compare sample statistics (means, correlations) against a null hypothesis that claims no effect exists in the population.
The t-test family: three versions (one-sample, dependent-samples, independent-samples) handle different research designs involving mean comparisons.
ANOVA for multiple groups: when comparing more than two means, ANOVA replaces multiple t-tests and controls the risk of false positives.
Common confusion: one-tailed vs two-tailed tests—one-tailed tests require a directional prediction before data collection and only reject the null in that expected direction.
Decision rule: if the p-value ≤ .05, reject the null hypothesis; if p > .05, retain it (do not conclude the effect exists).

🧪 The t-test family

🔬 One-sample t-test

One-sample t-test: compares a sample mean (M) with a hypothetical population mean (μ₀) that provides an interesting standard of comparison.

Null hypothesis: the population mean equals the hypothetical mean (μ = μ₀).
Alternative hypothesis: the population mean differs from the hypothetical mean (μ ≠ μ₀).
How it works: compute a t statistic from the sample mean, hypothetical mean, sample standard deviation, and sample size; then find the p-value.
Example: A health psychologist shows students a cookie with 250 calories and asks them to estimate. The sample mean is 212 calories, SD = 39.17, N = 10. The computed t = -3.07 with p = .013, so he rejects the null and concludes students underestimate calories.

🔄 Dependent-samples t-test

Dependent-samples t-test (paired-samples t-test): compares two means for the same sample tested at two different times or under two different conditions.

When to use: pretest-posttest designs or within-subjects experiments.
Key step: reduce each participant's two scores to a single difference score (subtract one from the other).
Then: treat it as a one-sample t-test where the hypothetical population mean is 0 (no average difference).
Example: The psychologist tests a training program by measuring calorie estimates before and after. Difference scores average 8.50, SD = 27.27, N = 10. The t = 0.99, p = .148 (one-tailed), so he retains the null—no evidence the training works.

⚖️ Independent-samples t-test

Independent-samples t-test: compares the means of two separate samples (M₁ and M₂).

When to use: between-subjects experiments or cross-sectional designs (e.g., comparing two pre-existing groups).
Null hypothesis: the two population means are equal (μ₁ = μ₂).
Formula complexity: must account for two sample means, two standard deviations, and two sample sizes; degrees of freedom = N − 2.
Example: Comparing junk-food eaters (M = 168.12) vs non-junk-food eaters (M = 220.71). The t = -2.74, p = .015, so he rejects the null and concludes junk-food eaters underestimate more.

🔀 One-tailed vs two-tailed tests

Two-tailed: reject the null if the sample is extreme in either direction; use when you have no directional expectation.
One-tailed: reject only if extreme in one pre-specified direction; requires a directional hypothesis before data collection.
Trade-off: one-tailed tests have less extreme critical values (easier to reject the null in the expected direction) but cannot reject if the effect goes the opposite way.
Don't confuse: you must decide which test to use before collecting data, based on your theoretical expectations.

📊 Analysis of Variance (ANOVA)

🎯 One-way ANOVA

One-way ANOVA: used to compare the means of more than two samples in a between-subjects design.

Null hypothesis: all group means are equal in the population (μ₁ = μ₂ = … = μ_G).
Alternative hypothesis: not all means are equal (at least one differs).
Test statistic F: ratio of two variance estimates—mean squares between groups (MS_B) divided by mean squares within groups (MS_W).
Why F works: when the null is true, F clusters around 1; larger F values indicate greater between-group differences relative to within-group variability.
Example: Comparing calorie estimates of psychology majors (M = 187.50), nutrition majors (M = 195.00), and dieticians (M = 238.13). F = 9.92, p = .0009, so reject the null—the groups differ.

🔍 Post hoc comparisons

The problem: a significant ANOVA tells you "not all means are equal" but not which specific pairs differ.
Why not just run multiple t-tests: conducting many t-tests inflates the risk of mistakenly rejecting a true null hypothesis (Type I error).
Solution: use modified t-test procedures (Bonferroni, Fisher's LSD, Tukey's HSD) that keep the overall error rate near 5%.
Don't confuse: post hoc tests are follow-ups after a significant ANOVA, not replacements for it.

🔁 Repeated-measures ANOVA

When to use: within-subjects designs where the same participants are tested under different conditions or at different times.
Key advantage: can measure and subtract stable individual differences (e.g., baseline reaction time) from the within-groups variability, making the test more sensitive.
Result: lower MS_W, higher F, better chance of detecting real effects.

🧩 Factorial ANOVA

When to use: factorial designs with more than one independent variable.
What it produces: separate F ratios and p-values for each main effect and each interaction.
Example: Testing participant major (psychology vs nutrition) and food type (cookie vs hamburger) would yield F and p for the main effect of major, main effect of food type, and the major × food type interaction.

📈 Testing correlation coefficients

🔗 Correlation test logic

Test of the correlation coefficient: determines whether a relationship between two quantitative variables exists in the population.

Null hypothesis: no relationship in the population (ρ = 0, where ρ is the population correlation).
Alternative hypothesis: a relationship exists (ρ ≠ 0).
How it works: compute Pearson's r for the sample; statistical software provides the associated p-value.
Degrees of freedom: N − 2.

📉 Example correlation test

A health psychologist examines the correlation between calorie estimates and weight in 22 students.
She finds r = −.21, p = .348 (two-tailed).
Because p > .05, she retains the null hypothesis—no evidence of a relationship.
If computing by hand, she would compare her r to the critical value for 20 degrees of freedom (.444); her r is less extreme, confirming p > .05.

📋 Interpreting ANOVA output

📊 The ANOVA table

Component	What it shows	Why it matters
SS (sum of squares)	Total variability between and within groups	Intermediate calculation
df (degrees of freedom)	Between groups: G − 1; within groups: N − G	Determines F distribution shape
MS (mean squares)	SS divided by df	The variance estimates used in F
F ratio	MS_B / MS_W	The test statistic
p-value	Probability of F this extreme if null is true	The decision criterion
F_crit	Critical value for α = .05	Hand-calculation decision threshold

Most researchers report only F, degrees of freedom, and p-value in their write-ups.
Example APA format: F(2, 21) = 9.92, p < .001.

🎲 Critical values and decision rules

📏 Using critical value tables

What they show: the threshold value of the test statistic (t, F, r) that corresponds to p = .05.
Decision rule: if your computed statistic is more extreme than the critical value, p < .05 → reject the null.
Degrees of freedom matter: each row in the table corresponds to a different df; use the row matching your study.
Don't confuse: two-tailed tests have more extreme critical values than one-tailed tests (because "extreme" includes both tails).

🧮 Software vs hand calculation

Modern practice: statistical software (SPSS, Excel, online tools) computes the exact p-value automatically.
Hand calculation: still useful for understanding the logic; use critical value tables to make the reject/retain decision.
Same conclusion either way: whether you compare p to .05 or compare your statistic to the critical value, the decision is identical.

Additional Considerations in Null Hypothesis Testing

59. Additional Considerations

🧭 Overview

🧠 One-sentence thesis

Null hypothesis testing, while standard in psychology, carries inherent risks of Type I and Type II errors and faces criticisms that have led researchers to supplement it with effect sizes, confidence intervals, and more transparent practices.

📌 Key points (3–5)

Two kinds of errors: Type I (rejecting a true null hypothesis) and Type II (retaining a false null hypothesis) both occur due to sampling variability and design limitations.
Statistical power matters: power is the probability of correctly rejecting a false null hypothesis, and researchers should aim for at least .80 power when planning studies.
The file drawer problem: non-significant results often go unpublished, leading the literature to overstate effect strengths and contain more Type I errors than expected.
Common confusion: a p value of .05 does not mean there is a 95% chance the result will replicate or that the null hypothesis has only a 5% chance of being true—it is the probability of the data if the null were true.
Solutions to criticisms: researchers now supplement null hypothesis tests with effect sizes (e.g., Cohen's d, Pearson's r) and confidence intervals to provide more informative results.

⚠️ Errors in hypothesis testing

⚠️ Type I error (false positive)

Type I error: rejecting the null hypothesis when it is actually true in the population.

This means concluding there is a relationship when there really is not.
Occurs because sampling error alone can occasionally produce extreme results even when the null is true.
When alpha is set at .05, researchers will make a Type I error 5% of the time when the null is true.
Example: concluding a treatment works when it actually has no effect.

⚠️ Type II error (false negative)

Type II error: retaining the null hypothesis when it is actually false in the population.

This means concluding there is no relationship when one actually exists.
Occurs primarily when a study lacks adequate statistical power (often due to small sample size).
Example: concluding a treatment doesn't work when it actually does have an effect.

⚖️ The tradeoff between error types

Reducing Type I errors (by setting alpha below .05, e.g., to .01) makes it harder to reject false null hypotheses, thus increasing Type II errors.
Reducing Type II errors (by setting alpha above .05, e.g., to .10) makes it easier to reject true null hypotheses, thus increasing Type I errors.
The .05 convention represents a compromise that keeps both error rates at acceptable levels.

📁 The file drawer problem

📁 What it is

File drawer problem: the tendency for statistically significant results to be published while non-significant results are filed away and never published.

Researchers are more likely to submit significant results; editors and reviewers are more likely to accept them.
Non-significant results end up "in a file drawer" (or a computer folder).

📁 Why it matters

The published literature likely contains a higher proportion of Type I errors than statistical theory would predict.
Even when a real relationship exists in the population, the published literature overstates its strength.
Example: if the true population correlation is weak (ρ = +.10), sampling error will produce results ranging from weak negative to moderately strong positive, but only the moderate-to-strong positive studies get published, making the effect appear stronger than it really is.

📁 Potential solutions

Registered reports: editors evaluate research based on the question and method before knowing the results, so non-significant findings are equally publishable.
Share non-significant results: researchers can post them in public repositories, present them at conferences, or submit to journals devoted to null results (e.g., Journal of Articles in Support of the Null Hypothesis).
Avoid p-hacking: researchers should not manipulate their analysis (removing outliers arbitrarily, selectively reporting variables, etc.) to achieve a desired p value.

🔋 Statistical power

🔋 What power is

Statistical power: the probability of rejecting the null hypothesis given the sample size and expected relationship strength.

Power is the complement of the probability of a Type II error: Power = 1 − P(Type II error).
Example: a study with 50 participants and an expected r = +.30 has power of .59, meaning a 59% chance of rejecting the null if the population correlation truly is +.30 (and a 41% chance of a Type II error).

🔋 Adequate power guideline

A common guideline is that power should be at least .80 (80% chance of detecting the effect if it exists).
Weak relationships require very large samples to achieve adequate power.

Relationship strength	Independent-samples t test	Test of Pearson's r
Strong (d = .80, r = .50)	52	28
Medium (d = .50, r = .30)	128	84
Weak (d = .20, r = .10)	788	782

🔋 How to increase power

Increase relationship strength: use a stronger manipulation or control extraneous variables better (e.g., use within-subjects instead of between-subjects designs).
Increase sample size: the usual strategy; for any expected effect, there is always some sample size large enough to achieve adequate power.
Researchers should check power before collecting data to avoid wasting time on underpowered studies.

🧐 Criticisms of null hypothesis testing

🧐 Misunderstandings

The p value is not the probability the null is true; it is the probability of the data if the null were true.
1 − p is not the probability of replication: in one study, 60% of professional researchers mistakenly thought p = .01 meant a 99% chance of replicating the result, but actual replication probability depends on power, which is often much lower.

🧐 Arbitrary cutoffs

The rigid .05 threshold makes little sense: a p of .04 is deemed "significant" and publishable, while .06 is "not significant," even though the two results are nearly identical.
This convention contributes to the file drawer problem and prevents good research from being published.

🧐 Limited informativeness

Rejecting the null only tells us there is some nonzero relationship, not how strong it is.
Some critics argue the null hypothesis (that the relationship is exactly zero) is never literally true, so rejecting it is uninformative.
Defenders (e.g., Abelson) counter that null hypothesis testing provides a principled way to show results are not mere chance, especially for new phenomena.

🧐 A radical step

In 2015, the journal Basic and Applied Social Psychology banned null hypothesis testing and p values, emphasizing descriptive statistics and effect sizes instead.
This move has not been widely adopted but continues the conversation about what we know and how we know it.

🛠️ Recommendations and solutions

🛠️ Report effect sizes

Every null hypothesis test should be accompanied by an effect size measure (e.g., Cohen's d, Pearson's r).
Effect sizes estimate the strength of the relationship in the population, not just whether one exists.
Don't confuse: the p value cannot substitute for effect size because p also depends on sample size—even a very weak effect can be significant with a large enough sample.

🛠️ Use confidence intervals

Confidence interval: a range of values computed so that some percentage of the time (usually 95%) the population parameter will lie within that range.

Example: a sample mean of 200 with a 95% confidence interval of 160 to 240 means there is a 95% chance the population mean lies between 160 and 240.
Advantages:
- Easier to interpret than null hypothesis tests.
- Provide the information needed to conduct null hypothesis tests: if a hypothetical population value falls outside the interval, the sample mean is significantly different from it at the .05 level.
Example: the interval 160–240 tells us the sample mean is significantly different from a hypothetical population mean of 250 (because 250 is outside the interval).

🛠️ Alternative approaches

Bayesian statistics: researchers specify probabilities that the null and alternative hypotheses are true before the study, then update those probabilities based on the data.
It is too early to say whether Bayesian methods will become common in psychology.
For now, null hypothesis testing—supported by effect sizes and confidence intervals—remains the dominant approach.

From the "Replicability Crisis" to Open Science Practices

60. From the“Replicability Crisis”to Open Science Practices

🧭 Overview

🧠 One-sentence thesis

Psychology's replicability crisis—the widespread failure to reproduce earlier findings—has prompted the field to adopt open science practices that increase transparency, rigor, and the sharing of research materials and data.

📌 Key points (3–5)

What the replicability crisis is: many published psychology studies fail to replicate when other researchers repeat them, with only 36 of 100 studies in one major project showing statistically significant effects in replication attempts.
Questionable research practices: selective deletion of outliers, cherry-picking results, HARKing (hypothesizing after results are known), and p-hacking contribute to low replicability.
Common confusion: a failure to replicate does not automatically discredit the original study—differences in statistical power, populations, procedures, or moderating variables could explain different results.
Open science solutions: pre-registering hypotheses, sharing research materials and raw data, publishing null findings, and conducting high-quality replications increase scientific rigor.
Why it matters: these practices restore integrity to psychological research and are now being adopted by hundreds of journals and funding agencies.

🔬 Understanding the replicability crisis

📉 What the crisis revealed

The excerpt describes the "replicability crisis" as:

A phrase that refers to the inability of researchers to replicate earlier research findings.

Key evidence from the Reproducibility Project:

270 psychologists worldwide tested 100 previously published experiments
97 of the original 100 studies had found statistically significant effects
Only 36 of the replications found statistically significant effects
Effect sizes in replications were, on average, half of those in original studies

🤔 What failure to replicate means (and doesn't mean)

The excerpt emphasizes an important distinction:

A replication failure by itself does not necessarily discredit the original study
Differences that could explain different results include:
- Statistical power differences
- Different populations sampled
- Different procedures used
- Effects of moderating variables

Don't confuse: "failure to replicate" with "proof the original was wrong"—the relationship is more nuanced.

🎯 Two interpretations of the crisis

View	What it suggests
Expected characteristic	Failure to replicate is a normal part of cumulative scientific progress
Systematic problem	Evidence of publication bias favoring counter-intuitive findings over replication studies, and widespread questionable research practices

⚠️ Questionable research practices

🗑️ Data manipulation practices

The excerpt lists specific problematic behaviors:

Selective deletion of outliers: removing data points to artificially inflate statistical relationships among measured variables
Selective reporting (cherry-picking): reporting only findings that support one's hypotheses, ignoring others

🎲 HARKing

HARKing: hypothesizing after the results are known

Mining data without an a priori (beforehand) hypothesis
Finding a statistically significant result
Then claiming that result had been originally predicted
This reverses the scientific method

📊 P-hacking

A practice where researchers:

Perform inferential statistical calculations to see if a result is significant
Decide whether to recruit additional participants based on those results
Collect more data if needed to reach significance

Why this is problematic: the probability of finding a statistically significant result is influenced by the number of participants in the study, so this manipulates that probability.

🚫 Outright fraud

The excerpt mentions fabrication of data (referencing Diederik Stapel from Chapter 3), but notes this is fraud rather than a "research practice."

Example: A researcher collects data from 20 participants, finds p = 0.08 (not significant), then recruits 10 more participants and recalculates until reaching p < 0.05—this is p-hacking.

🛠️ Enhancing scientific rigor

💪 Statistical power and design

Ways to increase reliability:

Designing studies with sufficient statistical power
This increases the reliability of findings

📰 Publication practices

Publishing both null and significant findings
This counteracts publication bias
Reduces the file drawer problem (unpublished null results)

📝 Detailed documentation

Describing research designs in sufficient detail
Enables other researchers to replicate using identical or very similar procedures

🔄 High-quality replications

Conducting replications with care
Publishing replication results (whether successful or not)

🌐 Open science practices

🏅 Digital badges and incentives

The excerpt describes a system used by journals like Psychological Science:

Digital badges are awarded to researchers who:

Pre-registered their hypotheses and data analysis plans
Openly shared their research materials with other researchers (enabling replication attempts)
Made available their raw data with other researchers

📋 Transparency and Openness Promotion (TOP) Guidelines

The excerpt includes a detailed table showing four levels (0-3) of transparency across eight criteria:

Criterion	What it addresses
Citation Standards	Citing data, code, and materials
Data Transparency	Availability and posting of data
Analytic Methods (Code) Transparency	Availability and posting of analysis code
Research Materials Transparency	Availability and posting of materials
Design and Analysis Transparency	Standards for describing research design
Preregistration of studies	Registering hypotheses before data collection
Preregistration of analysis plans	Registering analysis plans before seeing data
Replication	Journal policies on replication studies

Level progression example: For data transparency:

Level 0: Journal says nothing
Level 1: Article states whether data are available and where
Level 2: Data must be posted to a trusted repository
Level 3: Data must be posted and analyses reproduced independently before publication

🌍 Institutional adoption

The Center for Open Science has spearheaded these initiatives, leading to:

Formal adoption by more than 500 journals
Adoption by 50+ organizations
The list grows each week

💰 Funding agency requirements

Federal funding agencies now require:

Canada (Tri-Council): publication of publicly-funded research in open access journals
United States (National Science Foundation): similar open access requirements

The excerpt concludes: "it certainly appears that the future of science and psychology will be one that embraces greater 'openness.'"

Key Takeaways and Exercises

61. Key Takeaways and Exercises

🧭 Overview

🧠 One-sentence thesis

Null hypothesis testing provides a formal framework for deciding whether sample relationships reflect real population patterns or chance, but it requires careful interpretation alongside effect sizes and has prompted psychology to adopt open science practices in response to replication challenges.

📌 Key points (3–5)

Core logic of null hypothesis testing: assume the null hypothesis is true, calculate how likely the sample result would be under that assumption (p-value), then reject or retain the null based on that probability.
What determines statistical significance: both relationship strength and sample size—even weak relationships can be significant with large samples.
Common confusion: statistical significance ≠ practical importance; a result can be statistically significant yet trivial in real-world terms.
Error types: Type I error (rejecting a true null) vs Type II error (failing to reject a false null); statistical power is the probability of correctly rejecting a false null.
Open science response: psychology's replication crisis has led to transparency practices like pre-registration, data sharing, and digital badges to address questionable research practices.

🧪 Null hypothesis testing framework

🎯 The basic logic

Null hypothesis testing: a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.

Step 1: Assume the null hypothesis is true (i.e., no real relationship exists in the population).
Step 2: Calculate how likely the observed sample result would be if that assumption were correct.
Step 3: Make a decision:
- If the sample result would be unlikely under the null → reject the null in favor of the alternative hypothesis.
- If the sample result would not be unlikely → retain the null hypothesis.

📊 What the p-value means

The p-value is the probability of obtaining the sample result (or more extreme) if the null hypothesis were true.
It is based on two considerations:
1. Relationship strength in the sample
2. Sample size
Quick judgments about statistical significance can often be made by considering these two factors together.

Example: A small correlation in a very large sample might be statistically significant, while a larger correlation in a tiny sample might not be.

⚠️ Don't confuse significance with importance

Statistical significance tells you whether a result is unlikely due to chance alone.
Relationship strength and practical significance tell you whether the result matters in the real world.
Even weak relationships can be statistically significant if the sample size is large enough.
Always consider relationship strength and practical significance in addition to statistical significance.

🔧 Common null hypothesis tests

📏 t-tests for comparing means

The excerpt describes three types of t-tests:

Type	When to use	What it compares
One-sample t-test	Comparing one sample mean with a hypothetical population mean	Sample mean vs a specific value of interest
Dependent-samples t-test	Within-subjects design	Two means from the same participants
Independent-samples t-test	Between-subjects design	Two means from different groups

Example: To test whether university students rate themselves as friendlier than average (mean = 4), use a one-sample t-test comparing their mean rating to 4.

📐 ANOVA for comparing multiple means

Analysis of variance (ANOVA): a statistical test used when there are more than two groups or condition means to be compared.

Three main types mentioned:

Type	Design
One-way ANOVA	Between-subjects with one independent variable
Repeated-measures ANOVA	Within-subjects designs
Factorial ANOVA	Factorial designs (multiple independent variables)

🔗 Testing correlations

A null hypothesis test of Pearson's r compares a sample correlation with a hypothetical population value of 0.
The test determines whether the observed correlation is unlikely to occur if the true population correlation were zero.

⚖️ Errors and statistical power

❌ Type I and Type II errors

The decision to reject or retain the null hypothesis is not guaranteed to be correct:

Error type	What happens	When it occurs
Type I error	Reject the null hypothesis	When the null is actually true
Type II error	Fail to reject (retain) the null	When the null is actually false

Example: In comparing two psychotherapy forms, a Type I error means concluding one is better when they're actually equally effective; a Type II error means concluding they're equally effective when one truly is better.

💪 Statistical power

Statistical power: the probability of rejecting the null hypothesis given the expected strength of the relationship in the population and the sample size.

Power depends on:
- The expected relationship strength in the population
- The sample size
Researchers should ensure their studies have adequate statistical power before conducting them.
Higher power reduces the risk of Type II errors (missing real effects).

🔬 Criticisms and the replicability crisis

🤔 Criticisms of null hypothesis testing

The excerpt notes three main criticisms:

Researchers misunderstand it (e.g., common misinterpretations of p-values)
It is illogical (philosophical objections to the approach)
It is uninformative (doesn't tell you effect size or practical importance)

Counter-argument: Others argue it serves an important purpose, especially when used with:

Effect size measures
Confidence intervals
Other complementary techniques

Despite criticisms, it remains the dominant approach to inferential statistics in psychology.

🔄 The replication crisis

In recent years, psychology has grappled with a failure to replicate research findings.
Two interpretations:
- Some view this as a normal aspect of science
- Others suggest it highlights problems stemming from questionable research practices

Don't confuse: A single failed replication doesn't necessarily invalidate a finding, but systematic replication failures point to deeper issues.

🌐 Open science practices as a response

Open science practices: increase the transparency and openness of the research process.

Key practices mentioned:

Pre-registration of hypotheses: researchers declare their predictions and analysis plans before collecting data
Sharing of raw data: making datasets available to other researchers
Sharing of research materials: providing access to stimuli, protocols, and procedures
Digital badges: incentives to encourage these transparent practices

Why these matter: They address questionable research practices like p-hacking (manipulating data or analyses to achieve significant results) and HARKing (Hypothesizing After Results are Known).

Example: Pre-registration prevents researchers from changing their hypotheses after seeing the data, which would inflate false-positive rates.

Pattern	Main effect of A	Main effect of B	Interaction
1	No	No	No
2	Yes	No	No
3	No	Yes	No
4	Yes	Yes	No
5	Yes	Yes	Yes
6	Yes	No	Yes
7	No	Yes	Yes
8	No	No	Yes

Pattern	Main effect of A	Main effect of B	Interaction
1	No	No	No
2	Yes	No	No
3	No	Yes	No
4	Yes	Yes	No
5	Yes	Yes	Yes
6	Yes	No	Yes
7	No	Yes	Yes
8	No	No	Yes

Pattern	Main effect of A	Main effect of B	Interaction
1	No	No	No
2	Yes	No	No
3	No	Yes	No
4	Yes	Yes	No
5	Yes	Yes	Yes
6	Yes	No	Yes
7	No	Yes	Yes
8	No	No	Yes