Applied Combinatorics

An Introduction to Combinatorics

1.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Combinatorics addresses concrete counting and optimization problems through three main themes—discrete structures, enumeration, and algorithms—and this introduction illustrates the subject's accessible yet deep nature by posing sample problems from distribution, graph theory, number theory, and optimization.

📌 Key points (3–5)

Three principal themes: discrete structures (graphs, posets, patterns, etc.), enumeration (permutations, combinations, recurrence relations, etc.), and algorithms/optimization (sorting, shortest paths, network flows, etc.).
Enumeration focus: many basic problems count distributions of objects into cells, where distinguishability of objects and cells matters.
Common confusion: whether objects (or cells) are distinguishable changes the counting problem—e.g., ten identical dollar bills vs. ten distinct books.
Accessible but deep: the course starts with informal, concrete examples but will require more precision later; many questions remain open even for experts.

🎯 What combinatorics studies

🎯 The three principal themes

Theme	Examples
Discrete Structures	Graphs, digraphs, networks, designs, posets, strings, patterns, distributions, coverings, partitions
Enumeration	Permutations, combinations, inclusion/exclusion, generating functions, recurrence relations, Pólya counting
Algorithms and Optimization	Sorting, Eulerian circuits, Hamiltonian cycles, planarity testing, graph coloring, spanning trees, shortest paths, network flows, bipartite matchings, chain partitions

The excerpt emphasizes that combinatorics is "very concrete" with a wide range of applications, but also has an "intellectually appealing theoretical side."
The goal is to give a taste of both practical and theoretical aspects.

📚 Approach of the introduction

This preliminary chapter provides a "first look" at combinatorial problems through informal examples.
The discussion is deliberately informal at this stage to motivate topics; precision comes later.
Many questions are posed that cannot be answered immediately—some may never be fully answered (the excerpt jokes that solving them might earn you fame and a Ph.D.).

🔢 Enumeration problems: distributions of objects into cells

🔢 Core concept: objects and cells

Enumeration in combinatorics: counting the number of ways to distribute objects into cells, where distinguishability of objects and cells matters, and cells may be arranged in patterns.

The excerpt introduces this through a concrete scenario: Amanda distributing money or books to her three children (Dawn, Keesha, and Seth).
The key variable is whether objects (dollar bills vs. books) and cells (children) are distinguishable.

💵 Indistinguishable objects: ten dollar bills

Question 1: Amanda has ten one-dollar bills to give to her three children. How many ways can she distribute them?

Hidden assumption: Amanda does not distinguish individual dollar bills (e.g., by serial numbers).
She only decides the amount each child receives.
Example distributions:
- Dawn gets 4, Keesha gets 4, Seth gets 2.
- Keesha gets all 10, Dawn and Seth get 0 (though this may not make them happy).

Question 2: How many sequences of the form a₁, a₂, a₃ exist, where a₁ ≥ a₂ ≥ a₃ and a₁ + a₂ + a₃ = 10?

This reformulates the distribution problem as counting non-increasing sequences that sum to 10.

Question 3: If each child must receive at least one dollar, how do the answers change?

This adds a constraint (minimum distribution per cell).

📚 Distinguishable objects: ten books

Question 4: Amanda has ten distinct books (the top 10 from the New York Times best-seller list) to give to her children. How many ways can she distribute them?

Hidden assumption: the ten books are all different (distinguishable).
This is a fundamentally different counting problem from the dollar bills.

Question 5: The books are labeled B₁, B₂, ..., B₁₀. How many sets {S₁, S₂, S₃} exist where S₁, S₂, S₃ are pairwise disjoint and their union is {B₁, B₂, ..., B₁₀}?

This reformulates the problem in terms of set partitions.
Each child receives a subset of books; the subsets do not overlap and together cover all books.

Question 6: If each child must receive at least one book, how do the answers change?

Again, a minimum constraint per cell.

🔍 Distinguishability matters

Don't confuse:

Indistinguishable objects (dollar bills): only the total amount per child matters.
Distinguishable objects (books): which specific items each child receives matters.

The excerpt highlights this as a "hidden assumption" that changes the nature of the counting problem.

📏 Scaling up

Question 7: How would we answer these questions if "ten" were really "ten thousand" and "three" were "three thousand"?

Could you write the answer on a single page?
This hints at the need for formulas, generating functions, or algorithms rather than brute-force enumeration.

🔗 Other combinatorial domains (mentioned)

🔗 Graph theory, number theory, optimization

The excerpt states that the preliminary chapter will choose examples from:

Enumeration (covered above)
Graph theory
Number theory
Optimization

However, the provided text does not yet detail the graph theory, number theory, or optimization examples. It mentions that a circular necklace problem with six beads of three colors will be discussed (Figure 1.1 shows four necklaces, the first three being the same), but the excerpt cuts off before elaborating.

🎨 Necklace problem (teaser)

A circular necklace with six beads will be assembled using three different colors.
The excerpt notes that the first three necklaces shown are "actually the same necklace" (each has three red beads), hinting at symmetry and equivalence under rotation/reflection.
This is a preview of more advanced enumeration (likely Pólya counting, mentioned in the themes).

🎓 Course philosophy

🎓 Informal to formal progression

The introduction is deliberately informal to make combinatorics accessible.
Later stages will require more precision.
The excerpt warns: "you'll only be able to answer a few [questions] now. Later, you'll be able to answer many more … but most likely you'll never be able to answer them all."

🎓 Motivation and engagement

The tone is encouraging and slightly humorous (e.g., "if we're wrong, you'll become very famous" and "you'll get an A++ and maybe even a Ph.D.").
The goal is to show that combinatorics is both "fascinating and captivating" and has real-world applications.

Enumeration

1.2 Enumeration

🧭 Overview

🧠 One-sentence thesis

Enumeration problems in combinatorics center on counting the number of ways to distribute objects into cells under various constraints, and the difficulty of these problems grows dramatically with scale.

📌 Key points (3–5)

Core task: counting distributions of objects into cells, where objects and cells may or may not be distinguishable, and cells may be arranged in patterns.
Distinguishability matters: whether we can tell objects apart (e.g., ten different books vs. ten identical dollar bills) fundamentally changes the counting problem.
Constraints change counts: adding requirements like "each child gets at least one" alters the answer to distribution questions.
Common confusion: distinguishing objects vs. amounts—distributing ten identical dollar bills means deciding amounts each child receives, not tracking individual bills by serial number.
Scale challenge: problems that are manageable with small numbers (ten objects, three recipients) become computationally difficult or even impossible to write down when scaled to thousands.

💵 Distributing indistinguishable objects

💵 The dollar bill problem

When distributing ten one dollar bills to three children, the assumption is that Amanda does not distinguish individual dollar bills by serial numbers; she only decides the amount each child receives.

The objects (dollar bills) are treated as identical.
What matters is the total each child gets, not which specific bills.
Example: giving Dawn 4 dollars, Keesha 4 dollars, and Seth 2 dollars is one distribution; giving all 10 dollars to Keesha is another.

📊 Sequences in non-increasing order

The excerpt asks: how many sequences of the form a₁, a₂, a₃ exist where:

a₁ ≥ a₂ ≥ a₃
a₁ + a₂ + a₃ = 10

This reformulates the distribution problem as counting ordered sequences that sum to a fixed total.

➕ Adding constraints

If each child must receive at least one dollar, both the distribution count and the sequence count change.
The constraint reduces the number of valid distributions by eliminating cases where any child receives zero.

📚 Distributing distinguishable objects

📚 The book problem

When distributing ten books from the New York Times best-seller list, the hidden assumption is that the ten books are all different.

Unlike dollar bills, books are distinguishable objects.
Each book can be identified (labeled B₁, B₂, …, B₁₀).
The question becomes: how many ways can we assign distinct books to three children?

🔢 Set partitions

The excerpt describes the distribution in set notation:

Each child receives a set of books: S₁, S₂, S₃.
The sets are pairwise disjoint (no book goes to two children).
The union S₁ ∪ S₂ ∪ S₃ equals the full set {B₁, B₂, …, B₁₀}.

This is a set partition problem: counting ways to divide a set into non-overlapping subsets.

➕ Minimum allocation constraint

If each child must receive at least one book, the count changes.
This eliminates partitions where any Sᵢ is empty.
Don't confuse: "at least one book" is different from "at least one dollar"—the former involves distinct items, the latter identical units.

📿 Necklace problems and symmetry

📿 Circular arrangements

The excerpt introduces necklaces made with beads of different colors:

A necklace is circular, so rotations of the same arrangement are considered identical.
Example: three necklaces shown with three red beads, two blue, and one green are actually the same necklace because one can be rotated into another.
A fourth necklace with the same color counts is different because no rotation makes it match the first three.

🎨 Counting with and without color constraints

The excerpt poses three necklace questions:

How many necklaces with exactly three red, two blue, one green?
How many necklaces using red, blue, green beads (not all colors required)?
How many necklaces where all three colors must be used?

Each question adds or removes constraints on color usage.

🔄 Symmetry and equivalence

Rotations create equivalence classes: arrangements that look different in a list but are the same necklace.
Don't confuse: linear arrangements (where order matters from a fixed starting point) vs. circular arrangements (where rotations are identical).

🚀 Scaling and computational limits

🚀 From small to large numbers

The excerpt repeatedly asks: what if ten becomes ten thousand and three becomes three thousand?

Small-scale problems (10 objects, 3 recipients) can be solved by hand or simple enumeration.
Large-scale problems (10,000 objects, 3,000 recipients) may produce answers too large to write on a single page.
Example: the number of ways to distribute 10,000 items might require special software and long computation time.

💻 Practical challenges

Scale	Challenge
Small (10 objects, 3 recipients)	Manageable by hand or basic calculation
Large (thousands of objects/recipients)	Answer may be too large to write; requires software
Very large (e.g., 6,000-bead necklaces, 3,000 colors)	Exact computation may be infeasible; special algorithms needed

The excerpt hints that even knowing the exact answer may require understanding what software to use and how long the computation would take.
This foreshadows the importance of algorithms and optimization in combinatorics.

Combinatorics and Graph Theory

1.3 Combinatorics and Graph Theory

🧭 Overview

🧠 One-sentence thesis

Graphs provide a precise, computable way to model relationships and structures, enabling us to pose and solve challenging combinatorial problems about paths, cliques, independent sets, and real-world scenarios like radio frequency assignment.

📌 Key points (3–5)

What a graph is: a vertex set and a collection of edges (2-element subsets of vertices), precisely describable in a text file.
Core graph concepts: adjacency, paths, cycles, cliques, independent sets, and components—all natural ways to describe structure.
Computational challenge: some graph questions (like finding a cycle) may be easier to verify or solve than others (like finding a large clique), even for the same graph.
Common confusion: adjacency vs. edge—two vertices are adjacent if there is an edge between them; not all vertex pairs are adjacent.
Real-world modeling: graphs naturally represent problems like radio station frequency assignment, where proximity constraints become edges.

📐 What graphs are and how to describe them

📐 Definition and structure

A graph G consists of a vertex set V and a collection E of 2-element subsets of V. Elements of E are called edges.

In this course, the vertex set is almost always V = {1, 2, 3, ..., n} for some positive integer n.
This convention allows graphs to be described precisely with a text file:
- First line: a single integer n (number of vertices).
- Each remaining line: a pair of distinct integers specifying an edge.
Example: The excerpt shows a graph with 9 vertices and 10 edges defined by a text file.

🗂️ Why this matters

Graphs can be stored and processed by computers.
The text-file format makes it possible to work with large graphs (e.g., 1500 vertices) and ask computational questions.
Don't confuse: the graph itself is the abstract structure (vertices and edges); the text file is just one way to represent it.

🧩 Core graph terminology

🧩 Vertices and edges

Vertices: the elements of the vertex set (e.g., 1, 2, 3, ..., 9).
Edge: a 2-element subset like {2, 6}.
Adjacent vertices: two vertices are adjacent if there is an edge between them (e.g., vertices 5 and 9 are adjacent if {5, 9} is an edge).
Not an edge: {5, 4} is not an edge means there is no connection between vertices 5 and 4.
Not adjacent: vertices 3 and 7 are not adjacent if {3, 7} is not an edge.

🛤️ Paths and cycles

Path: a sequence of vertices where consecutive pairs are edges.
- Example: P = (4, 3, 1, 7, 9, 5) is a path of length 5 from vertex 4 to vertex 5.
- Length = number of edges in the path.
Cycle: a closed path (starts and ends at related vertices).
- Example: C = (5, 9, 7, 1) is a cycle of length 4.
Don't confuse: path length counts edges, not vertices.

🔗 Components and connectivity

Disconnected graph: a graph that is not all connected; it has separate pieces.
Component: a maximal connected subgraph.
- Example: the graph G is disconnected and has two components; one component has vertex set {2, 6, 8}.

🔺 Triangles, cliques, and independent sets

Triangle: a set of three vertices all adjacent to each other (e.g., {1, 5, 7}).
Clique: a set of vertices where every pair is adjacent.
- Example: {1, 7, 5, 9} is a clique of size 4.
Independent set: a set of vertices where no pair is adjacent.
- Example: {4, 2, 8, 5} is an independent set of size 4.
Don't confuse: cliques are "all connected"; independent sets are "none connected."

🔍 Challenging graph problems

🔍 Questions about a graph

The excerpt poses several questions for a given graph G (Figure 1.4, with 24 vertices):

What is the largest k for which G has a path of length k?
What is the largest k for which G has a cycle of length k?
What is the largest k for which G has a clique of size k?
What is the largest k for which G has an independent set of size k?
What is the shortest path from vertex 7 to vertex 6?

⚖️ Computational difficulty

Verification vs. finding: Suppose we have a graph on 1500 vertices and ask whether it contains a cycle of length at least 500.
- Raoul says yes, Carla says no. How do we decide who is right?
- If someone claims a cycle exists, they can show it (verification may be easier).
Clique vs. cycle: Is determining whether a graph has a clique of size 500 harder, easier, or about the same as determining whether it has a cycle of size 500?
- Helene says she doesn't think the graph has a clique of size 500 but isn't certain. Is it reasonable to insist she make up her mind?
- The excerpt hints that some problems (like finding large cliques) may be computationally harder than others (like finding cycles).

🌍 Real-world graph modeling

📻 Radio station frequency assignment

Problem setup: Radio stations in the plane must broadcast on different frequencies if they are closer than 200 miles apart to avoid interference.
Graph representation: Define a graph where:
- Each vertex = a radio station.
- Each edge = a pair of stations closer than 200 miles apart.
Questions:
- The excerpt shows that 6 different frequencies are enough. Can you do better?
- Can you find 4 stations each within 200 miles of the other 3? (This is asking for a clique of size 4.)
- Can you find 8 stations each more than 200 miles away from the other 7? (This is asking for an independent set of size 8.)
Why it matters: The graph structure makes the problem precise and computable.

🎓 Class enrollment problem

Problem: How big must an applied combinatorics class be so that there are either:
- (a) six students with each pair having taken at least one other class together, or
- (b) six students with each pair together in a class for the first time?
Graph interpretation: This is a graph problem about finding structure (cliques or independent sets) in a network of students.
Difficulty: Is this really a hard problem, or can we figure it out in just a few minutes, scribbling on a napkin?

🔢 Combinatorics and number theory connection

🔢 Collatz sequences

Form a sequence of positive integers using the following rules: Start with a positive integer n > 1. If n is odd, the next number is 3n + 1. If n is even, the next number is n/2. Halt if you ever reach 1.

Example starting with 28: 28, 14, 7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1.
Example starting with 19: 19, 58, 29, 88, 44, 22, ... (then it matches the first sequence from 22 onward).
Open question: Is there some positive integer n (possibly quite large) so that if you start from n, you will never reach 1?
Why it matters: This is a combinatorial/number-theoretic problem with no known answer.

🔐 Factoring and least common multiples

Middle school method: To add fractions, find the least common multiple (LCM) of the denominators.
- Example: 2/15 + 7/12 = 8/60 + 35/60 = 43/60 (LCM of 15 and 12 is 60).
Easy if you can factor: If you know the prime factorizations, finding the LCM is straightforward.
- Example: LCM of 351785000 = 2³ × 5⁴ × 7 × 19 × 23² and 316752027900 = 2² × 3 × 5² × 7³ × 11 × 23⁴ is 2³ × 3 × 5⁴ × 7³ × 11 × 19 × 23⁴ = 300914426505000.
Challenge: Can you factor 1961? How about 1348433?
Hard problem: Suppose c = 55684901170770357082442831733350405217163692355899511509652043138898236817075547152153799 is the product of two primes a and b. Can you find them?
Computational reality: Most calculators can't handle 20-digit numbers, much less 500-digit numbers. But computer algebra systems like SageMath can multiply large numbers instantly, yet factoring large numbers may take a very long time.
Practical question: If factoring is hard, can you find the LCM of two 500-digit integers by another method? How should middle school students be taught to add fractions?

💻 Software tools

SageMath: An open-source computer algebra system that can factor integers, multiply large numbers, and solve combinatorial problems.
Example: The excerpt shows SageMath code that factors 300914426505000 instantly and multiplies two large primes a and b to get c.
Limitation: Asking SageMath to factor c (the product of two large primes) will likely take a very long time, illustrating that factoring is computationally hard.

Combinatorics and Number Theory

1.4 Combinatorics and Number Theory

🧭 Overview

🧠 One-sentence thesis

Combinatorial techniques prove useful in number theory, particularly for problems involving sequences, factorization, and integer partitions, even though number theory itself is not the primary focus of this text.

📌 Key points (3–5)

Connection between fields: Combinatorial methods can illuminate number theory problems, even though the subject is not number theory itself.
Factorization difficulty: Factoring large integers into primes is computationally hard, while multiplying primes is easy—this asymmetry matters for practical problems like adding fractions with large denominators.
Integer partitions: Counting the ways to partition an integer (e.g., into odd parts vs. distinct parts) is an enumerative problem with surprising patterns.
Common confusion: Don't assume that because multiplication is easy, factorization must also be easy—the difficulty is asymmetric.
Computational tools: Software like SageMath can multiply large numbers instantly but may struggle to factor them, illustrating the computational gap.

🔢 Sequences and open problems

🔢 Collatz sequences

Collatz sequences: sequences formed by starting with a positive integer n > 1, then applying the rule "if n is odd, the next number is 3n + 1; if n is even, the next number is n divided by 2," halting if you reach 1.

The excerpt gives two examples:
- Starting with 28: the sequence is 28, 14, 7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1.
- Starting with 19: the sequence is 19, 58, 29, 88, 44, 22, and then it merges with the first sequence.
Observation: For any starting number between 100 and 200, the sequence eventually reaches 1.
Open question: Is there some (possibly very large) positive integer n such that starting from n, you will never reach 1?

🤔 Why this matters

The Collatz conjecture is an example of a simple-to-state problem that remains unsolved.
It shows how combinatorial iteration rules can lead to deep, unresolved questions in number theory.

🧮 Factorization and least common multiples

🧮 Finding least common multiples via prime factorization

Middle school students learn to add fractions by finding the least common multiple (LCM) of denominators.
Example from the excerpt:
- To add 2/15 + 7/12, find LCM(15, 12) = 60, so the sum is 8/60 + 35/60 = 43/60.
If you know the prime factorizations, finding the LCM is easy:
- For 351785000 = 2³ × 5⁴ × 7 × 19 × 23² and 316752027900 = 2² × 3 × 5² × 7³ × 11 × 23⁴, the LCM is 2³ × 3 × 5⁴ × 7³ × 11 × 19 × 23⁴.
- You take the highest power of each prime that appears in either factorization.

🔐 The hard part: factoring large integers

The excerpt asks: Can you factor 1961? How about 1348433?
Challenge problem: The integer c = 556849011707703570824428317333504052171636923558995115096520431388982368170755475721537999 is the product of two primes a and b. Can you find them?
The excerpt reveals how the challenge was constructed:
- Two large primes were found: a = 2425967623052370772757633156976982469681 and b = 22953686867719691230002707821868552601124472329079.
- SageMath can multiply a × b instantly.
- But if you ask SageMath to factor c, you will likely wait a very long time (more than a couple of minutes).

⚖️ Asymmetry: multiplication vs. factorization

Multiplication is easy: Even for numbers with hundreds of digits, computers can multiply them quickly.
Factorization is hard: Finding the prime factors of a large composite number is computationally difficult.
Don't confuse: The ease of one operation does not imply the ease of its inverse.

🎓 Implications for teaching

The excerpt asks: If factoring is hard, can you find the LCM of two 500-digit integers by another method?
How should middle school students be taught to add fractions if the underlying factorization problem is so difficult?
Example: Most calculators can't even handle 20-digit numbers, but specialized software (Maple, Mathematica, SageMath) can work with 500-digit numbers.

🧩 Integer partitions

🧩 What are integer partitions?

Integer partition: a way of writing a positive integer as a sum of positive integers, where order does not matter.

The excerpt shows all 22 partitions of 8 (see Figure 1.11 in the source).
Two special types are highlighted:
- Partitions into odd parts: only odd numbers appear (e.g., 5 + 3, 3 + 3 + 1 + 1, 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1).
- Partitions into distinct parts: all parts are different (e.g., 8, 7 + 1, 6 + 2, 5 + 3, 5 + 2 + 1, 4 + 3 + 1).

🔍 A surprising pattern

For the integer 8, there are exactly 6 partitions into odd parts and exactly 6 partitions into distinct parts.
Question posed: What if we asked you to find the number of integer partitions of 25892?
Deeper question: Do you think the number of partitions of 25892 into odd parts equals the number of partitions into distinct parts?
Key insight: Is there a way to answer this question without actually calculating the number of partitions of each type?

🎯 Why this is combinatorial

Counting partitions is an enumerative problem.
The equality (or pattern) between partitions into odd parts and partitions into distinct parts suggests a deeper combinatorial structure.
Example: Instead of brute-force counting for 25892, there may be a bijection (one-to-one correspondence) or generating function that proves the equality.

🛠️ Computational tools

🛠️ SageMath and computer algebra systems

The excerpt mentions that most calculators cannot add or multiply two 20-digit numbers, let alone 500-digit numbers.
However, it is straightforward to write a computer program to handle such tasks.
Commercial tools: Maple and Mathematica.
Open source: SageMath, which can be used for free online via the SageMath Cloud.
The text includes interactive SageMath cells for factorization and multiplication.

⚡ What SageMath can and cannot do quickly

Task	Speed	Example from excerpt
Multiply two large primes	Instant	a × b returns c immediately
Factor a large composite	Very slow	Factoring c takes more than a couple of minutes

Don't confuse: Just because a tool can perform one operation quickly does not mean it can perform the inverse operation quickly.

Combinatorics and Geometry

1.5 Combinatorics and Geometry

🧭 Overview

🧠 One-sentence thesis

Geometric problems—such as counting regions formed by lines, determining whether points define a certain number of lines, or deciding if a graph can be drawn without crossings—can be solved or analyzed using combinatorial techniques that rely on counting and structural constraints rather than continuous methods.

📌 Key points (3–5)

Enumerative flavor in geometry: questions about lines, regions, points, and graph drawings can be answered by combinatorial counting and reasoning.
Lines and regions: a family of lines (each pair intersecting, no three meeting at one point) determines a predictable number of regions; the challenge is to find a formula or pattern.
Points and lines: given a set of points, the number of lines they determine is constrained by combinatorial rules; some claims (e.g., 882 points determining exactly 752 lines) may be impossible.
Graph drawing and crossings: some graphs cannot be drawn in the plane without edge crossings, and this can sometimes be determined just by counting vertices and edges—without examining the full structure.
Common confusion: continuous geometry (calculus-style optimization) vs. discrete geometry (integer constraints, graph structure)—many geometric problems are inherently combinatorial and cannot be solved by continuous approximation alone.

📐 Lines, regions, and counting

📐 Lines determining regions

The excerpt presents a family of 4 lines in the plane (Figure 1.13):

Each pair of lines intersects.
No point belongs to more than two lines (no three lines meet at one point).
These 4 lines determine 11 regions.

Question posed: Under the same restrictions, how many regions would 8947 lines determine?

Key insight: The number of regions is determined by the number of lines and the intersection pattern; different arrangements (respecting the constraints) may or may not yield different counts.

📍 Points and lines they determine

Scenario (Example 1.14): Mandy claims she has found a set of 882 points in the plane that determine exactly 752 lines. Tobias disputes this claim.

The number of lines determined by a set of points is constrained by combinatorial rules.

The excerpt does not give the formula, but it implies that certain combinations of points and lines are impossible.
Who is right? The excerpt leaves this as an open question for the reader, but the framing suggests that combinatorial reasoning can resolve the dispute without drawing all the points.

🖼️ Graph drawing and edge crossings

🖼️ Graphs that cannot be drawn without crossings

The excerpt introduces a graph G (Figure 1.16) with 10 vertices and crossing edges in the drawing shown.

Question: Can you redraw G without crossing edges?

Some graphs must have crossing edges in any planar drawing.
The excerpt does not state whether this particular graph is planar or not, but it sets up the question.

🔢 Counting vertices and edges to determine planarity

Scenario (Example 1.15 continued): Sam and Deborah are given a graph with 2,843,952 vertices and 9,748,032 edges. The homework asks whether it can be drawn without edge crossings.

Deborah looks only at the number of vertices and edges and says "no."
Sam questions how she can be certain without examining the graph's structure.

Key insight: There is a combinatorial constraint (not stated explicitly in the excerpt, but implied) that relates the number of vertices and edges in a planar graph. If a graph has too many edges relative to its vertices, it cannot be planar—regardless of its specific structure.

Don't confuse:

Checking every possible drawing (impossible for large graphs) vs. using a counting argument based on vertices and edges alone.
The excerpt emphasizes that Deborah can justify her answer definitively using only these counts.

🧮 Combinatorics vs. continuous methods

🧮 Discrete vs. continuous optimization

The excerpt contrasts:

Continuous optimization (calculus): farmers fencing land, crossing rivers—variables can take any real value, even irrational numbers.
Discrete optimization (combinatorics): only integer values make sense; many such problems are "very hard to solve in general."

Why this matters for geometry:

Geometric problems involving points, lines, regions, and graph drawings are inherently discrete.
Continuous approximations (e.g., treating a graph drawing as a calculus problem) do not work; combinatorial structure and counting arguments are required.

🔍 Examples of discrete geometric problems

The excerpt lists several:

Counting regions formed by lines (integer count, depends on combinatorial arrangement).
Determining whether a set of points can define a given number of lines (integer constraint).
Deciding if a graph can be drawn without crossings (structural, not continuous).

Example (implicit): You cannot "approximately" draw a graph without crossings; either it is planar or it is not—a discrete, yes-or-no answer determined by combinatorial properties.

Combinatorics and Optimization

1.6 Combinatorics and Optimization

🧭 Overview

🧠 One-sentence thesis

Many optimization problems in combinatorics involve only integer values and appear similar on the surface, yet some turn out to be very hard to solve in general—even with computers—while others may be tractable.

📌 Key points (3–5)

Discrete vs continuous optimization: Combinatorial optimization problems require integer solutions (e.g., which cities to visit, which edges to include), unlike continuous calculus problems where any real number is allowed.
Graph-based optimization examples: shortest paths, traveling salesperson tours, highway inspection tours, and minimum-cost spanning trees all involve graphs but may differ greatly in difficulty.
Common confusion: Problems that sound similar (e.g., different tour or path problems on the same graph) may actually be quite different in computational difficulty—"similar sounding problems might actually be quite different in the end."
Scalability challenge: Even with powerful computers, some problems (e.g., enumerating all spanning trees in a large graph, solving very large Sudoku puzzles) may be impractical to solve by brute force.
Real-world relevance: These problems model practical scenarios like network design, logistics, and resource allocation, not just abstract puzzles.

🚗 Shortest paths and tours

🛣️ Shortest path problem

Shortest path: finding the minimum-distance route between two vertices in a weighted graph.

Setup: Vertices represent cities, edges represent highways, and edge weights represent distances.
Question: What is the shortest path from vertex E to vertex B?
Why it matters: This is a fundamental routing problem; the excerpt presents it as one of several optimization questions on the same graph.

🧳 Traveling salesperson problem

Setup: A salesperson (Ariel) has a home base (city A) and must visit all other cities at least once, then return home.
Goal: Minimize the total distance traveled.
Extra constraint: Can Ariel accomplish such a tour visiting each city exactly once?
Implication: This problem is presented alongside the shortest-path problem, but the excerpt hints (through student dialogue) that "similar sounding problems might actually be quite different"—suggesting this may be harder.

🛠️ Highway inspection tour

Setup: An engineer (Sanjay) must traverse every highway (edge) each month, starting from city E.
Goal: Minimize total distance while covering all edges.
Extra constraint: Can Sanjay make such a tour traveling along each highway exactly once?
Distinction: This problem focuses on edges (highways) rather than vertices (cities), so it is structurally different from the traveling salesperson problem.

🌳 Spanning trees and network design

🏦 Minimum spanning tree problem

Spanning tree: a subgraph that connects all vertices with no cycles; in the bank network model, it allows any branch to communicate with any other via relays.

Setup: Vertices are branch banks, edge weights are the cost (in millions of dollars) of building a data link.
Goal: Build a network (spanning tree) that connects all branches at minimum total cost.
Example calculation: The spanning tree shown has weight 12 + 25 + 19 + 18 + 23 + 19 = 116 (million dollars).

🔢 Counting spanning trees

Challenge: How many spanning trees does a given graph have?
Scalability question: For a large graph (e.g., 2875 vertices), does it make sense to enumerate all spanning trees and pick the cheapest?
General question: For a positive integer n, how many trees have vertex set {1, 2, 3, …, n}?
Implication: Brute-force enumeration may be impractical; the excerpt raises the question without answering it, highlighting the difficulty of combinatorial explosion.

🧩 Sudoku as a combinatorial problem

🎲 What is a Sudoku puzzle

Sudoku puzzle: a 9×9 array of cells that, when completed, has the integers 1, 2, …, 9 appearing exactly once in each row, each column, and each of the nine 3×3 subsquares.

Legitimacy requirement: A proper Sudoku puzzle must have a unique solution.
Difficulty variation: The excerpt mentions "fairly easy" and "far more challenging" puzzles, as well as "Evil" ones on the web.

🤔 Open questions about Sudoku

Puzzle generation: How does one create good (difficult) Sudoku puzzles?
Puzzle solving: How could a computer be used to solve puzzles? What makes some easy and others hard?
Scalability: Newspapers include 16×16 Sudoku puzzles; how difficult would it be to solve a 1024×1024 Sudoku puzzle, even with a powerful computer?
Why it matters: The excerpt frames Sudoku as having "more substance than you might think at first glance," suggesting it exemplifies broader combinatorial and computational challenges.

💬 Student perspectives on difficulty

🧐 Distinguishing easy from hard problems

Alice's assumption: "A fair to middling computer science major could probably write programs to solve any of them [network problems]."
Dave's caution: "Similar sounding problems might actually be quite different in the end. Maybe we'll learn to tell the difference."
Carlos's insight: "It might not be so easy to distinguish hard problems from easy ones."
Key takeaway: The excerpt uses student dialogue to emphasize that superficial similarity does not imply similar computational difficulty—a central theme in combinatorial optimization.

🌍 Practical relevance

Zori's concern: "Who's going to pay me to count necklaces, distribute library books or solve Sudoku puzzles?"
Bob's response: "The problems on networks and graphs seemed to have practical applications. I heard my uncle, a very successful business guy, talk about franchising problems that sound just like those."
Implication: The excerpt acknowledges skepticism about abstract problems but points to real-world applications in business, logistics, and network design.

1.7 Sudoku Puzzles

🧭 Overview

🧠 One-sentence thesis

Sudoku puzzles illustrate how seemingly simple combinatorial problems can vary dramatically in difficulty, raising questions about what makes some instances easy to solve while others remain computationally hard even for powerful computers.

📌 Key points (3–5)

What a Sudoku puzzle is: a 9×9 grid where digits 1–9 must appear exactly once in each row, column, and 3×3 subsquare, with a unique solution.
Practical questions raised: how to generate difficult puzzles, how to use computers to solve them, and what distinguishes easy from hard instances.
Scalability concern: larger Sudoku variants (e.g., 1024×1024) may be extremely difficult to solve even with powerful computers.
Common confusion: problems that "sound the same" (like various network/graph problems) may actually differ greatly in computational difficulty.
Broader theme: concrete combinatorial problems can have deep complexity differences that are not immediately obvious.

🧩 Structure and rules

🧩 Basic definition

A Sudoku puzzle is a 9×9 array of cells that when completed have the integers 1, 2, …, 9 appearing exactly once in each row and each column.

Additionally, the numbers 1–9 must appear exactly once in each of the nine 3×3 subsquares (identified by darkened borders).
This third constraint—the subsquare rule—is what makes the puzzles "so fascinating."
A legitimate Sudoku puzzle must have a unique solution.

📏 Variants and extensions

The size can be expanded in an obvious way.
Many newspapers include a 16×16 Sudoku puzzle in their Sunday edition.
The excerpt mentions hypothetical 1024×1024 puzzles to illustrate scalability questions.

❓ Open questions about difficulty

❓ Generation and solving

The excerpt poses several unanswered questions:

Question	Context
How to make good (difficult) puzzles?	Rory wants to create puzzles that are hard for Mandy to solve
How to use a computer to solve puzzles?	Mandy wants to solve Rory's constructed puzzles computationally
What makes some puzzles easy vs hard?	Difficulty varies significantly between instances

The excerpt shows two example puzzles: one "fairly easy" and one "far more challenging."
These questions highlight that understanding difficulty is non-trivial.

🖥️ Computational scalability

For very large puzzles (e.g., 1024×1024), the excerpt asks: "How difficult would it be to solve… even if you had access to a powerful computer?"
This question suggests that size alone may make the problem computationally intractable.
The implication is that brute-force approaches may not scale well.

💡 Broader implications from discussion

💡 Distinguishing problem difficulty

The student discussion (Section 1.8) reinforces themes relevant to Sudoku:

Alice's observation: "Similar sounding problems might actually be quite different in the end."
Dave's caution: "Maybe we'll learn to tell the difference."
Carlos's insight: "It might not be so easy to distinguish hard problems from easy ones."

Don't confuse: problems that appear structurally similar (like different combinatorial puzzles or network problems) may have vastly different computational complexity.

🌐 Practical resources

Sudoku puzzle software is available for all major operating systems.
Web resources exist (the excerpt mentions http://www.websudoku.com with "Evil" difficulty puzzles).
The excerpt notes these are popular but humorously warns against playing them during class.

🔗 Context within combinatorics

🔗 Why Sudoku matters

The excerpt states this example "has more substance than you might think at first glance."
Sudoku illustrates core combinatorial themes: constraint satisfaction, uniqueness of solutions, and computational difficulty.
It connects to the broader chapter theme of distinguishing easy from hard combinatorial problems.

🔗 Relation to other problems

The discussion section groups Sudoku with other concrete problems (necklaces, library books, network/graph problems).
Students debate whether these problems have practical value or are merely academic exercises.
Bob notes that "network problems seemed to have practical applications," suggesting Sudoku may be viewed as more recreational—but the computational questions it raises are serious.

1.8 Discussion

🧭 Overview

🧠 One-sentence thesis

The discussion reveals that combinatorics problems appear concrete and accessible but may hide deep differences in difficulty, with some problems being computationally easy and others potentially very hard despite sounding similar.

📌 Key points (3–5)

Student reactions vary: some find the problems practical (networks/graphs), others question real-world relevance (necklaces, Sudoku), and some appreciate the concrete, understandable nature of the material.
Apparent simplicity vs hidden complexity: problems that sound similar may be fundamentally different in difficulty; distinguishing "hard" from "easy" problems is not straightforward.
Teaching approach differs from calculus: the class poses many problems and asks students to solve them, rather than teaching a fixed method for problem types.
Common confusion: computational ease assumption—one student speculates that a "fair to middling" programmer could solve all the network problems, but another suggests this may not be true.
Accessibility: nearly all students felt they understood the day's material because it was "very concrete."

💬 Student perspectives on the course

💬 Expectations vs reality

Xing expected a calculus-like structure: the professor teaches methods, students apply them to problem types.
Instead, the class posed many problems and asked whether students could solve them.
This is a pedagogical shift: less direct instruction, more exploration.

🌍 Practical relevance concerns

Zori questions who will pay her to "count necklaces, distribute library books or solve Sudoku puzzles" after graduation.
Bob counters that network and graph problems "seemed to have practical applications," citing his uncle's franchising problems that "sound just like those."
Example: business optimization problems may map onto graph/network structures studied in class.

🎯 Concrete and accessible

Alice notes that "almost all" students understood the material because it was "so very concrete."
The problems were tangible and easy to grasp at a surface level.
Don't confuse: understanding the problem statement vs. understanding how to solve it or how hard it is.

🧩 Apparent simplicity vs hidden complexity

🧩 The computational ease assumption

Alice speculates: "A fair to middling computer science major could probably write programs to solve any of them [network problems]."
This assumes that similar-sounding problems are equally easy to solve computationally.

⚠️ The difficulty distinction challenge

Dave counters: "Similar sounding problems might actually be quite different in the end. Maybe we'll learn to tell the difference."
Carlos adds: "It might not be so easy to distinguish hard problems from easy ones."
Key insight: surface similarity does not imply computational similarity.
Example: two graph problems may both involve finding paths, but one might be solvable in seconds while the other takes years even on a supercomputer.

🔍 Why distinguishing matters

If you cannot tell hard from easy problems, you may:
- Waste time trying to find efficient solutions to inherently hard problems.
- Miss opportunities to apply known efficient methods to problems that look hard but are actually easy.
The course may teach students to recognize these distinctions.

🧠 Pedagogical implications

🧠 Problem-driven learning

The class does not follow a "here's the formula, now apply it" model.
Instead, it presents a variety of problems and asks students to engage with them.
This approach may help students develop problem-solving intuition and recognize structural differences.

🧠 Building discernment

The discussion hints that a key learning goal is to "tell the difference" between problem types.
Students will need to learn which problems are tractable and which are not, even when they sound similar.
This skill is valuable in real-world applications where problem difficulty is not always obvious upfront.

2.1 Strings: A First Look

🧭 Overview

🧠 One-sentence thesis

Strings (sequences of characters from a set) form the foundation of combinatorial mathematics because they model written communication and computation, and counting them requires multiplying the number of choices at each position.

📌 Key points (3–5)

What a string is: a function from positions {1, 2, ..., n} to a set X, written as x₁x₂...xₙ, where each xᵢ is a character from X.
Multiple names for the same concept: strings are also called sequences (when X is numbers), words (X is an alphabet of letters), or arrays (in computing).
How to count strings: multiply the number of options at each position—if position i has kᵢ choices, the total is k₁ × k₂ × ... × kₙ.
Common confusion: strings vs permutations—strings allow repetition of characters (e.g., "aababb"), while permutations require all characters to be distinct (covered in section 2.2).
Why it matters: strings model license plates, machine instructions, usernames, and all forms of digital and written communication.

📝 What strings are

📝 Formal definition

X-string of length n: a function s : {1, 2, ..., n} → X, where X is a set.

The set X is called the set of characters (or letters, or alphabet).
The element s(i) is the i-th character of s.
Notation: write s = "x₁x₂x₃...xₙ" instead of s(1) = x₁, s(2) = x₂, etc.

🔤 Alternative terminology

The excerpt lists several equivalent terms depending on context:

Term	Context	Example
String	General combinatorics	x₁x₂...xₙ
Sequence	X is numbers, defined by a rule	sᵢ = 2i - 1 (odd integers)
Word	X is an alphabet of letters	"aababbccabcbb" is a 13-letter word on alphabet {a, b, c}
Array	Computing languages	Data structure in programming

Don't confuse: these are all the same mathematical object—a function from positions to a character set.

🧮 Cartesian product view

When position i must use a character from a specific subset Xᵢ ⊆ X, the string is an element of the cartesian product X₁ × X₂ × ... × Xₙ.
Written as n-tuples: (x₁, x₂, ..., xₙ) where xᵢ ∈ Xᵢ for all i.

🔢 Counting strings: the multiplication principle

🔢 Core counting rule

The size of a product of sets is the product of the sets' sizes.
If position 1 has k₁ options, position 2 has k₂ options, ..., position n has kₙ options, then the total number of strings is k₁ × k₂ × ... × kₙ.

🚗 Georgia license plates (Example 2.1)

Setup: Four digits (first cannot be 0) + one space + three capital letters.

Position 1: 9 options (digits 1–9)
Positions 2–4: 10 options each (digits 0–9)
Position 5: 1 option (space)
Positions 6–8: 26 options each (capital letters A–Z)

Calculation: 9 × 10³ × 1 × 26³ = 158,184,000 total license plates.

Why multiplication works: The excerpt explains by focusing on the digit portion. Think of four blanks to fill. The first blank has 9 options. For strings starting with 1, there are 1000 possibilities (from 1000 to 1999). Alternatively, positions 2, 3, 4 each have 10 options, so 10 × 10 × 10 = 1000. Since the analysis doesn't depend on which digit (1–9) is chosen first, each of the 9 initial choices gives 1000 strings, totaling 9 × 1000 = 9000.

💻 Special types of strings

💻 Binary strings (bit strings)

0–1 string (binary string, bit string): an X-string where X = {0, 1}.

Example: A 32-bit machine instruction is a bit string of length 32.
Each position has 2 options (0 or 1).
General formula: The number of bit strings of length n is 2ⁿ.
For n = 32: 2³² = 4,294,967,296 possible instructions.

💻 Ternary strings

Ternary string: an X-string where X = {0, 1, 2}.

Each position has 3 options.
The number of ternary strings of length n is 3ⁿ.

🌐 Website usernames (Example 2.3)

Setup: 13-position string with varying constraints per position.

Position	Allowed characters	Count
1	Upper-case letters (U)	26
2–6	Upper/lower letters + digits (U + L + D)	62 each
7	'@' or '.'	2
8–12	Lower-case letters + '*', '%', '#' (L + 3 symbols)	29 each
13	Digits (D)	10

Calculation: 26 × 62⁵ × 2 × 29⁵ × 10 = 9,771,287,250,890,863,360 possible usernames.

Method: Visualize 13 blanks, write the number of options above each blank, then multiply all the numbers together because each choice is independent.

🔄 Strings vs permutations (preview)

🔄 Repetition allowed vs distinct characters

Strings (this section): characters may repeat.
- Example: "01110000" is a valid 8-character bit string.
- Example: "aababb" uses 'a' and 'b' multiple times.
Permutations (section 2.2): all characters must be distinct.
- Example: Drawing letters from a bag without replacement.
- "yellow" cannot be a permutation (uses 'l' twice).
- "jacket" can be a permutation (all letters distinct).

🔄 Counting permutations (brief preview)

For a 6-character permutation from 26 English letters:
- Position 1: 26 choices
- Position 2: 25 choices (one letter removed)
- Position 3: 24 choices
- ...
- Position 6: 21 choices
- Total: 26 × 25 × 24 × 23 × 22 × 21
This is about half the total number of 6-character strings (26⁶), because repetition is not allowed.

Don't confuse: The excerpt emphasizes that permutations require |X| ≥ n (the set must have at least as many elements as the string length), whereas ordinary strings have no such restriction.

Permutations

2.2 Permutations

🧭 Overview

🧠 One-sentence thesis

Permutations count strings where each symbol appears at most once, and the formula P(m, n) = m!/(m - n)! gives the number of ways to arrange n distinct items chosen from m items when order matters.

📌 Key points (3–5)

What a permutation is: a string where all characters are distinct (no symbol repeats).
How permutations differ from general strings: general strings allow repetition (e.g., "01110000"), but permutations do not (e.g., "jacket" is valid, "yellow" is not because "l" repeats).
The counting formula: P(m, n) = m × (m - 1) × (m - 2) × … × (m - n + 1) counts permutations of length n from an m-element set.
Common confusion: permutations vs combinations—permutations care about order (different positions = different outcomes), while combinations (introduced at the end) treat positions as equal.
Why it matters: permutations model real scenarios like officer elections, license plates with distinct digits, and drawing items without replacement.

🔤 From strings to permutations

🔤 What general strings allow

In the previous section, strings allowed repetition of symbols.
Example: "01110000" is a valid bit string of length eight, even though "0" and "1" repeat multiple times.
The number of strings of length n from an m-element set X is m^n (m choices per position, n positions).

🚫 When repetition is not allowed

Permutation: an X-string s = x₁x₂…xₙ where all n characters are distinct.

Many real-world scenarios require that each symbol be used at most once.
Example: Drawing 26 letters from a bag one at a time without returning them—"yellow" cannot be formed (uses "l" twice), but "jacket" can.
A permutation of length n from set X requires that the size of X is at least n (|X| ≥ n).

🧮 Counting permutations

🧮 The factorial and P(m, n) formula

Factorial notation: n! = n × (n - 1) × (n - 2) × … × 3 × 2 × 1; by convention, 0! = 1.
- Example: 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040.
P(m, n) formula: For integers m ≥ n ≥ 0,
- P(m, n) = m! / (m - n)! = m × (m - 1) × … × (m - n + 1).
- Example: P(9, 3) = 9 × 8 × 7 = 504.
- Example: P(8, 4) = 8 × 7 × 6 × 5 = 1680.

🔍 Why the formula works (Proposition 2.6)

When constructing a permutation of length n from an m-element set X:
- First position (x₁): m choices (any element of X).
- Second position (x₂): m - 1 choices (any element except x₁).
- Third position (x₃): m - 2 choices (any element except x₁ and x₂).
- nth position (xₙ): m - n + 1 choices (any element except the previous n - 1 chosen).
Multiplying these together: m × (m - 1) × (m - 2) × … × (m - n + 1) = P(m, n).
Example: Drawing 6 letters from 26 without replacement → P(26, 6) = 26 × 25 × 24 × 23 × 22 × 21 strings.

🎯 Applications and examples

🎯 Officer elections (Example 2.7)

Scenario: Elect 4 class officers (President, Vice President, Secretary, Treasurer) from 80 students.
Why permutations apply: Each position is distinct; the order matters (President ≠ Treasurer).
Calculation: P(80, 4) = 80 × 79 × 78 × 77 = 37,957,920 possible slates.
Don't confuse: If positions were equal (e.g., an executive council with no titles), order would not matter—that's a combination problem (introduced at the end of the excerpt).

🚗 License plates with distinct digits (Example 2.8)

The excerpt revisits license plates from Example 2.1, adding no-repetition constraints.

Constraint	Calculation	Explanation
Three letters must be distinct	P(26, 3) = 26 × 25 × 24 = 15,600	Instead of 26³, use permutation formula
Three digits (positions 2–4) distinct, but can repeat first digit	9 × P(10, 3) × 26³	First digit: 9 choices (nonzero); next three digits: P(10, 3); letters: 26³
All four digits distinct (first digit nonzero)	9 × P(9, 3) × 26³	First digit: 9 choices; next three digits chosen from remaining 9 → P(9, 3)

Example: If the first digit is 5, the next three digits must be chosen from {0,1,2,3,4,6,7,8,9} (9 elements), so P(9, 3) ways.

🔄 Transition to combinations

🔄 When order does not matter

The excerpt ends by introducing a new scenario: electing an executive council of 4 students from 80, where all positions are equal.
Key difference: There is no difference between Alice winning the "first" seat vs. the "fourth" seat—only the set of 4 students matters, not the order.
This motivates combinations (covered in section 2.3): counting subsets without regard to order.
Don't confuse: Permutations count ordered arrangements; combinations count unordered selections.

Combinations

2.3 Combinations

🧭 Overview

🧠 One-sentence thesis

Combinations count the number of ways to select a subset of k elements from an n-element set without regard to order, and this count equals n! divided by k!(n−k)!.

📌 Key points (3–5)

What combinations measure: the number of k-element subsets of a set, where order does not matter (unlike permutations).
How to compute combinations: C(n, k) = P(n, k) / k! = n! / (k!(n−k)!), because each subset can be arranged in k! different orders.
Key symmetry: choosing k elements to include is the same as choosing (n−k) elements to exclude, so C(n, k) = C(n, n−k).
Common confusion: combinations vs permutations—combinations ignore order (executive council seats are equal), permutations care about order (President vs Treasurer are different roles).
Useful correspondence: subsets of an n-element set correspond one-to-one with bit strings of length n.

🔢 Permutations formula (recap)

🔢 P(m, n) definition

P(m, n) = m! / (m−n)! = m(m−1)(m−2)⋯(m−n+1)

This counts the number of permutations (strings where all elements are distinct) of length n chosen from an m-element set.
Example: P(9, 3) = 9 × 8 × 7 = 504.
Example: P(68, 23) = 20732231223375515741894286164203929600000.

🧮 Why the formula works

When constructing a permutation of length n from an m-element set X:
- m choices for the first position
- (m−1) choices for the second position (cannot reuse the first element)
- (m−2) choices for the third position
- ⋯
- (m−n+1) choices for the nth position
Multiply all choices together: m(m−1)(m−2)⋯(m−n+1) = P(m, n).

📋 Example: officer slate

Electing 4 officers (President, Vice President, Secretary, Treasurer) from 80 students.
Order matters (President ≠ Treasurer), so this is a permutation problem.
Answer: P(80, 4) = 37,957,920 different slates.

🚗 Example: license plates with distinct letters

Georgia license plates with three distinct letters (positions 5–7).
Instead of 26³ = 17,576 ways (with repetition), now P(26, 3) = 26 × 25 × 24 = 15,600 ways.
Total plates (with 9 choices for first digit, 10 choices each for next 3 digits): 9 × 10³ × P(26, 3) = 140,400,000.
If the three digits in positions 2–4 must also be distinct: 9 × P(10, 3) × 26³.
If the first digit cannot repeat in positions 2–4: 9 × P(9, 3) × 26³.

🎯 What combinations are

🎯 Combination definition

A k-element subset of a set X is called a combination of size k.

The number of k-element subsets of an n-element set is denoted C(n, k) or (n choose k).
Also called binomial coefficients.
Read as "n choose k" or "the number of combinations of n things, taken k at a time."

🔄 Combinations vs permutations

Aspect	Permutations	Combinations
Order matters?	Yes	No
Example	Officer slate: President, VP, Secretary, Treasurer are different roles	Executive council: all 4 seats are equal
Formula	P(n, k) = n! / (n−k)!	C(n, k) = n! / (k!(n−k)!)
Count from 80 students, choose 4	P(80, 4) = 37,957,920	C(80, 4) = 1,581,580

Don't confuse: if positions are interchangeable (executive council), use combinations; if positions are distinct (officer roles), use permutations.

🧩 Combination formula and proof

🧩 The formula

C(n, k) = P(n, k) / k! = n! / (k!(n−k)!)

Key insight: each k-element subset can be arranged in k! different orders (permutations).
So P(n, k) counts all permutations, but each subset is counted k! times.
Divide by k! to avoid overcounting: C(n, k) = P(n, k) / k!.

🔍 Why this works

Start with P(n, k) permutations of length k from an n-element set.
Each of the C(n, k) subsets generates exactly k! permutations (by reordering its elements).
Therefore: k! × C(n, k) = P(n, k).
Solve for C(n, k): C(n, k) = P(n, k) / k!.

📊 Example: executive council

Electing a 4-member executive council from 80 students.
Order does not matter (all seats are equal).
Answer: C(80, 4) = P(80, 4) / 4! = 37,957,920 / 24 = 1,581,580.

🔁 Symmetry and correspondence

🔁 Symmetry property

C(n, k) = C(n, n−k) for all integers n and k with 0 ≤ k ≤ n.

Choosing k elements to include is the same as choosing (n−k) elements to exclude.
Example: choosing 4 winners from 80 students = choosing 76 losers from 80 students.
This symmetry is useful for simplifying calculations.

🍽️ Example: vegetable plate

A restaurant offers 21 vegetable options.
A vegetable plate includes 4 different vegetables (order does not matter).
Answer: C(21, 4) = 5,985 different vegetable plates.

💾 Bit string correspondence

There is a one-to-one correspondence between subsets of an n-element set X and bit strings of length n.
If X = {x₁, x₂, …, xₙ}, a subset A ⊆ X corresponds to a bit string s where s(i) = 1 if and only if xᵢ ∈ A.
Example: X = {a, b, c, d, e, f, g, h}, subset {b, c, g} corresponds to bit string 01100010.
The number of bit strings of length 8 with exactly three 1's is C(8, 3) = 56.
Total number of subsets of an n-element set: 2ⁿ (each element is either in or out, 2 choices per element).

Combinatorial Proofs

2.4 Combinatorial Proofs

🧭 Overview

🧠 One-sentence thesis

Combinatorial proofs demonstrate identities by showing that both sides count the same collection of objects in different ways, often yielding elegant arguments that avoid tedious algebra.

📌 Key points (3–5)

What combinatorial proofs do: prove identities by counting the same set of objects two different ways—left side and right side each represent a valid count of the same thing.
Why they matter: they replace complicated algebraic manipulations with intuitive visual or counting arguments that reveal why a formula is true.
Core technique: partition or group objects by some property (e.g., position of last 1 in a bit string, number of elements in a subset) so the right side sums over all cases.
Common confusion: don't confuse "counting the same objects" with "getting the same number by coincidence"—the proof must show both sides count exactly the same collection.
Key insight: many problems that don't look like "choosing subsets" can be reframed using binomial coefficients (e.g., distributing identical objects, integer solutions to equations).

🔢 Binomial coefficients and combinations

🔢 Definition and notation

Combination of size k: a k-element subset of a finite set X.

Binomial coefficient (n choose k), written as (n k) or C(n, k): the number of k-element subsets of an n-element set.

Read as "n choose k" or "the number of combinations of n things, taken k at a time."
Example: electing a 4-member executive council from 80 students is C(80, 4) = 1,581,580 ways.

🧮 Formula for binomial coefficients

The excerpt gives:

(n k) = C(n, k) = P(n, k) / k! = n! / (k! (n − k)!)

where P(n, k) is the number of permutations of length k from an n-element set.

Why this formula works (overcounting argument):

P(n, k) counts all k-permutations (order matters).
Each k-element subset can be turned into k! different permutations.
So each subset is overcounted k! times in P(n, k).
Dividing by k! corrects the overcount: C(n, k) = P(n, k) / k!.

🔄 Symmetry property

The excerpt states:

(n k) = (n / (n − k))

Interpretation: choosing k elements to include in a subset is the same as choosing (n − k) elements to exclude.

Example: electing 4 winners from 80 students is the same as choosing 76 losers.

🧵 Subsets and bit strings correspondence

The excerpt introduces a natural one-to-one correspondence:

Let X = {x₁, x₂, ..., xₙ} be an n-element set.
Each subset A ⊆ X corresponds to a bit string of length n.
The i-th bit is 1 if and only if xᵢ ∈ A.

Example: if X = {a, b, c, d, e, f, g, h}, the subset {b, c, g} corresponds to the bit string 01100010.

Implication: there are C(8, 3) = 56 bit strings of length 8 with exactly three 1's (corresponding to 3-element subsets).

Total number of subsets: the excerpt asks "what is the total number of subsets of an n-element set?" (This is answered in the next section: 2ⁿ.)

🎨 The combinatorial proof technique

🎨 What is a combinatorial proof?

The excerpt explains:

Combinatorial arguments count one thing in two different ways to prove an identity.

Both sides of an equation count the same collection of objects.
The left side groups or counts them one way; the right side counts them another way.
Since both count the same set, they must be equal.
This avoids "large amounts of tedious algebraic manipulations."

🎯 Core strategy

Identify what collection of objects both sides count.
Show the left side counts all objects.
Show the right side partitions the same objects by some property (e.g., "according to the number of 0's" or "according to the last occurrence of a 1").
Conclude both sides are equal because they count the same thing.

Don't confuse: a combinatorial proof is not just "both sides happen to equal the same number"—you must explicitly describe the same set of objects being counted.

🧩 Examples of combinatorial proofs

🧩 Sum of first n integers

Identity: 1 + 2 + 3 + ⋯ + n = n(n + 1) / 2

Proof idea (using Figure 2.14):

Consider an (n+1) × (n+1) array of dots.
Total dots: (n+1)².
Exactly n+1 dots lie on the main diagonal.
The off-diagonal dots split into two equal parts: above and below the diagonal.
Each part has S(n) = 1 + 2 + 3 + ⋯ + n dots.
So: S(n) = [(n+1)² − (n+1)] / 2 = n(n+1) / 2.

🧩 Sum of first n odd integers

Identity: 1 + 3 + 5 + ⋯ + (2n − 1) = n²

Proof idea (using Figure 2.16):

The left side is the sum of the first n odd integers.
Visually, arranging dots in an n × n square shows this sum equals n².
(The excerpt says "this is clearly equal to n²" from the figure.)

🧩 Sum of all binomial coefficients

Identity: (n 0) + (n 1) + (n 2) + ⋯ + (n n) = 2ⁿ

Proof:

Left side: groups bit strings of length n by the number of 1's.
- (n k) counts strings with exactly k ones.
- Summing over all k from 0 to n counts all bit strings.
Right side: there are 2ⁿ bit strings of length n (each position has 2 choices).
Both sides count the same set of bit strings.

🧩 Sum involving binomial coefficients (Pascal's identity variant)

Identity: (n / (k+1)) = (k k) + (k+1 k) + (k+2 k) + ⋯ + (n−1 k)

Proof:

Left side: counts bit strings of length n with exactly k+1 ones.
Right side: partitions these strings by the position of the last 1.
- If the last 1 is in position k+5, the remaining k ones must appear in the first k+4 positions, giving C(k+4, k) strings.
- Summing over all possible positions of the last 1 gives the right side.
Both sides count the same bit strings.

Special case: when k = 1 (so k+1 = 2), this reduces to the formula for the sum of the first n positive integers.

🧩 Powers of 3

Identity: 3ⁿ = (n 0)·2⁰ + (n 1)·2¹ + (n 2)·2² + ⋯ + (n n)·2ⁿ

Proof:

Left side: counts all strings of length n over the alphabet {0, 1, 2}.
Right side: partitions these strings by the number of positions that are not 2.
- If 6 positions are not 2, choose those 6 positions in C(n, 6) ways.
- Fill those 6 positions with 0 or 1: 2⁶ ways.
- Summing over all possible numbers of non-2 positions gives the right side.
Both sides count the same {0, 1, 2}-strings.

🧩 Sum of squares of binomial coefficients

Identity: (2n n) = (n 0)² + (n 1)² + (n 2)² + ⋯ + (n n)²

Proof:

Left side: counts bit strings of length 2n with exactly n ones (half the bits are 0's).
Right side: partitions these strings by the number of 1's in the first n positions.
- If there are k ones in the first n positions, there must be (n − k) ones in the last n positions.
- Number of ways: (n k) · (n / (n−k)) = (n k)² (using the symmetry property).
- Summing over all k gives the right side.
Both sides count the same bit strings.

🎁 Distributing identical objects (stars and bars)

🎁 Basic problem: each recipient gets at least one

Problem: distribute 18 identical folders among 4 employees (Audrey, Bart, Cecilia, Darren) so each gets at least one.

Solution (using gaps/dividers):

Imagine 18 folders in a row.
There are 17 gaps between them.
Choose 3 gaps and place a divider in each.
This divides the folders into 4 non-empty groups (one for each employee).
Answer: C(17, 3).

Example (Figure 2.22): Audrey gets 6, Bart gets 1, Cecilia gets 4, Darren gets 7.

🎁 No restriction: recipients may get zero

Problem: distribute 18 identical folders among 4 employees with no restriction (some may get zero).

Solution (artificial inflation trick):

Artificially inflate each person's allocation by 1.
Artificially inflate the total number of folders by 4 (one per person).
Now we have 22 = 18 + 4 folders.
Choose 3 gaps from 21 gaps to guarantee each person gets at least 1 (artificially).
The actual allocation is one less (so may be zero).
Answer: C(21, 3).

🎁 Partial restriction: only some recipients guaranteed

Problem: distribute 18 folders so only Audrey and Cecilia are guaranteed at least one; Bart and Darren may get zero.

Solution:

Artificially inflate only Bart and Darren's allocations (add 1 each).
Leave Audrey and Cecilia's allocations as is.
Total folders: 18 + 2 = 20.
Choose 3 gaps from 19 gaps.
Answer: C(19, 3).

🎁 Reformulation as integer solutions

The excerpt reformulates these problems as counting integer solutions to inequalities or equations.

Example: count integer solutions to x₁ + x₂ + x₃ + x₄ + x₅ + x₆ = 538 (or ≤ 538) subject to restrictions on xᵢ.

Restriction	Answer	Explanation
All xᵢ > 0, equality holds	C(537, 5)	Each variable at least 1; use 537 gaps among 538 objects, choose 5 dividers
All xᵢ ≥ 0, inequality	C(543, 5)	Inflate by 6 (one per variable); 538 + 6 = 544 objects, 543 gaps, choose 5

(The excerpt cuts off after listing two cases.)

Don't confuse:

"All xᵢ > 0" (each variable at least 1) vs. "all xᵢ ≥ 0" (variables may be zero).
Equality (x₁ + ⋯ + xₙ = k) vs. inequality (x₁ + ⋯ + xₙ ≤ k)—for inequality, introduce a slack variable to convert to equality.

The Ubiquitous Nature of Binomial Coefficients

2.5 The Ubiquitous Nature of Binomial Coefficients

🧭 Overview

🧠 One-sentence thesis

Binomial coefficients solve a wide range of counting problems—from distributing identical objects and solving integer equations to counting lattice paths—even when the problems do not initially appear to involve choosing subsets from sets.

📌 Key points (3–5)

Core insight: Many combinatorial problems can be reframed as "choosing k items from n positions," making binomial coefficients the natural counting tool.
Distribution problems: Distributing identical objects into distinct categories (with or without minimum requirements) translates into choosing divider positions among gaps.
Integer solutions: Counting solutions to equations like x₁ + x₂ + ... + xₖ = n with various constraints is equivalent to distribution problems.
Lattice paths: Paths on a grid from one point to another correspond to strings of moves, counted by binomial coefficients.
Common confusion: When constraints require "at least one" vs "zero or more," use the "artificial inflation" trick—add extra items temporarily to convert the problem into a simpler form.

🎁 Distributing identical objects

🎁 Basic distribution with minimum requirements

The excerpt presents a scenario: distribute 18 identical folders among four employees (Audrey, Bart, Cecilia, Darren) so that each receives at least one folder.

Method:

Imagine the 18 folders in a row, creating 17 gaps between them.
Choose 3 of these 17 gaps and place dividers in them.
The dividers split the folders into 4 non-empty groups (one per employee).
Answer: C(17, 3) ways.

The number of ways to distribute n identical objects into k distinct categories (each receiving at least one) equals the number of ways to choose (k−1) dividers from (n−1) gaps.

Example: If Audrey gets 6 folders, Bart 1, Cecilia 4, and Darren 7, the dividers are placed after the 6th, 7th, and 11th folders.

🎁 Distribution allowing zero allocations

When the restriction "each must receive at least one" is dropped, the excerpt uses an "artificial inflation" trick:

Artificially add 1 folder to each person's allocation.
Inflate the total number of folders by 4 (one per person): 18 + 4 = 22 folders.
Now choose 3 gaps from 21 gaps (between 22 folders).
The actual allocation is one less than the artificial allocation (so zero is allowed).
Answer: C(21, 3) ways.

Why this works: By guaranteeing everyone gets at least 1 in the inflated problem, then subtracting 1 from each allocation, we allow zero in the original problem.

🎁 Partial minimum requirements

The excerpt also considers: only Audrey and Cecilia must receive at least one folder; Bart and Darren may receive zero.

Method:

Artificially inflate only Bart and Darren's allocations by 1 each.
Total folders: 18 + 2 = 20; gaps: 19.
Choose 3 gaps from 19.
Answer: C(19, 3) ways.

Don't confuse: Inflate only the allocations that are allowed to be zero; leave the "must be at least one" allocations as-is.

🔢 Integer solutions to equations and inequalities

🔢 Reformulation as integer solutions

The excerpt reformulates distribution problems in terms of counting integer solutions to:

x₁ + x₂ + x₃ + x₄ + x₅ + x₆ = 538 (or ≤ 538)

with various restrictions on the variables.

Key cases (from the excerpt):

Restriction	Equality or inequality	Answer	Explanation
All xᵢ > 0	Equality	C(537, 5)	538 − 1 = 537 gaps; choose 5 dividers
All xᵢ ≥ 0	Equality	C(543, 5)	Inflate by 6: (538+6)−1 = 543 gaps
x₁, x₂, x₄, x₆ > 0; x₃ ≥ 52; x₅ ≥ 194	Equality	C(291, 3)	Subtract minimums first, then distribute remainder
All xᵢ > 0	Strict inequality (< 538)	C(537, 6)	Introduce x₇ (the balance); x₇ must be positive
All xᵢ ≥ 0	Strict inequality	C(543, 6)	Introduce x₇; only x₇ must be positive
All xᵢ ≥ 0	Non-strict inequality (≤ 538)	C(544, 6)	Introduce x₇ ≥ 0; all variables ≥ 0

🔢 Handling inequalities

When the problem involves an inequality (e.g., x₁ + ... + x₆ ≤ 538):

Introduce a new variable x₇ representing the "balance" or "slack."
Convert the inequality into an equality: x₁ + ... + x₆ + x₇ = 538.
If the original inequality is strict (<), then x₇ must be positive.
If non-strict (≤), then x₇ ≥ 0.

Example: For all xᵢ > 0 and strict inequality, introduce x₇ > 0, giving 7 variables all positive; answer C(537, 6).

🗺️ Lattice paths

🗺️ What is a lattice path

A lattice path in the plane: a sequence of integer coordinate pairs (m₁, n₁), (m₂, n₂), ..., (mₜ, nₜ) where each step either moves right (horizontal: mᵢ₊₁ = mᵢ + 1, nᵢ₊₁ = nᵢ) or up (vertical: mᵢ₊₁ = mᵢ, nᵢ₊₁ = nᵢ + 1).

A lattice path is equivalent to a string over {H, V}, where H = horizontal move, V = vertical move.
The excerpt illustrates a path from (0, 0) to (13, 8).

🗺️ Counting lattice paths

General formula: The number of lattice paths from (m, n) to (p, q) is C((p − m) + (q − n), p − m).

Why:

Total moves: (p − m) horizontal + (q − n) vertical = (p − m) + (q − n) moves.
Choose which (p − m) positions are horizontal moves.
Answer: C(total moves, horizontal moves).

Example: From (0, 0) to (n, n) requires 2n moves (n horizontal, n vertical), so C(2n, n) paths.

🗺️ Catalan numbers and "good" paths

The excerpt defines:

A lattice path from (0, 0) to (n, n) is good if it never goes above the diagonal line y = x; otherwise it is bad.

A path is good if, at every step, the number of V's (vertical moves) in any initial segment never exceeds the number of H's (horizontal moves).
Example of good path: "HHVHVVHHHVHVVV" (from (0,0) to (7,7)).
Example of bad path: "HVHVHHVVVHVHHV" (after 9 moves: 5 V's and 4 H's, so it crosses above the diagonal).

Catalan number formula: The number of good paths from (0, 0) to (n, n) is

C(n) = (1 / (n+1)) × C(2n, n)

🗺️ Deriving the Catalan formula via bijection

The excerpt uses a bijection argument:

Total paths P: All lattice paths from (0,0) to (n,n) have |P| = C(2n, n).
Partition P into good (G) and bad (B): P = G ∪ B.
Transform each bad path s into a path s′:
- Find the first position i where s crosses above the diagonal (i must be odd, say i = 2j+1).
- At position i, s has j H's and (j+1) V's.
- The "tail" of s (remaining 2n − 2j − 1 positions) has (n−j) H's and (n−j−1) V's.
- Swap H's and V's in the tail: H → V, V → H.
- The new path s′ has (n+1) V's and (n−1) H's total, so it goes from (0,0) to (n−1, n+1).
This transformation is a bijection between B (bad paths from (0,0) to (n,n)) and P′ (all paths from (0,0) to (n−1, n+1)).
Count: |P′| = C(2n, n−1), so |B| = C(2n, n−1).
Therefore: |G| = |P| − |B| = C(2n, n) − C(2n, n−1) = (1/(n+1)) × C(2n, n).

Don't confuse: The transformation swaps moves only in the tail (after the first crossing), not the entire path.

🧮 Enumerative techniques highlighted

🧮 Bijection

The excerpt demonstrates counting by establishing a one-to-one correspondence between two sets:

One set is "easier" to count.
The bijection proves both sets have the same size.
Example: Bad lattice paths ↔ paths from (0,0) to (n−1, n+1).

🧮 Complementary counting

Instead of counting the objects we want directly, count the total and subtract the objects we do not want.

Example: Good paths = Total paths − Bad paths.
This technique is useful when the "unwanted" objects are easier to count.

The Binomial Theorem

2.6 The Binomial Theorem

🧭 Overview

🧠 One-sentence thesis

The Binomial Theorem provides a formula for expanding powers of a sum (x + y)ⁿ into a sum of terms involving binomial coefficients, and this principle extends naturally to multinomial coefficients when more than two terms are summed.

📌 Key points (3–5)

What the Binomial Theorem states: (x + y)ⁿ expands into a sum where each term has the form (n choose i) times xⁿ⁻ⁱ times yⁱ, for i from 0 to n.
Why it works: each term in the expansion corresponds to choosing y from exactly i of the n factors, and x from the remaining n − i factors.
Multinomial generalization: when expanding (x₁ + x₂ + ... + xᵣ)ⁿ, the coefficients become multinomial coefficients n!/(k₁!k₂!...kᵣ!) where k₁ + k₂ + ... + kᵣ = n.
Common confusion: multinomial coefficients look complicated, but they follow the same counting logic as binomial coefficients—choosing how many times each term appears in the product.
Practical use: you can find the coefficient of a specific term in an expansion without computing the entire expansion.

🧮 The Binomial Theorem statement and proof

📐 The theorem

Binomial Theorem: Let x and y be real numbers with x, y, and x + y nonzero. Then for every non-negative integer n, (x + y)ⁿ = sum from i=0 to n of (n choose i) times xⁿ⁻ⁱ times yⁱ.

The theorem gives an exact formula for expanding any power of a binomial (a sum of two terms).
Each term in the expansion has a binomial coefficient (n choose i), a power of x, and a power of y.
The powers of x and y always add up to n in each term.

🔍 Why the formula works

The proof views (x + y)ⁿ as a product of n identical factors:

Write (x + y)ⁿ = (x + y)(x + y)(x + y)...(x + y) with n factors.
Each term in the expansion results from choosing either x or y from each of the n factors.
If you choose x exactly n − i times and y exactly i times, the resulting product is xⁿ⁻ⁱyⁱ.
The key counting step: the number of ways to get xⁿ⁻ⁱyⁱ equals the number of ways to choose i factors (from which to take y), which is exactly (n choose i).

Example: To expand (x + y)³, you multiply three factors. If you choose y from exactly one factor and x from the other two, you get xxy, xyx, or yxx—three ways total, which matches (3 choose 1) = 3. So the x²y term has coefficient 3.

🎯 Finding a single coefficient

The excerpt emphasizes that you often want just one coefficient, not the full expansion.

Example from the text: The coefficient of x⁵y⁸ in (2x − 3y)¹³ is (13 choose 5) times 2⁵ times (−3)⁸.
You identify which term you need (here, x⁵y⁸), note that 5 + 8 = 13, then apply the Binomial Theorem formula directly.
No need to expand all 14 terms; just compute the one coefficient.

🎨 Multinomial coefficients

🖌️ Counting with more than two colors

The excerpt introduces multinomial coefficients by analogy with painting problems:

Two colors (binomial): Choose k elements out of n to paint red; the rest are blue. Number of ways = (n choose k).
Three colors: Choose k₁ elements to paint red, k₂ to paint blue, and the remaining k₃ = n − (k₁ + k₂) to paint green.
- First choose k₁ of n for red: (n choose k₁) ways.
- Then choose k₂ of the remaining n − k₁ for blue: (n − k₁ choose k₂) ways.
- The remaining k₃ elements are automatically green.
- Total = (n choose k₁) times (n − k₁ choose k₂) = n! / (k₁! k₂! k₃!).

📝 General multinomial coefficient notation

Multinomial coefficient: (n choose k₁, k₂, k₃, ..., kᵣ) = n! / (k₁! k₂! k₃! ... kᵣ!)

This counts the number of ways to partition n objects into r groups of sizes k₁, k₂, ..., kᵣ (where k₁ + k₂ + ... + kᵣ = n).
The notation has "overkill" because kᵣ is determined by n and k₁, ..., kᵣ₋₁, but it is written explicitly for clarity.
Example from the text: (8 choose 3, 2, 1, 2) = 8! / (3! 2! 1! 2!) = 40320 / 12 = 1680.

🔤 Counting rearrangements of strings

Example from the excerpt: How many rearrangements of the string "MITCHELTKELLERANDWILLIAMTTROTTERAREGENIUSES!!" are possible?

The string has 45 characters total.
Count each distinct character: 3 A's, 1 C, 1 D, 7 E's, 1 G, 1 H, 4 I's, 1 K, 5 L's, 2 M's, 2 N's, 1 O, 4 R's, 2 S's, 6 T's, 1 U, 1 W, 2 !'s.
The number of distinct rearrangements is the multinomial coefficient: 45! / (3! 1! 1! 7! 1! 1! 4! 1! 5! 2! 2! 1! 4! 2! 6! 1! 1! 2!).
Why this works: you are assigning 45 positions to the characters, where 3 positions get A, 7 get E, etc.

🌳 The Multinomial Theorem

🧩 Statement of the theorem

Multinomial Theorem: Let x₁, x₂, ..., xᵣ be nonzero real numbers with their sum nonzero. Then for every n ≥ 0, (x₁ + x₂ + ... + xᵣ)ⁿ = sum over all k₁ + k₂ + ... + kᵣ = n of (n choose k₁, k₂, ..., kᵣ) times x₁^k₁ times x₂^k₂ times ... times xᵣ^kᵣ.

This generalizes the Binomial Theorem from two terms to r terms.
Each term in the expansion corresponds to a way of choosing how many times each xᵢ appears in the product, subject to the total being n.
The coefficient is the multinomial coefficient counting those choices.

🎲 Finding a specific coefficient

Example from the excerpt: What is the coefficient of x⁹⁹y⁶⁰z¹⁴ in (2x³ + y − z²)¹⁰⁰?

The exponents must satisfy: 3k₁ = 99, k₂ = 60, 2k₃ = 14, and k₁ + k₂ + k₃ = 100.
Solve: k₁ = 33, k₂ = 60, k₃ = 7. Check: 33 + 60 + 7 = 100 ✓.
The coefficient is (100 choose 33, 60, 7) times 2³³ times 1⁶⁰ times (−1)⁷.

What about x⁹⁹y⁶¹z¹³?

Try: 3k₁ = 99 → k₁ = 33, k₂ = 61, 2k₃ = 13 → k₃ = 6.5 (not an integer!).
Since k₃ must be a whole number, there is no such term in the expansion; the coefficient is 0.
Don't confuse: just because exponents "look close" doesn't mean the term exists—check that all kᵢ are non-negative integers and sum to n.

🔗 Connection to earlier counting techniques

The excerpt notes that the Binomial Theorem section follows an example (2.28) that used two common enumerative techniques:

Technique	Description
Bijection	Establish a one-to-one correspondence between two classes of objects, one easier to count than the other
Complement counting	Count the objects you do not want and subtract from the total

These techniques underpin many combinatorial proofs, including the proof of the Binomial Theorem (which uses a bijection between terms in the expansion and ways of choosing factors).
The multinomial coefficients also rely on counting partitions, a fundamental combinatorial idea.

Multinomial Coefficients

2.7 Multinomial Coefficients

🧭 Overview

🧠 One-sentence thesis

Multinomial coefficients generalize binomial coefficients to count the ways of partitioning a set into more than two groups, and they appear naturally in the expansion of powers of sums with more than two terms.

📌 Key points (3–5)

What multinomial coefficients count: the number of ways to partition n elements into r groups of specified sizes k₁, k₂, ..., kᵣ.
How they generalize binomial coefficients: binomial coefficients handle two groups (choose k, leave the rest), while multinomial coefficients handle r groups.
The formula: n! divided by the product of all the k factorials: n! / (k₁! k₂! ... kᵣ!).
Common confusion: the notation includes "overkill"—the last group size kᵣ is determined by n and the earlier k values, just as binomial coefficients don't write both k and n-k.
Why they matter: they solve counting problems with multiple categories and give coefficients in the Multinomial Theorem for expanding powers of sums.

🎨 The coloring interpretation

🎨 From two colors to three colors

The excerpt motivates multinomial coefficients through a coloring problem:

Two colors (binomial case): Choose k elements from n to paint red; the rest are blue. Number of ways = binomial coefficient (n choose k).
Three colors (multinomial case): Choose k₁ elements to paint red, k₂ to paint blue, and the remaining k₃ = n - (k₁ + k₂) to paint green.

🧮 Counting the three-color case

The excerpt computes this step-by-step:

Choose k₁ elements from n to paint red: (n choose k₁) ways.
From the remaining n - k₁ elements, choose k₂ to paint blue: (n - k₁ choose k₂) ways.
Paint the remaining k₃ elements green (no choice left).

Multiplying these gives:

(n choose k₁) × (n - k₁ choose k₂) = [n! / (k₁! (n - k₁)!)] × [(n - k₁)! / (k₂! (n - k₁ - k₂)!)] = n! / (k₁! k₂! k₃!)

This is the multinomial coefficient.

🔢 General notation and definition

Multinomial coefficient: (n choose k₁, k₂, k₃, ..., kᵣ) = n! / (k₁! k₂! k₃! ... kᵣ!)

The k values must sum to n: k₁ + k₂ + ... + kᵣ = n.
Example from the excerpt: (8 choose 3, 2, 1, 2) = 8! / (3! 2! 1! 2!) = 40320 / 24 = 1680.

🔍 Notation "overkill"

The excerpt notes:

The last group size kᵣ is determined by n and k₁, ..., kᵣ₋₁, so writing it is redundant.
Binomial coefficients don't write both parts: we write (8 choose 3), not (8 choose 3, 5).
But multinomial notation includes all k values for clarity.

Don't confuse: The redundancy is acknowledged but kept for explicitness; it does not mean the formula is wrong.

🔤 Counting rearrangements of strings

🔤 The string rearrangement problem

Example from the excerpt:

How many different rearrangements of the string "MITCHELTKELLERANDWILLIAMTTROTTERAREGENIUSES!!" are possible if all letters and characters must be used?

🧮 Solution approach

The excerpt counts the characters:

Total: 45 characters.
Distribution: 3 A's, 1 C, 1 D, 7 E's, 1 G, 1 H, 4 I's, 1 K, 5 L's, 2 M's, 2 N's, 1 O, 4 R's, 2 S's, 6 T's, 1 U, 1 W, 2 !'s.

The number of rearrangements is the multinomial coefficient:

45! / (3! 1! 1! 7! 1! 1! 4! 1! 5! 2! 2! 1! 4! 2! 6! 1! 1! 2!)

🧠 Why this works

We are partitioning 45 positions into groups: 3 positions for A, 7 for E, etc.
Each group corresponds to identical characters, so we divide by the factorial of each group size to avoid overcounting indistinguishable arrangements.

Example: If we had just "AAB", there are 3! = 6 permutations of positions, but AAB and AAB (swapping the two A's) look the same, so we divide by 2! to get 3! / 2! = 3 distinct strings: AAB, ABA, BAA.

📐 The Multinomial Theorem

📐 Statement of the theorem

Multinomial Theorem: Let x₁, x₂, ..., xᵣ be nonzero real numbers with their sum not equal to 0. Then for every nonnegative integer n:

(x₁ + x₂ + ... + xᵣ)ⁿ = sum over all k₁ + k₂ + ... + kᵣ = n of [(n choose k₁, k₂, ..., kᵣ) × x₁^k₁ × x₂^k₂ × ... × xᵣ^kᵣ]

This generalizes the Binomial Theorem (which handles r = 2).
Each term in the expansion corresponds to a way of choosing exponents k₁, ..., kᵣ that sum to n.

🧮 Finding a specific coefficient

Example from the excerpt:

What is the coefficient of x⁹⁹ y⁶⁰ z¹⁴ in (2x³ + y - z²)¹⁰⁰?

Solution:

Expand using the Multinomial Theorem: terms have the form (100 choose k₁, k₂, k₃) × (2x³)^k₁ × y^k₂ × (-z²)^k₃.
Simplify: (100 choose k₁, k₂, k₃) × 2^k₁ × x^(3k₁) × y^k₂ × (-1)^k₃ × z^(2k₃).
For x⁹⁹ y⁶⁰ z¹⁴, we need:
- 3k₁ = 99 → k₁ = 33
- k₂ = 60
- 2k₃ = 14 → k₃ = 7
Check: k₁ + k₂ + k₃ = 33 + 60 + 7 = 100 ✓
Coefficient: (100 choose 33, 60, 7) × 2³³ × (-1)⁷ = -(100 choose 33, 60, 7) × 2³³.

🚫 When a term does not appear

The excerpt also asks about x⁹⁹ y⁶¹ z¹³:

For z¹³, we need 2k₃ = 13, but k₃ must be an integer.
Since 13 is odd, no integer k₃ satisfies this.
Therefore, the coefficient is 0.

Don't confuse: A coefficient of 0 does not mean the formula is wrong; it means that particular term does not appear in the expansion.

🔗 Connection to binomial coefficients

🔗 Special case: r = 2

When r = 2, the multinomial coefficient reduces to the binomial coefficient:

Multinomial (r groups)	Binomial (2 groups)
(n choose k₁, k₂, ..., kᵣ)	(n choose k)
n! / (k₁! k₂! ... kᵣ!)	n! / (k! (n-k)!)
k₁ + k₂ + ... + kᵣ = n	k + (n-k) = n

The binomial coefficient is the number of ways to choose k elements (first group) and leave n-k (second group).
The multinomial coefficient generalizes this to r groups of specified sizes.

🧩 Why the formulas match

The excerpt's derivation for three colors shows:

(n choose k₁) × (n - k₁ choose k₂) simplifies to n! / (k₁! k₂! k₃!).
This is exactly the multinomial coefficient formula.
The binomial case is just the first step: (n choose k₁) = n! / (k₁! (n - k₁)!), which is (n choose k₁, n - k₁) in multinomial notation.

2.8 Discussion

🧭 Overview

🧠 One-sentence thesis

While computers can perform arithmetic operations (addition, multiplication) on very large integers extremely quickly, operations like factoring large integers or computing large exponentiations remain computationally challenging even for powerful machines.

📌 Key points (3–5)

Addition is easy: Even humans can add two 800-digit integers in minutes; computers do it almost instantly.
Multiplication scales well: Computers can multiply thousand-digit integers in about one second, far faster than humans.
Factoring is hard: Determining whether a large integer is prime or the product of two primes is very difficult, even for supercomputers.
Exponentiation complexity: Naive exponentiation (repeated multiplication) may require nested loops and take a long time, though faster methods may exist.
Common confusion: Not all operations on large integers are equally difficult—addition and multiplication are fast, but factoring and certain exponentiations are computationally expensive.

💻 What computers can do quickly

➕ Addition of large integers

SageMath (software from Chapter 1) treats big integers as strings.
Xing tested adding two integers, each with more than 800 digits.
Result: The software found the sum "about as fast as he could hit the enter key."
Alice's perspective: A human could do this in a couple of minutes with pencil and paper, so it's "not so impressive."
Key point: Addition is straightforward even for very large numbers.

✖️ Multiplication of large integers

Dave noted that very few humans would want to multiply two large integers by hand.
Xing's test: Two integers, each with more than 1,000 digits.
Result: The netbook found the product in about one second.
Why it matters: Multiplication is much harder for humans but still very fast for computers.
Practical tip: Xing used copy-paste to avoid typing errors when entering large numbers.

🔐 What remains computationally hard

🧩 Factoring large integers

Factoring an integer with several hundred digits is likely to be very challenging, not only for a netbook, but also for a supercomputer.

Dave asked whether the software could factor big integers.
Carlos explained the difficulty: If a large integer is either prime or the product of two large primes, detecting which case holds "could be very difficult."
Don't confuse: Just because computers can multiply large numbers quickly doesn't mean they can reverse the process (factoring) quickly.
Example scenario: Given a several-hundred-digit integer, determining its prime factors is computationally expensive.

🔢 Exponentiation challenges

Dave's question: Can the software calculate a to the power b when both are large integers?
Xing's initial response: "That shouldn't be a problem" because exponentiation is "just multiplying a times itself a total of b times."
Yolanda's concern: This approach involves nested loops, which "might take a long time for such a program to halt."
Carlos's thought: There might be ways to speed up such computations (though not detailed in the excerpt).
Key insight: Naive exponentiation (repeated multiplication b times) is inefficient; the computational cost depends on the algorithm used.

🤔 Real-world problem example

🔍 Catalan number verification

Alice found a web problem: "Is 838200020310007224300 a Catalan number?"
Her question: "How would you answer this? Do you have to use special software?"
The excerpt does not provide the answer, but raises the question of whether specialized algorithms or software are needed for such problems.
Implication: Some problems involving large integers may require mathematical insight beyond brute-force computation.

😰 Student reactions

😟 Zori's concern

Zori was "not happy" and "gloomily envisioned a future job hunt in which she was compelled to use big integer arithmetic as a job skill."
Her reaction: "Arrgghh."
Context: The discussion highlights that computational skills with large integers may be relevant in certain technical fields, which some students find daunting.

Exercises on Strings, Counting, and Combinatorics

2.9 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set applies counting principles—including product rule, permutations, combinations, and lattice paths—to solve problems involving strings, distributions, and combinatorial identities.

📌 Key points (3–5)

String counting: many problems ask how many strings (passwords, license plates, record identifiers) satisfy multiple constraints simultaneously.
Distribution problems: several exercises involve distributing identical objects (pencils, candy, donuts) to people with constraints, often solved using integer equations or inequalities.
Lattice paths: problems count paths on a grid from one point to another, sometimes avoiding or passing through specified points.
Common confusion: constraints like "at least one" vs "exactly" vs "no more than" require different counting techniques (inclusion-exclusion, stars-and-bars adjustments).
Combinatorial identities and arguments: some exercises ask for proofs using combinatorial reasoning (e.g., choosing teams with captains, bijections for Catalan numbers).

🔐 String and password problems

🔐 Password and identifier constraints

Many exercises (3, 5, 7, 8, 9, 10) ask: how many strings of a given length satisfy multiple conditions?

Typical constraints include:

Specific positions must be letters, digits, or symbols.
Certain characters must be distinct.
Certain characters must (or must not) be vowels.
At least one character must satisfy a property.

Example (Exercise 5): License plates of the form l₁ l₂ l₃ d₁ d₂ d₃ where at least one digit is nonzero and at least one letter is K.

This is a "at least one" problem: use complementary counting or inclusion-exclusion.

Don't confuse:

"Exactly three positions" vs "at least three positions."
"Distinct digits" (all different) vs "digits may repeat."

🔤 Vowel and consonant restrictions

Exercises 8, 9, 10 impose conditions on vowels (A, E, I, O, U) appearing in specific positions or exactly a certain number of times.

Example (Exercise 10):

Precisely four symbols are the letter 't'.
Precisely three characters are distinct vowels from {a, e, i, o, u}.
First and last symbols are distinct digits.

Approach:

Choose positions for each type of character.
Count ways to fill those positions.
Multiply counts (product rule).

🍩 Distribution and selection problems

🍩 Donuts, ice cream, and choices

Exercises 11 and 14 involve selecting items with or without order/repetition.

Exercise 11 (donuts):

(a) Selecting a type for each of six people (order matters, repetition allowed): product rule.
(b) Different type for each person: permutations (no repetition).
(c) Six different types, order irrelevant: combinations.

Exercise 14 (banana split):

Choose 3 different ice cream flavors (order matters because the bowl is asymmetric).
Each scoop gets one of 6 sauces (repetition allowed).
Choose 3 sprinkled toppings from 10.
Multiply all counts.

Don't confuse:

Ordered selection (permutations) vs unordered (combinations).
"Different types" (no repetition) vs "may repeat."

🎁 Distributing identical objects

Exercises 15, 19 ask: distribute identical items (pencils, candy) to people with lower/upper bounds on how many each receives.

Example (Exercise 15): Distribute 25 identical pencils to Ahmed, Barbara, Casper, Dieter such that:

Ahmed and Dieter each get at least 1.
Casper gets at most 5.
Barbara gets at least 4.

Approach (stars and bars with constraints):

Give minimum amounts first (Ahmed 1, Dieter 1, Barbara 4).
Remaining pencils: 25 - 6 = 19.
Casper can receive 0 to 5 total, so 0 to 1 more (since he already has 0).
Use inclusion-exclusion or case-by-case counting for upper bounds.

🔢 Integer solutions to equations and inequalities

🔢 Equations with non-negative or positive constraints

Exercises 16, 17, 18 ask for the number of integer solutions to equations like x₁ + x₂ + x₃ + x₄ + x₅ = 63 or inequalities like x₁ + x₂ + x₃ + x₄ + x₅ ≤ 63.

Key distinctions:

Constraint	Meaning	Technique
`xᵢ > 0`	Each variable at least 1	Substitute yᵢ = xᵢ - 1, solve for yᵢ ≥ 0
`xᵢ ≥ 0`	Each variable at least 0	Stars and bars directly
`≤ k` (inequality)	Sum at most k	Introduce slack variable: x₁ + ... + xₙ + s = k, s ≥ 0
`xᵢ ≤ m` (upper bound)	Variable capped at m	Use inclusion-exclusion or generating functions

Example (Exercise 16d): x₁ + x₂ + x₃ + x₄ + x₅ = 63, all xᵢ ≥ 0, x₂ ≥ 10.

Substitute y₂ = x₂ - 10, so y₂ ≥ 0.
Solve x₁ + y₂ + x₃ + x₄ + x₅ = 53, all ≥ 0.

Don't confuse:

Equality (=) vs inequality (≤): inequalities need a slack variable.
Lower bounds (≥) vs upper bounds (≤): upper bounds often require inclusion-exclusion.

🎓 Classroom candy distribution (Exercise 19)

450 identical candies, 65 students:

One student (contest winner) gets at least 10.
34 students get at least 1.
30 students may get 0.
Teacher may keep leftovers (so total distributed ≤ 450).

Approach:

Give contest winner 10, give 34 students 1 each: 10 + 34 = 44 used.
Remaining: 450 - 44 = 406.
Distribute 406 among 65 students (all ≥ 0) with ≤ 406 total: introduce slack variable.
Part (b) adds: one student (diabetic) gets at most 7 total (already has 1, so at most 6 more).

🗺️ Lattice paths

🗺️ Counting paths on a grid

Exercises 22–28 ask: how many lattice paths from point (a, b) to (c, d) on a grid, moving only right (R) or up (U)?

Lattice path: a path on a grid that moves one step right or one step up at each step.

Basic formula:

From (0, 0) to (m, n): need m right steps and n up steps.
Total steps: m + n.
Number of paths: C(m + n, m) = C(m + n, n) (choose which steps are right).

Example (Exercise 22): From (0, 0) to (10, 12):

10 right, 12 up, total 22 steps.
Paths: C(22, 10).

🚧 Paths through or avoiding points

Exercises 24–28 add constraints: must pass through a point, or must avoid a point.

Passing through a point (Exercise 24):

From (0, 0) to (10, 12) through (3, 5):
Paths = [paths (0,0) → (3,5)] × [paths (3,5) → (10,12)].
(0,0) → (3,5): C(8, 3).
(3,5) → (10,12): C(14, 7).
Total: C(8, 3) × C(14, 7).

Avoiding a point (Exercise 26):

From (0, 0) to (14, 73) not through (6, 37):
Total paths: C(87, 14).
Subtract paths through (6, 37): C(43, 6) × C(44, 8).

Example (Exercise 27, bank robber):

From (1st St, 1st Ave) to (7th St, 5th Ave), avoid (4th St, 4th Ave).
Movement: 6 blocks east, 4 blocks north.
Total paths: C(10, 6).
Subtract paths through police: [paths to (4,4)] × [paths from (4,4) to (7,5)].

Don't confuse:

"Through" (multiply path counts) vs "avoiding" (subtract path counts).

🧮 Combinatorial identities and proofs

🧮 Proving identities by counting

Exercises 20, 21 ask for combinatorial arguments (count the same set in two ways).

Exercise 20: Prove k · C(n, k) = n · C(n-1, k-1).

Combinatorial argument (hint: team with captain):

Left side: choose k people from n, then choose 1 captain from the k.
Right side: choose 1 captain from n first, then choose k-1 teammates from remaining n-1.
Both count the same thing: a team of k people with a designated captain.

Exercise 21: Prove Σ(j=0 to k) C(m, j) · C(w, k-j) = C(m+w, k).

Combinatorial argument:

Right side: choose k items from m + w items total.
Left side: partition the m+w items into two groups (m and w); for each j, choose j from the first group and k-j from the second.
Both count ways to choose k items from m + w.

🐱 Catalan numbers (Exercise 33)

Exercise 33 asks for bijective arguments showing three classes of objects are enumerated by the nth Catalan number C(n).

(a) Parenthesizations of n+1 factors:

Example: for 4 factors, there are 5 ways (listed in the exercise).

(b) Sequences of n ones and n negative-ones with non-negative partial sums:

Each prefix sum ≥ 0.

(c) Sequences 1 ≤ a₁ ≤ ... ≤ aₙ with aᵢ ≤ i:

Example for n=3: 111, 112, 113, 122, 123.

Hint for (c): Think of lattice paths and "boxes below the path."

🎲 Miscellaneous counting problems

🎲 Teams and competitions

Exercise 6 (students lining up):

(a) 10 students from group 1 line up: 10! ways.
(b) 30 students line up alternating by group (1, 2, 3, 1, 2, 3, ...): multiply permutations of each group.

Exercise 12 (korfball team):

Team needs 4 men and 4 women.
Choose 4 from 7 men: C(7, 4).
Choose 4 from 11 women: C(11, 4).
Total: C(7, 4) × C(11, 4).

Exercise 13 (programming competition):

(a) Top 4 places (order matters): P(20, 4).
(b) Then choose 4 honorable mentions from remaining 16: C(16, 4).
Total outcomes: P(20, 4) × C(16, 4).

🎨 Rearrangements and multinomial coefficients

Exercise 31 (rearranging letters in words):

Count permutations of letters where some letters repeat.
Formula: (total letters)! / (product of factorials of each letter's frequency).

Exercise 32 (painting elements):

27 elements painted in 6 colors with specified counts.
Multinomial coefficient: 27! / (7! · 6! · 2! · 7! · 5! · 0!).

📐 Polynomial coefficients

Exercise 29: Coefficient of x¹⁵ y¹²⁰ z²⁵ in (2x + 3y² + z)¹⁰⁰.

Use multinomial theorem: terms have form C(100; a, b, c) · (2x)ᵃ · (3y²)ᵇ · zᶜ where a + b + c = 100.
Match exponents: a = 15, 2b = 120 → b = 60, c = 25.
Check: 15 + 60 + 25 = 100 ✓.
Coefficient: C(100; 15, 60, 25) · 2¹⁵ · 3⁶⁰.

Exercise 30: Coefficient of x¹² y²⁴ in (x³ + 2xy² + y + 3)¹⁸.

Careful: x and y appear in multiple terms.
Expand using multinomial, match exponents for x and y separately.

Introduction to Induction and Recursion

3.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The chapter introduces recursion and induction as fundamentally important concepts in combinatorics and computer science, beginning with a motivating question about whether every set of positive integers must have a smallest element.

📌 Key points (3–5)

The door-prize question: Given any set of positive integers (even infinitely many), must there always be a least one?
Well-Ordered Property: Every non-empty set of positive integers has a least element—this is a foundational principle, not an obvious fact.
Assembly required: The positive integers and basic operations like addition and multiplication must be formally defined (via Peano Postulates); they don't come "for free."
Common confusion: Sequences and formulas with "..." notation may seem obvious, but their meaning is not always clear without precise definitions.
Chapter scope: The chapter will show how recursive formulas arise naturally, how to compute with them, and how to prove statements using mathematical induction.

🎟️ The motivating scenario

🎟️ The door-prize problem

A professor gives each student a ticket with a distinct positive integer.
The prize (one dollar) goes to the student with the lowest numbered ticket.
Key question: Must the prize be awarded? In other words, must there always be a least ticket number?

🔢 Generalizing the question

More broadly: Is it true that any set of positive integers always has a least element?
What if there are infinitely many students, each holding a ticket?
The excerpt hints that the answer is "yes," but the reasoning is more subtle than it first appears.

🏗️ Foundations of the positive integers

🏗️ "Some assembly required"

The positive integers come with "some assembly required."

The set of positive integers (denoted ℕ or ℤ⁺) is not just "given" in mathematics; it must be constructed formally.
The excerpt references Peano Postulates (discussed in Appendix B) as the starting point for defining positive integers.
Addition and multiplication are not automatic; they must be defined as part of this construction.
Don't confuse: the familiar properties of numbers are conclusions, not starting assumptions.

🧱 The Well-Ordered Property

Principle 3.1 (Well Ordered Property of the Positive Integers): Every non-empty set of positive integers has a least element.

This principle is a by-product of the formal development of the positive integers.
It is presented as a foundational property, not something that can be taken for granted.
Implication for the door prize: The professor will indeed have to pay someone a dollar, even if there are infinitely many students in the class.

⚠️ The challenge of interpreting statements

⚠️ Sequences and the "..." notation

The excerpt presents several sequences and asks what the next term should be:

2, 5, 8, 11, 14, 17, 20, 23, 26, ...
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, ...
1, 2, 5, 14, 42, 132, 429, 1430, 4862, ...
2, 6, 12, 20, 30, 42, 56, 72, 90, 110, ...
2, 3, 6, 11, 18, 27, 38, 51, ...

The excerpt describes these as "pretty easy stuff," but then presents a much harder sequence:
- 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, ...
The authors claim they "really have in mind something very concrete," but it is "far from obvious" without explanation.

⚠️ The danger of informal notation

The excerpt also raises a question about formulas like:
- 1 + 2 + 3 + ... + n = n(n + 1)/2
What do the dots mean? The excerpt emphasizes that even simple-looking notation requires careful definition.
Don't confuse: intuitive understanding with rigorous meaning—the chapter will address how to make these statements precise.

🎯 Chapter goals

🎯 What the chapter will cover

Recursive formulas: How they arise naturally in combinatorial problems.
Computation with recursion: How to use recursive definitions to perform calculations.
Mathematical induction: The Principle of Mathematical Induction and how to apply it to prove combinatorial statements.
Code snippets: Examples of how functions are defined recursively in computer programs.

🎯 Why recursion and induction matter

The excerpt states that recursion and induction are "fundamentally important" in combinatorial mathematics and computer science.
These concepts are the twin pillars for understanding how to define, compute, and prove properties of discrete structures.

3.2 The Positive Integers are Well Ordered

🧭 Overview

🧠 One-sentence thesis

The Well Ordered Property guarantees that every non-empty set of positive integers has a least element, which ensures that problems like finding the lowest ticket number always have a definite answer.

📌 Key points (3–5)

The Well Ordered Property: every non-empty set of positive integers must have a least element.
Why it matters: this property ensures that questions like "who has the lowest ticket?" always have an answer, even with infinitely many tickets.
Not as obvious as it seems: the positive integers require careful construction from foundational postulates (Peano Postulates), and basic operations like addition and multiplication must be defined.
Common confusion: the property feels "natural," but it is actually a fundamental result that emerges from how the number system is built, not a self-evident fact.
Applies to infinite sets too: even if there are infinitely many students drawing tickets, the Well Ordered Property still guarantees a winner.

🎟️ The motivating problem

🎟️ The door prize scenario

A professor gives each student a ticket with a distinct positive integer.
The prize goes to whoever holds the lowest-numbered ticket.
Key question: Must there always be a winner? In other words, must there always be a least ticket number?
The excerpt extends this: what if there are infinitely many students, each with a ticket?

🤔 Why the question is deeper than it looks

Most people answer "yes" immediately because it feels obvious.
However, the excerpt emphasizes that this is "a much more complex subject than you might think at first."
The positive integers come with "some assembly required"—they are built from foundational axioms (Peano Postulates), and operations like addition and multiplication must be explicitly defined.
The Well Ordered Property is not a trivial observation; it is a fundamental property that results from this construction.

🏛️ The Well Ordered Property

🏛️ Statement of the principle

Principle 3.1 (Well Ordered Property of the Positive Integers): Every non-empty set of positive integers has a least element.

This applies to any non-empty set of positive integers, no matter how it is chosen.
It does not matter whether the set is finite or infinite.

✅ Immediate consequence for the door prize

Because the set of ticket numbers drawn by students is a non-empty set of positive integers, the Well Ordered Property guarantees there is a least element.
Therefore, the professor will indeed have to pay someone a dollar—even if there are infinitely many students in the class.
Example: If students draw tickets numbered {7, 3, 15, 2, 100}, the least element is 2, so the student with ticket 2 wins.

🔍 Don't confuse: "natural" vs. "proven"

The property feels intuitively obvious, but it is not a starting assumption.
It is a consequence of how the positive integers are constructed from the Peano Postulates.
The excerpt warns that "you may be surprised to learn that this is really a much more complex subject than you might think at first."

🔧 Foundational context

🔧 Building the positive integers

The excerpt notes that the positive integers are developed starting from the Peano Postulates (discussed in Appendix B of the source).
Basic operations (addition, multiplication) do not come "for free"; they must be explicitly defined.
The Well Ordered Property is a by-product of this careful construction.

📚 Why this matters for the course

Understanding that the Well Ordered Property is a fundamental result (not just common sense) is important for rigorous mathematical reasoning.
It underpins proofs and definitions that will appear later, including the Principle of Mathematical Induction (mentioned in the chapter introduction).

The Meaning of Statements

3.3 The Meaning of Statements

🧭 Overview

🧠 One-sentence thesis

Mathematical notation like summation and factorial symbols requires precise recursive definitions to avoid ambiguity, and the Well Ordered Property of positive integers guarantees that such recursive definitions work for all positive integers.

📌 Key points (3–5)

The ambiguity problem: expressions like "1 + 2 + 3 + … + 6" are not precisely defined without clarifying what pattern the dots represent.
Recursive definitions solve ambiguity: by defining a base case (e.g., when n = 1) and a rule for building larger cases from smaller ones, we make notation precise.
Well Ordered Property guarantees completeness: every non-empty set of positive integers has a least element, which ensures recursive definitions cover all positive integers.
Common confusion: dots (…) in a sequence can mean different things depending on the underlying function or pattern—without a definition, the notation is incomplete.
Two ways to compute: some quantities (like binomial coefficients) can be calculated either directly via a formula or recursively via simpler cases.

🧩 The ambiguity of informal notation

🧩 What the dots mean

The excerpt presents several sequences (e.g., 2, 5, 8, 11, … or 1, 2, 3, 4, 1, 2, 3, 4, 5, …) and asks for the next term.
The problem: without a precise rule, the same notation can represent completely different sequences.
Example: "1 + 2 + 3 + … + 6" could mean the sum of the first six positive integers (yielding 21) or the sum of the first 19 terms of a more complicated sequence.
Key takeaway: dots are shorthand, but they are not a definition—they require a clarifying comment or formula.

🔍 Why this matters

Mathematical statements must be unambiguous.
Operations like addition and multiplication are binary (they combine two things at a time), so expressions with many terms need a precise order of operations.
The excerpt emphasizes: "without a clarifying comment or two, the notation … isn't precisely defined."

🔁 Recursive definitions

🔁 How recursion works

A recursive definition specifies a base case (starting point) and a rule for building larger cases from smaller ones.

Base case: define the expression for the smallest input (e.g., n = 1).
Recursive step: define the expression for n in terms of n − 1 (or smaller values).
Example from the excerpt: to define the sum ∑ᵢ₌₁ⁿ f(i):
- Base: ∑ᵢ₌₁¹ f(i) = f(1)
- Recursive: ∑ᵢ₌₁ⁿ f(i) = f(n) + ∑ᵢ₌₁ⁿ⁻¹ f(i) when n > 1

🔢 Factorial example

The excerpt points out that writing "n! = n · (n−1) · (n−2) · … · 3 · 2 · 1" has the same ambiguity problem (what do the dots mean?).
Recursive definition of factorial:
- Base: 1! = 1 (or 0! = 1 if starting at 0)
- Recursive: n! = n · (n−1)! when n > 1
This removes ambiguity: each factorial is defined in terms of the previous one.

💻 Code example

The excerpt provides a SageMath/Python function:

def sumrecursive(n):
    if n == 1:
        return 2
    else:
        return sumrecursive(n-1) + (n*n - 2*n + 3)

Base case: sumrecursive(1) = 2
Recursive case: sumrecursive(n) = sumrecursive(n−1) + (n² − 2n + 3)
This gives a precise meaning to the expression 2 + 3 + 6 + 11 + 18 + … + (n² − 2n + 3).
Example: sumrecursive(3) = sumrecursive(2) + (9 − 6 + 3) = sumrecursive(2) + 6. Working backward, sumrecursive(2) = 2 + 3 = 5, so sumrecursive(3) = 5 + 6 = 11.

🏛️ The Well Ordered Property

🏛️ What it says

Well Ordered Property of the Positive Integers: Every non-empty set of positive integers has a least element.

This is a fundamental property of the positive integers (ℕ).
It means: if you have any collection of positive integers (as long as it's not empty), there is always a smallest one.

🔗 Why it guarantees recursive definitions work

Suppose we define something recursively (base case + recursive step).
To prove the definition covers all positive integers, consider the set of integers for which the expression is not defined.
If this set is non-empty, it has a least element (by the Well Ordered Property).
But the recursive definition allows us to define the expression for that least element (using the base case or the recursive step applied to smaller values).
This contradiction shows the set must be empty—so the definition works for all positive integers.
Don't confuse: the Well Ordered Property is not about whether a sequence has a pattern; it's about the structure of the positive integers themselves.

📐 Binomial coefficients and Pascal's triangle

📐 Two ways to define binomial coefficients

The excerpt mentions that binomial coefficients (n choose k) can be defined in two ways:

Method	Definition	Notes
Factorial formula	(n choose k) = n! / (k! · (n−k)!)	Direct arithmetic; uses factorial notation
Recursive formula	Base: (n choose 0) = (n choose n) = 1; Recursive: (n choose k) = (n−1 choose k−1) + (n−1 choose k) when 0 < k < n	Uses only addition; builds from smaller cases

🔺 Pascal's triangle

The recursive formula corresponds to Pascal's triangle: each entry (except the 1s at the ends) is the sum of the two entries above it (left and right).
Combinatorial interpretation: both sides count k-element subsets of {1, 2, …, n}. The right-hand side groups them into subsets that contain n and subsets that don't.
Example: (4 choose 2) = (3 choose 1) + (3 choose 2) = 3 + 3 = 6.

⚡ Computational efficiency

The excerpt poses a question: which method is faster for large n (e.g., n around 1800–2000, k around 800)?
The factorial formula requires computing very large products and then dividing.
The recursive formula (Pascal's triangle) requires only additions.
The excerpt does not answer the question directly but invites the reader to experiment.

🔢 Recursive solutions to combinatorial problems

🔢 Example: regions formed by lines

The excerpt gives an example (Example 3.3):

Problem: n lines are drawn in the plane such that each pair crosses and no three lines meet at the same point. Let r(n) be the number of regions the plane is divided into.
Observations: r(1) = 2, r(2) = 4, r(3) = 7, r(4) = 11.
Recursive formula:
- Base: r(1) = 2
- Recursive: r(n) = n + r(n−1) when n > 1
Why this works: label the lines L₁, L₂, …, Lₙ. The nth line Lₙ crosses the previous n−1 lines at n−1 points, dividing Lₙ into n segments (two infinite, the rest finite). Each segment splits an existing region into two, adding n new regions.
Example: r(5) = 5 + r(4) = 5 + 11 = 16.

🎯 The power of recursion

Recursive formulas often have a natural interpretation (e.g., "adding one more line adds n regions").
They allow us to compute values step-by-step without needing a closed-form formula.
The excerpt notes that Chapter 9 will return to these problems and obtain "even more compact solutions" (closed-form formulas).

3.4 Binomial Coefficients Revisited

🧭 Overview

🧠 One-sentence thesis

A recursive formula for binomial coefficients—based on Pascal's triangle—provides a computationally efficient alternative to the factorial definition by using only addition.

📌 Key points (3–5)

Two definitions of binomial coefficients: the factorial formula and the recursive formula (Pascal's triangle).
Recursive rule: if 0 < k < n, then (n choose k) = (n−1 choose k−1) + (n−1 choose k); boundary cases (k=0 or k=n) equal 1.
Combinatorial interpretation: the recursion counts k-element subsets by splitting them into those containing element n and those not containing it.
Common confusion: factorial-based calculation vs recursion-based calculation—recursion uses only addition and can be faster for large n and moderate k.
Pascal's triangle structure: each interior entry is the sum of the two entries diagonally above it.

🔢 Two ways to compute binomial coefficients

🔢 Factorial definition

The binomial coefficient (n choose k) was originally defined in terms of factorial notation.

The factorial-based formula is: (n choose k) = n! / (k! × (n − k)!).
This definition is "complete and legally-correct" but requires computing large factorials and division.
Example: to compute (1800 choose 800), you must calculate very large factorials.

➕ Recursive formula

Let n and k be integers with 0 ≤ k ≤ n. If k = 0 or k = n, set (n choose k) = 1. If 0 < k < n, set (n choose k) = (n−1 choose k−1) + (n−1 choose k).

This recursion uses only addition, no multiplication or division.
Boundary conditions: (n choose 0) = 1 and (n choose n) = 1.
For interior values, the formula breaks the problem into two smaller subproblems.
Don't confuse: the recursive formula does not compute factorials at all; it builds up values by repeated addition.

🧩 Combinatorial meaning of the recursion

🧩 Counting subsets by cases

Both sides of the recursion count the number of k-element subsets of {1, 2, …, n}.
The right-hand side groups these subsets into two disjoint cases:
- Subsets that contain element n: there are (n−1 choose k−1) of these (choose the remaining k−1 elements from the first n−1 elements).
- Subsets that do not contain element n: there are (n−1 choose k) of these (choose all k elements from the first n−1 elements).
Example: to count 3-element subsets of {1,2,3,4}, split into those containing 4 (choose 2 from {1,2,3}) and those not containing 4 (choose 3 from {1,2,3}).

📐 Pascal's triangle

📐 Structure and pattern

Pascal's triangle displays the binomial coefficients in rows, with row n containing (n choose 0), (n choose 1), …, (n choose n).
Each row starts and ends with 1.
Interior entries: each entry is the sum of the entry to the left and the entry to the right in the row directly above.
Example (from the excerpt):
- Row 4: 1, 4, 6, 4, 1
- Row 5: 1, 5, 10, 10, 5, 1
- The 10 in row 5 is 4 + 6 from row 4.

📐 Visual representation

The excerpt provides the first nine rows of Pascal's triangle:

Row	Entries
0	1
1	1, 1
2	1, 2, 1
3	1, 3, 3, 1
4	1, 4, 6, 4, 1
5	1, 5, 10, 10, 5, 1
6	1, 6, 15, 20, 15, 6, 1
7	1, 7, 21, 35, 35, 21, 7, 1
8	1, 8, 28, 56, 70, 56, 28, 8, 1

⚡ Computational efficiency comparison

⚡ Factorial vs recursion speed

The excerpt mentions an experiment: which method is faster for computing (n choose m) when n is between 1800 and 2000 and m is around 800?
Factorial method: requires computing very large integers (factorials of ~1800) and then division.
Recursion method: uses only addition, building up from smaller values.
The excerpt implies (via the question) that recursion can be faster for large n and moderate k, especially when using libraries that handle big integers as strings.
Don't confuse: "faster" depends on implementation and the specific values of n and k; the excerpt does not give a definitive answer but prompts the reader to think about the trade-offs.

🔗 Connection to other recursive problems

🔗 Recursive problem-solving pattern

The section introduces binomial coefficients as an example of recursive definitions.
The excerpt also mentions other recursive combinatorial problems (lines dividing the plane, checkerboard tilings, ternary strings) in section 3.5.
Common pattern: define base cases, then express the solution for n in terms of solutions for smaller values.
Example from the excerpt (lines in the plane): r(1) = 2, and for n > 1, r(n) = n + r(n−1).
Example from the excerpt (checkerboard tilings): t(1) = 1, t(2) = 2, and for n > 2, t(n) = t(n−1) + t(n−2).
Example from the excerpt (ternary strings): g(1) = 3, g(2) = 8, and for n > 2, g(n) = 2×g(n−1) + (g(n−1) − g(n−2)).

🔗 Why recursion matters

Recursion provides a systematic way to compute values by breaking problems into smaller subproblems.
It often leads to efficient algorithms (as with Pascal's triangle) and has natural combinatorial interpretations.
The excerpt notes that "in Chapter 9, we return to these problems and obtain even more compact solutions," suggesting that recursion is a stepping stone to deeper techniques.

Solving Combinatorial Problems Recursively

3.5 Solving Combinatorial Problems Recursively

🧭 Overview

🧠 One-sentence thesis

Recursive formulas allow us to solve combinatorial problems by breaking them into smaller instances of the same problem, which can be computed efficiently even for large inputs.

📌 Key points (3–5)

Core strategy: identify a base case and a recursive relationship that expresses the solution for n in terms of solutions for smaller values.
Combinatorial examples: counting regions formed by lines, tiling patterns, and "good" strings all follow recursive patterns.
Greatest common divisor: the Euclidean algorithm recursively reduces the problem by replacing (m, n) with (n, remainder).
Merge sort: sorting can be done recursively by splitting a sequence, sorting each half, then merging the sorted halves.
Common confusion: recursive vs iterative—recursive solutions are elegant but may use more memory; iterative loops can be more efficient for the same problem.

🔢 Combinatorial counting problems

🔢 Lines and regions in the plane

Problem: n lines in the plane, each pair crossing, no three meeting at one point; find r(n), the number of regions.

Base case: r(1) = 2 (one line splits the plane into two regions).
Recursive formula: r(n) = n + r(n − 1) for n > 1.
Why it works: when you add the n-th line (call it L_n), it crosses the previous n − 1 lines at n − 1 points, dividing L_n into n segments (two infinite, the rest finite). Each segment splits an existing region into two, adding n new regions.
Example: r(5) = 5 + 11 = 16, r(6) = 6 + 16 = 22, r(7) = 7 + 22 = 29.
Practical note: even by hand, you can compute r(100) "before lunch."

🧩 Tiling a checkerboard

Problem: tile a 2 × n checkerboard with 1 × 2 and 2 × 1 rectangles; find t(n), the number of tilings.

Base cases: t(1) = 1, t(2) = 2.
Recursive formula: t(n) = t(n − 1) + t(n − 2) for n > 2.
Why it works: consider the rectangle covering the upper-right corner square.
- If it is vertical, the first n − 1 columns form a valid tiling → t(n − 1) ways.
- If it is horizontal, the rectangle below it must also be horizontal, and the first n − 2 columns form a valid tiling → t(n − 2) ways.
Example: t(3) = 1 + 2 = 3, t(4) = 2 + 3 = 5, t(5) = 3 + 5 = 8.
Practical note: t(100) is doable by hand; a computer can easily compute t(1000).

🔤 Good ternary strings

Problem: a ternary string (digits 0, 1, 2) is "good" if it never has a 2 immediately followed by a 0; find g(n), the number of good strings of length n.

Base cases: g(1) = 3 (all single-digit strings are good), g(2) = 8 (only "2,0" is bad, so 9 − 1 = 8).
Recursive formula: g(n) = 3·g(n − 1) − g(n − 2) for n > 2.
Why it works: partition good strings of length n by their last character.
- Ending in 1: any good string of length n − 1 can precede it → g(n − 1) ways.
- Ending in 2: any good string of length n − 1 can precede it → g(n − 1) ways.
- Ending in 0: can be preceded by a good string of length n − 1 that does not end in 2. There are g(n − 1) good strings of length n − 1, and exactly g(n − 2) of them end in 2, so g(n − 1) − g(n − 2) ways.
- Total: g(n − 1) + g(n − 1) + [g(n − 1) − g(n − 2)] = 3·g(n − 1) − g(n − 2).
Example: g(3) = 3·8 − 3 = 21, g(4) = 3·21 − 8 = 55.
Practical note: g(100) is doable by hand; a computer can compute g(5000).

🔍 Greatest common divisor via recursion

🔍 Division theorem

Division Theorem: For positive integers m and n, there exist unique integers q (quotient) and r (remainder) such that m = q·n + r and 0 ≤ r < n.

This is the foundation for the Euclidean algorithm.
The excerpt proves existence by strong induction (assuming the smallest counterexample leads to a contradiction).

🔁 Euclidean algorithm

Euclidean Algorithm: Let m > n be positive integers, and let m = q·n + r with 0 ≤ r < n. If r > 0, then gcd(m, n) = gcd(n, r). If r = 0, then gcd(m, n) = n.

Why it works: from m = q·n + r, we can write m − q·n = r. Any divisor of m and n must also divide r; conversely, any divisor of n and r must also divide m. So the set of common divisors is the same, and the greatest is the same.

Recursive code snippet (from the excerpt):

if m % n == 0: return n
else: return gcd(n, m % n)

Trade-off: recursive calls are elegant but use more memory; an iterative loop version is more memory-efficient.

🧮 Linear Diophantine equations

Theorem: Integers a and b (not necessarily non-negative) solve am + bn = c if and only if c is a multiple of gcd(m, n).

How to find a and b: work backward through the Euclidean algorithm steps, "solving" each equation for the remainder and substituting.
Example (from the excerpt): for m = 3920, n = 252:
1. 3920 = 15·252 + 140
2. 252 = 1·140 + 112
3. 140 = 1·112 + 28
4. 112 = 4·28 + 0 → gcd = 28
- Work backward:
  - 28 = 140 − 1·112
  - 28 = 140 − 1·(252 − 1·140) = 2·140 − 1·252
  - 28 = 2·(3920 − 15·252) − 1·252 = 2·3920 − 31·252
- So a = 2, b = −31.

🔀 Merge sort: recursive sorting

🔀 The merge operation

Merge: given two sorted sequences u₀ < u₁ < … < u_(s−1) and v₀ < v₁ < … < v_(t−1), combine them into a single sorted sequence of length s + t.

Algorithm:
1. Compare the smallest remaining element in each sequence.
2. Append the smaller one to the output list and remove it from its sequence.
3. Repeat until both sequences are exhausted.
Code logic (from the excerpt): maintain pointers p and q for the two sequences; at each step, append min(u[p], v[q]) and advance the corresponding pointer.
Example: merging [1, 2, 7, 9, 11, 15] and [3, 5, 8, 100, 130, 275] yields [1, 2, 3, 5, 7, 8, 9, 11, 15, 100, 130, 275].

🔀 Recursive merge sort strategy

Merge sort: to sort a sequence a₁, a₂, …, aₙ, split it into two halves, recursively sort each half, then merge the sorted halves.

Steps:
1. Set s = ⌈n/2⌉ and t = ⌊n/2⌋.
2. Let u be the first s elements, v be the last t elements.
3. Recursively sort u and v.
4. Merge the two sorted subsequences.
Example (from the excerpt): to sort (2, 8, 5, 9, 3, 7, 4, 1, 6), split into (2, 8, 5, 9, 3) and (7, 4, 1, 6), sort each, then merge.
Why it's optimal: merge sort is "one of several optimal algorithms for sorting" (the excerpt notes that introductory computer science courses cover this in depth).

⚖️ Recursive vs iterative trade-offs

Recursive approach: elegant, mirrors the problem structure, but uses more memory due to function call overhead.
Iterative approach: uses only a loop, more memory-efficient, but may be less intuitive.
Example: the Euclidean algorithm can be written either way; the excerpt mentions "the disadvantage of this approach is the somewhat wasteful use of memory due to recursive function calls."
Don't confuse: recursion is a problem-solving strategy, not a performance requirement—many recursive solutions can be rewritten iteratively.

Mathematical Induction

3.6 Mathematical Induction

🧭 Overview

🧠 One-sentence thesis

The Principle of Mathematical Induction provides a method to prove that an open statement is true for all positive integers by verifying it for the base case and showing that truth for any integer k implies truth for k+1.

📌 Key points (3–5)

What open statements are: mathematical statements involving a positive integer n that may be valid for some, none, or all positive integers.
The induction principle: if a statement is true for n=1 and assuming truth for k implies truth for k+1, then the statement is true for all positive integers.
Common confusion: not all open statements are true for all n—some have limited solutions, some have none, and only certain statements (like sums and factorials) hold universally.
Connection to recursion: inductive definitions are the "twin" of recursive definitions, offering an alternative way to define operations like factorial and summation.
Logical foundation: the Principle of Mathematical Induction is logically equivalent to the Well Ordered Property of the Positive Integers.

🔢 Understanding open statements

🔢 What open statements are

Open statements: mathematical statements involving a positive integer n that can be considered as equations valid for certain values of n.

They are not always true; they are "open" in the sense that their validity depends on which value of n you choose.
The excerpt emphasizes that open statements can have different solution sets among positive integers.

📊 Range of validity

The excerpt provides seven examples showing different validity patterns:

Statement type	Example from excerpt	Number of solutions
Single solution	2n + 7 = 13	Only n=3
No solutions	3n - 5 = 9	Never valid
Limited solutions	n² - 5n + 9 = 3	Exactly two solutions
Limited solutions	8n - 3 < 48	Six solutions
Always true	8n - 3 > 0	All positive integers
Always true	(n+3)(n+2) = n² + 5n + 6	All positive integers
Always true	n² - 6n + 13 ≥ 0	All positive integers

🧩 More complex statements

The excerpt then introduces harder-to-verify statements:

The sum of the first n positive integers equals n(n+1)/2.
The sum of the first n odd positive integers equals n².
A statement involving n to the nth power, factorials, and large constants when n=14.

Why these matter: Simple statements like "2n + 7 = 13" can be checked by algebra, but statements about sums or products for "all n" require a systematic proof method—this is where induction comes in.

🔁 The Principle of Mathematical Induction

🔁 The core principle

Principle of Mathematical Induction: Let S_n be an open statement involving a positive integer n. If S_1 is true, and if for each positive integer k, assuming that the statement S_k is true implies that the statement S_(k+1) is true, then S_n is true for every positive integer n.

In plain language:

Step 1 (base case): Show the statement is true when n=1.
Step 2 (inductive step): Assume the statement is true for some arbitrary positive integer k, then prove it must also be true for k+1.
Conclusion: If both steps succeed, the statement is true for all positive integers.

🧠 Why it works

The excerpt notes that this principle is "logically equivalent to the Well Ordered Property of the Positive Integers."
Think of it as a domino effect: if the first domino falls (base case) and each domino knocks over the next (inductive step), all dominos fall.
Don't confuse: induction does not mean "check a few cases and assume the rest"—you must prove the implication "k true → k+1 true" in general, not just verify specific numbers.

🔗 Connection to recursion

The excerpt calls induction the "powerful twin of recursion."
Recursive definitions build up from smaller cases; induction proves properties by assuming smaller cases and deducing the next.
Both rely on a base case and a step-by-step progression.

🏗️ Inductive definitions

🏗️ Recasting recursion as induction

The excerpt explains that "recursive definitions can also be recast in an inductive setting."

Example 1: Factorial

Set 1! = 1 (base case).
Whenever k! has been defined, set (k+1)! = (k+1) · k!.
This is an inductive definition: each new value is defined in terms of the previous one.

Example 2: Summation

Set the sum from i=1 to 1 of f(i) equal to f(1).
Set the sum from i=1 to k+1 of f(i) equal to the sum from i=1 to k of f(i) plus f(k+1).
Again, the definition builds step-by-step from the base case.

✖️ Defining multiplication inductively

The excerpt gives a deeper example: suppose you know addition but have never heard of multiplication. You can define it inductively:

Let m be a positive integer.
Set m · 1 = m (base case).
Set m · (k+1) = m · k + m (inductive step).

Important note: This defines multiplication but does not establish properties like commutativity or associativity—those would require separate proofs.

Don't confuse: an inductive definition tells you what an operation is; proving properties of that operation (e.g., that m · n = n · m) is a separate task, often done by induction itself.

📝 Proofs by induction

📝 The standard example

The excerpt introduces "the 'Hello World' example" of induction proofs:

Proposition: For every positive integer n, the sum of the first n positive integers is n(n+1)/2.

In other words: 1 + 2 + 3 + ... + n = n(n+1)/2.

How the proof would proceed (the excerpt states the proposition but does not show the full proof):

Base case: Check n=1. The sum of the first 1 positive integer is 1, and 1(1+1)/2 = 1. ✓
Inductive step: Assume the formula holds for some k (i.e., 1 + 2 + ... + k = k(k+1)/2). Then show it holds for k+1:
- The sum of the first k+1 integers is (1 + 2 + ... + k) + (k+1).
- By the inductive hypothesis, this equals k(k+1)/2 + (k+1).
- Simplify: k(k+1)/2 + (k+1) = [k(k+1) + 2(k+1)]/2 = [(k+1)(k+2)]/2, which is the formula for n=k+1. ✓
Conclusion: By induction, the formula holds for all positive integers n.

🎯 Why induction is necessary

You cannot verify the statement for infinitely many values of n by hand.
Induction provides a finite, rigorous proof that covers all cases.
The excerpt emphasizes that statements like "the sum of the first n odd positive integers is n²" require this systematic approach to establish validity.

Inductive Definitions

3.7 Inductive Definitions

🧭 Overview

🧠 One-sentence thesis

Inductive definitions provide a systematic way to define operations and sequences by specifying a base case and a rule for building each next term from the previous one, mirroring the structure of mathematical induction.

📌 Key points (3–5)

What inductive definitions do: define operations or sequences by giving a starting value and a rule to generate the next term from the current one.
Relationship to recursion: inductive definitions are essentially recursive definitions recast in an inductive setting—primarily a matter of taste.
How they work: specify the first case explicitly, then define each subsequent case in terms of the previous case(s).
Common confusion: inductive definitions define what something is but do not automatically prove properties like commutativity or associativity—those require separate proofs.
Why they matter: they allow rigorous construction of familiar operations (like multiplication) from more basic operations (like addition).

🔧 Structure of inductive definitions

🔧 Two-part structure

Every inductive definition has two components:

Base case: explicitly define the first value (e.g., when n = 1)
Inductive step: define the (k+1)-th value in terms of the k-th value

This mirrors the structure of mathematical induction proofs but is used for defining rather than proving.

🔄 Relationship to recursion

"Recursive definitions can also be recast in an inductive setting. Although it is primarily a matter of taste..."

The excerpt emphasizes that inductive and recursive definitions are equivalent ways of expressing the same idea.
The choice between them is stylistic rather than mathematical.
Example: factorial can be defined recursively or inductively—both capture the same operation.

📐 Examples of inductive definitions

📐 Factorial definition

The excerpt gives factorial as a first example:

Base case: set 1! = 1
Inductive step: whenever k! has been defined, set (k+1)! = (k+1) · k!

This defines factorial for all positive integers by building each value from the previous one.

📐 Summation notation

The excerpt defines summation inductively:

Base case: the sum from i=1 to 1 of f(i) equals f(1)
Inductive step: the sum from i=1 to k+1 of f(i) equals [the sum from i=1 to k of f(i)] + f(k+1)

The excerpt notes this uses "abbreviated form" with some English phrases omitted, but the meaning should be clear.

📐 Multiplication from addition

The excerpt provides a more foundational example—defining multiplication assuming only addition is known:

"Let m be a positive integer. Then set m · 1 = m and m · (k+1) = m · k + m"

Base case: m times 1 equals m
Inductive step: m times (k+1) equals (m times k) plus m
This defines what multiplication means but doesn't establish familiar properties.

Don't confuse: defining an operation versus proving its properties. The excerpt emphasizes that this definition "doesn't do anything in terms of establishing such familiar properties as the commutative and associative properties"—those require separate proofs.

🎯 What inductive definitions accomplish

🎯 Rigorous construction

Inductive definitions allow building up number systems and operations from scratch:

Start with basic operations (like addition)
Define more complex operations (like multiplication) in terms of simpler ones
Each step is explicit and verifiable

Example: The multiplication definition shows how to construct this operation rigorously, even though we use it intuitively every day.

🎯 Limitations and next steps

The excerpt points out an important limitation:

What inductive definitions provide	What they don't provide
Precise meaning of an operation	Proofs of properties like commutativity
A way to compute values	Proofs of properties like associativity
Rigorous foundation	Verification that familiar rules hold

The excerpt directs readers to "check out some of the details in Appendix B" for establishing these additional properties, indicating that proving properties requires work beyond the definition itself.

Proofs by Induction

3.8 Proofs by Induction

🧭 Overview

🧠 One-sentence thesis

Mathematical induction proves that an open statement is true for all positive integers by showing it holds for a base case and that truth for any integer k implies truth for k+1.

📌 Key points (3–5)

What induction proves: an open statement S_n is true for every positive integer n by establishing a base case and an inductive step.
Two essential steps: the basis step (prove S_1 is true) and the inductive step (assume S_k is true, then prove S_{k+1} is true).
Common mistake in the inductive step: starting with the entirety of S_{k+1} and manipulating it until you get a true statement—this can accidentally start with something false and arrive at something true through valid algebra.
How to distinguish inductive hypothesis from goal: S_k is what you assume (the inductive hypothesis); S_{k+1} is what you must prove.
Why it works: the Principle of Mathematical Induction is logically equivalent to the Well Ordered Property of the Positive Integers.

🔧 The Principle of Mathematical Induction

🔧 What the principle states

Principle of Mathematical Induction: Let S_n be an open statement involving a positive integer n. If S_1 is true, and if for each positive integer k, assuming that the statement S_k is true implies that the statement S_{k+1} is true, then S_n is true for every positive integer n.

An open statement is a statement involving a variable (here, the positive integer n).
The principle gives a two-step recipe: prove the first case, then prove the chain of implications.
The excerpt notes this principle is logically equivalent to the Well Ordered Property of the Positive Integers.

🧩 Why induction works

Once you show S_1 is true and that S_k → S_{k+1} for every k, you create a domino effect:
- S_1 is true (basis).
- S_1 true implies S_2 true (inductive step with k=1).
- S_2 true implies S_3 true (inductive step with k=2).
- And so on for all positive integers.

📐 The two key steps in every induction proof

📐 Basis step

What it is: prove that S_1 is true (or S_5 if you only need the statement for n ≥ 5, etc.).
How to do it: if S_n is an equation, evaluate both the left-hand side and the right-hand side of S_1 separately and show they are equal.
Don't just write it down: the excerpt emphasizes you must prove S_1 is true, not just state it.

Example from the excerpt: For the statement "sum of first n positive integers = n(n+1)/2," the basis step checks n=1:

Left side: 1
Right side: 1(1+1)/2 = 1
They are equal, so S_1 is true.

🔁 Inductive step

What it is: assume S_k is true for some positive integer k (this assumption is the inductive hypothesis), then prove S_{k+1} is true.
How to do it (for equations/inequalities):
- Work on one side of S_{k+1} (usually the left-hand side).
- Find a place to apply the inductive hypothesis (substitute the formula for S_k).
- Continue manipulating until you obtain the other side of S_{k+1}.
Alternative approach: work on the left-hand side of S_{k+1} separately and the right-hand side of S_{k+1} separately; if you can manipulate both to the same form, you've shown they are equal.

⚠️ Common mistake

What NOT to do: start with the entirety of S_{k+1} and manipulate it until you obtain a true statement.
Why this is dangerous: it is possible to start with something false and, through valid algebraic steps, obtain a true statement—this does not prove S_{k+1} is true.
Correct approach: work from one side of S_{k+1} toward the other, applying the inductive hypothesis along the way.

🧮 Example: Sum of first n positive integers

🧮 The proposition

Proposition 3.12: For every positive integer n, the sum of the first n positive integers is n(n+1)/2, i.e., the sum from i=1 to n of i equals n(n+1)/2.

🧮 Proof structure (detailed version)

Identify the open statement: Let S_n be "the sum from i=1 to n of i equals n(n+1)/2."
Basis step: Prove S_1 is true.
- Left side of S_1: 1
- Right side of S_1: 1(1+1)/2 = 1
- They are equal, so S_1 is true.
Inductive step: Assume S_k is true (i.e., the sum from i=1 to k of i equals k(k+1)/2). Prove S_{k+1} is true.
- Consider the left side of S_{k+1}: the sum from i=1 to k+1 of i.
- Rewrite it: (sum from i=1 to k of i) + (k+1).
- Apply the inductive hypothesis: k(k+1)/2 + (k+1).
- Simplify: k(k+1)/2 + (k+1) = (k² + 3k + 2)/2 = (k+1)(k+2)/2.
- This is the right side of S_{k+1}, so S_{k+1} is true.
Conclusion: By the Principle of Mathematical Induction, S_n is true for all positive integers n.

🧮 Refined proof style

The excerpt shows a second, more concise version of the same proof.
It omits explicit mention of "S_n" and "open statement" but follows the same logic.
The refined style is preferred once you have experience; beginners may find the detailed version clearer.

🔢 Example: Sum of first n odd positive integers

🔢 The proposition

Proposition 3.13: For each positive integer n, the sum of the first n odd positive integers is n², i.e., the sum from i=1 to n of (2i - 1) equals n².

🔢 Proof outline

Basis step: When n=1, the left side is 2(1)-1 = 1, and the right side is 1² = 1. True.
Inductive step: Assume the formula holds when n=k, i.e., the sum from i=1 to k of (2i-1) equals k².
- Consider the sum from i=1 to k+1 of (2i-1).
- Rewrite: (sum from i=1 to k of (2i-1)) + (2k+1).
- Apply inductive hypothesis: k² + (2k+1).
- Simplify: k² + 2k + 1 = (k+1)².
- This is the right side for n=k+1, so S_{k+1} is true.
Conclusion: By induction, the proposition holds for all positive integers n.

🔢 Combinatorial vs. inductive proofs

The excerpt notes that some mathematicians prefer combinatorial proofs (given in an earlier section) because they reveal "what is really going on."
However, you should be able to give a formal proof by mathematical induction when pressed.
Preference: give a combinatorial proof when you can find one; otherwise, use induction.

🔺 Example: Sum of binomial coefficients

🔺 The proposition

Proposition 3.14: Let n and k be non-negative integers with n ≥ k. Then the sum from i=k to n of the binomial coefficient (i choose k) equals (n+1 choose k+1).

🔺 Proof strategy

Fix k: treat k as a constant and prove the formula by induction on n.
Basis step: When n=k, the left side is (k choose k) = 1, and the right side is (k+1 choose k+1) = 1. True.
Inductive step: Assume the formula holds when n=m (for some m ≥ k), i.e., the sum from i=k to m of (i choose k) equals (m+1 choose k+1).
- Consider the sum from i=k to m+1 of (i choose k).
- Rewrite: (sum from i=k to m of (i choose k)) + (m+1 choose k).
- Apply inductive hypothesis: (m+1 choose k+1) + (m+1 choose k).
- Use the binomial identity: (m+1 choose k+1) + (m+1 choose k) = (m+2 choose k+1).
- This is the right side for n=m+1, so the formula holds for n=m+1.
Conclusion: By induction, the proposition holds for all n ≥ k.

🔍 Strong Induction (introduction)

🔍 When ordinary induction is not sufficient

The excerpt introduces a scenario where the standard Principle of Mathematical Induction "does not seem sufficient."
Example: A function f(n) is defined recursively by f(n) = 2f(n-1) - f(n-2), with f(1)=3 and f(2)=5.
Bob computes f(3)=7 and f(4)=9, and conjectures that f(n) = 2n+1 for all n ≥ 1.
He tries to prove this by induction: basis step is fine (f(1)=3=2·1+1), but in the inductive step, assuming f(k)=2k+1 is not enough to prove f(k+1)=2(k+1)+1, because the recursive definition of f(k+1) depends on both f(k) and f(k-1).

🔍 The limitation

The standard inductive hypothesis (assume S_k is true) only gives you information about one previous case.
When a recursive definition or statement depends on multiple previous cases, you need a stronger assumption.
The excerpt hints at "Strong Induction" as the solution (the section title is "3.9 Strong Induction"), but the text cuts off before explaining the principle.

🔍 What to expect (based on the excerpt)

Strong Induction will allow you to assume S_1, S_2, ..., S_k are all true (not just S_k), and then prove S_{k+1}.
This stronger inductive hypothesis will handle recursive definitions that depend on more than one previous term.

Strong Induction

3.9 Strong Induction

🧭 Overview

🧠 One-sentence thesis

Strong induction extends ordinary induction by allowing you to assume the statement holds for all smaller cases (not just the immediately preceding case), which is necessary when a recursive definition depends on multiple earlier terms.

📌 Key points (3–5)

Why ordinary induction sometimes fails: when a recursive formula depends on more than one previous term (e.g., f(n) = 2f(n−1) − f(n−2)), assuming only the k-th case is true doesn't give you enough information.
What strong induction adds: you assume the statement holds for all integers m with 1 ≤ m ≤ k, not just for m = k.
Common confusion: strong induction is not a different principle—it is logically equivalent to ordinary induction but sometimes more convenient ("stronger" means you assume more in the inductive step).
The "bootstrap" phenomenon: proving something stronger can actually make the proof easier, because you have more to work with in the inductive step.

🔄 When ordinary induction is not enough

🔄 Bob's problem with a two-term recurrence

Bob is given a function defined by f(n) = 2f(n−1) − f(n−2), with f(1) = 3 and f(2) = 5.
He conjectures that f(n) = 2n + 1 for all n ≥ 1.
Base step works: f(1) = 3 = 2·1 + 1 ✓
Inductive step breaks down:
- Assume f(k) = 2k + 1 (ordinary induction hypothesis).
- Then f(k+1) = 2f(k) − f(k−1) = 2(2k + 1) − f(k−1).
- But Bob doesn't know what f(k−1) equals under the induction hypothesis—he only assumed the formula for k, not for k−1.
Why it fails: the recurrence depends on two earlier values, but ordinary induction only gives you one.

🧩 What Bob needs

To compute f(k+1), he needs both f(k) = 2k + 1 and f(k−1) = 2(k−1) + 1.
In other words, he needs the formula to hold for all m with 1 ≤ m ≤ k, not just m = k.
This is exactly what strong induction provides.

💪 The Strong Principle of Mathematical Induction

💪 Statement of the principle

Strong Principle of Mathematical Induction: To prove that an open statement S_n is valid for all n ≥ 1, it is enough to:

(a) Show that S_1 is valid (base step), and

(b) Show that S_(k+1) is valid whenever S_m is valid for all integers m with 1 ≤ m ≤ k (strong inductive step).

Key difference from ordinary induction: in step (b), you assume S_m holds for every m from 1 up to k, not just for m = k.
The excerpt notes that this principle is "trivial" in the sense that it is logically valid—it is actually stronger (you assume more) than ordinary induction, so it is at least as powerful.

🔧 How Bob uses it

Base step: f(1) = 3 = 2·1 + 1 ✓
Strong inductive step: Assume f(m) = 2m + 1 for all 1 ≤ m ≤ k.
Then:
- f(k+1) = 2f(k) − f(k−1)
- = 2(2k + 1) − (2(k−1) + 1) [using the hypothesis for both k and k−1]
- = 4k + 2 − 2k + 2 − 1
- = 2k + 3
- = 2(k+1) + 1 ✓
The proof now works because both f(k) and f(k−1) are covered by the strong induction hypothesis.

🎯 The "bootstrap" phenomenon

What it means: sometimes proving a stronger statement is actually easier than proving a weaker one.
Why: a stronger hypothesis in the inductive step gives you more tools to work with.
Example: assuming f(m) = 2m + 1 for all m ≤ k (strong) is more useful than assuming it only for m = k (ordinary), even though you are trying to prove the same final result.
The excerpt calls this a "bootstrap" phenomenon—you pull yourself up by assuming more.

🆚 Strong vs ordinary induction

🆚 Logical relationship

Aspect	Ordinary induction	Strong induction
Inductive hypothesis	Assume S_k holds	Assume S_m holds for all 1 ≤ m ≤ k
What you prove	S_(k+1)	S_(k+1)
Logical strength	Weaker assumption	Stronger assumption (you assume more)
Validity	Valid	Also valid (and at least as powerful)

The excerpt emphasizes that strong induction is "stronger than the principle of induction"—meaning you assume more in the hypothesis, so it is easier to prove the inductive step.
Don't confuse: "stronger" does not mean it proves more theorems; it means the hypothesis is stronger (you get to assume more), which can make certain proofs possible.

🔍 When to use which

Ordinary induction: sufficient when the statement or recurrence depends only on the immediately preceding case.
Strong induction: necessary (or at least much more convenient) when the definition or argument depends on multiple earlier cases.
Example: if f(n) depends on f(n−1) and f(n−2), you need strong induction to have both values available in the inductive step.

🗨️ Discussion points from the excerpt

🗨️ Combinatorial vs inductive proofs

The excerpt mentions an ongoing debate: some prefer combinatorial proofs (which "show what is really going on"), others prefer formal induction proofs.
The excerpt's perspective: "you should prefer to give a combinatorial proof—when you can find one. But if pressed, you should be able to give a formal proof by mathematical induction."
One character (Dave) insists "Combinatorial proofs can always be made rigorous," while another (Xing) prefers induction as a formal proof method.

🗨️ Recursion vs induction in programming

Xing notes that in programming, recursion can "overload the stack," so loops are often preferred.
Example: computing the greatest common divisor using a loop (iterative) vs using recursion with backtracking.
This is a practical distinction, not a logical one—recursion and induction are closely related conceptually, but recursion has computational costs.

🗨️ The strange sequence example

Alice mentions a sequence: 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, …
Dave explains: the n-th term is the fewest number of U.S. coins required to make n cents.
This is an aside illustrating a real-world recursion/induction problem (though not directly related to strong induction).

3.10 Discussion

🧭 Overview

🧠 One-sentence thesis

The discussion explores the relationship between combinatorial proofs and induction, the practical meaning of a recursive sequence (fewest coins for change), and the real-world relevance of induction and recursion in verifying large software systems.

📌 Key points (3–5)

Combinatorial vs induction debate: Xing prefers induction as more rigorous; Dave argues combinatorial proofs can be made rigorous too.
The mysterious sequence explained: The sequence represents the fewest number of U.S. coins needed to make change for n cents.
Recursion vs loops in programming: Xing notes that recursion can overload the stack, so loops are often preferred for efficiency (e.g., computing greatest common divisor).
Common confusion: Induction and recursion appear similar but serve different purposes—Bob doesn't see the difference; Xing and Carlos clarify through programming and verification contexts.
Practical value: Zori questions real-world use; Carlos explains that induction and recursion principles underpin correctness proofs for large software projects.

🤔 Proof methods debate

🤔 Combinatorial proofs vs induction

Xing's view: Induction is a formal proof; combinatorial arguments might not be "really a proof."
Dave's counter: "Combinatorial proofs can always be made rigorous."
The excerpt does not resolve the debate—it shows two valid perspectives on proof style.
Don't confuse: this is about proof style preference, not about whether one method is mathematically invalid.

🪙 The mysterious sequence

🪙 What the sequence represents

The term aₙ is the fewest number of U.S. coins required to total to n cents.

The sequence 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, ... puzzled Alice.
Dave explains: each term is the minimum number of coins (using U.S. denominations: penny, nickel, dime, quarter, etc.) needed to make n cents.
Example: To make 6 cents, you might use 1 nickel + 1 penny = 2 coins, or 6 pennies = 6 coins; the minimum is 2, so a₆ = 2 (the sequence value depends on the position).
Carlos finds this explanation clever; others groan.

🔁 Recursion vs loops

🔁 Programming perspective on recursion

Bob's confusion: "I still don't see any difference between induction and recursion."
Dave's quip: "No one does."
Xing's clarification: In many programming languages, recursion is avoided in favor of loops to prevent overloading the stack.
- Recursion with backtracking uses more memory.
- Loops can accomplish the same task (e.g., computing greatest common divisor and finding coefficients a, b such that d = am + bn) with less storage.
Don't confuse: recursion (a programming technique) with induction (a proof technique)—they are related conceptually but serve different purposes.

💼 Practical relevance

💼 Real-world applications

Zori's skepticism: "Who's going to pay me to find greatest common divisors?"
Dave's blunt answer: "Nobody."
Alice's insight: There may be underlying principles with practical application.
Carlos's key point: Induction and recursion principles are essential for establishing correctness in large software projects.
- Big projects have hundreds of thousands of lines of code.
- Different teams write different parts at different times.
- Proving that the program does what it's supposed to do is very difficult.
Zori's reaction: She perks up, seeing a potential way to earn a salary (i.e., software verification work).

💼 Bob's naive view

Bob: "When I write a program, I just pay attention to details and after just a few corrections, they always work."
Alice's rebuttal: "Maybe that's because you don't do anything complicated."
This highlights the gap between small, simple programs (where informal testing suffices) and large, complex systems (where formal reasoning about correctness is necessary).

🧩 Summary of perspectives

Person	View
Xing	Prefers induction; sees it as more rigorous than combinatorial proofs; knows recursion can be inefficient in programming
Dave	Defends combinatorial proofs as rigorous; provides the coin-change interpretation of the sequence
Alice	Questions Dave's explanations but sees potential practical value in the principles
Bob	Confused about induction vs recursion; thinks simple testing is enough for programs
Carlos	Sees induction/recursion as foundational for software correctness proofs in large projects
Zori	Initially skeptical of practical value; becomes interested when she hears about software verification as a career
Yolanda	Impressed by Xing's programming knowledge (mentioned briefly)

Strong Induction and Recursion

3.11 Exercises

🧭 Overview

🧠 One-sentence thesis

The Strong Principle of Mathematical Induction allows you to assume all earlier cases (not just the immediately preceding one) when proving the next case, which is sometimes necessary to complete proofs that ordinary induction cannot handle.

📌 Key points (3–5)

Why ordinary induction can fail: Bob's proof got stuck because he needed f(k−1) but only knew f(k); the inductive hypothesis wasn't strong enough.
What strong induction adds: you may assume S_m is valid for all integers m with 1 ≤ m ≤ k when proving S_(k+1).
The "bootstrap" phenomenon: proving something stronger can paradoxically make the proof easier, because you have more to work with in the inductive step.
Common confusion: strong induction vs ordinary induction—strong induction assumes all prior cases, ordinary induction assumes only the immediately previous case.
Recursion vs induction: recursion defines a sequence step-by-step; induction proves a formula for that sequence; they are closely related but serve different purposes.

🚧 When ordinary induction hits a wall

🚧 Bob's stuck proof

Bob tried to prove f(n) = 2n + 1 by ordinary induction.
In the inductive step, he assumed f(k) = 2k + 1 and tried to show f(k+1) = 2(k+1) + 1.
The recurrence relation gave him:
- f(k+1) = 2 f(k) − f(k−1) = 2(2k + 1) − f(k−1)
The problem: he needed to know f(k−1) = 2(k−1) + 1, but his hypothesis only told him about f(k).
He was "totally perplexed" and ready to give up.

🔍 Why the hypothesis wasn't enough

Ordinary induction: "If S_k is true, then S_(k+1) is true."
Bob only knew that the formula held at k; he had no guarantee it held at k−1.
Example: to climb from step k to step k+1, Bob needed a foothold at step k−1, but his inductive hypothesis didn't give him that foothold.

💪 Strong induction to the rescue

💪 The Strong Principle of Mathematical Induction

To prove that an open statement S_n is valid for all n ≥ 1, it is enough to:

(a) Show that S_1 is valid, and
(b) Show that S_(k+1) is valid whenever S_m is valid for all integers m with 1 ≤ m ≤ k.

Key difference: in step (b), you assume the statement holds for every m from 1 up to k, not just at k.
This gives you much more information to work with in the inductive step.

🥾 The "bootstrap" phenomenon

Combinatorial mathematicians call this the "bootstrap" phenomenon: proving something stronger can make the proof easier.
Counterintuitive: you'd think a stronger claim is harder to prove, but the stronger hypothesis in the inductive step gives you more leverage.
Example: Bob now knows f(m) = 2m + 1 for all m ≤ k, so he can confidently use f(k−1) = 2(k−1) + 1 in his calculation.

✅ Bob's completed proof

With strong induction, Bob can now finish:
- f(k+1) = 2(2k + 1) − f(k−1) = 2(2k + 1) − [2(k−1) + 1] = 2k + 3 = 2(k+1) + 1.
The proof works, and Bob can "power down his computer and enjoy his coffee."

🔄 Recursion vs induction: related but distinct

🔄 What recursion does

Recursion: defines a sequence in terms of earlier terms.
Example: f(n) = 2 f(n−1) − f(n−2) + 6 with base cases f(0) = 2, f(1) = 4.
It tells you how to compute the next term if you know the previous ones.

🔄 What induction does

Induction: proves a formula or property holds for all n.
Example: prove f(n) = 3n² − n + 2 for all n ≥ 0.
It establishes correctness of a closed-form expression or general statement.

🔄 How they relate

A recursive definition often suggests what you need to prove by induction.
Induction (especially strong induction) is the tool to verify that a proposed formula matches the recursion.
Don't confuse: recursion is a definition; induction is a proof technique.

🔄 Practical considerations (from the discussion)

In programming, recursion can "overload the stack" if not managed carefully.
Loops can sometimes replace recursion with less memory overhead.
Example: computing the greatest common divisor (gcd) and coefficients a, b such that d = am + bn can be done iteratively without backtracking.
Induction and recursion underlie reasoning about program correctness, especially in large software projects with many lines of code.

🧩 The mysterious sequence and other discussion points

🧩 The coin-change sequence

Alice mentioned a "weird sequence": 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, ...
Dave explained: the term a_n is the fewest number of U.S. coins required to make n cents.
This is an example of a recursively defined sequence arising from a real-world optimization problem.

🧩 Combinatorial vs inductive proofs

Xing preferred induction; he felt combinatorial proofs "weren't really a proof."
Dave countered: "Combinatorial proofs can always be made rigorous."
The debate reflects different styles: combinatorial proofs give insight into why a formula is true; inductive proofs are more mechanical and formal.

🧩 Practical value

Zori was skeptical: "Who's going to pay me to find greatest common divisors?"
Carlos pointed out: the principles behind induction and recursion are central to establishing that computer programs do what you intend.
Large software projects require correctness proofs, and induction/recursion are foundational tools for that task.

📝 Exercise themes (overview only)

The exercises cover:

Theme	Examples
Recursive counting	Record identifiers, checkerboard tilings, string patterns without forbidden substrings
Greatest common divisor (gcd)	Finding gcd and coefficients a, b such that am + bn = d
Induction proofs	Summation formulas, divisibility, binomial theorem, closed-form solutions to recurrences
Combinatorial vs inductive proofs	Proving the same formula both ways to compare approaches
Algorithm analysis	Lower bound on sorting (using factorial and Stirling's approximation)

The exercises reinforce the distinction between defining a sequence recursively and proving a formula by induction.
Several problems ask for both a recursive formula and its use to compute specific values.
Don't confuse: finding a recursion (modeling the problem) vs proving a closed form (verifying the formula).

4.1 The Pigeon Hole Principle

🧭 Overview

🧠 One-sentence thesis

The Pigeon Hole Principle guarantees that when you map more items to fewer slots, at least two items must land in the same slot, and this simple idea proves surprisingly powerful results like the Erdős–Szekeres theorem about unavoidable patterns in sequences.

📌 Key points (3–5)

What the principle states: if you have more items than containers, at least two items must share a container.
Connection to injective functions: a function from a larger set to a smaller set cannot be one-to-one (injective).
Common confusion: the principle seems trivial but enables non-obvious proofs—e.g., forcing monotone subsequences in any long sequence.
The Erdős–Szekeres application: any sequence of mn + 1 distinct numbers must contain either an increasing subsequence of m + 1 terms or a decreasing subsequence of n + 1 terms.
Broader theme: "total disarray is impossible"—structure and patterns are unavoidable when sets are large enough.

🔑 Core definitions

🔑 Injective (one-to-one) functions

A function f : X → Y is 1–1 (one-to-one, or an injection, or injective) when f(x) ≠ f(x′) for all x, x′ in X with x ≠ x′.

In plain language: different inputs always produce different outputs.
When a function is injective, the size of the domain X cannot exceed the size of the codomain Y.
Example: if you assign each person a unique ID number, the number of people cannot exceed the number of available IDs.

🕳️ The Pigeon Hole Principle (Proposition 4.1)

If f : X → Y is a function and |X| > |Y|, then there exists an element y in Y and distinct elements x, x′ in X so that f(x) = f(x′) = y.

Informal version: if you put n + 1 pigeons into n holes, at least one hole must contain two pigeons.
Why it works: there simply aren't enough distinct outputs to give every input its own unique image.
Don't confuse: this is not about probability or typical cases—it is a certainty whenever the domain is larger than the codomain.

🎯 The Erdős–Szekeres theorem

📐 Statement (Theorem 4.2)

If m and n are non-negative integers, then any sequence of mn + 1 distinct real numbers either has an increasing subsequence of m + 1 terms, or it has a decreasing subsequence of n + 1 terms.

This guarantees unavoidable structure: no matter how you arrange mn + 1 distinct numbers, you cannot avoid one of these two patterns.
Example: with m = 2 and n = 2, any sequence of 5 distinct numbers must contain either 3 increasing terms or 3 decreasing terms.

🧩 How the proof uses the Pigeon Hole Principle

Setup:

Let σ = (x₁, x₂, x₃, …, x_{mn+1}) be a sequence of mn + 1 distinct real numbers.
For each position i, define:
- aᵢ = maximum length of an increasing subsequence starting with xᵢ
- bᵢ = maximum length of a decreasing subsequence ending with xᵢ

Case analysis:

If any aᵢ ≥ m + 1, we have an increasing subsequence of m + 1 terms → done.
If any bᵢ ≥ n + 1, we have a decreasing subsequence of n + 1 terms → done.
Otherwise, assume aᵢ ≤ m and bᵢ ≤ n for all i.

Applying the Pigeon Hole Principle:

Each position i has a pair (aᵢ, bᵢ) where 1 ≤ aᵢ ≤ m and 1 ≤ bᵢ ≤ n.
There are only mn possible distinct ordered pairs (a, b).
But we have mn + 1 positions, so by the Pigeon Hole Principle, two positions i₁ < i₂ must have the same pair: (a_{i₁}, b_{i₁}) = (a_{i₂}, b_{i₂}).

Reaching a contradiction:

Since x_{i₁} and x_{i₂} are distinct, either x_{i₁} < x_{i₂} or x_{i₁} > x_{i₂}.
If x_{i₁} < x_{i₂}: any increasing subsequence starting with x_{i₂} can be extended by prepending x_{i₁}, so a_{i₁} > a_{i₂}—contradicting (a_{i₁}, b_{i₁}) = (a_{i₂}, b_{i₂}).
If x_{i₁} > x_{i₂}: any decreasing subsequence ending with x_{i₁} can be extended by appending x_{i₂}, so b_{i₂} > b_{i₁}—again a contradiction.
Therefore, the assumption that all aᵢ ≤ m and all bᵢ ≤ n must be false.

🔍 Why this matters

The proof shows that even when you try to avoid patterns, the Pigeon Hole Principle forces structure to emerge.
Don't confuse: the theorem does not tell you which pattern appears, only that at least one must.

🌐 Broader context

🌐 Generalizations and theme

The excerpt notes that Chapter 11 will explore "powerful generalizations" of the Pigeon Hole Principle.
Common theme: "total disarray is impossible"—when sets are large enough relative to the available structure, order and patterns become unavoidable.
Example context: the excerpt mentions Dave's desire to never repeat himself, and Alice's observation that avoiding repetition requires "lots and lots of options"—a practical illustration of the principle.

An Introduction to Complexity Theory

4.2 An Introduction to Complexity Theory

🧭 Overview

🧠 One-sentence thesis

Some computational problems can be solved efficiently even for large inputs, while others require so many operations that no computer—present or future—can solve them in reasonable time when the input grows large.

📌 Key points (3–5)

Three problem types: membership testing (easy), finding triples with a target sum (harder), and partitioning a set into equal-sum subsets (intractable).
Running time matters: algorithms are compared by how the number of operations grows with input size n—proportional to n is fast, proportional to n³ is slower, proportional to 2^n is unworkable.
Certificates for "yes" answers: a yes answer can be verified efficiently by providing a certificate (e.g., the location of the answer), but a "no" answer may require checking all possibilities.
Common confusion: a problem being "easy to state" does not mean it is easy to solve—the partition problem is simple to describe but has roughly 2^n cases to check.
Why it matters: understanding complexity helps distinguish problems we can solve with better computers from problems that remain out of reach no matter how fast hardware becomes.

🔍 Three problems with different difficulty

🔍 Problem 1: Membership testing (easy)

Given a set S of 10,000 distinct positive integers (each at most 100,000), is 83,172 one of the integers in S?

How to solve: check each number in S one by one until you find 83,172 or exhaust the list.
Running time: proportional to n (the size of S), because you do at most n tests.
Why it scales: even if S grows to 1,000,000 integers, a modest computer can still finish quickly.
Example: with 10,000 numbers, you do at most 10,000 comparisons; a netbook handles this "in a heartbeat."

🔍 Problem 2: Finding three numbers with a target sum (harder)

Are there three integers in S whose sum is 143,297?

How to solve: consider all 3-element subsets of S and test whether their sum equals the target.
Running time: proportional to n³, because the number of 3-element subsets is C(n, 3), which grows like n³.
Why it becomes hard: for n = 10,000, there are about 166 billion triples to test; for n = 1,000,000, there are more than 10^17 triples—"off the table with today's hardware."
Don't confuse: testing one triple is easy; the problem is the sheer number of triples grows very fast.

🔍 Problem 3: Equal-sum partition (intractable)

Can the set S be partitioned into two disjoint subsets A and B so that the sum of integers in A equals the sum of integers in B?

How to solve (in theory): test all possible partitions of S into two complementary subsets.
Running time: proportional to 2^n, because there are 2^(n−1) − 1 nontrivial partition pairs.
Why it is unworkable: for n = 10,000, there are approximately 10^3000 partitions—"no piece of hardware on the planet will touch that assignment."
Key insight: even if you build faster computers, you only shift the threshold slightly; problems with exponential growth remain out of reach.

📜 Certificates: verifying answers efficiently

📜 What a certificate is

A certificate is additional information that allows an impartial referee to verify a "yes" answer efficiently.

It is not the full algorithm or all the work you did; it is a short proof that the answer is correct.
The referee can check the certificate quickly, even if finding it was hard.

✅ Certificates for "yes" answers

Problem	Certificate	How the referee checks
Membership (Problem 1)	"83,172 is on line 584"	Look at line 584 and confirm it is 83,172
Three-sum (Problem 2)	"Numbers at lines 12, 305, 8001 sum to 143,297"	Check those three numbers and verify their sum
Partition (Problem 3)	"Subset A = {list of numbers}"	Verify all numbers are in S, form B = S − A, compute both sums, confirm they are equal

In all three cases, the certificate is small (a few numbers or locations) and checking it is fast.
Example: for the partition problem, you don't provide the algorithm that searched all 2^n partitions; you only provide the one partition that works.

❌ Certificates for "no" answers

For Problems 1 and 2, a "no" answer can be verified by running the full algorithm (checking all elements or all triples).
For Problem 3, a "no" answer has no efficient certificate: you cannot say "we checked all 10^3000 partitions"—that is impossible.
The excerpt concludes: "The best we could say is that we tried to find a suitable partition and were unable to do so. As a result, we don't know what the correct answer to the question actually is."
Don't confuse: being unable to find a solution is not the same as proving no solution exists.

⚙️ Operations and input size

⚙️ What an operation is

An operation is a basic step in an algorithm, such as comparing two integers, updating a variable, or checking whether two subset sums are equal.

The meaning is intentionally imprecise: different algorithms have different "basic steps."
Key assumption: there exists a constant c such that any operation takes at most time c on a computer.
Different computers yield different values of c, but "that is a discrepancy which we can safely ignore."

📏 Input size

Definition: roughly the number of "blocks" of data; for the three problems, the input size is n = 10,000 (the number of integers in S).
The excerpt mostly ignores the size of individual integers (e.g., each at most 100,000), focusing on the count of items.
Limitation: if a single item is huge (e.g., a 700-megabyte file), just reading it takes significant time, so input size is not the whole story.
Example: determining whether a file x appears in a directory structure is easy by name, but checking for an exact copy of a large file is much harder.

📐 Big "Oh" notation

📐 Definition

f = O(g) means there exists a constant c and an integer n₀ such that f(n) ≤ c · g(n) whenever n > n₀.

Read as "f is Big Oh of g."
Modern interpretation: if f and g both count operations for two algorithms with input size n, then f = O(g) means f is "no harder than g when the problem size is large."

📐 Common benchmarks

The excerpt lists natural functions to compare against (in rough order of growth):

log log n
log n
square root of n
n^α where α < 1
n
n²
n³
n^c where c > 1 is a constant
n log n
2^n
n!
2^(n²)

📐 Example from sorting

The excerpt mentions (from an earlier subsection) that there are sorting algorithms with running time O(n log n), where n is the number of integers to be sorted.
This means the number of operations grows roughly like n log n, which is much better than n² or 2^n.

The Big "Oh" and Little "Oh" Notations

4.3 The Big“Oh”and Little“Oh”Notations

🧭 Overview

🧠 One-sentence thesis

Big Oh and Little oh notations provide a formal way to compare algorithm efficiency by describing how one function grows relative to another as input size becomes large, with Big Oh capturing "no harder than" and Little oh capturing "strictly dominated by."

📌 Key points (3–5)

Big Oh (O) meaning: f = O(g) means f grows no faster than g (up to a constant multiple) for large inputs; used to say one algorithm is "no harder than" another.
Little oh (o) meaning: f = o(g) means f grows strictly slower than g; the ratio f(n)/g(n) approaches zero as n grows.
Common confusion: Big Oh allows f to be much smaller than g (it's an upper bound), while Little oh guarantees f is strictly dominated; Big Oh does not mean "approximately equal."
Benchmark functions: algorithms are compared against standard growth rates like log n, n, n log n, n², 2ⁿ, etc.
Why it matters: these notations let us classify algorithm difficulty and compare running times without worrying about exact constants or small inputs.

📐 Big Oh notation

📐 Formal definition

f = O(g) when there exists a constant c and an integer n₀ such that f(n) ≤ c·g(n) whenever n > n₀.

This captures "f is no harder than g when the problem size is large."
The constant c and threshold n₀ are allowed to be any fixed values; we only care about behavior for large n.
The excerpt emphasizes: if f and g both describe operation counts for algorithms, then f = O(g) means f is no harder than g for large inputs.

🎯 What Big Oh tells you (and what it doesn't)

What it tells you: f does not grow faster than g (up to a constant factor).
What it does NOT tell you: f might be much smaller than g; Big Oh is an upper bound, not a tight estimate.
The excerpt warns: "when we write f = O(g), we are implying in some sense that f is no bigger than g, but it may in fact be much smaller."
Example: if an algorithm runs in n operations, it is also O(n²) and O(n³), but those descriptions are not tight.

🔢 Standard benchmark functions

The excerpt lists natural benchmarks for comparison:

Growth rate	Examples
Very slow	log log n, log n
Sublinear	√n, n^α where α < 1
Linear and polynomial	n, n², n³, n^c (c > 1 constant)
Linearithmic	n log n
Exponential and worse	2ⁿ, n!, 2^(n²)

Sorting algorithms: running time O(n log n) where n is the number of integers.
Shortest paths in a graph: running time O(n²) where n is the number of vertices.
Graph coloring (chromatic number ≤ 3): no known algorithm with running time O(n^c) for any constant c.

📉 Little oh notation

📉 Formal definition

f = o(g) when the limit as n approaches infinity of f(n)/g(n) equals 0.

This means f grows strictly slower than g; g eventually dominates f by an unbounded factor.
Both f(n) and g(n) must be positive for all n.
The excerpt says: "f is 'Little oh' of g when lim (n→∞) f(n)/g(n) = 0."

🔍 Examples from the excerpt

ln n = o(n^0.2): logarithm grows much slower than any positive power of n.
n^α = o(n^β) whenever 0 < α < β: smaller exponents are dominated by larger ones.
n^100 = o(c^n) for every c > 1: any polynomial is dominated by any exponential.
f(n) = o(1) means lim (n→∞) f(n) = 0: the function approaches zero.

⚖️ Big Oh vs Little oh

Notation	Meaning	Relationship
f = O(g)	f grows no faster than g (up to constant)	Upper bound; f may equal or be smaller
f = o(g)	f grows strictly slower than g	Strict domination; f is eventually negligible

Don't confuse: Big Oh allows f and g to grow at the same rate (e.g., 2n = O(n)); Little oh requires f to be strictly slower.
Example: n = O(n) is true, but n = o(n) is false; however, n = o(n²) is true.

🧮 Practical implications

🧮 Algorithm comparison

When comparing two algorithms with operation counts f(n) and g(n):
- If f = O(g), the first algorithm is "no harder" than the second for large inputs.
- If f = o(g), the first algorithm is strictly more efficient for large inputs.
The excerpt emphasizes: "the meaning of f = O(g) is that f is no harder than g when the problem size is large."

🚫 What we don't know

The excerpt mentions an open problem: determining whether a graph's chromatic number is at most 3.
"No one knows whether there is a constant c and an algorithm for determining whether the chromatic number of a graph is at most three which has running time O(n^c)."
This illustrates that for some problems, we cannot yet classify their difficulty using polynomial benchmarks.

🎓 Why constants and small inputs are ignored

Big Oh and Little oh focus on asymptotic behavior (large n).
Constants like c and thresholds like n₀ are allowed because:
- Different machines and implementations change constants.
- For large enough inputs, growth rate matters more than constant factors.
Example: an algorithm with 1000n operations is O(n), same as one with 2n operations; for large n, both are "linear."

Exact Versus Approximate

4.4 Exact Versus Approximate

🧭 Overview

🧠 One-sentence thesis

The distinction between polynomial-time solvable problems (class P) and problems where solutions can be verified quickly (class NP) remains one of the most important unsolved questions in computer science, with profound implications for which problems can be solved efficiently versus only checked efficiently.

📌 Key points (3–5)

What P means: problems that admit polynomial-time algorithms—solvable in O(n^c) steps for some constant c.
What NP means: yes-no problems where a "yes" certificate can be verified in polynomial time, even if finding it might be hard.
The famous question: whether P = NP, i.e., whether every problem whose solution can be checked quickly can also be solved quickly.
Common confusion: P ⊆ NP is known (anything solvable quickly can also be verified quickly), but whether NP ⊆ P is the open question.
Why it matters: determining which problems have fast algorithms versus which only have fast verification affects practical problem-solving and algorithm design.

🔢 Mathematical context: growth rates

📈 How functions grow to infinity

The excerpt begins with a mathematical aside about how π(n) (the prime-counting function) grows:

Some functions grow "slowly": log n, log log n
Some grow "quickly": 2^n, n!, 2^(2^n)
π(n) grows like n / ln n (Legendre's 1796 conjecture)

🏆 The Prime Number Theorem

The limit as n approaches infinity of (π(n) × ln n) / n equals 1.

Proved independently by Hadamard and de la Vallée-Poussin in 1896, exactly 100 years after the conjecture
Used techniques rooted in Riemann's complex analysis
Still an active research topic at the boundary of analysis and number theory
Why mentioned: illustrates that understanding growth rates of functions is a deep mathematical question

🚀 Class P: polynomial-time algorithms

⏱️ What polynomial time means

Polynomial time: a problem can be solved by an algorithm A with running time O(n^c) for some constant c > 0, where n is the input size.

The symbol P is suggestive of "polynomial"
The text emphasizes problems where a certificate (solution) can be found in polynomial time
Example: the first two problems in the chapter belong to P with O(n) and O(n³) running times
Example: determining whether a graph is 2-colorable or connected both admit polynomial-time algorithms

🤔 Difficulty of membership

It may be very difficult to determine whether a problem belongs to class P
The excerpt mentions the subset sum problem: "we don't see how to give a fast algorithm for solving the third problem, but that doesn't mean that there isn't one"
Don't confuse: not knowing a fast algorithm ≠ proving no fast algorithm exists

🎯 Class NP: verifiable in polynomial time

🔍 What NP means

Class NP: yes-no problems for which there is a certificate for a "yes" answer whose correctness can be verified in polynomial time.

Formal name: the class of nondeterministic polynomial time problems
Key distinction: finding a solution versus checking a solution
Example: the subset sum problem "definitely belongs to this class"—given a proposed subset, you can quickly verify whether it sums to the target

✅ Certificate verification

A certificate is a proposed solution to a yes-no problem
For NP problems, if the answer is "yes," there exists a certificate that can be checked quickly
The checking process must run in polynomial time
Don't confuse: NP does not mean "not polynomial" or "non-polynomial"—it means nondeterministic polynomial

❓ The P versus NP question

🧩 The relationship between P and NP

What we know	What we don't know
P ⊆ NP (any problem solvable in polynomial time can also be verified in polynomial time)	Whether P = NP
Subset sum is in NP	Whether subset sum is in P

"Evidently, any problem belonging to P also belongs to NP"
The famous question: are the two classes the same?

🏔️ Why it's hard

"It seems difficult to believe that there is a polynomial time algorithm for settling the third problem (the subset sum problem)"
No one has come close to settling this issue
The excerpt suggests this is "the most famous question at the boundary of combinatorial mathematics, theoretical computer science and mathematical logic"
Described as "notoriously challenging"

💡 Practical implications

The discussion section reveals different perspectives:

Xing's view: "Any finite problem can be solved. There is always a way to list all the possibilities, compare them one by one and take the best one."
Alice's counterpoint: Problems might take a long time "just because it is big"—example of multiplying two integers stored on full DVDs
Carlos's intuition: "There are really hard problems that any algorithm will take a long time to solve and not just because the input size is large"
Don't confuse: solvability in principle (always possible for finite problems) versus solvability in practice (polynomial time)

🎓 Implications and open questions

🔬 Determining membership in P

Very difficult to prove a problem is or isn't in P
Absence of a known fast algorithm doesn't prove impossibility
The excerpt encourages: "Maybe we all need to study harder!"

🌟 The stakes

The author hints at the importance: "if you get a good idea, be sure to discuss it with one or both authors of this text before you go public"
"If it turns out that you are right, you are certain to treasure a photo opportunity"
Practical implications: "people could earn a nice income solving problems faster and more accurately than their competition"

4.5 Discussion

🧭 Overview

🧠 One-sentence thesis

Even though any finite problem can theoretically be solved by checking all possibilities, some problems take so long that they are impractical to solve, and distinguishing "hard" problems from "easy" ones has real-world implications.

📌 Key points (3–5)

Bob's confusion: whether all problems can be solved in principle.
Xing's clarification: any finite problem can be solved by listing all possibilities, but that doesn't mean it's practical.
Alice's observation: a problem might take a long time simply because the input size is large (e.g., multiplying two very large integers).
Carlos's intuition: some problems are "really hard" not just because of input size, but because any algorithm will take a long time—though he can't yet formulate such a problem.
Common confusion: solvability in principle vs. solvability in practice—just because you can list all possibilities doesn't mean you can do it quickly enough to matter.

🤔 Can all problems be solved?

🤔 Bob's question

Bob wonders why it can't be the case that all problems can be solved—maybe we just don't know how yet.
This reflects a common beginner confusion: confusing "theoretically possible" with "practically feasible."

✅ Xing's answer: yes, in principle

Any finite problem can be solved by listing all possibilities, comparing them one by one, and taking the best one as the answer.

What this means: if a problem has a finite number of possible solutions, you can always check every single one.
Why it matters: this establishes that solvability is not the issue—speed is.
Example: to find the best subset from a set of 100 elements, you could check all 2^100 subsets, but that would take longer than the age of the universe.

⏱️ Why some problems take a long time

📦 Alice's point: large input size

A problem might take a long time simply because the input is huge.
Example from the excerpt: multiplying two integers, each stored on a completely full DVD.
- Even with a large computer and fancy software, the sheer size of the data makes the task slow.
Key insight: input size alone can make a problem slow, independent of the algorithm's efficiency.

🧩 Carlos's intuition: intrinsically hard problems

Carlos suspects there are "really hard problems" where any algorithm will take a long time, not just because the input is large.
He admits he doesn't yet know how to formulate such a problem, but he believes they exist.
What this suggests: some problems may be fundamentally difficult—no clever algorithm can make them fast.
Don't confuse: "hard because the input is big" (Alice's point) vs. "hard because the problem structure itself resists fast solutions" (Carlos's point).

💼 Practical implications

💼 Zori's observation

Even Zori, who was less enthusiastic about the complexity discussion, sensed that the question of which problems can be solved quickly has practical implications.
Why it matters: people can earn income by solving problems faster and more accurately than their competition.
Example: an organization that can solve scheduling or optimization problems faster than rivals gains a competitive advantage.

🔄 Summary of perspectives

Person	View	Implication
Bob	Maybe all problems can be solved, we just don't know how	Confuses theoretical solvability with practical feasibility
Xing	Any finite problem can be solved by checking all possibilities	Establishes that solvability is not the issue—speed is
Alice	Large input size alone can make a problem slow	Input size is one source of difficulty
Carlos	Some problems are intrinsically hard, independent of input size	Problem structure itself may resist fast solutions
Zori	Speed matters in the real world	Practical implications for business and competition

Graph Theory Basics

4.6 Exercises

🧭 Overview

🧠 One-sentence thesis

Graphs—consisting of vertices and edges—provide a fundamental discrete structure for modeling relationships and connectivity problems, with key concepts including paths, cycles, trees, and the distinction between simple graphs and multigraphs.

📌 Key points (3–5)

What a graph is: a pair (V, E) where V is a set of vertices and E is a set of 2-element subsets of V (edges).
Core structural concepts: adjacency, degree, paths, cycles, connectedness, and trees (connected acyclic graphs).
Common confusion: simple graphs vs. multigraphs—simple graphs have no loops or multiple edges; multigraphs allow both.
Isomorphism: two graphs are isomorphic if there's a bijection preserving adjacency, meaning they have the same structure even if drawn differently.
Fundamental result: the sum of all vertex degrees equals twice the number of edges, implying the number of odd-degree vertices is always even.

🔷 What graphs are and basic definitions

🔷 Graph structure

Graph: A graph G is a pair (V, E) where V is a set (almost always finite) and E is a set of 2-element subsets of V. Elements of V are called vertices and elements of E are called edges.

The edge {x, y} is abbreviated as xy (and xy means the same as yx).
Vertex set: V; edge set: E.
A drawing of a graph is a helpful visualization but is not the graph itself—the same graph can be drawn many different ways.

🔗 Adjacency and incidence

Adjacent vertices: distinct vertices x and y are adjacent when xy ∈ E; otherwise they are non-adjacent.
Incident: the edge xy is incident to vertices x and y.
Neighbors: adjacent vertices are also called neighbors.
Neighborhood: the neighborhood of vertex x is the set of all vertices adjacent to x.

Example: In a graph with vertices {a, b, c, d, e} and edges {ab, cd, ad}, vertices d and a are neighbors; the neighborhood of d is {a, c}; the neighborhood of e is the empty set.

📊 Degree

Degree: The degree of a vertex v in graph G, denoted deg_G(v), is the number of vertices in its neighborhood, or equivalently, the number of edges incident to it.

If the graph is clear from context, write deg(v).
Example: If d is adjacent to a and c, then deg(d) = 2.

🧩 Special types of graphs

🧩 Complete and independent graphs

Type	Definition	Notation
Complete graph	Every distinct pair of vertices is adjacent	K_n (n vertices)
Independent graph	No pair of distinct vertices is adjacent	I_n (n vertices)

In K_n, xy is an edge for every distinct pair x, y ∈ V.
In I_n, xy is not an edge for any distinct pair x, y ∈ V.

🌲 Trees and forests

Tree: A connected acyclic graph (a connected graph with no cycles on three or more vertices).

Forest: An acyclic graph (may be disconnected).

Spanning tree: a subgraph H of a connected graph G that is both a spanning subgraph (same vertex set) and a tree.
Leaf: a vertex v in a tree T with deg_T(v) = 1.

Key result: Every tree on n ≥ 2 vertices has at least two leaves.

Proof idea (by induction): Delete an edge e to get two smaller tree components; by induction each has at least two leaves; in the worst case two of these are endpoints of e, so at least two remain leaves in the original tree.

🔄 Paths and cycles

Walk: A sequence (x₁, x₂, ..., xₙ) of vertices where xᵢxᵢ₊₁ is an edge for each i = 1, 2, ..., n−1. Vertices need not be distinct.

Path: A walk with all vertices distinct. Denoted P_n for a path on n vertices.

Cycle: A path (x₁, x₂, ..., xₙ) of n distinct vertices (n ≥ 3) where x₁xₙ is also an edge. Denoted C_n for a cycle on n vertices.

Length of a path P_n: n − 1 edges.
Length of a cycle C_n: n edges.

Example: A path from x₁ to xₙ emphasizes the start and end vertices.

🔗 Subgraphs and graph relationships

🔗 Subgraph types

Type	Vertex condition	Edge condition
Subgraph	W ⊆ V	F ⊆ E
Induced subgraph	W ⊆ V	F = {xy ∈ E : x, y ∈ W}
Spanning subgraph	W = V	F ⊆ E

Induced subgraph: completely defined by its vertex set and the original graph G—includes all edges from G between vertices in W.
Spanning subgraph: has the same vertex set as G but possibly fewer edges.

🔄 Isomorphism

Isomorphic: Graphs G = (V, E) and H = (W, F) are isomorphic (written G ≅ H) when there exists a bijection f: V → W such that x is adjacent to y in G if and only if f(x) is adjacent to f(y) in H.

Isomorphism preserves adjacency structure.
Two graphs can look different when drawn but be isomorphic.
"G contains H" often means there is a subgraph of G isomorphic to H.

Don't confuse: Same number of vertices and edges does not guarantee isomorphism—the adjacency pattern must match.

Example: Two graphs with 6 vertices and the same number of edges may not be isomorphic if their degree sequences or connectivity patterns differ.

🌐 Connectivity and components

🌐 Connected vs. disconnected

Connected: A graph G is connected when there is a path from x to y in G for every x, y ∈ V; otherwise G is disconnected.

Component: a maximal connected subgraph of a disconnected graph G.
- "Maximal" means there is no larger connected subgraph containing it.

Example: A graph with vertices {a, b, c, d, e} and edges {ab, cd} is disconnected (no path from e to c); it has three components.

🔢 Fundamental degree theorem

Theorem (First Theorem of Graph Theory): For any graph G = (V, E),

The sum of all vertex degrees equals twice the number of edges: Σ_{v ∈ V} deg_G(v) = 2|E|.

Why: Each edge e = vw contributes 1 to deg(v) and 1 to deg(w), so it is counted twice on the left side; it is counted twice on the right side (as 2 times 1 edge).

Corollary: For any graph, the number of vertices of odd degree is even.

Proof idea: The sum of degrees is even (= 2|E|); if an odd number of vertices had odd degree, the sum would be odd—contradiction.

🔀 Multigraphs: loops and multiple edges

🔀 Simple graphs vs. multigraphs

Term	Loops allowed?	Multiple edges allowed?
Simple graph (or just "graph" in this text)	No	No
Multigraph	Yes	Yes

Loop: an edge with both endpoints being the same vertex.
Multiple edges: more than one edge between the same pair of vertices.

Terminology note: Different authors use different conventions; in this text, "graph" always means simple graph unless stated otherwise.

🛣️ Real-world motivation

Example: Cities and highways—multiple highways may connect the same two cities (multiple edges); a highway may leave and return to the same city (loop).
Simple graphs cannot model these features; multigraphs can.

Don't confuse: If a problem allows loops but not multiple edges (or vice versa), state the restriction explicitly in English.

Note on exercises: The excerpt includes a list of exercise problems (e.g., determining arithmetic progressions, sums, products, pigeonhole principle applications) but these are problem statements without solutions or explanatory content, so they are not detailed here.

Basic Notation and Terminology for Graphs

5.1 Basic Notation and Terminology for Graphs

🧭 Overview

🧠 One-sentence thesis

Graphs are a fundamental discrete structure consisting of vertices and edges that model relationships between objects, and understanding their basic notation, properties, and special types is essential for applying graph theory to real-world problems.

📌 Key points (3–5)

What a graph is: A graph G = (V, E) consists of a vertex set V and an edge set E of 2-element subsets of V; edges represent relationships between vertices.
Key properties: Degree of a vertex counts its neighbors; the sum of all degrees equals twice the number of edges (Theorem 5.9).
Special graph types: Complete graphs (all pairs connected), independent graphs (no edges), paths, cycles, trees (connected acyclic graphs), and bipartite graphs (vertices partitioned into two independent sets).
Common confusion: A graph drawing is just a visualization tool, not the graph itself; the same graph can be drawn many different ways without changing its structure.
Connectivity and structure: Graphs can be connected (a path exists between every pair of vertices) or disconnected (split into components); trees are minimal connected graphs with no cycles.

📐 Core definitions and structure

📐 What is a graph?

A graph G is a pair (V, E) where V is a set (almost always finite) and E is a set of 2-element subsets of V. Elements of V are called vertices and elements of E are called edges.

The edge {x, y} is abbreviated as xy, and xy ∈ E means exactly the same as yx ∈ E.
Vertices x and y are adjacent (or neighbors) when xy ∈ E; otherwise they are non-adjacent.
An edge xy is incident to vertices x and y.
Example: A graph with V = {a, b, c, d, e} and E = {{a, b}, {c, d}, {a, d}} is perfectly valid even though vertex e has no edges incident to it.

🎨 Graph visualization

A graph is commonly visualized by drawing a point for each vertex and a line connecting two vertices if they are adjacent.
Don't confuse: The drawing is a helpful tool but is not the same as the graph itself; the same graph can be drawn in many different ways.
Example: The graph G = ({a, b, c, d, e}, {{a, b}, {c, d}, {a, d}}) can be drawn with vertices arranged in a line, a circle, or any other configuration.

🔢 Degree of a vertex

The degree of a vertex v in a graph G, denoted deg_G(v), is the number of vertices in its neighborhood, or equivalently, the number of edges incident to it.

The neighborhood of a vertex x is the set of vertices adjacent to x.
Example: In a graph where vertex d is adjacent to vertices a and c, the neighborhood of d is {a, c} and deg(d) = 2.
If vertex e has no neighbors, then deg(e) = 0.

🔗 Relationships between vertices

🔗 Walks, paths, and cycles

A walk is a sequence (x₁, x₂, ..., xₙ) of vertices where xᵢxᵢ₊₁ is an edge for each i = 1, 2, ..., n−1. Vertices in a walk need not be distinct.

A path is a walk where all vertices are distinct.

A cycle is a path (x₁, x₂, ..., xₙ) of n ≥ 3 distinct vertices where x₁xₙ is also an edge.

The length of a path or cycle is the number of edges it contains.
Path Pₙ has n vertices and length n−1; cycle Cₙ has n vertices and length n.
Example: In a graph, the sequence (a, b, c, d) where each consecutive pair is connected is a path from a to d of length 3.

📏 Distance

The distance d(u, v) between vertices u and v is the length of a shortest path from u to v.

By convention, d(u, u) = 0.
Distance is only defined for vertices in the same connected component.

🌳 Special graph types

🌳 Complete and independent graphs

A complete graph is a graph where xy is an edge for every distinct pair x, y ∈ V.

An independent graph is a graph where xy ∉ E for every distinct pair x, y ∈ V.

Complete graph on n vertices is denoted Kₙ; independent graph on n vertices is denoted Iₙ.
Example: K₅ has 5 vertices and every pair is connected, giving 10 edges total (since there are C(5,2) = 10 pairs).

🌲 Trees and forests

An acyclic graph is one that does not contain any cycle on three or more vertices. Acyclic graphs are also called forests.

A tree is a connected acyclic graph.

A spanning tree of a connected graph G = (V, E) is a subgraph H = (W, F) that is both a spanning subgraph (W = V) and a tree.

Key property: Every tree on n ≥ 2 vertices has at least two leaves (vertices of degree 1).
Trees are minimal connected graphs: removing any edge disconnects them, and adding any edge creates a cycle.
Example: A path Pₙ is a tree; a star graph (one central vertex connected to n−1 leaves) is also a tree.

🎭 Bipartite graphs

A bipartite graph is a graph where the vertex set V can be partitioned into two sets A and B such that the subgraphs induced by A and B are independent (no edge has both endpoints in A or both in B).

Bipartite graphs are exactly the 2-colorable graphs.
Key characterization: A graph is bipartite if and only if it does not contain an odd cycle (Theorem 5.21).
Complete bipartite graph K_{m,n} has vertex set V₁ ∪ V₂ with |V₁| = m and |V₂| = n, and an edge xy if and only if x ∈ V₁ and y ∈ V₂.
Example: A graph modeling students and languages they speak is naturally bipartite, with students on one side and languages on the other.

🔍 Graph relationships and properties

🔍 Subgraphs

H = (W, F) is a subgraph of G = (V, E) when W ⊆ V and F ⊆ E.

H is an induced subgraph when W ⊆ V and F = {xy ∈ E : x, y ∈ W} (all edges of G with both endpoints in W).

H is a spanning subgraph when W = V.

An induced subgraph is completely determined by its vertex set and the original graph.
Example: Removing a vertex and all its incident edges from a graph produces an induced subgraph.

🔄 Isomorphism

Graphs G = (V, E) and H = (W, F) are isomorphic (written G ≅ H) when there exists a bijection f: V → W such that x is adjacent to y in G if and only if f(x) is adjacent to f(y) in H.

Isomorphic graphs have the same structure but possibly different vertex labels.
Don't confuse: Two graphs can have the same number of vertices and edges but not be isomorphic; the pattern of connections matters.
Example: Two graphs are not isomorphic if one has a vertex of degree 3 and the other does not, even if they have the same number of vertices and edges.
Writers often say G "contains" H when there is a subgraph of G isomorphic to H.

🔗 Connectivity

A graph G is connected when there is a path from x to y for every x, y ∈ V; otherwise it is disconnected.

A component of a disconnected graph G is a maximal connected subgraph (one that is not contained in any larger connected subgraph).

Testing connectivity: A positive answer can be justified by providing a spanning tree; a negative answer by providing a partition V = V₁ ∪ V₂ with no edges between V₁ and V₂.
Example: A graph with vertices {a, b, c, d, e} where a-b and c-d are the only edges is disconnected with two components.

📊 Fundamental theorems

📊 First Theorem of Graph Theory

Theorem 5.9: Let deg_G(v) denote the degree of vertex v in graph G = (V, E). Then the sum over all v in V of deg_G(v) equals 2|E|.

Why this works: Each edge e = vw contributes 1 to deg(v) and 1 to deg(w), so it is counted twice on the left side; it is clearly counted twice on the right side.
Corollary 5.10: For any graph, the number of vertices of odd degree is even.
Example: If a graph has 5 edges, the sum of all vertex degrees must be 10.

🌿 Properties of trees

Proposition 5.11: Every tree on n ≥ 2 vertices has at least two leaves.

Proof idea (by induction): For n = 2, the only tree is K₂, which has two leaves. For larger n, pick an edge e and delete it to get two components; each component is a tree with fewer vertices, so by induction each has at least two leaves. In the worst case, two of these leaves are the endpoints of e, so at least two vertices are leaves in the original tree.
This property is fundamental to many tree algorithms.

🎲 Multigraphs

🎲 Loops and multiple edges

A simple graph (what we call just "graph" in this text) has no loops or multiple edges.
A multigraph allows both loops (edges with both endpoints the same vertex) and multiple edges (more than one edge between the same pair of vertices).
Multiple edges: Between two nearby cities, there can be several interconnecting highways; traveling on one is fundamentally different from traveling on another.
Loops: A highway that leaves a city, goes through the countryside, and returns to the same city.
Example: The Königsberg bridge problem is naturally modeled as a multigraph, with land masses as vertices and bridges as edges (including multiple edges between the same pair of land masses).

Budget: <budget>1000000 - 1000*(5.5 + 15) = 979500</budget>

Multigraphs: Loops and Multiple Edges

5.2 Multigraphs: Loops and Multiple Edges

🧭 Overview

🧠 One-sentence thesis

Multigraphs extend simple graphs by allowing loops (edges from a vertex to itself) and multiple edges (more than one edge between the same pair of vertices), making them better models for real-world networks where multiple connections between the same points are possible.

📌 Key points (3–5)

Simple graphs vs multigraphs: simple graphs forbid loops and multiple edges; multigraphs allow both.
Why multigraphs matter: real-world networks (e.g., highways between cities) often have multiple distinct connections between the same nodes.
Terminology convention: in this text, "graph" always means simple graph; "multigraph" explicitly allows loops and multiple edges.
Common confusion: the terminology is not standard across all sources—always check the author's definition at the start of a paper or chapter.
Flexibility: if you need only loops or only multiple edges (but not both), state the restriction explicitly in plain English.

🌐 Motivation from real networks

🛣️ Highway network example

The excerpt uses a highway network to motivate multigraphs:

Vertices represent cities.
Edges represent highways.
In a simple graph model, two cities are either connected by one edge or not connected at all.

Limitations of simple graphs for this scenario:

Multiple highways: Two nearby cities may have several distinct highways connecting them, and traveling on one is fundamentally different from traveling on another.
Loop highways: A highway can leave a city, pass through the countryside, and return to the same city—this is an edge with both endpoints being the same vertex.

Example: City A and City B might have three separate highways (multiple edges), and City A might have a scenic loop road that starts and ends at A (a loop).

🔄 What multiple edges and loops capture

Multiple edges: more than one edge between two adjacent vertices.
Loops: an edge with both endpoints being the same vertex.
You can have more than one loop at the same vertex.

These features let the model represent richer structure that simple graphs cannot.

📖 Definitions and terminology

📖 Simple graph

A simple graph is a graph with no loops and no multiple edges.

In this text, the word "graph" without qualification always means simple graph.
This is a convention; other authors may define "graph" differently.

📖 Multigraph

A multigraph is a graph that can have loops and multiple edges.

When the excerpt says "multigraph," both loops and multiple edges are allowed.
The excerpt emphasizes that terminology is "far from standard"—different authors use different conventions.

🔀 Mixed cases

What if you want to allow loops but not multiple edges, or vice versa?

The excerpt says: "If we really needed to talk about such graphs, then the English language comes to our rescue, and we just state the restriction explicitly!"
In other words, there is no single standard term for these intermediate cases; you simply describe the rules in words.

Don't confuse:

"Graph" in this text = simple graph (no loops, no multiple edges).
"Multigraph" = allows both loops and multiple edges.
Always check the author's conventions at the beginning of a paper or chapter.

🧩 How authors signal their conventions

🧩 Common opening sentences

The excerpt lists two typical ways authors clarify their usage:

Statement	Meaning
"In this paper, all graphs will be simple, i.e., we will not allow loops or multiple edges."	Only simple graphs; no loops, no multiple edges.
"In this paper, graphs can have loops and multiple edges."	Multigraphs are allowed.

These sentences usually appear at the start of a discussion or paper.
They set the ground rules for the entire work.

🧩 Why this matters

Graph theory terminology is not universal.
Without an explicit statement, readers might assume different definitions.
The excerpt warns that "the terminology is far from standard," so always look for the author's definition.

Example: One author might use "graph" to mean what this text calls "multigraph," while another uses "graph" to mean simple graph. Reading the opening clarification prevents confusion.

Eulerian and Hamiltonian Graphs

5.3 Eulerian and Hamiltonian Graphs

🧭 Overview

🧠 One-sentence thesis

Eulerian graphs can be completely characterized by a simple degree condition (connected with all even degrees), whereas determining whether a graph is Hamiltonian remains computationally difficult despite sufficient conditions like Dirac's theorem.

📌 Key points (3–5)

Eulerian characterization: A graph is eulerian if and only if it is connected and every vertex has even degree—this gives a fast algorithm.
Hamiltonian difficulty: No known quick method exists to determine if a graph is hamiltonian, though sufficient conditions (like Dirac's theorem) guarantee hamiltonicity.
Common confusion: Eulerian circuits traverse every edge exactly once; hamiltonian cycles visit every vertex exactly once—these are fundamentally different properties.
Practical algorithms: The proof of the eulerian theorem provides a deterministic algorithm that either finds an eulerian circuit, detects disconnection, or finds an odd-degree vertex.
Historical application: Euler used his theorem to prove the Königsberg bridge problem had no solution because the corresponding multigraph had vertices of odd degree.

🔄 Eulerian Graphs

🔄 Definition and characterization

Eulerian graph: A graph that contains an eulerian circuit—a sequence of vertices (x₀, x₁, ..., xₜ) where x₀ = xₜ, every edge appears exactly once, and consecutive vertices are connected by edges.

Theorem 5.13: A graph G is eulerian if and only if it is connected and every vertex has even degree.

The "only if" direction is intuitive: if you traverse every edge exactly once and return to the start, each vertex must have edges pairing up (one entering, one exiting).
For each vertex x, the number of edges exiting x equals the number entering x, and every incident edge either exits or enters.
This characterization works even for multigraphs (graphs with multiple edges between the same vertices).

🔍 Why the even-degree condition works

When an eulerian circuit exists:

View each edge xᵢxᵢ₊₁ as "exiting" xᵢ and "entering" xᵢ₊₁.
Every edge incident with a vertex x either exits from x or enters x.
Since the circuit is closed and uses each edge exactly once, exits and entrances must balance perfectly.
Therefore, the degree (total incident edges) must be even.

Don't confuse: The condition is about degree (number of incident edges), not about the number of vertices or the length of paths.

🛠️ Algorithm for finding eulerian circuits

The proof provides a deterministic process that will either:

Find an eulerian circuit,
Show the graph is disconnected, or
Find a vertex of odd degree.

How it works:

Label vertices 1, 2, ..., n and start with x₀ = 1.
Begin with a trivial circuit C = (1).
Maintain a partial circuit C = (x₀, x₁, ..., xₜ) with x₀ = xₜ = 1.
Mark edges as "traversed" or "not traversed."
If not all edges are traversed, find the least integer i where xᵢ has an untraversed incident edge.
If no such i exists but untraversed edges remain, the graph is disconnected.
From u₀ = xᵢ, build a sequence (u₀, u₁, ..., uₛ) by always choosing the least-numbered untraversed neighbor.
If u₀ ≠ uₛ, then u₀ and uₛ have odd degree.
If u₀ = uₛ, expand the circuit by replacing xᵢ with the sequence (u₀, u₁, ..., uₛ).

Example from the excerpt: Starting with graph G in Figure 5.14 (connected, all even degrees):

C = (1)
C = (1, 2, 4, 3, 1) — start next from 2
C = (1, 2, 5, 8, 2, 4, 3, 1) — start next from 4
C = (1, 2, 5, 8, 2, 4, 6, 7, 4, 9, 6, 10, 4, 3, 1) — start next from 7
C = (1, 2, 5, 8, 2, 4, 6, 7, 9, 11, 7, 4, 9, 6, 10, 4, 3, 1) — Done!

🌉 The Königsberg bridge problem

Euler applied his theorem to the famous Königsberg bridge problem:

The multigraph has each land mass as a vertex and each bridge as an edge.
Multiple edges exist between the same pairs of vertices.
The graph is not eulerian because it has vertices of odd degree.
Therefore, citizens could not find a route crossing each bridge exactly once and returning to the start.

Don't confuse: This is a multigraph (multiple edges allowed), but Theorem 5.13 still applies—the even-degree condition is necessary and sufficient.

🔁 Hamiltonian Graphs

🔁 Definition

Hamiltonian graph: A graph containing a hamiltonian cycle—a sequence (x₁, x₂, ..., xₙ) where every vertex appears exactly once, x₁xₙ is an edge, and xᵢxᵢ₊₁ is an edge for each i = 1, 2, ..., n−1.

Key difference from eulerian:

Eulerian: traverse every edge exactly once.
Hamiltonian: visit every vertex exactly once.

🔍 Examples and counterexamples

From Figure 5.16:

Graph G: both eulerian and hamiltonian.
Graph H: hamiltonian but not eulerian.

The Petersen graph (Figure 5.17):

Famous example that is not hamiltonian.
Demonstrates that recognizing hamiltonian graphs is non-trivial.

⚠️ Computational difficulty

Unlike the eulerian case:

No known quick method exists to determine whether a graph is hamiltonian.
The problem is computationally hard (in contrast to the polynomial-time algorithm for eulerian graphs).
However, sufficient conditions exist that guarantee hamiltonicity.

Don't confuse: The absence of a fast general algorithm doesn't mean we can never prove a graph is hamiltonian—it means we lack a simple characterization like the even-degree condition for eulerian graphs.

🎯 Dirac's Sufficient Condition

🎯 Theorem 5.18 (Dirac)

Statement: If G is a graph on n vertices and each vertex in G has at least ⌈n/2⌉ neighbors, then G is hamiltonian.

This is a sufficient condition, not necessary: graphs with lower minimum degree can still be hamiltonian.
The condition guarantees enough connectivity to force a hamiltonian cycle.

🧩 Proof strategy

The proof uses contradiction and the pigeonhole principle:

Assume failure: Suppose n is the smallest integer for which a counterexample exists (clearly n ≥ 4).
Find longest path: Let P = (x₁, x₂, ..., xₜ) be a longest path in G.
Neighbors on the path: All neighbors of both x₁ and xₜ must appear on this path (otherwise we could extend it).
Pigeonhole argument: By the pigeonhole principle, there exists an integer i with 1 ≤ i < t such that x₁xᵢ₊₁ and xᵢxₜ are both edges.
Construct cycle: This implies C = (x₁, x₂, ..., xᵢ, xₜ, xₜ₋₁, ..., xᵢ₊₁) is a cycle of length t.
Extend to hamiltonian: This requires ⌈n/2⌉ < t < n. If y is any vertex not on C, then y must have a neighbor on C (by the degree condition), implying G has a path on t+1 vertices—contradiction.

Why it works: The high minimum degree forces so much connectivity that any maximal path can be "closed" into a cycle, and any vertex outside that cycle must connect to it, allowing extension.

🔄 Comparing Eulerian and Hamiltonian Properties

Property	Eulerian	Hamiltonian
What it traverses	Every edge exactly once	Every vertex exactly once
Characterization	Simple: connected + all even degrees	No simple characterization known
Algorithm	Deterministic polynomial-time	No known fast general algorithm
Sufficient condition	The characterization is both necessary and sufficient	Dirac's theorem (and others) give sufficient conditions only
Applies to multigraphs	Yes	Definition uses simple graphs

Common confusion: A graph can be eulerian but not hamiltonian, hamiltonian but not eulerian, both, or neither—the properties are independent.

Example: The first graph in Figure 5.16 is both; the second is hamiltonian but not eulerian; the Petersen graph is neither.

Graph Coloring

5.4 Graph Coloring

🧭 Overview

🧠 One-sentence thesis

Chromatic number and clique number can differ arbitrarily—there exist triangle-free graphs requiring arbitrarily many colors—and while determining chromatic number is generally hard, special graph classes like interval graphs admit efficient coloring algorithms.

📌 Key points (3–5)

Chromatic vs clique number gap: For every t ≥ 3, there exist graphs with chromatic number t but clique number only 2 (triangle-free graphs needing many colors).
Computational difficulty: No polynomial-time algorithm is known for finding chromatic number or maximum clique; verifying a certificate is easy, but finding one appears very hard.
First Fit (greedy) algorithm: Colors vertices in order, assigning each the smallest color not used by already-colored neighbors; performance depends critically on vertex ordering.
Common confusion: Knowing clique number does not help find chromatic number (or vice versa) because the gap can be arbitrarily large.
Interval graphs exception: For interval graphs, chromatic number equals clique number, and First Fit with the right ordering achieves optimal coloring efficiently.

🧩 The Pigeon Hole Principle foundation

🧩 Generalized Pigeon Hole Principle

Proposition 5.24 (Generalized Pigeon Hole Principle): If f : X → Y is a function and |X| ≥ (m − 1)|Y| + 1, then there exists an element y ∈ Y and distinct elements x₁, …, xₘ ∈ X so that f(xᵢ) = y for i = 1, …, m.

Plain language: If you map enough elements from X into Y, at least m of them must land on the same element of Y.
The excerpt uses this to prove that chromatic number can be forced upward: if you try to color with too few colors, many vertices must share the same color, creating conflicts.
Example: If you have more than 2|Y| elements in X and each element of Y receives at most two elements from X, then some y must receive at least three.

🔺 Triangle-free graphs with large chromatic number

🔺 Main result (Proposition 5.25)

Proposition 5.25: For every t ≥ 3, there exists a graph Gₜ so that χ(Gₜ) = t and ω(Gₜ) = 2.

What it means: You can build graphs that need t colors but contain no triangle (largest clique is just an edge).
Why it matters: Chromatic number and clique number are not tightly related; knowing one does not determine the other.
The excerpt provides two different proofs (Kelly & Kelly; Mycielski).

🏗️ First construction (Kelly & Kelly)

Base case: Start with G₃ = C₅ (5-cycle), which is triangle-free and needs 3 colors.

Inductive step (building Gₜ₊₁ from Gₜ):

Begin with an independent set I of size t(nₜ − 1) + 1, where nₜ is the number of vertices in Gₜ.
For every nₜ-element subset S of I, attach a copy of Gₜ with vertices of S adjacent to corresponding vertices in that copy.
Vertices in different copies of Gₜ are not adjacent; each vertex in I has at most one neighbor in any given copy.

Why ω(Gₜ₊₁) = 2:

Any triangle must contain a vertex from I (since Gₜ is triangle-free).
No two vertices in I are adjacent.
Each vertex in I is adjacent to at most one vertex in any fixed copy of Gₜ.
So the other two vertices of a triangle would have to come from distinct copies, but vertices in different copies are not adjacent.

Why χ(Gₜ₊₁) = t + 1:

Upper bound: Use t colors on copies of Gₜ and one new color on I → at most t + 1 colors.
Lower bound: If only t colors are used, by the Generalized Pigeon Hole Principle, some nₜ-element subset of I has all vertices the same color. That color cannot be used in the attached copy of Gₜ, forcing a contradiction.

🏗️ Second construction (Mycielski)

Base case: Again G₃ = C₅.

Inductive step (building Gₜ₊₁ from Gₜ):

Start with independent set I of only nₜ points: y₁, …, yₙₜ.
Add a copy of Gₜ with vertices x₁, …, xₙₜ; make yᵢ adjacent to xⱼ if and only if xᵢ is adjacent to xⱼ in Gₜ.
Add a new vertex z adjacent to all vertices in I.

Why ω(Gₜ₊₁) = 2: Clearly triangle-free (similar reasoning).

Why χ(Gₜ₊₁) = t + 1:

Upper bound: Color Gₜ with {1, …, t}, use color t + 1 on I, and color z with 1 → at most t + 1 colors.
Lower bound (by contradiction): Suppose χ(Gₜ₊₁) = t. Let φ be a proper coloring using {1, …, t}; assume φ(z) = t. Consider the nonempty set S of vertices in the copy of Gₜ colored t. For each xᵢ in S, change its color to match φ(yᵢ), which cannot be t (since z is colored t). This yields a proper coloring of Gₜ with only t − 1 colors (because xᵢ and yᵢ are adjacent to the same vertices in the copy of Gₜ), contradicting χ(Gₜ) = t.

Don't confuse: The two constructions use different sizes for the independent set I and different attachment rules, but both achieve the same result.

🖥️ Computational difficulty of chromatic number

🖥️ The decision problem

Question: Given a graph G, is χ(G) ≤ t?

Easy to verify: If someone gives you a proper coloring with at most t colors, you can check it quickly.
Hard to find: No polynomial-time algorithm is known for finding a proper coloring with the fewest colors.
Similarly, "Is ω(G) ≥ k?" is easy to verify (check a given clique) but hard to find.

🖥️ Why the gap matters

Since χ(G) and ω(G) can differ arbitrarily (Proposition 5.25), being able to find one value does not generally help find the other.
Many believe no polynomial-time algorithm exists for either problem.

🎨 The First Fit (greedy) algorithm

🎨 How it works

Fix an ordering of vertices: V = {v₁, v₂, …, vₙ}.
Color v₁ with color 1.
For each vᵢ₊₁ (assuming v₁, …, vᵢ are already colored), assign the smallest positive integer color not used by any of its already-colored neighbors in {v₁, …, vᵢ}.

🎨 Ordering is critical

The excerpt states: "the ordering of V is vital to the ability of the First Fit algorithm to color G using χ(G) colors."
Example: Figure 5.26 shows two orderings of the same bipartite graph; different orderings can lead to different numbers of colors used.
In general: Finding an optimal ordering is just as difficult as coloring G, so this simple algorithm does not work well in general.

Don't confuse: First Fit always produces a proper coloring, but it may use more colors than χ(G) unless the ordering is chosen carefully.

📊 Interval graphs: a tractable case

📊 Intersection graphs

Intersection graph: Given an indexed family of sets F = {Sₐ : α ∈ V}, the graph G has vertex set V and vertices x and y are adjacent if and only if Sₓ ∩ Sᵧ ≠ ∅.

Every graph is an intersection graph (the excerpt asks "Why?").
To make the concept useful, restrict the types of sets allowed.

📊 Interval graphs

Interval graph: The intersection graph of a family of closed intervals of the real line ℝ.

Example: Figure 5.27 shows six intervals {a, b, c, d, e, f}; the corresponding interval graph has an edge between x and y if and only if intervals x and y overlap.

📊 Optimal coloring for interval graphs (Theorem 5.28)

Theorem 5.28: If G = (V, E) is an interval graph, then χ(G) = ω(G).

Proof idea:

For each vertex v, let I(v) = [aᵥ, bᵥ] be its interval.
Order vertices as {v₁, v₂, …, vₙ} such that a₁ ≤ a₂ ≤ … ≤ aₙ (ties broken arbitrarily).
Apply First Fit with this ordering.
When coloring vᵢ, all its neighbors have left endpoint ≤ aᵢ and right endpoint ≥ aᵢ (since they overlap vᵢ).
Thus vᵢ and its previously-colored neighbors form a clique.
So vᵢ is adjacent to at most ω(G) − 1 already-colored vertices.
A color from {1, 2, …, ω(G)} will be available; the algorithm assigns the smallest such color.
Therefore χ(G) ≤ ω(G); since always χ(G) ≥ ω(G), we have χ(G) = ω(G).

Why this works: The natural ordering (by left endpoint) ensures that when First Fit colors a vertex, its already-colored neighbors form a clique, so the number of colors needed never exceeds the clique number.

📊 Perfect graphs

Perfect graph: A graph G such that χ(H) = ω(H) for every induced subgraph H.

Since an induced subgraph of an interval graph is an interval graph, Theorem 5.28 shows interval graphs are perfect.
The excerpt notes: "The study of perfect graphs originated in connection with the theory of communications networks and has proved to be a major area of research in graph theory for many years now."

🌐 Planar graphs preview

🌐 The utilities problem

Setup: Connect three utilities (water, electricity, natural gas) to three homes.
Graph model: Vertex for each utility, vertex for each home, edge from each utility to each home → complete bipartite graph K₃,₃.
Question: Can this graph be drawn in the plane so edges intersect only at vertices?

🌐 Why it matters

While the utilities example might seem contrived (lines can be buried at different depths), the question of planar drawing is important in microchip and circuit board design.
In those contexts, the material is so thin that placing connections at different depths is not an option.

Note: The excerpt cuts off mid-sentence; the full discussion of planar graphs continues in Section 5.5.

Planar Graphs

5.5 Planar Graphs

🧭 Overview

🧠 One-sentence thesis

A graph is planar if it can be drawn in the plane with edges crossing only at vertices, and Kuratowski's Theorem shows that planarity fundamentally depends on whether the graph avoids containing subgraphs homeomorphic to K₅ or K₃,₃.

📌 Key points (3–5)

What planar means: a graph is planar if it has a drawing where edges intersect only at shared vertices (not in the middle of edges).
Euler's formula: for any planar drawing of a connected graph, n − m + f = 2 (vertices minus edges plus faces equals 2).
Edge limit test: a planar graph on n vertices (n ≥ 3) has at most 3n − 6 edges; more edges means the graph cannot be planar.
Common confusion: passing the edge-count test (≤ 3n − 6 edges) does not guarantee planarity—K₃,₃ has 6 vertices and 9 edges (which passes) but is still nonplanar.
Kuratowski's characterization: a graph is planar if and only if it contains no subgraph homeomorphic to K₅ or K₃,₃, reducing all planarity questions to these two forbidden structures.

🏠 Motivating problem and definitions

🏠 The utilities-and-homes puzzle

Setup: connect three utilities (water, electricity, gas) to three homes without any lines crossing.
This is modeled as the complete bipartite graph K₃,₃ (one vertex per utility, one per home, edges from each utility to each home).
The question becomes: can K₃,₃ be drawn in the plane without edge crossings?
Real-world relevance: microchip and circuit-board design, where connections are so thin that placing them at different depths is impossible or severely restricted.

📐 Core definitions

Drawing of a graph: associating vertices with points in the plane and edges with simple polygonal arcs (finite sequences of line segments that don't cross themselves) connecting the endpoint points.

Planar drawing: a drawing in which polygonal arcs for two edges intersect only at a point corresponding to a vertex to which both edges are incident.

Planar graph: a graph that has a planar drawing (i.e., it can be drawn without crossings, even if some drawings do have crossings).

Face: a region bounded by edges and vertices in a planar drawing, containing no other vertices or edges inside; the unbounded exterior region also counts as a face.

Example: a planar drawing with 6 vertices and 9 edges determines 5 faces (including the unbounded outer region).
Don't confuse: a graph may have many drawings; planarity means at least one drawing is crossing-free.

🧮 Euler's formula and the edge-count test

🧮 Euler's formula for planar graphs

Euler's Formula (Theorem 5.32): Let G be a connected planar graph with n vertices and m edges. Every planar drawing of G has f faces, where n − m + f = 2.

This holds for any planar drawing of any planar graph.
The number 2 comes from a fundamental property of the plane.
Example: K₄ has 4 vertices, 6 edges, and 4 faces in its planar drawing: 4 − 6 + 4 = 2 ✓

Proof idea (by induction on m):

Base case (m = 0): a single vertex, one face, so 1 − 0 + 1 = 2.
Inductive step: pick an edge e and delete it to form G′.
- If G′ is connected: removing e merges two faces into one (so f′ = f − 1), and m′ = m − 1, n′ = n. Substituting into n′ − m′ + f′ = 2 gives n − (m − 1) + (f − 1) = 2, which simplifies to n − m + f = 2.
- If G′ is disconnected (two components G′₁ and G′₂): apply induction to each component, add the equations, and account for the unbounded face being counted twice (so f = f′₁ + f′₂ − 1). This also yields n − m + f = 2.

📏 The edge-count inequality (Theorem 5.33)

Theorem 5.33: A planar graph on n vertices (n ≥ 3) has at most 3n − 6 edges.

Derivation:

Count edge-face pairs (e, F) where edge e is part of face F's boundary.
Each edge bounds at most 2 faces → total pairs p ≤ 2m.
Each face is bounded by at least 3 edges → total pairs p ≥ 3f.
So 3f ≤ 2m, i.e., f ≤ (2m)/3.
Substitute into Euler's formula: m = n + f − 2 ≤ n + (2m)/3 − 2.
Solve: m/3 ≤ n − 2, so m ≤ 3n − 6.

Using the contrapositive:

If a graph has more than 3n − 6 edges, it cannot be planar.
Example: K₅ has 5 vertices and C(5,2) = 10 edges. Since 10 > 3·5 − 6 = 9, K₅ is nonplanar. Any graph containing K₅ is also nonplanar.

⚠️ Limitations of the edge-count test

Important: passing the test (m ≤ 3n − 6) does not prove a graph is planar.
Example: K₃,₃ has 6 vertices and 9 edges. Since 9 = 3·6 − 6, it passes the test—but K₃,₃ is actually nonplanar.
To prove K₃,₃ is nonplanar, we need a refined argument using Euler's formula and properties of bipartite graphs.

🚫 Proving K₃,₃ is nonplanar

🚫 Refined edge-face counting for K₃,₃

K₃,₃ has 6 vertices and 9 edges.
Assume it has a planar drawing. Then by Euler's formula, f = 2 − n + m = 2 − 6 + 9 = 5 faces.
Count edge-face pairs:
- From the edge side: each edge bounds 2 faces → 2m = 18 pairs.
- From the face side: each face is bounded by a cycle. Since K₃,₃ is bipartite, it has no odd cycles—every cycle has even length ≥ 4.
- So each face is bounded by at least 4 edges. If fₖ is the number of faces with k edges, then 4f₄ + 6f₆ + … ≥ 4f (since every face contributes at least 4 to the count).
- But we know f = 5, so the face side gives at least 4·5 = 20 pairs.
Contradiction: 18 (edge side) ≠ 20 (face side).
Therefore, K₃,₃ cannot have a planar drawing; it is nonplanar.

🔗 Homeomorphism and Kuratowski's Theorem

🔗 Elementary subdivision and homeomorphism

Elementary subdivision: given a graph G with edge uv, form a new graph G′ by adding a new vertex v′ and replacing edge uv with two edges uv′ and v′v (i.e., "split" the edge by inserting a vertex of degree 2).

Homeomorphic graphs: two graphs G₁ and G₂ are homeomorphic if they can be obtained from the same graph by a (possibly empty) sequence of elementary subdivisions.

Key insight: homeomorphic graphs have the same planarity properties—if G is nonplanar, any elementary subdivision of G is also nonplanar.
Example: subdividing any edge of K₅ still yields a nonplanar graph.

🏆 Kuratowski's Theorem (Theorem 5.34)

Kuratowski's Theorem: A graph is planar if and only if it does not contain a subgraph homeomorphic to either K₅ or K₃,₃.

This reduces the entire question of planarity to checking for two forbidden structures.
The theorem was proved by Polish mathematician Kazimierz Kuratowski in 1930; the proof is beyond the scope of the excerpt.
Practical use: efficient algorithms for planarity testing use this characterization.

🔍 Example: the Petersen graph

The Petersen graph has 10 vertices and 15 edges, so it passes the edge-count test (15 ≤ 3·10 − 6 = 24).
Each vertex has degree 3, so finding a subgraph homeomorphic to K₅ (which has vertices of degree 4) is difficult.
Instead, look for a subgraph homeomorphic to K₃,₃:
- K₃,₃ contains a 6-cycle with three additional edges connecting opposite vertices.
- Identify a 6-cycle in the Petersen graph, draw it as a hexagon, and place the remaining 4 vertices inside.
- Delete one vertex (the black vertex in the figure); the three white vertices now have degree 2.
- Replace each white vertex and its two incident edges with a single edge → this yields K₃,₃.
Conclusion: the Petersen graph contains a subgraph homeomorphic to K₃,₃, so it is nonplanar.

🎨 The Four Color Theorem

🗺️ From map coloring to graph coloring

Original problem (1852): Francis Guthrie tried to color a map of English counties so that adjacent counties (sharing a boundary segment, not just a point) have different colors. He needed only 4 colors and couldn't find a map requiring 5.
Graph formulation: create a vertex for each region; connect two vertices with an edge if their regions share a boundary.
- This produces a planar graph (edges can be drawn through the common boundary).
- Conversely, any planar graph corresponds to a map.
Restated problem: Does every planar graph have chromatic number at most 4?

📜 History of the Four Color Problem

Year	Event
1852	Francis Guthrie poses the problem; communicated by his brother Frederick to Augustus de Morgan.
1877	Arthur Cayley asks the Royal Society if the problem has been resolved, renewing interest.
1878–79	Alfred Bray Kempe publishes a "proof."
1890	Percy John Heawood finds a flaw in Kempe's proof but salvages enough to show every planar graph is 5-colorable.
1880	Peter Guthrie Tait announces a proof using hamiltonian cycles, but realizes around 1883 he cannot prove the required cycles exist.
1946	A counterexample to Tait's conjecture is found.
1976	Kenneth Appel and Wolfgang Haken announce a computer-assisted proof (1,482 configurations checked).
1989	Appel and Haken publish a 741-page book correcting known flaws.
Early 1990s	Robertson, Sanders, Seymour, and Thomas provide a new computer-assisted proof (633 configurations, code available online).

🖥️ The computer-assisted proof

Four Color Theorem (Theorem 5.37): Every planar graph has chromatic number at most 4.

Appel and Haken's 1976 proof used computers to verify 1,482 "unavoidable configurations."
Many mathematicians were unsatisfied: how can you be certain the code has no logic errors?
Several mistakes were found but none were fatal.
The 1990s proof by Robertson et al. used fewer configurations (633) and made the code publicly available, gaining wider acceptance.
Ongoing question: will anyone find a proof that does not require a computer?

🤔 Why the controversy?

Traditional mathematical proofs are human-verifiable step-by-step arguments.
Computer-assisted proofs delegate case-checking to code, which may contain bugs.
The Four Color Theorem remains a landmark example of this tension: the result is generally accepted today, but many still hope for a purely human-readable proof.

Counting Labeled Trees

5.6 Counting Labeled Trees

🧭 Overview

🧠 One-sentence thesis

The number of labeled trees on n vertices is exactly n raised to the power (n - 2), a result known as Cayley's Formula, which can be proven by establishing a bijection between labeled trees and strings of length (n - 2).

📌 Key points (3–5)

What we're counting: labeled trees on n vertices (trees where vertices are distinguished by labels from the set {1, 2, ..., n}), denoted T_n.
The formula: T_n = n^(n-2) for all n ≥ 1 (Cayley's Formula).
Proof strategy: construct a bijection between the set of labeled trees on n vertices and the set of strings of length (n - 2) whose symbols come from {1, 2, ..., n}.
The Prüfer code: a recursive algorithm that converts any labeled tree into a unique string, forming one direction of the bijection.
Common confusion: the Prüfer code records labels of non-leaf vertices (the neighbors of deleted leaves), not the leaves themselves; leaf labels are exactly those that do not appear in the code.

🔢 Building intuition through small cases

🌳 Trees on 1, 2, and 3 vertices

For n = 1: only one tree (a single vertex), so T_1 = 1 = 1^(1-2) = 1^(-1), though this edge case is trivial.
For n = 2: only one tree, isomorphic to K_2 (two vertices connected by an edge), so T_2 = 1.
For n = 3: all trees on 3 vertices are isomorphic to P_3 (a path of three vertices).
- There are T_3 = 3 labeled trees, corresponding to which vertex has degree 2 (the middle vertex).
- This matches 3^(3-2) = 3^1 = 3.

🌲 Trees on 4 vertices

The excerpt identifies two nonisomorphic tree structures:

K_(1,3) (a star with one center of degree 3): there are 4 ways to choose which vertex is the center, giving 4 labelings.
P_4 (a path of four vertices):
- Choose 2 labels for the degree-2 vertices: C(4,2) = 6 ways.
- Choose which of the 2 remaining labels is adjacent to one of the degree-2 vertices: 2 ways.
- Total: 6 × 2 = 12 labelings.
Sum: T_4 = 4 + 12 = 16 = 4^(4-2) = 4^2.

🌴 Trees on 5 vertices

The excerpt identifies three nonisomorphic tree structures:

K_(1,4): 5 ways to choose the vertex of degree 4, giving 5 labelings.
P_5:
- 5 ways to choose the middle vertex.
- C(4,2) = 6 ways to label the two degree-2 vertices.
- 2 ways to label the two degree-1 vertices.
- Total: 5 × 6 × 2 = 60 labelings.
The third tree (shown in Figure 5.38, not fully described but has one vertex of degree 3):
- 5 ways to label the degree-3 vertex.
- C(4,2) = 6 ways to label the two leaves adjacent to it.
- 2 ways to label the remaining two vertices.
- Total: 5 × 6 × 2 = 60 labelings.
Sum: T_5 = 5 + 60 + 60 = 125 = 5^(5-2) = 5^3.

Pattern observed: the formula T_n = n^(n-2) holds for small n, suggesting it holds in general.

📜 Cayley's Formula and its history

📐 The theorem statement

Theorem 5.39 (Cayley's Formula): The number T_n of labeled trees on n vertices is n^(n-2).

🕰️ Historical context

Equivalent results were proven earlier by:
- James J. Sylvester (1857)
- Carl W. Borchardt (1860)
Arthur Cayley (1889) was the first to state and prove it in graph-theoretic terminology, hence the name.
- Note: Cayley only proved it rigorously for n ≤ 6 and claimed it could be extended; whether such extension is straightforward is debated.
The excerpt mentions that Cayley's Formula has many elegant proofs using different techniques.
The proof presented here is due to Prüfer (1918), who worked in the context of permutations and his own terminology, unaware of established graph theory language.

🔄 The Prüfer code algorithm

🧮 What the Prüfer code does

Prüfer code: a recursive algorithm that takes a tree T on k ≥ 2 vertices labeled by elements of a set S of positive integers of size k and returns a string of length (k - 2) whose symbols are elements of S.

Notation: prüfer(T) denotes the Prüfer code of tree T.
If v is a leaf of T, T - v denotes the tree obtained by removing v (the subgraph induced by all other vertices).

🔧 The recursive procedure

The algorithm is defined as follows:

Base case: If T = K_2 (two vertices, one edge), return the empty string.
Recursive case:
- Let v be the leaf of T with the smallest label.
- Let u be its unique neighbor, with label i.
- Return the pair (i, prüfer(T - v)).

In plain language:

Find the leaf with the smallest label.
Record the label of its neighbor.
Remove the leaf and repeat on the smaller tree.
Continue until only two vertices remain (K_2), then stop.

🧪 Example: computing a Prüfer code

The excerpt walks through computing prüfer(T) for a 9-vertex tree T (Figure 5.41):

Step 1: smallest leaf is vertex 2, neighbor is vertex 6 → record 6, remove vertex 2.
Step 2: in T - {2}, smallest leaf is vertex 5, neighbor is vertex 6 → record 6, remove vertex 5.
Step 3: smallest leaf is vertex 6, neighbor is vertex 4 → record 4, remove vertex 6.
Step 4: smallest leaf is vertex 7, neighbor is vertex 3 → record 3, remove vertex 7.
Step 5: smallest leaf is vertex 8, neighbor is vertex 1 → record 1, remove vertex 8.
Step 6: smallest leaf is vertex 1, neighbor is vertex 4 → record 4, remove vertex 1.
Step 7: smallest leaf is vertex 4, neighbor is vertex 3 → record 3, remove vertex 4.
Final: only vertices 3 and 9 remain (K_2) → return empty string.
Result: prüfer(T) = 6643143 (a string of length 9 - 2 = 7).

🔍 Key property: what appears in the Prüfer code

Critical insight: The symbols that appear in prüfer(T) are exactly the labels of the non-leaf vertices of T.

Why: we always record the label of the neighbor of the leaf we delete.
A leaf's label cannot appear because we only record its neighbor's label, and the only way we'd delete a neighbor of a leaf is if that neighbor were also a leaf, which only happens when T = K_2 (in which case we return the empty string).
Consequence: if I ⊆ {1, 2, ..., n} is the set of symbols appearing in prüfer(T), then the labels of the leaves of T are precisely the elements of {1, 2, ..., n} \ I.

Don't confuse: the Prüfer code does not record which vertices are leaves; it records which vertices are not leaves (the internal vertices).

🔗 Proof of Cayley's Formula via bijection

🎯 Proof strategy

Goal: show there is a bijection between:
- The set T_n of labeled trees on n vertices (with labels from {1, 2, ..., n}).
- The set of strings of length (n - 2) whose symbols come from {1, 2, ..., n}.
The second set has size n^(n-2) (n choices for each of (n - 2) positions), so establishing a bijection proves T_n = n^(n-2).

➡️ Forward direction: tree to string

The Prüfer code algorithm provides the forward map: given a labeled tree T, compute prüfer(T).
This produces a string of length (n - 2) with symbols from {1, 2, ..., n}.

⬅️ Reverse direction: string to tree

The excerpt provides a recursive construction to recover a tree from its Prüfer code:

Input: a string s = s_1 s_2 ... s_(n-2) with symbols from a set S of n elements.
Base case (n = 2): the only string is the empty string; construct K_2 with labels 1 and 2.
Inductive step (n = m + 1, assuming the result holds for m ≥ 2):
1. Let I be the set of symbols appearing in s.
2. Let k be the least element of S \ I (the smallest label not in s).
3. By the key property, k labels a leaf of T, and its unique neighbor has label s_1.
4. Form the substring s' = s_2 s_3 ... s_(m-1) (length m - 2, symbols from S \ {k}).
5. By induction, construct the unique tree T' with prüfer(T') = s'.
6. Form T from T' by attaching a new leaf labeled k to the vertex of T' labeled s_1.

Why this works:

The smallest label not appearing in s must be a leaf (by the key property).
The first symbol s_1 in the Prüfer code is the label of the neighbor of the first deleted leaf (the smallest leaf).
Removing this leaf and its contribution to the code gives a smaller problem that can be solved by induction.

🧪 Example: reconstructing a tree from a Prüfer code

The excerpt reconstructs the tree with Prüfer code s = 75531 (7 vertices):

Initial: labels {1, 2, 3, 4, 5, 6, 7}, code 75531.
- Symbols in code: {1, 3, 5, 7}.
- Smallest missing label: 2.
- Add edge 2–7 (attach leaf 2 to vertex 7).
Step 1: labels {1, 3, 4, 5, 6, 7}, code 5531.
- Smallest missing: 4.
- Add edge 4–5.
Step 2: labels {1, 3, 5, 6, 7}, code 531.
- Smallest missing: 6.
- Add edge 6–5.
Step 3: labels {1, 3, 5, 7}, code 31.
- Smallest missing: 5.
- Add edge 5–3.
Step 4: labels {1, 3, 7}, code 1.
- Smallest missing: 3.
- Add edge 3–1.
Step 5: labels {1, 7}, code (empty).
- Base case: add edge 1–7 (K_2).
Result: the tree shown in Figure 5.45, with edges 1–7, 1–3, 3–5, 5–6, 5–4, 7–2.

Process summary:

At each step, remove the first symbol from the code and the smallest missing label from the label set.
Record the edge between the smallest missing label and the first symbol.
When the code is empty, the two remaining labels form K_2.
Build the tree by adding edges in the order recorded.

🧩 Why the bijection is complete

✅ Uniqueness and existence

Forward (tree → string): the Prüfer code algorithm is deterministic, so each tree produces exactly one string.
Reverse (string → tree): the inductive construction is deterministic, so each string produces exactly one tree.
Inverse relationship: the construction reverses the Prüfer code process:
- The Prüfer code deletes the smallest leaf and records its neighbor.
- The reconstruction attaches a leaf (the smallest missing label) to the vertex whose label is the first symbol.
This establishes a bijection, proving T_n = n^(n-2).

🔄 Summary of the proof

Direction	Input	Output	Method
Tree → String	Labeled tree T on n vertices	String of length (n - 2)	Prüfer code algorithm (recursive deletion of smallest leaf)
String → Tree	String s of length (n - 2)	Labeled tree T on n vertices	Inductive construction (attach smallest missing label to first symbol's vertex)

Conclusion: since there are n^(n-2) strings of length (n - 2) with symbols from {1, ..., n}, and the bijection shows there are exactly as many labeled trees, T_n = n^(n-2).

5.7 A Digression into Complexity Theory

🧭 Overview

🧠 One-sentence thesis

Some graph problems (connectivity, eulerian circuits) have efficient polynomial-time algorithms that can justify both positive and negative answers, while others (hamiltonian cycles) remain hard because no one knows how to efficiently justify a negative answer in the general case.

📌 Key points (3–5)

Polynomial-time solvable problems: connectivity and eulerian circuits have efficient algorithms that can produce certificates for both yes and no answers.
Certificates for answers: a positive answer needs evidence (e.g., a spanning tree for connectivity); a negative answer also needs justification (e.g., a partition showing disconnection).
Hamiltonian problem difficulty: determining if a graph is hamiltonian looks similar to the eulerian problem, but justifying a negative answer is not straightforward—no efficient method is known.
Common confusion: surface similarity vs actual difficulty—eulerian and hamiltonian problems both ask for vertex sequences with edges between consecutive vertices, but their computational complexity differs dramatically.
Open question: no one currently knows how to efficiently justify that a graph is not hamiltonian in the general case.

🟢 Problems with efficient algorithms

🔗 Graph connectivity

Connectivity problem: given a graph on n vertices, determine whether the graph is connected.

Positive answer certificate: provide a spanning tree.
Negative answer certificate: provide a partition of the vertex set V = V₁ ∪ V₂ where both V₁ and V₂ are non-empty and no edges connect a vertex in V₁ to a vertex in V₂.
Efficiency: Chapter 12 discusses two efficient algorithms that find spanning trees in connected graphs; these can be modified to produce the partition when the graph is disconnected.

🔄 Eulerian circuits

Eulerian problem: determine whether a connected graph is eulerian (has a circuit visiting every edge exactly once).

Positive answer certificate: produce the actual eulerian sequence of vertices.
- The chapter provides an algorithm to construct this sequence.
Negative answer certificate: produce a vertex of odd degree.
- The algorithm will identify such a vertex if it exists.
- Depending on data structures, it may be most efficient to simply look for odd-degree vertices directly.
Key insight: both yes and no answers can be efficiently justified.

🔴 The hamiltonian problem difficulty

🔄 Surface similarity to eulerian problem

Both problems ask for a sequence of vertices where each pair of consecutive vertices is joined by an edge:

Eulerian: visit every edge exactly once.
Hamiltonian: visit every vertex exactly once.

The structural similarity is misleading—the computational difficulty is very different.

❌ The asymmetry in justification

Answer type	Justification difficulty
Positive (graph is hamiltonian)	Can be justified by producing the hamiltonian cycle
Negative (graph is not hamiltonian)	Not straightforward—no efficient method known

Limitation of existing theorems: Theorem 5.18 (mentioned in the excerpt) only gives a way to confirm that a graph is hamiltonian.
Problem: many nonhamiltonian graphs do not satisfy the theorem's hypothesis, so it cannot be used to prove a graph is not hamiltonian.

⚠️ Current state of knowledge

Open problem: at this time, no one knows how to efficiently justify a negative answer to the hamiltonian question—at least not in the general case.
This is a fundamental difference from the connectivity and eulerian problems, where both positive and negative answers can be efficiently certified.

🤔 Don't confuse

Eulerian vs hamiltonian: though both involve sequences of vertices connected by edges, the eulerian problem is efficiently solvable in both directions (yes/no), while the hamiltonian problem has no known efficient algorithm for proving a graph is not hamiltonian.
Having a certificate vs finding one: even if someone guarantees a hamiltonian cycle exists, it's unclear whether this extra information makes finding it significantly easier (this question is raised in the discussion section but not resolved in the excerpt).

5.8 Discussion

🧭 Overview

🧠 One-sentence thesis

The discussion reveals that the students recognize the practical applications of graph-theoretic problems (Eulerian cycles, Hamiltonian cycles, chromatic number) but disagree on whether knowing a solution exists makes it easier to find.

📌 Key points (3–5)

Practical relevance: Students agree that Eulerian and Hamiltonian cycle problems have real applications in network routing, integrity, and information exchange.
Debate about search difficulty: The group is divided on whether knowing a solution exists (e.g., knowing a graph has a Hamiltonian cycle or is 3-colorable) makes it easier to find that solution.
Two positions: Some students (Dave, Xing, Bob) believe extra knowledge that a solution exists should help; Carlos argues it doesn't help and the problems remain hard regardless.
Common confusion: Don't confuse "knowing something exists" with "being able to find it easily"—the excerpt shows this is an open question the students cannot resolve.
Outcome: The discussion ends without consensus, but graphs and their properties have captured the students' attention.

💬 Student perspectives on applications

🌐 Network and routing applications

Bob points out that Eulerian and Hamiltonian cycle problems "are certain to have applications in network routing problems."
Xing reinforces this: "There are important questions in network integrity and information exchange that are very much the same as these basic problems."
The students connect abstract graph problems to concrete real-world scenarios involving networks.

🎨 Chromatic number applications

Alice observes that "the notion of chromatic number clearly has practical applications."
The excerpt does not detail what these applications are, but the students accept that coloring problems are useful.
Example: An organization might need to assign resources (represented by colors) to tasks (vertices) such that conflicting tasks get different resources.

🤔 Zori's initial skepticism

Zori initially held an "indefensible" position (the excerpt does not specify what it was, but context suggests she doubted practical relevance).
After hearing the others' arguments, she reluctantly concedes with "Whatever."
This shows the group dynamic: peer discussion can shift perspectives even when someone is reluctant to admit it.

🔍 The central debate: does knowing help finding?

🧩 The question posed

Dave asks: "Finding a hamiltonian cycle can't be all that hard, if someone guarantees that there is one. This extra information must be of value in the search."
Xing agrees: "It seems natural that it should be easier to find something if you know it's there."
Alice extends the question to chromatic number: "If someone tells you that a graph is 3-colorable, does that help you to find a coloring using only three colors?"

⚖️ Two sides of the argument

Position	Who holds it	Reasoning
Knowing helps	Dave, Xing, (implicitly Bob)	It "seems natural" and "reasonable" that extra information about existence should make the search easier.
Knowing doesn't help	Carlos	"I don't think this extra knowledge is of any help. I think these problems are pretty hard, regardless."

The excerpt does not provide technical justification for either side; the students are reasoning intuitively.
Don't confuse: "a solution exists" vs. "we can efficiently find the solution"—the students are debating whether the first statement helps with the second.

🤷 No resolution

"They went back and forth for a while, but in the end, the only thing that was completely clear is that graphs and their properties had captured their attention, at least for now."
The discussion ends without consensus, highlighting that this is a non-trivial question.
Example: Suppose someone tells you a maze has an exit—does that make it easier to find the exit, or do you still have to explore just as much?

🎯 What the discussion reveals

📚 Engagement with the material

The students are actively connecting theory to practice and debating open-ended questions.
The professor had "showed us examples back in our first class," and now "things are even clearer" as they discuss graphs in depth.
This shows that revisiting concepts with more context deepens understanding.

🧠 Computational thinking

The students are implicitly grappling with computational complexity: how hard is it to find a solution, and does extra information reduce that difficulty?
The excerpt does not use formal complexity terms, but the debate foreshadows topics like NP-completeness (where knowing a solution exists does not necessarily make finding it easy).

🗣️ Collaborative learning

Multiple students contribute different perspectives (applications, analogies, skepticism).
Even when they don't reach agreement, the discussion itself is valuable for exploring the problem space.

Graph Theory Exercises

5.9 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set consolidates graph theory concepts by asking students to apply definitions and theorems about subgraphs, trees, isomorphism, Eulerian and Hamiltonian properties, graph coloring, planarity, degree sequences, and Prüfer codes to concrete problems.

📌 Key points (3–5)

Scope of exercises: covers fundamental graph properties (trees, spanning subgraphs, induced subgraphs), special graph types (Eulerian, Hamiltonian), and advanced topics (chromatic number, planarity, Prüfer codes).
Proof and construction tasks: some exercises ask for proofs (e.g., trees have n−1 edges) while others require constructing examples (e.g., drawing planar graphs, finding colorings).
Applied modeling problems: real-world scenarios (chemical storage, class scheduling) require translating constraints into graph-theoretic models.
Common confusion: distinguishing Eulerian circuits (all vertices even degree, closed walk) from Eulerian trails (at most two odd-degree vertices, open walk) and understanding when degree sequences alone determine graph properties versus when specific structure is needed.
Algorithmic practice: several exercises involve the First Fit coloring algorithm and interpreting/constructing Prüfer codes for labeled trees.

🌳 Trees and basic structures

🌳 Tree properties

Exercise 6: Prove that every tree on n vertices has exactly n−1 edges.
- This is a fundamental characterization of trees.
- Trees are connected acyclic graphs; the edge count follows from these properties.

🔗 Subgraph types

Exercise 5 asks to draw specific subgraphs of a given graph G:
- An induced subgraph on vertices {b, c, h, j, g}: includes all edges from G between those vertices.
- A spanning subgraph with exactly 10 edges: uses all vertices but only 10 of the edges.
Don't confuse: induced subgraphs are determined entirely by the vertex set (all edges between chosen vertices must be included), while spanning subgraphs keep all vertices but may drop any edges.

🏷️ Labeled trees and Prüfer codes

Exercises 36–42 work with Prüfer codes, a bijection between labeled trees on n vertices and sequences of length n−2.
Tasks include:
- Computing prüfer(T) for given trees (Exercises 37–39).
- Reconstructing trees from Prüfer codes (Exercises 40–42).
Example: Exercise 36 asks to draw all 16 labeled trees on 4 vertices, illustrating that the count matches 4^(4−2) = 16.

🔄 Eulerian and Hamiltonian properties

🔄 Eulerian circuits and trails

Eulerian circuit: a closed walk that uses every edge exactly once, starting and ending at the same vertex.

Eulerian trail: a walk that uses every edge exactly once but may start and end at different vertices.

Exercise 8: Find an Eulerian circuit in a given graph or explain why none exists.
- A connected graph has an Eulerian circuit if and only if every vertex has even degree.
Exercise 11: Prove that a graph has an Eulerian trail if and only if it is connected and has at most two vertices of odd degree.
- This generalizes the circuit condition by allowing an open walk.
Exercise 10: Show that adding a single edge to a non-Eulerian graph can make it Eulerian.
- This works when exactly two vertices have odd degree; adding an edge between them makes all degrees even.

🔁 Hamiltonian cycles

Exercise 9: Determine if a graph is Hamiltonian (has a cycle visiting every vertex exactly once).
Exercise 12: Alice and Bob discuss a graph with 17 vertices and 129 edges; Bob claims it's Hamiltonian.
- The question asks whether one must be correct based solely on these parameters.
- This tests understanding of necessary versus sufficient conditions for Hamiltonian graphs.

🎨 Graph coloring

🎨 Chromatic number

Chromatic number χ(G): the minimum number of colors needed to color vertices so that no two adjacent vertices share the same color.

Exercises 13–14: Find the chromatic number of given graphs and provide an optimal coloring.
Exercise 17: All trees with more than one vertex have the same chromatic number—what is it and why?
- Trees are bipartite (no odd cycles), so they can be 2-colored.

🧪 Applied coloring problems

Exercise 15 (chemical storage): 10 chemicals, some pairs cannot be stored together (given by a matrix).
- Model: create a graph where vertices are chemicals, edges connect incompatible pairs.
- Goal: find the chromatic number = minimum number of storage rooms needed.
Exercise 16 (class scheduling): schedule 7 courses for 6 students, each student takes a subset.
- Model: vertices are courses, edges connect courses taken by the same student.
- Goal: chromatic number = minimum number of time slots needed.

🎯 First Fit algorithm

Exercise 24: Apply the First Fit coloring algorithm using two different vertex orderings.
- First Fit processes vertices in order, assigning each the smallest-numbered color that doesn't conflict with already-colored neighbors.
- Different orderings can yield different numbers of colors.
Exercise 27: Prove that if every vertex has degree at most D, then χ(G) ≤ D + 1.
- First Fit guarantees this bound: when coloring a vertex, at most D neighbors are already colored, so at least one of D + 1 colors is available.
- Part (b) asks for a bipartite example with D = 1000 showing the bound is not tight (bipartite graphs have χ = 2 regardless of D).

🌈 Advanced coloring constructions

Exercises 18–23 reference Mycielski's and Kelly-Kelly constructions for Proposition 5.25 (building triangle-free graphs with arbitrarily large chromatic number):
- Exercise 18: Find a proper (t+1)-coloring of G_{t+1}.
- Exercises 19–22: Count vertices in these recursive constructions.
- Exercise 23: Find the girth (shortest cycle length) of G_t.

🗺️ Planarity

🗺️ Planar graphs

Planar graph: a graph that can be drawn in the plane with no edge crossings.

Exercises 28–29: Determine if given graphs are planar; if yes, draw without crossings; if no, explain why.
Exercise 30: Draw K₅ − e (complete graph on 5 vertices minus one edge) in the plane.
- K₅ itself is not planar, but removing one edge makes it planar.
Exercise 31: Draw an Eulerian planar graph with 10 vertices and 21 edges.
Exercise 32: Prove every planar graph has a vertex incident to at most 5 edges.
- This follows from Euler's formula and the average degree bound for planar graphs.

🔢 Degree sequences and graph properties

🔢 Degree sequence definition

Degree sequence: the list of vertex degrees arranged in nonincreasing order.

Exercise 33: Given five sequences, identify which:
- Cannot be a degree sequence of any graph.
- Could be a planar graph's degree sequence.
- Could be a tree's degree sequence.
- Is an Eulerian graph's degree sequence.
- Must be a Hamiltonian graph's degree sequence.

🔍 Distinguishing properties from degree sequences

Exercise 34: Three sequences of length 10; one is impossible, and for the other two, determine which properties (Hamiltonian, Eulerian, tree, planar) can be decided from the degree sequence alone.
- Key insight: some properties (e.g., being a tree: connected with n−1 edges) can sometimes be determined from degree sequence, while others (e.g., Hamiltonian) generally cannot.
Exercise 35: For sequences where degree alone is insufficient, draw both a graph with the property and one without.
- Example: two graphs with the same degree sequence, one Hamiltonian and one not.

✅ Checking validity

A sequence can be a degree sequence only if:
- The sum of degrees is even (each edge contributes 2 to the total).
- The sequence satisfies the Erdős–Gallai conditions (not explicitly stated but implied by "cannot be a degree sequence").

🏆 Challenge problem

🏆 Brooks' Theorem

Exercise 43: Prove that if G is connected and Δ(G) = k (maximum degree k), then χ(G) ≤ k + 1, with equality only when:
- k = 2 and G is an odd cycle, or
- k ≥ 2 and G = K_{k+1}.
Hint provided: assume χ(G) = k + 1 but neither exception holds; use a spanning tree and careful vertex ordering with First Fit to show only k colors are needed, reaching a contradiction.

Dilworth's Chain Covering Theorem and its Dual

6.1 Basic Notation and Terminology

🧭 Overview

🧠 One-sentence thesis

Dilworth's theorem proves that any poset of width w can be partitioned into exactly w chains (and no fewer), while its dual shows that any poset of height h can be partitioned into exactly h antichains (and no fewer).

📌 Key points (3–5)

Dilworth's theorem: a poset of width w can be partitioned into w chains, and this is the minimum number needed.
Dual theorem: a poset of height h can be partitioned into h antichains, and this is the minimum number needed.
Key distinction: chains vs antichains—chains are totally ordered subsets (every pair comparable), antichains have no comparable pairs.
Common confusion: maximal vs maximum—a maximal chain/antichain cannot be extended, but a maximum chain/antichain has the largest possible size.
Algorithmic note: the dual theorem yields an efficient recursive algorithm by removing minimal (or maximal) elements layer by layer; Dilworth's theorem proof is less algorithmic.

🔗 Core definitions and terminology

🔗 Chains and antichains

Chain: a subset where every pair of elements is comparable (totally ordered).

Antichain: a subset where no two distinct elements are comparable.

These are fundamental building blocks for partitioning posets.
Example: in a poset, if elements x and y satisfy x < y, they can belong to the same chain; if x ∥ y (incomparable), they can belong to the same antichain.

📏 Width and height

Width of a poset P: the size of the largest antichain in P.

Height of a poset P: the size of the largest chain in P.

Width measures "how wide" the poset is (maximum incomparability).
Height measures "how tall" the poset is (maximum comparability).
Don't confuse: width counts elements in an antichain; height counts elements in a chain.

🔺 Maximal vs maximum

Term	Meaning	Example implication
Maximal chain	Cannot be extended by adding more elements	May not be the longest chain
Maximum chain	Has the largest possible size	Always maximal, but maximal need not be maximum
Maximal antichain	Cannot be extended by adding more elements	May not be the largest antichain
Maximum antichain	Has the largest possible size	Always maximal, but maximal need not be maximum

The excerpt emphasizes: "a maximum chain is maximal, but maximal chains need not be maximum."
Same distinction applies to antichains.

📍 Minimal and maximal points

Minimal point: a point x with height(x) = 1, meaning no element is below it in the poset.

Maximal point: a point with no element above it in the poset.

Notation: min(P) denotes the set of all minimal points; max(P) denotes the set of all maximal points.
These are used in the recursive algorithm for the dual theorem.

🧮 Down sets and up sets

For a point x in poset P = (X, ≤):

D(x) = {y ∈ X : y < x} (elements strictly below x)
D[x] = {y ∈ X : y ≤ x} (elements below or equal to x)
U(x) = {y ∈ X : y > x} (elements strictly above x)
U[x] = {y ∈ X : y ≥ x} (elements above or equal to x)
I(x) = {y ∈ X − {x} : x ∥ y} (elements incomparable to x)

For a subset S ⊆ X:

D(S) = {y ∈ X : y < x for some x ∈ S}
D[S] = S ∪ D(S)
U(S) and U[S] defined dually.

When A is a maximal antichain, the ground set partitions as: X = A ∪ D(A) ∪ U(A).

🪜 Dual of Dilworth's theorem (antichain partition)

🪜 Statement and intuition

Theorem (Dual of Dilworth): If P = (X, ≤) is a poset and height(P) = h, then there exists a partition X = A₁ ∪ A₂ ∪ ⋯ ∪ Aₕ where each Aᵢ is an antichain. Furthermore, no partition using fewer antichains is possible.

Intuition: the height h is the length of the longest chain, so by the Pigeon Hole Principle, any partition into fewer than h antichains would force two elements of a maximum chain into the same antichain—impossible, since chain elements are comparable.

🧩 Proof sketch

For each x ∈ X, define height(x) as the largest integer t such that there exists a chain x₁ < x₂ < ⋯ < xₜ with x = xₜ.
For each i = 1, 2, …, h, let Aᵢ = {x ∈ X : height(x) = i}.
Each Aᵢ is an antichain: if x, y ∈ Aᵢ and x < y, then there would be a chain x₁ < ⋯ < xᵢ = x < xᵢ₊₁ = y, so height(y) ≥ i + 1, contradiction.
Since height(P) = h, there exists a maximum chain C = {x₁, x₂, …, xₕ}. If we could partition P into t < h antichains, by Pigeon Hole, one antichain would contain two points from C—impossible.

🔄 Recursive algorithm

The proof yields an efficient algorithm:

Set P₀ = P.
If Pᵢ is defined and nonempty, let Aᵢ = min(Pᵢ) (the set of minimal points).
Let Pᵢ₊₁ be the subposet remaining after removing Aᵢ from Pᵢ.
Repeat until the poset is empty.

Each Aᵢ is an antichain (minimal points are pairwise incomparable within their layer).
Dually, we can recursively remove max(Pᵢ) (maximal points) to get another antichain partition.
Example: the excerpt illustrates this on a 17-point poset of height 5, with darkened points forming a chain of size 5.

🏔️ Dilworth's theorem (chain partition)

🏔️ Statement and intuition

Theorem (Dilworth): If P = (X, ≤) is a poset and width(P) = w, then there exists a partition X = C₁ ∪ C₂ ∪ ⋯ ∪ Cᵥ where each Cᵢ is a chain. Furthermore, no chain partition into fewer chains is possible.

Intuition: the width w is the size of the largest antichain, so by Pigeon Hole, any partition into fewer than w chains would force two elements of a maximum antichain into the same chain—impossible, since antichain elements are incomparable.

🧩 Proof strategy (induction on |X|)

The proof uses induction on the size of the ground set:

Base case: if |X| = 1, the result is trivial (one chain suffices).
Inductive step: assume the theorem holds for all posets with |X| < k + 1. Suppose P = (X, ≤) has |X| = k + 1 and width(P) = w.
Without loss of generality, assume w > 1 (otherwise X = C₁ is already a single chain).
Key observation: if C is a nonempty chain in P and the subposet (X − C, ≤) has width w′ < w, then by induction we can partition X − C into w′ chains, so X = C ∪ (w′ chains) gives a partition into w′ + 1 ≤ w chains, which suffices.
Therefore, we may assume that for any nonempty chain C, the subposet (X − C, ≤) also has width w.

🔧 Construction of the partition

Choose a maximal point x and a minimal point y with y ≤ x in P. Let C = {x, y} (or just {x} if x = y).
Let Y = X − C and Q = P(Y) (the subposet induced by Y).
Let A = {a₁, a₂, …, aᵥ} be a w-element antichain in (Y, Q).
Partition X as X = A ∪ D(A) ∪ U(A).
- Since y is minimal and A is a maximal antichain, y ∈ D(A).
- Since x is maximal and A is a maximal antichain, x ∈ U(A).
- In particular, x and y are distinct.
Note that U[A] ≠ X (since y ∉ U[A]) and D[A] ≠ X (since x ∉ D[A]).
Apply the inductive hypothesis to the subposets determined by D[A] and U[A]:
- U[A] = C₁ ∪ C₂ ∪ ⋯ ∪ Cᵥ (partition into w chains)
- D[A] = D₁ ∪ D₂ ∪ ⋯ ∪ Dᵥ (partition into w chains)
Without loss of generality, label so that aᵢ ∈ Cᵢ ∩ Dᵢ for each i = 1, 2, …, w.
Then X = (C₁ ∪ D₁) ∪ (C₂ ∪ D₂) ∪ ⋯ ∪ (Cᵥ ∪ Dᵥ) is the desired partition into w chains.

📊 Example illustration

The excerpt illustrates Dilworth's theorem on the same poset from earlier figures.
Darkened points form a 7-element antichain (so width = 7).
The labels provide a partition into 7 chains.

🤔 Algorithmic concerns

The proof does not immediately yield an efficient algorithm for finding the width w or a partition into w chains.
The excerpt notes: "Bob has yet to figure out why listing all the subsets of X is a bad idea" (exponential complexity).
Carlos suggests "a skilled programmer can devise an algorithm from the proof," but the text defers this issue to later chapters.
Don't confuse: the dual theorem (antichain partition) has a simple recursive algorithm; Dilworth's theorem (chain partition) is more subtle algorithmically.

📐 Linear extensions (topological sort)

📐 Definition

Linear extension (also called topological sort): a linear order L on X such that x < y in L whenever x < y in P.

A linear extension "respects" the partial order: it extends P to a total order without contradicting any existing comparisons.
Example: the excerpt shows that the poset P₃ has 11 distinct linear extensions, displayed in a table.

🧐 Existence and subtleties

Bob questions whether every finite poset has a linear extension.
Alice claims it is easy to show they do (the excerpt does not provide the proof here).
Carlos notes: "there are subtleties to this question when the ground set X is infinite."
The excerpt mentions Szpilrajn's contribution to this issue (a web search topic).
Don't confuse: finite posets always have linear extensions; infinite posets require more care (e.g., axiom of choice considerations).

6.2 Additional Concepts for Posets

🧭 Overview

🧠 One-sentence thesis

Dilworth's theorem and its dual establish that any poset of width w can be partitioned into exactly w chains (and dually, any poset of height h can be partitioned into exactly h antichains), providing fundamental structural results for partially ordered sets.

📌 Key points (3–5)

Dilworth's theorem: a poset of width w can be partitioned into w chains, and no fewer chains suffice.
Dual theorem: a poset of height h can be partitioned into h antichains, and no fewer antichains suffice.
Proof strategy for dual: assign each element to an antichain based on its height (longest chain ending at that element); elements at the same height form an antichain.
Common confusion: maximal vs maximum—a maximum chain/antichain has the largest possible size, while a maximal chain/antichain cannot be extended but may not be the largest.
Algorithmic approach: the dual theorem yields a recursive algorithm that repeatedly removes minimal (or maximal) elements to build the antichain partition.

🎯 Dilworth's theorem and its dual

🎯 Dilworth's chain covering theorem

Theorem 6.17 (Dilworth's Theorem): If P = (X, ≤) is a poset and width(P) = w, then there exists a partition X = C₁ ∪ C₂ ∪ ... ∪ Cᵥ, where each Cᵢ is a chain for i = 1, 2, ..., w. Furthermore, there is no chain partition into fewer chains.

What it says: you can cover all elements of a poset using exactly w chains, where w is the width (size of the largest antichain).
Why w is necessary: by the Pigeon Hole Principle, if you have an antichain of size w, you need at least w chains because no two elements of an antichain can belong to the same chain.
Why w is sufficient: the proof (by induction on the number of elements) shows that w chains are always enough.

🔄 Dual of Dilworth's theorem

Theorem 6.18 (Dual of Dilworth's Theorem): If P = (X, ≤) is a poset and height(P) = h, then there exists a partition X = A₁ ∪ A₂ ∪ ... ∪ Aₕ, where each Aᵢ is an antichain for i = 1, 2, ..., h. Furthermore, there is no partition using fewer antichains.

What it says: you can cover all elements using exactly h antichains, where h is the height (length of the longest chain).
Why h is necessary: if there is a maximum chain C with h elements, you need at least h antichains because no two elements of a chain can belong to the same antichain.
Why h is sufficient: the proof constructs the partition explicitly by grouping elements by their height.

🔨 Proof of the dual theorem

🔨 Height of an element

For each x in X, let height(x) be the largest integer t for which there exists a chain x₁ < x₂ < ... < xₜ with x = xₜ.

What it measures: the length of the longest chain ending at element x.
Key property: height(x) ≤ h for all x in X, where h = height(P).
Example: if the longest chain ending at x is a < b < c < x, then height(x) = 4.

🧱 Constructing the antichain partition

Step 1: For each i = 1, 2, ..., h, define Aᵢ = {x ∈ X : height(x) = i}.
Step 2: Show each Aᵢ is an antichain.
- Suppose x, y ∈ Aᵢ and x < y.
- Then there is a chain x₁ < x₂ < ... < xᵢ = x < xᵢ₊₁ = y.
- This means height(y) ≥ i + 1, contradicting y ∈ Aᵢ.
Step 3: Show h antichains are necessary.
- If height(P) = h, there exists a maximum chain C = {x₁, x₂, ..., xₕ}.
- If we could partition P into t < h antichains, by the Pigeon Hole Principle one antichain would contain two points from C, which is impossible (chain elements are comparable).

🔍 Don't confuse: height of element vs height of poset

height(x): the longest chain ending at x (property of a single element).
height(P): the longest chain in the entire poset (global property).
The proof uses height(x) to assign elements to antichains, ensuring that elements at the same "level" are incomparable.

🤖 Algorithmic approach

🤖 Recursive algorithm for antichain partition

The proof of Theorem 6.18 yields an efficient recursive algorithm:

Set P₀ = P.
If Pᵢ has been defined and Pᵢ ≠ ∅:
- Let Aᵢ = min(Pᵢ) (the set of minimal elements of Pᵢ).
- Let Pᵢ₊₁ be the subposet remaining when Aᵢ is removed from Pᵢ.
Repeat until all elements are assigned.

Why it works: minimal elements at each stage have no elements below them in the current subposet, so they form an antichain.
Dual approach: you can also recursively remove the set of maximal points to build the partition.
Example: Figure 6.19 illustrates this algorithm for a 17-point poset of height 5, with darkened points forming a chain of size 5.

🔎 Minimal and maximal points

A point x ∈ X with height(x) = 1 is called a minimal point of P. The set of all minimal points is denoted min(X, P) or min(P).

Dually, the set max(P) consists of maximal points of P.

Minimal point: no element in the poset is strictly less than it.
Maximal point: no element in the poset is strictly greater than it.
Discussion 6.20: Alice claims it is very easy to find the set of minimal elements of a poset—this is true because you only need to check which elements have no predecessors.

📐 Maximal vs maximum chains and antichains

📐 Definitions and distinctions

Term	Definition	Key property
Maximal chain	A chain C such that no chain C' contains C as a proper subset	Cannot be extended, but may not be the longest
Maximum chain	A chain C such that no chain C' has more elements than C	Has the largest possible size (equals height of poset)
Maximal antichain	An antichain A such that no antichain A' contains A as a proper subset	Cannot be extended, but may not be the largest
Maximum antichain	An antichain A such that no antichain A' has more elements than A	Has the largest possible size (equals width of poset)

Key distinction: maximum implies maximal, but maximal does not imply maximum.
Example: in a poset with multiple chains of different lengths, a short chain that cannot be extended is maximal but not maximum.
Thrust of Theorem 6.18: it is easy to find the height h of a poset as well as a maximum chain C consisting of h points, plus a handy partition into h antichains.

🧰 Notation for down sets and up sets

🧰 Down sets and up sets for elements

For a poset P = (X, ≤) and x ∈ X:

D(x) = {y ∈ X : y < x in P} (elements strictly below x)
D[x] = {y ∈ X : y ≤ x in P} (elements below or equal to x)
U(x) = {y ∈ X : y > x in P} (elements strictly above x)
U[x] = {y ∈ X : y ≥ x} (elements above or equal to x)
I(x) = {y ∈ X - {x} : x ∥ y in P} (elements incomparable to x)

🧰 Down sets and up sets for subsets

For a subset S ⊆ X:

D(S) = {y ∈ X : y < x in P, for some x ∈ S} (elements strictly below some element of S)
D[S] = S ∪ D(S) (elements below or in S)
U(S) and U[S] are defined dually.

🧰 Partition using a maximal antichain

When A is a maximal antichain in P, the ground set X can be partitioned into pairwise disjoint sets:

X = A ∪ D(A) ∪ U(A)

A: the antichain itself (incomparable elements).
D(A): elements strictly below some element of A.
U(A): elements strictly above some element of A.
This partition is used in the proof of Dilworth's theorem to simplify the induction argument.

🧮 Proof strategy for Dilworth's theorem

🧮 Induction setup

Let P = (X, ≤) be a poset and let w denote the width of P.

Base case: if |X| = 1, the result is trivial (one chain suffices).
Inductive hypothesis: assume the theorem holds for all posets with |X| < k.
Inductive step: suppose P = (X, ≤) has |X| = k + 1.

🧮 Key observations

Without loss of generality, w > 1: otherwise, the trivial partition X = C₁ (the entire poset is a chain) satisfies the theorem.
Removing a chain preserves width: if C is a nonempty chain in (X, ≤), we may assume the subposet (X - C, ≤) also has width w.
- If width(X - C) = w' < w, then we can partition X - C into w' chains C₁, C₂, ..., Cᵥ', and adding C gives a partition of X into w' + 1 ≤ w chains.
- The proof proceeds by showing that if width(X - C) = w, we can still construct a partition into w chains.

🧮 Don't confuse: width of poset vs width after removing a chain

width(P): size of the largest antichain in the entire poset.
width(X - C): size of the largest antichain in the subposet after removing chain C.
The proof relies on the fact that removing a chain can reduce the width, but if it doesn't, the inductive argument still applies.

Dilworth's Chain Covering Theorem and its Dual

6.3 Dilworth’s Chain Covering Theorem and its Dual

🧭 Overview

🧠 One-sentence thesis

Dilworth's theorem guarantees that any finite poset of width w can be partitioned into exactly w chains, and its dual (Theorem 6.18) guarantees that a poset of height h can be partitioned into exactly h antichains.

📌 Key points (3–5)

Dual theorem (height and antichains): A poset of height h can be partitioned into h antichains by recursively removing minimal (or maximal) elements, and this process also finds a maximum chain.
Dilworth's theorem (width and chains): A poset of width w requires at least w chains in any chain partition (by the Pigeon Hole Principle), and exactly w chains suffice.
Maximal vs maximum: A maximal chain (antichain) cannot be extended by adding more elements, but a maximum chain (antichain) has the largest possible size; maximum implies maximal, but not vice versa.
Common confusion: The dual theorem provides an efficient recursive algorithm, but Dilworth's proof does not immediately yield an efficient algorithm for finding width or the chain partition.
Why it matters: These theorems connect the structural parameters (height, width) of a poset to concrete partitions, enabling both theoretical understanding and algorithmic approaches.

🔄 The dual theorem: height and antichain partitions

📏 What height measures and the dual result

Height of a poset: The length of a maximum chain (the largest number of elements in any chain).

Theorem 6.18 (the dual) states that a poset of height h can be partitioned into exactly h antichains.
The Pigeon Hole Principle shows that at least h antichains are needed (because a maximum chain of size h must have one element in each antichain).
The theorem proves that h antichains are also sufficient.

🔁 Recursive algorithm for antichain partition

The dual theorem's proof gives an efficient recursive algorithm:

Start with P₀ = P.
If Pᵢ is nonempty, let Aᵢ = min(Pᵢ) (the set of all minimal elements of Pᵢ).
Remove Aᵢ from Pᵢ to get Pᵢ₊₁.
Repeat until the poset is empty.

Each Aᵢ is an antichain (minimal elements are pairwise incomparable).
The number of iterations equals the height h.
Example: Figure 6.19 shows a 17-point poset of height 5 partitioned into 5 antichains; the darkened points form a chain of size 5.

🔼 Dually: removing maximal elements

You can also partition P into height(P) antichains by recursively removing the set max(P) of maximal points.
This dual approach works symmetrically: maximal elements at each step form an antichain.

🧩 Finding minimal and maximal elements

Minimal elements of a poset P: elements x such that no y in P satisfies y < x.
Maximal elements of a poset P: elements x such that no y in P satisfies y > x.

Discussion 6.20: Alice claims it is very easy to find the set of minimal elements; the excerpt implies this is straightforward (no algorithm details given, but the concept is clear).
The notation min(P) and max(P) denote these sets.

🔗 Dilworth's theorem: width and chain partitions

📐 What width measures and Dilworth's result

Width of a poset: The size of a maximum antichain (the largest number of pairwise incomparable elements).

Dilworth's theorem states that a poset of width w can be partitioned into exactly w chains.
The Pigeon Hole Principle shows that at least w chains are needed (because a maximum antichain of size w must have one element in each chain).
Dilworth's theorem proves that w chains are also sufficient.

🧱 Proof strategy: induction and careful choice

The proof proceeds by induction on the number of elements |X|:

Base case: Trivial when |X| = 1.
Inductive step: Assume the theorem holds for all posets with fewer than k + 1 elements; prove it for a poset P = (X, P) with |X| = k + 1 and width w > 1.
Key observation: If C is a nonempty chain in P, we may assume the subposet (X − C, P(X − C)) also has width w. If it had smaller width w′ < w, we could partition X − C into w′ chains and add C to get a partition of X into w′ + 1 ≤ w chains, which suffices.
Construction:
- Choose a maximal point x and a minimal point y with y ≤ x in P.
- Let C be the chain containing only x and y (one or two elements depending on whether they are distinct).
- Let Y = X − C and Q = P(Y).
- Let A be a w-element antichain in the subposet (Y, Q).
Partition around the antichain: In the partition X = A ∪ D(A) ∪ U(A) (down set, antichain, up set):
- y is minimal and A is a maximal antichain, so y ∈ D(A).
- x is maximal, so x ∈ U(A).
- This shows x and y are distinct.
Apply induction: Label A = {a₁, a₂, ..., aₘ}. Note that U[A] ≠ X (since y ∉ U[A]) and D[A] ≠ X (since x ∉ D[A]). Apply the inductive hypothesis to partition U[A] into w chains C₁, C₂, ..., Cₘ and D[A] into w chains D₁, D₂, ..., Dₘ, with aᵢ ∈ Cᵢ ∩ Dᵢ for each i.
Combine: X = (C₁ ∪ D₁) ∪ (C₂ ∪ D₂) ∪ ... ∪ (Cₘ ∪ Dₘ) is the desired partition into w chains.

Example: Figure 6.21 shows a poset (from Figure 6.5) of width 7; the darkened points form a 7-element antichain, and the labels provide a partition into 7 chains.

🔍 Down sets and up sets notation

The proof uses the following notation for a poset P = (X, P) and element x ∈ X:

Notation	Definition	Type
D(x)	{y ∈ X : y < x in P}	Down set (strict)
D[x]	{y ∈ X : y ≤ x in P}	Down set (inclusive)
U(x)	{y ∈ X : y > x in P}	Up set (strict)
U[x]	{y ∈ X : y ≥ x in P}	Up set (inclusive)
I(x)	{y ∈ X − {x} : x ∥ y in P}	Incomparable set

For a subset S ⊆ X: D(S) = {y ∈ X : y < x in P for some x ∈ S} and D[S] = S ∪ D(S); U(S) and U[S] are defined dually.
When A is a maximal antichain, the ground set partitions as X = A ∪ D(A) ∪ U(A) (pairwise disjoint).

🆚 Maximal vs maximum: key distinction

🔝 Definitions and relationship

Maximal chain: A chain C such that no chain C′ contains C as a proper subset.
Maximum chain: A chain C such that no chain C′ has |C| < |C′|.

The same definitions apply to antichains.
Relationship: A maximum chain (antichain) is always maximal, but a maximal chain (antichain) need not be maximum.
Don't confuse: "Maximal" means "cannot be extended" (local property); "maximum" means "largest possible size" (global property).
Example: In a poset, a short chain that cannot be extended (because its endpoints are incomparable to all other elements) is maximal but not maximum.

🎯 Thrust of Theorem 6.18

The dual theorem makes it easy to find the height h of a poset.
It also provides a maximum chain C of h points.
As a bonus, it gives a partition of the poset into h antichains.

⚙️ Algorithmic considerations

🚀 Efficient algorithm for the dual theorem

The recursive algorithm for Theorem 6.18 (repeatedly removing minimal or maximal elements) is efficient.
It directly constructs both the antichain partition and a maximum chain.

🤔 Dilworth's theorem and algorithmic challenges

Discussion 6.22: Alice notes that the proof of Dilworth's theorem does not seem to provide an efficient algorithm for finding the width w or a partition into w chains.
Bob suggests listing all subsets of X, which is a bad idea (exponential time).
Carlos says a skilled programmer can devise an algorithm from the proof, but the excerpt does not provide details.
The text promises to return to this issue later.

Don't confuse: The dual theorem (height and antichains) has an obvious efficient algorithm; Dilworth's theorem (width and chains) has a proof that is not immediately algorithmic, even though it is constructive.

🔗 Linear extensions and the subset lattice

📜 Linear extensions (topological sorts)

Linear extension (also topological sort): A linear order L on X such that x < y in L whenever x < y in P.

Every finite poset has at least one linear extension.
Example: Figure 6.23 shows a poset P₃ with 11 linear extensions, displayed as a table.
Discussion 6.24: Bob is not convinced every finite poset has a linear extension; Alice says it is easy to show they do. Carlos mentions subtleties for infinite sets and references Szpilrajn's contribution.

🔍 Sorting and linear extensions

The classical sorting problem: determine an unknown linear order L by asking questions of the form "Is x < y in L?"
Special case: Determine an unknown linear extension L of a poset P by asking questions of the form "Is x < y in L?"
Discussion 6.25: How should Alice decide which question to ask? How hard is it to count the number of linear extensions of a poset? Could you count them for a poset on 100,000 points? (The excerpt does not answer these questions.)

🧊 The subset lattice

Subset lattice: The family of all subsets of a finite set X, partially ordered by inclusion.

Notation: 2^t denotes the subset lattice of all subsets of {1, 2, ..., t} ordered by inclusion.
Example: Figure 6.26 shows the lattice of all subsets of {1, 2, 3, 4}, represented by bit strings (abbreviated without commas and parentheses).

Elementary properties of 2^t:

Property	Description
Height	t + 1; all maximal chains have exactly t + 1 points
Size	2^t elements
Ranks (antichains)	Partitioned into A₀, A₁, ..., Aₜ with *
Maximum rank size	Occurs in the middle, at s = floor(t/2)

The excerpt cuts off mid-sentence describing the maximum binomial coefficient.

Linear Extensions of Partially Ordered Sets

6.4 Linear Extensions of Partially Ordered Sets

🧭 Overview

🧠 One-sentence thesis

Every partially ordered set can be "straightened out" into one or more total orderings (linear extensions) that preserve all the original ordering relationships, and determining these extensions efficiently is a fundamental problem in computer science.

📌 Key points (3–5)

What a linear extension is: a total ordering of all elements that respects every ordering relationship already present in the poset.
Alternative name: linear extensions are also called "topological sorts."
Existence: every finite poset has at least one linear extension (possibly many).
Counting difficulty: determining how many linear extensions a poset has is computationally hard, even for moderately sized posets.
Connection to sorting: the classical sorting problem is a special case—finding an unknown linear extension by asking comparison questions.

🔄 What linear extensions are

📐 Definition and basic idea

Linear extension (also called topological sort): A linear order L on the ground set X such that whenever x < y in the poset P, then x < y in L.

A poset may have incomparable elements; a linear extension forces every pair into a definite order.
The extension must be faithful: it cannot contradict any ordering already in the poset.
Example: if the poset says "a < b" and "c is incomparable to both," a valid linear extension might order them as a, b, c or a, c, b or c, a, b—but never b, a, c (which would violate a < b).

🔢 How many extensions exist

The excerpt shows that the example poset P₃ has exactly 11 distinct linear extensions (displayed in a table).
Different posets have different numbers of extensions; some may have very few, others very many.
The number depends on how much "freedom" the poset leaves—more incomparable pairs mean more possible orderings.

🤔 Existence and subtleties

✅ Finite posets always have linear extensions

Bob doubts whether every finite poset has a linear extension.
Alice claims it is easy to show they do (the excerpt does not give the proof, but asserts existence).
Don't confuse: the question is not "does a unique extension exist?" but "does at least one exist?"—the answer for finite posets is always yes.

♾️ Infinite posets and Szpilrajn's contribution

Carlos notes that when the ground set X is infinite, the question becomes subtle.
The excerpt mentions Szpilrajn (a mathematician) who contributed to resolving this issue for infinite posets.
For review purposes: finite posets are straightforward; infinite cases require more care.

🖥️ Connection to sorting algorithms

🔍 Classical sorting as a special case

Standard sorting algorithms (bubble sort, merge sort, quick sort) determine an unknown linear order L by asking questions of the form "Is x < y in L?"
Special case: determine an unknown linear extension L of a known poset P by asking the same type of questions.
The poset structure provides partial information; the algorithm must "fill in" the rest.

❓ Strategy for asking questions

Alice faces the problem: given a poset, which question should she ask first to efficiently discover the unknown linear extension?
The excerpt does not provide an answer but poses it as a discussion point.
This is a research-level question in algorithm design.

📊 Computational difficulty

🧮 Counting linear extensions is hard

The excerpt asks: "How hard is it to determine the number of linear extensions of a poset?"
Even for a poset on 100,000 points, counting all extensions is computationally infeasible with current methods.
This is not just a matter of writing code—it is a fundamentally difficult combinatorial problem.

🗂️ Comparison: existence vs counting vs finding

Task	Difficulty (for finite posets)	Notes from excerpt
Does a linear extension exist?	Easy (always yes)	Alice says it's easy to show
Find one linear extension	Moderate	Algorithms exist, but efficiency varies
Count all linear extensions	Very hard	Infeasible for large posets (e.g., 100,000 elements)

Don't confuse "finding one extension" with "counting all extensions"—the former is a standard algorithmic task; the latter is much harder.

The Subset Lattice

6.5 The Subset Lattice

🧭 Overview

🧠 One-sentence thesis

The subset lattice—all subsets of a finite set ordered by inclusion—has width equal to the size of its largest rank, which occurs at the middle level, as proven by Sperner's Theorem.

📌 Key points (3–5)

What the subset lattice is: the poset formed by all subsets of a finite set, ordered by inclusion (denoted 2^t for subsets of {1, 2, ..., t}).
Structure properties: height is t + 1, total size is 2^t, and elements are partitioned into ranks (antichains) by subset size.
Sperner's Theorem: the width (maximum antichain size) equals the largest binomial coefficient, which occurs at the middle rank(s).
Common confusion: the width is not the total number of subsets, but the maximum size of a single rank—specifically the rank at floor(t/2).
Proof technique: counting maximal chains through antichain elements and showing their disjointness constrains the antichain size.

📐 Structure and Basic Properties

📐 Definition and notation

Subset lattice: the family of all subsets of a finite set X, partially ordered by inclusion.

For a positive integer t, 2^t denotes the subset lattice of all subsets of {1, 2, ..., t} ordered by inclusion.
The excerpt uses bit strings to represent sets (e.g., "1010" represents a subset).
Example: For X = {1, 2, 3, 4}, the subset lattice contains all 16 subsets from the empty set (0000) to the full set (1111).

📏 Height and chains

Height: t + 1 (the longest chain has t + 1 points).
All maximal chains have exactly t + 1 points.
This follows because you can build a chain from the empty set by adding one element at a time until you reach the full set.

🔢 Size and ranks

Total size: 2^t elements (all possible subsets).
Elements are partitioned into ranks A₀, A₁, ..., Aₜ where:
- Rank Aᵢ contains all subsets of size i.
- The size of rank Aᵢ is the binomial coefficient C(t, i) (t choose i).
Each rank forms an antichain (no two subsets of the same size can be related by inclusion).

🎯 Maximum rank size

The maximum rank size occurs in the middle of the lattice.
If s = floor(t/2), then C(t, s) is the largest binomial coefficient in the sequence C(t, 0), C(t, 1), ..., C(t, t).
When t is odd: two ranks of maximum size (the two middle ranks).
When t is even: only one rank of maximum size (the single middle rank).
Example: For t = 4, the middle rank is at size 2, with C(4, 2) = 6 subsets.

🏆 Sperner's Theorem

🏆 Statement of the theorem

Sperner's Theorem: For each t ≥ 1, the width of the subset lattice 2^t equals the maximum size of a rank, specifically width(2^t) = C(t, floor(t/2)).

Width means the maximum size of an antichain in the poset.
The theorem says the width is achieved by taking all subsets of a single middle size.
Don't confuse: width is not the number of ranks, but the size of the largest antichain.

🔍 Lower bound (easy direction)

The set of all floor(t/2)-element subsets of {1, 2, ..., t} forms an antichain.
This is because no subset of size floor(t/2) can be contained in another subset of the same size.
Therefore, width(2^t) is at least C(t, floor(t/2)).

🔐 Upper bound (proof technique)

The proof shows that width(2^t) is at most C(t, floor(t/2)) by counting maximal chains:

Setup: Let w be the width and {S₁, S₂, ..., Sᵥ} be an antichain of size w.
- Each Sᵢ is a subset of {1, 2, ..., t}.
- For i < j, neither Sᵢ ⊆ Sⱼ nor Sⱼ ⊆ Sᵢ (antichain property).
Counting chains through each element:
- For each Sᵢ with |Sᵢ| = kᵢ, let 𝒮ᵢ be the set of all maximal chains passing through Sᵢ.
- The number of such chains is |𝒮ᵢ| = kᵢ! × (t - kᵢ)!.
- Why: to build a maximal chain through Sᵢ, delete elements one at a time (kᵢ! ways) for the lower part, and add elements one at a time ((t - kᵢ)! ways) for the upper part.
Disjointness of chain sets:
- If i < j, then 𝒮ᵢ ∩ 𝒮ⱼ = ∅ (empty intersection).
- Why: if a maximal chain belonged to both 𝒮ᵢ and 𝒮ⱼ, it would pass through both Sᵢ and Sⱼ, implying one is a subset of the other, contradicting the antichain property.
Total chain count constraint:
- There are exactly t! maximal chains in 2^t.
- Therefore: sum from i=1 to w of [kᵢ! × (t - kᵢ)!] ≤ t!.
Algebraic manipulation:
- Dividing by t!: sum from i=1 to w of [kᵢ! × (t - kᵢ)! / t!] ≤ 1.
- This simplifies to: sum from i=1 to w of [1 / C(t, kᵢ)] ≤ 1.
- Since each term is at least 1 / C(t, ceiling(t/2)), we get: w / C(t, ceiling(t/2)) ≤ 1.
- Therefore: w ≤ C(t, ceiling(t/2)).

🎓 Conclusion

Combining the lower and upper bounds proves that width(2^t) = C(t, floor(t/2)).
The maximum antichain is exactly the middle rank(s).

🔗 Interval Orders (Introduction)

🔗 Definition

Interval order: a poset P = (X, P) for which there exists a function I assigning to each element x in X a closed interval I(x) = [aₓ, bₓ] of the real line such that for all x, y in X, x < y in P if and only if bₓ < aᵧ in the real numbers.

Interval representation (or just representation): the function I that assigns intervals.
Notation: [aₓ, bₓ] denotes the closed interval for element x.
Length of interval: |I(x)| = bₓ - aₓ.

🎨 Properties of representations

Endpoints need not be distinct: different elements can share endpoints.
Intervals can be identical: distinct points x and y may satisfy I(x) = I(y).
Degenerate intervals allowed: intervals of the form [a, a] (zero length) are permitted.
Distinguishing representation: all intervals are non-degenerate and all endpoints are distinct; every interval order has such a representation.

🚫 Forbidden subposet characterization

Fishburn's Theorem: A poset P is an interval order if and only if it excludes 2 + 2 as a subposet.

Notation:

n: the chain with n points {0, 1, ..., n-1} where i < j in n if and only if i < j in the integers.
P + Q: the disjoint union of posets P and Q with no comparabilities between elements from different posets.
2 + 2: a four-point poset {a, b, c, d} with a < b and c < d as the only relations (two disjoint chains of length 2).

🔍 Proof direction (interval order excludes 2 + 2)

The excerpt proves one direction: if P is an interval order, it cannot contain 2 + 2 as a subposet.

Setup: Suppose P contains four points {x, y, z, w} forming 2 + 2, with x < y and z < w in P.
Incomparability: This means x ∥ w (x and w are incomparable) and z ∥ y (z and y are incomparable).
Contradiction with intervals: If I is an interval representation, then:
- x < y implies bₓ < aᵧ.
- z < w implies b_z < aᵥ.
- x ∥ w and z ∥ y create constraints that cannot all be satisfied simultaneously with real intervals.
Therefore, P cannot be an interval order.
Don't confuse: the proof in the other direction (if P excludes 2 + 2, then P is an interval order) is deferred to a later section.

Interval Orders

6.6 Interval Orders

🧭 Overview

🧠 One-sentence thesis

Interval orders are a special class of posets that can be represented by intervals on the real line and are precisely those posets that exclude the forbidden subposet 2 + 2.

📌 Key points (3–5)

What interval orders are: posets where each element corresponds to a closed interval on the real line, with x < y in the poset if and only if the right endpoint of x's interval is less than the left endpoint of y's interval.
Fishburn's characterization: a poset is an interval order if and only if it does not contain 2 + 2 as a subposet (two disjoint 2-element chains).
Algorithmic solution: there exists a straightforward algorithm to either find an interval representation or detect the forbidden 2 + 2 subposet.
Common confusion: interval endpoints need not be distinct—different elements may share the same interval, and degenerate intervals [a, a] are allowed, though distinguishing representations (with all distinct, non-degenerate intervals) always exist.
Unique minimal representation: when representing an interval order with integer endpoints, there is a smallest n for which a representation exists, and that representation is unique.

📐 Definition and basic properties

📐 What is an interval order

Interval order: A poset P = (X, P) is an interval order if there exists a function I assigning to each element x in X a closed interval I(x) = [aₓ, bₓ] of the real line ℝ so that for all x, y in X, x < y in P if and only if bₓ < aᵧ in ℝ.

The function I is called an interval representation of P.
For brevity, I(x) is written as [aₓ, bₓ].
The length of an interval is |I(x)| = bₓ - aₓ.
Why this works: the condition bₓ < aᵧ means x's interval ends strictly before y's interval begins, capturing the "less than" relation.

Example: The poset P₃ shown in Figure 6.28 is an interval order, demonstrated by the representation where each element is assigned an interval on the real line.

🔧 Flexibility in representations

Interval representations have considerable flexibility:

Endpoints need not be distinct: multiple intervals can share the same endpoint values.
Intervals may coincide: distinct points x and y may satisfy I(x) = I(y).
Degenerate intervals allowed: intervals of the form [a, a] (zero length) are permitted.

Distinguishing representation: A representation where all intervals are non-degenerate and all endpoints are distinct.

Every interval order has a distinguishing representation (though the excerpt states this is "relatively easy to see" without providing the proof).

🚫 Forbidden subposet characterization

🚫 The 2 + 2 poset

Before stating the characterization, we need notation:

n: the chain with n points, specifically the ground set {0, 1, ..., n-1} with i < j in n if and only if i < j in the integers.
P + Q: for disjoint posets P = (X, P) and Q = (Y, Q), the poset R = (X ∪ Y, R) where z ≤ w in R if and only if:
- (a) z, w ∈ X and z ≤ w in P, or
- (b) z, w ∈ Y and z ≤ w in Q.

Thus 2 + 2 consists of two disjoint 2-element chains with no comparabilities between them.

Example: 2 + 2 can be viewed as a four-point poset with ground set {a, b, c, d} where a < b and c < d are the only relations (besides reflexivity).

🎯 Fishburn's Theorem

Fishburn's Theorem: Let P = (X, P) be a poset. Then P is an interval order if and only if it excludes 2 + 2.

Proof (one direction): An interval order cannot contain 2 + 2 as a subposet.

Suppose P = (X, P) is a poset and {x, y, z, w} ⊆ X form a subposet isomorphic to 2 + 2.
Without loss of generality, assume x < y and z < w in P.
This means x ∥ w (x and w are incomparable) and z ∥ y in P.
If I were an interval representation, then bₓ < aᵧ and bᵤ < aᵥ in ℝ.
This gives aᵥ ≤ bₓ < aᵧ ≤ bᵤ, which is a contradiction.
Therefore, P cannot be an interval order.

Don't confuse: The theorem is an "if and only if" statement; the excerpt provides only the "interval order implies no 2 + 2" direction here, deferring the converse to the next section.

🔍 Algorithm for finding representations

🔍 Down sets and up sets notation

For a poset P = (X, P) and subset S ⊆ X:

D(S): the down set = {y ∈ X : there exists some x ∈ S with y < x in P}.
D[S]: D(S) ∪ S (the down set including S itself).
When S = {x}, write D(x) and D[x] instead of D({x}) and D[{x}].

Dually:

U(S): the up set = {y ∈ X : there exists some x ∈ S with y > x in P}.
U[S]: U(S) ∪ S.
When S = {x}, write U(x) for {y ∈ X : x < y in P}.

🛠️ The algorithm procedure

Step 1: Find the family D = {D(x) : x ∈ X}.

Step 2: Check two cases:

Case 1 (2 + 2 detected): There exist distinct elements x and y where D(x) ⊈ D(y) and D(y) ⊈ D(x).

Choose z ∈ D(x) \ D(y) and w ∈ D(y) \ D(x).
The four elements {x, y, z, w} form a subposet isomorphic to 2 + 2.
Conclusion: P is not an interval order.

Case 2 (interval order): For all x, y ∈ X, either D(x) ⊆ D(y) or D(y) ⊆ D(x).

Find the family U = {U(x) : x ∈ X}.
In this case, for all x, y ∈ X, either U(x) ⊆ U(y) or U(y) ⊆ U(x).
Let d = |D|. (The excerpt notes that |U| = |D|, proven in exercises.)

🏗️ Constructing the representation

When Case 2 holds:

Step 3: Label the sets in D and U as D₁, D₂, ..., Dₐ and U₁, U₂, ..., Uₐ so that:

∅ = D₁ ⊆ D₂ ⊆ D₃ ⊆ ... ⊆ Dₐ
U₁ ⊇ U₂ ⊇ ... ⊇ Uₐ₋₁ ⊇ Uₐ = ∅

Step 4: Form the interval representation I by the rule:

For each x ∈ X, set I(x) = [i, j], where D(x) = Dᵢ and U(x) = Uⱼ.

Potential issue: It might happen that j < i, making the rule illegal.

The excerpt states this never happens (proven in exercises).

📊 Properties of the algorithm

Theorem 6.30: If P is a poset excluding 2 + 2, then:

The number of down sets equals the number of up sets: |D| = |U|.

For each x ∈ X, if I(x) = [i, j], then i ≤ j in ℝ.

(Statement incomplete in excerpt)

Why this matters:

The algorithm provides a constructive proof of the other direction of Fishburn's Theorem.
Either the algorithm finds an interval representation, or it finds a 2 + 2 subposet.
This gives both a characterization and a practical method for working with interval orders.

🎯 Uniqueness of minimal representations

🎯 Integer endpoint representations

When P = (X, P) is an interval order and n is a positive integer:

There may be many ways to represent P using intervals with integer endpoints in [n] (the set {1, 2, ..., n}).
There is a least n for which a representation can be found.
For this minimal n, the representation is unique.

Don't confuse: This uniqueness applies specifically to the minimal integer representation, not to all possible representations (which can vary widely in endpoint choices).

Finding a Representation of an Interval Order

6.7 Finding a Representation of an Interval Order

🧭 Overview

🧠 One-sentence thesis

An algorithm can determine whether a poset is an interval order by either constructing a unique minimal interval representation or discovering a forbidden 2+2 subposet, completing Fishburn's Theorem.

📌 Key points (3–5)

The algorithm's two outcomes: either it finds an interval representation using integer endpoints, or it detects a subposet isomorphic to 2+2 (proving the poset is not an interval order).
Uniqueness of minimal representation: for any interval order, there is a smallest integer d such that the poset can be represented using intervals with endpoints in {1, 2, ..., d}, and this representation is unique.
How the algorithm works: it constructs families of down sets D and up sets U, checks whether they form chains under inclusion, and assigns intervals based on these sets.
Common confusion: the algorithm applies to any poset, not just interval orders—it serves as both a constructor (for interval orders) and a detector (for non-interval orders).
Connection to graph coloring: for interval orders, finding width and a minimum chain partition reduces to coloring the incomparability graph, which is an interval graph.

🔍 The algorithm's decision procedure

🔍 Building the down-set family

For a poset P = (X, P) and element x, the down set D(x) = {y ∈ X : there exists some element with y < x in P}.

The algorithm first collects all down sets: D = {D(x) : x ∈ X}.
Notation: D[x] means D(x) ∪ {x} (the down set including x itself).
Dually, up sets are defined: U(x) = {y ∈ X : x < y in P}.

⚠️ Case 1: Detecting 2+2

The algorithm checks whether all down sets are comparable by inclusion.

When 2+2 is found:

If there exist distinct x and y where D(x) ⊈ D(y) and D(y) ⊈ D(x), then the down sets are not totally ordered by inclusion.
Choose z ∈ D(x) \ D(y) and w ∈ D(y) \ D(x).
The four elements {x, y, z, w} form a subposet isomorphic to 2+2.
This proves P is not an interval order.

Example: In the modified poset where the line joining c and d is erased, D(j) = {f, 1} and D(d) = {c}. Since neither is a subset of the other, the points c, d, f, and j form a 2+2 configuration.

✅ Case 2: Building the representation

When all down sets are comparable:

Either D(x) ⊆ D(y) or D(y) ⊆ D(x) for all x, y ∈ X.
In this case, P is an interval order.
The up sets U also form a chain: either U(x) ⊆ U(y) or U(y) ⊆ U(x) for all pairs.

🏗️ Constructing the interval representation

🏗️ Labeling the chains

Let d = |D| (the number of distinct down sets).

Key structural fact (Theorem 6.30):

The number of down sets equals the number of up sets: |D| = |U|.
Label the sets so that:
- ∅ = D₁ ⊆ D₂ ⊆ D₃ ⊆ ... ⊆ Dₐ
- U₁ ⊇ U₂ ⊇ ... ⊇ Uₐ₋₁ ⊇ Uₐ = ∅

📐 The assignment rule

For each element x ∈ X, assign the interval I(x) = [i, j] where:

D(x) = Dᵢ (x's down set matches the i-th down set)
U(x) = Uⱼ (x's up set matches the j-th up set)

Why this works (Theorem 6.30 guarantees):

For every x, the assigned interval satisfies i ≤ j (the interval is valid).
For any x, y: x < y in P if and only if j < k (where I(x) = [i, j] and I(y) = [k, l]).
The integer d is the minimum number needed for any integer-endpoint representation.
This representation is unique.

🧮 Worked example

For the 10-point poset in Figure 6.31, d = 5:

Down sets	Up sets
D₁ = ∅	U₁ = {a, b, d, e, h, i, j}
D₂ = {c}	U₂ = {a, b, e, h, i, j}
D₃ = {c, f, 1}	U₃ = {b, e, i}
D₄ = {c, f, 1, h}	U₄ = {e}
D₅ = {a, c, f, 1, h, j}	U₅ = ∅

Resulting intervals:

I(a) = [3, 4], I(b) = [4, 5], I(c) = [1, 1]
I(d) = [2, 5], I(e) = [5, 5], I(f) = [1, 2]
I(1) = [1, 2], I(h) = [3, 3], I(i) = [4, 5], I(j) = [3, 4]

🎨 Connection to width and chain partitions

🎨 The incomparability graph

For an interval order P with interval representation {[aₓ, bₓ] : x ∈ X}:

Build the interval graph G where x and y are adjacent if and only if x and y are incomparable in P.
G is called the incomparability graph of P.

Key insight:

x and y are incomparable in P ⟺ their intervals overlap ⟺ xy is an edge in G.
x < y in P ⟺ bₓ < aᵧ ⟺ x and y are not adjacent in G.

🔗 Using graph coloring

Since interval graphs are perfect (chromatic number equals clique number):

A coloring of G corresponds to a partition of P into chains.
An optimal coloring (minimum number of colors) gives a minimum chain partition.
The width of P equals the chromatic number of G.

First Fit algorithm:

Order elements x₁, x₂, ..., xₙ so that i < j whenever D(xᵢ) is a proper subset of D(xⱼ).
Assign each xᵢ₊₁ to the first chain Cⱼ where xᵢ₊₁ is comparable to all elements already in Cⱼ.

📊 Example chain partition

For the 10-point poset, using the ordering 1, f, c, d, h, a, j, b, i, e:

Chain	Elements
C₁	{1, h, b}
C₂	{f, a, e}
C₃	{c, d}
C₄	{j}
C₅	{i}

Don't confuse: This First Fit approach works efficiently for interval orders because their incomparability graphs are interval graphs (perfect graphs), but it does not generalize to arbitrary posets.

Dilworth's Theorem for Interval Orders

6.8 Dilworth’s Theorem for Interval Orders

🧭 Overview

🧠 One-sentence thesis

For interval orders, there is a simple First Fit algorithm that efficiently finds both the width of the poset and an optimal partition into chains, unlike the general poset case where no efficient method is yet known.

📌 Key points (3–5)

Connection to interval graphs: An interval order's incomparability graph is an interval graph, and coloring that graph corresponds to partitioning the poset into chains.
First Fit works for interval orders: Applying First Fit to elements ordered by their down-sets (D(x)) produces an optimal chain partition for interval orders.
First Fit fails in general: For arbitrary posets, First Fit can produce arbitrarily bad partitions into chains or antichains, even when the width or height is small.
Common confusion: While some linear order always exists for which First Fit finds an optimal partition, searching for that order is impractical—specialized algorithms are needed for general posets.
Why interval orders are special: They inherit the "perfect" property from interval graphs, meaning chromatic number equals clique number, which translates to width equaling maximum antichain size.

🔗 Connection between interval orders and interval graphs

🔗 The incomparability graph

Given an interval order P with interval representation where x < y in P if and only if the right endpoint of x's interval is less than the left endpoint of y's interval.
Construct graph G where x and y are connected by an edge if and only if they are incomparable in P.
This graph G is called the incomparability graph of P.
Key insight: G is an interval graph (determined by the same intervals).

🎨 Perfect graphs and optimal coloring

Interval graphs are perfect: chromatic number χ(G) equals clique number ω(G).
An optimal coloring of G can be found by applying First Fit to vertices ordered by their left endpoints.
Each color class in the graph coloring corresponds to a chain in the poset P.
Why: vertices with the same color are not adjacent in G, meaning they are comparable in P, forming a chain.

🔧 The First Fit algorithm for interval orders

🔧 Ordering the elements

To apply First Fit to an interval order P:

Order elements as x₁, x₂, ..., xₙ such that i < j whenever D(xᵢ) is a proper subset of D(xⱼ).
D(x) denotes the down-set of x (all elements less than or equal to x in P).

🔧 Assigning to chains

Assign x₁ to chain C₁.
For each subsequent xᵢ₊₁, assign it to chain Cⱼ where j is the least positive integer such that xᵢ₊₁ is comparable to every element already assigned to Cⱼ.
This ensures each chain contains only comparable elements.

📊 Example from the excerpt

The 10-point interval order from Figure 6.31 produces:

Ordering: x₁ = 1, x₂ = f, x₃ = c, x₄ = d, x₅ = h, x₆ = a, x₇ = j, x₈ = b, x₉ = i, x₁₀ = e
Resulting chains:
- C₁ = {1, h, b}
- C₂ = {f, a, e}
- C₃ = {c, d}
- C₄ = {j}
- C₅ = {i}
This partition is optimal because the width of P is 5, and A = {a, b, d, i, j} is a 5-element antichain.

⚠️ Why First Fit fails for general posets

⚠️ Bad performance on arbitrary posets

The excerpt warns: "you should be very careful in applying First Fit to find optimal chain partitions of posets—just as one must be leary of using First Fit to find optimal colorings of graphs."

📉 Concrete counterexamples

The excerpt describes two posets in Figure 6.33:

Poset type	Height/Width	First Fit result	Optimal result	Problem
Height 2 poset (10 points)	Height = 2	Uses 5 antichains	Fewer possible	Can be extended to force arbitrarily many antichains
Width 2 poset	Width = 2	Uses 4 chains	2 chains optimal	Can be extended to force arbitrarily many chains

The excerpt notes that forcing First Fit to use many chains while keeping width at 2 is "a bit harder" than the antichain case.

🔍 Existence vs practicality

"In general, there is always some linear order on the ground set of a poset for which First Fit will find an optimal partition into antichains. Also, there is a linear order (in general different from the first) on the ground set for which First Fit will find an optimal partition into chains."

Don't confuse: The existence of such an order doesn't help, because finding it is as hard as solving the original problem.
The excerpt concludes: "there is no advantage in searching for such orders, as the algorithms we develop for finding optimal antichain and chain partitions work quite well."

🎯 Why interval orders are tractable

🎯 Skipping interval representations

The excerpt offers an alternative approach that avoids constructing interval representations:

Order elements so that i < j whenever D(x) is a proper subset of D(y).
Apply First Fit with respect to chains in this order.
This works because the down-set ordering respects the interval structure implicitly.

🎯 The general challenge

The excerpt foreshadows that finding the width of a general poset remains an open problem:

"We do not yet have an efficient process for determining the width of a poset and a minimum partition into chains."
One character (Yolanda) notes: "we still don't have a clue as to how to find the width of a poset in the general case. This might be very difficult—like the graph coloring problems discussed in the last chapter."
Another character (Dave) speculates there might be a fairly efficient process for all posets, but the tools aren't available yet.

🎯 Practical implications

Interval orders represent a special case where the structure (perfect interval graphs) makes the problem tractable.
For general posets, specialized algorithms beyond First Fit are necessary.
The excerpt suggests poset problems are often more complicated than their graph analogs, "sometime a little bit and sometimes a very big bit."

Discussion of First Fit and Poset Partitioning

6.9 Discussion

🧭 Overview

🧠 One-sentence thesis

First Fit algorithms can fail badly on posets—requiring arbitrarily many antichains or chains even when the height or width is small—but specialized algorithms for optimal partitions exist and work well.

📌 Key points (3–5)

First Fit can fail: applying First Fit to partition posets can use far more antichains/chains than optimal, even on simple posets.
Existence vs. practicality: there always exists some linear order that makes First Fit optimal, but finding it offers no advantage over direct algorithms.
Common confusion: First Fit works differently for antichains vs. chains—forcing many chains while keeping width 2 is harder than forcing many antichains while keeping height 2.
Concrete procedures vs. general solutions: interval orders have efficient algorithms, but the general poset width problem remains unsolved at this point.
Posets vs. graphs: many combinatorial problems have both graph and poset versions, with the poset version typically more complicated.

⚠️ Why First Fit fails

⚠️ The antichain partition problem

Example 6.32 shows a height-2 poset on 10 points where First Fit uses 5 antichains (when points are considered in label order).
The excerpt asks: "Do you see how to extend this poset to force First Fit to use arbitrarily many antichains, while keeping the height of the poset at 2?"
This demonstrates that First Fit can be arbitrarily bad even when the optimal number (the height) remains constant.

⚠️ The chain partition problem

The same example shows a width-2 poset where First Fit uses 4 chains.
The excerpt asks: "Do you see how to extend this poset to force First Fit to use arbitrarily many chains while keeping the width of the poset at 2?"
The excerpt notes: "Do you get a feeling for why the second problem is a bit harder than the first?"
This suggests that constructing adversarial examples for chain partitions is more subtle than for antichain partitions.

🔍 Theoretical existence vs. practical use

🔍 Optimal orders exist but are not useful

In general, there is always some linear order on the ground set of a poset for which First Fit will find an optimal partition into antichains.

Similarly, there exists a (generally different) linear order for which First Fit finds an optimal chain partition.
Why not search for these orders? The excerpt states: "there is no advantage in searching for such orders, as the algorithms we develop for finding optimal antichain and chain partitions work quite well."
The practical lesson: direct algorithms outperform trying to find the "right" order for First Fit.

💬 Team perspectives on the chapter

💬 Concrete vs. general solutions

The discussion section presents different viewpoints:

Character	Perspective	Concern or observation
Bob	Likes concrete procedures	"This material was full of cases of very concrete procedures for doing useful things."
Yolanda	Worried about limitations	The last procedure only works for interval orders; general poset width is still unsolved and "might be very difficult—like the graph coloring problems."
Dave	Optimistic about future	Believes "there's going to be a fairly efficient process that works for all posets" even if the tools aren't available yet.
Carlos	Comparative complexity	Poset problems have graph analogues, and "the poset version would be a bit more complicated, sometime a little bit and sometimes a very big bit."
Zori	Practical applications	Thinking about real-world uses where linear orders are "impossible or impractical"—posets might have commercial value.

💬 Open questions at this stage

The width of a general poset (not just interval orders) remains an open problem in the narrative.
The analogy to graph coloring suggests this may be computationally hard.
Don't confuse: having an algorithm for special cases (interval orders) vs. having one for all posets.

Exercises for Partially Ordered Sets and Inclusion-Exclusion

6.10 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set applies poset concepts (partial orders, diagrams, chains, antichains, interval orders, dimension) and inclusion-exclusion techniques (surjections, derangements, Euler φ-function) to concrete computational problems.

📌 Key points (3–5)

Poset exercises: verify partial order properties, draw Hasse diagrams for divisibility and subset relations, find maximal/minimal elements, chains, antichains, height, and width.
Interval order tasks: determine whether a poset is an interval order and construct interval representations or explain why none exists.
Inclusion-exclusion applications: count surjections, derangements (permutations with no fixed points), and compute Euler's φ-function for integers.
Common confusion: distinguishing maximal chains (cannot be extended) from non-maximal chains (can still add elements); recognizing when a poset can be represented by intervals.
Why it matters: these exercises build fluency in analyzing order structures and applying counting principles to real-world distribution and number-theoretic problems.

📐 Poset structure and properties

📐 Verifying partial orders

Task: given a relation P on a set X, check whether P is a partial order (reflexive, antisymmetric, transitive).
If not, either list the ordered pairs that must be added to make it a partial order, or explain why it cannot be made one by adding pairs.
Example: if a relation is missing reflexive pairs (x, x), you can add them; if it violates antisymmetry, adding pairs won't fix it.

🖼️ Drawing Hasse diagrams

Divisibility poset: X = {1, 2, 3, 5, 6, 10, 15, 30}, x ≤ y if x divides y.
- Draw the diagram showing the covering relations (direct divisibility without intermediate divisors).
Subset poset: X is a collection of sets, P is the "is a subset of" relation.
- Draw the diagram with each set as a node; draw an edge upward from A to B if A ⊂ B and no intermediate set exists.

🔗 Linear extensions

A linear extension of a poset P = (X, P) is a total order L on X such that if x ≤ y in P, then x ≤ y in L.

Task: for given posets, find a linear extension (a way to order all elements consistently with the poset).
Advanced: count the number of distinct linear extensions (without listing them all).

🔄 Duality and isomorphism

Dual poset: P^d has the same elements but reversed order (x ≤ y in P^d ⟺ y ≤ x in P).
Task: if poset P has height h and width w, determine the height and width of P^d.
Key insight: height and width swap under duality (height of P^d = width of P, width of P^d = height of P).
Example: if P has height 5 and width 3, then P^d has height 3 and width 5—no need to redraw the diagram.

🔍 Chains, antichains, and decompositions

🔍 Maximal and minimal elements

Maximal element: no other element is strictly greater.
Minimal element: no other element is strictly less.
Task: list all maximal and minimal elements in a given poset diagram.
Don't confuse: a poset can have multiple maximal/minimal elements; they are not necessarily unique.

⛓️ Chains and maximal chains

A chain is a subset of elements that are pairwise comparable (totally ordered).

Maximal chain: cannot be extended by adding any other element.
Task: find a maximal chain with a specified number of points, or find a non-maximal chain and explain why it is not maximal.
Example: a chain {a, b} is not maximal if there exists c such that a < c b (and c is comparable to all chain elements).

🔗 Antichains and width

An antichain is a subset of elements that are pairwise incomparable.

Width of a poset: the size of a maximum antichain.
Task: find the width w, exhibit an antichain of size w, and partition the poset into w chains.
Algorithm: use the method from the chapter (e.g., greedy or Dilworth's theorem).

📏 Height and chain decomposition

Height of a poset: the size of a maximum chain.
Task: find the height h, exhibit a maximum chain, and partition the poset into h antichains.
Example: if the longest chain has 4 elements, height = 4; partition all elements into 4 levels (antichains) by "rank."

🕰️ Interval orders and representations

🕰️ What is an interval order?

An interval order is a poset that can be represented by assigning each element an interval on the real line, where x < y in the poset ⟺ the interval for x lies entirely to the left of the interval for y.

Task: given a poset diagram, either find an interval representation or prove none exists.
Key test: interval orders cannot contain certain forbidden subposets (e.g., 2+2, a four-element poset with two incomparable pairs that "interleave").

🖼️ Drawing interval diagrams

Given: an interval representation (a figure showing intervals for each element).
Task: draw the Hasse diagram of the corresponding interval order.
Method: x < y in the poset ⟺ right endpoint of x's interval < left endpoint of y's interval.

🔍 Finding interval representations

Task: for a given poset, construct intervals or explain why it is not an interval order.
Strategy: check for forbidden subposets; if none, assign intervals by topological ordering and endpoint constraints.
Example: if the poset contains a 2+2 (two disjoint incomparable pairs {a, b} and {c, d} with a < c, b < d, a incomparable to d, b incomparable to c), it is not an interval order.

🧮 Width via First Fit algorithm

Task: use the First Fit algorithm (ordering intervals by left endpoints) to find the width w and a partition into w chains, plus an antichain of size w.
Method: process intervals left to right; assign each to the first available chain; the number of chains needed equals the maximum number of overlapping intervals at any point.

🔢 Inclusion-exclusion: surjections and derangements

🔢 Counting surjections

A surjection from [n] to [m] is a function where every element of [m] is the image of at least one element of [n].

Formula: S(n, m) = Σ (−1)^k C(m, k) (m − k)^n, summing over k = 0 to m.
Task: compute S(n, m) for given n and m.
Example: S(15, 4) counts the ways to distribute 15 distinct lottery tickets to 4 grandchildren so each gets at least one ticket.

🎩 Derangements

A derangement is a permutation σ of [n] where σ(i) ≠ i for all i (no fixed points).

Formula: d_n = Σ (−1)^k C(n, k) (n − k)!, summing over k = 0 to n.
Asymptotic result: d_n / n! → 1/e as n → ∞.
Example: the Hat Check problem—if 100 men check hats and receive them back at random, the probability that no man gets his own hat is approximately 1/e ≈ 0.368.

🔢 Partial derangements

Task: count permutations where exactly r elements are fixed (in their original positions).
Method: choose r positions to fix (C(n, r) ways), then derange the remaining n − r elements (d_(n−r) ways).
Example: distributing 100 hats so exactly 40 men get their own hats: C(100, 60) × d_60.

🔢 Euler φ-function and number theory

🔢 Definition and computation

φ(n) = the number of positive integers m ≤ n such that gcd(m, n) = 1 (m is relatively prime to n).

Brute-force method: iterate m from 1 to n, check gcd(m, n) = 1, count.
Problem: inefficient for large n (e.g., n = 321974 is manageable, but n = 1369122257328767073 is not).

🧮 Inclusion-exclusion formula

Theorem: if n has distinct prime factors p₁, p₂, …, p_m, then φ(n) = n × ∏ (p_i − 1)/p_i.
Method: apply inclusion-exclusion to exclude multiples of each prime factor.
Example: if n = 12 = 2² × 3, then φ(12) = 12 × (1 − 1/2) × (1 − 1/3) = 12 × 1/2 × 2/3 = 4.

🔑 Factorization and efficiency

Task: given n and its prime factorization, compute φ(n) efficiently.
Example: n = 1369122257328767073 = 3³ × 11 × 19⁴ × 31² × 6067², so φ(n) = n × (2/3) × (10/11) × (18/19) × (30/31) × (6066/6067).
Why it matters: public-key cryptography relies on the difficulty of factoring large integers; knowing the factorization makes computing φ(n) easy.

🔐 Cryptographic relevance

Scenario: if n = p₁ × p₂ (product of two large primes) and you know p₁ and p₂, computing φ(n) = (p₁ − 1)(p₂ − 1) is trivial.
Without the factorization, computing φ(n) for large n is computationally hard.
Don't confuse: knowing n is a product of two primes (without knowing which primes) does not make the problem easy.

🧮 Dimension of posets

🧮 Definition

The dimension of a poset P, denoted dim(P), is the smallest number t such that P is the intersection of t linear orders on X.

Task: find the dimension of given posets.

🔄 Properties of dimension

Duality: dim(P) = dim(P^d) (dimension is the same for a poset and its dual).
Subposets: if P is a subposet of Q, then dim(P) ≤ dim(Q).
Removal: removing a point can reduce dimension by at most 1.

📊 Bounds on dimension

Dilworth's theorem application: dim(P) ≤ width(P).
Example: for every n ≥ 2, there exists a poset P_n on 2n points with both width and dimension equal to n.
Task: use the example from the chapter to verify this construction.

Introduction to Inclusion-Exclusion and Euler's Totient Function

7.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The Inclusion-Exclusion Principle provides a systematic formula for computing Euler's totient function φ(n), which counts integers relatively prime to n, and this computation becomes dramatically easier when the prime factorization of n is known—a fact central to modern cryptography.

📌 Key points (3–5)

What φ(n) measures: the count of positive integers ≤ n that are relatively prime to n (share no common prime factors).
How Inclusion-Exclusion computes φ(n): start with n, subtract multiples of each prime factor, add back over-subtracted intersections, subtract triple-overlaps, etc.
Why factorization matters: knowing the prime factors p₁, p₂, … makes computing φ(n) trivial; without them, the computation is computationally infeasible for large n.
Common confusion: φ(n) is not "how many primes divide n"—it counts how many numbers up to n are not divisible by any of n's prime factors.
Real-world significance: the difficulty of factoring large integers (and thus computing φ(n)) underpins public-key cryptography security.

🧮 The Inclusion-Exclusion formula for φ(n)

🧮 General formula for three prime factors

When n has prime factors p₁, p₂, p₃, the Inclusion-Exclusion Principle yields:
φ(n) = n − (n/p₁ + n/p₂ + n/p₃) + (n/(p₁p₂) + n/(p₁p₃) + n/(p₂p₃)) − n/(p₁p₂p₃)

Start with all n integers: n candidates.
Subtract multiples of each prime: remove n/p₁, n/p₂, n/p₃ (numbers divisible by each prime).
Add back double-counted intersections: restore n/(p₁p₂), n/(p₁p₃), n/(p₂p₃) (numbers divisible by pairs of primes were subtracted twice).
Subtract triple-overlap: remove n/(p₁p₂p₃) (numbers divisible by all three primes were added back once too many).

🔢 Factored form

The formula can be rewritten as:

φ(n) = n · (1 − 1/p₁) · (1 − 1/p₂) · (1 − 1/p₃)

This multiplicative form makes computation straightforward once the primes are known.
Each factor (1 − 1/pᵢ) represents the fraction of numbers not divisible by pᵢ.

🔍 Worked examples

🔍 Example with a large integer (Example 7.16)

Given: n = 1369122257328767073 = 3³ · 11 · 19⁴ · 31² · 6067²

Computation:

The prime factors are p₁ = 3, p₂ = 11, p₃ = 19, p₄ = 31, p₅ = 6067.
Apply the multiplicative formula: φ(n) = n · (1 − 1/3) · (1 − 1/11) · (1 − 1/19) · (1 − 1/31) · (1 − 1/6067)
Simplify: φ(n) = n · (2/3) · (10/11) · (18/19) · (30/31) · (6066/6067)
Result: φ(1369122257328767073) = 760615484618973600.
Key point: SageMath (a computer algebra system) reports this "quickly" because the factorization is known.

🔐 The cryptographic challenge (Example 7.17)

Setup: Amanda and Bruce must compute φ(n) for a 150-digit integer n.

Person	Information given	Task difficulty
Amanda	n = p₁ · p₂ (two primes given explicitly)	Easy: φ(n) = (p₁ − 1)(p₂ − 1)
Bruce	n only (no factorization)	Computationally infeasible

Why Amanda's job is easier: for n = p₁ · p₂, the formula simplifies to φ(n) = n · (1 − 1/p₁) · (1 − 1/p₂) = (p₁ − 1)(p₂ − 1).
Why Bruce's job is hard: without the factorization, he must either factor n (extremely slow for large n) or compute φ(n) by testing all integers up to n (also infeasible).
Leveling the playing field?: telling Bruce that n is the product of two primes does not help—he still cannot compute φ(n) without knowing the actual primes.
Don't confuse: knowing "n has two prime factors" ≠ knowing the factors themselves; the latter is what makes computation easy.

🔐 Cryptographic significance

🔐 Why large integer factorization matters

Core insight from the discussion (Section 7.6): "Large integers, and specifically integers which are the product of large primes, are central to public key cryptography."
If someone could quickly factor 150-digit integers, they could "unravel many important secrets"—implying that encrypted data would be compromised.
The security of many cryptographic systems relies on the assumption that factoring large n (and thus computing φ(n)) is computationally hard.

⚠️ The asymmetry of difficulty

With factorization: computing φ(n) is trivial (a few multiplications).
Without factorization: computing φ(n) is believed to be as hard as factoring n itself.
This asymmetry is the foundation of public-key cryptography: the public key involves n (easy to use), but breaking the system requires factoring n (hard without the private key, which is the factorization).

🎯 Real-world relevance

Zori's skepticism ("This won't help me to earn a living") is challenged by Xing's firm reply.
The excerpt emphasizes that skill in large integer arithmetic and factorization has real-world value—and danger.
Example scenario: An organization that can factor large integers could decrypt secure communications, access financial systems, or compromise national security infrastructure.

📚 Context and scope

📚 Chapter goals

Bob notes: "the professor indicated that the goal was just provide some key examples."
The chapter is intentionally brief, focusing on concrete applications of Inclusion-Exclusion to φ(n) rather than exhaustive theory.
Hint at "more general notions of inversion" suggests broader mathematical frameworks exist but are not covered here.

📚 What φ(n) is not

Not: the number of prime factors of n.
Not: the number of divisors of n.
Is: the count of integers from 1 to n that share no prime factors with n (i.e., gcd(k, n) = 1 for each such k).
Example: for n = 12 = 2² · 3, the numbers 1, 5, 7, 11 are relatively prime to 12, so φ(12) = 4.

7.2 The Inclusion-Exclusion Formula

🧭 Overview

🧠 One-sentence thesis

The Inclusion-Exclusion Formula provides a systematic way to compute Euler's totient function phi(n) by subtracting and adding contributions from all prime factors, and this computation is dramatically easier when the prime factorization of n is known.

📌 Key points (3–5)

What the formula computes: phi(n), the count of integers up to n that are coprime to n, using the prime factorization of n.
How the formula works: start with n, subtract multiples of each prime, add back over-subtracted intersections, subtract triple-counted terms, etc.
Why factorization matters: knowing the prime factors makes computing phi(n) fast; without them, the problem becomes extremely difficult even for large computers.
Common confusion: the formula looks complex, but it systematically corrects for over-counting and under-counting at each step.
Real-world significance: difficulty of factoring large integers (products of large primes) is central to public key cryptography and security.

🧮 The Inclusion-Exclusion structure

🧮 The formula for three primes

The excerpt shows the formula for n with three prime factors p₁, p₂, p₃:

phi(n) = n · (1 - 1/p₁) · (1 - 1/p₂) · (1 - 1/p₃)

This can also be written as:

Start with n
Subtract n/p₁, n/p₂, n/p₃ (remove multiples of each prime)
Add back n/(p₁·p₂), n/(p₁·p₃), n/(p₂·p₃) (restore double-subtracted intersections)
Subtract n/(p₁·p₂·p₃) (remove triple-counted term)

🔄 Why the alternating signs

First subtraction: removes all numbers divisible by each prime.
Addition step: numbers divisible by two primes were subtracted twice, so add them back once.
Second subtraction: numbers divisible by all three primes were added back too many times, so subtract once more.
The pattern alternates: subtract, add, subtract, add, etc., to correct for over- and under-counting at each level.

💡 Worked examples

💡 Example with known factorization

The excerpt gives n = 1369122257328767073 with factorization:

n = 3³ · 11 · 19⁴ · 31² · 6067²

Using the formula:

phi(n) = n · (1 - 1/3) · (1 - 1/11) · (1 - 1/19) · (1 - 1/31) · (1 - 1/6067)
Equivalently: n · (2/3) · (10/11) · (18/19) · (30/31) · (6066/6067)
SageMath quickly computes phi(1369122257328767073) = 760615484618973600

Why this is fast: once you know the prime factors, the formula is straightforward multiplication and division.

💡 Example with two large primes (Amanda vs Bruce)

The excerpt presents a challenge problem:

n is a 150-digit integer
Amanda is told n = p₁ · p₂ where both primes are given explicitly
Bruce is only told n is the product of two primes, but not which ones

Amanda's advantage:

She can immediately apply phi(n) = n · (1 - 1/p₁) · (1 - 1/p₂)
The computation is trivial once the factors are known

Bruce's difficulty:

He must first factor n to find p₁ and p₂
Factoring large integers is computationally extremely hard
Even knowing n is a product of two primes doesn't help much without knowing which primes

Don't confuse: the formula itself is simple; the hard part is finding the prime factorization.

🔐 Cryptographic significance

🔐 Why large integer factorization matters

The excerpt includes a discussion emphasizing:

Large integers that are products of large primes are central to public key cryptography
If someone could quickly factor 150-digit integers, they could "unravel many important secrets"
The difficulty of factorization protects cryptographic systems

🔐 The asymmetry

Situation	Difficulty	Implication
Computing phi(n) with known factors	Easy (fast multiplication)	Amanda's task is trivial
Factoring n to find primes	Extremely hard for large n	Bruce's task is nearly impossible
Using this for security	Hard to break, easy to use	Foundation of public key cryptography

Key insight: the same mathematical operation (computing phi) is easy in one direction (with factors) and hard in the other (without factors). This asymmetry is what makes the mathematics useful for security.

🔐 Real-world context

The excerpt notes:

Citizens highly skilled in large integer arithmetic who could quickly factor large integers would be able to break important secrets
Such ability would put one's life in danger
This is not theoretical—it's the basis of real cryptographic systems protecting sensitive information

Example: An organization uses a 150-digit number as part of its encryption key. If an attacker knew the two prime factors, they could decrypt messages easily. Without those factors, breaking the encryption requires solving an extremely difficult factoring problem that even powerful computers cannot do quickly.

7.3 Enumerating Surjections

🧭 Overview

🧠 One-sentence thesis

Euler's totient function, computed via inclusion-exclusion, counts integers coprime to a given number and plays a central role in large-integer arithmetic and public-key cryptography.

📌 Key points (3–5)

What the totient function counts: φ(n) measures how many positive integers up to n share no common factors with n (are coprime to n).
How to compute it: use the inclusion-exclusion principle on the prime factors of n to subtract out multiples and add back over-counted intersections.
Efficient formula: when n is factored into primes p₁, p₂, p₃, …, φ(n) equals n multiplied by (1 − 1/p₁)(1 − 1/p₂)(1 − 1/p₃)… for each distinct prime factor.
Common confusion: knowing the prime factorization makes computing φ(n) easy; without it, the problem is computationally hard even for very large n.
Why it matters: the difficulty of factoring large integers underpins public-key cryptography—factoring quickly would break many encryption schemes.

🔢 Euler's totient function and inclusion-exclusion

🔢 What φ(n) measures

Euler's totient function φ(n): the count of positive integers less than or equal to n that are coprime to n (share no prime factors with n).

It is not simply "how many numbers are less than n," but "how many of those numbers have no common divisor with n other than 1."
The excerpt applies inclusion-exclusion to derive φ(n) by subtracting multiples of each prime factor, adding back over-subtracted intersections, and so on.

🧮 The inclusion-exclusion formula

The excerpt shows the principle applied to three primes p₁, p₂, p₃:

Start with n.
Subtract multiples of each prime: n/p₁, n/p₂, n/p₃.
Add back pairwise intersections: n/(p₁p₂), n/(p₁p₃), n/(p₂p₃).
Subtract the triple intersection: n/(p₁p₂p₃).

This telescopes into the product formula:

φ(n) = n · (1 − 1/p₁) · (1 − 1/p₂) · (1 − 1/p₃) · …

for all distinct prime factors of n.

📊 Example with concrete numbers

Example 7.16 in the excerpt:

n = 1369122257328767073
Prime factorization: 3³ · 11 · 19⁴ · 31² · 6067²
Apply the formula by multiplying n by (1 − 1/3), (1 − 1/11), (1 − 1/19), (1 − 1/31), (1 − 1/6067).
SageMath computes φ(1369122257328767073) = 760615484618973600 quickly.

Don't confuse: the speed comes from knowing the prime factorization; without it, computing φ(n) for such a large n is extremely hard.

🔐 Cryptographic significance

🔐 The two-prime case (Example 7.17)

The excerpt presents a challenge:

Amanda and Bruce must find φ(n) for a very large n (over 150 digits).
Amanda is told that n = p₁ · p₂, the product of two large primes, and is given p₁ and p₂.
Bruce is told only that n is the product of two primes, but not which ones.

Why Amanda's job is easier:

With p₁ and p₂ known, φ(n) = n · (1 − 1/p₁) · (1 − 1/p₂) = (p₁ − 1)(p₂ − 1), which is straightforward arithmetic.
Without the factorization, Bruce faces the computationally hard problem of factoring n—even knowing it has exactly two prime factors does not make factoring feasible for such large numbers.

🛡️ Public-key cryptography and factoring

The discussion (Section 7.6) emphasizes:

Large integers that are products of large primes are central to public-key cryptography.
If someone could quickly factor 150-digit integers, they could break many encryption schemes and "unravel important secrets."
Xing warns that such skill would be both valuable and dangerous, underscoring the real-world stakes of integer factorization.

Common confusion: it's not about the size of n alone, but about the difficulty of finding its prime factors—knowing the factorization trivializes φ(n), but discovering it is the hard part.

🧩 Practical computation and relevance

🧩 Why factorization is the bottleneck

Scenario	Information given	Difficulty of computing φ(n)
Factorization known	Prime factors p₁, p₂, …	Easy: apply the product formula
Factorization unknown	Only n	Hard: must factor n first, which is computationally infeasible for large n

The excerpt shows that SageMath can compute φ(n) "quickly" when the factorization is provided.
Without the factorization, even knowing n is a product of two primes does not help Bruce significantly.

🎓 Relevance to applied combinatorics

The discussion section addresses skepticism about large-integer arithmetic:

Zori initially doubts the practical value of "big integer stuff."
Xing firmly counters that large-integer arithmetic and factoring are foundational to cryptography and security.
The group realizes that the material has direct, high-stakes applications in protecting information.

Don't confuse: this is not abstract number theory for its own sake—it is the mathematical foundation of modern encryption and data security.

Derangements

7.4 Derangements

🧭 Overview

🧠 One-sentence thesis

Euler's totient function φ(n) counts integers coprime to n and can be computed efficiently using inclusion-exclusion when the prime factorization is known, which is central to public key cryptography.

📌 Key points (3–5)

What φ(n) measures: the count of positive integers less than or equal to n that are coprime to n (share no common prime factors).
How to compute φ(n): using the inclusion-exclusion principle applied to the prime factors of n.
Formula for φ(n): φ(n) = n × (1 - 1/p₁) × (1 - 1/p₂) × ... × (1 - 1/pₖ) where p₁, p₂, ..., pₖ are the distinct prime factors of n.
Common confusion: knowing n versus knowing its prime factorization—factorization makes computing φ(n) easy, but factoring large n is computationally hard.
Why it matters: the difficulty of factoring large integers (products of large primes) underpins public key cryptography and security systems.

🔢 The totient function and its formula

🔢 What φ(n) counts

Euler's totient function φ(n): the number of positive integers less than or equal to n that are coprime to n.

Two numbers are coprime when they share no common prime factors (their greatest common divisor is 1).
The excerpt shows φ(n) is computed by excluding numbers that share prime factors with n.
Example: if n has prime factors p₁, p₂, p₃, we exclude multiples of these primes using inclusion-exclusion.

📐 The inclusion-exclusion formula

The excerpt derives φ(n) using the Principle of Inclusion-Exclusion:

Start with n total integers.
Subtract multiples of each prime: n/p₁, n/p₂, n/p₃.
Add back double-counted intersections: n/(p₁p₂), n/(p₁p₃), n/(p₂p₃).
Subtract triple-counted: n/(p₁p₂p₃).
This simplifies to: φ(n) = n × (1 - 1/p₁) × (1 - 1/p₂) × (1 - 1/p₃) × ...

The formula shows that φ(n) depends on multiplying n by a fraction for each distinct prime factor.

🧮 Computing φ(n) with known factorization

Example from the excerpt: given n = 1369122257328767073 with factorization 3³ × 11 × 19⁴ × 31² × 6067²:

Apply the formula: φ(n) = n × (1 - 1/3) × (1 - 1/11) × (1 - 1/19) × (1 - 1/31) × (1 - 1/6067)
Equivalently: φ(n) = n - n/3 - n/11 - n/19 - n/31 - n/6067 (with inclusion-exclusion adjustments)
The result: φ(1369122257328767073) = 760615484618973600
Key point: SageMath computes this "quickly" because the factorization is known.

🔐 Cryptographic significance

🔐 The factorization problem

The excerpt presents a scenario (Example 7.17) with two students, Amanda and Bruce:

Student	Information given	Task difficulty
Amanda	n = p₁ × p₂ and both primes p₁, p₂ provided	Easy—can apply φ formula directly
Bruce	Only n (a 150-digit number)	Hard—must factor n first

The excerpt asks: "Is this information of any special value to Amanda? Does it really make her job any easier than Bruce's?"
Answer implied: Yes, dramatically easier. Knowing the prime factorization allows immediate computation of φ(n).
Even telling Bruce that n is a product of two primes does not help much—he still must find which two primes.

🔒 Why large primes matter for security

From the discussion section:

"Large integers, and specifically integers which are the product of large primes, are central to public key cryptography."
If someone "could quickly factor integers with, say 150 digits, then you would be able to unravel many important secrets."
The excerpt emphasizes that factoring large integers is computationally hard, and this difficulty protects cryptographic systems.
Don't confuse: multiplying two large primes is easy; factoring their product back into primes is hard. This asymmetry is the foundation of security.

⚠️ Real-world implications

The character Xing states:

Being "highly skilled in large integer arithmetic" and able to factor large numbers would allow breaking secrets.
"No doubt your life would be in danger"—suggesting that such ability would threaten powerful security systems.
The excerpt uses this to illustrate that the mathematics of φ(n) and prime factorization is not abstract but has direct real-world security applications.

🧩 Worked examples

🧩 Example with moderate-sized n

For n = 1369122257328767073:

Prime factorization: 3³ × 11 × 19⁴ × 31² × 6067²
Distinct primes: 3, 11, 19, 31, 6067
Apply formula: φ(n) = n × (1 - 1/3) × (1 - 1/11) × (1 - 1/19) × (1 - 1/31) × (1 - 1/6067)
Result: 760615484618973600
The excerpt notes this is computed "quickly" by software when factorization is known.

🧩 Example with very large n

For n (a 150-digit number) = p₁ × p₂:

p₁ = 470287785858076441566723507866751092927015824834881906763507
p₂ = 669483106578092405936560831017556154622901950048903016651289
With factorization known: φ(n) = (p₁ - 1) × (p₂ - 1) (since n is a product of two primes)
Without factorization: computing φ(n) requires first factoring n, which is computationally infeasible for numbers this large.
Key insight: the same mathematical problem has vastly different difficulty depending on what information is available.

7.5 The Euler φ Function

7.5 The Euler ϕ Function

🧭 Overview

🧠 One-sentence thesis

The Euler φ function counts integers relatively prime to n and can be computed efficiently using the prime factorization of n, which is central to public key cryptography.

📌 Key points (3–5)

What φ(n) computes: the count of positive integers less than or equal to n that are relatively prime to n (share no common prime factors).
How to calculate φ(n): use the Principle of Inclusion-Exclusion applied to the prime factorization of n.
The formula pattern: for n with prime factors p₁, p₂, p₃, φ(n) = n × (1 - 1/p₁) × (1 - 1/p₂) × (1 - 1/p₃).
Common confusion: knowing n versus knowing its prime factorization—factoring large n is computationally hard, but computing φ(n) is easy if you already know the prime factors.
Why it matters: large integers that are products of large primes underpin public key cryptography; factoring them quickly would compromise many secrets.

🔢 Computing φ using prime factorization

🔢 The Inclusion-Exclusion formula

The excerpt shows how the Principle of Inclusion-Exclusion yields the formula for φ(n) when n has prime factors p₁, p₂, p₃:

φ(n) = n - (n/p₁ + n/p₂ + n/p₃) + (n/(p₁p₂) + n/(p₁p₃) + n/(p₂p₃)) - n/(p₁p₂p₃)

Start with all n integers.
Subtract those divisible by each prime (single terms).
Add back those divisible by pairs of primes (to correct for over-subtraction).
Subtract those divisible by all three primes.

🧮 Simplified multiplicative form

The formula simplifies to:

φ(n) = n × (1 - 1/p₁) × (1 - 1/p₂) × (1 - 1/p₃)

This form is easier to compute once you know the prime factors.
Each factor (1 - 1/pᵢ) represents the fraction of integers not divisible by pᵢ.
Example: For n = 1369122257328767073 with prime factorization 3³ × 11 × 19⁴ × 31² × 6067², the excerpt shows φ(n) = n × (1 - 1/3) × (1 - 1/11) × (1 - 1/19) × (1 - 1/31) × (1 - 1/6067).

💻 Computational efficiency with known factorization

Example 7.16 demonstrates:

SageMath reports the prime factorization of 1369122257328767073.
Using the formula, SageMath quickly computes φ(1369122257328767073) = 760615484618973600.
Key point: computation is fast because the prime factorization is known.

🔐 The cryptographic significance

🔐 Amanda vs Bruce scenario (Example 7.17)

The excerpt presents a challenge:

Both Amanda and Bruce must find φ(n) for a very large n (a 150-digit integer).
Amanda is told that n = p₁ × p₂ (product of two large primes) and is given both primes.
Bruce is only told that n is the product of two primes, but not which ones.

Does Amanda's information help?

Yes, enormously. With the prime factors, Amanda can apply the formula directly: φ(n) = n × (1 - 1/p₁) × (1 - 1/p₂).
Bruce must first factor n to find p₁ and p₂, which is computationally very hard for large integers.
Even knowing that n is a product of two primes does not make Bruce's job significantly easier—factoring remains the bottleneck.

🛡️ Why large integer factorization matters

The discussion section (7.6) emphasizes the real-world stakes:

Zori dismisses "big integer stuff" as irrelevant to earning a living.
Xing firmly corrects her: large integers, especially products of large primes, are central to public key cryptography.
If someone could quickly factor 150-digit integers, they could "unravel many important secrets" and their "life would be in danger."
The group realizes Xing is absolutely certain—this is not theoretical; it has real security implications.

⚠️ Don't confuse: knowing n vs knowing its factors

What you know	Computational difficulty	Implication
n only (even if very large)	Factoring n is hard	Cannot easily compute φ(n)
n and its prime factorization	Easy to compute φ(n)	Formula applies directly
n is a product of two primes (but not which)	Still must factor n	Does not level the playing field

The asymmetry between computing φ(n) with known factors and factoring n to find those factors is the foundation of cryptographic security.

📐 Worked example with large n

📐 Example 7.16 breakdown

Given:

n = 1369122257328767073
Prime factorization: 3³ × 11 × 19⁴ × 31² × 6067²

Steps:

Identify the distinct prime factors: 3, 11, 19, 31, 6067.
Apply the formula: φ(n) = n × (1 - 1/3) × (1 - 1/11) × (1 - 1/19) × (1 - 1/31) × (1 - 1/6067).
Compute: φ(n) = 1369122257328767073 × (2/3) × (10/11) × (18/19) × (30/31) × (6066/6067).
Result: φ(n) = 760615484618973600.

Why this is fast: the prime factorization is already known; the formula is a straightforward multiplication.

📐 Example 7.17 implications

n is approximately 150 digits long.
Amanda receives p₁ and p₂ (both primes, each around 60–70 digits).
Amanda's task: multiply (1 - 1/p₁) and (1 - 1/p₂) by n—straightforward arithmetic.
Bruce's task: factor n into p₁ and p₂—no known efficient algorithm for large n.
Conclusion: Amanda's information makes her job exponentially easier; Bruce faces an intractable problem without the factorization.

🧠 Conceptual takeaways

🧠 The Euler φ function as a counting tool

φ(n) counts how many integers from 1 to n are relatively prime to n (i.e., share no prime factors with n).
It is not just a theoretical curiosity; it has practical applications in number theory and cryptography.

🧠 The power of prime factorization

Knowing the prime factorization of n transforms a hard problem (computing φ(n) from scratch) into an easy one (applying a formula).
The difficulty of factoring large integers is what makes certain cryptographic systems secure.

🧠 Real-world relevance

The discussion section underscores that this material is not abstract: public key cryptography relies on the computational hardness of factoring.
Zori's skepticism is addressed by Xing's firm reminder that skill in large integer arithmetic has real-world, high-stakes implications.

7.6 Discussion

🧭 Overview

🧠 One-sentence thesis

Large integer arithmetic, especially factoring integers that are products of large primes, is central to public key cryptography and has real-world security implications, not just theoretical interest.

📌 Key points (3–5)

The challenge posed: computing Euler's phi function φ(n) for very large integers can be easy or hard depending on what information you have.
The asymmetry: knowing the prime factorization of n makes computing φ(n) trivial, but without it the problem is extremely difficult.
Real-world relevance: skill in factoring large integers (e.g., 150-digit numbers) would allow breaking many cryptographic secrets.
Common confusion: students may dismiss large integer arithmetic as impractical, but it underpins modern security systems.
The stakes: cryptographic security depends on the difficulty of factoring; being able to factor quickly would be both valuable and dangerous.

🔐 The computational challenge

🔐 Amanda vs Bruce: information asymmetry

Example 7.17 presents two students with the same task: compute φ(n) for an extremely large integer n (over 150 digits).

Amanda's advantage:

She is told that n = p₁ · p₂ (product of two primes).
She is given both prime factors explicitly: p₁ and p₂ (each around 66 digits).
With the factorization, computing φ(n) is straightforward using the formula from earlier in the chapter.

Bruce's disadvantage:

He receives only the value of n itself.
Without the prime factorization, he must either factor n (computationally infeasible for numbers this large) or compute φ(n) by other means (also infeasible).

The leveling question:

Would telling Bruce that n is the product of two primes help?
The excerpt implies this information alone does not make the problem tractable—he still needs to find those two primes, which is the hard part.

🧮 Why factorization matters

The difficulty of computing φ(n) depends critically on whether you know the prime factorization of n.

With factorization: apply the formula directly (fast, even for huge numbers).
Without factorization: no known efficient method exists for very large n.
This asymmetry is not a mathematical curiosity—it is the foundation of cryptographic security.

🔒 Cryptographic implications

🔒 Public key cryptography

Xing's response emphasizes the real-world stakes:

Large integers that are products of large primes are "central to public key cryptography."
Modern encryption systems rely on the assumption that factoring such integers is computationally infeasible.
If someone could quickly factor 150-digit integers, they could "unravel many important secrets."

⚠️ The danger of skill

The excerpt notes a striking consequence:

High skill in large integer arithmetic and factoring would make you capable of breaking important cryptographic systems.
"No doubt your life would be in danger"—this is not hyperbole; the ability to break widely used encryption would have serious security and political implications.

Don't confuse:

Theoretical interest vs practical impact: the mathematics may seem abstract, but it directly protects real-world communications, financial transactions, and sensitive data.

💬 Student perspectives

💬 Initial reactions

The discussion captures a range of student attitudes:

Student	Initial view	Reaction
Yolanda	Chapter seemed short	Neutral observation
Bob	Hints at more general inversion concepts	Curious but uncertain
Zori	Frustrated; sees no practical value	"This won't help me earn a living"
Xing	Firm rebuttal	Insists the material is critically important

💬 The turning point

Zori's skepticism about relevance is directly challenged:

Xing is "uncharacteristically firm" and "absolutely certain."
The group initially thinks Xing is "way out of bounds" but quickly realizes he is serious.
Zori becomes quiet and reflects that "maybe, just maybe, her skepticism over the relevance of the material in applied combinatorics was unjustified."

Key message:

Abstract mathematical topics (like Euler's phi function and large integer factorization) have direct, high-stakes applications.
Dismissing them as irrelevant overlooks their role in modern technology and security.

🎯 Broader context

🎯 The chapter's scope

Bob notes the chapter was intentionally brief:

The professor's goal was to "provide some key examples."
There are hints at "more general notions of inversion" beyond what was covered.
The discussion section serves to motivate the material by connecting it to real-world applications.

🎯 The pedagogical point

The excerpt uses the Amanda/Bruce example and the student dialogue to illustrate:

Computational complexity: some problems are easy in one direction (with extra information) but hard in the other.
Practical importance: the difficulty of certain mathematical problems is not a bug but a feature—it enables secure communication.
Motivation for study: understanding why these topics matter can change students' engagement with the material.

Exercises

7.7 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set applies the inclusion-exclusion principle to count surjections, derangements, and arrangements with forbidden patterns, while also exploring Euler's totient function and recursive formulas for derangements.

📌 Key points (3–5)

Property P_i framework: exercises test whether functions, permutations, or integers satisfy specific indexed properties (e.g., f(i) = i, σ(i) = i, or i divides j).
Surjections and distribution problems: counting ways to map larger sets onto smaller sets or distribute distinct objects ensuring every recipient gets at least one.
Derangements: permutations where no element stays in its original position; can be counted via inclusion-exclusion or recursive formulas.
Common confusion: derangements (no element in its original position) vs. arrangements with no consecutive pairs preserved (different forbidden pattern).
Euler's totient function φ(n): counts integers up to n that are relatively prime to n; computed using inclusion-exclusion on prime divisors.

🔢 Property-based classification problems

🔢 Functions and property P_i

Property P_i for a function f: [8] → [7]: there is no j such that f(j) = i.

Exercise 11(a) asks which properties a given function satisfies by checking the table.
Example: if f maps some j to 4, then f does not satisfy property P_4.
Part (b) asks if a function [8] → [7] can satisfy no property P_i (i ≤ 7).
Part (c) extends to [8] → [9] and asks the same question.

🔄 Permutations and fixed points

Property P_i for a permutation σ: [n] → [n]: σ(i) = i (i is a fixed point).

Exercise 12(a) checks which properties a given permutation satisfies.
Example: if σ(2) = 1, then σ does not satisfy P_2 (because σ(2) ≠ 2).
Part (b) asks for a permutation satisfying exactly P_1, P_4, and P_8.
Part (c) asks for a permutation with no fixed points (a derangement).

🔢 Divisibility properties

Property P_i for j ∈ [n]: i is a divisor of j.

Exercise 13 sets m = n = 15 and asks which properties the integer 12 satisfies.
Example: 12 satisfies P_3 because 3 divides 12; does not satisfy P_5 because 5 does not divide 12.
Parts (b)–(d) ask for integers satisfying exactly two, four, or three properties, or explain why none exist.

🎯 Surjections and distribution problems

🎯 Counting surjections

Exercise 14: How many surjections from an 8-element set to a 6-element set?
A surjection ensures every element in the codomain is "hit" at least once.
Use inclusion-exclusion: total functions minus those missing at least one target element.

📚 Distribution with at least one each

Exercise 15: distribute 10 distinct books to 4 people (John, Paul, Ringo, George) so each gets at least one.
This is equivalent to counting surjections from a 10-element set onto a 4-element set.
Exercise 16: assign 9 tasks to 5 employees, each employee gets at least one task.
Exercise 17: assign 12 topics to 6 students, Katie must get the most challenging topic (and possibly others), each student gets at least one.
- Fix Katie's assignment of the challenging topic first, then distribute the remaining 11 topics as surjections onto 6 students.

🔀 Derangements

🔀 What is a derangement

A derangement: a permutation σ of [n] where σ(i) ≠ i for all i (no fixed points).

Exercise 18: list all derangements of [4].
Example notation: write σ as a string σ(1)σ(2)σ(3)σ(4).
Exercise 19: count derangements of a 9-element set using the inclusion-exclusion formula.

⚽ Real-world derangement scenarios

Exercise 20: equipment manager hands uniforms to 6 players; count ways so no player gets his own uniform.
This is exactly the number of derangements of a 6-element set.
Exercise 21: payroll clerk places 7 paychecks into pre-labeled envelopes; count ways so exactly three employees receive the correct paycheck.
- Choose which 3 employees get correct checks, then derange the remaining 4.

🔁 Recursive formulas for derangements

Exercise 22: two recursive formulas for d_n (number of derangements of [n]).
Initial values: d_1 = 0, d_2 = 1.

(a) First recursion: d_n = (n - 1)(d_{n-1} + d_{n-2}) for n ≥ 2.

Hint: for a derangement σ, consider k where σ(k) = 1; count choices for k and whether σ(1) = k or not.

(b) Second recursion: d_n = n · d_{n-1} + (-1)^n for n ≥ 2.

Hint: prove using the first recursion and mathematical induction.

🧮 Euler's totient function φ(n)

🧮 Definition and computation

φ(n): the number of integers in [n] that are relatively prime to n.

Exercise 23: compute φ(18) by listing the integers and by using the formula from Theorem 7.14.
Exercise 24: compute φ(756).
Exercise 25: given the prime factorization 1625190883965792 = 2^5 · 3^4 · 11^2 · 13 · 23^3 · 181^2, compute φ(1625190883965792).

📐 Formula from Theorem 7.14

The formula uses inclusion-exclusion on the prime divisors of n.
If n = p_1^{a_1} · p_2^{a_2} · ... · p_k^{a_k}, then φ(n) = n · (1 - 1/p_1) · (1 - 1/p_2) · ... · (1 - 1/p_k).
Exercise 26: prove Proposition 7.15 (not specified in the excerpt, but referenced).

🚶 Line arrangements with forbidden patterns

🚶 No student follows the same classmate

Exercise 27: 9 students walk to lunch in order ABCDEFGHI; on the return trip, no student should walk immediately behind the same classmate.
Example: ACBDIHGFE is allowed; CEFGBADHI is not (because FG and HI are preserved).
Don't confuse: this is not a derangement (which forbids same positions); it forbids preserving consecutive pairs.

(a) Count the exact number of valid return orderings.

Use inclusion-exclusion on the 8 consecutive pairs (AB, BC, CD, DE, EF, FG, GH, HI).

(b) Compare this count to the number of derangements (no student in the same position).

(c) What fraction of all 9! possible line-ups meet the criterion?

📖 Generating functions introduction

📖 Power series as enumerative tools

A generating function for a sequence {a_n : n ≥ 0}: F(x) = Σ (n=0 to ∞) a_n x^n.

Chapter 8 introduces generating functions as a way to encode sequences.
Power series can be manipulated like ordinary functions: added, subtracted, multiplied.
For combinatorial purposes, convergence is often not a concern.
Techniques from calculus (differentiation, integration term by term) can be applied when convenient.

Basic Notation and Terminology

8.1 Basic Notation and Terminology

🧭 Overview

🧠 One-sentence thesis

Generating functions encode sequences as formal power series that can be manipulated algebraically to solve counting problems, without necessarily worrying about convergence.

📌 Key points (3–5)

What a generating function is: a formal power series F(x) = sum of aₙxⁿ that encodes a sequence {aₙ}.
"Formal" means ignoring convergence: we manipulate these series like polynomials (add, multiply, differentiate, integrate) without caring whether they converge for specific x values.
How to extract information: the coefficient of xⁿ in F(x) tells you aₙ, the n-th term of the sequence.
Common confusion: generating functions are not about plugging in x and computing F(x); they are symbolic tools where x is a placeholder.
Why it matters: algebraic operations on generating functions correspond to combinatorial operations on sequences, making complex counting problems tractable.

🔢 Core definition and philosophy

🔢 What is a generating function?

Generating function: Given a sequence {aₙ : n ≥ 0} of real numbers, the generating function F(x) is defined by F(x) = sum from n=0 to infinity of aₙxⁿ.

The word "function" is in quotes because we do not necessarily substitute values for x.
We treat F(x) as a formal power series: a symbolic object that can be manipulated algebraically.
By convention, F(0) = a₀ (the constant term).

🎯 The formal power series perspective

Key idea: we frequently ignore issues of convergence.
Power series are manipulated "just like ordinary functions"—they can be added, subtracted, multiplied, differentiated, and integrated term by term.
Example: Even if a series converges only at x = 0 (radius of convergence 0), we can still talk about it as a generating function.
Don't confuse: this is different from calculus, where convergence is central. Here, the series is a formal encoding of the sequence, not a function to evaluate.

📐 Fundamental examples

📐 The infinite geometric series

Consider the constant sequence aₙ = 1 for all n ≥ 0. Its generating function is:

F(x) = 1 + x + x² + x³ + x⁴ + ...

Algebraic derivation (without calculus):

Multiply both sides by (1 - x):
- (1 - x)(1 + x + x² + x³ + ...) = (1 + x + x² + ...) - x(1 + x + x² + ...) = 1
All terms cancel except 1.
Dividing by (1 - x) gives: 1/(1 - x) = sum from n=0 to infinity of xⁿ.

This is the Maclaurin series for 1/(1 - x), which converges when |x| < 1, but we use it formally without worrying about convergence.

📐 The finite geometric series

For a finite sum, the same method applies:

(1 - x)(1 + x + ... + xⁿ) = 1 - xⁿ⁺¹

Dividing by (1 - x):

1 + x + ... + xⁿ = (1 - xⁿ⁺¹)/(1 - x)

🔬 Differentiation and integration

Differentiation example:

Start with 1/(1 - x) = 1 + x + x² + x³ + ...
Differentiate term by term: 1/(1 - x)² = 1 + 2x + 3x² + 4x³ + ... = sum from n=1 to infinity of n·xⁿ⁻¹.

Integration example:

Start with 1/(1 + x) = 1/(1 - (-x)) = 1 - x + x² - x³ + ...
Integrate term by term: log(1 + x) = x - x²/2 + x³/3 - x⁴/4 + ... = sum from n=1 to infinity of (-1)ⁿ⁺¹·xⁿ/n.

Key point: These operations work for formal power series without concern about convergence tests from calculus.

🚫 A non-convergent example

Consider F(x) = sum from n=0 to infinity of n!·xⁿ.

This series has radius of convergence 0 (converges only at x = 0, where F(0) = 1).
Nevertheless, it makes sense as a formal power series.
It is the generating function for the sequence where a₀ = 1 and aₙ = the number of permutations of {1, 2, ..., n} for n ≥ 1.
This shows we can work with generating functions even when they don't converge.

🔗 Multiplying generating functions

🔗 Product formula

Proposition: Let A(x) = sum of aₙxⁿ and B(x) = sum of bₙxⁿ be generating functions. Then A(x)B(x) is the generating function of the sequence whose n-th term is:

a₀bₙ + a₁bₙ₋₁ + a₂bₙ₋₂ + ... + aₙb₀ = sum from k=0 to n of aₖbₙ₋ₖ

🔗 Why this matters

When you multiply two power series, the coefficient of xⁿ in the product is the sum of all products aₖbⱼ where k + j = n.
This corresponds to convolution of sequences.
Combinatorial interpretation: if A(x) counts one type of structure and B(x) counts another, their product often counts combined structures.

Example: If you want to count ways to distribute objects with constraints, multiplying generating functions for individual recipients gives the generating function for all recipients together.

Don't confuse: Multiplying generating functions is not the same as multiplying the sequences term by term; it's a convolution that mixes terms from different indices.

Another look at distributing apples or folders

8.2 Another look at distributing apples or folders

🧭 Overview

🧠 One-sentence thesis

Generating functions encode distribution problems by multiplying power series—one per entity—so that the coefficient on x^n counts all the ways to distribute n objects under given restrictions.

📌 Key points (3–5)

Core technique: To count ways to distribute n objects to multiple entities with restrictions, build a generating function for each entity individually, then multiply them; the coefficient on x^n in the product gives the answer.
How multiplication works: The product of generating functions produces x^n for every combination of exponents that sum to n, so each term corresponds to one distribution scenario.
Handling restrictions: Different restrictions (at least one, at most k, multiples only, etc.) translate into different generating function factors (e.g., x/(1−x) for "at least one," 1+x+x²+x³ for "at most three").
Common confusion: Don't confuse the generating function itself with the answer—the generating function is a formal power series; the coefficient on x^n is the count you want.
Extracting coefficients: Use algebraic manipulation (partial fractions, known series expansions) or computational tools to find the coefficient on x^n, which may yield a closed-form formula or a specific numerical answer.

🍎 The basic distribution problem revisited

🍎 Distributing to one child

The generating function for distributing n apples to one child (each child gets at least one apple) is x + x² + x³ + ⋯ = x/(1−x).

For a single child who must receive at least one apple, there is exactly one way to distribute n apples: give all n to that child.
The sequence is {aₙ : n ≥ 1} with aₙ = 1 for all n ≥ 1.
The generating function is x(1 + x + x² + ⋯) = x/(1−x).

🧒 Scaling up to multiple children

When distributing n apples to 5 children (each gets at least one), multiply the single-child generating function five times:

(x + x² + ⋯)(x + x² + ⋯)(x + x² + ⋯)(x + x² + ⋯)(x + x² + ⋯)

Why this works: To get x^n in the expansion, pick x^(k₁) from the first factor, x^(k₂) from the second, etc., where k₁ + k₂ + k₃ + k₄ + k₅ = n.
Each such product corresponds to one way of distributing n apples so that child i gets kᵢ apples.
Since each kᵢ > 0, every child receives at least one apple.
Example: The coefficient on x⁶ is 5 = C(5, 4), because you need to choose which child gets 2 apples (the other four get 1 each).

🔢 Extracting the general coefficient

The generating function for 5 children is:

x⁵/(1−x)⁵

To find the coefficient on x^n:

Rewrite as x⁵ · (1/(1−x)⁵).
Use calculus (fourth derivative) or combinatorial reasoning to show that 1/(1−x)⁵ = Σ C(n+4, 4) x^n.
Shift the index: the coefficient on x^n in x⁵/(1−x)⁵ is C(n−1, 4).

Don't confuse: The generating function x⁵/(1−x)⁵ is not the answer; it is a compact encoding. The answer for a specific n is the coefficient on x^n.

🎁 Handling complex restrictions

🎁 Fruit basket example (Example 8.5)

Problem: A basket has 20 pieces of fruit (apples, pears, oranges, grapefruit). Restrictions:

At least one apple.
At most three pears.
Number of oranges must be a multiple of four.
Grapefruit unrestricted.

Solution approach:

Build a generating function for each fruit type:
- Apples (at least one): x/(1−x) = x + x² + x³ + ⋯
- Pears (at most three): 1 + x + x² + x³
- Oranges (multiples of four): 1/(1−x⁴) = 1 + x⁴ + x⁸ + ⋯
- Grapefruit (unrestricted): 1/(1−x) = 1 + x + x² + ⋯
Multiply them: [x/(1−x)] · (1 + x + x² + x³) · [1/(1−x⁴)] · [1/(1−x)]
Simplify using the identity (1 + x + x² + x³) = (1−x⁴)/(1−x): x/(1−x)³
Expand x/(1−x)³ = x · Σ C(n+1, 2) x^n = Σ C(n+1, 2) x^(n+1).
The coefficient on x^n is C(n+1, 2).
For n = 20, the answer is C(21, 2) = 210.

Why it works: Each factor contributes one fruit type; the product's x^n term counts all ways to pick fruits summing to n under the restrictions.

🔢 Integer solutions with mixed constraints (Example 8.6)

Problem: Find the number of integer solutions to x₁ + x₂ + x₃ = n where:

x₁ ≥ 0 and even.
x₂ ≥ 0.
0 ≤ x₃ ≤ 2.

Solution:

Generating functions for each variable:
- x₁ (even, nonnegative): 1/(1−x²) = 1 + x² + x⁴ + ⋯
- x₂ (nonnegative): 1/(1−x) = 1 + x + x² + ⋯
- x₃ (0, 1, or 2): 1 + x + x²
Multiply: (1 + x + x²) / [(1−x)(1−x²)]
Simplify the denominator: (1−x)(1−x²) = (1+x)(1−x)².
Use partial fractions: (1 + x + x²) / [(1+x)(1−x)²] = A/(1+x) + B/(1−x) + C/(1−x)²
Solve for A, B, C: A = 1/4, B = −3/4, C = 3/2.
Expand each term as a power series:
- 1/(1+x) = Σ (−1)^n x^n
- 1/(1−x) = Σ x^n
- 1/(1−x)² = Σ (n+1) x^n
The coefficient on x^n is: (−1)^n / 4 − 3/4 + 3(n+1)/2

Common confusion: Partial fractions are used to break a rational function into simpler pieces whose power series expansions are known; this is analogous to integration techniques in calculus but applied to formal power series.

🛠️ Computational tools and verification

🛠️ Using SageMath for series expansion

The series(x, degree) method expands a generating function up to a given degree.
Example: ((1+x+x^2)/((1+x)*(1-x)^2)).series(x, 31) gives all coefficients up to x³⁰.
To extract a single coefficient, use .list()[n] to index into the list of coefficients.

🛠️ Partial fractions in SageMath

The partial_fraction() method automatically decomposes a rational function.
Example: ((1+x+x^2)/((1+x)*(1-x)^2)).partial_fraction() returns 1/4/(x+1) + 3/4/(x−1) + 3/2/(x−1)².
Use pretty_print() for more readable output.

Why this matters: For specific n, computational tools quickly give numerical answers; for general n, partial fractions combined with known series help derive closed-form formulas.

🔄 Connecting to known counting problems

🔄 The generating function 1/(1−x)^n (Example 8.7)

Question: What is the coefficient on x^k in 1/(1−x)^n?

Combinatorial interpretation:

1/(1−x) = 1 + x + x² + ⋯ encodes distributing apples to one child (any nonnegative number).
Multiplying n copies of 1/(1−x) gives 1/(1−x)^n, which encodes distributing k apples to n children with no restrictions (each child can receive zero or more).
This is a "stars and bars" problem: distribute k apples to n children, which requires choosing n−1 dividers among k+n−1 positions.
The coefficient on x^k is C(k+n−1, n−1) = C(k+n−1, k).

Result:

1/(1−x)^n = Σ C(k+n−1, k) x^k

Don't confuse: This is the nonnegative case (each child can receive zero apples); if each child must receive at least one, the generating function is x^n/(1−x)^n and the coefficient on x^k is C(k−1, n−1).

🔄 Why combinatorial reasoning helps

Calculus-based derivations (repeated differentiation, factorials) are error-prone with many −1 adjustments.
Combinatorial reasoning directly connects the generating function to a counting problem, making the coefficient formula more intuitive and easier to verify.
Example: Recognizing 1/(1−x)^n as "distribute to n children, nonnegative amounts" immediately gives C(k+n−1, k) without calculus.

Newton's Binomial Theorem

8.3 Newton’s Binomial Theorem

🧭 Overview

🧠 One-sentence thesis

Newton's Binomial Theorem extends the classical binomial theorem to all real exponents by generalizing the definition of binomial coefficients, enabling the expansion of expressions like (1 + x) raised to any real power p as an infinite power series.

📌 Key points (3–5)

Extension beyond integers: The classical binomial theorem works for positive integer exponents, but Newton's version applies to any real number p ≠ 0.
Generalized binomial coefficients: The key is redefining P(p; k) and C(p; k) so they make sense when p is any real number and k is a nonnegative integer.
Infinite series result: When p is not a positive integer, the expansion becomes an infinite sum rather than a finite one.
Common confusion: When p and k are integers with 0 ≤ p < k, the generalized coefficient still equals zero (matching the classical case), but for non-integer p, coefficients like P(−5; 4) or C(−7/2; 5) are non-zero and meaningful.
Practical application: The theorem enables finding generating functions for sequences like central binomial coefficients, which appear in counting problems.

🔧 Generalizing the building blocks

🔧 Extending P(p; k) to real p

The excerpt starts from the recursive definition used for integers:

P(p; 0) = 1 for all integers p ≥ 0
P(p; k) = p · P(p − 1; k − 1) when p ≥ k > 0

Definition 8.8 removes the restriction p ≥ k:

For all real numbers p and nonnegative integers k:

P(p; 0) = 1 for all real p

P(p; k) = p · P(p − 1; k − 1) for all real p and integers k > 0

The recursion still makes sense because k is always a nonnegative integer, even though p can be any real number.
Example: P(−5; 4) = (−5)(−6)(−7)(−8), which is a well-defined product of four terms.

🔢 Extending binomial coefficients C(p; k)

Definition 8.9 generalizes the binomial coefficient:

For all real numbers p and nonnegative integers k, C(p; k) = P(p; k) / k!

Written in binomial notation: (p choose k) = P(p; k) / k!

When p and k are integers with 0 ≤ p < k, this still gives zero (matching the classical definition).
For non-integer p, new coefficients emerge: (−7/2 choose 5) = [(−7/2)(−9/2)(−11/2)(−13/2)(−15/2)] / 5!
Don't confuse: the formula looks the same, but the domain has expanded from "p and k both nonnegative integers with p ≥ k" to "p any real, k any nonnegative integer."

📐 Newton's Binomial Theorem statement

📐 The theorem

Theorem 8.10 (Newton's Binomial Theorem):

For all real p with p ≠ 0, (1 + x)^p = sum from n=0 to infinity of (p choose n) · x^n

The classical binomial theorem is the special case when p is a positive integer (the sum terminates after p+1 terms because coefficients become zero).
When p is not a positive integer, the series is infinite.
The excerpt notes that the proof can be found in advanced calculus books (not provided here).

🔗 Connection to generating functions

The excerpt emphasizes the generating-function perspective:

The classical binomial theorem says (1 + x)^p is the generating function for the number of n-element subsets of a p-element set (when p is a positive integer).
Newton's extension allows us to interpret (1 + x)^p as a generating function even when p is not a positive integer, opening the door to new counting interpretations.

🧮 Working with the generalized coefficients

🧮 Alternative recursive formula (Lemma 8.11)

The excerpt establishes a different recursion:

For each k ≥ 0, P(p; k+1) = P(p; k) · (p − k)

Proof sketch:

Base case k = 0: both sides equal p.
Inductive step: assume P(p; m+1) = P(p; m)(p − m). Then:
- P(p; m+2) = p · P(p−1; m+1) (by definition)
- = p · [P(p−1; m) · (p−1 − m)] (by induction hypothesis applied to p−1)
- = [p · P(p−1; m)] · (p − (m+1)) (rearranging)
- = P(p; m+1) · (p − (m+1)) (by definition of P(p; m+1))

This recursion is useful for deriving specific formulas.

🧮 Simplifying C(−1/2; k) (Lemma 8.12)

The excerpt proves a concrete formula:

For each k ≥ 0, (−1/2 choose k) = (−1)^k · (2k choose k) / 2^(2k)

Proof sketch:

Base case k = 0: both sides equal 1.
Inductive step: assume the formula holds for k = m. Then:
- (−1/2 choose m+1) = P(−1/2; m+1) / (m+1)!
- = P(−1/2; m) · (−1/2 − m) / [(m+1) · m!] (using the alternative recursion)
- = [(−1/2 − m) / (m+1)] · (−1/2 choose m) (rearranging)
- Substitute the induction hypothesis and simplify to get (−1)^(m+1) · (2(m+1) choose (m+1)) / 2^(2(m+1))

This formula is key to the next application.

🎯 Application: generating function for central binomial coefficients

🎯 Theorem 8.13

The excerpt derives a generating function for the sequence of central binomial coefficients {(2n choose n) : n ≥ 0}:

Theorem 8.13:

The function f(x) = (1 − 4x)^(−1/2) is the generating function of the sequence {(2n choose n) : n ≥ 0}.

Proof:

Apply Newton's Binomial Theorem with p = −1/2 and replace x with −4x:
- (1 − 4x)^(−1/2) = sum from n=0 to infinity of (−1/2 choose n) · (−4x)^n
Substitute the formula from Lemma 8.12:
- = sum from n=0 to infinity of [(−1)^n · (2n choose n) / 2^(2n)] · (−4)^n · x^n
- = sum from n=0 to infinity of [(−1)^n · (2n choose n) / 2^(2n)] · (−1)^n · 4^n · x^n
- = sum from n=0 to infinity of (2n choose n) · x^n (the (−1)^n terms cancel and 4^n / 2^(2n) = 1)

The coefficient on x^n is exactly (2n choose n).

🎯 Corollary 8.14: an identity from squaring

By squaring f(x) = (1 − 4x)^(−1/2), the excerpt derives:

Corollary 8.14:

For all n ≥ 0, 2^(2n) = sum from k=0 to n of (2k choose k) · (2(n−k) choose (n−k))

Why this works:

[f(x)]^2 = (1 − 4x)^(−1) = 1 / (1 − 4x) = sum from n=0 to infinity of 4^n · x^n (geometric series)
So the coefficient on x^n in [f(x)]^2 is 4^n = 2^(2n).
By Proposition 8.3 (product of generating functions), the coefficient on x^n in the product of two series is the sum of products of their coefficients:
- Coefficient on x^n = sum from k=0 to n of [(2k choose k) · (2(n−k) choose (n−k))]
Equating the two expressions gives the identity.

Don't confuse: this is not a direct application of Newton's theorem, but rather a consequence of multiplying the generating function by itself and using the convolution formula for products.

8.4 An Application of the Binomial Theorem

🧭 Overview

🧠 One-sentence thesis

Newton's Binomial Theorem can be applied to derive a generating function for central binomial coefficients and to prove a summation identity involving products of binomial coefficients.

📌 Key points (3–5)

Main result: The function (1 − 4x)^(−1/2) is the generating function for the sequence of central binomial coefficients {(2n choose n)}.
Key technique: Use Newton's Binomial Theorem with exponent p = −1/2, combined with a simplified formula for binomial coefficients with that exponent.
Corollary identity: Squaring the generating function yields the identity 2^(2n) = sum from k=0 to n of [(2k choose k) × (2n−2k choose n−k)].
Common confusion: The binomial coefficients here use a negative exponent (−1/2), not a positive integer, so Newton's generalized theorem is required, not the standard binomial theorem.

🔧 Preparatory lemmas

🔧 Recursive formula for falling factorial

The excerpt establishes a different recursive formula than the definition:

Lemma 8.11: For each k ≥ 0, P(p; k+1) = P(p; k) × (p − k).

P(p; k) denotes the falling factorial: p × (p−1) × (p−2) × … × (p−k+1).
Why it matters: This recursion is needed to simplify binomial coefficients with non-integer exponents.
Proof sketch: Base case k=0 gives both sides equal to p; inductive step uses the definition of falling factorial and substitutes p−1 for the exponent.

🧮 Simplified binomial coefficient formula

The excerpt derives a closed form for binomial coefficients with exponent −1/2:

Lemma 8.12: For each k ≥ 0, (−1/2 choose k) = (−1)^k × (2k choose k) / 2^(2k).

What it does: Converts a binomial coefficient with fractional exponent into one with integer exponent (the central binomial coefficient (2k choose k)).
How: Uses induction on k; base case k=0 gives 1 on both sides; inductive step applies Lemma 8.11 and algebraic manipulation.
Don't confuse: This is not the standard binomial coefficient formula; it only holds for the specific exponent p = −1/2.

🎯 Main theorem: generating function for central binomial coefficients

🎯 Statement and proof

Theorem 8.13: The function f(x) = (1 − 4x)^(−1/2) is the generating function of the sequence {(2n choose n) : n ≥ 0}.

Proof strategy:

Apply Newton's Binomial Theorem with p = −1/2 and replace x with −4x:
- (1 − 4x)^(−1/2) = sum from n=0 to infinity of [(−1/2 choose n) × (−4x)^n]
Substitute Lemma 8.12's formula for (−1/2 choose n):
- = sum from n=0 to infinity of [(−1)^n / 2^(2n) × (2n choose n) × (−4)^n × x^n]
Simplify: (−1)^n × (−4)^n = (−1)^n × (−1)^n × 4^n = 4^n, and 4^n / 2^(2n) = 1.
- Result: sum from n=0 to infinity of [(2n choose n) × x^n].

Why it works: Newton's Binomial Theorem extends to all real exponents, not just positive integers, so we can use p = −1/2.

🔗 Connection to future applications

The excerpt notes this generating function will reappear in Section 9.7 for a counting problem "in disguise."
Example context: problems that seem new but reduce to central binomial coefficients.

📐 Corollary: a summation identity

📐 Squaring the generating function

Corollary 8.14: For all n ≥ 0, 2^(2n) = sum from k=0 to n of [(2k choose k) × (2n−2k choose n−k)].

How it's derived:

Square both sides of Theorem 8.13: [f(x)]^2 = (1 − 4x)^(−1).
Left side: (1 − 4x)^(−1) = sum from n=0 to infinity of [4^n × x^n] = sum from n=0 to infinity of [2^(2n) × x^n].
Right side: [sum (2k choose k) × x^k]^2.
By Proposition 8.3 (product of generating functions), the coefficient of x^n in the product is the sum from k=0 to n of [(2k choose k) × (2n−2k choose n−k)].
Equate coefficients of x^n on both sides.

Don't confuse: This identity comes from squaring the generating function, not from the binomial theorem directly; it uses the rule for multiplying two power series.

🔢 Introduction to integer partitions

🔢 Definition and notation

A partition P of an integer n is a collection of (not necessarily distinct) positive integers such that the sum of elements in P equals n.

Convention: write elements from largest to smallest.
Example: 2 + 2 + 1 is a partition of 5.
p_n: the number of partitions of integer n; by convention p_0 = 1.

🧩 Generating function for partitions

The excerpt constructs a generating function for p_n:

P(x) = (sum from m=0 to infinity of x^m) × (sum from m=0 to infinity of x^(2m)) × (sum from m=0 to infinity of x^(3m)) × … = product from m=1 to infinity of [1 / (1 − x^m)].

How it works: Each factor accounts for how many times a particular integer k appears in the partition.
- The factor with x^(km) terms counts the number of k's.
Limitation: The form is elegant but not easy to use for computing p_n directly.
Example: The excerpt mentions that Hardy and Ramanujan solved the asymptotic estimate problem in 1918 (referenced in The Man who Knew Infinity).

🎲 Distinct vs. odd partitions

The excerpt gives an example for n = 8:

p_8 = 22 (total partitions of 8).
6 partitions into distinct parts (all summands different).
6 partitions into odd parts (all summands odd).

Key observation: The counts for distinct-part and odd-part partitions are equal for n=8; the excerpt notes this is "always the case" (Theorem 8.16, stated but not proved in this excerpt).

Don't confuse:

Distinct parts ≠ odd parts in definition, but their counts turn out equal.
The generating function P(x) counts all partitions, not just distinct or odd ones.

Partitions of an Integer

8.5 Partitions of an Integer

🧭 Overview

🧠 One-sentence thesis

The number of ways to partition an integer into distinct parts always equals the number of ways to partition it into odd parts, a surprising equivalence proven using generating functions.

📌 Key points (3–5)

What a partition is: a collection of positive integers that sum to n, written largest to smallest (e.g., 2+2+1 is a partition of 5).
Two special types: partitions into distinct parts (all different) vs. partitions into odd parts (only odd numbers allowed).
The surprising result: for every integer n, the count of distinct-part partitions equals the count of odd-part partitions.
Common confusion: this equality is not a coincidence—it holds for all n, proven by showing two generating functions are identical.
Why generating functions help: they encode partition-counting rules as infinite products, making algebraic manipulation reveal hidden equivalences.

🧩 What partitions are

🧩 Definition and notation

A partition P of an integer n is a collection of (not necessarily distinct) positive integers such that the sum of all integers in P equals n.

By convention, elements are written from largest to smallest.
Example: 2+2+1 is a partition of 5.
The number of partitions of n is denoted p_n (with p_0 = 1 by convention).

📊 Example: partitions of 8

The excerpt lists all 22 partitions of 8 (p_8 = 22).
Among these, exactly 6 use only distinct parts (e.g., 8, 7+1, 6+2, 5+3, 5+2+1, 4+3+1).
Also exactly 6 use only odd parts (e.g., 7+1, 5+3, 5+1+1+1, 3+3+1+1, 3+1+1+1+1+1, 1+1+1+1+1+1+1+1).
This 6=6 equality is not a coincidence—it holds for every n.

🔢 Generating function for all partitions

🔢 The general partition generating function

The generating function P(x) for p_n (the total number of partitions of n) is:

P(x) = (sum from m=0 to infinity of x^m) × (sum from m=0 to infinity of x^(2m)) × (sum from m=0 to infinity of x^(3m)) × ... = product from m=1 to infinity of 1/(1 - x^m)

Each factor accounts for how many times a particular integer k appears in the partition.
The factor with terms x^(km) counts the number of k's in the partition.
This form is elegant but not easy for computing p_n directly.

📝 Historical note

Providing an asymptotic estimate for p_n was notoriously difficult.
Hardy and Ramanujan solved it in 1918.
The excerpt mentions a popular account in Robert Kanigel's 1991 book The Man who Knew Infinity and the 2016 film.

🎯 The distinct-parts vs. odd-parts theorem

🎯 Statement of Theorem 8.16

Theorem 8.16: For each n ≥ 1, the number of partitions of n into distinct parts equals the number of partitions of n into odd parts.

"Distinct parts" means all integers in the partition are different (x_i ≠ x_j for i ≠ j).
"Odd parts" means every integer in the partition is odd.
Example from the excerpt: for n=8, both counts are 6.

🔍 Proof strategy

The proof compares two generating functions:

Type	Generating function	What it encodes
Distinct parts	D(x) = product from n=1 to infinity of (1 + x^n)	Each part appears 0 or 1 times
Odd parts	O(x) = product from n=1 to infinity of 1/(1 - x^(2n-1))	Odd parts can appear any number of times

The proof shows D(x) = O(x) by algebraic manipulation.

🧮 Key algebraic step

The proof uses the identity:

1 - x^(2n) = (1 - x^n)(1 + x^n) for all n ≥ 1

Starting from D(x):

D(x) = product from n=1 to infinity of (1 + x^n)
Rewrite as: product of (1 - x^(2n))/(1 - x^n)
Split into: [product of (1 - x^(2n))] / [product of (1 - x^n)]
The denominator contains both even and odd terms.
The even terms in the denominator cancel with the numerator.
What remains is: product from n=1 to infinity of 1/(1 - x^(2n-1)) = O(x)

Don't confuse: The generating functions look very different at first (one is a product of sums, the other a product of fractions), but the algebraic identity reveals they are the same function, so their coefficients (the partition counts) must match term by term.

🌟 Why this matters

The equality is not obvious from the definitions—you cannot easily construct a direct bijection between distinct-part and odd-part partitions just by looking at examples.
Generating functions encode the counting rules and allow algebraic proof of the equivalence.
This illustrates the power of generating functions to reveal hidden structure in combinatorial problems.

🔄 Connection to earlier material

🔄 Similarity to fruit basket problems

A partition counts how many 1's appear, how many 2's appear, etc.
This is similar to counting fruit baskets (how many apples, how many oranges, etc.) from earlier in the chapter.
The generating function structure reflects this: each factor corresponds to one "type" (one integer value).

🔄 Restricted generating functions

D(x) and O(x) are restricted versions of the general partition generating function P(x).
D(x) restricts each part to appear at most once (hence the factor (1 + x^n) instead of 1/(1 - x^n)).
O(x) restricts to odd parts only (hence the product over 2n-1 instead of all n).

Exponential generating functions

8.6 Exponential generating functions

🧭 Overview

🧠 One-sentence thesis

Exponential generating functions, which use the power series form of e^x (sum of a_n times x^n divided by n factorial), are particularly useful for counting problems where order matters—such as enumerating strings with specific digit restrictions—unlike ordinary generating functions that work well when order does not matter.

📌 Key points (3–5)

What exponential generating functions are: power series of the form sum of a_n times x^n divided by n factorial, based on the exponential function e^x.
When to use them: problems where order matters (e.g., counting strings), as opposed to ordinary generating functions for problems where order does not matter (e.g., counting fruit baskets).
How to build them: multiply factors for each symbol or digit, using e^x for unrestricted counts, (e^x + e^(−x))/2 for even counts, e^x − 1 for "at least one," and finite sums for bounded counts.
Common confusion: exponential vs ordinary generating functions—exponential is for ordered arrangements (strings), ordinary is for unordered collections (partitions, combinations).
Why they matter: they solve counting problems that ordinary generating functions cannot handle, especially when restrictions involve parity, minimum occurrences, or maximum occurrences in ordered sequences.

🔤 Core concept and definition

🔤 What exponential generating functions are

Exponential generating function: for a sequence {a_n : n ≥ 0}, the exponential generating function is the sum over n of a_n times x^n divided by n factorial.

Ordinary generating functions have the form: sum of a_n times x^n.
Exponential generating functions have the form: sum of a_n times x^n divided by n factorial.
The name comes from the power series for e^x, which is sum of x^n divided by n factorial.

🌟 The fundamental example

The constant sequence 1, 1, 1, 1, ... has exponential generating function:

E(x) = sum from n=0 to infinity of x^n divided by n factorial.
From calculus, this is the power series for e^x.
Example: the exponential generating function for the number of binary strings of length n is e^(2x), because e^(2x) = sum of (2x)^n divided by n factorial = sum of 2^n times x^n divided by n factorial.

🆚 When to use exponential vs ordinary generating functions

🆚 Order matters vs order does not matter

The key distinction:

Type	Use when	Example problem
Ordinary generating function	Order does not matter; counting unordered collections	Counting fruit baskets with apples and oranges; partitions of integers
Exponential generating function	Order matters; counting ordered arrangements	Counting strings where position matters (e.g., "10001" ≠ "011000")

📦 Illustrative contrast

Unordered (ordinary): Two fruit baskets with two apples and three oranges are considered equivalent regardless of arrangement.
Ordered (exponential): The bit strings "10001" and "011000" both contain three zeros and two ones, but they are not the same strings because position matters.

Don't confuse: the same counting problem may require different generating functions depending on whether the objects are distinguishable by position.

🧰 Building blocks for exponential generating functions

🧰 Factor for unrestricted count

If a symbol can appear any number of times, introduce a factor of e^x.
Example: for 1s and 2s with no restrictions in a ternary string, use e^x for each, giving e^x times e^x = e^(2x).

🧰 Factor for even count

To count an even number of a symbol:

Need: 1 + x^2 divided by 2! + x^4 divided by 4! + x^6 divided by 6! + ... = sum from n=0 to infinity of x^(2n) divided by (2n)!.
Technique: recall that e^(−x) = sum of (−1)^n times x^n divided by n!.
When you add e^x and e^(−x), all odd-power terms cancel, leaving 2 + 2 times x^2 divided by 2! + 2 times x^4 divided by 4! + ...
Therefore, the factor for an even number of a symbol is (e^x + e^(−x)) divided by 2.

🧰 Factor for at least one occurrence

To ensure a symbol appears at least once:

Need: x + x^2 divided by 2! + x^3 divided by 3! + ... = sum from n=1 to infinity of x^n divided by n!.
This is the series for e^x minus the first term (which is 1).
Therefore, the factor for at least one occurrence is e^x − 1.

🧰 Factor for at most k occurrences

To allow at most k occurrences of a symbol:

Use the finite sum: 1 + x + x^2 divided by 2! + x^3 divided by 3! + ... + x^k divided by k!.
Example: for at most three 2s, use 1 + x + x^2 divided by 2! + x^3 divided by 3!.

🧮 Worked examples

🧮 Ternary strings with even number of 0s

Problem: Find the number of ternary strings (using digits 0, 1, 2) in which the number of 0s is even, with no restrictions on 1s and 2s.

Solution steps:

For 1s: factor of e^x (unrestricted).
For 2s: factor of e^x (unrestricted).
For 0s (even count): factor of (e^x + e^(−x)) divided by 2.
Multiply: [(e^x + e^(−x)) divided by 2] times e^x times e^x = (e^(3x) + e^x) divided by 2.
Expand: (1/2) times [sum of 3^n times x^n divided by n! + sum of x^n divided by n!].
The coefficient on x^n divided by n! is (3^n + 1) divided by 2.

Answer: The number of ternary strings of length n with an even number of 0s is (3^n + 1) divided by 2.

🧮 Ternary strings with at least one 0 and at least one 1

Problem: How many ternary strings of length n have at least one 0 and at least one 1?

Solution steps:

For 0s (at least one): factor of e^x − 1.
For 1s (at least one): factor of e^x − 1.
For 2s (unrestricted): factor of e^x.
Multiply: (e^x − 1) times (e^x − 1) times e^x = e^(3x) − 2 times e^(2x) + e^x.
Expand: sum of 3^n times x^n divided by n! − 2 times sum of 2^n times x^n divided by n! + sum of x^n divided by n!.
The coefficient on x^n divided by n! is 3^n − 2 times 2^n + 1.

Answer: 3^n − 2 times 2^n + 1.

Alternative method (inclusion-exclusion):

Total ternary strings of length n: 3^n.
Subtract strings lacking a 0 (only 1s and 2s): 2^n.
Subtract strings lacking a 1 (only 0s and 2s): 2^n.
Add back strings lacking both 0 and 1 (only 2s): 1.
Result: 3^n − 2 times 2^n + 1 (same answer).

🧮 Eight-digit passcode with multiple restrictions

Problem: An eight-digit passcode must contain an even number of 0s, at least one 1, and at most three 2s. How many valid passcodes are there?

Solution steps:

For the seven unrestricted digits (3, 4, 5, 6, 7, 8, 9): factor of e^(7x).
For 0s (even count): factor of (e^x + e^(−x)) divided by 2.
For 1s (at least one): factor of e^x − 1.
For 2s (at most three): factor of 1 + x + x^2 divided by 2! + x^3 divided by 3!.
Multiply all factors together.
Use a computer algebra system (SageMath) to find the coefficient on x^8 divided by 8!.

Result: The coefficient is 33847837 divided by 40320, so there are 33847837 valid passcodes.

Observation: This is only about 33.85% of the total number of eight-digit strings (which is 10^8 = 100000000), showing that the restrictions significantly reduce the number of possibilities.

🛠️ Practical techniques

🛠️ Reading off coefficients

Once you have the exponential generating function as a power series, the number you want is the coefficient on x^n divided by n!.
Example: if the series is sum of a_n times x^n divided by n!, then a_n is the count for length n.

🛠️ Using computer algebra systems

For complex products of exponential generating functions, hand calculation can be error-prone.
Computer algebra systems like SageMath can expand the series and extract the desired coefficient.
Example from the excerpt: use series(x, 9) to expand up to x^8, then .list()[8] to get the coefficient on x^8 divided by 8!.

Don't confuse: the coefficient in the series is on x^n divided by n!, not just x^n, so you must account for the factorial when interpreting results.

🛠️ Algebraic tricks

To isolate even powers: add e^x and e^(−x) to cancel odd terms.
To get "at least one": subtract 1 from e^x to remove the n=0 term.
To get finite sums: write out the first few terms explicitly and recognize they form a truncated exponential series.

🌐 Beyond string counting

🌐 Other applications

The excerpt notes that exponential generating functions are useful in many situations beyond enumerating strings:

Example mentioned: counting the number of n-vertex, connected, labeled graphs.
However, such applications are beyond the scope of the excerpt.

🌐 Further study

The book generatingfunctionology by Herbert S. Wilf is available online for those interested in learning much more about generating functions.
Link provided: http://www.math.upenn.edu/~wilf/DownldGF.html

Discussion: Generating Functions and Cryptographic Implications

8.7 Discussion

🧭 Overview

🧠 One-sentence thesis

The discussion illustrates how generating functions can prove equalities between quantities without explicitly computing them, leading students to recognize parallels with cryptographic communication where parties exchange information while maintaining privacy or verifiability.

📌 Key points (3–5)

Core insight: Generating functions can prove two quantities are equal without calculating what those quantities actually are.
Student reaction: The group recognizes this "proving without revealing" property as remarkable and potentially applicable beyond pure mathematics.
Conceptual leap: Students spontaneously connect the mathematical technique to real-world scenarios involving privacy, security, and communication verification.
Common confusion: Don't confuse "proving equality without computing values" with "hiding information"—the mathematical proof is complete and rigorous, but the connection to cryptography is metaphorical at this stage.
Broader significance: Privacy and security applications are identified as "big ticket items" worth exploring.

🔍 The mathematical insight

🔍 Proving without computing

The excerpt references a proof that "the number of partitions of an integer into odd parts is the same as the number of partitions of that integer into distinct parts."
Key property: The proof establishes equality without calculating the actual values of either quantity.
Yolanda's observation: "We showed that two quantities were equal without saying anything about what those quantities actually were."
This demonstrates that generating functions can establish relationships structurally rather than numerically.

🎯 Why this matters

Traditional counting might require enumerating all possibilities for both sides and comparing totals.
Generating functions allow symbolic manipulation to prove equality directly.
The method is elegant because it bypasses potentially difficult or tedious computation.

💭 From mathematics to applications

💭 Communication without full disclosure

Dave's contribution: "There might be other instances where you would want to be able to communicate fully, yet hold back on every last detail."

What this means:

Parties might need to exchange information while maintaining some level of privacy.
The mathematical technique suggests it's possible to prove or verify something without revealing everything.

💭 Detecting eavesdropping

Carlos adds another dimension: "Maybe they just want to be able to detect whether anyone else was listening."

Implications:

Beyond hiding content, parties might want to verify the integrity of their communication channel.
This touches on authentication and tamper-detection concepts.

💭 Plausible deniability

Alice raises: "Do you mean that parties may want to communicate, while maintaining that the conversation did not occur?"

What this suggests:

Communication might need to be verifiable to intended recipients but deniable to others.
This is a more sophisticated privacy requirement than simple encryption.

🔐 Security and privacy connections

🔐 Why Zori was "nearly happy"

"Privacy and security were big ticket items."

The discussion resonates because these are high-value application areas.
The mathematical property of "proving without revealing" maps conceptually onto cryptographic protocols.
Modern cryptography often requires proving knowledge of information without disclosing the information itself (zero-knowledge proofs, for example).

🔐 The conceptual bridge

Mathematical property	Potential security analog
Prove equality without computing values	Verify credentials without revealing them
Establish relationship structurally	Authenticate without exposing secrets
Manipulate symbolically rather than numerically	Process encrypted data without decryption

Important caveat: The excerpt presents this as a student brainstorming session, not a formal connection—the students are recognizing an interesting parallel, not claiming the mathematical technique directly solves cryptographic problems.

Generating Functions: Exercises

8.8 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set develops skill in constructing and manipulating generating functions—both ordinary and exponential—to solve combinatorial counting problems involving distributions, partitions, and constrained selections.

📌 Key points (3–5)

Two types of generating functions: ordinary generating functions (OGFs) encode sequences as power series; exponential generating functions (EGFs) are used when order matters.
Core skills practiced: writing generating functions from sequences, extracting coefficients, finding closed forms, and modeling real-world constraints (e.g., "at least two," "even number," "multiple of k").
Applications covered: distributing items with restrictions, making change, forming partitions, and counting strings with character constraints.
Common confusion: OGFs vs. EGFs—use OGFs when items are indistinguishable within a type and order doesn't matter; use EGFs when order or arrangement matters (e.g., strings, permutations).
Tool usage: many exercises suggest computer algebra systems (e.g., SageMath) for extracting coefficients or partial fraction expansions, but hand calculation is encouraged for learning.

📝 Finite and infinite sequences

📝 Writing generating functions from sequences

Finite sequences (Exercises 1a–1f): directly write the polynomial by pairing each term with its power of x.
- Example: the sequence 1, 4, 6, 4, 1 becomes 1 + 4x + 6x² + 4x³ + x⁴.
Infinite sequences (Exercises 2a–2l): find a closed form (not an infinite sum) using standard series formulas.
- Example: 1, 1, 1, 1, ... starting from x¹ suggests x/(1 - x).
- Patterns to recognize: geometric series, binomial expansions, shifts (multiply by x^k to delay the start).

🔍 Extracting terms from generating functions

From closed form to sequence (Exercise 3): expand or recognize the series to find the nth coefficient.
- Example: 1/(1 - x⁴) expands to 1 + x⁴ + x⁸ + ..., so the coefficient of x^n is 1 if n is a multiple of 4, else 0.
Finding specific coefficients (Exercise 4): use algebraic manipulation, convolution, or series expansion to isolate the coefficient on x^10.

🎯 Modeling combinatorial problems

🎯 Distribution problems with constraints

A generating function encodes all possible ways to satisfy constraints by multiplying factors, one per type of item.

Structure: each item type contributes a factor; restrictions translate to which powers of x appear.
- "At least one white balloon": factor is x + x² + x³ + ... = x/(1 - x).
- "At most two blue balloons": factor is 1 + x + x².
- "At least two cookies per volunteer": factor is x² + x³ + ... = x²/(1 - x).
Example (Exercise 5): balloons with "at least one white, at least one gold, at most two blue" → multiply (x/(1-x)) · (x/(1-x)) · (1 + x + x²), then find the coefficient on x^10.

🧮 Inequalities and restricted sums

Exercise 7 and 27: count integer solutions to x₁ + x₂ + x₃ + x₄ ≤ n with constraints like "x₂ ≥ 2," "x₃ is a multiple of 4," "0 ≤ x₄ ≤ 3."
Method: write a generating function where each variable's factor reflects its constraint, then find a closed formula for the nth coefficient.
Don't confuse: ≤ n (inequality) vs. = n (equality)—for inequalities, multiply by 1/(1 - x) to sum over all valid n.

💰 Making change and spending problems

Exercise 11: count ways to make change for $100 using dollar coins and $1, $2, $5 bills.
- Each denomination contributes a factor: (1 + x + x² + ...) for each type.
- Use partial fractions to extract the coefficient on x^100.
Exercise 12: spending exactly €n at a chocolate shop with item prices and purchase constraints.
- Each item type contributes a factor based on its price (as the exponent) and quantity restrictions.
- Example: "at least two boxes at €8 each" → x^16 + x^24 + x^32 + ... = x^16/(1 - x^8).

🧩 Partitions and special structures

🧩 Integer partitions

Partitions into odd parts (Exercise 17): use the generating function ∏(1/(1 - x^(2k+1))) for k ≥ 0.
Partitions into distinct parts (Exercise 16, 18): list partitions of 9 and mark which are distinct (D) or odd (O); verify the bijection discussed in Section 8.7.
Even parts (Exercise 19): analogous to odd parts, but use even exponents.

🍬 Multi-type selection with flavor distinctions

Exercise 13: candy bags where fruit chews have four flavors and bags are distinguished by flavor, not just count.
- "At most two fruit chews, flavors matter" → factor is 1 + 4x + (4 choose 2)x² + 4x² = 1 + 4x + 10x² (combinations with replacement).
- Don't confuse: when flavors matter, use combinations with replacement or multinomial counting, not just powers of x.

🔢 Exponential generating functions

🔢 When to use EGFs

Exponential generating functions (EGFs) encode sequences {aₙ} as ∑(aₙ xⁿ / n!); use them when order or arrangement matters (e.g., permutations, strings).

Exercises 20–22: write EGFs for sequences like aₙ = 5ⁿ, aₙ = n!, and extract coefficients.
- Example: aₙ = 5ⁿ → EGF is e^(5x).
- To find aₙ from an EGF, extract the coefficient on xⁿ/n! and multiply by n!.

🔤 Counting strings with character constraints

Exercises 23–25: count strings of length n from {a, b, c, d} with restrictions like "at least one a," "even number of c's," "odd number of b's."
Method: each character contributes a factor based on its constraint.
- "At least one a": e^x - 1 (all counts except zero).
- "Even number of c's": (e^x + e^(-x))/2 = cosh(x).
- "Odd number of b's": (e^x - e^(-x))/2 = sinh(x).
Exercise 26: alphanumeric strings with vowel, digit parity, and letter count constraints—multiply many factors and extract the coefficient.

⚠️ Common confusion: OGF vs. EGF

Scenario	Use OGF	Use EGF
Distributing indistinguishable items	✓
Order doesn't matter	✓
Counting arrangements/strings		✓
Permutations or labeled objects		✓

Exercise 15b hint: despite order mattering (coins inserted one-by-one), the problem asks for an OGF approach—check small cases by hand to verify your generating function.

🛠️ Techniques and tools

🛠️ Partial fractions and series expansion

Exercise 11 hint: use partial fractions to decompose rational generating functions, then expand each term as a power series.
Identity provided: p(x)(1 + x + x² + ... + x^k) = p(x)(1 - x^(k+1))/(1 - x), useful for finite geometric series.

💻 Computer algebra systems

The excerpt encourages hand calculation for learning but suggests CAS (e.g., SageMath) for:
- Extracting coefficients from complex generating functions.
- Finding partial fraction expansions.
- Verifying large coefficients (e.g., Exercise 18: smallest integer with ≥1000 partitions).
SageMath snippets provided: f(x).series(x, n) for series expansion, partial_fraction() for decomposition.

🎓 Pedagogical notes

"Solve by hand unless suggested otherwise" to build intuition.
Check small cases manually (e.g., coefficients on x², x³) to verify your generating function before computing large coefficients.
Exercise 14: reverse-engineer a combinatorial problem from a given generating function—tests deep understanding of how constraints map to algebraic factors.

Linear Recurrence Equations: Introduction

9.1 Introduction

🧭 Overview

🧠 One-sentence thesis

This chapter presents a systematic treatment of linear recurrence equations with the goal of finding closed-form expressions for functions defined recursively, focusing on how to move from recursive definitions (which depend on earlier values) to explicit formulas depending only on n.

📌 Key points (3–5)

What recurrences are: equations that define a function's value at n in terms of earlier values of the function, rather than directly in terms of n alone.
The chapter's goal: to find closed-form (explicit) expressions for recursively defined functions whenever possible, focusing on linear recurrence equations.
Key examples: Fibonacci numbers, string-counting problems with restrictions, and geometric problems (lines dividing the plane into regions) all lead to linear recurrences.
Common confusion: homogeneous vs non-homogeneous—homogeneous recurrences have zero on the right-hand side; non-homogeneous ones include an additional function of n.
Two solution methods: the chapter will use both direct techniques for linear recurrences and generating functions (revisiting Chapter 8 material).

🐰 The Fibonacci sequence

🐰 The rabbit story and recurrence

The excerpt introduces the Fibonacci sequence through a rabbit population model:

A pair of newborn rabbits is placed on an island.
Rabbits cannot reproduce until their third month of life, but then produce one new pair each month.
Rabbits live indefinitely and have no predators.

Let f_n denote the number of pairs of rabbits in month n.

Fibonacci recurrence: f_n = f_(n−1) + f_(n−2) for n ≥ 3, with f_1 = f_2 = 1 (and f_0 = 0).

Why this recurrence works:

In month n, all pairs from month n−1 are still present: f_(n−1) pairs.
Only pairs born before month n−1 can reproduce in month n: that's f_(n−2) pairs.
Each reproducing pair adds one new pair.
Total: f_n = f_(n−1) + f_(n−2).

Example: The first 21 terms (f_0 through f_20) are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765.

📊 Ratios of consecutive terms

The excerpt shows that the ratios f_(n+1) / f_n for n = 1 to 18 appear to converge to a single number (around 1.618...).

Questions raised:

Can we determine this limiting number?
Does this number relate to an explicit formula for f_n?
How can we compute very large terms like f_1000 without recursion?

🧱 Fibonacci in other contexts

The Fibonacci sequence appears in many problems beyond rabbits.

Example: Tiling a checkerboard

Let c_n count the number of ways to cover a 2×n checkerboard with 2×1 tiles.
c_1 = 1 and c_2 = 2.
Recurrence: c_(n+2) = c_(n+1) + c_n.
- If the rightmost column contains a vertical tile, the rest can be tiled in c_(n+1) ways.
- If the rightmost two columns contain two horizontal tiles, the rest can be tiled in c_n ways.
This is the same recurrence as Fibonacci, so c_n = f_(n+1).

🔤 Recurrences from string-counting problems

🔤 Binary strings with no consecutive 1's

Let a_n count binary strings of length n with no two consecutive 1's.

Base cases:

a_1 = 2 (both "0" and "1" are valid).
a_2 = 3 (only "11" is invalid among four strings).
a_3 = 5 (three bad strings: "110", "011", "111").

Recurrence: a_(n+2) = a_(n+1) + a_n.

Why:

Partition valid strings by their last bit.
If the last bit is 0, the first n+1 positions can be any valid string of length n+1: a_(n+1) strings.
If the last bit is 1, the preceding bit must be 0, so the first n positions can be any valid string of length n: a_n strings.
Total: a_(n+2) = a_(n+1) + a_n.

Result: This sequence is the Fibonacci sequence offset by one position: a_n = f_(n+1).

🔤 Ternary strings avoiding "20" substring

Let t_n count ternary strings of length n that never have "20" as a substring in consecutive positions.

Base cases:

t_1 = 3 (all three single-character strings are valid).
t_2 = 8 (only "20" is invalid among nine strings).

Recurrence: t_(n+2) = 3·t_(n+1) − t_n.

Why:

Partition valid strings by their last character.
If the last character is 2 or 1, the first n+1 characters can be any valid string of length n+1: 2·t_(n+1) strings.
If the last character is 0, the first n+1 characters must form a valid string that does not end in 2. The number of such strings is t_(n+1) − t_n.
Total: t_(n+2) = 2·t_(n+1) + (t_(n+1) − t_n) = 3·t_(n+1) − t_n.

Result: t_3 = 21.

📐 Lines and regions in the plane

📐 The geometric problem

A family of n lines in the plane satisfies:

Each pair of lines intersects.
No point belongs to more than two lines (no three lines meet at a point).

Question: How many regions do these n lines determine?

Let r_n denote the number of regions.

Base cases:

r_1 = 2.
r_2 = 4.
r_3 = 7.
r_4 = 11.

📐 The recurrence

Recurrence: r_(n+1) = r_n + (n + 1).

Why:

Consider n+1 lines. Choose one line and call it l.
Line l intersects each of the other n lines at distinct points (since no three lines meet).
Label these intersection points consecutively as x_1, x_2, ..., x_n.
These n points divide line l into n+1 segments (two of which are infinite).
Each segment partitions one of the r_n regions (determined by the other n lines) into two parts.
Total: r_n original regions plus n+1 new regions created by l.

Don't confuse: This recurrence is non-homogeneous (the right-hand side includes n+1, not just earlier terms).

🔧 Defining linear recurrence equations

🔧 General form

Linear recurrence equation: An equation of the form
c_0·a_(n+k) + c_1·a_(n+k−1) + c_2·a_(n+k−2) + ... + c_k·a_n = φ(n),
where k ≥ 1 is an integer, c_0, c_1, ..., c_k are constants with c_0 ≠ 0 and c_k ≠ 0, and φ is a function.

Key features:

The equation is linear because each term involves a at a single index, not products or powers of a.
The coefficients c_i are constants (not functions of n).
The function φ(n) on the right-hand side may depend on n but not on any a values.

🔧 Homogeneous vs non-homogeneous

Type	Definition	Example from excerpt
Homogeneous	φ(n) = 0 (right-hand side is zero)	Fibonacci: a_(n+2) − a_(n+1) − a_n = 0
Non-homogeneous	φ(n) ≠ 0 (right-hand side is a non-zero function of n)	Lines and regions: r_(n+1) = r_n + (n+1)

Don't confuse: The term "homogeneous" refers only to whether the right-hand side is zero, not to whether the coefficients are constant.

🔧 Terminology note

The excerpt notes that "linear recurrence equation with constant coefficients" is more precise, but the chapter uses "linear recurrence equation" as shorthand, reserving "linear recurrence equation with nonconstant coefficients" for cases where the c_i depend on n.

Linear Recurrence Equations

9.2 Linear Recurrence Equations

🧭 Overview

🧠 One-sentence thesis

Linear recurrence equations allow us to compute the nth term of a sequence from previous values using a fixed formula with constant coefficients, and they can be solved systematically using advancement operator techniques analogous to solving differential equations.

📌 Key points (3–5)

What a linear recurrence is: an equation of the form c₀aₙ₊ₖ + c₁aₙ₊ₖ₋₁ + ... + cₖaₙ = f(n), where coefficients are constants and c₀, cₖ ≠ 0.
Homogeneous vs nonhomogeneous: homogeneous means the right-hand side is zero; nonhomogeneous means it is some function f(n).
Common confusion: the advancement operator A shifts the index forward (Af(n) = f(n+1)), similar to how the differential operator D takes derivatives in continuous math.
Key technique: rewrite recurrence equations using advancement operator polynomials p(A)f = f(n), making them analogous to differential equations.
Why it matters: provides systematic methods to find explicit formulas for sequences defined recursively.

🔍 What makes a recurrence linear

📐 Definition and form

A recurrence equation is linear when it has the form c₀aₙ₊ₖ + c₁aₙ₊ₖ₋₁ + c₂aₙ₊ₖ₋₂ + ... + cₖaₙ = f(n), where k ≥ 1 is an integer, c₀, c₁, ..., cₖ are constants with c₀, cₖ ≠ 0, and f: ℕ → ℝ is a function.

The term "linear" means each term involves only a single sequence value (no products like aₙ·aₙ₊₁).
The coefficients c₀, c₁, ..., cₖ must be constants (not functions of n).
If coefficients depend on n, we call it a "linear recurrence with nonconstant coefficients."

🔀 The order parameter k

The integer k tells us how many previous terms we need to compute the next term.
Example: Fibonacci has k = 2 because aₙ₊₂ = aₙ₊₁ + aₙ (we need two previous values).
The restrictions c₀ ≠ 0 and cₖ ≠ 0 ensure the equation genuinely involves k steps.

🏠 Homogeneous vs nonhomogeneous equations

✅ Homogeneous equations

A linear equation is homogeneous if the function f(n) on the right-hand side is the zero function.

In other words, the equation equals zero: c₀aₙ₊ₖ + c₁aₙ₊ₖ₋₁ + ... + cₖaₙ = 0.
Example: Fibonacci satisfies aₙ₊₂ - aₙ₊₁ - aₙ = 0 (homogeneous, k = 2, c₀ = 1, cₖ = -1).
Example: The ternary string sequence satisfies tₙ₊₂ - 3tₙ₊₁ + tₙ = 0 (homogeneous, k = 2, c₀ = cₖ = 1).

❌ Nonhomogeneous equations

The right-hand side is some nonzero function f(n).
Example: The lines-and-regions sequence satisfies rₙ₊₁ - rₙ = n + 1 (nonhomogeneous, k = 1, c₀ = 1, cₖ = -1).
Nonhomogeneous equations are "more tricky" and require "guess-and-test" methods beyond the systematic techniques for homogeneous equations.

🆚 Comparison table

Type	Right-hand side	Example from excerpt	Solving approach
Homogeneous	f(n) = 0	aₙ₊₂ - aₙ₊₁ - aₙ = 0 (Fibonacci)	Fully resolvable systematic techniques
Nonhomogeneous	f(n) ≠ 0	rₙ₊₁ - rₙ = n + 1 (lines/regions)	Guess-and-test methods

🔧 The advancement operator

🎯 What the advancement operator does

The advancement operator A is a function A: V → V defined by Af(n) = f(n+1), where V is the vector space of functions from integers to complex numbers.

A "advances" the index by 1: it shifts from position n to position n+1.
More generally, A^p f(n) = f(n+p) when p is a positive integer.
This is analogous to the differential operator D in calculus, where Df represents the derivative of f.

🧮 Applying operator polynomials

We can form polynomials of A, like 3A² - 5A + 4.
To apply such a polynomial to f at n = 0: (3A² - 5A + 4)f(0) = 3f(2) - 5f(1) + 4f(0).
Example from excerpt: if f(n) = 7n - 9, then (3A² - 5A + 4)f(0) = 3(5) - 5(-2) + 4(-9) = -11.

🔄 Rewriting recurrences with A

Any linear recurrence can be rewritten as p(A)f = f(n), where p(A) = c₀A^k + c₁A^(k-1) + ... + cₖ.
This operator notation makes the structure clearer and connects to differential equation techniques.
Don't confuse: A is not multiplication; it's an operator that shifts the argument.

🧪 Simple examples with the advancement operator

🔢 Solving a basic advancement equation

Example from excerpt: Suppose sₙ₊₁ = 2sₙ for n ≥ 0 and s₀ = 3. Find an explicit formula for sₙ.

In operator notation:

Define f(n) = sₙ for n ≥ 0.
The condition becomes Af(n) = 2f(n) with initial condition f(0) = 3.

Finding the pattern:

What function satisfies "advancing it gives twice its value"? Answer: 2^n.
For any constant c, the function c·2^n also satisfies Af(n) = 2f(n).
To satisfy f(0) = 3, we need c·2⁰ = c = 3.
Therefore, sₙ = f(n) = 3·2^n.

Verification:

Initial condition: f(0) = 3·2⁰ = 3 ✓
Advancement equation: Af(n) = 3·2^(n+1) = 3·2·2^n = 2·(3·2^n) = 2f(n) ✓

🔗 Analogy to differential equations

Example from excerpt: Solve Df = 3f if f(0) = 2.

Continuous case (differential equation):

Question: find a function whose derivative is three times itself.
Answer: e^(3x) has this property (D(e^(3x)) = 3e^(3x)).
General solution: f(x) = c·e^(3x) for any constant c.
With f(0) = 2, we get c = 2, so f(x) = 2e^(3x).

Discrete case (advancement equation):

Question: find a function that, when advanced, gives twice itself.
Answer: 2^n has this property (A(2^n) = 2^(n+1) = 2·2^n).
General solution: f(n) = c·2^n for any constant c.
With f(0) = 3, we get c = 3, so f(n) = 3·2^n.

Key parallel:

Both use operator notation (D for derivative, A for advancement).
Both find a "base" function with the desired property.
Both use a constant multiplier to satisfy initial conditions.
Don't confuse: D operates on continuous functions; A operates on sequences (functions from integers to complex numbers).

🎓 The broader framework

📚 Vector space perspective

Functions from integers to complex numbers form a vector space V.
The advancement operator A is a linear operator on V.
This parallels how infinitely-differentiable functions form a vector space with D as a linear operator.
The excerpt notes that no linear algebra background is required, but this perspective provides rigorous footing.

🎯 Goals of the approach

For homogeneous equations: develop techniques to "fully resolve" them systematically.
For nonhomogeneous equations: discuss "guess-and-test" methods, which are more challenging.
The operator notation makes the structure transparent and leverages analogies to differential equations.

Advancement Operators

9.3 Advancement Operators

🧭 Overview

🧠 One-sentence thesis

Advancement operators provide a systematic algebraic framework for solving recurrence equations by transforming them into polynomial equations whose roots and factors directly determine the solution structure.

📌 Key points (3–5)

What advancement operators do: Transform recurrence equations into operator polynomial equations of the form p(A)f = 1, where Af(n) = f(n+1).
Core solving method: Factor the polynomial p(A) to find roots; each root r gives a solution term of the form c·rⁿ.
Homogeneous vs nonhomogeneous: Homogeneous equations (p(A)f = 0) have k-parameter families of solutions for degree-k polynomials; nonhomogeneous equations require finding one particular solution plus the general homogeneous solution.
Common confusion—repeated roots: When a root r appears with multiplicity m, the solution includes terms c₁rⁿ, c₂n·rⁿ, c₃n²·rⁿ, ..., up to cₘn^(m-1)·rⁿ, not just multiple copies of c·rⁿ.
Why it matters: This method converts recurrence problems into polynomial factoring problems, providing explicit formulas instead of recursive definitions.

🔧 The advancement operator mechanism

🔧 Definition and basic operation

Advancement operator A: A function A: V → V defined by Af(n) = f(n+1), where V is the vector space of functions from integers to complex numbers.

The operator "advances" the input by one step: it shifts n to n+1.
More generally, A^p f(n) = f(n+p) for positive integer p.
Example: If f(n) = 7n - 9, then (3A² - 5A + 4)f(0) = 3f(2) - 5f(1) + 4f(0) = 3(5) - 5(-2) + 4(-9) = -11.

📝 Rewriting recurrences as operator equations

A linear recurrence equation can be rewritten using a polynomial p(A) of the advancement operator:

p(A)f = (c₀A^k + c₁A^(k-1) + c₂A^(k-2) + ... + cₖ)f = 1

Here k ≥ 1 is an integer, 1 is a fixed function, and c₀, c₁, ..., cₖ are constants with c₀, cₖ ≠ 0.
Since c₀ ≠ 0, we can divide both sides by c₀, so we may assume c₀ = 1 when convenient.
Example: The recurrence sₙ₊₁ = 2sₙ becomes Af(n) = 2f(n), or (A - 2)f = 0.

🎯 Why the constant term matters

The excerpt restricts attention to p(A) where the constant term is nonzero (i.e., 0 is not a root).

If 0 is a root of multiplicity m, then p(A) = A^m q(A) where q has nonzero constant term.
The equation p(A)f = 1 becomes A^m q(A)f = 1.
Solutions to the original problem are translations of solutions to q(A)f = 1: if h' solves q(A)f = 1, then h(n) = h'(n+m) solves the original.
Special case: A^m f = 1 requires f(n+m) = 1(n), so f is just a translation of 1.

🏠 Homogeneous equations with distinct roots

🏠 The basic pattern

For homogeneous equations p(A)f = 0 where p has degree k with distinct roots r₁, r₂, ..., rₖ:

General solution: f(n) = c₁r₁ⁿ + c₂r₂ⁿ + c₃r₃ⁿ + ... + cₖrₖⁿ

Each root r of p(A) contributes a solution term c·rⁿ.
The general solution is a k-parameter family (the constants c₁, c₂, ..., cₖ).
Example: (A+3)(A-2)f = 0 has roots -3 and 2, giving f(n) = c₁(-3)ⁿ + c₂2ⁿ.

🔍 Why this works—factoring insight

When p(A) = (A - r₁)(A - r₂)...(A - rₖ):

If (A - r)f = 0, then f(n+1) = r·f(n), which is satisfied by f(n) = c·rⁿ.
Since the factors commute, (A+3)(A-2)f₁ = (A-2)(A+3)f₁, so f₁(n) = c₁2ⁿ solves the full equation.
Similarly, f₂(n) = c₂(-3)ⁿ solves it.
Combining them: f(n) = c₁2ⁿ + c₂(-3)ⁿ gives all solutions (a 2-parameter family for a degree-2 polynomial).

📊 Using initial conditions

Initial conditions determine the specific constants.

Example: For t_{n+2} = 3t_{n+1} - t_n with t₁ = 3, t₂ = 8:

Operator equation: (A² - 3A + 1)t = 0
Roots: (3 ± √5)/2
General solution: t(n) = c₁((3+√5)/2)ⁿ + c₂((3-√5)/2)ⁿ
Evaluating at n=0 and n=1 gives two equations in c₁, c₂
Solving yields c₁ = 7√5/10 + 3/2 and c₂ = -7√5/10 + 3/2

Don't confuse: Even though the solution contains square roots and fractions, it always evaluates to an integer when counting objects (like strings).

🔁 Repeated roots

🔁 The multiplicity pattern

When a root r appears with multiplicity m in p(A):

General solution for (A - r)^m f = 0: f(n) = c₁rⁿ + c₂n·rⁿ + c₃n²·rⁿ + ... + cₘn^(m-1)·rⁿ

Each power of n from 0 to m-1 appears as a coefficient multiplying rⁿ.
Example: (A-2)²f = 0 has solution f(n) = c₁2ⁿ + c₂n·2ⁿ.

🧪 Verification example

For (A-2)²f₂ where f₂(n) = c₂n·2ⁿ:

Apply (A-2) once: (A-2)f₂(n) = c₂(n+1)2^(n+1) - 2c₂n·2ⁿ = c₂2^(n+1)
Apply (A-2) again: (A-2)(c₂2^(n+1)) = c₂2^(n+2) - 2c₂2^(n+1) = 0 ✓

🎯 Combined example

For p(A) = (A+5)(A-1)³:

Root -5 (multiplicity 1): contributes c₁(-5)ⁿ
Root 1 (multiplicity 3): contributes c₂(1)ⁿ + c₃n(1)ⁿ + c₄n²(1)ⁿ = c₂ + c₃n + c₄n²
General solution: f(n) = c₁(-5)ⁿ + c₂ + c₃n + c₄n²

Don't confuse: You cannot use c₂(1)ⁿ + c₃(1)ⁿ + c₄(1)ⁿ = (c₂+c₃+c₄)(1)ⁿ—that's only one parameter, not three.

🎪 Nonhomogeneous equations

🎪 Two-step solution method

For nonhomogeneous equations p(A)f = g(n) where g(n) ≠ 0:

Find the general solution f₁ to the homogeneous equation p(A)f = 0
Find any particular solution f₀ to the nonhomogeneous equation p(A)f = g(n)
Combine them: General solution is f = f₀ + f₁

Why this works (Lemma 9.20): If f₀ is a particular solution and f is any solution, then f₁ = f - f₀ satisfies p(A)f₁ = p(A)f - p(A)f₀ = g(n) - g(n) = 0.

🎯 Finding a particular solution—first guess

The best starting point: try something that looks like the right-hand side g(n), with unknown constants.

Example: For (A+2)(A-6)f = 3ⁿ:

Try f₀(n) = d·3ⁿ
Apply the operator: (A+2)(A-6)(d·3ⁿ) = ... = -5d·3^(n+1)
Set equal to 3ⁿ: -5d·3^(n+1) = 3ⁿ gives d = -1/15
Particular solution: f₀(n) = (-1/15)·3ⁿ
General solution: f(n) = (-1/15)·3ⁿ + c₁(-2)ⁿ + c₂6ⁿ

⚠️ When the first guess fails

If your guess is a solution to the homogeneous equation, it gets "annihilated" by p(A) and gives zero.

Example: For (A+2)(A-6)f = 6ⁿ:

Trying f₀(n) = d·6ⁿ fails because (A-6)(d·6ⁿ) = 0 (since 6 is a root of p(A))
Fix: Multiply by n to get f₀(n) = d·n·6ⁿ
This works: (A+2)(A-6)(dn·6ⁿ) = 48d·6ⁿ
Setting 48d·6ⁿ = 6ⁿ gives d = 1/48

📐 Mixed-type right-hand sides

For equations like (A-2)²f = 3ⁿ + 2n:

Try f₀(n) = d₁·3ⁿ + d₂n + d₃ (include lower powers of n)
If a term is annihilated by p(A), increase its power (e.g., constant → n, n → n²)
Match coefficients on both sides to solve for the d values

Example: For (A-1)r = n+1 (from the lines-in-plane problem):

Homogeneous solution: f₁(n) = c₁
Try f₀(n) = d₁n + d₂, but d₂ is annihilated (constant is a homogeneous solution)
Increase powers: try f₀(n) = d₁n² + d₂n
Solving gives d₁ = 1/2, d₂ = 1/2
General solution: f(n) = c₁ + n²/2 + n/2
With initial condition f(1) = 2: f(n) = 1 + (n²+n)/2 = (n+1 choose 2) + 1

📚 Theoretical foundation

📚 Vector space structure

The set V of functions from integers to complex numbers forms a vector space:

Addition: (f + g)(n) = f(n) + g(n)
Scalar multiplication: (αf)(n) = α·f(n)
The advancement operator A is a linear operator: A(f+g) = Af + Ag and A(αf) = α(Af)

🎓 The principal theorem (Theorem 9.18)

For p(A) = c₀A^k + c₁A^(k-1) + ... + cₖ with c₀, cₖ ≠ 0:

The set W of all solutions to p(A)f = 0 is a k-dimensional subspace of V.

This means:

Every solution is a linear combination of k basis solutions
To solve p(A)f = 0, find k linearly independent solutions (a basis for W)
The general solution is the span of these basis solutions

🔢 Reading off solutions from factored polynomials

Once p(A) is factored, the general solution can be written immediately:

Polynomial factor	Contribution to general solution
(A - r)	c·rⁿ
(A - r)²	c₁rⁿ + c₂n·rⁿ
(A - r)³	c₁rⁿ + c₂n·rⁿ + c₃n²·rⁿ
(A - r)^m	c₁rⁿ + c₂n·rⁿ + ... + cₘn^(m-1)·rⁿ

Example: For (A-1)⁵(A+1)³(A-3)²(A+8)(A-9)⁴f = 0, the general solution has 15 terms (5+3+2+1+4 parameters).

🔄 Alternative method: Generating functions

🔄 The generating function approach

Instead of using advancement operators, encode the sequence as a generating function f(x) = Σrₙxⁿ.

Key insight: In xf(x), the coefficient on xⁿ is rₙ₋₁; in x²f(x), it's rₙ₋₂.

Example: For rₙ + rₙ₋₁ - 6rₙ₋₂ = 0 with r₀=1, r₁=3:

Multiply the recurrence by xⁿ and sum
Align terms: f(x) + xf(x) - 6x²f(x) = (terms from initial conditions)
Solve for f(x): f(x) = (1+4x)/(1+x-6x²)
Use partial fractions to expand
Read off rₙ as the coefficient of xⁿ

⚖️ Comparison of methods

Aspect	Advancement operators	Generating functions
Homogeneous equations	Straightforward once factored	Requires partial fractions
Nonhomogeneous equations	Need to guess particular solution form	Automatically handles right-hand side
Nonlinear recurrences	Does not apply	Can sometimes work
Computational efficiency	Gives explicit formula	May require symbolic computation

Don't confuse: Both methods give the same answer, but one may be easier depending on the problem structure.

Solving Advancement Operator Equations

9.4 Solving advancement operator equations

🧭 Overview

🧠 One-sentence thesis

Advancement operator equations can be solved systematically by factoring the polynomial operator, using roots to build solutions for homogeneous equations, and then adding a particular solution for nonhomogeneous cases.

📌 Key points (3–5)

Homogeneous equations: When the right-hand side is zero, the general solution is a k-parameter family built from the roots of the polynomial p(A), where k is the degree of p.
Repeated roots require special handling: A root of multiplicity m contributes terms like c₁ rⁿ, c₂ n rⁿ, c₃ n² rⁿ, ..., up to n^(m-1) rⁿ.
Nonhomogeneous equations: Solve in two steps—find the general solution to the corresponding homogeneous equation, then find any particular solution to the nonhomogeneous equation; the general solution is their sum.
Common confusion: When guessing a particular solution, avoid terms that solve the homogeneous equation (they get "annihilated" by the operator); multiply by n or increase powers if necessary.
Why it matters: These techniques give explicit closed-form formulas for recurrence relations that arise in counting problems, algorithm analysis, and other applications.

🏠 Homogeneous equations: building solutions from roots

🔍 What is a homogeneous equation?

A homogeneous advancement operator equation has the form p(A) f = 0, where p(A) is a polynomial in the advancement operator A and the constant term of p is nonzero.

The right-hand side is zero (no "forcing term").
The polynomial p(A) = c₀ + c₁ A + c₂ A² + ... + cₖ Aᵏ with c₀ ≠ 0 and cₖ ≠ 0.
The degree k tells us how many parameters the general solution will have.

🌱 Simple roots: one parameter per root

When p(A) factors into distinct linear factors, each root r contributes one exponential solution.

Example from the excerpt:

Equation: (A² + A − 6) f = 0
Factor: p(A) = (A + 3)(A − 2)
Roots: r₁ = 2, r₂ = −3
General solution: f(n) = c₁ 2ⁿ + c₂ (−3)ⁿ

Why this works:

If (A − r) f = 0, then f(n+1) = r f(n), which is satisfied by f(n) = c rⁿ.
When p(A) = (A − r₁)(A − r₂), both f₁(n) = c₁ r₁ⁿ and f₂(n) = c₂ r₂ⁿ are solutions.
Because the operators commute, their sum is also a solution.

🔁 Repeated roots: multiply by powers of n

When a root r appears with multiplicity m, you need m linearly independent solutions.

For a root r of multiplicity m, the solution terms are: c₁ rⁿ, c₂ n rⁿ, c₃ n² rⁿ, ..., c_m n^(m−1) rⁿ.

Example from the excerpt:

Equation: (A − 2)² f = 0
Root: r = 2 with multiplicity 2
General solution: f(n) = c₁ 2ⁿ + c₂ n 2ⁿ

Verification sketch:

Try f₂(n) = c₂ n 2ⁿ.
Apply (A − 2)²: first application gives c₂ 2^(n+1), second application yields 0.
Don't confuse: you cannot use c₁ 2ⁿ + c₂ 2ⁿ = (c₁ + c₂) 2ⁿ, which is still only one parameter.

📐 Combining roots: the general pattern

Situation	Root structure	Solution form
Distinct roots r₁, r₂, ..., rₖ	Each root simple	c₁ r₁ⁿ + c₂ r₂ⁿ + ... + cₖ rₖⁿ
Root r with multiplicity m	Repeated root	c₁ rⁿ + c₂ n rⁿ + ... + c_m n^(m−1) rⁿ
Mixed	Some simple, some repeated	Combine both patterns

Example from the excerpt:

Equation: (A + 5)(A − 1)³ f = 0
Roots: r₁ = −5 (simple), r₂ = 1 (multiplicity 3)
General solution: f(n) = c₁ (−5)ⁿ + c₂ + c₃ n + c₄ n²

🎯 Nonhomogeneous equations: adding a particular solution

🧩 The two-step method

To solve p(A) f = g(n) (nonhomogeneous), find:

The general solution f₁ to the corresponding homogeneous equation p(A) f = 0.

Any particular solution f₀ to the nonhomogeneous equation.

Then the general solution is f = f₀ + f₁.

Why this works:

If p(A) f₀ = g(n) and p(A) f₁ = 0, then p(A)(f₀ + f₁) = g(n) + 0 = g(n).
The parameters in f₁ allow you to satisfy any initial conditions.

🔮 Guessing a particular solution

The key heuristic: try something that looks like the right-hand side g(n), introducing unknown constants.

Example from the excerpt:

Equation: (A + 2)(A − 6) f = 3ⁿ
Guess: f₀(n) = d 3ⁿ
Apply the operator and solve for d: you get d = −1/15.
General solution: f(n) = −(1/15) 3ⁿ + c₁ (−2)ⁿ + c₂ 6ⁿ

⚠️ Avoiding annihilation: when your guess fails

Common pitfall: If your guess is already a solution to the homogeneous equation, the operator will "annihilate" it (make it zero), and you won't be able to match the right-hand side.

Example from the excerpt:

Equation: (A + 2)(A − 6) f = 6ⁿ
Homogeneous solution includes c₂ 6ⁿ.
Naive guess f₀(n) = d 6ⁿ fails: (A − 6)(d 6ⁿ) = 0.
Fix: Multiply by n. Try f₀(n) = d n 6ⁿ.
Now (A − 6)(d n 6ⁿ) = d 6^(n+1) ≠ 0, and you can solve for d.

Rule of thumb:

If g(n) contains a term that solves the homogeneous equation, multiply your guess by n (or n², n³, etc., if needed).
This is analogous to the repeated-root case.

🧮 Polynomial right-hand sides

When g(n) is a polynomial (e.g., g(n) = n + 1), guess a polynomial of the same or higher degree, including all lower-degree terms.

Example from the excerpt:

Equation: (A − 1) r = n + 1
Homogeneous solution: f₁(n) = c₁ (constant).
Naive guess: f₀(n) = d₁ n + d₂ fails because the constant d₂ is annihilated.
Correct guess: Increase all powers by 1: f₀(n) = d₁ n² + d₂ n.
Solve: d₁ = 1/2, d₂ = 1/2.
General solution: f(n) = c₁ + (n² + n)/2.

Another example (mixed type):

Equation: (A − 2)² f = 3ⁿ + 2n
Guess: f₀(n) = d₁ 3ⁿ + d₂ n + d₃ (include the constant d₃ because you have a linear term).
Solve for d₁, d₂, d₃ by matching coefficients.

🔢 Working with initial conditions

🎲 From general to specific

Once you have the general solution with parameters c₁, c₂, ..., cₖ, use the given initial values f(0), f(1), ..., f(k−1) to set up a system of linear equations.

Example from the excerpt:

General solution: f(n) = c₁ (−5)ⁿ + c₂ + c₃ n + c₄ n²
Initial conditions: f(0) = 1, f(1) = 2, f(2) = 4, f(3) = 4
Substitute n = 0, 1, 2, 3 to get four equations in four unknowns.
Solve the system to find the specific values of c₁, c₂, c₃, c₄.

📊 Real-world example: ternary strings

Problem: Count ternary strings of length n that contain the substring (2, 0).

Recurrence: t_{n+2} = 3 t_{n+1} − t_n
Advancement operator form: (A² − 3A + 1) t = 0
Roots: r = (3 ± √5)/2
General solution: t(n) = c₁ ((3 + √5)/2)ⁿ + c₂ ((3 − √5)/2)ⁿ
Initial conditions t(1) = 3, t(2) = 8 determine c₁ and c₂.
Don't worry: Even though the formula has square roots and fractions, the binomial expansion ensures t(n) is always an integer.

🛠️ Special cases and edge conditions

🔄 What if the constant term of p is zero?

If p(A) = Aᵐ q(A) where q has nonzero constant term:

The equation p(A) f = 1 can be rewritten as Aᵐ q(A) f = 1.
Solve the simpler problem q(A) f = 1 first.
Solutions to the original problem are translations: if h′ solves the simpler problem, then h(n) = h′(n + m) solves the original.

Special case: Aᵐ f = 1 means f(n + m) = 1 for all n, i.e., f is a translation of the constant function 1.

📐 Lines in the plane revisited

Problem: How many regions do n lines in general position divide the plane into?

Recurrence: r_{n+1} = r_n + n + 1
Advancement operator form: (A − 1) r = n + 1
Homogeneous solution: f₁(n) = c₁
Particular solution guess: f₀(n) = d₁ n² + d₂ n (increase powers because constants are annihilated)
Solve: d₁ = 1/2, d₂ = 1/2
General solution: f(n) = c₁ + (n² + n)/2
Initial condition: r(1) = 2 ⇒ c₁ = 1
Answer: r(n) = 1 + (n² + n)/2 = (n+1 choose 2) + 1

🧠 Why this method works: vector space foundations

🏗️ The structure behind solutions

The excerpt begins to formalize the approach using the language of vector spaces:

A vector space V consists of vectors with addition (x + y) and scalar multiplication (α x) satisfying properties like commutativity, associativity, existence of zero, and additive inverses.

Key insight:

The set of all functions f : ℕ → ℝ forms a vector space.
Solutions to p(A) f = 0 form a subspace (closed under addition and scalar multiplication).
If f₁ and f₂ solve p(A) f = 0, then so does c₁ f₁ + c₂ f₂.
The dimension of this subspace equals the degree k of p(A), which is why the general solution has k parameters.

For nonhomogeneous equations:

If f₀ is any particular solution to p(A) f = g(n), and f₁ is the general solution to p(A) f = 0, then f₀ + f₁ is the general solution to the nonhomogeneous equation.
This is analogous to how solutions to linear differential equations behave.

🔗 Connection to differential equations

The excerpt notes that readers familiar with differential equations will see many parallels:

Linear differential equations with constant coefficients are solved similarly.
Characteristic polynomials, repeated roots, and particular solutions all have direct analogues.
The advancement operator A plays the role of the differentiation operator D.

Formalizing our approach to recurrence equations

9.5 Formalizing our approach to recurrence equations

🧭 Overview

🧠 One-sentence thesis

The general solution to a homogeneous linear recurrence equation can be systematically found by treating the problem as a vector space and factoring the corresponding advancement operator polynomial, with each root contributing a basis function to the solution space.

📌 Key points (3–5)

Vector space foundation: Recurrence equation solutions form a k-dimensional vector space W, where k is the order of the equation.
Structure of general solutions: Every solution to a nonhomogeneous equation equals one particular solution plus any solution to the corresponding homogeneous equation.
Distinct roots case: When the characteristic polynomial factors into distinct roots r₁, r₂, …, rₖ, the general solution is f(n) = c₁r₁ⁿ + c₂r₂ⁿ + … + cₖrₖⁿ.
Repeated roots case: A root r with multiplicity k contributes terms c₁rⁿ + c₂nrⁿ + c₃n²rⁿ + … + cₖnᵏ⁻¹rⁿ to the solution.
Common confusion: Don't confuse the order k of the equation with the number of distinct roots—repeated roots require polynomial multipliers (n, n², etc.) in the solution.

🏗️ Vector space framework

🏗️ What is the vector space V

Vector space V: the set of all functions f : ℕ → ℝ (functions from non-negative integers to real numbers), with addition defined by (f + g)(n) = f(n) + g(n) and scalar multiplication by (αf)(n) = α·f(n).

This is the setting for all recurrence equation solutions in this chapter.
The eight vector space properties (commutativity, associativity, zero element, additive inverse, etc.) ensure that standard algebraic manipulations work.
Example: If f(n) = 2ⁿ and g(n) = 3ⁿ, then (f + g)(n) = 2ⁿ + 3ⁿ.

🔧 Linear operators on V

Linear operator ϕ : V → V: a function satisfying ϕ(x + y) = ϕ(x) + ϕ(y) and ϕ(αx) = α·ϕ(x).

The advancement operator A is a key example: it shifts indices, so (Af)(n) = f(n+1).
Operators can be added and scaled: (ϕ + ψ)x = ϕx + ψx and (αϕ)x = α(ϕx).
Polynomials in A, like c₀Aᵏ + c₁Aᵏ⁻¹ + … + cₖ, are also operators.

📐 The principal theorem (Theorem 9.18)

📐 Statement of the theorem

Theorem 9.18: For a homogeneous linear equation (c₀Aᵏ + c₁Aᵏ⁻¹ + … + cₖ)f = 0 with c₀, cₖ ≠ 0, the set W of all solutions is a k-dimensional subspace of V.

Why it matters: Once we know W is k-dimensional, we only need to find k basis functions; every solution is a linear combination of those basis functions.
The proof that W is a subspace is immediate from linearity: p(A)(αf + βg) = αp(A)f + βp(A)g.
The harder part is proving the dimension is exactly k (not shown in full detail in the excerpt).

🔗 Nonhomogeneous equations (Lemma 9.20)

Lemma 9.20: If f₀ is any particular solution to the nonhomogeneous equation p(A)f = g, then every solution f has the form f = f₀ + f₁, where f₁ is a solution to the corresponding homogeneous equation p(A)f = 0.

How it works: Subtract the particular solution from any solution; the difference satisfies the homogeneous equation.
Proof sketch: p(A)(f − f₀) = p(A)f − p(A)f₀ = g − g = 0.
Example: If you find one solution to (A − 2)f = 3ⁿ, add any multiple of 2ⁿ (the homogeneous solution) to get all solutions.
Don't confuse: The particular solution f₀ is not unique, but the form f₀ + f₁ captures all solutions.

🌱 The base case: k = 1 (Lemma 9.19)

🌱 Solving (A − r)f = 0

Lemma 9.19: For r ≠ 0, every solution to (A − r)f = 0 has the form f(n) = c·rⁿ, where c = f(0).

Why this works: The equation (A − r)f = 0 means f(n+1) − r·f(n) = 0, so f(n+1) = r·f(n).
By induction: f(0) = c, f(1) = r·c, f(2) = r·f(1) = r²c, and so on.
Example: If (A − 3)f = 0 and f(0) = 5, then f(n) = 5·3ⁿ.
This is the simplest case and forms the building block for more complex equations.

🌳 Distinct roots case (Theorem 9.21)

🌳 Factored form with distinct roots

Consider the equation:

p(A)f = (A − r₁)(A − r₂)⋯(A − rₖ)f = 0, with r₁, r₂, …, rₖ distinct and non-zero.

Then every solution has the form:

f(n) = c₁r₁ⁿ + c₂r₂ⁿ + c₃r₃ⁿ + … + cₖrₖⁿ.

🔍 Inductive proof strategy

Base case k = 1: Already established by Lemma 9.19.
Inductive step: Assume the result holds for m roots; prove it for m+1 roots.
Rewrite (A − r₁)⋯(A − rₘ₊₁)f = 0 as (A − r₁)⋯(A − rₘ)[(A − rₘ₊₁)f] = 0.
By the inductive hypothesis, (A − rₘ₊₁)f must equal d₁r₁ⁿ + d₂r₂ⁿ + … + dₘrₘⁿ (a nonhomogeneous equation).
Find a particular solution f₀(n) = c₁r₁ⁿ + … + cₘrₘⁿ by choosing cᵢ so that cᵢ(rᵢ − rₘ₊₁) = dᵢ (possible because rᵢ ≠ rₘ₊₁).
The homogeneous part (A − rₘ₊₁)f = 0 contributes cₘ₊₁rₘ₊₁ⁿ.
Combine: f(n) = f₀(n) + cₘ₊₁rₘ₊₁ⁿ = c₁r₁ⁿ + … + cₘrₘⁿ + cₘ₊₁rₘ₊₁ⁿ.

🧮 Why distinct roots matter

Each root rᵢ is different, so the terms rᵢⁿ are linearly independent.
This ensures we can solve for the constants cᵢ uniquely given initial conditions.
Example: If p(A) = (A − 2)(A − 3)(A − 5), then f(n) = c₁·2ⁿ + c₂·3ⁿ + c₃·5ⁿ.

🔁 Repeated roots case (Lemma 9.22)

🔁 When a root appears multiple times

Lemma 9.22: For the equation (A − r)ᵏf = 0, the general solution is f(n) = c₁rⁿ + c₂n·rⁿ + c₃n²·rⁿ + … + cₖnᵏ⁻¹·rⁿ.

Why polynomial multipliers: A single root r of multiplicity k contributes k linearly independent solutions: rⁿ, n·rⁿ, n²·rⁿ, …, nᵏ⁻¹·rⁿ.
The excerpt states the result but leaves the proof as an exercise.
Example: If (A − 2)³f = 0, then f(n) = c₁·2ⁿ + c₂·n·2ⁿ + c₃·n²·2ⁿ.

🔄 Combining distinct and repeated roots

Factor the polynomial p(A) completely, identifying each root and its multiplicity.
For each distinct root r with multiplicity m, include terms c₁rⁿ + c₂n·rⁿ + … + cₘnᵐ⁻¹·rⁿ.
Don't confuse: The exponent on n goes up to (multiplicity − 1), not multiplicity.

🎯 Reading off solutions from factored form (Example 9.23)

🎯 A comprehensive example

Given the equation:

(A − 1)⁵(A + 1)³(A − 3)²(A + 8)(A − 9)⁴f = 0

The general solution is:

Root	Multiplicity	Contribution to f(n)
r = 1	5	c₁ + c₂n + c₃n² + c₄n³ + c₅n⁴
r = −1	3	c₆(−1)ⁿ + c₇n(−1)ⁿ + c₈n²(−1)ⁿ
r = 3	2	c₉·3ⁿ + c₁₀·n·3ⁿ
r = −8	1	c₁₁(−8)ⁿ
r = 9	4	c₁₂·9ⁿ + c₁₃·n·9ⁿ + c₁₄·n²·9ⁿ + c₁₅·n³·9ⁿ

How to read it: Each factor (A − r)ᵐ tells you to include m terms: rⁿ, n·rⁿ, …, nᵐ⁻¹·rⁿ.
The total number of constants (c₁ through c₁₅) equals the sum of multiplicities: 5 + 3 + 2 + 1 + 4 = 15.
This matches the order of the original equation (the highest power of A).

🌐 Complex roots

The excerpt notes that factoring may require complex numbers (roots with non-zero imaginary parts).
The same rules apply: each complex root r contributes terms involving rⁿ.
Example: If r = 2 + 3i is a root, the solution includes terms like c·(2 + 3i)ⁿ.

🔄 Alternative approach: generating functions (Section 9.6 preview)

🔄 Why generating functions

The advancement operator method works well for linear recurrences with constant coefficients.
Generating functions offer another tool, applicable to both linear and some nonlinear recurrences.
The excerpt begins Example 9.24 to illustrate the generating function technique.

🧪 Setup of Example 9.24

Recurrence: rₙ + rₙ₋₁ − 6rₙ₋₂ = 0, with r₀ = 1 and r₁ = 3.
Define the generating function: f(x) = Σ(n=0 to ∞) rₙxⁿ = r₀ + r₁x + r₂x² + r₃x³ + …
Key insight: In xf(x), the coefficient of xⁿ is rₙ₋₁; in −6x²f(x), the coefficient of xⁿ is −6rₙ₋₂.
Adding f(x) + xf(x) − 6x²f(x) makes the coefficient of xⁿ equal to rₙ + rₙ₋₁ − 6rₙ₋₂ = 0 (by the recurrence).
This allows solving for f(x) algebraically, then extracting rₙ from the series expansion.
Don't confuse: The generating function method manipulates power series, not the sequence directly; it's a different perspective on the same problem.

Using generating functions to solve recurrences

9.6 Using generating functions to solve recurrences

🧭 Overview

🧠 One-sentence thesis

Generating functions provide an alternative method for solving both linear and nonlinear recurrence equations by converting recurrence relations into algebraic equations that can be manipulated and expanded to extract closed-form solutions.

📌 Key points (3–5)

Alternative approach: Generating functions work for linear recurrences (especially with constant coefficients) and can also handle nonlinear recurrences that other methods cannot solve easily.
Core technique: Multiply the recurrence equation by powers of x, sum over all terms, and recognize that the resulting sums correspond to the generating function and its multiples.
Key advantage for nonhomogeneous equations: Does not require guessing the form of a particular solution, unlike the advancement operator method.
Common confusion: The coefficient on x to the power n in the generating function becomes zero for n ≥ 2 because of the recurrence equation itself—this is the key algebraic trick.
Trade-off: While avoiding particular solution guessing, this method often requires partial fraction expansion, so neither approach is universally superior.

🔧 The basic technique for homogeneous recurrences

🔧 Setting up the generating function equation

Start with a recurrence like r_n + r_(n-1) - 6·r_(n-2) = 0 with initial conditions r_0 = 1 and r_1 = 3.
Define the generating function f(x) = sum from n=0 to infinity of r_n·x^n.
The key insight: construct related functions like x·f(x) and -6·x²·f(x) whose coefficients align with the recurrence terms.

🎯 Exploiting the recurrence relation

When you form:

f(x) has r_n as the coefficient on x^n
x·f(x) has r_(n-1) as the coefficient on x^n
-6·x²·f(x) has -6·r_(n-2) as the coefficient on x^n

Why this matters: Adding these three functions together, the coefficient on x^n (for n ≥ 2) becomes r_n + r_(n-1) - 6·r_(n-2), which equals zero by the recurrence equation.

Example from the excerpt:

Left side becomes f(x)·(1 + x - 6·x²)
Right side has only the first few terms that don't cancel: r_0 + (r_0 + r_1)·x = 1 + 4·x
Result: f(x) = (1 + 4·x)/(1 + x - 6·x²)

🧮 Extracting the solution

Use partial fractions to expand the rational function into simpler geometric series.
The excerpt shows: f(x) = (6/5)·1/(1 - 2·x) - (1/5)·1/(1 + 3·x)
Recognize each term as a geometric series: 1/(1 - 2·x) = sum of 2^n·x^n, etc.
Read off the coefficient on x^n to get r_n = (6/5)·2^n - (1/5)·(-3)^n.

🔄 Extending to nonhomogeneous recurrences

🔄 Modified setup procedure

For a nonhomogeneous recurrence like r_n - r_(n-1) - 2·r_(n-2) = 2^n with r_0 = 2 and r_1 = 1:

Multiply both sides by x^n first, then sum over all n ≥ 2.
The right-hand side becomes sum of 2^n·x^n, which is a geometric series.
The left-hand side requires careful accounting of missing initial terms.

📐 Handling the algebra

The excerpt shows:

First sum: sum from n=2 of r_n·x^n = R(x) - (2 + x), where R(x) is the full generating function
Second sum: sum from n=2 of r_(n-1)·x^n = x·R(x) - 2·x (missing the first term)
Third sum: sum from n=2 of r_(n-2)·x^n = x²·R(x) (no missing terms)
Right side: sum from n=2 of 2^n·x^n = 1/(1 - 2·x) - 1 - 2·x

After algebraic manipulation: R(x) = (6·x² - 5·x + 2)/((1 - 2·x)·(1 - x - 2·x²))

🎲 Reading the final answer

Partial fractions yield: R(x) = -1/(9·(1 - 2·x)) + 2/(3·(1 - 2·x)²) + 13/(9·(1 + x))
Each term expands to a series; the middle term uses the derivative rule for geometric series.
Final result: r_n = (5/9)·2^n + (2/3)·n·2^n + (13/9)·(-1)^n

Don't confuse: The term (n+1)·2^n in the expansion comes from the squared denominator (1 - 2·x)², not from the original recurrence directly.

⚖️ Comparing methods

⚖️ Advantages of generating functions

Aspect	Generating function method	Traditional advancement operator
Particular solution	Not needed—automatically incorporated	Must guess appropriate form
Nonlinear recurrences	Can handle (see next section)	Generally cannot solve
Algebraic complexity	Requires partial fractions	Requires solving characteristic equations

⚖️ When to use which approach

The excerpt advises:

Both methods have "positives and negatives."
Unless instructed otherwise, choose whichever seems most appropriate for the situation.
For nonhomogeneous equations, generating functions avoid the guessing step but require partial fraction skills.
For nonlinear recurrences (mentioned for the next section), generating functions may be the only viable approach.

🌳 Preview: nonlinear recurrences

🌳 Introduction to RUBOTs

The excerpt introduces rooted, unlabeled, binary, ordered trees (RUBOTs):

A tree is rooted if we have designated a special vertex called its root.

An unlabeled tree is one in which we do not make distinctions based upon names given to the vertices.

A binary tree is one in which each vertex has 0 or 2 children.

An ordered tree is one in which the children of a vertex have some ordering (first, second, third, etc.).

RUBOTs have a left and right child for vertices with children.
Let c_n be the number of RUBOTs with n leaves, with c_0 = 0 for convenience.
The generating function C(x) = sum of c_n·x^n begins: C(x) = x + x² + 2·x³ + 5·x⁴ + ...

🌳 The recursive structure

The excerpt hints at breaking a RUBOT with n ≥ 2 leaves into two smaller RUBOTs.
This decomposition will lead to expressing c_n in terms of c_k for k < n.
Why this is nonlinear: The combination of two smaller trees suggests a product or convolution, making the recurrence nonlinear (unlike the linear recurrences earlier).

Note: The excerpt ends before completing the nonlinear recurrence solution, but establishes that generating functions are particularly suited to this problem.

Solving a nonlinear recurrence

9.7 Solving a nonlinear recurrence

🧭 Overview

🧠 One-sentence thesis

Generating functions can solve nonlinear recurrence equations by transforming a recursive counting problem into an algebraic equation, revealing that the number of certain binary trees follows the Catalan sequence.

📌 Key points (3–5)

What this section solves: a nonlinear recurrence for counting rooted, unlabeled, binary, ordered trees (RUBOTs) with n leaves.
How generating functions help: the recurrence c_n = sum of (c_k × c_(n−k)) becomes a quadratic equation C(x) = x + C²(x) in the generating function.
Key result: solving the quadratic and expanding with Newton's Binomial Theorem shows c_n is the Catalan number (2n−2 choose n−1) / n.
Common confusion: the recurrence is nonlinear because c_n depends on products of earlier terms (c_k × c_(n−k)), not just linear combinations.
Connection to earlier material: these Catalan numbers appeared in Chapter 2 when counting lattice paths that do not cross the diagonal y = x.

🌳 The counting problem: RUBOTs

🌳 What is a RUBOT?

A RUBOT is a tree with five special properties:

Rooted: one vertex is designated as the root (drawn at the top).
Unlabeled: vertices have no distinguishing names.
Binary: every vertex has either 0 or 2 children (no vertex has exactly 1 child).
Ordered: the two children of any vertex are distinguished as left and right.

The excerpt counts RUBOTs by the number of leaves (vertices with 0 children).

📊 Small cases

Figure 9.26 shows the RUBOTs for n ≤ 4 leaves:

n = 1: 1 tree (just the root)
n = 2: 1 tree
n = 3: 2 trees
n = 4: 5 trees

So the generating function starts as C(x) = x + x² + 2x³ + 5x⁴ + …

The goal is to find a formula for all coefficients c_n.

🔁 Building the recurrence

🔁 Breaking down a RUBOT

For n ≥ 2 leaves, the root must have two children (by the binary property).

The left child is the root of a smaller RUBOT with k leaves.
The right child is the root of a smaller RUBOT with (n − k) leaves.
There are c_k choices for the left sub-RUBOT and c_(n−k) choices for the right sub-RUBOT.
Total RUBOTs with the left child having k leaves: c_k × c_(n−k).

Summing over all possible splits k = 1, 2, …, n−1 gives:

c_n = sum from k=1 to n−1 of (c_k × c_(n−k))

Since c_0 = 0 (by convention), this can be rewritten as:

c_n = sum from k=0 to n of (c_k × c_(n−k))

🔍 Why this is nonlinear

The recurrence involves products c_k × c_(n−k), not just sums of earlier terms.
This makes it nonlinear and harder to solve with standard linear recurrence techniques.
Generating functions turn this product structure into something manageable.

🧮 Solving with generating functions

🧮 From recurrence to quadratic equation

Let C(x) = sum from n=0 to ∞ of (c_n x^n).

By Proposition 8.3 (convolution of generating functions), the square C²(x) has coefficient on x^n equal to:

sum from k=0 to n of (c_k × c_(n−k))

But this is exactly c_n for n ≥ 2 (from the recurrence).

The only mismatch is the x¹ term:

C²(x) starts with 0 + 0x + (c_0c_2 + c_1c_1 + c_2c_0)x² + …
C(x) starts with 0 + x + c_2x² + …

Adding x to C²(x) fixes this, giving:

C(x) = x + C²(x)

This is a quadratic equation in C(x).

🧮 Solving the quadratic

Rearranging: C²(x) − C(x) + x = 0.

Using the quadratic formula:

C(x) = (1 ± √(1 − 4x)) / 2

Since c_n ≥ 0, we need the solution that gives non-negative coefficients. The excerpt chooses the minus sign:

C(x) = (1 − √(1 − 4x)) / 2

🧮 Expanding with Newton's Binomial Theorem

Rewrite:

C(x) = 1/2 − (1/2) × (1 − 4x)^(1/2)

Newton's Binomial Theorem expands (1 − 4x)^(1/2) as:

1 + sum from n=1 to ∞ of ((1/2 choose n) × (−4)^n × x^n)

Lemma 9.27 gives:

(1/2 choose k) = (−1)^(k−1) / k × (2k−2 choose k−1) / 2^(2k−1)

Substituting and simplifying:

C(x) = sum from n=1 to ∞ of (1/n × (2n−2 choose n−1) × x^n)

🎯 The result and connection

🎯 Theorem 9.28

The generating function for the number c_n of RUBOTs with n leaves is:

C(x) = (1 − √(1 − 4x)) / 2 = sum from n=1 to ∞ of (1/n × (2n−2 choose n−1) × x^n)

Therefore:

c_n = (1/n) × (2n−2 choose n−1)

🔗 Catalan numbers

The excerpt notes that c_n is a Catalan number.

In Chapter 2, Catalan numbers counted lattice paths from (0,0) to (n,n) that do not cross the diagonal y = x.
The coefficient c_n corresponds to the Catalan number called C(n−1) in Chapter 2.
This shows the same sequence appears in multiple combinatorial contexts.

🔍 Don't confuse

The recurrence is nonlinear because of the product structure, not because it involves powers of n.
The generating function method works here precisely because squaring C(x) captures the convolution sum in the recurrence.

💬 Discussion highlights

💬 Student reflections

Yolanda: appreciates how vector space concepts (bases, dimension) connect to recurrence solutions; sees the importance of factoring across calculus, differential equations, and this chapter.
Bob: understood the chapter except for "the detail about zero as a root of an advancement operator polynomial."
Xing: notes the approach depends on factoring and mentions recent breakthroughs in factoring algorithms.
Alice: was not having a good day (no technical comment).

The discussion emphasizes how techniques (factoring, generating functions) unify across different mathematical topics.

Discussion

9.8 Discussion

🧭 Overview

🧠 One-sentence thesis

Solving recurrence equations explicitly (closed-form) is vastly more efficient than computing values recursively, even though recursive programming is easy to implement.

📌 Key points (3–5)

The factoring challenge: High-degree polynomials in advancement operators require many initial conditions, making factoring the hardest step in solving recurrences.
Recursive vs explicit computation: Recursive functions are easy to code but can be exponentially slower than closed-form solutions as input grows.
Experimental evidence: Timing tests show recursive evaluation time grows exponentially (≈10× per +5 in n), while explicit formulas run in constant time.
Common confusion: Just because recursion is easy to program doesn't mean it's efficient—closed-form solutions avoid redundant recomputation.
Practical implication: For large inputs, explicit solutions are essential; recursive computation becomes impractically slow.

🧮 The mathematical challenge

🧮 Factoring as the bottleneck

Dave identifies the core difficulty: when the advancement operator equation has a large-degree polynomial, you must factor it and handle many initial conditions.
The excerpt contrasts two stages:
- Factoring: hard, especially for high-degree polynomials.
- Solving linear systems: relatively easy once factoring is done.
Example: A recurrence with a degree-5 polynomial requires factoring into roots and tracking 5 initial conditions.

🤔 Why not just use recursion?

Bob's question: "Defining a recursive function is easy in almost all programming languages, so why not just use a computer to calculate the values you need?"
The group's response (via experiment): ease of coding ≠ efficiency of execution.
Xing hints that closed-form solutions help understand growth rates (big-O notation from Chapter 4), but the group decides to test empirically.

🧪 The programming experiment

🧪 The test recurrence

The group examines the recurrence from Example 9.25:

Recursive definition:
r(n) = 2ⁿ + r(n−1) + 2·r(n−2) for n ≥ 2
r(1) = 1
r(0) = 2

Alice implements this as a recursive function r(n) in SageMath.
She also implements the explicit solution s(n) from Example 9.25:
s(n) = (5/9)·2ⁿ + (2/3)·n·2ⁿ + (13/9)·(−1)ⁿ
Both functions produce the same values (e.g., r(1) = s(1) = 1, r(4) = s(4) = 53, r(10) = s(10) = 7397).

⏱️ Timing comparison

Dave uses the timeit command to measure execution time for n = 0, 5, 10, 15, 20, 25, 30:

n	r(n) time (recursive)	s(n) time (explicit)
0	238 ns	44 μs
5	11.4 μs	50.6 μs
10	127 μs	47.6 μs
15	1.42 ms	50 μs
20	15.7 ms	49 μs
25	133 ms	50.4 μs
30	1.49 s	48.8 μs

Key observation: s(n) runs in roughly constant time (~50 μs) regardless of n.
Key observation: r(n) time grows exponentially—approximately 10× slower for each increase of 5 in n.

🚀 Scaling to large inputs

The group tests s(100): it computes "almost instantly."
In contrast, r(40) would take so long they joke about getting coffee refills while waiting.
Don't confuse: Small test values (n ≤ 10) may hide the performance gap; the difference becomes dramatic as n grows.

📊 Why explicit solutions matter

📊 Efficiency gains

Recursive functions recompute the same subproblems many times (e.g., r(n−2) is recalculated in both r(n) and r(n−1)).
Explicit formulas compute the answer directly in one step, avoiding all redundant work.
Example: For n = 30, the recursive approach takes 1.49 seconds; the explicit formula takes ~50 microseconds—a speedup of roughly 30,000×.

📊 Understanding growth rates

Xing's remark: closed-form solutions connect to big-O analysis (Chapter 4).
Knowing the explicit form reveals the dominant term (e.g., (2/3)·n·2ⁿ grows faster than (5/9)·2ⁿ or (13/9)·(−1)ⁿ).
This helps predict behavior for large n without running code.

📊 Practical takeaway

Even though recursion is easy to code, it is not always practical.
For recurrences that will be evaluated at large n or many times, investing effort in finding the closed-form solution pays off enormously.
The "machinery" of this chapter (advancement operators, factoring, solving linear systems) is justified by the dramatic performance improvement.

9.9 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set demonstrates that closed-form solutions to recurrence equations (like s(n)) compute results almost instantly even for large n, while naive recursive implementations (like r(n)) grow exponentially slower, making algorithmic efficiency critical in practice.

📌 Key points (3–5)

Performance contrast: the closed-form solution s(n) runs in constant time (~50 microseconds regardless of n), while the recursive r(n) takes exponentially longer (about 10× slower for every increase of 5 in n).
Practical impact: s(100) computes instantly, but r(40) takes so long the excerpt jokes about getting coffee while waiting.
What the exercises cover: converting recurrence equations to advancement operator form, solving homogeneous and nonhomogeneous equations, finding explicit formulas (including Fibonacci), and applying recurrences to combinatorial problems (strings, tiles, trees, lattice paths).
Common confusion: both r and s give the same numerical answers for small n, so the dramatic efficiency difference only becomes visible through timing experiments or larger inputs.
Why it matters: understanding how to solve recurrences and obtain closed formulas is essential for writing efficient algorithms.

⏱️ Timing experiment and efficiency

⏱️ The timeit comparison

Dave uses SageMath's timeit command to measure execution time for both functions across n = 0, 5, 10, 15, 20, 25, 30.
Key observation:
- s(n) stays roughly constant (around 44–50 microseconds per loop).
- r(n) grows dramatically: 238 nanoseconds at n=0, 11.4 μs at n=5, 127 μs at n=10, 1.42 ms at n=15, 15.7 ms at n=20, 133 ms at n=25, and 1.49 seconds at n=30.
The excerpt notes that r takes "about 10 times as long to run for each increase of 5 in the value of n," indicating exponential growth.

🚀 Practical demonstration

The final test computes s(100) "almost instantly" and prints a large integer result: 85214290348675420878389493250277.
In contrast, the excerpt jokes that "getting a refill on their coffee would be a good way to pass the time waiting on r(40)."
Don't confuse: both functions are mathematically correct and produce the same answers; the difference is purely computational efficiency.

Example: for small n like 1, 4, or 10, both r and s seem equally fast to a human observer, but the timing data reveals the hidden exponential cost of the recursive approach.

📝 Exercise categories

📝 Converting to advancement operator form (Exercises 1)

Task: rewrite recurrence equations using the advancement operator notation.
The excerpt lists six sub-problems with various recurrence forms, including:
- Simple linear recurrences (e.g., rn+2 − rn+1 + 2rn).
- Higher-order recurrences (e.g., rn+4 − 3rn+3 − rn+2 + 2rn).
- Recurrences with non-constant terms (e.g., involving 3n or (−1)n).

🔢 Solving specific recurrences (Exercises 2–5, 7)

Exercises 2, 4, 7: solve recurrences with given initial conditions.
- Example (Exercise 2): solve rn+2 − rn+1 + 2rn given r0 = 1 and r2 = 3 (note: r1 is not specified).
- Example (Exercise 4): solve hn+3 − 6hn+2 − 11hn+1 + 6hn with three initial conditions.
Exercises 3, 5: find general solutions (no specific initial conditions).
- Exercise 5 specifically asks for an explicit formula for the nth Fibonacci number fn.

⚙️ Advancement operator equations (Exercises 6, 8)

Exercise 6: give general solutions to six advancement operator equations in factored or polynomial form.
- Examples include (A − 2)(A + 10)f = 0, (A² − 36)f = 0, and (A³ + A² − 5A + 3)f = 0.
Exercise 8: solve three advancement operator equations with repeated roots and multiple factors.
- Example: (A − 4)³(A + 1)(A − 7)⁴(A − 1)²f = 0.

🔧 Nonhomogeneous equations (Exercise 9)

Task: find general solutions to ten nonhomogeneous advancement operator equations.
The right-hand side (forcing term) varies:
- Exponential: 3n, 2n, (−1)n, 5n.
- Polynomial: 3n² + 9n, n².
- Mixed: 5n2n, 3n2n + 2n, 3n + 2n².
Example (9a): (A − 5)(A + 2)f = 3n.
Example (9c): (A − 3)³f = 3n + 1 (note the repeated root and the forcing term involving the same base).

🧩 Combinatorial applications

🧩 Ternary strings (Exercise 10)

Task: find and solve a recurrence for ln, the number of ternary strings of length n that do not contain "102" as a substring.
This requires identifying how valid strings of length n can be built from shorter valid strings.

🗼 Towers of Hanoi (Exercise 11)

The Towers of Hanoi puzzle: three pegs and n circular discs of different sizes, starting stacked largest-to-smallest on the leftmost peg; goal is to move them to the rightmost peg, one disc at a time, never placing a larger disc on a smaller one.

Let tn denote the fewest moves needed.
Task: determine an explicit formula for tn.

🆔 Database identifiers (Exercise 12)

A valid identifier of length n can be constructed in three ways:
1. "A" followed by any valid identifier of length n − 1.
2. One of six two-character strings ("1A", "1B", "1C", "1D", "1E", "1F") followed by any valid identifier of length n − 2.
3. "0" followed by any ternary string (from {0, 1, 2}) of length n − 1.
Task: find a recurrence for l(n) (the number of identifiers of length n) and solve it for an explicit formula.
The excerpt notes that the empty string (length 0) counts as valid, so l(0) = 1.

🧱 Tiling with L-tiles (Exercise 13)

An L-tile is a 2×2 tile with the upper-right 1×1 square deleted; it may be rotated so the missing square is in any of the four positions.

Task: find a recursive formula for tn, the number of ways to tile a 2×n rectangle using 1×1 tiles and L-tiles, along with enough initial conditions; then find a closed formula.

🌳 RUBOTs and Catalan numbers (Exercises 19–20)

Exercise 19: count and draw rooted, unlabeled, binary, ordered trees (RUBOTs) with 6 leaves.
Exercise 20: develop a recurrence for the number ln of lattice paths from (0,0) to (n, n) that do not cross the diagonal y = x, similar to the RUBOT recurrence cn = sum from k=0 to n of ckcn−k for n ≥ 2.
The excerpt references Chapter 2, where Catalan numbers were first encountered counting lattice paths.

🔬 Advanced techniques

🔬 Proofs and lemmas (Exercise 14)

Task: prove Lemma 9.22 about advancement operator equations with repeated roots.
This is a theoretical exercise requiring rigorous justification of the solution form when characteristic roots have multiplicity greater than 1.

🌀 Generating functions (Exercises 15–18)

Exercises 15–17: use generating functions to solve specific recurrences with given initial conditions.
- Example (Exercise 15): solve rn = rn−1 + 6rn−2 for n ≥ 2 with r0 = 1, r1 = 3.
- Example (Exercise 16): solve an+3 = 5an+2 − 7an+1 + 3an + 2n for n ≥ 0 with a0 = 0, a1 = 2, a2 = 5.
Exercise 18: use generating functions to find a closed formula for the Fibonacci numbers fn.
Why generating functions: they provide an alternative method to solve recurrences, especially useful for nonhomogeneous equations and for deriving closed formulas.

An Introduction to Probability

10.1 An Introduction to Probability

🧭 Overview

🧠 One-sentence thesis

Probability spaces provide a formal framework for quantifying the likelihood of outcomes and events, enabling precise reasoning about games of chance, sampling, and conditional scenarios.

📌 Key points (3–5)

What a probability space is: a finite set of outcomes paired with a probability measure function that assigns likelihoods to events (subsets).
How probabilities combine: if two events cannot happen together (disjoint), their combined probability is the sum of their individual probabilities.
Outcomes vs events: outcomes are individual elements; events are subsets; the probability of an event is the sum of probabilities of its outcomes.
Common confusion: distinguishing between "equally likely" outcomes (like a fair die) and "unequally likely" outcomes (like sums of two dice, where different sums have different probabilities).
Conditional probability: the probability of one event given that another has occurred, calculated by dividing the probability of both events by the probability of the given event.

🎲 Probability spaces and their structure

🎲 What a probability space is

A probability space is a pair (S, P) where S is a finite set and P is a function whose domain is the family of all subsets of S and whose range is the set [0, 1] of all real numbers which are non-negative and at most one.

S is the set of all possible outcomes.
P assigns a number between 0 and 1 to every subset of S.
Two key properties must hold:
- P(empty set) = 0 and P(S) = 1 (the probability of nothing happening is zero; the probability of something happening is one).
- If A and B are disjoint subsets (they share no elements), then P(A or B) = P(A) + P(B).

📊 Terminology

Term	Definition	Notes
Probability measure	The function P	Maps events to [0, 1]
Event	Any subset of S	Can contain one or many outcomes
Outcome (or elementary outcome)	An individual element x in S	The basic building blocks
Probability of an outcome	P(x) = P({x})	The likelihood of that single outcome
Probability of an event E	P(E) = sum of P(x) for all x in E	Add up probabilities of all outcomes in the event

🔑 Key insight: from outcomes to events

If you know P(x) for every outcome x in S, you can calculate P(E) for any event E.
Why: by the addition property, P(E) equals the sum of P(x) for all x in E.
Example: If S = {1, 2, 3} with P(1) = 1/8, P(2) = 2/8, P(3) = 1/8, then P({2, 3}) = 2/8 + 1/8 = 3/8.

🎯 Examples of probability spaces

🎯 Spinner with unequal regions

The excerpt describes a spinner with five regions.
S = {1, 2, 3, 4, 5}.
Probabilities: P(1) = P(3) = P(4) = 1/8, P(2) = 2/8 = 1/4, P(5) = 3/8.
Observers note: "odds of landing in region 1 are the same as region 3" (both 1/8); "twice as likely to land in region 2 as in region 4" (2/8 vs 1/8).
This illustrates that outcomes need not be equally likely.

⚖️ Equally likely outcomes

When all outcomes have the same probability, the setup is simpler.
If S has n elements and all are equally likely, then P(x) = 1/n for each x.
For any event E, P(E) = |E| / n (the size of E divided by the size of S).
Example: A fair six-sided die has S = {1, 2, 3, 4, 5, 6} with P(i) = 1/6 for each i.

🎲 One die vs two dice

Single die:

S = {1, 2, 3, 4, 5, 6}, each outcome has probability 1/6.

Two dice (sum of dots):

S = {2, 3, 4, ..., 11, 12}.
Outcomes are not equally likely.
The excerpt explains: treat the two dice as distinguished (e.g., one red, one blue).
There are 36 equally likely pairs (i, j) with 1 ≤ i, j ≤ 6, each with probability 1/36.
Different sums correspond to different numbers of pairs:
- Sum = 2: only (1,1) → P(2) = 1/36.
- Sum = 4: pairs (3,1), (2,2), (1,3) → P(4) = 3/36 = 1/12.
- Sum = 7: six pairs → P(7) = 6/36.
Don't confuse: the sum outcomes are not equally likely even though the underlying pairs are.

🎰 Alice's game: differences of two dice

S = {0, 1, 2, 3, 4, 5} (possible differences).
Probabilities: P(0) = 6/36, P(1) = 10/36, P(2) = 8/36, P(3) = 6/36, P(4) = 4/36, P(5) = 2/36.
Compact notation: P(0) = 1/6 and P(d) = 2(6 - d)/36 when d > 0.
This shows how combinatorics (counting pairs) determines probabilities.

🔵 Marbles: sampling without replacement

A jar has 20 marbles: 6 red, 9 blue, 5 green.
Three marbles are selected at random (without replacement).
Let X = {0, 1, 2, 3, 4, 5} represent the number of blue marbles among the three selected.
Probability: P(i) = C(9, i) × C(11, 3 - i) / C(20, 3) for i = 0, 1, 2, 3.
P(4) = P(5) = 0 (impossible to draw 4 or 5 blue marbles when only 3 are selected).
Discussion: Bob questions outcomes with probability zero; Carlos says they make sense (they represent impossible events within the model).

🃏 Full house in poker

Five cards drawn from a standard 52-card deck (four suits, 13 values each).
A full house: three cards of one value and two cards of another value (e.g., three kings and two eights).
Probability = (13 choose 1) × (12 choose 1) × (4 choose 3) × (4 choose 2) / (52 choose 5) ≈ 0.00144.
This illustrates how combinatorial counting (choosing values and suits) determines probabilities in card games.

🔀 Conditional probability

🔀 What conditional probability measures

The probability of A, given B, denoted P(A | B), is defined by P(A | B) = P(A ∩ B) / P(B), where P(B) > 0.

This answers: "If we know event B has occurred, what is the probability that event A also occurred?"
The formula adjusts the probability by focusing only on the subset of outcomes where B is true.

🔴 Marble example: updating beliefs

Scenario: A jar has 20 marbles (6 red, 9 blue, 5 green). Xing blindly selects two marbles, one for each pocket.
Before looking: P(left pocket is red) = 6/20.
After discovering the right pocket is blue: intuition says the probability of left being red should be slightly higher.
Formal calculation:
- Let A = "left pocket is red," B = "right pocket is blue."
- P(B) = 9/20.
- P(A ∩ B) = (9 × 6) / 380 (9 ways to pick blue for right, 6 ways to pick red for left, out of 20 × 19 ordered pairs).
- P(A | B) = (54/380) / (9/20) = 54/380 × 20/9 = 6/19.
- 6/19 is indeed slightly larger than 6/20.
Why it changes: knowing one marble is blue removes one blue marble from the pool, leaving 19 marbles with 6 still red, so the proportion increases.

⚠️ Don't confuse: unconditional vs conditional

Unconditional probability P(A) considers all possible outcomes.
Conditional probability P(A | B) restricts attention to outcomes where B is true.
Example: P(left red) = 6/20 overall, but P(left red | right blue) = 6/19 after conditioning.

Markdown notes complete.

10.2 Conditional Probability and Independent Events

🧭 Overview

🧠 One-sentence thesis

Conditional probability measures how the probability of one event changes when we know another event has occurred, and two events are independent if knowing one does not change the probability of the other.

📌 Key points (3–5)

What conditional probability measures: the probability of event A given that event B has already occurred, calculated as P(A ∩ B) / P(B).
How to interpret: when new information (event B) is revealed, the probability of A may shift from its original value P(A) to the conditional value P(A | B).
Independence definition: events A and B are independent if P(A ∩ B) = P(A) × P(B), meaning knowing B does not change the probability of A.
Common confusion: independence vs dependence—if P(A | B) ≠ P(A), the events are dependent; if they are equal, the events are independent.
Why it matters: conditional probability helps solve real-world problems where partial information is revealed step-by-step.

🎲 Understanding conditional probability

🎲 The basic formula

Probability of A given B, denoted P(A | B): P(A | B) = P(A ∩ B) / P(B), where P(B) > 0.

This formula adjusts the probability of A by restricting the sample space to only those outcomes where B has occurred.
The numerator P(A ∩ B) is the probability that both A and B happen.
The denominator P(B) normalizes by the probability that B happens at all.

🧪 The marble example (Xing's pockets)

Setup: A jar contains 20 marbles: 6 red, 9 blue, 5 green. Xing draws two marbles without replacement, one for each pocket.
Before new information: The probability that the left pocket marble is red is 6/20.
After new information: Xing checks his right pocket and finds a blue marble. Now the probability that the left pocket marble is red is P(A | B).
- Let A = left pocket is red, B = right pocket is blue.
- P(B) = 9/20.
- P(A ∩ B) = (9 × 6) / 380 (the probability of drawing one blue and one red in that order).
- P(A | B) = (54/380) / (9/20) = 6/19, which is slightly larger than 6/20.
Why it increases: Removing one blue marble from the jar leaves proportionally more red marbles available for the left pocket.

🏺 Two-jar example (partitioning the sample space)

Setup: Jar 1 has 20 marbles (6 red, 9 blue, 5 green); Jar 2 has 18 marbles (9 red, 5 blue, 4 green). A jar is chosen at random, then two marbles are drawn.
Question: What is the probability that both marbles are green?
Solution using conditional probability:
- Let G = both marbles are green, J₁ = marbles from Jar 1, J₂ = marbles from Jar 2.
- G = (G ∩ J₁) ∪ (G ∩ J₂), and these two events are disjoint.
- P(G | J₁) = (5 choose 2) / (20 choose 2), P(G | J₂) = (4 choose 2) / (18 choose 2).
- P(J₁) = P(J₂) = 1/2.
- P(G ∩ Jᵢ) = P(Jᵢ) × P(G | Jᵢ) for each i.
- P(G) = (1/2) × (20/380) + (1/2) × (12/306) ≈ 4.6%.
Key insight: Conditional probability allows us to break down a complex problem by conditioning on which jar was chosen.

🔗 Independent events

🔗 Definition and interpretation

Events A and B are independent if P(A ∩ B) = P(A) × P(B).

Equivalent condition: When P(B) ≠ 0, A and B are independent if and only if P(A) = P(A | B).
Meaning: Knowing that B occurred does not change the probability of A.
Dependent events: Events that are not independent are called dependent.

🔴 Example: Xing's marbles are dependent

A = left pocket is red, B = right pocket is blue.
We calculated P(A | B) = 6/19, but P(A) = 6/20.
Since P(A | B) ≠ P(A), the two events are dependent.
Intuition: Drawing without replacement means the composition of remaining marbles changes, so the two draws affect each other.

🟢 Example: Two jars, one marble (dependent)

Setup: One jar is chosen at random, one marble is drawn. Let A = second jar is chosen, B = marble is green.
P(A) = 1/2.
P(B) = (1/2) × (5/20) + (1/2) × (4/18).
P(A ∩ B) = (1/2) × (4/18).
P(A ∩ B) ≠ P(A) × P(B), so A and B are not independent.
Intuition: Once you know the marble is green, it is more likely you chose the first jar (which has a higher proportion of green marbles).

🎲 Example: Dice rolls (independent)

Setup: Roll one red die and one blue die. Let A = red die shows 3 or 5, B = doubles (both dice show the same number).
P(A) = 2/6.
P(B) = 6/36 (there are 6 doubles: (1,1), (2,2), ..., (6,6)).
P(A ∩ B) = 2/36 (the doubles (3,3) and (5,5) satisfy A).
P(A ∩ B) = (2/36) = (2/6) × (6/36) = P(A) × P(B), so A and B are independent.
Intuition: The outcome of one die does not affect the outcome of the other die.

🧮 How to distinguish independence

Condition	What it means	Example
P(A ∩ B) = P(A) × P(B)	A and B are independent	Dice rolls: outcome of one die does not affect the other
P(A \| B) = P(A)	Knowing B does not change A's probability	Same as above
P(A ∩ B) ≠ P(A) × P(B)	A and B are dependent	Drawing without replacement: first draw affects second
P(A \| B) ≠ P(A)	Knowing B changes A's probability	Xing's marbles: knowing right pocket is blue increases left pocket red probability

🚫 Don't confuse

Disjoint vs dependent: Disjoint events (A ∩ B = ∅) are usually dependent, because if A occurs, B cannot occur, so P(B | A) = 0 ≠ P(B) (unless P(B) = 0).
Independence is symmetric: If A is independent of B, then B is independent of A.
Conditional probability is not symmetric: P(A | B) is generally not equal to P(B | A).

Bernoulli Trials

10.3 Bernoulli Trials

🧭 Overview

🧠 One-sentence thesis

Bernoulli trials model repeated experiments where each trial has exactly two outcomes with constant success probability, allowing us to calculate the probability of any specific sequence of successes and failures.

📌 Key points (3–5)

What Bernoulli trials are: repeated experiments with only two outcomes (success/failure) where the probability of success remains constant across all trials.
Key requirement: the probability of success on any individual trial is exactly p, no matter how many times the experiment is repeated.
How to calculate probabilities: for n trials with i successes, the probability is (n choose i) × p^i × (1 − p)^(n − i).
Common confusion: even with a fair coin (p = 1/2), getting exactly 50 heads in 100 tosses is not very likely (~7.96%); outcomes cluster near the expected value but rarely hit it exactly.
Replacement matters: the marble example shows that putting the marble back (with replacement) keeps probabilities constant, which is essential for Bernoulli trials.

🎲 The setup: constant probability across trials

🎲 The marble jar example

A jar contains 7 marbles: 4 red and 3 blue.
When you draw one marble at random:
- Probability of red = 4/7 = p
- Probability of blue = 3/7 = 1 − p
Crucial step: the marble is put back in the jar and the marbles are stirred before the next draw.
Because of replacement, the probability of getting red on the second trial is again 4/7, and this pattern holds no matter how many times you repeat the experiment.

📋 Formal definition

Bernoulli trials: an experiment with only two outcomes—success and failure—where the probability of success is p and the probability of failure is 1 − p. Most importantly, when the experiment is repeated, the probability of success on any individual test is exactly p.

The two outcomes can be labeled in many ways: success/failure, head/tails, up/down, good/bad, forwards/backwards, red/blue, etc.
The key is that the probability does not change from trial to trial.

🧮 Calculating probabilities for n trials

🧮 The general formula

Fix a positive integer n and repeat the experiment n times.
The outcomes are binary strings of length n from the alphabet {S, F} (success and failure).
If a string x has i successes and n − i failures, then:
- P(x) = (n choose i) × p^i × (1 − p)^(n − i)
The binomial coefficient (n choose i) accounts for all the different orders in which i successes can occur among n trials.

🎯 Die rolling example

Example: Roll a die and call it a success if the result is a 2 or a 5.

Probability of success p = 2/6 = 1/3
Probability of failure = 2/3
If the die is rolled 10 times, the probability of getting exactly 4 successes is:
- C(10, 4) × (1/3)^4 × (2/3)^6

🪙 Fair coin example

Example: A fair coin is tossed 100 times.

Probability of getting heads 40 times and tails 60 times is:
- (100 choose 40) × (1/2)^40 × (1/2)^60 = (100 choose 40) / 2^100

🤔 Common misconceptions about "expected" outcomes

🤔 The 50-50 fallacy

Scenario: Bob says that if a fair coin is tossed 100 times, it is fairly likely you will get exactly 50 heads and 50 tails.

Dave is skeptical.
Carlos calculates that the probability of getting exactly 50 heads in 100 tosses is approximately 0.079589 (about 7.96%).
Don't confuse: "expected value" does not mean "most likely exact outcome." Even though the expected number of heads is 50, the probability of getting exactly 50 is not very high.

📊 Clustering near the expected value

Observation	What it means
Exact match is unlikely	Getting exactly 50 heads in 100 tosses has only ~8% probability
Range is more likely	There is a 99% chance the number of heads is between 20 and 80
Large n behavior	When n is very large, it becomes increasingly certain that the number of heads will be close to n/2

Xing reports that you have a 99% chance that the number of heads is at least 20 and at most 80 (out of 100 tosses).
Carlos adds that when n is very large, the number of heads in n tosses will be close to n/2.
Dave asks: "What do you mean by close, and what do you mean by very large?" (The excerpt does not answer this, but it highlights that these terms need careful definition.)

🎯 Key takeaway

Outcomes cluster around the expected value but rarely hit it exactly.
The larger the number of trials, the more concentrated the distribution becomes around the expected proportion, but individual exact outcomes remain relatively rare.

Discrete Random Variables

10.4 Discrete Random Variables

🧭 Overview

🧠 One-sentence thesis

Discrete random variables map outcomes to real numbers and their expectation (mean) provides a formal measure of what average behavior to expect from repeated trials, enabling us to determine whether games are fair or biased.

📌 Key points (3–5)

What a random variable is: a function that maps outcomes in a probability space to real numbers (positive, negative, or zero).
What expectation measures: the expected value (mean) is the weighted average of all possible outcomes, calculated by summing each value times its probability.
Linearity of expectation: the expectation of a sum of random variables equals the sum of their individual expectations.
Common confusion: expectation does not have to be an integer or even a possible outcome—it represents the long-run average, not a single trial result.
Why it matters for fairness: a game is fair if the cost to play equals the expected value; otherwise, one side has an advantage.

🎲 What random variables are

🎲 Definition and structure

Random variable: any function X that maps outcomes in a probability space (S, P) to real numbers (all values allowed: positive, negative, and zero).

The excerpt uses capital letters (X, Y) for historical reasons, though they are just functions.
The function takes each outcome in the sample space S and assigns it a numerical value.
Example: for a spinner with numbered regions, X(i) = i squared assigns the square of the region number to each outcome.

📊 How expectation is calculated

Expectation (also called mean or expected value), denoted E(X): the quantity sum over all x in S of X(x) times P(x).

Because S is finite, the formula can be rewritten as: sum over all values y of y times the probability that X(x) equals y.
The expectation is a weighted average: each possible value is multiplied by its probability, then all products are summed.
Example: for the spinner where X(i) = i squared, E(X) = 1 squared times 1/8 + 2 squared times 2/8 + 3 squared times 1/8 + 4 squared times 1/8 + 5 squared times 3/8 = 109/8 = 13.625.

🎯 What expectation means

🎯 Long-run average interpretation

If you repeat the experiment n times and record outcomes (i₁, i₂, ..., iₙ), the total value should be close to n times E(X).
Example: if the spinner is spun n times and a prize worth i squared is awarded each time, the total prize should be approximately 13.625n.
Don't confuse: the expected value 13.625 is not a possible outcome of a single trial (all outcomes are perfect squares), but it represents the average over many trials.

⚖️ Fairness of games

A game is fair if the cost to play equals the expected value of the winnings.
If the cost is less than the expected value, the player has an unfair advantage.
If the cost is more than the expected value, the game is biased against the player.
Example: paying 13.625 cents to play the spinner game (where you win i squared pennies) makes it fair; paying less favors the player, paying more favors the house.

🎰 Real-world applications

Context	Expected return per dollar	Implication
State lotteries	Approximately 50 cents	Far from fair; only play if you want to support the cause
Casino games	Slightly less than 90 cents	Still a loss, but closer to fair; you pay for the entertainment experience

The excerpt notes that lotteries finance scholarships or public enterprises but have poor expected value.
Casinos use the edge (expected loss per dollar) to fund their buildings and operations.

🔢 Key properties and examples

➕ Linearity of expectation

Proposition: If X₁, X₂, ..., Xₙ are random variables on a probability space (S, P), then E(X₁ + X₂ + ... + Xₙ) = E(X₁) + E(X₂) + ... + E(Xₙ).

This property is an immediate consequence of the definition.
It is fundamental for calculations involving multiple random variables.
The excerpt emphasizes its importance for discussions that follow.

🎲 Bernoulli trials expectation

Consider n Bernoulli trials with probability p of success, and let X count the number of successes.
The expectation is: E(X) = sum from i=0 to n of i times (n choose i) times p to the i times (1 - p) to the (n - i) = np.
The excerpt derives this using calculus: expand f(x) = [px + (1 - p)] to the n using the binomial theorem, take the derivative, and evaluate at x = 1.
Example interpretation: if you flip a fair coin 100 times, the expected number of heads is 100 times 1/2 = 50 (though getting exactly 50 heads has probability only about 0.08, not very likely at all).

🎲 Alice and Bob's game

Sample space S = {0, 1, 2, 3, 4, 5} represents the difference when two dice are rolled.
Define X(d) = 2 - d, which records Bob's winnings (negative means Alice wins).
E(X) = sum from d=0 to 5 of X(d) times p(d) = -2 times 1/6 + (-1) times 10/36 + 0 times 8/36 + 1 times 6/36 + 2 times 4/36 + 3 times 2/36 = -2/36.
Bob should expect to lose slightly more than a nickel (about 5.56 cents) each time the game is played.
The excerpt concludes: Alice likes to play this game repeatedly; Bob should refuse.

🧮 Precision and probability

🧮 What "close" and "large" mean

The excerpt discusses a fair coin tossed 100 times: the probability of exactly 50 heads is about 0.08 (not very likely).
However, there is a 99% chance that the number of heads is between 20 and 80.
For every epsilon greater than 0, there exists some n₀ (depending on epsilon) such that if n is greater than n₀, the probability that the average winnings per trial is within epsilon of the expected value is at least 1 - epsilon.
This statement gives a precise meaning to "close" and "large" in the context of repeated trials.
Don't confuse: a single trial can deviate significantly from the expected value, but the average over many trials converges to it.

Central Tendency

10.5 Central Tendency

🧭 Overview

🧠 One-sentence thesis

Central tendency measures help us understand how likely a random variable is to stray from its expected value, distinguishing situations where outcomes far from the mean are surprising from those where they are not.

📌 Key points (3–5)

What central tendency addresses: two situations can have the same expected value but very different likelihoods of deviating far from that expectation.
Variance and standard deviation: variance measures the expected squared distance from the mean; standard deviation is its square root and provides a natural scale for measuring deviation.
Markov's inequality: a basic bound showing that the probability of a random variable being at least k is at most the expected value divided by k.
Chebyshev's inequality: a stronger result that uses variance to bound how far a random variable can stray from its expected value in terms of standard deviations.
Common confusion: expected value alone does not tell you how concentrated outcomes are—variance and standard deviation capture that additional information.

🎲 The motivating contrast

🎟️ Lottery vs coin tosses

The excerpt presents two situations, both with sample space {0, 1, 2, ..., 10,000} and expected value 5,000:

Situation	Description	Reaction to outcome ≥ 7,500
Lottery	10,001 tickets sold; mayor draws one winning number	Not very surprising; citizens are curious but not alarmed
Coin tosses	Fair coin tossed 10,000 times; count heads	"What? Are you out of your mind?"—seems extraordinary

Both have the same expected value (5,000), yet one outcome far from the mean seems normal while the other seems impossible.
This intuitive difference is what central tendency captures mathematically.

🎯 Why expected value is not enough

Expected value tells you the "center" but not how tightly outcomes cluster around it.
The lottery has uniform probability—every outcome is equally likely—so large deviations are not unusual.
Coin tosses follow a binomial distribution—outcomes concentrate strongly near the mean, so large deviations are extremely rare.

📏 Markov's Inequality

📐 The basic bound

Markov's Inequality: For any random variable X in a probability space (S, P) and any k > 0, the probability that the absolute value of X is at least k is at most the expected value of the absolute value of X divided by k.

In symbols: P(|X| ≥ k) ≤ E(|X|) / k.

This is a "trivial" result (the excerpt's term) but provides a starting point.
It uses only the expected value, not variance.

🧮 Application to the two situations

For both the lottery and coin tosses, E(X) = 5,000.
Applying Markov's inequality with k = 7,500: P(X ≥ 7,500) ≤ 5000/7500 = 2/3.
So in both cases, the probability is at most 2/3—"nothing alarming here in either case."
Don't confuse: Markov's inequality gives the same bound for both situations, so it cannot explain why one outcome seems extraordinary and the other does not. A more subtle measure is needed.

📊 Variance and Standard Deviation

📦 Definitions

Variance of a random variable X: var(X) = E((X − E(X))²), the expected value of the squared distance from the mean.

Standard deviation of X: σ_X = √var(X), the square root of the variance.

Variance is always non-negative (it's an expected squared quantity).
Standard deviation has the same units as X, making it easier to interpret.
These measure how spread out the distribution is around the expected value.

🎡 Spinner example

The excerpt revisits a spinner with regions 1, 2, 3, 4, 5 and probabilities 1/8, 1/4, 1/8, 1/8, 3/8. Let X(i) = i² when the pointer stops in region i.

E(X) = 109/8 (calculated earlier in the text).
Variance calculation:
- var(X) = (1² − 109/8)² · 1/8 + (2² − 109/8)² · 1/4 + (3² − 109/8)² · 1/8 + (4² − 109/8)² · 1/8 + (5² − 109/8)² · 3/8
- This simplifies to 48394/512.
Standard deviation: σ_X = √(48394/512) ≈ 9.722.

🎲 Bernoulli trials

For n Bernoulli trials with success probability p, let X count the number of successes.

E(X) = np (established earlier).
Variance formula: var(X) = np(1 − p).
The excerpt mentions two approaches to deriving this:
1. Direct calculation using the definition and second derivatives.
2. Using independence: if X₁, X₂, ..., Xₙ are independent random variables, then var(X₁ + X₂ + ... + Xₙ) = var(X₁) + var(X₂) + ... + var(Xₙ).

🔧 Computational shortcut

Proposition: For any random variable X, var(X) = E(X²) − E²(X).

In words: variance equals the expected value of X squared minus the square of the expected value.

This is often easier to compute than the definition.
Proof outline: expand (r − μ)² where μ = E(X), sum over all outcomes weighted by probabilities, and simplify using linearity of expectation.

🎯 Chebyshev's Inequality

🎯 The stronger bound

Chebyshev's Inequality: Let X be a random variable with expectation μ = E(X) and standard deviation σ_X. For any k > 0, the probability that X differs from μ by at least k standard deviations is at most 1/k².

In symbols: prob(|X − E(X)| ≥ k·σ_X) ≤ 1/k², or equivalently, prob(|X − E(X)| < k·σ_X) ≥ 1 − 1/k².

This uses variance/standard deviation, not just expected value.
It gives a much tighter bound when the standard deviation is small relative to the mean.

🔍 Proof idea

Let A = {r : |r − μ| > k·σ_X}, the set of outcomes far from the mean.
The variance is at least the contribution from outcomes in A: var(X) ≥ Σ_{r in A} (r − μ)² · prob(X = r).
For r in A, (r − μ)² ≥ k²·σ_X², so var(X) ≥ k²·σ_X² · prob(X in A).
Since var(X) = σ_X², we get 1 ≥ k² · prob(X in A), hence prob(X in A) ≤ 1/k².

🪙 Coin toss application

For n = 10,000 fair coin tosses counting heads:

μ = E(X) = 5,000.
var(X) = n/4 = 2,500, so σ_X = √2,500 = 50.
Set k = 50 so that k·σ_X = 2,500.
Chebyshev's inequality: prob(|X − 5000| < 2500) ≥ 1 − 1/50² = 1 − 1/2500 = 0.9996.
Interpretation: The number of heads is within 2,500 of 5,000 with probability at least 99.96%, so getting at least 7,500 heads is "very unlikely indeed."

🎟️ Lottery comparison

For the lottery with uniform distribution over {0, 1, ..., 10,000}:

The probability that the winning number is at least 7,500 is exactly 2501/10001 ≈ 1/4.
This is much larger than the coin-toss case, explaining the different reactions.
Don't confuse: Chebyshev's inequality gives a bound, not an exact probability. For the coin tosses, the actual probability of ≥ 7,500 heads is far smaller than the bound suggests.

🖥️ Exact calculation

The excerpt notes that for Bernoulli trials, the exact probability can be computed:

For 10,000 coin tosses, P(X ≥ 7,500) = Σ_{i=7,500}^{10,000} (10,000 choose i) / 2^{10,000}.
A computer algebra system can calculate this exactly, and "you are encouraged to check it out just to see how truly small this quantity actually is."
This confirms that the outcome is extraordinarily unlikely, far beyond what Chebyshev's inequality alone tells us.

🔗 Independence and families of random variables

👨‍👩‍👧‍👦 Independent families

A family F = {X₁, X₂, ..., Xₙ} of random variables is independent if for each pair i, j with i < j, and for each pair of real numbers a, b in [0,1], the events {x : Xᵢ(x) ≥ a} and {x : Xⱼ(x) ≥ b} are independent.

This extends the notion of independent events to random variables.
In Bernoulli trials, the outcome of each trial is independent of the others.

➕ Variance of sums

When X₁, X₂, ..., Xₙ are independent random variables:

var(X₁ + X₂ + ... + Xₙ) = var(X₁) + var(X₂) + ... + var(Xₙ).
This makes calculating variance for Bernoulli trials "a trivial calculation."
Example: for n coin tosses, each individual toss has variance p(1 − p), so the total variance is n·p(1 − p).

📚 Further study

The excerpt notes that the treatment here is "just a small part of a more complex subject which can be treated more elegantly and ultimately much more compactly—provided you first develop additional background material on families of random variables." It refers readers to probability and statistics texts for deeper coverage.

Probability Spaces with Infinitely Many Outcomes

10.6 Probability Spaces with Infinitely Many Outcomes

🧭 Overview

🧠 One-sentence thesis

Probability spaces can be extended beyond finite sets to countably infinite or uncountable sample spaces, where probabilities are computed using infinite sums or defined on families of subsets rather than individual outcomes.

📌 Key points (3–5)

Extension beyond finite sets: probability spaces can have countably infinite or uncountable sample spaces, not just finite ones.
Countably infinite case: probabilities are still defined on individual outcomes, but the sum over all outcomes is an infinite series that converges to 1.
Uncountable case: the probability function is not defined on individual outcomes but on families of subsets instead.
Common confusion: in countably infinite spaces, we still sum probabilities over outcomes (infinite sum), but in uncountable spaces, we cannot assign probabilities to individual points at all.
Practical application: games with indefinite rounds (like repeated die rolls until a condition is met) naturally lead to countably infinite sample spaces.

🔢 Types of infinite probability spaces

🔢 Countably infinite sample spaces

When S is countably infinite, we can still define P on the members of S, and now the sum over x in S of P(x) is an infinite sum which converges absolutely (since all terms are non-negative) to 1.

The sample space S has infinitely many outcomes, but they can be listed (like the natural numbers).
Each outcome still has a probability assigned to it.
The key difference from finite spaces: summing all probabilities is now an infinite series.
The series must converge to exactly 1 (absolute convergence is guaranteed because probabilities are non-negative).

♾️ Uncountable sample spaces

When S is uncountable, P is not defined on S. Instead, the probability function is defined on a family of subsets of S.

The sample space is so large that outcomes cannot be listed (like all real numbers in an interval).
Individual outcomes do not have probabilities assigned to them.
Probabilities are assigned to collections (subsets) of outcomes instead.
Don't confuse: this is fundamentally different from countable infinity—you cannot talk about "the probability of outcome x" at all.

🎲 Game with repeated trials

🎲 Nancy's die game

The excerpt describes a game to illustrate countably infinite probability spaces:

Rules:

Nancy rolls a die.
If she rolls a 6 on the first roll, she wins immediately.
If she rolls any other number (say, k), she keeps rolling until either:
1. She rolls a 6 → she loses.
2. She rolls k again → she wins.

Sample sequences:

(4, 2, 3, 5, 1, 1, 1, 4): Nancy wins (rolled 4 again before rolling 6).
(6): Nancy wins (6 on first roll).
(5, 2, 3, 2, 1, 6): Nancy loses (rolled 6 before rolling 5 again).

🧮 Computing the win probability

The probability that Nancy wins involves an infinite sum because the game can go on indefinitely:

Win on first roll: probability is 1/6 (rolling a 6 immediately).
Win on round n (where n ≥ 2):
- Probability 5/6 of not rolling 6 on the first roll.
- Probability (4/6) raised to the power (n - 2) of avoiding both a win and a loss on rolls 2 through n - 1.
- Probability 1/6 of rolling the matching number on round n.

The total win probability is:

1/6 plus the sum over n from 2 to infinity of (5/6) times (4/6) raised to (n - 2) times (1/6).
This infinite sum evaluates to 7/12.

Example: The game might last 2 rounds, 3 rounds, 4 rounds, etc., each with decreasing probability, and we sum all these possibilities.

🎯 General pattern for repeated sampling

🎯 Two disjoint events A and B

The excerpt generalizes Nancy's game to any probability space with two disjoint events A and B:

Setup:

Events A and B are disjoint (cannot both happen).
P(A) + P(B) is less than 1 (there is a positive probability of neither happening).
Repeated independent samples are taken from the space.
Call it a "win" if event A occurs, a "loss" if event B occurs, and a "tie" (sample again) otherwise.

Win probability formula:

The probability of a win is P(A) divided by (P(A) + P(B)).
This can also be written as P(A) plus P(A) times the sum over n from 1 to infinity of (1 - P(A) - P(B)) raised to the power n.

🔍 Why this formula works

On the first sample, A might occur with probability P(A) → immediate win.
If neither A nor B occurs (probability 1 - P(A) - P(B)), we sample again.
The process repeats indefinitely, creating a geometric series.
The formula sums all these infinite possibilities.

Don't confuse: the formula P(A) / (P(A) + P(B)) is the probability of A winning eventually, not the probability of A on a single trial.

📚 Scope and further study

📚 Focus of this text

The text emphasizes finite sets and combinatorics.
Countably infinite probability spaces are discussed briefly (as in the examples above).
Uncountable probability spaces are mentioned but not covered in detail.

📖 Where to learn more

For general concepts from probability and statistics, especially uncountable sample spaces, students are referred to specialized texts.
The excerpt does not provide specific references but acknowledges that deeper treatment requires going beyond combinatorics-focused material.

10.7 Discussion

🧭 Overview

🧠 One-sentence thesis

A person's preference between options can change depending on which alternatives are available, a phenomenon explained by conditional probability where preferences are conditioned on the full set of choices.

📌 Key points (3–5)

The dessert paradox: A diner patron consistently chooses pecan pie over cherry and apple, but switches to cherry when apple is unavailable—seemingly contradicting his earlier preference.
Why it seems confusing: If someone prefers pecan to cherry (when all three are available), it seems illogical to then prefer cherry to pecan when only two choices remain.
The explanation: Preferences can be conditional—the patron's preference for pecan was conditioned on three choices being available; when the choice set changes, preferences may shift.
Common confusion: Don't assume preferences are absolute rankings independent of context; the set of available options can change how people evaluate each option.
Real-world parallels: The same pattern appears in politics (voters shift preferences when a candidate drops out) and personal relationships (preferences change as the pool of options changes).

🍰 The dessert puzzle

🍰 What happens

A man eats lunch at the same diner every day for six months.
The waiter offers three dessert choices: apple pie, cherry pie, and pecan pie.
The man consistently chooses pecan pie.
One day, the waiter says there is no apple pie—only cherry and pecan are available.
The man now chooses cherry pie instead of pecan.

❓ Why it seems paradoxical

Over six months, the man demonstrated a preference: pecan > cherry (and pecan > apple).
When apple is removed, the choice should be simpler: just pecan vs. cherry.
Yet he reverses his preference and picks cherry over pecan.
Bob's confusion: "Why would the guy ask for cherry pie in preference to pecan pie when he consistently takes pecan pie over both cherry pie and apple pie?"

🎲 The conditional probability explanation

🎲 Preferences depend on the choice set

Carlos identifies the key insight:

"The patron's preference for pecan pie was conditioned on the fact that there were three choices. When there were only two choices, his preferences changed."

The man's preference for pecan was not an absolute ranking but a preference given the context of three options.
When the set of available options changes (from three to two), the conditional probabilities and evaluations shift.
Example: The presence of apple pie might have made pecan seem more attractive relative to cherry; without apple, the comparison changes.

🔄 Don't confuse absolute vs. conditional preferences

Absolute preference would mean: pecan > cherry always, regardless of what else is available.
Conditional preference means: pecan > cherry when apple is also an option, but cherry > pecan when only those two are available.
The excerpt emphasizes that preferences can be conditioned on the full menu, not just pairwise comparisons.

🌍 Real-world parallels

🗳️ Presidential politics

Yolanda observes:

"Doesn't this happen all the time in presidential politics? People prefer candidate A when A, B and C are running, but when candidate C drops out, they shift their preference to candidate B."

Voters may prefer A over B when C is in the race.
When C drops out, some voters switch to B instead of A.
The presence or absence of C changes how voters evaluate A vs. B.

💞 Personal relationships

Alice notes:

"You could say the same thing about close personal relationships."

The pool of available partners affects how each individual is evaluated.
Preferences can shift as the set of options changes.
(Alice privately thinks the principle applies to her own situation with Bob, though she doesn't say so aloud.)

🧩 Group reactions

🧩 Initial skepticism

Zori: Dismisses the dessert problem as irrelevant to real-world applications, saying "this conversation about dessert in some stupid diner is too much."
Dave: Makes a joke about pecan pie being a great dessert, missing the conceptual point.
Alice: Not amused by the tangent.

🧩 Gradual recognition

Xing: Hesitant but senses "there's something here."
Carlos: First to articulate the conditional probability explanation.
Yolanda: Extends the idea to politics, showing broader applicability.
Alice: Extends it to personal relationships, showing even wider relevance.

10.8 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply probability concepts—including random selection, conditional probability, independence, and expected value—to combinatorial scenarios involving dice, spinners, permutations, graphs, and sampling strategies.

📌 Key points (3–5)

Random selection from groups: calculating probabilities when choosing subsets from a larger population (e.g., students from a class).
Repeated trials and "at least one" events: finding probabilities of getting a particular outcome in multiple independent trials (dice rolls, spinner spins).
Conditional probability and independence: distinguishing when events are independent versus when conditioning changes probabilities (graph subsets, sequential sampling).
Common confusion: sampling with replacement vs. without replacement—probabilities change dramatically when items are not returned, and later rounds can become predictable.
Expected value and fairness: determining fair payoffs by comparing expected winnings to the cost of playing.

🎲 Random selection and combinatorics

🎓 Choosing students from a class

Exercise 1 asks about selecting 3 students randomly from a class of 35, where 7 belong to a particular group.

(a) Probability that one specific person (Yolanda) is chosen.
(b) Probability that Yolanda is chosen but Zori is not—requires conditioning on Yolanda being selected and Zori not being among the other two.
(c) Probability that exactly two of the seven club members are chosen—combines counting favorable outcomes (choosing 2 from 7 and 1 from the remaining 28) with total possible selections.
(d) Probability that none of the seven are chosen—all three must come from the 28 non-members.

Key idea: Use combinatorial counting (combinations) to find favorable outcomes divided by total outcomes.

🎲 Dice rolls and "at least one" outcomes

Exercise 2 examines Bob's claim about dice probabilities.

Getting at least one '7' in three rolls of a pair of dice has probability slightly less than 1/2.
Getting at least one '5' in six rolls has probability just over 1/2.
Method: Calculate the complement—probability of not getting the target outcome in any roll, then subtract from 1.
Example: If the probability of rolling a '7' on one roll is p, then the probability of no '7' in three rolls is (1 − p)³, so at least one '7' is 1 − (1 − p)³.

Don't confuse: The number of rolls and the target probability interact nonlinearly; more rolls increase the chance but not proportionally.

🎡 Spinner problems and expected value

🎯 Spinner probabilities

Exercise 3 uses the spinner from Figure 10.1 (beginning of chapter).

(a) & (b) Probability of getting at least one specific number (5 or 3) in three spins—again use the complement approach.
(c) Conditional probability: if you spin until you get either a '2' or a '5', what is the probability that '2' comes first?
- Only spins that land on '2' or '5' matter; ignore other outcomes.
- The probability is the ratio of the '2' region's probability to the combined probability of '2' and '5'.

💰 Expected value and fairness

(d) If you receive i dollars when the spinner lands in region i, what is the expected value?

Expected value: the sum of each outcome's value multiplied by its probability.

The question asks whether paying three dollars to play is reasonable, noting that three is "right in the middle" of possible outcomes.
Key insight: The expected value depends on both the values and their probabilities, not just the middle value—if higher or lower numbers are more or less likely, the expected value shifts.

🃏 Derangements and fair games

🎰 Bob's derangement game

Exercise 4 describes a game where Bob wins if a random permutation of 1, 2, ..., 50 is a derangement.

Derangement: a permutation σ where σ(i) ≠ i for all i.

Bob pays $1 to play and wins $2.50 if the permutation is a derangement.
Fairness question: Is the expected payout equal to the cost?
To determine fairness, calculate the probability of a derangement (a known combinatorial result) and multiply by the payout.
If the expected value is less than $1, the game favors Alice; if more, it favors Bob.
Adjustment: The payoff should be set so that (probability of derangement) × (payoff) = $1.

🕸️ Random graphs and independence

🔗 Random graph construction

Exercise 5 builds a random graph on vertices {1, 2, ..., 10} by tossing a fair coin for each possible edge.

For each pair {i, j}, include the edge if the coin shows heads.
Let E_S be the event that a 3-element subset S forms a complete subgraph (triangle).

📐 Probability and independence of triangles

(a) Why is P(E_S) = 1/8 for each 3-element subset S?

A triangle on three vertices requires three edges.
Each edge is included with probability 1/2 (fair coin).
All three edges must be present: (1/2)³ = 1/8.

(b) Why are E_S and E_T independent when |S ∩ T| = 1?

If S and T share exactly one vertex, they share no edges.
The edges defining E_S and E_T are determined by different coin tosses.
Therefore, the events are independent.

(c) Show that P(E_S | E_T) ≠ P(E_S | E_T ∩ E_U) for S = {1,2,3}, T = {2,3,4}, U = {3,4,5}.

S and T share edge {2,3}; T and U share edge {3,4}.
Conditioning on both E_T and E_U provides more information about shared edges than conditioning on E_T alone.
This demonstrates that independence can break down when events share structure.

Don't confuse: Pairwise independence (E_S and E_T independent) does not imply mutual independence (E_S, E_T, E_U all independent together).

🎱 Sampling strategies and predictability

🔄 Sampling with replacement

Exercise 6 first describes drawing two marbles from ten (labeled 1, 2, ..., 10), observing the sum, then returning them and repeating.

There are C(10, 2) = 45 possible pairs.
Exactly 5 pairs sum to 11: {1,10}, {2,9}, {3,8}, {4,7}, {5,6}.
Probability of sum = 11 is 5/45 = 1/9.
Fair payoff: If you pay $1, a fair payout for an "11" is $9 (expected value = 1/9 × $9 = $1).

🚫 Sampling without replacement

Now marbles are drawn in pairs and set aside; five rounds total (all ten marbles used).

Round 5 observation: After four pairs are drawn, the fifth pair is determined—no randomness remains.
Why everyone should or no one should wager on round 5: If the first four pairs include all five pairs that sum to 11, round 5 cannot be 11; if none do, round 5 might be 11, but the outcome is already fixed once the first four pairs are known.
Therefore, round 5 is skipped.

💡 Exploiting information in sequential sampling

Observant player strategy: By watching earlier rounds, a player learns which marbles remain.

If several "11-summing" pairs are still possible, the probability of drawing one increases.
If all such pairs are already eliminated, the probability drops to zero.
With a fixed 9-to-1 payout, a player can bet only when the conditional probability exceeds 1/9, guaranteeing profit over many games.

Challenge: What is the minimum payout ratio above which a player has a winning strategy?

The player must find the maximum conditional probability that can occur during the game.
If this maximum exceeds the reciprocal of the payout ratio, the player can wait for favorable conditions and bet only then.
The minimum fair payout ratio is the reciprocal of this maximum conditional probability.

Don't confuse: With replacement, each round is independent and the probability is always 1/9; without replacement, probabilities shift each round, and information accumulates.

11.1 A First Taste of Ramsey Theory

🧭 Overview

🧠 One-sentence thesis

Ramsey Theory proves that in any sufficiently large graph, a "boring" substructure—either a complete subgraph or an independent set of a specified size—must inevitably appear.

📌 Key points (3–5)

Core claim: Any graph with at least R(m, n) vertices must contain either a complete subgraph on m vertices or an independent set of size n.
Boredom is inevitable: A subgraph is "boring" if every pair of vertices behaves identically (all connected or all disconnected); large enough graphs always contain such boring substructures.
Sharp bounds matter: The smallest graph that avoids both structures defines the threshold; for example, a 5-cycle avoids both a 3-clique and a 3-independent-set, so R(3,3) = 6.
Common confusion: The Ramsey number R(m, n) is the minimum size that guarantees the structure exists, not the size where it first appears.
Difficulty: Exact Ramsey numbers are notoriously hard to compute; only a handful are known precisely (e.g., R(3,3) = 6, R(4,4) = 18, but R(5,5) is only bounded between 43 and 49).

🎯 The inevitability of boring subgraphs

🎯 What makes a subgraph "boring"

A subgraph H of a graph G is "boring" if it is either a complete subgraph (every pair of vertices is connected) or an independent set (no pair of vertices is connected).

In both cases, every pair of vertices in H behaves in exactly the same way.
Complete subgraph: all pairs are edges.
Independent set: no pairs are edges.
The excerpt frames this as "boredom" because there is no variation in the edge pattern.

🔍 Why boredom is inevitable

The excerpt answers the question "Is boredom inevitable?" with "yes—at least in a relative sense."
If a graph is large enough, it must contain a boring subgraph of a given size.
This is an extension of the Pigeon Hole Principle: just as pigeons must cluster in holes, vertices must cluster into uniform edge patterns.

🧩 The base case: graphs on six vertices

🧩 Lemma 11.1 statement

Lemma 11.1: Let G be any graph with six or more vertices. Then either G contains a complete subgraph of size 3 or an independent set of size 3.

This is the simplest non-trivial case of Ramsey Theory.
The bound of six is sharp: a 5-cycle avoids both structures, so six is the minimum guarantee.

🛠️ Proof mechanism

Pick any vertex x.
Partition the remaining vertices into two sets:
- S₁: neighbors of x (vertices connected to x).
- S₂: non-neighbors of x (vertices not connected to x).
Since there are at least five other vertices, either |S₁| ≥ 3 or |S₂| ≥ 3.

Case 1: |S₁| ≥ 3

Take three distinct vertices y₁, y₂, y₃ from S₁.
If any pair yᵢyⱼ is an edge, then {x, yᵢ, yⱼ} forms a complete subgraph of size 3.
If no edges exist among {y₁, y₂, y₃}, then they form an independent set of size 3.

Case 2: |S₂| ≥ 3

The argument is dual: examine edges among three non-neighbors of x.

⚠️ Sharpness of the bound

A cycle on five vertices (5-cycle) contains neither a complete subgraph of size 3 nor an independent set of size 3.
Therefore, six is the smallest number that guarantees the structure.
Don't confuse: "sharp" means the bound cannot be lowered, not that every graph of that size contains the structure.

🏛️ Ramsey's Theorem for Graphs

🏛️ Theorem 11.2 statement

Ramsey's Theorem for Graphs: If m and n are positive integers, then there exists a least positive integer R(m, n) so that if G is a graph with at least R(m, n) vertices, then either G contains a complete subgraph on m vertices, or G contains an independent set of size n.

R(m, n) is called the Ramsey number.
It is the minimum number of vertices that guarantees one of the two structures.
The theorem asserts that such a number always exists.

🔢 Upper bound on R(m, n)

The proof shows that R(m, n) ≤ (m + n - 2 choose m - 1).
The claim is trivial when either m ≤ 2 or n ≤ 2 (small cases are easy).
For m, n ≥ 3, the proof uses induction on t = m + n, assuming the result holds when t ≤ 5.

🔧 Proof strategy

Pick any vertex x in a graph G with at least (m + n - 2 choose m - 1) vertices.
Partition the remaining vertices into S₁ (neighbors of x) and S₂ (non-neighbors of x).
Use the binomial identity:
- (m + n - 2 choose m - 1) = (m + n - 3 choose m - 2) + (m + n - 3 choose m - 1)
- Which also equals (m + n - 3 choose m - 2) + (m + n - 3 choose n - 2).
Therefore, either |S₁| ≥ (m + n - 3 choose m - 2) or |S₂| ≥ (m + n - 3 choose n - 2).

Case 1: |S₁| is large

If S₁ does not have an independent set of size n, then by induction it contains a complete subgraph of size m - 1.
Add x to this set to obtain a complete subgraph of size m in G.

Case 2: |S₂| is large

If S₂ does not contain a complete subgraph of size m, then by induction it contains an independent set of size n - 1.
Add x to this set to obtain an independent set of size n in G.

📊 Known Ramsey numbers and their difficulty

📊 Small known values

m, n	R(m, n)	Status
3, 3	6	Exact
4, 4	18	Exact
5, 5	43–49	Bounds only

Only a handful of Ramsey numbers are known precisely.
Even R(5, 5) is unknown; it lies between 43 and 49.

🌍 Erdős's remarks on difficulty

Paul Erdős said that determining R(5, 5) exactly might be possible if all the world's mathematical talent were focused on the problem.
He also said that finding the exact value of R(6, 6) might be beyond our collective abilities.
This illustrates the notorious difficulty of computing Ramsey numbers.

📋 Table of small Ramsey numbers

The excerpt provides a table for R(m, n) when 3 ≤ m, n ≤ 9:

Single numbers indicate exact values.
Pairs of numbers indicate lower and upper bounds.
Example: R(3, 4) = 9 (exact), R(5, 6) is between 80 and 143.
The table was last updated using a 2014 survey article by Stanisław Radziszowski in the Electronic Journal of Combinatorics.

🔍 Don't confuse

R(m, n) is the minimum size that guarantees the structure, not the size where it might appear.
A graph with fewer than R(m, n) vertices can still contain the structure; it's just not guaranteed.
The difficulty is not in proving existence (Theorem 11.2 does that), but in finding the exact threshold.

Small Ramsey Numbers

11.2 Small Ramsey Numbers

🧭 Overview

🧠 One-sentence thesis

Determining exact Ramsey numbers R(m, n) is notoriously difficult, with only a handful of small values known precisely despite their theoretical existence being guaranteed by Ramsey's Theorem.

📌 Key points (3–5)

Ramsey numbers exist but are hard to compute: Theorem 11.2 proves R(m, n) exists for any positive integers m and n, but finding exact values is extremely difficult.
Known small values are rare: only R(3,3) = 6 and R(4,4) = 18 are known exactly among small cases; R(5,5) is only bounded between 43 and 49.
Upper bound from proof: the proof shows R(m, n) is at most the binomial coefficient "m plus n minus 2 choose m minus 1."
Common confusion: existence vs computation—Ramsey's Theorem guarantees the number exists, but that doesn't mean we can calculate it.
Difficulty grows rapidly: even the world's mathematical talent might determine R(5,5), but R(6,6) may be beyond collective human ability according to Paul Erdős.

📐 Ramsey's Theorem and existence

📐 What Ramsey's Theorem guarantees

Ramsey's Theorem for Graphs (Theorem 11.2): If m and n are positive integers, then there exists a least positive integer R(m, n) so that if G is a graph with at least R(m, n) vertices, then either G contains a complete subgraph on m vertices, or G contains an independent set of size n.

A complete subgraph means every pair of vertices is connected by an edge.
An independent set means no pair of vertices is connected by an edge.
The theorem says: make the graph big enough, and you must find either a large clique or a large independent set.

🔢 Upper bound from the proof

The proof establishes that R(m, n) is at most the binomial coefficient (m + n - 2 choose m - 1).
The proof uses induction on t = m + n, starting from the base case when either m ≤ 2 or n ≤ 2 (which is trivial).
The argument partitions vertices into those adjacent to a chosen vertex x and those not adjacent, then applies the binomial coefficient identity to show at least one partition is large enough to apply the induction hypothesis.

🧮 Example from the proof

The excerpt mentions a cycle on five vertices as a sharp example: it contains neither a complete set of size 3 nor an independent set of size 3, showing that R(3,3) = 6 is the smallest possible.
This demonstrates that the bound cannot be improved in general.

🔍 Known values and bounds

🔍 Exact small values

The excerpt provides these exact Ramsey numbers:

R(3, 3) = 6
R(4, 4) = 18

📊 Bounds for other small cases

The excerpt gives ranges for cases where exact values are unknown:

m	n	R(m, n) bounds or exact value
3	5	14
3	6	18
3	7	23
3	8	36
3	9	39
4	5	25
4	6	between 36 and 41
4	7	between 49 and 61
5	5	between 43 and 49

When a cell contains a single number, that is the precise answer.
When there are two numbers, they represent lower and upper bounds.
The table extends up to m = 9 and n = 9, with bounds becoming wider for larger values.

📚 Reference for current data

The excerpt notes that Table 11.3 was last updated using the 12 January 2014 version of Dynamic Survey #DS1: "Small Ramsey Numbers" by Stanisław Radziszowski in the Electronic Journal of Combinatorics, indicating this is an active research area with ongoing updates.

🚧 The difficulty of computation

🚧 Paul Erdős's assessment

The excerpt quotes the distinguished Hungarian mathematician Paul Erdős on the difficulty:

R(5, 5): might be possible to determine exactly if all the world's mathematical talent were focused on the problem.
R(6, 6): finding the exact value might be beyond our collective abilities.

This illustrates how rapidly the difficulty grows even for small values.

🚧 Why existence doesn't mean we can compute

Don't confuse: Ramsey's Theorem proves R(m, n) exists as a definite integer, but the proof method (induction with binomial bounds) doesn't give us an efficient way to calculate the exact value.
The upper bound from the proof is often much larger than the actual value, so it doesn't narrow down the answer sufficiently.
Example: the proof gives an upper bound, but for R(5, 5) we only know it's between 43 and 49—a range of 7 possible values despite decades of effort.

🔬 Estimating Ramsey numbers

🔬 Stirling's approximation

The excerpt introduces Stirling's approximation for factorials:

Full form: n! ≈ square root of (2πn) times (n/e) to the power n, times (1 + 1/(12n) + 1/(288n²) - 139/(51840n³) + higher order terms).
Simplified form commonly used: n! ≈ square root of (2πn) times (n/e) to the power n.
This approximation can be found in almost any advanced calculus book.

🔬 Upper bound for R(n, n)

Using Stirling's approximation and the binomial coefficients from the proof of Ramsey's Theorem:

R(n, n) ≤ (2n - 2 choose n - 1) ≈ 2 to the power (2n) divided by (4 times square root of n).
This gives an exponential upper bound in n.

🔬 Lower bound from Erdős

Theorem 11.4 (due to P. Erdős) provides a lower bound:

R(n, n) > n divided by (e times square root of 2), all to the power (n/2).
The proof uses a probabilistic argument: it counts the total number of labeled graphs on t vertices (which is 2 to the power C(t,2)), then counts graphs containing a complete subgraph of size n (denoted F₁) and graphs containing an independent set of size n (denoted F₂).
The key inequality: if 2 times (t choose n) times 2 to the power (n(t-n)) times 2 to the power C(t-n, 2) is less than 2 to the power C(t,2), then there exists a graph that avoids both a complete subgraph of size n and an independent set of size n.
This shows R(n, n) must be larger than such a t, establishing the lower bound.
The excerpt notes this was a "true classic" and was "subsequently recast" (though the recast method is not described in this excerpt).

Estimating Ramsey Numbers

11.3 Estimating Ramsey Numbers

🧭 Overview

🧠 One-sentence thesis

Estimating Ramsey numbers R(n, n) is extremely difficult, but Stirling's approximation provides an upper bound and probabilistic arguments provide a lower bound that have remained essentially unimproved for over fifty years.

📌 Key points (3–5)

Known values are rare: only a handful of exact Ramsey numbers are known; even R(5,5) is uncertain (between 43 and 49), and R(6,6) may be beyond human ability to determine.
Upper bound via Stirling: using Stirling's approximation for factorials, R(n, n) is bounded above by roughly 4^n / sqrt(n).
Lower bound via probability: Erdős used a probabilistic argument to show R(n, n) grows at least exponentially, roughly 2^(n/2) / sqrt(e).
Common confusion: the probabilistic method does not construct a specific graph; it shows one must exist by proving the expected count is less than 1.
Why it matters: despite decades of effort, no one has significantly improved these bounds or determined whether the true growth rate is closer to the upper or lower bound.

📊 The difficulty of exact values

📊 Small known Ramsey numbers

The excerpt provides a table of known values and bounds for R(m, n) when m and n are between 3 and 9.

Exact values mentioned:

R(3,3) = 6
R(4,4) = 18
R(3,4) = 9
R(3,5) = 14
R(3,6) = 18
R(3,7) = 23
R(3,8) = 36
R(3,9) = 39

Uncertain values:

R(5,5) is between 43 and 49
R(4,5) is 25
Most larger values have only lower and upper bounds, often with wide gaps

🚫 Erdős's assessment

Paul Erdős, a distinguished Hungarian mathematician, made two famous statements:

Finding R(5,5) exactly might be possible if all the world's mathematical talent focused on it.
Finding R(6,6) exactly might be beyond our collective abilities.

This illustrates how rapidly the difficulty grows even for small values.

🔢 Upper bound using Stirling's approximation

🔢 Stirling's formula

Stirling's approximation: n! ≈ sqrt(2π n) × (n/e)^n

The excerpt notes that "we will normally be satisfied with the first term" and that proofs can be found in advanced calculus books.

The full version includes correction terms:

n! = sqrt(2π n) × (n/e)^n × (1 + 1/(12n) + 1/(288n²) - 139/(51840n³) + O(1/n⁴))

📐 Deriving the upper bound

Using Stirling's approximation and binomial coefficients from Ramsey's theorem proof:

Result: R(n, n) ≤ (2n - 2 choose n - 1) ≈ 4^n / sqrt(π n)

This shows the Ramsey number grows at most exponentially with base roughly 4.

Why this works:

The proof of Ramsey's theorem counts graphs and subgraphs
Binomial coefficients appear naturally in these counts
Stirling's approximation simplifies the factorial expressions in binomial coefficients

🎲 Lower bound using probability

🎲 Erdős's probabilistic argument

The excerpt presents "a true classic" theorem by Erdős:

Theorem: If n is a positive integer, then R(n, n) > n / (e × sqrt(2)) × 2^(n/2)

This shows the Ramsey number grows at least exponentially with base roughly sqrt(2) ≈ 1.414.

🎯 The counting argument (original version)

The proof counts graphs with vertex set {1, 2, ..., t}:

Setup:

Total graphs F: there are 2^C(t,2) labeled graphs on t vertices
Graphs F₁ containing a complete subgraph of size n: |F₁| = (t choose n) × 2^(n(t-n)) × 2^C(t-n, 2)
Graphs F₂ containing an independent set of size n: |F₂| = (t choose n) × 2^(n(t-n)) × 2^C(t-n, 2)

Key inequality: If 2 × (t choose n) × 2^(n(t-n)) × 2^C(t-n, 2) < 2^C(t,2), then there exists a graph G without a complete subgraph of size n or an independent set of size n.

Finding the bound:

Use the inequality (t choose n) ≤ t^n / n!
Apply Stirling's approximation to n!
After algebra and taking the nth root, we need t < n / (e × sqrt(n)) × 2^(n/2)

Conclusion: Since R(n, n) is the smallest value where every graph must have either a complete subgraph or independent set of size n, and we've shown graphs exist up to size t without this property, R(n, n) must be larger than t.

🎲 The probabilistic reinterpretation

The excerpt notes the proof was "subsequently recast" using probability:

Probability space:

Outcomes: graphs with vertex set {1, 2, ..., t}
Each edge ij (for i < j) is present with probability 1/2
Events for distinct pairs are independent

Random variables:

X₁: counts n-element subsets where all pairs are edges (complete subgraphs)
X₂: counts n-element subsets where no pairs are edges (independent sets)
X = X₁ + X₂

Expected values: By linearity of expectation, E(X) = E(X₁) + E(X₂), where:

E(X₁) = E(X₂) = (t choose n) × (1/2)^C(n,2)

The key insight: If E(X) < 1, then there must exist a graph with no complete subgraph or independent set of size n.

Don't confuse: This does not construct a specific graph. It proves existence by showing the average is less than 1, so at least one outcome must have X = 0.

Example: If the expected number of "bad" graphs is 0.7, then at least 30% of graphs must be "good" (have X = 0).

🔍 The persistent gap

🔍 Fifty years without progress

The excerpt emphasizes that "after more than fifty years and the efforts of many very bright researchers, only marginal improvements have been made."

Two open questions:

Question	Statement	Meaning
Upper bound improvement	Is there constant c < 2 and n₀ such that R(n,n) < 2^(cn) for n > n₀?	Can we prove the base is strictly less than 2?
Lower bound improvement	Is there constant d > 1/2 and n₁ such that R(n,n) > 2^(dn) for n > n₁?	Can we prove the base is strictly greater than 1/2?

Current situation:

Upper bound: base approximately 2 (from 4^n / sqrt(n))
Lower bound: base approximately sqrt(2) ≈ 1.414 (from 2^(n/2) / sqrt(e))
The true value lies somewhere between these

🏆 The challenge

The excerpt concludes with humor: "We would certainly give you an A for this course if you managed to do [solve either question]."

This underscores the extraordinary difficulty—settling either question would be a major breakthrough in combinatorics.

Applying Probability to Ramsey Theory

11.4 Applying Probability to Ramsey Theory

🧭 Overview

🧠 One-sentence thesis

The probabilistic method proves that Ramsey numbers grow exponentially by showing that random graphs with high probability avoid both large complete subgraphs and large independent sets, a result that has remained essentially unimproved for over fifty years.

📌 Key points (3–5)

Erdős's lower bound: Using probability (or counting), we can prove R(n,n) is at least roughly n·e·sqrt(2) to the power (n/2), an exponential lower bound.
Two proof perspectives: The same result can be shown by counting graphs or by analyzing random graphs with independent edge probabilities; the probabilistic view has proven more powerful.
Common confusion: The probabilistic method does not construct an explicit graph—it only proves existence by showing the expected number of "bad" graphs is less than the total.
Stubborn open problem: Despite fifty years of effort, no one has significantly improved the exponential bounds (upper from Ramsey's theorem, lower from Erdős), and constructive methods have fared even worse.
Power of the method: The probabilistic approach extends to prove existence of graphs with both large girth (no small cycles) and large chromatic number, properties that seem contradictory.

🎲 The probabilistic lower bound for Ramsey numbers

🎲 What Erdős proved

Theorem 11.4: If n is a positive integer, then R(n,n) is at least roughly n·e·sqrt(2) raised to the power (n/2).

R(n,n) is the smallest number such that any graph on that many vertices must contain either a complete subgraph of size n or an independent set of size n.
The theorem gives a lower bound: it shows R(n,n) cannot be too small.
This bound is exponential in n, which is much stronger than polynomial growth.

🔢 The counting argument (original proof)

The original proof counts graphs directly:

Let F be the family of all labeled graphs on t vertices; there are 2 to the power C(t,2) such graphs.
Let F₁ be graphs containing a complete subgraph of size n; let F₂ be graphs containing an independent set of size n.
The sizes of F₁ and F₂ can be bounded using binomial coefficients and Stirling's approximation.
If we can choose t large enough so that |F₁| + |F₂| < |F|, then there must exist a graph avoiding both properties.
The calculation shows t can be as large as roughly n·e·sqrt(n)·2^(n/2) while maintaining this inequality.

Don't confuse: This is not constructing a specific graph; it is proving one must exist by showing there are more total graphs than "bad" graphs.

🎰 The probabilistic reinterpretation

The same proof can be recast using probability:

Consider a random graph on t vertices where each edge appears independently with probability 1/2.
Let X₁ count the number of n-element complete subgraphs; let X₂ count n-element independent sets.
Set X = X₁ + X₂.
By linearity of expectation: E(X) = E(X₁) + E(X₂).
Each expectation equals C(t,n) times (1/2)^C(n,2).
If E(X) < 1, then there must exist a graph with X = 0 (i.e., no large complete subgraph or independent set).
The calculation for how large t can be is identical to the counting version.

Why this perspective matters: The probabilistic view has proven to be "the right one" and has led to many other powerful results.

🚧 The stubborn gap in our knowledge

🚧 What remains unknown after fifty years

Despite efforts by many bright researchers over more than half a century, only marginal improvements have been made on the bounds:

Bound type	Question	Status
Upper bound	Is R(n,n) < 2^(cn) for some c < 2?	Unknown
Lower bound	Is R(n,n) > 2^(dn) for some d > 1/2?	Unknown

The excerpt notes: "We would certainly give you an A for this course if you managed to do either."
This illustrates how difficult the problem is, even though the probabilistic method easily gives an exponential lower bound.

🔨 Constructive methods fare even worse

Carlos's struggle highlights a deeper issue:

Constructive methods: explicitly building a graph with the desired properties (no random techniques).
Carlos could only show R(n,n) grows like n^c (polynomial), far weaker than the exponential bound from probability.
The state of the art: No one has shown R(n,n) > c^n for any constant c > 1 using only constructive methods.
Alice's reassurance: "Maybe saying you are unable to do something that lots of other famous people seem also unable to do is not so bad."

Don't confuse: Existence proofs (probabilistic) vs. constructive proofs (explicit examples)—the former can be much more powerful but don't tell you how to build the object.

🌀 Extending the method: large girth and large chromatic number

🌀 A seemingly contradictory result

Theorem 11.7 (Erdős): For every pair of integers g ≥ 3 and t, there exists a graph G with chromatic number χ(G) > t and girth greater than g.

Girth: the length of the smallest cycle in the graph.
Chromatic number: the minimum number of colors needed to color vertices so no edge connects same-colored vertices.
This seems contradictory: large chromatic number usually comes from having many triangles or small cycles, but large girth means no small cycles.

🎲 The probabilistic construction strategy

The proof uses a two-stage randomization approach:

Choose parameters: integers n (number of vertices), s (independent set size), and edge probability p.
First goal: ensure with high probability there is no independent set of size s.
- Let X₁ count s-element independent sets.
- Want E(X₁) < 1/4.
- Set s = (2 ln n)/p to achieve this.
Second goal (modified): don't try to avoid all small cycles—just ensure there are relatively few (fewer than n/2).
- Let X₂ count cycles of size at most g.
- Want E(X₂) < n/4.
- Set p = n^(-1/(g-1))/10 to achieve this.
Remove bad vertices: delete one vertex from each small cycle, leaving at least n/2 vertices.
Final graph H: has no small cycles, no independent set of size s, so chromatic number at least (n/2)/s.
Parameter choice: require n > 2st, which needs n^(1/(g-1))·(40 ln n) > t.

Example scenario: To get girth > 5 and chromatic number > 100, choose n large enough that n^(1/4)·(40 ln n) > 100, then construct the random graph and remove small cycles.

🧠 Gaining intuition with the method

Experienced researchers simplify by focusing on essential steps:

Want E(X₁) small: set n^s · e^(-ps²) ≈ 1, get s ≈ (ln n)/p.
Want small cycles ≈ n: set (1/p)^g ≈ n, get p ≈ n^(-1/(g-1)).
Want n ≥ st: requires n^(1/(g-1)) ≥ t.
The rest is "just paying attention to details."

Don't confuse: The probabilistic method proves existence but doesn't construct the graph explicitly—you know it's out there but may not know how to find it efficiently.

🤔 Philosophical implications

🤔 Trust and randomness

The discussion raises questions about randomized algorithms:

Zori: "Who in their right mind would trust their lives to an algorithm that used random methods?"
Xing's response: We already trust probabilistic reasoning daily (crossing streets, flying on planes); probability less than 10^(-20) is acceptable.
This reflects broader acceptance of probabilistic guarantees in practice.

🧩 Existence without construction

Dave's observation captures a deep puzzle:

"You have to be struck by the statements that it appears difficult to construct objects which you can prove exist in abundance."
Carlos wonders: "Maybe one could prove that there are easily stated theorems which only have long proofs."
This foreshadows complexity theory and questions about the gap between existence and explicit construction.

Zori's takeaway: There may be problems that are easy to state but fundamentally difficult to solve—potentially valuable for clients willing to pay for better-than-competition solutions.

Ramsey's Theorem

11.5 Ramsey’s Theorem

🧭 Overview

🧠 One-sentence thesis

Ramsey's theorem guarantees that for any fixed number of categories and subset size, there exists a sufficiently large set that must contain a large uniform subset where all subsets of the fixed size fall into the same category.

📌 Key points

The probabilistic method's power: Using random graphs and expected values can prove existence of graphs without certain structures, yielding exponential lower bounds for Ramsey numbers.
The gap between bounds: After fifty years, the bounds on R(n,n) remain far apart—no one knows if it grows like 2^(cn) for c < 2 or 2^(dn) for d > 1/2.
Constructive vs probabilistic methods: Constructive methods have only achieved polynomial bounds (n^c), while probabilistic methods easily give exponential bounds; no constructive proof has shown R(n,n) > c^n for any constant c > 1.
Common confusion: The Ramsey number R(n,n) is not about counting edges or vertices directly; it's about the minimum size needed to guarantee either a complete subgraph K_n or an independent set I_n.
General Ramsey theorem: The result extends to r categories (colors), subset size s, and target sizes h₁, h₂, ..., h_r, guaranteeing a uniform large subset.

🎲 The probabilistic method for lower bounds

🎲 Random graph construction

The proof uses a probability space where outcomes are graphs with vertex set {1, 2, ..., t}.

Each edge ij (where i < j) appears independently with probability 1/2.
This randomness allows counting "bad" configurations via expected value.

📊 Counting via random variables

The proof defines:

X₁: counts n-element subsets where all (n choose 2) pairs are edges (complete subgraphs K_n).
X₂: counts n-element independent subsets (no edges, denoted I_n).
X = X₁ + X₂: total count of "bad" configurations.

By linearity of expectation:

E(X) = E(X₁) + E(X₂)
E(X₁) = E(X₂) = (t choose n) × (1/2)^(n choose 2)

🔑 The existence argument

If E(X) < 1, then there must exist a graph with vertex set {1, 2, ..., t} without a K_n or an I_n.

The key insight: if the average number of bad configurations is less than 1, at least one graph must have zero bad configurations.
This proves R(n,n) > t for that value of t.
Example: If we can make the expected count of both complete and independent n-subsets sum to less than 1, some graph avoids both structures.

🧮 The calculation

The question becomes: how large can t be while maintaining E(X) < 1?

After algebra using the inequality (t choose n) ≤ t^n / n! and Stirling's approximation for n!.
Taking the nth root of both sides shows we need only guarantee t ≥ n × e × p^(n/2) × (1/2)^(1/n).
This yields an exponential lower bound on R(n,n).

🔍 The unsolved gap in Ramsey numbers

📏 Known bounds from Theorems 11.2 and 11.4

The excerpt states that Theorem 11.2 gives an upper bound and Theorem 11.4 gives a lower bound on R(n,n).

After more than fifty years and efforts by many researchers, only marginal improvements have been made.
The gap between upper and lower bounds remains enormous.

❓ Open question: upper bound

Can we improve the upper bound?

Unknown: Is there a constant c < 2 and an integer n₀ such that R(n,n) < 2^(cn) when n > n₀?
The excerpt notes no one has settled this question.

❓ Open question: lower bound

Can we improve the lower bound?

Unknown: Is there a constant d > 1/2 and an integer n₁ such that R(n,n) > 2^(dn) when n > n₁?
The excerpt notes no one has been able to answer this either.
The excerpt humorously offers: "We would certainly give you an A for this course if you managed to do either."

🔨 Constructive vs probabilistic methods

🔨 The constructive limitation

Carlos tried to prove a good lower bound on R(n,n) using only constructive methods (no random techniques).

Problem: Constructive approaches seem only to show R(n,n) ≥ n^c for some constant c.
This is polynomial, which is "so weak compared to the exponential bound which the probabilistic method gives easily."

🚧 The constructive barrier

Alice points out a fundamental limitation:

Nobody has been able to show that there is a constant c > 1 and an integer n₀ so that R(n,n) > c^n when n > n₀, provided that only constructive methods are allowed.

Don't confuse: This is not about R(n,n) itself being small; it's about what we can prove without probabilistic arguments.
The probabilistic method proves exponential bounds exist, but no one can explicitly construct the graphs that achieve them.
Example: We know "most" random graphs avoid K_n and I_n for appropriate t, but we cannot write down a specific graph family that does so.

💡 The consolation

Alice's observation: "saying that you are unable to do something that lots of other famous people seem also unable to do is not so bad."

This highlights the depth of the problem: the gap between existence proofs and explicit constructions is a major open question in combinatorics.

🌐 The general Ramsey theorem

📐 Formal statement (Theorem 11.6)

The theorem generalizes to:

r: number of bins or colors (categories).
s: fixed subset size.
h = (h₁, h₂, ..., h_r): a string of integers with hᵢ ≥ s for each i.

There exists a least positive integer R(s : h₁, h₂, ..., h_r) so that if n ≥ n₀ and φ : C([n], s) → [r] is any function, then there exists an integer i ∈ [r] and a subset H ⊆ [n] with |H| ≥ hᵢ so that φ(S) = i for every S [in some appropriate collection].

🧩 What this means

Setup: You have a set of size n, and you assign each s-element subset to one of r categories via function φ.
Conclusion: If n is large enough (at least R(s : h₁, h₂, ..., h_r)), there must exist:
- Some category i.
- A large subset H of size at least hᵢ.
- All s-element subsets of H are assigned to category i (uniform treatment).

🔗 Connection to earlier results

The graph coloring version (edges present or absent) is the special case where s = 2 and r = 2.
R(n,n) corresponds to h₁ = h₂ = n, asking for either n vertices all connected or n vertices with no edges.
The general theorem shows the phenomenon is not limited to graphs: any partition of fixed-size subsets into finitely many categories must produce large uniform structures.

The Probabilistic Method

11.6 The Probabilistic Method

🧭 Overview

🧠 One-sentence thesis

The probabilistic method, pioneered by Erdős, proves the existence of graphs with seemingly contradictory properties—large girth (no small cycles) and large chromatic number (no small independent sets)—by showing that a random graph has positive probability of satisfying both conditions.

📌 Key points (3–5)

Core idea: Instead of constructing a graph directly, prove one exists by showing a random graph satisfies the desired properties with positive probability.
Two-stage strategy: First ensure the random graph has few small cycles and no large independent sets, then remove vertices from small cycles to get the final graph.
Key technique: Use expected value and Markov's Inequality to bound the probability that undesirable features (too many small cycles or large independent sets) occur.
Common confusion: The goal is not to eliminate all small cycles in the random graph; instead, ensure there are few enough (less than n/2) so removal still leaves a large graph.
Why it matters: This method proves existence without explicit construction—something that constructive methods have failed to achieve for many problems (e.g., exponential lower bounds for Ramsey numbers).

🎯 The power of non-constructive proof

🎯 Constructive vs probabilistic methods

The excerpt contrasts two approaches through Carlos's struggle:

Constructive methods: Build an explicit example; Carlos could only prove R(n, n) ≥ n^c for some constant c.
Probabilistic method: Prove existence by showing positive probability; easily gives exponential bounds for R(n, n).

Alice's observation: No one has been able to show R(n, n) > c^n for any constant c > 1 using only constructive methods when n > n₀.

This highlights a fundamental gap: the probabilistic method can prove things exist that we cannot explicitly construct.
Don't confuse: "Unable to construct" does not mean "does not exist"—the probabilistic method bridges this gap.

🔧 The two-stage strategy

🔧 Stage 1: Random graph with controlled properties

The proof starts by considering a random graph on n vertices where each pair of vertices becomes an edge independently with probability p.

Two goals (both must hold simultaneously):

No large independent set: Ensure the graph has no independent set of size s (this will force large chromatic number).
Few small cycles: Ensure the number of cycles of length at most g (the girth parameter) is less than n/2.

Why "few" instead of "none": Requiring zero small cycles is too restrictive; having fewer than n/2 is sufficient because we can remove one vertex from each small cycle and still have at least n/2 vertices remaining.

🎲 Stage 2: Cleanup by removal

After obtaining a random graph with the properties above:

Remove one vertex from each small cycle (there are fewer than n/2 such cycles).
The resulting graph H has:
- At least n/2 vertices remaining.
- Girth greater than g (no cycles of length ≤ g remain).
- No independent set of size s (removal doesn't create new independent sets).
- Chromatic number at least (n/2)/s.

To make chromatic number exceed t, choose n > 2st.

📊 Controlling random variables with expectation

📊 Random variable X₁: Counting independent sets

X₁ = the number of s-element independent sets in the random graph.

Expected value calculation:

E(X₁) = (number of s-element subsets) × (probability each is independent)
E(X₁) = C(n, s) × (1 - p)^(C(s,2))
Using approximations: C(n, s) ≤ n^s ≤ e^(s ln n) and (1 - p)^(C(s,2)) ≤ e^(-ps²/2)
Set s = (2 ln n)/p to make E(X₁) < 1/4.

Markov's Inequality application: If E(X₁) < 1/4, then the probability that X₁ exceeds 1/2 (which is 2E(X₁)) is less than 1/2.

This means with probability > 1/2, the random graph has X₁ = 0 (no s-element independent sets).

📊 Random variable X₂: Counting small cycles

X₂ = the number of cycles of length at most g in the random graph.

Expected value calculation:

Sum over cycle lengths i from 3 to g.
For each length i: (number of ways to choose i vertices in order) × p^i
E(X₂) ≤ sum from i=3 to g of [n(n-1)(n-2)...(n-i+1) × p^i] ≤ g(pn)^g.

Target: Want E(X₂) < n/4.

Set p = n^(1/g - 1/10) to achieve g(np)^g ≤ n/4.

Markov's Inequality application: If E(X₂) < n/4, then the probability that X₂ exceeds n/2 (which is 2E(X₂)) is less than 1/2.

This means with probability > 1/2, the random graph has fewer than n/2 small cycles.

🎯 Combining the bounds

Since both bad events (X₁ > 0 and X₂ ≥ n/2) each have probability < 1/2:

The probability that at least one bad event occurs is < 1/2 + 1/2 = 1.
Therefore, there exists a graph where both X₁ = 0 and X₂ < n/2.

Key insight: We don't need to compute the exact probability distribution; bounding the expected value via Markov's Inequality is sufficient to guarantee existence.

🏆 The main result: Girth vs chromatic number

🏆 Erdős' theorem statement

Theorem 11.7 (Erdős): For every pair g, t of integers with g ≥ 3, there exists a graph G with χ(G) > t and girth of G greater than g.

What it means:

Girth > g: No cycles of length ≤ g (the graph is "locally sparse").
Chromatic number > t: Cannot be colored with t colors (requires many colors, so "globally dense" in some sense).

Why surprising: Intuitively, graphs with no small cycles seem "sparse," yet they can require arbitrarily many colors—properties that seem contradictory.

🏆 Parameter choices

To prove the theorem for given g and t:

Choose n > 2st (ensures chromatic number will exceed t).
Set p = n^(1/g - 1/10) (controls small cycle count).
Set s = (2 ln n)/p (controls independent set size).

The resulting graph H (after vertex removal) satisfies:

Girth > g (by construction, all small cycles removed).
No independent set of size s.
Chromatic number ≥ (n/2)/s > (2st/2)/s = t.

Example: To get a graph with girth > 5 and chromatic number > 100, choose n large enough (n > 200s), set p and s accordingly, and the probabilistic method guarantees such a graph exists.

🔍 Contrast with constructive examples

The excerpt notes that triangle-free graphs with large chromatic number were constructed in Chapter 5, but:

Those constructions all have girth exactly 4 (they contain 4-cycles).
The probabilistic method proves graphs with arbitrarily large girth and chromatic number exist, even though no explicit construction is known.

Don't confuse: Girth 4 means "triangle-free but has 4-cycles"; girth > g means "no cycles of length ≤ g at all."

11.7 Discussion

🧭 Overview

🧠 One-sentence thesis

The probabilistic method reveals that many combinatorial objects provably exist in abundance yet are surprisingly difficult to construct explicitly, a gap that raises fundamental questions about proof complexity and computational difficulty.

📌 Key points (3–5)

Practical acceptance of randomness: People routinely trust probabilistic reasoning in everyday life (e.g., airline safety with disaster probability less than 10^-20), even without formal training.
Existence vs construction gap: Objects proven to exist abundantly via probabilistic arguments often appear difficult to construct explicitly—a striking asymmetry.
Fundamental question raised: There may be easily stated theorems that only have long proofs, suggesting inherent complexity barriers.
Common confusion: "Random" does not mean "unreliable"—probabilistic guarantees can be extremely strong (e.g., probabilities like 10^-20).
Practical implications: Problems that are easy to state but hard to solve may represent opportunities (or challenges) beyond known complexity classes like NP.

🎲 Trusting randomness in practice

🎲 Everyday probabilistic reasoning

Xing argues that everyone already trusts probabilistic methods: crossing streets (risk of being hit by a bus) and other daily activities involve implicit probability judgments.
The general public is comfortable with probability concepts even without knowing the formal definition of a probability space.
Example: Xing is "completely comfortable taking an airline flight" if the disaster probability is less than 10^-20.

🔍 Don't confuse probabilistic with unreliable

Zori's initial skepticism ("Who in their right mind would trust their lives to an algorithm that used random methods?") reflects a common misunderstanding.
Probabilistic guarantees can be far stronger than many deterministic everyday risks.
The key is the magnitude of the probability: extremely low failure rates (like 10^-20) are effectively negligible.

🧩 The existence-construction paradox

🧩 What the paradox is

Dave observes: "You have to be struck by the statements that it appears difficult to construct objects which you can prove exist in abundance."
The probabilistic method shows that certain combinatorial structures (e.g., graphs with specific properties) exist with high probability.
Yet explicitly constructing such objects—actually writing down a specific example—often appears difficult.

🤔 Why this matters

This gap between proving existence and finding an instance is not just a technical curiosity.
It suggests something fundamental about the nature of mathematical proof and computation.
Carlos raises the possibility: "Maybe one could prove that there are easily stated theorems which only have long proofs."

📐 Implications for proof complexity

If some theorems can only be proven with very long arguments, this would be a deep result about the limits of mathematical reasoning.
Bob's reaction ("That doesn't make any sense") shows this idea is counterintuitive—but the existence-construction gap hints it may be true.
The excerpt does not resolve this question but identifies it as "something fundamental."

💼 Practical and theoretical frontiers

💼 Beyond known complexity classes

Zori sees commercial opportunity: problems that are "readily understood but somehow difficult in the end" could command high fees.
She knows about the complexity class NP (problems where solutions can be verified quickly).
The excerpt suggests there may be "even bigger challenges (and bigger paychecks)" beyond NP—problems where even the existence-construction gap is more severe.

🧠 What this means for problem-solving

Easy-to-state problems can hide deep difficulty.
The probabilistic method reveals this gap: proving something exists (via probability) is often much easier than building it.
This asymmetry may point to fundamental limits in computation and proof, not just current ignorance.

11.8 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply probabilistic methods and threshold behavior to random graphs and tournaments, demonstrating that certain structural properties emerge with high probability when graph parameters exceed critical thresholds.

📌 Key points (3–5)

Random graph cliques and independent sets: For edge probability p = 1/2, the expected number of large cliques plus large independent sets can be less than 1, showing both structures are rare.
Threshold behavior: Random graphs exhibit sharp transitions—properties like "no isolated vertices" switch from almost certainly false to almost certainly true as edge probability crosses a critical value.
Tournament reachability: In random tournaments, with high probability every small set of vertices has a common "dominator" vertex pointing to all of them.
Common confusion: Threshold behavior is not gradual—the transition happens in a narrow range around a specific probability formula (e.g., log n / n).
Chromatic number vs clique number: Random graphs can have chromatic number much larger than clique number, showing the basic inequality is far from tight.

🎲 Random graph structure problems

🎲 Cliques and independent sets (Exercise 1)

The exercise asks about random graphs with n vertices and edge probability p = 1/2.

Part (a): Expected value bound

Let X = number of complete subgraphs (cliques) of size t = 2 log n
Let Y = number of independent sets of size t = 2 log n
Goal: show E(X + Y) < 1 when n is sufficiently large
This uses the first moment method: if the expected count is less than 1, the probability of having even one such structure is small

Part (b): Chromatic number gap

Use part (a) to prove two statements hold with high probability:
- The clique number ω(G) is less than 2 log n
- The chromatic number is at least n / (2 log n)
Key insight: This shows the inequality χ(G) ≥ ω(G) can be very loose—chromatic number can be much larger than clique number
Example: A graph might need many colors to properly color even though it has no large cliques

🏆 Tournament domination (Exercise 2)

Random tournament: Start with a complete graph on n vertices; for each pair i, j with i < j, flip a fair coin to decide edge orientation—heads gives (i,j), tails gives (j,i).

The domination property

For every set S of size log n / 10, there exists a vertex x such that (x,y) is in T for every y in S
In other words: every small set has a "dominator" that beats all its members
This holds with high probability when n is large
Don't confuse: This is not saying one vertex dominates everything, only that each small set has some dominator

🔗 Two-step reachability (Exercise 3)

For a random tournament T on n vertices, show with high probability:

The property For every pair of distinct vertices x, y, at least one of these holds:

There is a direct edge (x,y) in T, OR
There is a vertex z such that both (x,z) and (z,y) are in T

Interpretation

Every pair is either directly connected or connected through a common intermediate vertex
This is a diameter property: you can reach any vertex from any other in at most 2 steps
Example: If x doesn't beat y directly, there's some z that x beats and z beats y

🌡️ Threshold behavior

🌡️ What threshold behavior means

Threshold behavior: A graph property switches from "almost certainly false" to "almost certainly true" as a parameter (like edge probability) crosses a critical value.

The excerpt emphasizes this is not gradual—there is a sharp transition around a specific threshold probability.

🔍 Isolated vertices threshold (Exercise 4)

The exercise demonstrates threshold behavior for the property "no isolated vertices."

Two edge probabilities to compare

Edge probability	Result
p = 10 log n / n	Almost certainly no isolated vertices
p = log n / (10n)	Almost certainly at least one isolated vertex

Why this shows a threshold

The two probabilities differ only by a constant factor (100×)
Below the threshold: isolation is common
Above the threshold: isolation disappears
The critical threshold is around (log n) / n

Don't confuse: "Almost certainly" means "with probability approaching 1 as n grows," not "definitely always."

🌐 Connectivity threshold (Exercise 5)

The exercise asks to determine the threshold probability for a random graph to be connected (not just having no isolated vertices).

Connection to previous exercise

Being connected is a stronger property than having no isolated vertices
The threshold for connectivity should be related to but potentially different from the isolated-vertex threshold
The exercise asks you to find the precise threshold formula in the same sense as Exercise 4

📖 Context note

📖 Transition to algorithms

The excerpt ends with the start of Chapter 12 on Graph Algorithms, which shifts focus from probabilistic graph theory to:

Minimum weight spanning trees: finding a spanning tree with smallest total edge weight
Shortest paths: finding minimum-weight paths from a root to all other vertices
The weight function w: E → ℝ≥0 assigns non-negative weights to edges
Example application: vertices are network nodes, edge weights are costs (in thousands of dollars) of building connections; a spanning tree ensures connectivity with minimum total cost

Minimum Weight Spanning Trees

12.1 Minimum Weight Spanning Trees

🧭 Overview

🧠 One-sentence thesis

Two efficient algorithms—Kruskal's and Prim's—can find a spanning tree of minimum total edge weight by systematically selecting edges according to different strategies, both guaranteed to produce an optimal solution.

📌 Key points (3–5)

The problem: given a connected graph where each edge has a weight (e.g., cost or distance), find a spanning tree whose total weight is minimal.
Why it matters: real-world applications like building network infrastructure require connecting all nodes at minimum cost without redundancy.
Two algorithms, different approaches: Kruskal's sorts all edges and adds them while avoiding cycles; Prim's grows a tree from a root by always adding the lightest edge connecting the tree to the rest of the graph.
Common confusion: both algorithms are greedy (always pick the lightest available edge under certain rules), but they inspect edges in different orders and may produce different trees with the same minimum weight.
Key guarantee: Lemma 12.6 ensures that if an edge is the lightest connecting a component to the rest of the graph, some minimum-weight spanning tree must contain it.

🏗️ Problem setup and motivation

🏗️ Weighted graphs and spanning trees

A weighted graph is a pair (G, w) where G = (V, E) is a connected graph and w: E → [0, ∞) assigns a weight w(e) to each edge e.

The weight of a set S of edges is w(S) = sum of w(e) for all e in S.
A spanning tree T connects all vertices with no cycles and has exactly n − 1 edges (where n is the number of vertices).
The weight of a spanning tree is the sum of the weights of its edges.

💡 Real-world scenario

Example: vertices represent network nodes; edges represent possible direct physical connections; weights represent construction costs (in thousands of dollars).
Goal: connect all nodes so data can flow between any pair, with no redundant links, at minimum total cost.
A minimum-weight spanning tree achieves this: it is connected (so every node can reach every other node) and acyclic (so there is no redundancy), and it has the smallest possible total weight.

🔑 Key preliminaries

🔑 Spanning forests and component counts

Proposition 12.3: Let G = (V, E) be a graph on n vertices, and let H = (V, S) be a spanning forest. Then 0 ≤ |S| ≤ n − 1. If |S| = n − k, then H has k components. In particular, H is a spanning tree if and only if it contains n − 1 edges.

A spanning forest is a subgraph that includes all vertices and is acyclic (a forest of trees).
The number of edges determines the number of components: fewer edges → more components.
A spanning tree is a spanning forest with exactly one component (i.e., connected).

🔄 Exchange principle

Proposition 12.4 (Exchange Principle): Let T = (V, S) be a spanning tree in graph G, and let e = xy be an edge of G not in T. Then:

There is a unique path P from x to y in T.

For each edge f on this path, exchanging f for e (remove f, add e) produces a new spanning tree.

Why this matters:

Adding edge e to tree T creates exactly one cycle (because T already had a unique path from x to y).
Removing any edge f on that path breaks the cycle and restores the tree property.
This shows that spanning trees are "interchangeable" by swapping edges, which is crucial for proving algorithm correctness.

Example: Suppose T connects x to y via path x–a–b–y, and e = xy is not in T. Adding e creates cycle x–a–b–y–x. Removing any edge on the path (e.g., edge a–b) breaks the cycle and yields a new spanning tree.

🧩 The optimality lemma

Lemma 12.6: Let F be a spanning forest of G and let C be a component of F. Let e = xy be an edge of minimum weight among all edges with one endpoint in C and the other not in C. Then among all spanning trees of G that contain the forest F, there is one of minimum weight that contains edge e.

Plain language:

If you have a partial forest F and you look at all edges leaving some component C, the lightest such edge e is "safe" to include: some optimal spanning tree extending F will use e.

Why this works (from the proof):

Suppose some minimum-weight spanning tree T (extending F) does not use e.
T must use some other edge f to connect C to the rest of the graph.
By the exchange principle, we can swap f for e to get a new tree T′.
Since e is the lightest edge leaving C, w(e) ≤ w(f), so w(T′) ≤ w(T).
Therefore T′ is also a minimum-weight spanning tree, and it contains e.

Don't confuse: This lemma does not say e must be in every minimum-weight spanning tree—only that there exists at least one optimal tree containing e.

🌲 Kruskal's algorithm (Avoid Cycles)

🌲 How Kruskal's algorithm works

Algorithm 12.8 (Kruskal's Algorithm):

Sort all m edges by weight: e₁, e₂, …, eₘ so that w(e₁) ≤ w(e₂) ≤ … ≤ w(eₘ).
Initialization: Set S = ∅ (the edge set of the spanning tree) and i = 0.
Inductive step: While |S| < n − 1:
- Find the smallest j > i such that adding edge eⱼ to S does not create a cycle.
- Set i = j and S = S ∪ {eⱼ}.

Plain language:

Inspect edges in order from lightest to heaviest.
Add each edge to the tree unless it would create a cycle.
Stop when you have n − 1 edges (a spanning tree).

✅ Why Kruskal's algorithm is correct

At each step, S forms a spanning forest F with some number of components.
The next edge e added by Kruskal's is the lightest edge overall that does not create a cycle, which means it connects two different components.
By Lemma 12.6, there exists a minimum-weight spanning tree containing all edges in S and also containing e.
By induction, when the algorithm terminates (|S| = n − 1), S is a minimum-weight spanning tree.

📋 Example: Kruskal's on the weighted graph

The excerpt provides a 13-vertex graph (Figure 12.1) with labeled edges and weights.

Edges added in order (with weights):

ck (23)
ag (25)
fg (26)
fi (29)
fj (30)
bj (34)
bc (39)
em (49)
dl (55)
dj (56) — note: al also has weight 56 but was not chosen because dj was inspected first
ek (59)
ch (79)

Total weight: 504

Key observations:

Edge fb (weight 38) was skipped because it would create cycle f–j–b.
Edge ai was skipped because it would create a cycle.
Edges al, dk, km, dm were skipped for the same reason.
The algorithm stops after adding 12 edges (13 vertices → 13 − 1 = 12 edges).

🌳 Prim's algorithm (Build Tree)

🌳 How Prim's algorithm works

Algorithm 12.10 (Prim's Algorithm):

Choose a root vertex r.
Initialization: Set W = {r} (the set of vertices in the tree so far) and S = ∅ (the edge set).
Inductive step: While |W| < n:
- Find an edge e of minimum weight with one endpoint in W and the other not in W.
- If e = xy with x ∈ W and y ∉ W, update W = W ∪ {y} and S = S ∪ {e}.

Plain language:

Start with a single root vertex.
Repeatedly add the lightest edge that connects the current tree to a new vertex outside the tree.
Stop when all vertices are included.

✅ Why Prim's algorithm is correct

At each step, the vertices in W form a single connected component (a tree).
The edge e added is the lightest edge leaving this component.
By Lemma 12.6, there exists a minimum-weight spanning tree containing all edges in S and also containing e.
By induction, when the algorithm terminates (|W| = n), S is a minimum-weight spanning tree.

📋 Example: Prim's on the weighted graph

Starting from vertex a:

Edges added in order (with weights):

ag (25)
fg (26)
fi (29)
fj (30)
bj (34)
bc (39)
ck (23) — found later than Kruskal's because c was not in the tree until step 6
al (56) — arbitrary choice between al and jd, both weight 56
dl (55)
ek (59)
em (49)
ch (79)

Total weight: 504 (same as Kruskal's)

Key observations:

The spanning tree is different from Kruskal's: this one contains al instead of dj.
Both trees have the same total weight (504), which is the minimum.
The order in which edges are considered is very different from Kruskal's.

🔀 Comparing the two algorithms

Aspect	Kruskal's (Avoid Cycles)	Prim's (Build Tree)
Strategy	Sort all edges; add lightest edges that don't create cycles	Grow a tree from a root; always add lightest edge leaving the tree
Edge inspection order	Global: lightest to heaviest across the entire graph	Local: lightest edge connecting the current tree to the rest
Data structure	Needs efficient cycle detection (track components)	Needs efficient minimum-edge lookup (heap)
When edges are added	May add edges far apart in the graph early on	Always extends a single connected component
Result	May produce a different tree than Prim's (but same weight)	May produce a different tree than Kruskal's (but same weight)

Common confusion:

Both algorithms are greedy (always pick the lightest available edge under their rules).
Both are guaranteed optimal (by Lemma 12.6).
They may produce different spanning trees with the same minimum weight (as seen in the examples: Kruskal's used dj, Prim's used al).

⚙️ Efficiency considerations

⚙️ Kruskal's complexity

Sorting step: requires m log m operations (where m is the number of edges).
Building the tree: n − 1 steps, but each step requires checking for cycles (tracking components).
Overall: at most O(n² log n) operations (since m ≤ n² for a simple graph).

⚙️ Prim's complexity

No sorting required upfront.
Each step: must find the lightest edge among all edges leaving the current tree.
Data structure: a heap (mentioned in the excerpt) allows efficient tracking of candidate edges and quick identification of the minimum-weight edge.
The excerpt does not give a detailed complexity analysis for Prim's, but notes that efficient implementation depends on the heap data structure.

🤔 Why not just enumerate all spanning trees?

Discussion 12.7 addresses this:

A graph on n vertices can have as many as n^(n−2) spanning trees (a result from Section 5.6, not detailed here).
For n = 20, this is already over 10²³ trees—completely impractical to enumerate.
Greedy algorithms like Kruskal's and Prim's avoid this explosion by making locally optimal choices guaranteed (by Lemma 12.6) to lead to a globally optimal solution.

Don't confuse: Greedy algorithms do not always work (the excerpt mentions graph coloring as a counterexample), but for minimum-weight spanning trees, the greedy approach is both correct and efficient.

Digraphs

12.2 Digraphs

🧭 Overview

🧠 One-sentence thesis

Digraphs extend graphs by allowing edges to have direction, enabling models where connections between vertices operate in only one direction rather than both ways.

📌 Key points (3–5)

Why digraphs are needed: graphs model two-way connections, but some real-world relationships (e.g., one-way flights) require directional edges.
What a digraph is: a structure with vertices and directed edges, where (x, y) and (y, x) are distinct ordered pairs.
Directed paths and cycles: sequences of vertices following edge directions; cycles return to the starting vertex.
Common confusion: in graphs, {x, y} and {y, x} are the same edge; in digraphs, (x, y) and (y, x) are two different directed edges that may both exist, only one exist, or neither exist.
Weighted digraphs: edges can have lengths/distances, enabling shortest-path problems.

🔄 From graphs to digraphs

🔄 Why direction matters

In a graph, an edge x y represents a connection that works in both directions.
Problem: some real-world scenarios are asymmetric.
- Example: a flight route from Atlanta to Fargo exists, but no direct flight from Fargo to Atlanta.
- A graph edge between Atlanta and Fargo would lose the one-way information.
Solution: digraphs allow modeling directional relationships.

📐 Formal definition

Digraph: a pair (V, E) where V is a vertex set and E ⊆ V × V with x ≠ y for every (x, y) ∈ E. The pair (x, y) is a directed edge from x to y.

Key difference from graphs:
- In graphs, edges are sets: {x, y} = {y, x}.
- In digraphs, edges are ordered pairs: (x, y) ≠ (y, x).
For distinct vertices x and y, a digraph may contain:
- Only (x, y),
- Only (y, x),
- Both (x, y) and (y, x), or
- Neither.

🎨 Visual representation

Diagrams of digraphs use arrowheads to show direction.
Example from Figure 12.12:
- Contains edge (a, f) but not (f, a).
- Contains both (c, d) and (d, c).

🛤️ Paths and cycles in digraphs

🛤️ Directed path

Directed path from r to x: a sequence P = (r = u₀, u₁, …, uₜ = x) of distinct vertices where (uᵢ, uᵢ₊₁) is a directed edge in G for every i = 0, 1, …, t − 1.

Must follow edge directions: you can only traverse from uᵢ to uᵢ₊₁ if the directed edge (uᵢ, uᵢ₊₁) exists.
All vertices in the path must be distinct.

🔁 Directed cycle

Directed cycle: a directed path C = (r = u₀, u₁, …, uₜ = x) where (uₜ, u₀) is also a directed edge in G.

The path returns to its starting vertex via a directed edge.
Example: if you have a directed path from r to x, and there's also an edge (x, r), then you have a directed cycle.

📏 Weighted digraphs and distance

📏 Edge lengths

A weighted digraph is a pair (G, w) where:
- G = (V, E) is a digraph.
- w: E → [0, ∞) assigns a non-negative weight to each directed edge.
In this context, weight is interpreted as distance or length.
w(x, y) is the length of the edge (x, y).

📏 Path length and distance

Length of a path P = (r = u₀, u₁, …, uₜ = x): the sum of the lengths of the edges in the path, ∑ᵢ₌₀ᵗ⁻¹ w(uᵢ, uᵢ₊₁).

Distance from r to x: the minimum length of a directed path from r to x.

Path length: add up all edge weights along the path.
Distance: the shortest possible path length among all directed paths from r to x.
Don't confuse: path length is for a specific path; distance is the minimum over all possible paths.

🎯 The shortest-path problem

Goal: for each vertex x, find the distance from a given starting vertex r to x.
This problem has many applications (e.g., routing, navigation).
The excerpt introduces this as Problem 12.13, setting up for Dijkstra's algorithm (mentioned in section 12.3).

Dijkstra's Algorithm for Shortest Paths

12.3 Dĳkstra’s Algorithm for Shortest Paths

🧭 Overview

🧠 One-sentence thesis

Dijkstra's algorithm systematically finds the shortest path from a root vertex to every other vertex in a weighted digraph by iteratively making vertices "permanent" in order of their distance from the root.

📌 Key points (3–5)

What the algorithm solves: finds both the minimum distance from a root vertex to every other vertex and an actual shortest path to each vertex.
How it works: maintains "permanent" and "temporary" vertices, repeatedly selecting the closest temporary vertex and updating distances through it.
Key mechanism: at each step, scans from the newly permanent vertex to see if routing through it shortens paths to temporary vertices.
Common confusion: the algorithm may find multiple paths of the same length; the path stored P(x) need not be unique, but its length δ(x) will equal the true distance.
Correctness relies on: any prefix of a shortest path is itself a shortest path, and permanent vertices are processed in non-decreasing order of distance.

🏗️ Problem setup and definitions

🏗️ Weighted digraphs and distance

A digraph (directed graph) G = (V, E) has directed edges (x, y) that are ordered pairs, not sets; (x, y) and (y, x) are distinct possible edges.

The excerpt extends graphs to digraphs where edges have direction, shown with arrowheads.
A weight function w assigns each directed edge (x, y) a non-negative weight w(x, y).
In this section, weight is interpreted as distance or length.

📏 Path length and distance

The length of a directed path P = (r = u₀, u₁, …, uₜ = x) is the sum of the lengths of its edges: Σ w(uᵢ, uᵢ₊₁) for i from 0 to t−1.

The distance from r to x is the minimum length of any directed path from r to x.

A directed path is a sequence of distinct vertices where each consecutive pair (uᵢ, uᵢ₊₁) is a directed edge.
The problem (12.13) asks: for each vertex x, find the distance from r to x and find a shortest path from r to x.

🔧 Extended weight function

To simplify the algorithm description, the excerpt extends w by setting w(x, y) = ∞ when x ≠ y and (x, y) is not a directed edge.
Infinity is treated "as if it were a number" (though it is not); in implementation, a large finite value (e.g., number of vertices × maximum edge weight) simulates infinity.

🔄 How Dijkstra's algorithm works

🔄 Two-phase structure: permanent and temporary

At each step i (where 1 ≤ i ≤ n, n = |V|), the algorithm maintains:

A sequence σ = (v₁, v₂, …, vᵢ) of distinct vertices starting with r = v₁, called permanent vertices; the rest are temporary vertices.
For each vertex x, a number δ(x) and a path P(x) from r to x of length δ(x).

Permanent vertices have their final shortest distance and path determined.
Temporary vertices have tentative distances that may be updated.

🚀 Initialization (Step 1)

Set i = 1, σ = (r).
Set δ(r) = 0 and P(r) = (r) (the trivial one-point path).
For each x ≠ r, set δ(x) = w(r, x) and P(x) = (r, x).
- If (r, x) is not an edge, δ(x) = ∞.
Choose a temporary vertex x with minimum δ(x), set v₂ = x, append v₂ to σ, and increment i.

Example from the excerpt: with root a, after initialization, δ(f) = 24 is minimum among temporary vertices, so f becomes v₂ (permanent).

🔁 Inductive step (Step i, i > 1)

If i < n:

For each temporary vertex x, compute:
- δ(x) ← min { δ(x), δ(vᵢ) + w(vᵢ, x) }
This is called "scanning from vertex vᵢ."
If δ(x) decreases, update P(x) by appending x to the end of P(vᵢ).
Choose a temporary vertex x with minimum δ(x), set vᵢ₊₁ = x, append it to σ, and increment i.

Don't confuse: even if δ(vᵢ) + w(vᵢ, x) equals δ(x), the path is not updated unless δ(x) is strictly decreased. (The excerpt notes in Step 3: "we do not change P(e) since δ(e) is not decreased by routing P(e) through c.")

🔍 Scanning mechanism

At each step, the algorithm "scans" from the newly permanent vertex vᵢ.
It checks whether routing through vᵢ gives a shorter path to any temporary vertex x.
The comparison is: is δ(vᵢ) + w(vᵢ, x) < δ(x)?
If yes, update δ(x) and P(x); if no, leave them unchanged.

Example: In Step 2, scanning from f updates d because δ(f) + w(f, d) = 24 + 120 = 144 < ∞ = δ(d).

📖 Worked example walkthrough

📖 The digraph and initialization

The example uses an oriented graph (at most one of (x, y) or (y, x) for each pair) with 8 vertices {a, b, c, d, e, f, g, h} and root a.
After Step 1:
- δ(a) = 0, δ(f) = 24, δ(c) = 47, δ(e) = 70, all others ∞.
- Vertex f is closest, so σ = (a, f).

📖 Steps 2–7: iterative scanning

Step	Scan from	New permanent	Key updates
2	f	c	δ(d) updated to 144 via f
3	c	e	δ(b) = 102, δ(d) = 135, δ(g) = 113 via c
4	e	b	δ(b) = 101, δ(g) = 112 via e
5	b	g	δ(d) = 132, δ(h) = 180 via b
6	g	d	δ(h) = 178 via g
7	d	h	δ(h) = 161 via d

The final sequence σ = (a, f, c, e, b, g, d, h).
Final distances: δ(a)=0, δ(f)=24, δ(c)=47, δ(e)=70, δ(b)=101, δ(g)=112, δ(d)=132, δ(h)=161.
Final paths: e.g., P(h) = (a, e, b, d, h).

📖 Observation: order matters

The excerpt shows that δ(h) is updated multiple times (180 → 178 → 161) as better routes are discovered.
The algorithm does not revisit permanent vertices; once a vertex is permanent, its distance and path are final.

✅ Why the algorithm is correct

✅ Two key propositions

Proposition 12.16 (Subpath optimality):

If P = (r = u₀, u₁, …, uₜ = x) is a shortest path from r to x, then for every 0 < j < t, the prefix (u₀, u₁, …, uⱼ) is a shortest path from r to uⱼ and the suffix (uⱼ, uⱼ₊₁, …, uₜ) is a shortest path from uⱼ to uₜ.

Any segment of a shortest path is itself a shortest path.
This is fundamental: if a subpath were not shortest, you could replace it and shorten the whole path, contradicting optimality.

Proposition 12.17 (Non-decreasing order):

When the algorithm halts, σ = (v₁, v₂, …, vₙ), and δ(v₁) ≤ δ(v₂) ≤ … ≤ δ(vₙ).

Permanent vertices are added in non-decreasing order of their distance from r.
This ensures that when a vertex becomes permanent, no later vertex can provide a shorter route to it.

✅ Proof sketch (Theorem 12.18)

The theorem states: when Dijkstra's algorithm terminates, for each x ∈ V, δ(x) is the true distance from r to x and P(x) is a shortest path.

Proof by induction on k, the minimum number of edges in a shortest path from r to x:

Base case (k = 1): the edge (r, x) is a shortest path. At Step 1, δ(x) = w(r, x) and P(x) = (r, x), which is correct.
Inductive step: assume correctness for all vertices reachable in ≤ k edges. Let x be reachable in k+1 edges via shortest path P = (u₀, u₁, …, uₖ, uₖ₊₁ = x).
- By Proposition 12.16, Q = (u₀, …, uₖ) is a shortest path to uₖ.
- By the inductive hypothesis, δ(uₖ) is correct and P(uₖ) is a shortest path to uₖ.
- The true distance to x is δ(uₖ) + w(uₖ, x).
- Let uₖ = vᵢ and x = vⱼ in the sequence σ.
 - If j i (x becomes permanent after uₖ): at Step i, the algorithm sets δ(x) ≤ δ(vᵢ) + w(vᵢ, x) = δ(uₖ) + w(uₖ, x), the true distance. Again, δ(x) equals the true distance.
- In both cases, δ(x) is the true distance and P(x) is a shortest path.

Don't confuse: P(uₖ) (the path found by the algorithm) need not be identical to Q (the shortest path in the proof), but they have the same length δ(uₖ). The algorithm guarantees correctness of distance and existence of a shortest path, not uniqueness of the path.

📚 Historical context

📚 Kruskal's and Prim's algorithms

Joseph B. Kruskal published his algorithm in 1956 in Proceedings of the American Mathematical Society (a three-page paper).
Robert C. Prim published his algorithm in 1957 in The Bell System Technical Journal, focusing on telephone network applications.
Prim and Kruskal were colleagues at Bell Laboratories; Prim was aware of Kruskal's prior work.

📚 Earlier discovery and naming

Czech mathematician Vojtěch Jarník discovered Prim's algorithm in 1929, so some call it Jarník's algorithm.
Dijkstra also later rediscovered it, leading to the name "Dijkstra-Jarník-Prim algorithm" in some sources.
Edsger Dijkstra published his shortest-path algorithm (the subject of this section), though the excerpt cuts off before completing the historical note.

Note: The excerpt does not provide the publication year or journal for Dijkstra's algorithm; it only begins to mention that "Edsger Dijkstra published his algorithm for finding shortest…" before ending.

12.5 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set provides practice applying Kruskal's and Prim's algorithms to find minimum weight spanning trees, using Dijkstra's algorithm to find shortest paths in directed graphs, and exploring theoretical properties and edge cases of these graph algorithms.

📌 Key points (3–5)

Two spanning tree algorithms: Kruskal's "avoid cycles" approach and Prim's "build tree" approach both find minimum weight spanning trees but in different ways.
Dijkstra's algorithm application: exercises require finding shortest distances from a source vertex to all other vertices in directed graphs (digraphs).
Practical constraints: one problem introduces mandatory edges that must be included even when minimizing total cost.
Common confusion: Kruskal's algorithm requires listing which edges are rejected (would create cycles), while Prim's algorithm requires listing edges in selection order.
Theoretical extensions: exercises explore algorithm behavior on disconnected graphs, uniqueness proofs, and failure cases with negative edge weights.

🌲 Spanning tree algorithm exercises

🔄 Kruskal's algorithm tasks

Exercises 1, 3, and 5 ask students to apply Kruskal's algorithm to three different weighted graphs.
Required output format:
- Complete list of all edges considered
- Indicate which edges are taken for the tree
- Indicate which edges (if any) are rejected because they would create cycles
The "avoid cycles" nickname captures the algorithm's core mechanism: edges are considered in weight order but skipped if they would form a cycle.

🏗️ Prim's algorithm tasks

Exercises 2, 4, and 6 apply Prim's algorithm to the same three graphs used for Kruskal's exercises.
Required output format:
- List edges selected by the algorithm
- In the order they were selected (sequence matters for Prim's)
The "build tree" nickname reflects how Prim's grows a single connected component from a starting vertex.

🏦 Constrained optimization problem

Exercise 7 presents a bank network scenario:

Must connect headquarters, branches, ATMs, and Federal Reserve
Goal: minimize total cost
Constraint: two specific connections are mandatory (h to f, and b₂ to a₃)
Students must explain their selection process and justify why the solution is minimal
Example: this tests understanding that real-world problems often have non-negotiable requirements that modify the pure optimization.

🧭 Shortest path algorithm exercises

📍 Dijkstra's algorithm on graphs

Exercises 11 and 13 provide directed graphs as figures
Task: find distance from vertex a to each other vertex, plus a directed path achieving that distance
The algorithm builds shortest paths incrementally from the source vertex

📊 Dijkstra's algorithm on tabular data

Exercises 12 and 14 represent digraphs as tables
Format: row x, column y intersection gives the weight of directed edge (x, y)
Note: the weight from x to y may differ from y to x (directed edges)
Example given: w(b, d) = 21 but w(d, b) = 10 in one table

🔍 Theoretical and edge-case problems

🌲 Disconnected graphs

Exercise 8:

A disconnected weighted graph obviously has no spanning trees. However, it is possible to find a spanning forest of minimum weight in such a graph.

Task: explain how to modify both Kruskal's and Prim's algorithms for this case
A spanning forest is a collection of spanning trees, one per connected component

🎯 Uniqueness proof

Exercise 10:

Kruskal's original paper used the algorithm to prove a uniqueness result
Claim: if a connected weighted graph has no two edges with the same weight, then there is a unique minimum weight spanning tree
Students must prove this using Kruskal's algorithm

⚠️ Algorithm failure cases

Exercise 15:

Find a digraph with undirected paths between all vertex pairs, but where Dijkstra's algorithm from root r cannot find finite-length paths to some vertex x
Tests understanding of when Dijkstra's algorithm breaks down (e.g., lack of directed connectivity)

➖ Negative edge weights

Exercise 16 explores why Dijkstra's algorithm requires non-negative weights:

Part	Question	Concept tested
(a)	Show Dijkstra's fails with negative weights	Algorithm correctness boundaries
(b)	Bob suggests adding absolute value of most negative weight to all edges; show this fails	Why naive fixes don't work

Context provided: negative weights might represent profit rather than cost
Don't confuse: adding a constant to all edges changes which path is optimal (different paths have different numbers of edges)

📚 Historical context

📖 Algorithm origins

The excerpt mentions:

Dijkstra published his shortest-path algorithm in 1959 in a three-page paper in Numerische Mathematik
Edward F. Moore discovered an equivalent form two years earlier (1957) in Proceedings of an International Symposium on the Theory of Switching
The same 1959 paper also contained the third publication of Prim's algorithm
Dijkstra was aware of Kruskal's prior work but argued his algorithm was preferable because it required less graph information stored in memory at each step

Basic Notation and Terminology for Network Flows

13.1 Basic Notation and Terminology

🧭 Overview

🧠 One-sentence thesis

Network flow problems model moving something from a source to a sink through a graph with capacity constraints, and the goal is to maximize the total flow while respecting conservation laws at every intermediate vertex.

📌 Key points (3–5)

What a network is: an oriented graph with a source S (where flow starts), a sink T (where flow ends), and edges with non-negative capacities.
What a flow is: an assignment of non-negative values to edges that respects capacity limits and conservation laws (flow in = flow out at every non-source/sink vertex).
Conservation laws: the amount leaving the source equals the amount arriving at the sink (the flow's value), and at every intermediate vertex, inflow equals outflow.
Common confusion: when computing cut capacity, only edges from L to U count—edges from U to L are not included.
Why trivial solutions matter: assigning zero flow to every edge is always feasible, which distinguishes network flow from harder optimization problems where finding any solution can be as hard as finding the optimal one.

🌐 Network structure

🌐 Oriented graph

An oriented graph: a directed graph in which for each pair of vertices x, y at most one of the directed edges (x, y) and (y, x) is present.

This means you cannot have both directions of an edge between the same two vertices.
A network is an oriented graph used for flow problems.

🔌 Source and sink

Source S: the starting point; all edges incident with S point away from it.
Sink T (terminus): the destination; all edges incident with T point toward it.
Example: In a water distribution system, S is the reservoir and T is the final delivery point.

📏 Capacity

Capacity c(e) or c(x, y): a non-negative constraint on how much can be transmitted via edge e = (x, y).

Each edge has a capacity that limits the flow it can carry.
Example: In Figure 13.1, edge (E, B) has capacity 24, and edge (A, T) has capacity 56.

🌊 Flow definition and conservation

🌊 What a flow is

A flow φ in a network: a function that assigns to each directed edge e = (x, y) a non-negative value φ(e) = φ(x, y) ≤ c(x, y) such that conservation laws hold.

Flow values must be non-negative and cannot exceed the edge's capacity.
Example: In Figure 13.2, edge (E, D) has capacity 20 and carries flow 8.

⚖️ Conservation law 1: Source and sink balance

The total flow leaving the source S equals the total flow arriving at the sink T.
Formally: sum over all x of φ(S, x) = sum over all x of φ(x, T).
This quantity is called the value of the flow φ.
Example: In Figure 13.2, the flow value is 30 = φ(S, F) + φ(S, B) + φ(S, E) = φ(A, T) + φ(C, T).

⚖️ Conservation law 2: Intermediate vertex balance

At every vertex y that is neither the source nor the sink, inflow equals outflow.
Formally: sum over all x of φ(x, y) = sum over all x of φ(y, x).
Example: At vertex B in Figure 13.2, inflow is φ(S, B) + φ(E, B) + φ(D, B) = 20, and outflow is φ(B, F) + φ(B, A) + φ(B, C) = 20.

🎯 Why zero flow is always feasible

Assigning φ(e) = 0 for every edge e is always a valid flow (though not useful).
Don't underestimate this: in general linear programming problems, finding any feasible solution can be as hard as finding the optimal one.
Network flow problems are easier because a trivial feasible solution always exists.

✂️ Cuts and their capacity

✂️ What a cut is

A cut V = L ∪ U: a partition of the vertex set V into two parts L and U, where S ∈ L and T ∈ U.

The cut separates the source from the sink.
Example: One possible cut puts S and some intermediate vertices in L, and T and the remaining vertices in U.

📊 Capacity of a cut

The capacity of a cut V = L ∪ U, denoted c(L, U): the sum of capacities of all edges from L to U.

Formally: c(L, U) = sum over x in L and y in U of c(x, y).
Important: only edges from L to U are counted; edges from U to L are not included.
Don't confuse: the direction matters—this is not the total capacity of all edges crossing the cut in either direction.

🔍 Why cuts matter

Cuts provide a way to verify the optimality of a flow.
The excerpt mentions that the algorithm will find both a maximum flow and a certificate (using cuts) that proves optimality.
This leads to the Max Flow-Min Cut Theorem (mentioned but not detailed in this section).

Flows and Cuts

13.2 Flows and Cuts

🧭 Overview

🧠 One-sentence thesis

The value of any flow in a network is always at most the capacity of any cut, establishing a fundamental upper bound that will lead to an algorithm for finding maximum flows.

📌 Key points (3–5)

What a cut is: a partition of vertices into two sets L and U, where the source S is in L and the sink T is in U.
How cut capacity is calculated: sum only the capacities of edges going from L to U (not the reverse direction).
The fundamental inequality: for any flow and any cut, the flow's value ≤ the cut's capacity.
Common confusion: when computing cut capacity, edges from U back to L are not included—only the forward direction from L to U counts.
Why it matters: this theorem provides a certificate for optimality and hints that maximum flow might equal minimum cut capacity.

🔪 What is a cut?

🔪 Definition and structure

A cut V = L ∪ U is a partition of the vertex set V into two parts, where S ∈ L and T ∈ U.

The network's vertices are split into two groups: L (containing the source) and U (containing the sink).
Every vertex must be in exactly one of L or U.
The names L and U are chosen for reasons that will become clear later in the chapter.

📏 Cut capacity

The capacity of a cut V = L ∪ U, denoted c(L, U), is the sum of capacities of all edges from L to U.

Formula in words: c(L, U) = sum over all x in L and y in U of c(x, y).

Critical detail: Only edges going from L to U are counted. Edges from U to L are excluded from this sum.

Example: In the network from Figure 13.2, consider the cut with L₂ = {S, F, B, E} and U₂ = {A, D, C, T}.

The capacity is c(F, A) + c(B, A) + c(B, C) + c(E, D) = 24 + 15 + 20 + 20 = 79.
Notice that c(D, B) is not included because edge (D, B) goes from U₂ to L₂, the wrong direction.

⚠️ Common mistake

Don't confuse: When computing cut capacity, you might think "all edges crossing the partition" should count. But only the L→U direction matters; U→L edges are ignored.

🔗 The fundamental theorem

🔗 The inequality

Theorem 13.4: Let G = (V, E) be a network, let φ be a flow in G, and let V = L ∪ U be a cut. The value of the flow is at most as large as the capacity of the cut.

In symbols: v₀ ≤ c(L, U), where v₀ is the flow value.

What this means: No matter which flow you pick and which cut you pick, the flow value can never exceed the cut capacity.

🧮 Why the theorem is true

The proof uses flow conservation and careful bookkeeping:

Start with the flow value: v₀ = sum of φ(S, y) over all y minus sum of φ(z, S) over all z. The second sum is 0 because nothing flows into the source.
Add conservation equations: For every vertex x in L except S, the flow in equals the flow out, so sum of φ(x, y) minus sum of φ(z, x) = 0.
Combine all vertices in L: When you sum over all x in L (including S), you get:
- v₀ = sum over x in L of [sum of φ(x, y) - sum of φ(z, x)]
Cancellation trick: If edge (a, b) has both endpoints in L, it contributes +φ(a, b) when x = a and -φ(a, b) when x = b, so it cancels out completely.
What remains: Only edges crossing from L to U contribute positively, and edges from U to L contribute negatively:
- v₀ = sum over x in L, y in U of φ(x, y) - sum over x in L, z in U of φ(z, x)
Apply capacity constraint: Each φ(x, y) ≤ c(x, y), and the second sum is non-negative, so:
- v₀ ≤ sum over x in L, y in U of c(x, y) = c(L, U)

🎯 What this gives us

An upper bound: Every cut provides an upper limit on how large any flow can be.
A certificate of optimality: If you find a flow whose value equals some cut's capacity, you know the flow is maximum (it can't be any larger).
A hint: The discussion at the end suggests the question: does maximum flow value always equal minimum cut capacity? (This will be explored further in the chapter.)

🔍 Worked example

🔍 First cut: L₁ and U₁

Consider the cut with L₁ = {S, F, B, E, D} and U₁ = {A, C, T}.

Edges from L₁ to U₁:

(F, A) with capacity 24
(B, A) with capacity 15
(B, C) with capacity 20
(D, C) with capacity 42

Cut capacity: c(L₁, U₁) = 24 + 15 + 20 + 42 = 101.

🔍 Second cut: L₂ and U₂

Consider the cut with L₂ = {S, F, B, E} and U₂ = {A, D, C, T}.

Edges from L₂ to U₂:

(F, A) with capacity 24
(B, A) with capacity 15
(B, C) with capacity 20
(E, D) with capacity 20

Cut capacity: c(L₂, U₂) = 24 + 15 + 20 + 20 = 79.

Key observation: Edge (D, B) exists in the network, but D is in U₂ and B is in L₂, so this edge goes from U₂ to L₂. It is not included in the capacity calculation.

📊 Comparison

Cut	L set	U set	Capacity	Note
L₁, U₁	S, F, B, E, D	A, C, T	101	Includes edge (D, C)
L₂, U₂	S, F, B, E	A, D, C, T	79	Excludes edge (D, B) going backward

Different cuts give different upper bounds on the maximum flow value.

💭 Looking ahead

💭 The debate

The discussion at the end raises a natural question inspired by earlier chapters:

Chapter 5 analogy: Maximum clique size ≤ chromatic number, but they are not always equal (graphs can have no large cliques but need many colors).
Chapter 6 analogy: Maximum antichain size = minimum chain partition size (Dilworth's theorem), and this equality also holds when "chain" and "antichain" are swapped.

The question: Does maximum flow value always equal minimum cut capacity, or is it just an inequality like the clique/coloring case?

💭 What the excerpt promises

The chapter will develop an efficient algorithm that both finds a maximum flow and provides a certificate verifying optimality.
This certificate will use the concept of cuts.
The hint is that the answer to the debate will be found by continuing to read.

100

Augmenting Paths

13.3 Augmenting Paths

🧭 Overview

🧠 One-sentence thesis

Augmenting paths provide a systematic way to increase the value of a network flow by identifying sequences of vertices where flow can be pushed forward along edges with spare capacity or pulled back from edges that are already used.

📌 Key points (3–5)

What an augmenting path is: a sequence of distinct vertices from source S to sink T where each step either moves forward along an edge with spare capacity or backward against a used edge.
How augmenting paths increase flow: by adding a positive amount ε to forward edges and subtracting ε from backward edges, the flow value increases by ε while preserving conservation laws.
Forward vs backward edges: forward edges have spare capacity and can carry more flow; backward edges are already used and can have their flow reduced to "redirect" capacity.
Common confusion: not all augmenting paths are equally efficient—some choices can require many updates (up to 2M for a four-vertex network with capacity M), while others reach maximum flow quickly.
Why it matters: augmenting paths are the key tool in the Ford-Fulkerson labeling algorithm for finding maximum flows, but path selection affects algorithm efficiency.

🔍 Core terminology

🔍 Flow states of edges

The excerpt defines several states for edges in a network with current flow ϕ:

Term	Definition	Meaning
Used	ϕ(x, y) > 0	The edge carries some flow
Full	ϕ(x, y) = c(x, y) > 0	The edge is at capacity
Spare capacity	ϕ(x, y) < c(x, y)	The edge can carry more flow
Empty	0 = ϕ(x, y) < c(x, y)	The edge carries no flow but has capacity

Edges with zero capacity are simply ignored.
These states determine whether an edge can participate in an augmenting path.

🛤️ What is an augmenting path

An augmenting path is a sequence P = (x₀, x₁, ..., xₘ) of distinct vertices in the network such that x₀ = S, xₘ = T, and for each i = 1, 2, ..., m, either (a) (xᵢ₋₁, xᵢ) has spare capacity or (b) (xᵢ, xᵢ₋₁) is used.

The path starts at the source S and ends at the sink T.
All vertices in the path must be distinct (no cycles).
The path is not necessarily a directed path—it can move "against" the direction of some edges.

➡️ Forward edges

When condition (a) holds: edge (xᵢ₋₁, xᵢ) has spare capacity.
The path moves in the same direction as the edge.
Flow can be increased along forward edges.
Example from the excerpt: In path (S, F, A, T), all edges are forward because each has spare capacity.

⬅️ Backward edges

When condition (b) holds: edge (xᵢ, xᵢ₋₁) is used (the directed edge goes the opposite way).
The path moves from xᵢ₋₁ to xᵢ, which is opposite the direction of the edge.
Flow can be decreased along backward edges to "free up" capacity.
Example from the excerpt: In path (S, E, D, C, B, A, T), edge (C, B) is backward because the actual directed edge is (B, C).
Don't confuse: the notation (xᵢ₋₁, xᵢ) for a backward edge refers to the nondirected edge; the actual directed edge in the network is (xᵢ, xᵢ₋₁).

🔧 How augmenting paths modify flow

🔧 Calculating the increment ε

The excerpt describes how to determine the positive amount ε by which to modify the flow:

Step 1: Calculate ε₁ (forward edge limit)

ε₁ = minimum of {c(xᵢ₋₁, xᵢ) - ϕ(xᵢ₋₁, xᵢ) : (xᵢ₋₁, xᵢ) is a forward edge of P}
This is the spare capacity: how much more flow each forward edge can carry.
ε₁ is always defined and positive because edges (x₀, x₁) and (xₘ₋₁, xₘ) are always forward edges.

Step 2: Calculate ε₂ (backward edge limit, if needed)

If P has no backward edges, set ε = ε₁.
If P has one or more backward edges:
- ε₂ = minimum of {ϕ(xᵢ, xᵢ₋₁) : (xᵢ₋₁, xᵢ) is a backward edge of P}
- This is the current flow on backward edges: how much flow can be reduced.
- ε₂ > 0 because every backward edge is used.
- Set ε = minimum of {ε₁, ε₂}.

🔧 Updating the flow

Proposition 13.7 states the update rule:

Increase the flow along forward edges of P by ε.
Decrease the flow along backward edges of P by ε.

Result:

The resulting function ϕ' is still a valid flow (satisfies conservation laws).
The new flow has value v + ε (increased by ε).

Example from the excerpt: For path P₁ = (S, F, A, T) with ε = 12, all edges are forward, so increase flow on (S, F), (F, A), and (A, T) by 12, resulting in a flow value increase of 12.

🔧 Why this works

Forward edges: adding ε uses up spare capacity without violating capacity constraints.
Backward edges: subtracting ε "redirects" flow, freeing up capacity elsewhere.
Conservation laws are preserved at intermediate vertices because flow in and flow out change by the same amount.
The net effect is that ε more flow reaches the sink T from the source S.

📋 Examples from the excerpt

📋 Four augmenting paths in Figure 13.2

The excerpt lists four augmenting paths with their ε values:

P₁ = (S, F, A, T) with ε = 12
- All edges are forward.
P₂ = (S, B, A, T) with ε = 8
- All edges are forward.
P₃ = (S, E, D, C, B, A, T) with ε = 9
- All edges are forward except (C, B), which is backward.
P₄ = (S, B, E, D, C, A, T) with ε = 2
- All edges are forward except (B, E) and (C, A), which are backward.

The excerpt notes that readers should understand why each path is an augmenting path and how ε is determined.
Exercise 13.7.7 asks to update the flow for each path individually.

⚠️ Caution: not all augmenting paths are equally good

⚠️ The problem with arbitrary path selection

The excerpt includes a cautionary example showing that using "just any old augmenting path" can be inefficient:

Setup:

A four-vertex network (S, A, B, T) with edges:
- (S, A), (A, T), (S, B), (B, T) each with capacity M
- (A, B) with capacity 1
Maximum flow value is 2M.

Inefficient approach:

Start with zero flow.
Alternate between two augmenting paths:
1. (S, A, B, T) with ε = 1 (forward edges)
2. (S, B, A, T) with ε = 1 ((B, A) is backward)
Each update increases flow by only 1.
Requires 2M updates to reach maximum flow.

Efficient approach:

Use augmenting paths (S, A, T) and (S, B, T), each with ε = M.
Only 2 updates needed to reach maximum flow.

⚠️ Why the difference matters

The inefficient paths each use three edges, while the efficient paths use only two.
The excerpt mentions that "Dave wanders by and mumbles something about the better augmenting paths using only two edges."
The number of updates should be small in terms of the number of vertices, not dependent on the size of capacities M.
This motivates the need for a systematic algorithm (Ford-Fulkerson) rather than arbitrary path selection.

⚠️ Key insight

Not all augmenting paths are created equal.
Path selection strategy affects algorithm efficiency dramatically.
The Ford-Fulkerson labeling algorithm (introduced in the next section) provides a systematic approach to choosing augmenting paths.

101

The Ford-Fulkerson Labeling Algorithm

13.4 The Ford-Fulkerson Labeling Algorithm

🧭 Overview

🧠 One-sentence thesis

The Ford-Fulkerson labeling algorithm systematically labels vertices to find augmenting paths and terminates when it discovers both a maximum flow and a minimum cut of equal value, proving that the maximum flow equals the minimum cut.

📌 Key points (3–5)

Core mechanism: the algorithm labels vertices starting from the source, scans labeled vertices in order, and either finds an augmenting path (when the sink is labeled) or discovers a minimum cut (when labeling halts without reaching the sink).
Why naive augmenting paths fail: choosing arbitrary augmenting paths can require exponentially many updates (e.g., 2M updates for a flow of value 2M), while better paths (using fewer edges) reach the maximum flow much faster.
Breadth-first search strategy: vertices are scanned in the order they are labeled ("first labeled, first scanned"), not in pseudo-alphabetic order, ensuring systematic exploration.
Common confusion: forward edges vs backward edges—forward edges (u → v) allow increasing flow if not full; backward edges (v → u) allow decreasing flow if used.
Optimality certificate: when labeling halts without reaching the sink, the partition into labeled (L) and unlabeled (U) vertices forms a cut whose capacity exactly equals the current flow value, proving both are optimal.

🚫 The problem with arbitrary augmenting paths

🚫 Why naive selection is inefficient

The excerpt opens with a cautionary story about Carlos and Bob:

Carlos repeatedly suggests augmenting paths that alternate between two choices, each increasing flow by only 1.
With a large parameter M, this approach requires 2M updates to reach a maximum flow of value 2M, even though the network has only four vertices.
Bob realizes that "using any old augmenting path is definitely not a good idea."

✅ Better paths exist

Starting from zero flow, only two augmenting paths are needed: (S, A, T) and (S, B, T), each with capacity M.
These paths quickly reach the maximum flow.
Dave's observation: the better paths use only two edges, while the inefficient paths each use three edges.
This motivates the need for a systematic algorithm that prefers shorter or more efficient augmenting paths.

🏗️ Algorithm structure and setup

🏗️ Vertex precedence and pseudo-alphabetic order

Pseudo-alphabetic order: a linear order on vertices, typically (S, T, A, B, C, D, E, F, G, ...), where the source S is first and the sink T is second.

This order establishes a notion of precedence for the algorithm.
The convention works for networks with at most 26 vertices; for larger real-world problems, computers use integer keys.

🏷️ Labeled vs unlabeled vertices

At the start, only the source is labeled; all other vertices are unlabeled.
The algorithm systematically considers unlabeled vertices and determines which should be labeled.
Labels are never changed: once a vertex is labeled, its label remains until the flow is updated and all labels (except the source's) are discarded.

🔄 The labeling cycle

Start with only the source labeled.
Scan labeled vertices to label new vertices.
If the sink is labeled: an augmenting path is found; update the flow and restart with only the source labeled.
If labeling halts without reaching the sink: the current flow is maximum, and the partition into labeled and unlabeled vertices is a minimum cut.

🔍 How vertices are labeled

🔍 Label structure

Each labeled vertex u receives a triple: (predecessor, sign, potential).

First coordinate: the vertex that caused u to be labeled (or ∅ for the source).
Second coordinate: + (forward edge) or − (backward edge).
Third coordinate: the potential p(u), a positive real number (or ∞ for the source) representing how much the flow can be increased along the path to u.

Example: The source is labeled (∅, +, ∞).

➡️ Forward edges (u → v)

When scanning from labeled vertex u, consider unlabeled neighbor v where edge e = (u, v) is directed from u to v.

Condition: edge e is not full, i.e., current flow φ(e) < capacity c(e).
Label v with: (u, +, p(v)), where p(v) = min{p(u), c(e) − φ(e)}.
Why this formula: the flow increase is limited by both the prior potential p(u) and the spare capacity c(e) − φ(e).
The potential p(v) is positive because it is the minimum of two positive numbers.

⬅️ Backward edges (v → u)

When scanning from labeled vertex u, consider unlabeled neighbor v where edge e = (v, u) is directed from v to u.

Condition: edge e is used, i.e., current flow φ(e) > 0.
Label v with: (u, −, p(v)), where p(v) = min{p(u), φ(e)}.
Why this formula: the flow on e can be decreased by at most φ(e) or p(u).
Again, p(v) is positive.

Don't confuse: forward edges allow increasing flow if not full; backward edges allow decreasing flow if used.

🔄 Breadth-first search: "first labeled, first scanned"

Vertices are scanned in the order they are labeled, not in pseudo-alphabetic order.
Example from the excerpt:
- Scan from source S labels D, G, M (in that order).
- Next scan from D (first labeled after S), which labels B, F, G, Q.
- Next scan from G (labeled before B), even though B precedes G in pseudo-alphabetic order.
This ensures a breadth-first search for augmenting paths.

✅ Termination: two outcomes

✅ Outcome 1: Sink is labeled (augmenting path found)

When the sink T is labeled with (u, +, a):

The second coordinate must be + because all edges incident with T are oriented toward T.
Backtrack to find the augmenting path:
- T got its label from u₁.
- u₁ got its label from u₂, and so on.
- Eventually reach a vertex uₘ labeled by the source S.
- The augmenting path is P = (S, uₘ, uₘ₋₁, ..., u₁, T).
The flow increase δ for this path is p(T), the potential on the sink.
The algorithm ensures p(uₘ) ≥ p(uₘ₋₁) ≥ ... ≥ p(u₁) ≥ p(T), so δ = p(T) is valid.
After updating the flow, discard all labels except the source's and restart.

✅ Outcome 2: Labeling halts without reaching the sink (optimality)

If every labeled vertex has been scanned and the sink remains unlabeled:

Let L = set of labeled vertices, U = set of unlabeled vertices.
The partition V = L ∪ U is a cut.
Key observation:
- Every edge e = (x, y) with x ∈ L and y ∈ U is full: φ(e) = c(e). (Otherwise, y would have been labeled.)
- Every edge e = (y, x) with x ∈ L and y ∈ U has zero flow: φ(e) = 0. (Otherwise, y would have been labeled via a backward edge.)
The capacity of the cut L ∪ U equals the value of the current flow.
This proves the current flow is maximum and the cut is minimum, providing a certificate of optimality.

🏆 The Max Flow–Min Cut Theorem

🏆 Statement of the theorem

Theorem 13.10 (Max Flow–Min Cut Theorem): Let G = (V, E) be a network. If v₀ is the maximum value of a flow and c₀ is the minimum capacity of a cut, then v₀ = c₀.

The labeling algorithm provides a constructive proof of this theorem.
When the algorithm halts without labeling the sink, it simultaneously produces:
- A flow of value v₀.
- A cut of capacity c₀ = v₀.
This resolves the debate mentioned earlier in the chapter: the maximum flow/minimum cut question is analogous to "antichains and partitioning into chains" (where max equals min), not to "clique number and chromatic number" (where they can differ).

📝 Concrete example walkthrough

📝 Initial labeling from the source

The excerpt begins applying the algorithm to a network (Figure 13.2, not shown in full):

Source S: labeled (∅, +, ∞).
Scan from S in pseudo-alphabetic order of neighbors.

📝 First scan results

From S, consider neighbors B, E, F:

B: edge (S, B) is not full → label B: (S, +, 8).
E: edge (S, E) is not full → label E: (S, +, 28).
F: edge (S, F) is not full → label F: (S, +, 15).
Scan from S is complete.

📝 Second scan from B

Next, scan from B (first labeled after S). Unlabeled neighbors in order: A, C, D.

A: label A: (B, +, 8).
C: label C: (B, +, 8).
D: label D: (B, −, 6) (note the − sign, indicating a backward edge).

📝 Third scan from E

Next, scan from E. E has no unlabeled neighbors, so move to the next labeled vertex.

Don't confuse: the scan order follows the labeling order (B, E, F, A, C, D, ...), not the pseudo-alphabetic order (A, B, C, D, E, F, ...).

102

A Concrete Example

13.5 A Concrete Example

🧭 Overview

🧠 One-sentence thesis

The Labeling Algorithm finds maximum flow by repeatedly discovering augmenting paths through vertex labeling and scanning until no path to the sink exists, at which point the labeled/unlabeled vertex partition forms a minimum cut that certifies optimality.

📌 Key points (3–5)

How the algorithm proceeds: label the source, scan labeled vertices in order, label their neighbors based on edge capacity, and backtrack from the sink to find an augmenting path.
What happens when the sink is labeled: an augmenting path exists; backtrack to find it, then increase flow along that path by the minimum available capacity (delta).
What happens when the sink is NOT labeled: the algorithm halts with a cut whose capacity equals the current flow value, proving both maximum flow and minimum cut.
Common confusion: the algorithm alternates between two phases—labeling/scanning to find paths, then updating flow—and repeats until no augmenting path exists.
Certificate of optimality: when labeled vertices L and unlabeled vertices U form a cut with capacity exactly equal to flow value, the flow is proven optimal.

🔄 The labeling and scanning process

🏷️ How vertices get labeled

Start by labeling the source S with a special initial label (empty first coordinate, +, infinity).
Scanning means examining all neighbors of a labeled vertex in pseudo-alphabetic order.
A neighbor qualifies for a label if:
- It is unlabeled, AND
- The edge to it is not full (for forward edges), OR
- The edge from it carries positive flow (for backward edges).
Each label has three parts: (predecessor vertex, direction sign, available capacity).

🔍 Example: First iteration

The excerpt walks through labeling starting from source S:

S is labeled first: S : (∅, +, ∞)
Scan from S finds neighbors B, E, F (edges not full):
- B : (S, +, 8)
- E : (S, +, 28)
- F : (S, +, 15)
Scan from B finds unlabeled neighbors A, C, D:
- A : (B, +, 8)
- C : (B, +, 8)
- D : (B, −, 6) (note the minus sign indicates a backward edge)
Scan from E and F finds no unlabeled neighbors.
Scan from A finds the sink T:
- T : (A, +, 8)

🔙 Backtracking to find the augmenting path

Once the sink T is labeled:

Work backwards using the first coordinate of each label.
T got its label from A, A from B, B from S.
The augmenting path is P = (S, B, A, T) with delta = 8.
All edges on this path are forward (indicated by + signs).
Increase flow on every edge of P by 8.
The flow value increases from 30 to 38.

🔁 Repeating the algorithm

🔄 Second iteration

After updating the flow (Figure 13.12 shows the new flow):

Restart the labeling process from scratch.
Now edge (S, B) is full, so B will not be labeled directly from S.
The new labeling sequence (reading down columns):
- S : (∅, +, ∞)
- E : (S, +, 28)
- F : (S, +, 15)
- B : (E, +, 19) (now labeled from E instead of S)
- D : (E, +, 12)
- A : (F, +, 12)
- C : (B, +, 10)
- T : (A, +, 12)
Backtracking gives augmenting path P = (S, F, A, T) with delta = 12.
Flow value increases to 50 (38 + 12).

♾️ Iteration continues

The excerpt states:

"We start the labeling process over again and repeat until we reach a stage where some vertices (including the source) are labeled and some vertices (including the sink) are unlabeled."

This is the termination condition.

🛑 How the algorithm halts

🚫 When the sink remains unlabeled

The excerpt provides a second network (Figure 13.13) with current flow value 172.

First labeling finds an augmenting path P = (S, C, H, I, E, L, T) with delta = 3.
Flow increases to 175.
Second labeling halts with only these vertices labeled:
- L = {S, C, F, H, I}
Unlabeled vertices:
- U = {T, A, B, D, E, G, J, K}
The sink T is in U, so no augmenting path exists.

✅ Certificate of optimality

When the algorithm halts with the sink unlabeled:

Every edge from L to U must be full: flow equals capacity.
- Otherwise, the endpoint in U would have been labeled.
Every edge from U to L must carry zero flow.
- Otherwise, the endpoint in U would have been labeled with a backward edge.
The capacity of the cut V = L ∪ U is:
- Sum of capacities of all edges from L to U.
In the example:
- Capacity = 41 + 8 + 23 + 8 + 13 + 29 + 28 + 25 = 175.
- This exactly equals the current flow value.
Conclusion: the flow is maximum and the cut is minimum, certifying optimality.

🔐 Why this proves optimality

"If L is the set of labeled vertices, and U is the set of unlabeled vertices, then every edge e = (x, y) with x in L and y in U is full, i.e., flow(e) = capacity(e)."

The capacity of any cut is an upper bound on the maximum flow.
When a cut's capacity equals the current flow, the flow cannot be increased further.
This provides a certificate: both a maximum flow and a minimum cut are found simultaneously.

🔢 Transition to linear programming

🧮 What is a linear programming problem

The excerpt introduces the general form:

A linear programming problem is an optimization problem that can be stated in the following form: Find the maximum value of a linear function c₁x₁ + c₂x₂ + c₃x₃ + ⋯ + cₙxₙ subject to m constraints C₁, C₂, …, Cₘ, where each constraint Cᵢ is a linear equation of the form: Cᵢ : aᵢ₁x₁ + aᵢ₂x₂ + aᵢ₃x₃ + ⋯ + aᵢₙxₙ ≤ bᵢ where all coefficients and constants are real numbers.

Objective function: a linear combination of variables to maximize.
Constraints: linear inequalities that the variables must satisfy.
All coefficients and constants are real numbers.

🔗 Connection to network flows

The section title "Integer Solutions of Linear Programming Problems" suggests:

Network flow problems can be formulated as linear programs.
The Labeling Algorithm's termination guarantees integer solutions when capacities are integers.
(The excerpt cuts off before elaborating further on this connection.)

103

Integer Solutions of Linear Programming Problems

13.6 Integer Solutions of Linear Programming Problems

🧭 Overview

🧠 One-sentence thesis

Network flow problems are a special case of linear programming where integer capacities guarantee integer-valued optimal solutions, unlike general linear programming problems that may have only rational solutions even when posed with integer data.

📌 Key points (3–5)

What linear programming is: optimization of a linear function subject to linear constraints, widely used in engineering, science, and industry.
Integer vs rational solutions: LP problems with rational coefficients have rational optimal solutions (if any exist), but integer coefficients do not guarantee integer solutions.
Network flows as special LP problems: maximum flow problems are a special case of linear programming where integer capacities always yield integer flows.
Common confusion: general LP algorithms vs specialized algorithms—network problems use special-purpose algorithms (like Ford-Fulkerson) because they are more efficient in practice than general LP methods.
Why integer solutions matter: determining when an integer-posed LP problem has integer solutions is a subtle, difficult, and very important theme in operations research.

🔢 Linear programming fundamentals

🎯 Problem structure

A linear programming problem: find the maximum value of a linear function c₁x₁ + c₂x₂ + c₃x₃ + ... + cₙxₙ subject to m constraints C₁, C₂, ..., Cₘ, where each constraint Cᵢ is a linear equation of the form aᵢ₁x₁ + aᵢ₂x₂ + aᵢ₃x₃ + ... + aᵢₙxₙ ≤ bᵢ, with all coefficients and constants being real numbers.

The objective is to maximize (or minimize) a linear combination of variables.
All constraints are linear inequalities or equations.
Example: maximize profit given resource constraints where each product uses a linear combination of resources.

🏭 Importance and solvability

Very important class: LP problems have many applications in engineering, science, and industrial settings.
Relatively efficient algorithms exist: there are practical methods for finding solutions to linear programming problems.
The excerpt emphasizes both the broad applicability and computational tractability of LP problems.

🔀 The integer solution challenge

📐 Rational coefficients guarantee rational solutions

When an LP problem is posed with rational coefficients and constants, it has an optimal solution with rational values—if it has an optimal solution at all.
This is a theoretical guarantee: the solution type matches the input type (for rationals).

🧱 Integer coefficients do NOT guarantee integer solutions

A linear programming problem posed with integer coefficients and constants need not have an optimal solution with integer values.
This holds even when the problem has an optimal solution with rational values.
Don't confuse: integer inputs ≠ integer outputs in general LP problems.
This creates a fundamental challenge: many real-world problems require integer solutions (e.g., you cannot produce 3.7 cars), but the mathematical structure does not automatically provide them.

🔍 The operations research theme

A very important theme in operations research is to determine when a linear programming problem posed in integers has an optimal solution with integer values.
The excerpt emphasizes this is a subtle and often very difficult problem.
This is not a simple classification—it requires deep analysis of problem structure.

🌊 Network flows as a special case

🎁 Network flows are special LP problems

The problem of finding a maximum flow in a network is a special case of a linear programming problem.
This means network flow problems inherit the general LP structure but have additional properties.

✅ Integer capacities guarantee integer flows

A network flow problem in which all capacities are integers has a maximum flow in which the flow on every edge is an integer.
The Ford-Fulkerson labeling algorithm guarantees this.
This is a key distinction: network flows with integer capacities always have integer solutions, unlike general integer LP problems.
Example: if all edge capacities are whole numbers, the optimal flow will also be whole numbers on every edge—no fractional flows appear.

⚡ Specialized algorithms are more efficient

In general, linear programming algorithms are not used on networks.
Instead, special-purpose algorithms (such as Ford-Fulkerson) have proven to be more efficient in practice.
Why: the special structure of network problems allows algorithms tailored to that structure to outperform general-purpose LP solvers.
Don't confuse: just because network flows are LP problems doesn't mean you should use general LP methods—specialized algorithms exploit the network structure.

📊 Summary comparison

Problem type	Coefficients	Optimal solution guarantee	Algorithm choice
General LP	Rational	Rational values (if solution exists)	General LP algorithms
General LP	Integer	May be only rational, not integer	General LP algorithms; integer solutions are hard
Network flow	Integer capacities	Integer flows on all edges	Special-purpose (e.g., Ford-Fulkerson)

🔑 Key takeaway

Network flow problems are a fortunate special case where integer inputs guarantee integer outputs, and specialized algorithms exploit this structure for efficiency—unlike the general LP case where integer solutions remain a difficult research problem.

104

13.7 Exercises

🧭 Overview

🧠 One-sentence thesis

These exercises apply the Ford-Fulkerson labeling algorithm and the max-flow min-cut theorem to find maximum flows, minimum cuts, and augmenting paths in various networks, reinforcing the core techniques of network flow analysis.

📌 Key points (3–5)

Core tasks: verifying flow values, finding augmenting paths (including those with backward edges), computing cut capacities, and running the Ford-Fulkerson algorithm to completion.
Proof requirement: one exercise asks to prove that flow conservation holds at vertices along an augmenting path, covering four cases based on forward/backward edge status.
Common confusion: distinguishing forward vs backward edges on augmenting paths and understanding how the labeling algorithm determines flow value from labels.
Algorithm practice: multiple exercises require starting from a given flow (or zero flow) and iterating the Ford-Fulkerson algorithm until no augmenting path exists.
Max-flow min-cut connection: exercises link finding maximum flows with identifying minimum-capacity cuts.

🔍 Verifying and computing flows

🔍 Checking claimed flow values (Exercise 2)

Alice claims a flow of value 20 in a network; Bob says no flow exceeds 18.
The task is to determine who is correct and explain why.
This requires either:
- Finding a cut with capacity ≤ 18 (which would prove Bob right by the max-flow min-cut theorem), or
- Constructing a valid flow of value 20 (which would prove Alice right).
Don't confuse: a flow's value with the sum of all edge flows; the flow value is the net flow out of the source (or into the sink).

🔄 Finding augmenting paths with backward edges (Exercise 3)

Given a network with an existing flow ϕ (shown in Figure 13.16 with edge labels "capacity, current flow").
Find an augmenting path P that includes at least one backward edge.
Compute the bottleneck value (the maximum amount by which flow can be increased along P).
Update ϕ using P to obtain a new flow ϕ′ and determine its value.
Example: a backward edge on P means the path uses an edge in the reverse direction, reducing flow on that edge to "push back" flow and allow more flow elsewhere.

🧮 Cut capacity calculations

🧮 Computing cut capacities (Exercises 5 & 6)

The capacity of a cut (L, U) is the sum of capacities of all edges going from L to U.

Exercise 5: compute capacity for L = {S, F, H, C, B, G, I} and U = {A, D, E, T}.
Exercise 6: compute capacity for L = {S, F, D, B, A} and U = {H, C, I, G, E, T}.
Both refer to the network in Figure 13.16.
Key: only count edges directed from L to U; edges from U to L do not contribute to cut capacity.
These exercises reinforce the relationship between minimum cuts and maximum flows.

🔁 Running the Ford-Fulkerson algorithm

🔁 Updating flows with given augmenting paths (Exercise 7)

Example 13.8 provides four augmenting paths P₁, P₂, P₃, P₄ for the flow in Figure 13.2.
For each path separately, update the flow and produce four distinct network flows.
Important: do not apply the paths in sequence; each update starts from the original flow in Figure 13.2.
This tests understanding of how a single augmenting path modifies flow values on its edges.

🔁 Continuing the algorithm to termination (Exercise 8)

Start from the network flow in Figure 13.12.
Run the Ford-Fulkerson labeling algorithm until it halts without labeling the sink T.
When the algorithm halts, the current flow is maximum.
Find:
- The value of the maximum flow.
- A cut of minimum capacity (the set of labeled vertices L and unlabeled vertices U form the cut).
This exercise demonstrates the algorithm's stopping condition and the max-flow min-cut theorem in action.

🔁 Starting from a given flow (Exercise 9)

Figure 13.17 shows a network with a current flow (edges labeled "capacity, flow").
Use the Ford-Fulkerson algorithm starting from this flow to find:
- A maximum flow.
- A minimum cut.
The algorithm iteratively finds augmenting paths and updates the flow until no path from S to T exists in the residual network.

🔁 Starting from zero flow (Exercise 10)

Figure 13.18 shows a network with only capacities (no initial flow).
Start from the zero flow: ϕ(e) = 0 for every directed edge e.
Apply the Ford-Fulkerson labeling algorithm to find:
- A maximum flow.
- A minimum cut.
This is the "from scratch" scenario, building up flow incrementally.

🧩 Theoretical and inference exercises

🧩 Proving flow conservation (Exercise 4)

Prove Proposition 13.7: verify that flow conservation laws hold at each vertex along an augmenting path (except S and T).
There are four cases depending on whether the two edges incident to a vertex on the path are forward or backward:
1. Both forward.
2. Both backward.
3. One forward, one backward (incoming forward, outgoing backward).
4. One forward, one backward (incoming backward, outgoing forward).
For each case, show that the net flow into the vertex equals the net flow out after the update.
This exercise deepens understanding of how augmenting path updates preserve flow conservation.

🧩 Inferring flow value from labels (Exercise 11)

A network has source S with three neighbors: B, E, F.
Edge capacities: c(S, B) = 30, c(S, E) = 20, c(S, F) = 25.
An unknown flow ϕ exists on the network.
When the Ford-Fulkerson algorithm runs on ϕ, the first two labels are:
- S with label (∞, +, ∞) (the source is always labeled first with infinite potential).
- F with label (S, +, 15).
The label (S, +, 15) means:
- F was labeled from S.
- The edge S→F is a forward edge (+).
- The bottleneck value is 15, meaning the residual capacity from S to F is 15.
Since c(S, F) = 25 and residual capacity = 25 − ϕ(S, F) = 15, we have ϕ(S, F) = 10.
The flow value equals the total flow out of S: ϕ(S, B) + ϕ(S, E) + ϕ(S, F).
Because F is labeled second (before B and E), the algorithm could not label B or E yet, implying:
- Either ϕ(S, B) = 30 (edge saturated) or B is not reachable.
- Either ϕ(S, E) = 20 (edge saturated) or E is not reachable.
The problem asks to determine the flow value and explain the reasoning.
Key insight: the labeling order and label values reveal information about current flow and residual capacities.

Exercise	Task	Key concept tested
2	Verify claimed flow value	Max-flow min-cut theorem
3	Find augmenting path with backward edge	Residual network, backward edges
4	Prove flow conservation along path	Flow update mechanics
5, 6	Compute cut capacity	Cut definition, capacity calculation
7	Update flow with given paths	Single augmenting path update
8, 9, 10	Run Ford-Fulkerson to completion	Full algorithm execution
11	Infer flow from labels	Label interpretation, residual capacity

105

14.1 Introduction

🧭 Overview

🧠 One-sentence thesis

When network flow problems have integer capacities, there always exists a maximum flow in which every edge carries an integer amount of flow, enabling combinatorial interpretations of flows as "selected" edges.

📌 Key points (3–5)

Integer flow theorem: In networks with integer capacities, at least one maximum flow assigns integer flow to every edge.
Special case—capacity 1: When all capacities are 1, maximum flows use only 0 or 1 on each edge, giving a combinatorial "take this edge or not" interpretation.
Common confusion: The theorem does not guarantee that every maximum flow is integer—only that at least one integer maximum flow exists.
Why it matters: Integer flows allow network flow algorithms to solve discrete combinatorial problems like matchings and chain partitions.
Broader context: Network flows are a special case of linear programming; integer constraints guarantee rational solutions, but flows achieve the stronger property of integer solutions.

🔢 The integer flow theorem

🔢 What the theorem guarantees

Theorem 14.1: In a network flow problem in which every edge has integer capacity, there is a maximum flow in which every edge carries an integer amount of flow.

Not obvious at first: You might worry that a maximum flow could require fractional values like 23/3 or even irrational values like √21.
What the theorem rules out: If capacities are integers, we can always find a maximum flow that is also all integers.
Important limitation: The theorem does not say that all maximum flows are integer—only that at least one integer maximum flow exists.

🧮 Why not worse than rational?

Network flow problems belong to a larger class called linear programming problems.
A major theorem in linear programming: if all constraints (capacities) are integers, the solution must be a rational number.
Network flows achieve something stronger: not just rational, but fully integer solutions are possible.

🎯 The capacity-1 special case

🎯 Binary flows

When every edge has capacity 1, the integer flow theorem means every edge in a maximum flow carries either 0 or 1.
This creates a simple binary choice: "use this edge" (flow = 1) or "don't use this edge" (flow = 0).

🧩 Combinatorial interpretation

Flow as selection: Edges with flow 1 can be interpreted as edges we "take" or "select" in some combinatorial structure.
Why this matters: This binary interpretation allows network flow algorithms to solve discrete problems like:
- Finding maximum matchings in bipartite graphs.
- Finding the width of a poset and a minimal chain partition.
Example: If a flow of 1 means "assign this worker to this job," then a maximum flow finds the largest possible assignment.

🔍 Don't confuse

Not about the flow value: The theorem is not about the total flow through the network being integer; it's about the flow on each individual edge being integer.
Not all maximum flows: There may be other maximum flows with fractional values on edges; the theorem only promises that at least one integer maximum flow exists.

🌉 Bridge to combinatorial problems

🌉 Why this chapter uses capacity-1 networks

The chapter focuses on restricted network flows where every edge has capacity 1.
This restriction unlocks the binary (0 or 1) flow property, which maps naturally to combinatorial "yes/no" decisions.

🎯 Two main applications

Problem	What it solves	How capacity-1 flows help
Maximum matchings in bipartite graphs	Pair vertices from two sets with no vertex paired twice	Flow = 1 means "include this pairing"
Width of a poset and minimal chain partition	Find the largest antichain and partition into chains	Flow = 1 means "include this element in a chain"

Both problems are combinatorial: they involve selecting discrete structures (edges, chains) rather than continuous quantities.
The integer flow theorem ensures that network flow algorithms, which might otherwise produce fractional answers, will yield combinatorial solutions.

106

Matchings in Bipartite Graphs

14.2 Matchings in Bipartite Graphs

🧭 Overview

🧠 One-sentence thesis

Network flow algorithms can efficiently find maximum matchings in bipartite graphs and determine when a matching that saturates all vertices of one set is possible.

📌 Key points (3–5)

What a matching is: a set of edges where no two edges share an endpoint, allowing us to pair vertices from two distinct sets without conflicts.
The maximum matching problem: finding the largest possible matching, such as assigning the most workers to jobs they're qualified for.
Network flow method: convert a bipartite graph into a network with capacity-1 edges, then use flow algorithms to find maximum matchings.
Hall's Theorem: provides the exact condition for when a matching saturating all vertices of one set exists—every subset A must have at least as many neighbors as elements in A.
Common confusion: just because some workers are idle doesn't mean a better assignment exists; the structure of the graph may prevent full saturation.

🎨 Bipartite graphs and matchings

🎨 What is a bipartite graph

A bipartite graph G = (V, E) is one in which the vertices can be properly colored using only two colors.

This coloring partitions V into two independent sets V₁ and V₂.
All edges run between V₁ and V₂ (no edges within V₁ or within V₂).
Useful when modeling relationships between two distinct types of objects.

👷 Workers and jobs example

V₁ represents workers; V₂ represents jobs.
An edge connects worker w to job j if and only if w is qualified to do j.
This models real-world assignment problems where qualifications matter.

🔗 What is a matching

A matching M ⊆ E is a set of edges where no two edges share an endpoint.

If vertex v is the endpoint of an edge in M, we say M saturates v.
In bipartite graphs, a matching pairs vertices from V₁ with vertices from V₂ so that no vertex is paired with more than one other vertex.
Example: assigning workers to jobs such that each worker gets at most one job and each job gets at most one worker.

🏆 Maximum matching

A maximum matching contains the largest number of edges possible.
In the workers-and-jobs scenario, this means:
- Each worker is assigned to a job for which they are qualified (an edge exists).
- Each worker is assigned to at most one job.
- Each job is assigned at most one worker.
Don't confuse: a matching that leaves some workers idle may still be maximum if the graph structure prevents better assignments.

🌊 Network flow approach to maximum matchings

🌊 Constructing the network

To find a maximum matching using network flows:

Start with bipartite graph G with sets V₁ and V₂.
Add a source S and a sink T.
Add an edge from S to each vertex in V₁.
Add an edge from each vertex in V₂ to T.
Orient all edges between V₁ and V₂ from V₁ to V₂.
Give every edge capacity 1.

🔄 Correspondence between matchings and flows

Matching to flow:

Place one unit of flow on each edge in the matching M.
Place one unit of flow on edges from S to vertices of V₁ saturated by M.
Place one unit of flow from vertices of V₂ saturated by M to T.
Conservation laws are satisfied because each saturated vertex has exactly one matching edge.

Flow to matching:

The full edges (carrying flow 1) from V₁ to V₂ in an integer-valued flow form a matching.
Simply extract these edges to get the matching.

⚙️ Finding maximum matchings

Run the labeling algorithm (Ford-Fulkerson) on the network to find a maximum flow.
The maximum flow corresponds to a maximum matching.
Example from the excerpt: starting with a 4-edge matching, the algorithm finds an augmenting path and improves it to a 5-edge matching.

🚫 When full saturation is impossible

🚫 Recognizing impossibility

When the labeling algorithm halts without labeling the sink:

Some vertices in V₁ remain labeled, some in V₂ remain unlabeled.
The labeled vertices from V₁ can only connect to the labeled vertices from V₂.
If there are more labeled vertices in V₁ than in V₂, at least one vertex in V₁ must remain unsaturated.

🧩 Example from the excerpt

Three vertices (x₃, x₄, x₅) from V₁ were labeled.
Only two vertices (y₄, y₅) from V₂ were labeled.
y₄ and y₅ are the only neighbors of x₃, x₄, or x₅ in G.
No matter how the matching edges are chosen from {x₃, x₄, x₅}, one vertex will be left unsaturated.
Conclusion: one worker must go without a job assignment.

🎯 Hall's Theorem

Hall's Theorem: Let G = (V, E) be a bipartite graph with V = V₁ ∪ V₂. There is a matching which saturates all vertices of V₁ if and only if for every subset A ⊆ V₁, the set N of neighbors of the vertices in A satisfies |N| ≥ |A|.

What it means:

For every subset A of V₁, the number of neighbors must be at least as large as the size of A.
If any subset A has fewer neighbors than elements, no matching can saturate all of V₁.
This provides both a necessary and sufficient condition.

Why it matters:

Gives a precise criterion to determine when full saturation is possible.
Explains to "your boss" why no better assignment exists: the structure of the graph (qualifications) prevents it.
Don't confuse: this is about all subsets, not just individual vertices; even if every single vertex has neighbors, a group of vertices might share too few neighbors collectively.

🔍 Chain partitioning in posets

🔍 The chain partitioning problem

Recall: Dilworth's Theorem states that a poset P of width w can be partitioned into w chains (but no fewer).
Previously, an algorithm existed only for interval orders.
Network flows now provide an efficient algorithm for all posets.

🏗️ Constructing the network for posets

Given poset P with points {x₁, x₂, ..., xₙ}:

Create source S and sink T.
For each point xᵢ, create two vertices: x′ᵢ and x″ᵢ.
Add edges from S to each x′ᵢ (capacity 1).
Add edges from each x″ᵢ to T (capacity 1).
Add a directed edge from x′ᵢ to x″ⱼ if and only if xᵢ < xⱼ in P.
All edges have capacity 1.

🔗 Flow to chain partition

Building chains from flow:

If there is flow on edge (x′ᵢ, x″ⱼ), place xᵢ and xⱼ in the same chain.
Follow the flow: if x′ⱼ has outgoing flow to x″ₖ, add xₖ to the chain (since xᵢ < xⱼ < xₖ).
Continue until reaching a vertex with no outgoing flow.
Then check if x″ᵢ has incoming flow; if so, add those elements below xᵢ to the chain.
Repeat for all unassigned points.

Example from the excerpt:

Flow on (x′₁, x″₃) places x₁ and x₃ in chain C₁.
x″₂ has flow into x′₁, so add x₂; x″₇ has flow into x′₂, so add x₇.
Result: C₁ = {x₁, x₂, x₃, x₇}.

🎯 Finding the maximum antichain

Using labeled/unlabeled vertices:

After the final run of the labeling algorithm, track which vertices are labeled.
For each chain C = {x₁ < x₂ < ... < xₖ}:
- The minimal element x₁ has x″₁ unlabeled (no flow into it, and T is unlabeled).
- The maximal element xₖ has x′ₖ labeled (no flow out of it).
Along the sequence x′ₖ, x″ₖ, x′ₖ₋₁, x″ₖ₋₁, ..., x′₁, x″₁, there is a switch from labeled to unlabeled.
This switch must occur with x′ᵢ labeled and x″ᵢ unlabeled.
Collect one such element from each chain to form antichain A.

Why A is an antichain:

If yᵢ < yⱼ, then (y′ᵢ, y″ⱼ) is an edge in the network.
Scanning from y′ᵢ would label y″ⱼ, contradicting the construction.
Therefore, no two elements in A are comparable.

Example: maximum antichain is {x₁, x₅, x₈} with three elements, matching the three chains.

107

Chain Partitioning

14.3 Chain partitioning

🧭 Overview

🧠 One-sentence thesis

Network flows enable an efficient algorithm to find a minimum chain partition for any poset by constructing a special network where maximum flow corresponds to the minimum number of chains needed.

📌 Key points (3–5)

What the algorithm does: uses network flows to partition any poset into the minimum number of chains, extending beyond the interval-order special case from Chapter 6.
How the network is built: for each poset element xᵢ, create two vertices x′ᵢ and x′′ᵢ; add a directed edge from x′ᵢ to x′′ⱼ if and only if xᵢ < xⱼ in the poset.
How to extract the chain partition: follow flow edges from x′ᵢ to x′′ⱼ to build chains, tracing both forward (greater elements) and backward (smaller elements).
How to find the maximum antichain: identify the transition points where x′ᵢ is labeled but x′′ᵢ is unlabeled in the final labeling run; these elements form a maximum antichain.
Common confusion: the network has two vertices per poset element (single-prime and double-prime), not one; edges between them encode the poset order relation.

🏗️ Network construction from poset

🏗️ Vertices and basic edges

The network is built from a poset P with points {x₁, x₂, …, xₙ}:

Add a source S and sink T.
For each poset element xᵢ, create two vertices: x′ᵢ and x′′ᵢ.
Add edges S → x′ᵢ for all i (capacity 1).
Add edges x′′ᵢ → T for all i (capacity 1).
All edges have capacity 1.

🔗 Encoding the poset order

An edge is directed from x′ᵢ to x′′ⱼ if and only if xᵢ < xⱼ in P.

This is the key step: the poset's order relation becomes the "middle layer" of edges.
Example: if x₁ < x₃ in the poset, the network includes edge (x′₁, x′′₃).
If x₄ is less than x₃, x₅, and x₉, then three directed edges leave x′₄.
If x₉ is maximal in the poset, no directed edges leave x′₉.

Don't confuse: the edge goes from the prime version of the smaller element to the double-prime version of the larger element, not between the same vertex types.

🔄 Finding maximum flow

🔄 Running the labeling algorithm

The excerpt demonstrates the Ford-Fulkerson labeling algorithm:

Start with an initial flow (may be zero or an eyeballed flow).
Run the labeling algorithm with a priority order (e.g., S, T, x′₁, …, x′₁₀, x′′₁, …, x′′₁₀).
If an augmenting path is found, update the flow and repeat.
When the sink T remains unlabeled, the flow is maximum.

📊 Example trace

In the running example (a 10-element poset):

An initial flow is shown in Figure 14.9.
The labeling algorithm finds an augmenting path (S, x′₆, x′′₄, x′₇, x′′₂, T).
After augmentation (Figure 14.10), a second run leaves T unlabeled, confirming maximum flow.
The final labeled vertices are: S, x′₃, x′₅, x′₉, x′′₃, x′′₉, x′₁, x′₈.

🔗 Extracting the chain partition

🔗 Following flow edges to build chains

If there is a unit of flow on edge (x′ᵢ, x′′ⱼ), place xᵢ and xⱼ in the same chain.

The process:

Start with an unused vertex x′ᵢ.
If (x′ᵢ, x′′ⱼ) is full, add xⱼ to the chain (since xᵢ < xⱼ).
Look forward: check if x′ⱼ has outgoing flow to x′′ₖ; if so, add xₖ (since xᵢ < xⱼ < xₖ). Repeat until no outgoing flow.
Look backward: check if x′′ᵢ has incoming flow from x′ₘ; if so, add xₘ (since xₘ < xᵢ < xⱼ). Repeat until no incoming flow.
Move to the next unused vertex and repeat.

🧩 Example chain construction

From the flow in Figure 14.10:

Chain C₁: Start at x′₁ → (x′₁, x′′₃) full → add x₁, x₃. No flow out of x′₃. Flow into x′′₁ from x′₂ → add x₂. Flow into x′′₂ from x′₇ → add x₇. No flow into x′′₇. Result: C₁ = {x₁, x₂, x₃, x₇}.
Chain C₂: Start at x′₄ → (x′₄, x′′₅) full → add x₄, x₅. No flow out of x′₅. Flow into x′′₄ from x′₆ → add x₆. No flow into x′′₆. Result: C₂ = {x₄, x₅, x₆}.
Chain C₃: Start at x′₈ → (x′₈, x′′₉) full → add x₈, x₉. No flow out of x′₉. Flow into x′′₈ from x′₁₀ → add x₁₀. Result: C₃ = {x₈, x₉, x₁₀}.

Why this works: the flow structure ensures that elements added to the same chain are comparable (form a chain in the poset).

🎯 Finding the maximum antichain

🎯 Using labeled/unlabeled vertices

To prove the chain partition is minimum, find an antichain with as many elements as there are chains.

Key observation: For each chain C = {x₁ < x₂ < … < xₖ}:

The minimal element x₁ has no flow into x′′₁, so x′′₁ is unlabeled (since T is unlabeled).
The maximal element xₖ has no flow out of x′ₖ, so x′ₖ is labeled.
Along the sequence x′ₖ, x′′ₖ, x′ₖ₋₁, x′′ₖ₋₁, …, x′₂, x′′₂, x′₁, x′′₁, there must be a transition from labeled to unlabeled.

🔍 Transition point logic

The transition must occur at some xᵢ where:

x′ᵢ is labeled.
x′′ᵢ is unlabeled.

Why not the reverse? Suppose x′ᵢ and x′′ᵢ are both unlabeled while x′ᵢ₊₁ and x′′ᵢ₊₁ are both labeled. Since xᵢ and xᵢ₊₁ are consecutive in C, there is flow on (x′ᵢ, x′′ᵢ₊₁). When scanning from x′′ᵢ₊₁, the vertex x′ᵢ would be labeled—contradiction.

🧩 Constructing the antichain

For each chain, take the first element yᵢ where y′ᵢ is labeled and y′′ᵢ is unlabeled. Form A = {y₁, …, yᵥ}.

Why A is an antichain: If yᵢ < yⱼ, then (y′ᵢ, y′′ⱼ) is an edge in the network. Therefore, scanning from y′ᵢ would label y′′ⱼ—but y′′ⱼ is unlabeled by construction. So no two elements in A are comparable.

🧩 Example antichain

In the running example with three chains:

The maximum antichain is {x₁, x₅, x₈}.
This has three elements, matching the three chains, confirming the partition is minimum.

🔗 Connection to Dilworth's Theorem

🔗 Extending earlier results

Chapter 6 discussed Dilworth's Theorem: any poset of width w can be partitioned into w chains (but no fewer).
Previously, an algorithm was only available for the special case of interval orders.
This section's contribution: through network flows, an efficient algorithm now works for all posets.

Don't confuse: the width w equals both the size of a maximum antichain and the minimum number of chains needed; the network flow algorithm finds both simultaneously.

108

14.4 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set applies network flow techniques to solve matching problems, poset chain-partition problems, and width-finding problems by constructing appropriate flow networks and interpreting the results.

📌 Key points (3–5)

Matching problems: Use network flows to find maximum matchings between two vertex sets and determine when all vertices in one set can be saturated.
Poset network representation: Posets are encoded as flow networks where each element x becomes two vertices (x′ and x′′) connected by edges representing the order relation.
Chain partition from flow: A maximum flow in the poset network yields a minimum chain partition, where full edges indicate elements in the same chain.
Width via antichain: The width of a poset equals the size of a maximum antichain, found by identifying vertices where x′ is labeled but x′′ is unlabeled during flow construction.
Common confusion: Don't confuse the network representation (with split vertices x′ and x′′) with the original poset structure—the network is a tool for finding chains and antichains, not the poset itself.

🔗 Matching problems

🔗 Maximum matching tasks

The exercises ask you to find maximum matchings from vertex set V₁ to vertex set V₂ in bipartite graphs.

What you need to do:

Apply network flow techniques (from earlier in the chapter) to construct a flow network
Find the maximum flow, which corresponds to the maximum matching
If the matching does not saturate all vertices in V₁, explain why using the structure of the graph

🎓 Student-project assignment (Exercise 3)

A concrete matching scenario:

Five students: Alice, Bob, Carlos, Dave, Yolanda
Five topics: graph algorithms, posets, induction, graph theory, generating functions
Each student lists acceptable topics
Goal: assign each student a different topic from their list

Constraints:

No two students can work on the same topic
Each student must receive a topic from their preference list

If impossible: Find an assignment maximizing the number of satisfied students and explain why all cannot be satisfied.

🏈 College recruitment (Exercise 4)

Another matching application:

Seven colleges competing for six football players
Each school can sign only one player
Each player can commit to only one school
Table shows which players are interested in and admitted to each school

Task: Maximize the number of schools that successfully sign a player.

📊 Poset network interpretation

📊 Reading the network without drawing the poset (Exercise 5)

Given a network diagram corresponding to a poset P, you can answer questions about the poset structure directly:

Question type	How to answer from the network
Which elements are greater than xᵢ?	Look at which x′′ vertices are reachable from x′ᵢ via directed paths
Which elements are less than xⱼ?	Look at which x′ vertices can reach x′′ⱼ via directed paths
Which elements are comparable with xₖ?	Find all elements either greater than or less than xₖ
Maximal elements	Elements whose x′ vertices have no outgoing flow
Minimal elements	Elements whose x′′ vertices have no incoming flow

🎨 Network-to-poset conversion (Exercise 6)

Task: Draw the actual poset diagram from the network representation.

Key principle:

The network uses split vertices (x′ and x′′ for each element x) and directed edges to encode the order relation; the poset diagram shows elements as single nodes with covering relations.

🔍 Finding width and chain partitions

🔍 Width-finding process (Exercises 7, 8, 9)

The width w of a poset is the size of a maximum antichain, which equals the minimum number of chains needed to partition the poset.

Three-part task:

Find the width w
Find an antichain of size w
Find a partition into w chains

🧩 Chain partition from flow

How to build chains from the network flow:

Start with an unprocessed vertex xᵢ
If edge [x′ᵢ, x′′ⱼ] is full (carries flow of 1), place xᵢ and xⱼ in the same chain
From x′ⱼ, look for outgoing full edges to continue building the chain upward
From x′′ᵢ, look for incoming flow to add smaller elements to the chain
Repeat until no more elements can be added to the current chain
Start a new chain with the next unprocessed vertex

Example from the text:

Chain C₁ = {x₁, x₂, x₃, x₇} built by following full edges and flow directions
Chain C₂ = {x₄, x₅, x₆}
Chain C₃ = {x₈, x₉, x₁₀}

🎯 Antichain from labeled vertices

How to find the maximum antichain:

Track which vertices are labeled during the flow-finding process
For each chain C = {x₁ < x₂ < ... < xₖ}:
- The minimal element x₁ has x′′₁ unlabeled (no flow in)
- The maximal element xₖ has x′ₖ labeled (no flow out)
- Find the transition point where x′ᵢ is labeled but x′′ᵢ is unlabeled
Collect one such element from each chain to form antichain A

Why this works:

If yᵢ < yⱼ in the poset, then edge [y′ᵢ, y′′ⱼ] exists in the network
Scanning from y′ᵢ would label y′′ⱼ
So if y′ᵢ is labeled but y′′ⱼ is unlabeled, then yᵢ and yⱼ cannot be comparable
Therefore A is indeed an antichain

Example from the text: Maximum antichain = {x₁, x₅, x₈} for the three-chain partition.

🔧 Exercise variations

🔧 Different input formats

The exercises present poset problems in multiple ways:

Format	What you're given	What to do
Network with flow	Network diagram with bold edges showing current flow	Extract chain partition and antichain directly
Network without flow	Network diagram, no flow marked	Find maximum flow first, then proceed
Poset diagram	Traditional Hasse diagram	Construct the network, find flow, determine width

🔧 Poset-and-network combined (Exercise 8)

Some exercises provide both:

The poset diagram
The corresponding network with a flow already marked

Task: Use the network to find width, chain partition, and antichain—verify your understanding by checking consistency with the poset diagram.

Don't confuse: The network is a computational tool; the poset is the mathematical object you're analyzing. The network's structure (split vertices, directed edges) is designed to make flow algorithms work, not to directly represent the poset's order relation.

109

Coloring the Vertices of a Square

15.1 Coloring the Vertices of a Square

🧭 Overview

🧠 One-sentence thesis

When symmetries (rotations and flips) make certain vertex colorings of a square equivalent, the number of truly distinct colorings can be found by counting how many colorings each transformation leaves unchanged and dividing the total by the number of transformations.

📌 Key points (3–5)

The equivalence problem: fixing a square in the plane gives 16 colorings with two colors, but rotations and flips make many of them equivalent (e.g., rotating C₂ clockwise 90° yields C₃).
Transformations as permutations: the square has eight transformations—identity, three rotations (90°, 180°, 270° clockwise), and four flips (vertical, horizontal, positive-slope diagonal, negative-slope diagonal)—that rearrange vertices.
Fixed colorings: each transformation leaves certain colorings unchanged; for example, the 90° rotation fixes only the all-white and all-gold colorings because any mixed coloring would move a vertex of one color to a position of another color.
Common confusion: don't confuse "which colorings a transformation fixes" with "which colorings a transformation changes into each other"—the fixed-coloring count is what matters for the enumeration formula.
The counting connection: summing the number of colorings fixed by all transformations and dividing by the number of transformations gives the number of nonequivalent colorings (equivalence classes).

🎨 The coloring problem and equivalence

🎨 Fixed vs. rotatable squares

If the square's position is fixed in the plane, there are 2⁴ = 16 different two-color (white and gold) vertex colorings.
If the square is a metal frame with beads that can be rotated and flipped, many colorings become equivalent.
- Example: flipping C₇ about the vertical axis yields C₉; rotating C₂ clockwise 90° yields C₃.
The goal is to count how many colorings are not equivalent under these transformations.

🔄 What equivalence means

Two colorings are equivalent if one can be transformed into the other by rotation or flipping.
This equivalence relation partitions the 16 colorings into equivalence classes.
In the example, the 16 colorings split into six equivalence classes:
- Two classes with one coloring each (all white, all gold).
- One class with two colorings (C₁₀ and C₁₁).
- Three classes with four colorings each (one gold vertex; one white vertex; two of each color).

🔀 Transformations of the square

🔀 Labeling vertices

Label the vertices: upper-left = 1, upper-right = 2, lower-right = 3, lower-left = 4.
Each transformation rearranges these positions.

↻ Rotations

Transformation	Description	Effect on vertices
r₁	Clockwise 90°	1→2, 2→3, 3→4, 4→1
r₂	Clockwise 180°	1→3, 2→4, 3→1, 4→2
r₃	Clockwise 270°	Equivalent to r₁ three times

r₂ can be achieved by doing r₁ twice; r₃ by doing r₁ three times.
Counterclockwise rotations are unnecessary because they have the same effect as clockwise rotations (just different angles).

↔ Flips

Transformation	Axis	Effect on vertices
v	Vertical	1→2, 2→1, 3→4, 4→3
h	Horizontal	1→4, 2→3, 3→2, 4→1
p	Positive-slope diagonal	1→3, 2→2, 3→1, 4→4
n	Negative-slope diagonal	1→1, 2→4, 3→3, 4→2

🆔 Identity transformation

Identity transformation (ε): the transformation that does nothing to the square.

ε(1) = 1, ε(2) = 2, ε(3) = 3, ε(4) = 4.
Every coloring is unchanged by ε.

🔒 Fixed colorings under transformations

🔒 What "fixed" means

A coloring is fixed by a transformation if applying that transformation leaves the coloring unchanged.
Example: r₁ (90° rotation) moves vertices cyclically, so only C₁ (all white) and C₁₆ (all gold) remain unchanged—any coloring with more than one color would have a vertex of one color moved to a position of the other color.

📊 Fixed colorings table

Transformation	Number fixed	Which colorings
ε (identity)	16	All 16
r₁ (90°)	2	C₁, C₁₆
r₂ (180°)	4	C₁, C₁₀, C₁₁, C₁₆
r₃ (270°)	2	C₁, C₁₆
v (vertical flip)	4	C₁, C₆, C₈, C₁₆
h (horizontal flip)	4	C₁, C₇, C₉, C₁₆
p (positive diagonal)	8	C₁, C₃, C₅, C₁₀, C₁₁, C₁₃, C₁₅, C₁₆
n (negative diagonal)	8	C₁, C₂, C₄, C₁₀, C₁₁, C₁₂, C₁₄, C₁₆

🧮 Why v fixes four colorings

For v (vertical flip) to leave a coloring unchanged, position 1 must have the same color as position 2, and position 3 must have the same color as position 4.
This gives 2 × 2 = 4 possibilities: both pairs white, both pairs gold, top pair one color and bottom pair the other.
These correspond to C₁, C₆, C₈, and C₁₆.

🔗 The equivalence relation and counting formula

🔗 How transformations create equivalence

Applying a transformation to a colored square produces another colored square.
Notation: τ★(Cᵢ) = Cⱼ means transformation τ changes coloring Cᵢ into coloring Cⱼ.
Example: r₁★(C₁₂) = C₁₃; v★(C₁₀) = C₁₁.

✓ Verifying equivalence is an equivalence relation

Equivalence relation: Cᵢ ~ Cⱼ if there exists a transformation τ with τ★(Cᵢ) = Cⱼ.

Reflexive: ε★(C) = C for all colorings C.
Transitive: if τ₁★(Cᵢ) = Cⱼ and τ₂★(Cⱼ) = Cₖ, then τ₂★(τ₁★(Cᵢ)) = Cₖ.
Symmetric: every transformation τ has an inverse τ⁻¹ that undoes it; if τ★(Cᵢ) = Cⱼ, then τ⁻¹★(Cⱼ) = Cᵢ.
- Example: the inverse of r₁ is r₃ (counterclockwise 90° = clockwise 270°).

🎯 The counting formula

Sum the number of colorings fixed by each transformation: 16 + 2 + 4 + 2 + 4 + 4 + 8 + 8 = 48.
Divide by the number of transformations: 48 ÷ 8 = 6.
Result: 6 equivalence classes (nonequivalent colorings).
This is not a coincidence—the formula generalizes to other symmetry groups.

🧩 Permutation groups (introduction)

🧩 What a permutation group is

Permutation: a bijection from a set X to itself.

Permutation group P: a set of permutations of X satisfying three properties:

The identity permutation ε is in P.

If τ₁, τ₂ are in P, then τ₂ ∘ τ₁ is in P (closure under composition).

If τ₁ is in P, then τ₁⁻¹ is in P (closure under inverses).

For Pólya's theorem, X is finite, usually X = {1, 2, ..., n}.

🔷 The dihedral group D₈

The eight transformations of the square form a permutation group called the dihedral group of the square, denoted D₈.
More generally, D₂ₙ is the group of transformations for a regular n-gon (2n transformations: n rotations and n flips).

🔗 Composition examples

r₂ ∘ r₁ = r₃: rotating 90° then 180° clockwise = rotating 270° clockwise.
v ∘ r: composing vertical flip with 180° rotation gives v∘r(1)=1, v∘r(2)=4, v∘r(3)=3, v∘r(4)=2, which equals n (negative diagonal flip).

🔄 Inverses in D₈

r₁⁻¹ = r₃ (the inverse of 90° clockwise is 270° clockwise).
v⁻¹ = v (flipping twice about the same axis returns to the original).
In general, the inverse of any flip is that same flip.

Budget: 1000000 Used: 38481 (remaining: 961519)

110

Permutation Groups

15.2 Permutation Groups

🧭 Overview

🧠 One-sentence thesis

Permutation groups provide the mathematical structure needed to formalize how transformations act on sets and partition them into equivalence classes, which is essential for counting non-equivalent colorings.

📌 Key points (3–5)

What a permutation group is: a set of permutations (bijections from a set to itself) that includes the identity, is closed under composition, and contains inverses.
How transformations create equivalence: when transformations act on a set (e.g., colorings), they partition it into equivalence classes; two elements are equivalent if some transformation maps one to the other.
Cycle notation: a compact way to represent permutations by writing disjoint cycles, which is more useful than matrix notation for understanding structure.
Common confusion: permutation multiplication is function composition, so order matters—σ₂σ₁ means "apply σ₁ first, then σ₂" (right to left), and generally σ₁σ₂ ≠ σ₂σ₁.
Key relationship: the sum of stabilizer sizes over an equivalence class equals the group size, connecting fixed points to group structure.

🔄 Group actions and equivalence

🔄 How transformations act on sets

When a group of transformations acts on a set C, each transformation σ maps elements of C to other elements.
Notation: σ★(Cᵢ) = Cⱼ means transformation σ sends coloring Cᵢ to coloring Cⱼ.
Example: r₁★(C₁₂) = C₁₃ and v★(C₁₀) = C₁₁ for the square transformations.

≈ Equivalence relation from actions

Two colorings Cᵢ and Cⱼ are equivalent (written Cᵢ ≈ Cⱼ) if there exists a transformation σ with σ(Cᵢ) = Cⱼ.

This relation is:

Reflexive: ε(C) = C for all C (identity transformation).
Transitive: if σ₁(Cᵢ) = Cⱼ and σ₂(Cⱼ) = Cₖ, then σ₂(σ₁(Cᵢ)) = Cₖ.
Symmetric: if σ(Cᵢ) = Cⱼ, then σ⁻¹(Cⱼ) = Cᵢ (using the inverse transformation).

🧩 Partitioning into equivalence classes

The equivalence relation partitions C into disjoint equivalence classes.
Example from the square colorings: 6 equivalence classes total—two with one coloring each (all white, all gold), one with two colorings, and three with four colorings each.
The number of equivalence classes is the number of non-equivalent colorings.

🔍 Fixed points and stabilizers

Two complementary perspectives:

Concept	Definition	Meaning
fix_C(σ)	{C ∈ C : σ★(C) = C}	Which elements are fixed by transformation σ
stab_G(C)	{σ ∈ G : σ★(C) = C}	Which transformations fix element C

Example: fix_C(r₂) = {C₁, C₁₀, C₁₁, C₁₆} (colorings unchanged by 180° rotation).
Example: stab_D₈(C₇) = {ε, h} (only identity and one flip preserve C₇).

🎯 Permutation group structure

🎯 Definition of a permutation group

A permutation group P is a set of permutations of a set X satisfying three properties:

The identity permutation ε is in P.
If σ₁, σ₂ ∈ P, then σ₂ ∘ σ₁ ∈ P (closed under composition).
If σ₁ ∈ P, then σ₁⁻¹ ∈ P (contains inverses).

🔷 The dihedral group D₈

The transformations of the square form a permutation group called the dihedral group of the square, denoted D₈.
It contains 8 permutations: 4 rotations and 4 flips.
More generally, D₂ₙ is the dihedral group for a regular n-gon, containing 2n permutations.
Example compositions: r₂ ∘ r₁ = r₃ (rotate 90° then 180° = rotate 270°).
Example inverses: r₁⁻¹ = r₃ (counterclockwise 90° undoes clockwise 90°); v⁻¹ = v (any flip is its own inverse).

🌐 The symmetric group Sₙ

Sₙ: the set of all permutations of {1, 2, ..., n}.
Every finite permutation group is a subgroup of Sₙ for some n.
This is the "universal" permutation group containing all possible rearrangements.

📝 Representing and manipulating permutations

📝 Matrix notation (basic)

Write a permutation σ as a 2×n matrix:
- First row: domain {1, 2, ..., n}
- Second row: σ(i) in position i
Example: σ = (1 2 3 4 5 / 2 4 3 5 1) means σ(1)=2, σ(2)=4, σ(3)=3, σ(4)=5, σ(5)=1.
Drawback: awkward and doesn't reveal structure.

🔁 Cycle notation (preferred)

Construct a digraph: vertices are {1, ..., n}, directed edge from i to j if σ(i) = j.
Every component is a directed cycle (since σ is a bijection).
Write each cycle starting from its smallest vertex, following edges in order, enclosed in parentheses.
Example: σ = (1245)(3) means σ(1)=2, σ(2)=4, σ(4)=5, σ(5)=1, σ(3)=3.
Convention: list disjoint cycles with first entries in increasing order.

🔢 Cycle structure terminology

A k-cycle is a cycle of length k.
Example: σ = (1483)(27)(56) has two 2-cycles and one 4-cycle.
Example: σ' = (13)(2)(478)(56) has one 1-cycle, two 2-cycles, and one 3-cycle.

⚙️ Multiplying (composing) permutations

Important: σ₂σ₁ means apply σ₁ first, then σ₂ (right-to-left composition).
Permutation multiplication is generally not commutative: σ₁σ₂ ≠ σ₂σ₁.

Example (from D₈):

Let σ₁ = (1234) and σ₂ = (12)(34).
σ₃ = σ₂σ₁: σ₃(1) = σ₂(σ₁(1)) = σ₂(2) = 1, σ₃(2) = σ₂(3) = 4, σ₃(3) = σ₂(4) = 3, σ₃(4) = σ₂(1) = 2. So σ₃ = (1)(24)(3).
σ₄ = σ₁σ₂: σ₄(1) = 3, σ₄(2) = 2, σ₄(3) = 1, σ₄(4) = 4. So σ₄ = (13)(2)(4).
Notice σ₃ ≠ σ₄.

Physical example:

Book with cover up, spine left.
Flip left-to-right, then rotate 90° clockwise → spine in one position.
Rotate 90° clockwise, then flip left-to-right → spine in a different position.

🛠️ Building cycle notation efficiently

Instead of computing where each element goes separately, trace cycles as you compute:

Example:

σ₁ = (123)(487)(5)(6), σ₂ = (18765)(234), find σ₃ = σ₂σ₁.
Start with 1: σ₁(1)=2, σ₂(2)=3, so σ₃(1)=3. Write (13...
Continue from 3: σ₁(3)=1, σ₂(1)=8, so σ₃(3)=8. Write (138...
Continue from 8: σ₁(8)=7, σ₂(7)=6, so σ₃(8)=6. Write (1386...
Continue from 6: σ₁(6)=6, σ₂(6)=5, so σ₃(6)=5. Write (13865...
Continue from 5: σ₁(5)=5, σ₂(5)=1, so σ₃(5)=1. Close cycle: (13865).
Start next cycle with 2: trace to get (247).
Result: σ₃ = (13865)(247).

Inverse example:

[(123456)][(165432)] = ?
Trace 1: right sends 1→6, left sends 6→1, so product sends 1→1.
Similarly, every element maps to itself.
Result: (1)(2)(3)(4)(5)(6) = identity permutation.
Therefore (123456) and (165432) are inverses.

🔗 Connecting group size to equivalence classes

🔗 Proposition 15.8

For a group G acting on a finite set C, for all C ∈ C: the sum of |stab_G(C')| over all C' in the equivalence class ⟨C⟩ equals |G|.

In symbols: Σ_{C' ∈ ⟨C⟩} |stab_G(C')| = |G|.

🧮 Why this matters

This proposition connects the size of the group to the sizes of stabilizers across an equivalence class.
It is a key step toward Burnside's lemma (mentioned in the excerpt but proved in the next section).
The relationship between fixed points (fix_C(σ)) and stabilizers (stab_G(C)) underlies the counting of non-equivalent colorings.

💡 Proof sketch

Define T(C, C') = {σ ∈ G : σ(C) = C'}.
For any σ ∈ T(C, C'), composing with elements of stab_G(C) gives distinct elements of T(C, C').
This establishes a bijection showing |T(C, C')| = |stab_G(C)|.
Summing over all C' in ⟨C⟩ accounts for all elements of G exactly once, yielding |G|.

111

Burnside's Lemma

15.3 Burnside’s Lemma

🧭 Overview

🧠 One-sentence thesis

Burnside's lemma connects the number of distinct equivalence classes under a group action to the average number of elements fixed by each group element, providing a powerful counting tool for symmetry problems.

📌 Key points (3–5)

What the lemma states: The number of equivalence classes equals the average number of fixed elements across all group elements, computed as (1 / |G|) × sum of |fix(g)| over all g in G.
Key setup concepts: The lemma requires understanding the stabilizer of an element (group elements that fix it) and the fixed set of a group element (elements it leaves unchanged).
Foundation proposition: For any element and its equivalence class, the sum of stabilizer sizes across the class equals the group size.
Common confusion: Don't confuse fix(g) (elements fixed by g) with stab(c) (group elements fixing c)—they are dual concepts.
Why it matters: The lemma validates symmetry-based counting without exhaustively listing all equivalence classes, as demonstrated by the square coloring example.

🔧 Core notation and definitions

🔧 The action and equivalence relation

When a group G acts on a finite set C, it induces an equivalence relation denoted by ~.
The action of group element g on element c is written as g(c).
The equivalence class containing c is denoted ⟨c⟩.

🎯 Fixed set of a group element

fix_C(g) = {c in C : g(c) = c}, the set of colorings fixed by g.

This is the collection of all elements in C that remain unchanged when g acts on them.
Example from the excerpt: fix_C(r²) = {C₁, C₁₀, C₁₁, C₁₆} for rotation r² acting on colorings.

🔒 Stabilizer of an element

stab_G(c) = {g in G : g(c) = c}, the stabilizer of c in G, the permutations in G that fix c.

This is the collection of all group elements that leave c unchanged.
Example from the excerpt: stab_D₈(C₇) = {ε, h} and stab_D₈(C₁₁) = {ε, r², p, n}.
Don't confuse: stab(c) asks "which group elements fix this c?" while fix(g) asks "which elements does this g fix?"

🧮 The foundation: Proposition 15.8

🧮 Statement of the proposition

Let a group G act on a finite set C. Then for all c in C, the sum of |stab_G(c')| over all c' in ⟨c⟩ equals |G|.

In words: if you sum up the stabilizer sizes for all elements in an equivalence class, you always get the group size.
This is the key technical result needed to prove Burnside's lemma.

🔍 How the proof works

The proof introduces T(c, c') = {g in G : g(c) = c'}, the set of group elements sending c to c'.

Key observations:

T(c, c) = stab_G(c) (elements sending c to itself are exactly its stabilizer).
If g is in T(c, c'), then g ∘ σᵢ is also in T(c, c') for each σᵢ in stab_G(c).
This shows T(c, c') = {g ∘ σ₁, ..., g ∘ σₖ} where stab_G(c) = {σ₁, ..., σₖ}.
Therefore |T(c, c')| = |stab_G(c)| for any c' in the equivalence class of c.

Counting argument:

For all c' in ⟨c⟩, we have |stab_G(c')| = |T(c', c')| = |T(c', c)| = |T(c, c')| = |stab_G(c)|.
Summing over all c' in ⟨c⟩ gives: sum of |stab_G(c')| = sum of |T(c, c')|.
Each element of G appears in exactly one T(c, c') for precisely one c' in ⟨c⟩.
Therefore the sum equals |G|.

🎯 Burnside's Lemma

🎯 Statement of the lemma

Let a group G act on a finite set C. If N is the number of equivalence classes of C induced by this action, then N = (1 / |G|) × sum of |fix_C(g)| over all g in G.

In words: the number of distinct patterns (equivalence classes) equals the average number of elements fixed by each group element.
The excerpt notes this calculation matches exactly what was performed for 2-coloring the vertices of a square in Section 15.1.

🔬 Proof mechanism

The proof uses a clever double-counting argument on the set X = {(g, c) in G × C : g(c) = c}.

Two ways to count X:

Sum over g in G: each term counts pairs with g in the first coordinate → sum of |fix_C(g)| = |X|.
Sum over c in C: each term counts pairs with c in the second coordinate → sum of |stab_G(c)| = |X|.

Therefore: sum of |fix_C(g)| over g in G = sum of |stab_G(c)| over c in C.

Regrouping by equivalence classes:

The right-hand sum can be rewritten as: sum over equivalence classes ⟨c⟩ of (sum of |stab_G(c')| over c' in ⟨c⟩).
By Proposition 15.8, each inner sum equals |G|.
There are N equivalence classes, so the total is N × |G|.

Conclusion: sum of |fix_C(g)| = N × |G|, so N = (1 / |G|) × sum of |fix_C(g)|.

💡 Practical implications

💡 Why the lemma is powerful

The excerpt emphasizes the computational advantage: for a hexagon with 4 colors, there would be 4⁶ = 4096 different colorings and the dihedral group has 12 elements.
Assembling a complete table (analogous to Table 15.2) would be "a nightmare."
Burnside's lemma lets you count equivalence classes without exhaustively listing them all.
You only need to compute |fix_C(g)| for each of the 12 group elements, not examine all 4096 colorings.

💡 Connection to Pólya's approach

The excerpt states that Burnside's lemma "helpfully validates the computations" from the previous section.
However, for larger problems, even applying Burnside's lemma directly can be tedious.
The next section (15.4) introduces Pólya's theorem, which brings "genius" to the approach by using generating functions.
This sets up the transition to the cycle index, a generating function with multiple variables based on cycle structure.

🔄 Cycle notation context

🔄 Building cycle notation efficiently

The excerpt begins with examples of composing permutations using cycle notation.

Example 15.6:

Given σ₁ = (123)(487)(5)(6) and σ₂ = (18765)(234), compute σ₃ = σ₂ ∘ σ₁.
Work from right to left: σ₁ sends 1 to 2, σ₂ sends 2 to 3, so σ₃ sends 1 to 3.
Continue: 3 → 8 → 6 → 5 → 1, completing the first cycle (13865).
Start with 2: 2 → 4 → 7 → 2, giving cycle (247).
Result: σ₃ = (13865)(247).

Example 15.7:

Compute [(123456)][(165432)] (working right to left).
First permutation sends 1 to 6, second sends 6 to 1, so the product sends 1 to 1.
Similarly, 2 → 2 and i → i for every i ≤ 6.
Result: (1)(2)(3)(4)(5)(6), the identity permutation.
Therefore (123456) and (165432) are inverses.

🔄 Transition to cycle index

The excerpt ends by introducing the cycle index, a generating function approach.

Key idea:

Associate a monomial to each permutation based on its cycle structure.
If σ has jₖ cycles of length k for 1 ≤ k ≤ n, its monomial is x₁^(j₁) × x₂^(j₂) × ... × xₙ^(jₙ).
Note: j₁ + 2j₂ + 3j₃ + ... + nj_n = n (total elements).

Examples from D₈:

r₁ = (1234): one cycle of length 4 → monomial x₄¹.
r₂ = (13)(24): two cycles of length 2 → monomial x₂².
p = (14)(2)(3): two 1-cycles and one 2-cycle → monomial x₁² × x₂¹.
ε = (1)(2)(3)(4): four 1-cycles → monomial x₁⁴, with 16 fixed colorings.

112

Pólya's Theorem

15.4 Pólya’s Theorem

🧭 Overview

🧠 One-sentence thesis

Pólya's Enumeration Theorem extends Burnside's Lemma by using a cycle index polynomial to count not only the total number of distinct colorings under group symmetries but also to track how many colorings have specific color distributions.

📌 Key points (3–5)

Cycle index: a polynomial built from the cycle structure of each permutation in a group, averaged over all permutations, which encodes symmetry information.
Substitution for counting: substituting the number of colors m for each variable gives the total number of distinct colorings; substituting color variables (like w + 1) gives a generating function tracking color distributions.
Pattern inventory: the generating function resulting from Pólya's theorem, where coefficients count nonequivalent colorings with specific numbers of each color.
Common confusion: Burnside's Lemma only counts total distinct colorings, while Pólya's theorem also distinguishes how many of each color appear in each distinct coloring.
Power over manual counting: avoids enumerating all colorings and checking each permutation's fixed points, especially for large problems (e.g., 500-bead necklaces or hexagons with 4 colors).

🔢 The cycle index polynomial

🔢 What the cycle index is

Cycle index P_G(x₁, x₂, …, xₙ): the average of the monomials associated with each permutation in group G, where each monomial encodes the cycle structure of that permutation.

For a permutation with j₁ cycles of length 1, j₂ cycles of length 2, etc., the associated monomial is x₁^j₁ · x₂^j₂ · … · xₙ^jₙ.
The cycle index is the sum of all such monomials divided by the size of the group.
Example: for the dihedral group D₈ of the square, the cycle index is
P_D₈(x₁, x₂, x₃, x₄) = (1/8)(x₁⁴ + 2x₁²x₂ + 3x₂² + 2x₄).

🔄 How cycle structure relates to fixed colorings

If a permutation fixes a coloring, all vertices in the same cycle must have the same color.
Each cycle can be colored uniformly in m ways (where m is the number of colors).
Example: the permutation v = (12)(34) has two 2-cycles; for 2-coloring, each cycle has 2 choices, so v fixes 2 × 2 = 4 colorings.
Substituting xᵢ = m for all i in the cycle index gives the total number of distinct m-colorings (this recovers Burnside's Lemma).

📊 Table: Cycle index for D₈

Transformation	Cycle notation	Monomial	Fixed 2-colorings
identity ε	(1)(2)(3)(4)	x₁⁴	16
rotation r₁	(1234)	x₄	2
rotation r₂	(13)(24)	x₂²	4
rotation r₃	(1432)	x₄	2
vertical flip v	(12)(34)	x₂²	4
horizontal flip h	(14)(23)	x₂²	4
diagonal flip p	(14)(2)(3)	x₁²x₂	8
diagonal flip n	(1)(24)(3)	x₁²x₂	8

Don't confuse: the monomial is not the number of fixed colorings; it is a symbolic representation of cycle structure. Substituting m for each variable gives the count.

🎨 Substitution for color tracking

🎨 Substituting color sums

To track how many vertices receive each color, substitute (c₁ + c₂ + … + cₘ) for xᵢ in the cycle index.
For 2-coloring with white (w) and gold (1), substitute x₁ = w + 1, x₂ = w² + 1², x₃ = w³ + 1³, etc.
A cycle of length i contributes i vertices of the same color, so xᵢ becomes wⁱ + 1ⁱ.

🧮 Example: 2-coloring the square

Substituting into P_D₈ gives:
P_D₈(w+1, w²+1², w³+1³, w⁴+1⁴) = 1⁴ + 1³w + 2·1²w² + 1w³ + w⁴.
Coefficients: 1 coloring with 4 gold, 1 with 3 gold + 1 white, 2 with 2 gold + 2 white, 1 with 1 gold + 3 white, 1 with 4 white.
This matches the manual count from earlier sections.

🌈 Example: 3-coloring the square

Allowing blue (b) as well, substitute x₁ = w + 1 + b, x₂ = w² + 1² + b², etc.
The resulting polynomial has terms like 2b²1w, meaning 2 distinct colorings with 2 blue, 1 gold, 1 white vertex.
Don't confuse: the coefficient counts distinct (nonequivalent) colorings, not all colorings with that color distribution.

🎼 Pólya's Enumeration Theorem

🎼 The full theorem statement

Pólya's Enumeration Theorem: Let S be a set with |S| = r and C the set of colorings of S using colors c₁, …, cₘ. If a permutation group G acts on S to induce an equivalence relation on C, then
P_G(Σcᵢ, Σcᵢ², …, Σcᵢʳ)
is the generating function for the number of nonequivalent colorings of S in C.

The substitution Σcᵢ means summing over all color variables.
The resulting polynomial is called the pattern inventory.
Each coefficient on a monomial like c₁^a · c₂^b · … · cₘ^z counts how many distinct colorings use a of color 1, b of color 2, etc.

🚀 Why this is powerful

Avoids manual enumeration: no need to list all colorings or check each permutation's fixed points.
Scales to large problems: the excerpt gives an example of 500-bead necklaces with 3 colors, yielding approximately 3.6 × 10²³⁵ total distinct necklaces and 2.5 × 10²⁰⁰ with a specific color distribution (225 white, 225 gold, 50 blue).
Computer algebra systems can compute cycle indices and pattern inventories for large groups.

🎵 Application: Counting musical scales

🎵 Modeling scales as colorings

Western music uses 12 equally-spaced notes, numbered 0 through 11, arranged cyclically (modulo 12).
A scale is a subset of {0, 1, …, 11} arranged in increasing order.
A transposition replaces each note x by x + a (mod 12); musicians consider two scales equivalent if one is a transposition of the other.

🎵 Coloring interpretation

Arrange the 12 notes clockwise around a circle.
Color selected notes black and unselected notes white.
Transposition corresponds to rotation; the group acting is the cyclic group of rotations (not the full dihedral group, since reflection is not allowed).
Example: three 5-note scales S₁, S₂, S₃ are shown; S₁ and S₂ are equivalent by rotation (transposition by 7), but S₃ is not equivalent to them.

🎵 Applying Pólya's theorem

The question "How many nonequivalent scales with exactly k notes?" becomes a coloring problem on a 12-vertex cycle.
The cycle index for the cyclic group of order 12 can be computed, then substituted with color variables to count scales by size.
Don't confuse: flipping the circle would allow more equivalences (e.g., S₃ would become equivalent to S₁), but musical transposition only allows rotation, not reflection.

113

Applications of Pólya's Enumeration Formula

15.5 Applications of Pólya’s Enumeration Formula

🧭 Overview

🧠 One-sentence thesis

Pólya's Enumeration Formula can be applied to count nonequivalent patterns in diverse domains—from musical scales and chemical isomers to unlabeled graphs—by substituting appropriate variables into the cycle index of the relevant symmetry group.

📌 Key points (3–5)

Pattern inventory: The generating function from Pólya's Enumeration Theorem records the number of nonequivalent patterns by substituting color variables into the cycle index.
Shortcut for single-type counting: When counting objects with a fixed total (e.g., k notes in a scale), you can replace one variable with 1 to simplify the generating function.
Pair groups for edge problems: Counting nonisomorphic graphs requires using the pair group S^(2)_n instead of S_n, because permutations must preserve edge/non-edge relationships, not just vertices.
Common confusion: Don't confuse labeled graphs (where vertices have fixed identities) with nonisomorphic/unlabeled graphs (where only structure matters)—the latter requires accounting for symmetries via Pólya's theorem.
Computational power: For large problems (e.g., 500-bead necklaces or 30-vertex graphs), computer algebra systems can compute the pattern inventory, yielding astronomically large counts.

🎵 Counting musical scales

🎵 The musical scale problem

Western music uses 12 equally-spaced notes, numbered 0 through 11, arranged in octaves (cyclic: after 11 comes 0 again).
A scale is a subset of {0, 1, ..., 11} arranged in increasing order.
A transposition replaces each note x by x + a (mod 12) for some constant a.
Musicians consider two scales equivalent if one is a transposition of the other.
Question: How many nonequivalent scales are there with exactly k notes?

🔄 Modeling as a coloring problem

Arrange the 12 notes clockwise around a circle.
Color selected notes black and unselected notes white.
Transposition corresponds to rotation of the circle.
Example: Three 5-note scales S₁, S₂, S₃ are shown; S₂ can be obtained from S₁ by rotating forward 7 positions (transposition by adding 7), so they are equivalent; S₃ cannot be obtained by rotation, so it is inequivalent to S₁ and S₂.
Don't confuse: Flips are not allowed here (only transpositions/rotations), so scales that differ by reflection are not equivalent in this model.

🔢 The cyclic group and cycle index

The group acting on the colorings is the cyclic group of order 12, C₁₂ = {ρ, ρ², ..., ρ¹¹}, where ρ = (0 1 2 3 4 5 6 7 8 9 10 11) is the transposition by one note.
Every element is a power of ρ (only rotations allowed).
The cycle index is:

P_C₁₂(x₁, ..., x₁₂) = (x₁¹²)/12 + (x₂⁶)/12 + (x₃⁴)/6 + (x₄³)/6 + (x₆²)/6 + x₁₂/3

🎹 Simplification and result

Shortcut: Since the number of white notes can be deduced from the number of black notes, replace w by 1 in the substitution x_i = b^i + w^i.
Substitute x_i = 1 + b^i into the cycle index:

P_C₁₂(1 + b, 1 + b², ..., 1 + b¹²) = b¹² + b¹¹ + 6b¹⁰ + 19b⁹ + 43b⁸ + 66b⁷ + 80b⁶ + 66b⁵ + 43b⁴ + 19b³ + 6b² + b + 1
The coefficient on b^k gives the number of nonequivalent k-note scales.
Answer: There are 80 nonequivalent 6-note scales.

🧪 Enumerating chemical isomers

🧪 Benzene and aromatic hydrocarbons

Benzene (C₆H₆): six carbon atoms form a hexagonal ring with alternating single/double bonds; one hydrogen atom is bonded to each carbon (outside the ring).
Aromatic hydrocarbons: formed by replacing one or more hydrogen atoms with other atoms or functional groups (e.g., CH₃ methyl group, OH hydroxyl group).
Isomers: molecules with the same chemical formula but different structures (different choices of which hydrogens to replace).

🔷 The dihedral group D₁₂

The symmetry group for the benzene ring is the dihedral group of the hexagon, D₁₂, which includes rotations and flips.
Number the six carbon atoms 1, 2, ..., 6 clockwise.
Clockwise rotation by 60° is r = (123456); other rotations are r², r³, r⁴, r⁵.
Flip across the vertical axis is f = (16)(25)(34); other flips are fr, fr², fr³, fr⁴, fr⁵.
The cycle index is:

P_D₁₂(x₁, ..., x₆) = (1/12)(x₁⁶ + 2x₁⁶ + 2x₃² + 4x₂³ + 3x₁²x₂²)

🧬 Counting xylenol isomers

Xylenol (dimethylphenol): has three hydrogen atoms, two methyl groups (m), and one hydroxyl group (h) attached to the carbon ring.
Substitute x_i = 1 + m^i + h^i into the cycle index (1 accounts for default hydrogen).
The resulting generating function includes many terms; the coefficient on h¹m² gives the number of isomers with one hydroxyl and two methyl groups.
Answer: There are 6 isomers of xylenol.
Pólya originally used these techniques to enumerate alkane isomers (C_n H_{2n+2}), which are special types of trees.

📊 Counting nonisomorphic graphs

📊 Labeled vs nonisomorphic graphs

Labeled graphs: vertices have fixed identities (e.g., vertex set {1, 2, 3, 4}).
- There are C(n, 2) possible edges, so 2^(C(n,2)) labeled graphs on n vertices.
- To count labeled graphs with exactly k edges, choose a k-element subset of all possible edges: C(C(n,2), k) graphs.
Nonisomorphic (unlabeled) graphs: only the structure matters, not vertex labels.
- Example: Four labeled graphs on four vertices are shown; the first three are isomorphic to each other, so only two nonisomorphic graphs are illustrated.
- Don't confuse: Isomorphic graphs have the same structure but different labels; nonisomorphic graphs have genuinely different structures.

🔀 The pair group S^(2)_n

To count nonisomorphic graphs, the symmetric group S_n acts on vertices, but we need to track edges (2-element subsets).
The pair group S^(2)_n permutes the 2-element subsets e_ij = {i, j}.
A permutation σ in S_n induces a permutation in S^(2)n: e_ij is sent to e{σ(i)σ(j)}.
Why this matters: For a permutation to fix a graph, every edge must be sent to an edge and every non-edge to a non-edge.

🔢 Finding the cycle index of S^(2)_4

For each permutation in S₄, determine the corresponding permutation in S^(2)_4 by tracking how 2-element subsets are permuted.
Examples:
- Identity (1)(2)(3)(4) → (e₁₂)(e₁₃)(e₁₄)(e₂₃)(e₂₄)(e₃₄)
- (12)(3)(4) → (e₁₂)(e₁₃ e₂₃)(e₁₄ e₂₄)(e₃₄) (two 2-cycles, two 1-cycles)
- (123)(4) → (e₁₂ e₂₃ e₁₃)(e₁₄ e₂₄ e₃₄) (two 3-cycles)
- (1234) → (one 4-cycle and one 2-cycle)
- (12)(34) → (two 2-cycles and two 1-cycles)
Count permutations of each cycle type in S₄:
- 1 identity (four 1-cycles)
- 6 single 2-cycles (two 1-cycles, one 2-cycle)
- 3 two 2-cycles
- 8 one 3-cycle and one 1-cycle
- 6 single 4-cycles
The cycle index of S^(2)_4 is:

P_{S^(2)_4}(x₁, ..., x₆) = (1/24)(x₁⁶ + 9x₁²x₂² + 8x₃² + 6x₂x₄)

📈 Generating function for 4-vertex graphs

Substitute x_i = 1 + x^i to account for edge present (x^i) or absent (1).
Result:

P_{S^(2)_4}(1 + x, ..., 1 + x⁶) = 1 + x + 2x² + 3x³ + 2x⁴ + x⁵ + x⁶
The coefficient on x^m gives the number of nonisomorphic graphs with m edges.
Substituting x = 1 gives the total: 11 nonisomorphic graphs on four vertices (vs. 64 labeled graphs).

🚀 Large-scale computations

For 30 vertices: approximately 3.3 × 10⁹⁸ nonisomorphic graphs (vs. 8.9 × 10¹³⁰ labeled graphs).
For 30-vertex graphs with exactly 200 edges: approximately 3.1 × 10⁹⁶ nonisomorphic graphs.
Limitation: Enumerating graphs where every vertex has degree r cannot be approached with these techniques due to increased dependency between vertex mappings.

114

15.6 Exercises

🧭 Overview

🧠 One-sentence thesis

This exercise set applies Pólya's enumeration theorem to count nonequivalent colorings and structures under symmetry groups, including permutations in cycle notation, dihedral groups acting on polygons, graph isomorphism counting, and real-world applications like musical scales and molecular isomers.

📌 Key points (3–5)

Core skill: writing permutations in cycle notation and computing compositions of permutations.
Stabilizer and orbit: finding which group elements fix a particular coloring (stabilizer) is central to applying Pólya's theorem.
Cycle index application: substituting variables in the cycle index generates counts of nonequivalent colorings with specified color distributions.
Common confusion: the group action depends on what is being permuted—vertices of a polygon vs. squares of a grid vs. edges in a graph—so the same abstract group (e.g., D₈) has different cycle representations in different problems.
Real applications: the exercises cover musical scales, chemical isomers, graph enumeration, and game boards, showing how symmetry reduces counting complexity.

🔄 Permutation mechanics

🔄 Cycle notation and composition

Exercises 1–2 ask you to convert two-row permutation notation into cycle notation and then compute products of permutations.
Cycle notation groups elements that map to each other in a cycle: (1 2 3 4 5 6 → 4 2 5 6 3 1) becomes cycles showing 1→4→6→1, etc.
Composition σ₁σ₂ means "apply σ₂ first, then σ₁"; order matters and the exercises verify this by computing both σ₁σ₂ and σ₂σ₁.
Example: if σ₁ = (1 4 6) and σ₂ = (2 5), then σ₁σ₂ ≠ σ₂σ₁ in general.

🧮 Pair groups and edge permutations

Exercise 8 connects permutations in S₄ (acting on four vertices) to permutations in S⁽²⁾₄ (acting on the six edges of the complete graph K₄).
A 4-cycle in S₄ induces a 4-cycle and a 2-cycle in S⁽²⁾₄; a product of two 2-cycles in S₄ induces two 2-cycles and two 1-cycles in S⁽²⁾₄.
This technique is essential for counting nonisomorphic graphs, because graph isomorphism depends on how edges (not just vertices) are rearranged.
Don't confuse: the same abstract permutation has different cycle structures depending on whether it acts on vertices or on pairs of vertices.

🔷 Dihedral groups and polygon colorings

🔷 The dihedral group D₁₀ (pentagon)

Exercise 4 works through the dihedral group of the regular pentagon, which has 10 elements: 5 rotations and 5 reflections.
Rotations: r₁ = (1 2 3 4 5) is the 72° clockwise rotation; r₂, r₃, r₄ are successive powers.
Reflections: f₁ = (1)(2 5)(3 4) flips about the line through vertex 1; similarly for f₂ through f₅.
The exercise asks you to write all 10 elements in cycle notation using the vertex labeling.

🎨 Fixed colorings and stabilizers

Exercise 4(b): draw the black-white vertex colorings fixed by r₁ (rotation) and by f₁ (reflection).
- A coloring is fixed by r₁ if rotating it 72° yields the same pattern; this forces all five vertices to have the same color (only two such colorings: all black or all white).
- A coloring is fixed by f₁ if flipping about the axis through vertex 1 leaves it unchanged; vertex 1 is free, but vertices 2 and 5 must match, and vertices 3 and 4 must match.
Exercise 4(c): find the stabilizer of a specific coloring C (vertices 1, 2, 5 black; 3, 4 white).
- The stabilizer is the set of group elements that leave C unchanged.
- Check each of the 10 elements; only those that map black vertices to black and white to white belong to stab(C).

📐 Cycle index and counting

Exercise 4(d): compute the cycle index of D₁₀ by averaging the cycle structures of all 10 permutations.
Exercise 4(e): substitute x₁ = 2 (two colors) into the cycle index to count all nonequivalent black-white colorings of the pentagon.
Exercise 4(f): substitute x₁ = X + W (or similar) to get a generating function, then extract the coefficient of X²W³ to count colorings with exactly two black and three white vertices; draw these colorings.
Don't confuse: the cycle index counts orbits (nonequivalent colorings), not individual colorings.

🎵 Applications to scales, molecules, and graphs

🎵 Musical scales (Exercise 6)

Classical Thai music uses a 7-note octave (equally spaced).
A scale is a subset of these 7 notes; two scales are equivalent if one is a transposition (cyclic rotation) of the other.
The acting group is C₇ (cyclic group of order 7).
The exercise asks for the number of nonequivalent k-note scales for each k from 1 to 7.
Method: use the cycle index of C₇ and substitute appropriately to count subsets of size k up to rotation.

🧪 Chemical isomers (Exercise 7)

Xylene: a benzene ring (hexagon of carbons) with two methyl groups and four hydrogens attached.
Two arrangements are isomers if they are not equivalent under the symmetries of the hexagon (dihedral group D₁₂).
The exercise asks: how many distinct isomers exist?
This is a coloring problem: color two positions "methyl" and four positions "hydrogen," then count orbits under D₁₂.

🔗 Graph isomorphism (Exercises 9–10)

Exercise 9: draw all nonisomorphic graphs on 4 vertices with 3 edges, and with 4 edges.
- Nonisomorphic means not equivalent under any relabeling of vertices.
Exercise 10: use the pair group S⁽²⁾₅ (acting on the 10 edges of K₅) to find the cycle index, then count nonisomorphic graphs on 5 vertices.
- Substitute x₁ = 1 + e (edge present or absent) to get a generating function in e.
- The coefficient of e⁶ gives the number of nonisomorphic 5-vertex graphs with exactly 6 edges.
The excerpt's opening remark (page 96) warns that counting graphs where every vertex has degree r is harder, because vertex-degree constraints create dependencies that Pólya's theorem does not handle directly.

🎲 Tic-tac-toe boards and cube painting

🎲 Tic-tac-toe (Exercise 11)

The 3×3 grid has 9 squares; two boards are equivalent if related by rotation or reflection (dihedral group D₈ acting on a square).
Exercise 11(a): represent each of the 8 elements of D₈ as a permutation of the 9 squares (numbered as in Figure 15.16).
- Example: 90° clockwise rotation r₁ = (1 3 9 7 1)(2 6 8 4)(5) in cycle notation.
Exercise 11(b): find the cycle index of D₈ in terms of these 9-square permutations.
Exercise 11(c): substitute x₁ = X + O + B (three symbols: X, O, blank) to get a generating function t(X, O) where the coefficient of XⁱOʲ counts nonequivalent boards with i X's and j O's.
Exercise 11(d): total number of nonequivalent boards (sum all coefficients).
Exercise 11(e): boards with three X's and three O's (extract coefficient of X³O³).
Exercise 11(f): in an actual game, players alternate, so valid boards have equal X's and O's or one more X; use t(X, O) to count only these boards.

🎨 Cube face painting (Exercise 12)

Paint the 6 faces of a cube with white, gold, or blue; two cubes are equivalent if one can be rotated to match the other.
The rotation group of the cube has 24 elements (not the full symmetry group, which has 48 if reflections are included).
Hint: label faces U, D, F, B, L, R and work with a physical model to identify the 24 rotations and their cycle structures on the 6 faces.
Find the cycle index, substitute x₁ = W + G + B, and extract:
- Total nonequivalent paintings.
- Paintings with exactly two faces of each color (coefficient of W²G²B²).

🔢 Cyclic groups and generating functions

🔢 Cyclic group C₁₂ (Exercise 5)

Write all 12 permutations in the cyclic group of order 12 in cycle notation.
C₁₂ = {identity, generator g, g², …, g¹¹} where g is a 12-cycle.
Each element gᵏ has cycle structure determined by gcd(k, 12): gᵏ consists of gcd(k,12) cycles each of length 12/gcd(k,12).
Example: g⁴ has gcd(4,12)=4 cycles, each of length 3.

📊 Stabilizers from tables (Exercise 3)

Exercise 3 refers to Table 15.2 (colorings of square vertices under D₈) and asks for the stabilizer of two specific colorings C₃ and C₁₆.
Method: look up which group elements fix each coloring in the table.
The stabilizer is a subgroup; its size (by the orbit-stabilizer theorem) determines the orbit size.

🧩 Summary of techniques

Exercise type	Group	What is permuted	Key technique
Permutation basics (1–2)	Arbitrary	Elements	Cycle notation, composition
Pentagon colorings (4)	D₁₀	5 vertices	Cycle index, substitution
Thai scales (6)	C₇	7 notes	Cyclic symmetry, subset counting
Xylene isomers (7)	D₁₂	6 ring positions	Dihedral symmetry, 2+4 coloring
Graph isomorphism (9–10)	S⁽²⁾₅	10 edges	Pair group, generating function
Tic-tac-toe (11)	D₈	9 squares	Multi-symbol coloring, game constraints
Cube painting (12)	Rotation group	6 faces	3D symmetry, three-color problem

All exercises rely on the same core idea: identify the symmetry group, write its elements in cycle notation (acting on the appropriate set), compute the cycle index, and substitute to count orbits.
Don't confuse: the "same" group (e.g., D₈) acts differently depending on the object (square vertices vs. tic-tac-toe squares), so cycle structures differ.

115

On-Line Algorithms

16.1 On-Line algorithms

🧭 Overview

🧠 One-sentence thesis

On-line algorithms face fundamental limitations because decisions must be made without full information, yet for some problems like poset partitioning, simple strategies can achieve performance close to the optimal off-line result.

📌 Key points (3–5)

What on-line means: decisions must be made as information arrives, round by round, without knowing future inputs—unlike off-line problems where all data is available upfront.
Graph coloring example: even for forests (the easiest graphs to color off-line), an adversarial Builder can force Assigner to use logarithmically many colors in an on-line setting.
Poset partitioning example: Assigner can partition any poset of height h into at most (h+1 choose 2) antichains on-line, and this bound is tight—Builder can force exactly that many.
Common confusion: on-line vs off-line—forests are trivial to 2-color off-line (they are bipartite), but on-line they require up to log(n) colors; the lack of future information fundamentally changes the problem.
Why it matters: many real-world decisions (construction projects, investments, emergency actions) are irreversible and must be made on-line, so understanding on-line algorithm limits is practical.

🎮 The on-line game framework

🎮 Two-player structure

On-line algorithm setting: a game between Builder (adversary) and Assigner (algorithm), played in rounds where Builder reveals information incrementally and Assigner must make irrevocable decisions.

Builder controls the input, presenting it piece by piece.
Assigner must respond immediately to each new piece without knowing what comes next.
The goal is to measure how much worse Assigner performs compared to an off-line algorithm that sees all input at once.

🔄 Round-by-round play

Round 1: Builder presents initial data; Assigner makes a decision.
Subsequent rounds: Builder adds new data and specifies its relationship to previous data; Assigner extends her solution.
Assigner's past decisions cannot be changed—this is the core constraint.

Example: In graph coloring, Builder adds one vertex per round and declares which previous vertices are adjacent to it; Assigner must immediately assign a color different from all its neighbors' colors.

🌲 Graph coloring: forests are hard on-line

🌲 The forest coloring game

Builder and Assigner agree on a class C of graphs (e.g., forests).
Each round: Builder presents a new vertex and declares which previous vertices are adjacent to it.
Assigner must color the new vertex differently from its neighbors.

Off-line baseline: Forests (including trees and paths) are bipartite, so they can always be 2-colored off-line.

🎯 Forcing three colors on a 4-vertex path

The excerpt gives a detailed example showing Builder can force 3 colors even on a path with 4 vertices:

Round 1: Builder presents vertex x; Assigner colors it (say, color 1).
Round 2: Builder presents vertex y, not adjacent to x.
- If Assigner uses a new color (color 2) on y:
  - Round 3: Builder presents z adjacent to both x and y → Assigner forced to use color 3.
  - Round 4: Builder presents w adjacent to y but not x or z → 3 colors already used.
- If Assigner reuses color 1 on y:
  - Round 3: Builder presents z adjacent to x but not y → Assigner uses color 2.
  - Round 4: Builder presents w adjacent to both y and z → Assigner forced to use color 3.

Key insight: No matter what Assigner does, Builder can force 3 colors on a 4-vertex path.

📈 General result: exponential colors for forests

Theorem 16.2: Builder has a strategy to construct a forest with at most 2^n − 1 vertices while forcing Assigner to use n colors.

Proof sketch (induction):

Base cases: n=1 (1 vertex, 1 color), n=2 (2 adjacent vertices, 2 colors), n=3 (4-vertex path, 3 colors as above).
Inductive step: Assume Builder can force i colors on forests F₁, F₂, ..., Fₖ (each Fᵢ forces i colors and has ≤ 2^i − 1 vertices).
- Builder constructs all these forests with no edges between them.
- From each Fᵢ, Builder picks a vertex yᵢ such that all colors φ(y₁), φ(y₂), ..., φ(yₖ) are distinct (possible because Fᵢ uses at least i colors).
- Builder presents a new vertex x adjacent to all of {y₁, y₂, ..., yₖ} and nothing else.
- Assigner must use a (k+1)-th color on x.
- Total vertices: 1 + (1 + 2 + 4 + ... + 2^(k−1)) + 1 ≤ 2^k.

Don't confuse: This is not saying forests need n colors in general—off-line they need only 2. The point is that without knowing the future structure, Assigner cannot avoid using many colors.

📊 Poset partitioning: doing well on-line

📊 The poset partitioning game

Builder constructs a poset P one point at a time.
Each round: Builder presents a new point x and declares which previous points are less than x, greater than x, or incomparable with x.
Assigner must assign x to an antichain (a set of mutually incomparable elements).

Off-line baseline: A poset of height h can be partitioned into exactly h antichains (by recursively removing minimal elements—Dilworth's theorem context).

✅ Assigner's strategy: (r, s)-labeling

Theorem 16.4: Assigner can partition any poset of height ≤ h into at most (h+1 choose 2) antichains on-line.

Assigner's strategy when point xₙ arrives:

Compute r = length of longest chain in {x₁, ..., xₙ} with xₙ as its least element.
Compute s = length of longest chain in {x₁, ..., xₙ} with xₙ as its greatest element.
Place xₙ in antichain A(r, s).

Why this works (proof that A(r, s) is indeed an antichain):

Suppose y is already in A(r, s) and y is comparable to the new point x.
When y was added, there existed a chain C′ of r points with y as minimum and a chain D′ of s points with y as maximum.
If y > x: add x below C′ → chain of r+1 points with x as minimum → x should be in A(r+1, s), not A(r, s). Contradiction.
If y < x: add x above D′ → chain of s+1 points with x as maximum → x should be in A(r, s+1), not A(r, s). Contradiction.
Therefore y and x must be incomparable.

Counting antichains used:

Each antichain corresponds to a pair (i, j) of positive integers with i + j − 1 ≤ h (since any chain through a point has total length r + s − 1).
Equivalently, i + j ≤ h + 1.
Number of such pairs = (h+1 choose 2) = (h+1)·h / 2.

Example: For h=3, Assigner uses at most (4 choose 2) = 6 antichains, compared to the off-line optimum of 3.

🔒 Tightness: Builder can force (h+1 choose 2) antichains

Theorem 16.5 (Szemerédi): For every h, Builder has a strategy to build a poset of height h forcing Assigner to use at least (h+1 choose 2) antichains, with at least h antichains used on maximal elements alone.

Proof sketch (induction):

Base: h=1, present 1 point → 1 antichain (and (2 choose 2) = 1).
Inductive step: Assume strategy Sₕ works for height h.
- Builder follows Sₕ twice to build disjoint posets P₁ and P₂ (all points in P₁ incomparable to all in P₂).
- Case 1: If Assigner uses h+1 or more antichains on the maximal elements of P₁ ∪ P₂:
  - Builder follows Sₕ a third time to build P₃ with all points of P₃ less than all maximal elements of P₁ ∪ P₂.
  - Height is h+1; antichains used: (h+1) + (h+1 choose 2) = (h+2 choose 2).
- Case 2: If Assigner uses exactly the same h antichains W on maximal elements of both P₁ and P₂:
  - Builder presents new point x greater than all of P₁ and incomparable with all of P₂.
  - Assigner must use a new antichain (not in W) for x.
  - Builder follows Sₕ a third time to build P₃ with all points less than x and the maximal elements of P₂.
  - Again, h+1 antichains on maximal elements and (h+2 choose 2) total.

Key takeaway: The simple (r, s)-strategy is optimal—no cleverer on-line algorithm can do better.

🔍 Comparing on-line and off-line performance

Problem	Off-line optimum	On-line performance	Gap
Forest coloring (height h)	2 colors (bipartite)	log₂(n) colors forced	Exponential
Poset partitioning (height h)	h antichains (Dilworth)	(h+1 choose 2) antichains	Quadratic (but polynomial)

Don't confuse:

"Hard on-line" does not mean the problem is hard off-line.
Forests are trivial off-line but adversarially hard on-line.
Poset partitioning is harder on-line but only by a polynomial factor, and the on-line algorithm is extremely simple.

🌍 Real-world relevance

The excerpt opens with practical examples of irreversible on-line decisions:

Construction projects: commit years before ground is broken.
Investment decisions: made with today's information, may look unwise with tomorrow's news.
Emergency actions: "deciding to exit a plane with a parachute is rarely reversible."

These illustrate why understanding on-line algorithm limits matters beyond pure theory.

116

Extremal Set Theory

16.2 Extremal Set Theory

🧭 Overview

🧠 One-sentence thesis

Extremal set theory determines the maximum size of a family of subsets satisfying specific constraints, such as pairwise intersection or size restrictions, with results ranging from simple bounds to deep theorems like Erdős–Ko–Rado.

📌 Key points (3–5)

Core question: What is the maximum size of a family of subsets of {1, 2, ..., n} when the family must satisfy certain properties?
Two basic examples: families where all pairs intersect (max size 2^(n−1)), and Sperner families where no set contains another (max size "n choose floor(n/2)").
Erdős–Ko–Rado theorem: for k-element subsets that all pairwise intersect, the maximum family size is "(n−1) choose (k−1)" when n ≥ 2k.
Common confusion: extremal families can be unique (Sperner) or highly non-unique (pairwise intersection)—the number of optimal configurations varies by problem.
Proof techniques: counting arguments (complementary pairs) and clever combinatorial constructions (circular arrangements).

🎯 The fundamental question

🎯 What extremal set theory asks

Given a positive integer n and the set [n] = {1, 2, ..., n}, what is the maximum size of a family F of subsets of [n] when F must satisfy certain properties?

The "family" is a collection of subsets.
"Maximum size" means the largest number of subsets you can include while obeying the constraints.
Different constraints lead to different maximum sizes and different extremal families.

🔍 Why "extremal"

The term refers to finding extreme values (maxima or minima) under constraints.
The excerpt focuses on maximization problems: how large can F be?

📐 Example: Pairwise intersection constraint

📐 The constraint and bound

Problem: Find the maximum size of F where every pair of sets A, B in F has non-empty intersection (A ∩ B ≠ ∅).

Answer: 2^(n−1)

🔽 Lower bound construction

Consider the family F of all subsets of [n] that contain the element 1.
This family has 2^(n−1) elements (half of all 2^n subsets).
Any two sets in this family both contain 1, so their intersection is non-empty.
Example: For n=3, F = {{1}, {1,2}, {1,3}, {1,2,3}} has 4 = 2^2 sets, all containing 1.

🔼 Upper bound proof

The key insight: whenever a subset S is in F, its complement S′ cannot be in F (because S ∩ S′ = ∅).
The 2^n subsets of [n] can be organized into 2^(n−1) complementary pairs.
At most one set from each pair can belong to F.
Therefore |F| ≤ 2^(n−1).

🌳 Many extremal families

For every complementary pair, you can choose either member.
This gives many different families that achieve the maximum size.
Don't confuse with: problems where the extremal family is essentially unique (see Sperner below).

📏 Example: Sperner's theorem

📏 The constraint and bound

Problem: Find the maximum size of F where no set in F is a proper subset of another (an "antichain" property).

Answer: "n choose floor(n/2)"

🎯 The extremal family is nearly unique

When F achieves the maximum size, F consists of either:
- All subsets of size floor(n/2), or
- All subsets of size ceiling(n/2).
When n is even, these are exactly the same family.
When n is odd, there are two extremal families (one for each middle size).
This is a very small number of extremal configurations, contrasting sharply with the pairwise intersection case.

🏆 Erdős–Ko–Rado theorem

🏆 The setup and result

Theorem (Erdős, Ko, Rado): Let n and k be positive integers with n ≥ 2k. The maximum size of a family F of subsets of [n] satisfying:

A ∩ B ≠ ∅ for all A, B in F, and

|A| = k for all A in F,

is "(n−1) choose (k−1)".

This combines two constraints: pairwise intersection and fixed size k.
The condition n ≥ 2k is essential; it ensures that two k-element sets can intersect.

🔽 Lower bound: the standard construction

Consider the family F of all k-element subsets of [n] that contain the element 1.
This family has size "(n−1) choose (k−1)" (choose the remaining k−1 elements from the n−1 elements {2, 3, ..., n}).
Any two sets in F both contain 1, so they intersect.

🔼 Upper bound: circular arrangement proof

🎡 The circular setup

Place n points p₁, p₂, ..., pₙ equally spaced around a circle.
For each permutation σ of [n], place the integers 1 through n at these points in the order given by σ.
There are n! such arrangements.

🧮 Counting consecutive blocks

For each permutation σ, define F(σ) as the subfamily of F consisting of sets whose k elements appear in a consecutive block around the circle.

Let t = sum over all σ of |F(σ)|.

Claim 1: t ≤ kn!

For a fixed σ, if |F(σ)| = s ≥ 1, the union of sets in F(σ) forms a consecutive block.
Since n ≥ 2k, this block does not wrap around the entire circle.
There is a "first" set S in this block (the k elements starting earliest in clockwise order).
Every other set in F(σ) is a clockwise shift of one or more positions from S.
Therefore |F(σ)| ≤ k.
Summing over all n! permutations: t ≤ kn!.

Claim 2: Each set S in F appears in exactly nk!(n−k)! permutations σ as a member of F(σ).

There are n positions around the circle where the block of k consecutive positions containing S can start.
There are k! ways to order the k elements of S within the block.
There are (n−k)! ways to order the remaining n−k elements.
Total: nk!(n−k)! permutations.

🔗 Combining the counts

The total count t can also be expressed as: t = |F| · nk!(n−k)!
From Claim 1: |F| · nk!(n−k)! ≤ kn!
Dividing both sides by nk!(n−k)!: |F| ≤ n! / (k!(n−k)!) · (1/n) = (n−1)! / ((k−1)!(n−k)!) = "(n−1) choose (k−1)".

🧩 Why the circular argument works

The circular arrangement "spreads out" the counting problem.
By considering all permutations, we avoid directly analyzing the complex intersection structure.
The constraint n ≥ 2k ensures that consecutive blocks of size k cannot cover the entire circle, which is crucial for the upper bound argument.
Example scenario: For n=5, k=2, a 2-element set like {1,3} appears as consecutive in some circular arrangements (e.g., ...1,3,...) but not in others.

🔄 Comparison of extremal families

Problem	Constraint	Maximum size	Number of extremal families
Pairwise intersection	A ∩ B ≠ ∅ for all A, B	2^(n−1)	Many (one choice per complementary pair)
Sperner (antichain)	No set contains another	"n choose floor(n/2)"	One or two (middle layer(s))
Erdős–Ko–Rado	Pairwise intersection + all sets size k	"(n−1) choose (k−1)"	Not specified in excerpt

Common confusion: The number of extremal configurations is problem-dependent; some constraints force a unique structure, others allow many optimal solutions.

117

Markov Chains

16.3 Markov Chains

🧭 Overview

🧠 One-sentence thesis

Markov chains model systems where transitions between states depend only on the current state with fixed probabilities, and under certain conditions these systems converge to stable long-term probability distributions.

📌 Key points (3–5)

What a Markov chain is: a finite set of states with fixed transition probabilities that do not depend on time or history—only on the current state.
Regular vs absorbing chains: regular chains converge to a stable distribution across all states; absorbing chains have "trap" states that, once entered, cannot be left.
Transition matrix: an n×n stochastic matrix where entry (j,k) gives the probability of moving from state Sⱼ to state Sₖ; all entries are non-negative and each row sums to 1.
Common confusion: the probability of being in a state after m moves vs the limiting probability as m approaches infinity—regular chains converge to a fixed distribution W regardless of starting state.
Key questions: how fast does convergence happen, what is the probability of reaching or being absorbed in a particular state, and how long until absorption occurs.

🎲 Core structure of Markov chains

🎲 States and transitions

A Markov chain consists of:

A finite set of states S₁, S₂, …, Sₙ
At each time step i, the system is in exactly one state
Transition probabilities p(j,k) that are fixed and do not depend on time i

The defining property: if you are in state Sⱼ at time i, there is a fixed probability p(j,k) that you will be in state Sₖ at time i+1, independent of how you arrived at Sⱼ or what time it is.

Example: The motivational example uses a connected graph with six vertices. At each step, if you are at vertex x with d neighbors, you move to each neighbor with probability 1/d. The vertices are the states; the movement rule defines the transition probabilities.

📊 Transition matrix

Transition matrix P: an n×n matrix whose (j,k) entry is the probability p(j,k) of moving from state Sⱼ to state Sₖ.

Properties:

P is a stochastic matrix: all entries are non-negative and all row sums equal 1
Conversely, every square stochastic matrix can be viewed as the transition matrix of some Markov chain
The matrix encodes the entire probabilistic behavior of the chain

The excerpt provides a 6×6 example transition matrix for the graph, showing probabilities like 1/4, 1/2, 1/3 in various positions.

🔄 Regular Markov chains and convergence

🔄 What makes a chain regular

A transition matrix P is regular if there exists some integer m such that the matrix Pᵐ (P multiplied by itself m times) has only positive entries.

"Positive entries" means every state can reach every other state in exactly m steps with non-zero probability
The example matrix is regular because all entries of P³ are positive

🎯 The fundamental convergence theorem

Theorem 16.9 states that for a regular n×n transition matrix P:

There exists a row vector W = (w₁, w₂, …, wₙ) of positive real numbers summing to 1
As m approaches infinity, each row of Pᵐ converges to W
W satisfies WP = W (W is a fixed point under the transition)
For each state Sᵢ, the value wᵢ is the limiting probability of being in state Sᵢ

What this means:

No matter which state you start in, after many steps the probability distribution over states approaches the same vector W
The long-run behavior is independent of the initial state
Example: for the 6×6 matrix, W = (5/13, 3/13, 2/13, 2/13, 1/13, 1/13)

Don't confuse: the probability pₓ,ₘ of being at vertex x after m moves vs the limiting probability wₓ as m→∞. The sequence pₓ,ₘ converges to wₓ, but how fast convergence happens is a more subtle question.

🔍 Open questions for regular chains

Even with Theorem 16.9, important practical questions remain:

Convergence speed: how fast does Pᵐ approach W?
Coverage: how many moves are needed to ensure you have visited every edge (or made every transition) with high probability (e.g., ≥0.999)?

The excerpt notes these are "more subtle" and not fully answered by the basic theorem.

🕳️ Absorbing Markov chains

🕳️ What is an absorbing state

A state Sᵢ is absorbing if pᵢ,ᵢ = 1 and pᵢ,ⱼ = 0 for all j ≠ i.

Once you enter an absorbing state, you remain there forever
The excerpt colorfully describes this as "like the infamous Hotel California, once you are in state Sᵢ, 'you can never leave.'"

Example: The excerpt modifies the 6×6 matrix by making states 4 and 5 absorbing. State 4 might represent a "safe harbor" (escape point); state 5 might represent meeting a "hungry tiger" (an unpleasant absorption).

🕳️ What makes a chain absorbing

A Markov chain is absorbing if:

There is at least one absorbing state, and

For each non-absorbing state Sⱼ, it is possible to reach an absorbing state (possibly in many steps).

Not every state needs to be absorbing; there must be a mix
From any transient (non-absorbing) state, eventual absorption is guaranteed

❓ Key questions for absorbing chains

The excerpt lists three natural questions:

Forward absorption probability: If we start in non-absorbing state Sᵢ, what is the probability of being absorbed in absorbing state Sⱼ?
Backward inference: If we are absorbed in state Sⱼ, what is the probability that we started in non-absorbing state Sᵢ?
Expected absorption time: If we start in non-absorbing state Sᵢ, what is the expected number of steps before absorption?

Don't confuse: absorbing chains with regular chains. Regular chains spread probability across all states in the long run; absorbing chains concentrate all probability in absorbing states eventually, and the interesting questions are about which absorbing state and how long it takes.

🧮 Computation and techniques

🧮 Computing the limiting vector W

Given Theorem 16.9, computing W can be done using eigenvalue techniques from undergraduate linear algebra
W satisfies WP = W, so W is a left eigenvector of P with eigenvalue 1
The example gives W = (5/13, 3/13, 2/13, 2/13, 1/13, 1/13) for the original 6×6 matrix

⚠️ Subtleties beyond the basic theorem

The excerpt emphasizes that while Theorem 16.9 guarantees convergence:

The rate of convergence (how fast Pᵐ approaches W) is a more complex question
Questions about coverage (visiting all edges/transitions) require additional analysis
The proof of Theorem 16.9 itself is "a bit too complex to prove given our space constraints"

118

16.4 The Stable Matching Theorem

🧭 Overview

🧠 One-sentence thesis

The Stable Matching Theorem proves that for any set of preference orderings among n males and n females, a stable matching always exists and can be found through an efficient algorithm where males propose in order of preference and females hold onto their best current option.

📌 Key points (3–5)

What stable matching means: a one-to-one pairing where no two people would both prefer to leave their assigned partners for each other.
The core result: regardless of individual preferences, a stable matching can always be generated.
The algorithm mechanism: males propose sequentially down their preference lists; females hold the best suitor so far and reject others.
Common confusion: the algorithm is asymmetric—females' prospects improve over time (they keep upgrading), while males' prospects deteriorate (they move down their lists after rejections).
Why it matters: the method is both constructive (produces an actual matching) and efficient (terminates with each female holding exactly one male).

💑 The stable matching problem

💑 Setup and preferences

There are n eligible males (b₁, b₂, …, bₙ) and n eligible females (g₁, g₂, …, gₙ).
Goal: arrange n marriages, each involving one male and one female.
Each female linearly orders all males by preference: for female i, there is a permutation σᵢ so that if she prefers male bⱼ to bₖ, then σᵢ(j) > σᵢ(k).
Each male linearly orders all females by preference: for male i, there is a permutation τᵢ so that if he prefers female gⱼ to gₖ, then τᵢ(j) > τᵢ(k).
Different people may have completely different preference orders.

🔒 What "stable" means

A one-to-one matching of the n males to the n females is stable if there do not exist two males b and b′ and two females g and g′ so that:

b is matched to g;

b prefers g′ to g; and

g′ prefers b to b′.

In plain language: no pair of people (one male, one female) would both prefer each other over their assigned partners.
If such a pair existed, they would be "mutually inclined to dissolve their relationship and initiate dalliances with other partners."
The question: can we always generate a stable matching, no matter what the preferences are?

🚪 The algorithm

🚪 How the proposal process works

The algorithm proceeds in stages:

Stage 1:

All males go knock on the door of the female who is tops on their list.
Some females may have more than one caller; others may have none.
If a female has one or more males at her door, she grabs the one she prefers most by the collar and tells the others to go away.

Subsequent stages:

Any male rejected at a step proceeds to the door of the female who is next on his list.
Again, each female with one or more suitors chooses the best among them and sends the others away.
This continues until eventually each female is holding onto exactly one male.

Example: Male b₁ starts at his top choice. If rejected, he moves to his second choice, then third, and so on, until some female accepts him (holds onto him).

⚖️ Asymmetry in the process

Participant	How prospects change over time	Why
Females	Prospects improve	Once a female has a suitor, she only upgrades—each new suitor she considers is better than or equal to her current one
Males	Prospects deteriorate	Males start at the top of their lists and work downward after each rejection

Don't confuse: both sides end up matched, but the algorithm favors females in terms of outcome quality.

✅ Why the matching is stable

✅ Proof of stability

The excerpt asserts that the resulting matching is stable and provides a proof by contradiction:

Suppose the matching is unstable:

Then there exist males b and b′, females g and g′ such that:
- b is matched to g,
- b prefers g′ to g, and
- g′ prefers b to b′.

Why this leads to a contradiction:

The algorithm requires male b to start at the top of his list and work his way down.
Since b eventually lands on g's doorstep, and he prefers g′ to g, it means he must have visited g′'s door earlier (before settling for g).
When b was at g′'s door, she sent him away—meaning she had a male in hand that she preferred to b at that exact moment.
Since female g′'s holdings only improve with time, when the matching is finalized, she has a mate that she prefers to b.
But this contradicts the assumption that g′ prefers b to her final match.

In plain language: if b prefers g′ but ended up with g, he must have been rejected by g′ earlier because she had someone better, and she never downgrades.

🎯 Key insight

The algorithm's structure ensures that any potential instability is impossible: if a male prefers someone else, that someone else already rejected him for someone she prefers more.
The proof relies on the monotonicity of female preferences during the algorithm (they only improve) and the sequential nature of male proposals (they move down their lists).

119

Zero–One Matrices

16.5 Zero–One Matrices

🧭 Overview

🧠 One-sentence thesis

The Gale-Ryser theorem establishes that a zero–one matrix with specified row and column sums exists if and only if the dual of the row sum partition is less than or equal to the column sum partition in a natural partial order on partitions.

📌 Key points (3–5)

What the problem asks: whether there exists an m×n matrix of only 0s and 1s with prescribed row sums and column sums.
Key transformation: row and column sum strings can be viewed as partitions of the same integer t, and we can assume both are non-increasing.
The dual partition: for any partition V, the dual V_d is constructed by counting how many entries in V are at least a certain threshold; the dual of the dual returns the original.
Common confusion: the necessary condition (R_d ≤ C in the poset) is not just a counting check—it compares cumulative partial sums term by term.
Why it matters: the Gale-Ryser theorem provides both a necessary and sufficient condition, and the proof is constructive, showing how to build the matrix step by step.

🔢 Problem setup and assumptions

🔢 What is a zero–one matrix with specified sums

When M is an m×n zero–one matrix, the row sum string R = (r₁, r₂, …, rₘ) is defined by rᵢ = sum of entries in row i; the column sum string C = (c₁, c₂, …, cₙ) is defined analogously.

The question: given non-negative integer strings R and C, does there exist an m×n matrix of 0s and 1s with those exact row and column sums?
Example: if R = (3, 2, 1) and C = (2, 2, 1, 1), we ask whether a 3×4 zero–one matrix exists with those row and column totals.

🧹 Simplifying assumptions

Without loss of generality, we may assume:

Equal totals: There is a positive integer t such that the sum of all row sums equals the sum of all column sums equals t (otherwise no matrix can exist).
Non-increasing order: Both R and C are non-increasing strings (r₁ ≥ r₂ ≥ … ≥ rₘ and c₁ ≥ c₂ ≥ … ≥ cₙ), because exchanging rows or columns does not change the problem.
Positive entries only: All entries in R and C are positive integers, since zeroes correspond to rows or columns of all zeroes, which can be ignored.

After these assumptions, both R and C can be viewed as partitions of the integer t.

🔗 Partitions and the partial order

🔗 Partitions of an integer

A partition of a positive integer t is a non-increasing sequence of positive integers that sum to t.

Notation: P(t) denotes the family of all partitions of t.
Example: (5, 4, 3) and (5, 3, 3, 1) are both partitions of 12.

⚖️ The partial order on partitions

For partitions V = (v₁, v₂, …, vₘ) and W = (w₁, w₂, …, wₙ), we write V ≥ W if and only if m ≤ n and the cumulative partial sums of V are at least as large as those of W term by term: ∑(i=1 to j) vᵢ ≥ ∑(i=1 to j) wᵢ for each j = 1, 2, …, m.

This is a partial order on P(t).
It compares the "front-loaded" nature of partitions: V ≥ W means V accumulates its total faster than W.
Example: (5, 4, 3) covers (5, 3, 3, 1) in P(12) because the partial sums (5), (5+4=9), (5+4+3=12) are ≥ (5), (5+3=8), (5+3+3=11), (5+3+3+1=12).

📐 Covering relation in the poset

Proposition 16.11 characterizes when one partition covers another (i.e., V is immediately above W with no partition in between):

If V covers W in P(t), then n = m + 1 (W has one more part than V).
There exist positions i and j with 1 ≤ i < j ≤ n such that:
- Parts before position i are unchanged: v_k = w_k for k < i.
- Parts after position j are unchanged: v_k = w_k for k > j.
- At position i, vᵢ = wᵢ + 1 (V has one more unit).
- At position j, either wⱼ = vⱼ + 1 (W has one more unit) or wⱼ = 1 (a new part of size 1 appears).
- Between i and j, parts are equal: w_k = v_k = vᵢ - 1.
Don't confuse: covering is a very specific one-step change, not just any V > W.

🔄 The dual partition

🔄 Definition of the dual

Given a partition V = (v₁, v₂, …, vₘ) from P(t), the dual partition W = V_d is defined as follows: n = v₁ (the largest part of V), and for each j = 1, …, n, wⱼ is the number of entries in V that are at least n + 1 - j.

In other words, wⱼ counts how many parts of V are ≥ (n + 1 - j).
Example: the dual of V = (8, 6, 6, 6, 5, 5, 3, 1, 1, 1) is (8, 7, 7, 6, 6, 4, 1, 1).
Both V and V_d are partitions of the same integer (42 in the example).

🔁 Duality is an involution

If W = V_d, then V = W_d: the dual of the dual is the original partition.
This symmetry is key to the Gale-Ryser theorem.

✅ The Gale-Ryser theorem

✅ Statement of the theorem

Theorem 16.12 (Gale-Ryser): Let R and C be partitions of a positive integer t. Then there exists a zero–one matrix with row sum string R and column sum string C if and only if R_d ≥ C in the poset P(t).

The condition R_d ≥ C is both necessary and sufficient.

🧪 Necessity: why the condition must hold

Suppose M is an m×n zero–one matrix with row sum string R and column sum string C.

Construct a modified matrix M': for each row i, push the rᵢ ones as far left as possible (so m'ᵢ,ⱼ = 1 if and only if 1 ≤ j ≤ rᵢ).
M and M' have the same row sum string R.
The column sum string C' of M' is non-decreasing, and the positive part C'' of C' is exactly R_d (the dual of R).
Shifting ones left can only increase partial sums, so for each j, the cumulative sum ∑(i=1 to j) c''ᵢ ≥ ∑(i=1 to j) cᵢ.
Therefore R_d ≥ C in the poset P(t).

Don't confuse: this is not just checking total sums (which are equal by assumption), but comparing cumulative partial sums.

🔨 Sufficiency: constructing the matrix

The proof is constructive:

In the poset P(t), find a chain W₀ > W₁ > … > Wₛ such that W₀ = R_d and Wₛ = C, where each Wₚ covers Wₚ₊₁.
Start with a zero–one matrix M₀ having row sum string R and column sum string W₀ = R_d (this is the "left-pushed" matrix described above).
For each step p from 0 to s-1, given a matrix Mₚ with row sum string R and column sum string Wₚ, use the covering relation (Proposition 16.11) to identify positions i and j.
Find a row q where the (q, i) entry is 1 and the (q, j) entry is 0, and exchange these two entries to form Mₚ₊₁.
This exchange changes the column sum string from Wₚ to Wₚ₊₁ while preserving the row sum string R.
After s steps, we obtain a matrix Mₛ with row sum string R and column sum string C.

Example scenario: if R = (8, 4, 3, 1, 1, 1) and C is some partition of 18 with R_d ≥ C, the algorithm starts with the matrix where row 1 has eight 1s in the first eight columns, row 2 has four 1s in the first four columns, etc., then repeatedly swaps entries to transform the column sums step by step until they match C.

🔍 Key takeaways

🔍 Why the dual matters

The dual partition R_d captures the "column perspective" of the left-pushed matrix.
The condition R_d ≥ C ensures that the column sums can be "redistributed" from the left-pushed configuration to the desired configuration C without violating the row sums.

🔍 Constructive nature

The Gale-Ryser theorem is not just an existence result; it provides an explicit algorithm to build the matrix by a sequence of single-entry swaps.
Each swap corresponds to moving down one step in the poset chain from R_d to C.

120

Arithmetic Combinatorics

16.6 Arithmetic Combinatorics

🧭 Overview

🧠 One-sentence thesis

Arithmetic combinatorics focuses on finding patterns such as arithmetic progressions within sets of integers, guaranteeing that sufficiently large sets colored in finitely many ways must contain monochromatic arithmetic progressions of any desired length.

📌 Key points (3–5)

What arithmetic progressions are: increasing sequences of integers with constant differences between consecutive terms.
Core guarantee (Theorem 16.13): for any desired progression length t and number of colors r, there exists a threshold size n₀ such that any coloring of {1, 2, ..., n} (n ≥ n₀) with r colors must contain a monochromatic t-term arithmetic progression.
Connections: arithmetic combinatorics aligns closely with Ramsey theory and number theory, but recent work also connects to real and complex analysis.
Common confusion: this is not about finding any arithmetic progression in a set—it guarantees progressions that are monochromatic (all elements assigned the same color/value).
Historical roots: the area has deep historical foundations but has seen recent rapid development with many new discoveries.

🔢 Fundamental definitions

🔢 Arithmetic progression

An arithmetic progression is an increasing sequence a₁ < a₂ < a₃ < ... < aₜ of integers where there exists a positive integer d such that aᵢ₊₁ − aᵢ = d for all i = 1, 2, ..., t − 1.

The key property: the difference between consecutive terms is constant.
The integer d is the common difference.
The integer t is called the length of the arithmetic progression (how many terms it contains).
Example: The sequence 3, 7, 11, 15, 19 is an arithmetic progression of length 5 with common difference d = 4.

🎨 Coloring interpretation

The excerpt describes a function φ : {1, 2, ..., n} → {1, 2, ..., r}.
This represents assigning one of r "colors" (values) to each integer from 1 to n.
A monochromatic arithmetic progression means all terms aᵢ in the progression satisfy φ(aᵢ) = c for the same color c.

🎯 The main theorem

🎯 Theorem 16.13 statement

For any pair (r, t) of positive integers, there exists a threshold integer n₀ such that:

If n ≥ n₀, and
φ : {1, 2, ..., n} → {1, 2, ..., r} is any function (any coloring with r colors),
Then there exists a t-term arithmetic progression 1 ≤ a₁ < a₂ < ... < aₜ ≤ n and an element c ∈ {1, 2, ..., r} such that φ(aᵢ) = c for each i = 1, 2, ..., t.

🔍 What the theorem guarantees

Universality: no matter how you color the integers, the pattern must appear.
Parameters you control:
- r = how many colors you use
- t = how long an arithmetic progression you want to find
What you get: a threshold size n₀ (which depends on r and t).
Implication: once your set is large enough (n ≥ n₀), you cannot avoid having a monochromatic arithmetic progression of length t.

⚠️ Don't confuse

This is not saying "every large set contains an arithmetic progression"—that's trivially true.
It says: even if you try to "break up" patterns by coloring integers differently, you cannot prevent monochromatic arithmetic progressions from appearing once the set is large enough.
The theorem is a Ramsey-type result: structure (arithmetic progression) emerges from sufficient size, regardless of how you partition (color) the elements.

🌐 Broader context

🌐 Connections to other areas

The excerpt states that arithmetic combinatorics:

Is closely aligned with Ramsey theory and number theory.
Shows connections with real and complex analysis in recent work.
Has roots going back many years, but is a rapidly changing area with many recent deep and exciting discoveries.

🧩 Relationship to Ramsey theory

The excerpt mentions that "in some sense, this area is closely aligned with Ramsey theory."
Ramsey theory studies unavoidable patterns in sufficiently large structures.
Theorem 16.13 is a Ramsey-type result: it guarantees unavoidable monochromatic arithmetic structure in large colored sets.

121

The Lovász Local Lemma

16.7 The Lovász Local Lemma

🧭 Overview

🧠 One-sentence thesis

The Lovász Local Lemma is a powerful probabilistic technique that can prove the existence of rare configurations by showing that the probability of avoiding all "bad" events is positive, even when each bad event has relatively high probability.

📌 Key points (3–5)

What the lemma does: proves that certain rare configurations exist by showing that all events in a family can simultaneously fail (i.e., their complements all occur).
When it applies: when events have limited dependence—each event is independent of most others—and probabilities satisfy specific inequalities.
Two versions: the asymmetric form allows different probabilities and dependency neighborhoods for each event; the symmetric form assumes uniform bounds.
Common confusion: probabilistic methods usually find abundant objects, but the Local Lemma can find exceedingly rare ones by carefully managing dependencies.
Why it matters: provides existence proofs for combinatorial structures (like Ramsey-type bounds) that are hard to construct explicitly.

🎯 The core idea and notation

🎯 Probabilistic methods: abundant vs rare

Probabilistic techniques typically find objects that exist in abundance.
- Example: random graphs almost certainly have modest independence numbers and few small cycles (as in the girth/chromatic number theorem).
The Lovász Local Lemma is unusual: it can prove existence of configurations that are exceedingly rare.
The key insight: if "bad" events have limited overlap in their dependencies, even if each is likely, we can show all of them fail simultaneously with positive probability.

🔤 Notation conventions

The excerpt introduces shorthand to simplify probability expressions:

Complement: E-bar denotes the complement of event E (i.e., "E does not occur").
Intersection as product: the product symbol over a family F of events E₁, E₂, ..., Eₖ denotes their intersection E₁ ∩ E₂ ∩ ... ∩ Eₖ.
Mixed notation: E₁ E₂ E₃-bar represents E₁ ∩ E₂-bar ∩ E₃, mixing complements and intersections.

🔗 Independence and dependency neighborhoods

For each event E in family F, let N(E) denote a subfamily of events from F excluding E, such that E is independent of any event not in N(E).

N(E) is the "dependency neighborhood" of E: events that E depends on.
Independence condition: P(E | product of F over G) = P(E) whenever G and N(E) are disjoint.
This limited dependence is the key to the lemma's power.

🔢 The asymmetric form

🔢 Statement of the asymmetric lemma

Lemma 16.14 (Lovász Local Lemma, Asymmetric): Let F be a finite family of events in a probability space. For each event E in F, let N(E) denote a subfamily of events from F excluding E so that E is independent of any event not in N(E). Suppose that for each event E in F, there is a real number x(E) with 0 < x(E) < 1 such that P(E) ≤ x(E) times the product over F in N(E) of (1 - x(F)). Then for every non-empty subfamily G of F, the probability of the product over E in G of E-bar is at least the product over E in G of (1 - x(E)). In particular, the probability that all events in F fail is positive.

The lemma assigns a "weight" x(E) to each event E.
The condition P(E) ≤ x(E) · product of (1 - x(F)) over F in N(E) relates the event's probability to its dependency neighborhood.
Conclusion: the probability that all events fail (all complements occur) is at least the product of (1 - x(E)) over all E, which is positive.

🧮 How the proof works

The proof uses induction on the size of subfamily G:

Base case (|G| = 1): If G = {E}, then P(E-bar) = 1 - P(E) ≥ 1 - x(E), which follows from the hypothesis.
Inductive step (|G| = k ≥ 2): Write G = {E₁, E₂, ..., Eₖ}. The probability of the intersection of all E-bars factors as a product of conditional probabilities:
- P(E₁-bar | E₂-bar, ..., Eₖ-bar) · P(E₂-bar | E₃-bar, ..., Eₖ-bar) · ...
Key inequality: For each term P(E | product of F over F_E), show P(E | product of F over F_E) ≤ x(E).
- If F_E and N(E) are disjoint, then P(E | product of F over F_E) = P(E) ≤ x(E) by independence.
- If F_E intersects N(E), split F_E into events in N(E) and events not in N(E). Use the hypothesis and inductive hypothesis on the denominator to show the ratio is at most x(E).
Combining these inequalities completes the induction.

🧩 Why the asymmetric form is flexible

Different events can have different probabilities and different-sized dependency neighborhoods.
The weights x(E) can be tailored to each event's specific situation.
This flexibility is useful when events have heterogeneous structure.

⚖️ The symmetric form

⚖️ Statement of the symmetric lemma

Lemma 16.15 (Lovász Local Lemma, Symmetric): Let p and d be numbers with 0 < p < 1 and d ≥ 1. Let F be a finite family of events in a probability space. For each event E in F, let N(E) denote the subfamily of events from F excluding E so that E is independent of any event not in N(E). Suppose that P(E) ≤ p and |N(E)| ≤ d for every event E in F, and that e · p · (d + 1) < 1, where e ≈ 2.71828 is the base for natural logarithms. Then the probability that all events in F fail is positive.

Uniform bounds: every event has probability at most p and dependency neighborhood size at most d.
Condition: e · p · (d + 1) < 1 ensures the lemma applies.
Simpler to check: only two parameters (p and d) instead of individual weights x(E).

🔧 Proof of the symmetric form

Set x(E) = 1 / (d + 1) for every event E in F.
Verify the asymmetric lemma's condition:
- P(E) ≤ p = 1 / (e · (d + 1)) ≤ x(E) · product over F in N(E) of (1 - 1/(d + 1)).
- The product over at most d terms of (1 - 1/(d + 1)) is at least (1 - 1/(d + 1))^d, which is approximately 1/e for large d.
Apply the asymmetric lemma with these uniform weights.

📐 Alternative condition: 4pd < 1

The excerpt notes that many applications use the condition 4pd < 1 instead of e · p · (d + 1) < 1.
This is a "trivial modification" of the argument—slightly different constants but the same structure.
Don't confuse: both conditions serve the same purpose; the choice depends on convenience in specific applications.

🎲 Application to Ramsey theory

🎲 Estimating R(3, n)

The excerpt begins to apply the Local Lemma to estimate the Ramsey number R(3, n), which is the smallest number of vertices such that any 2-coloring of edges contains either a triangle in one color or an independent set of size n in the other.

Known upper bound: R(3, n) ≤ binomial(n + 1, 3) from Theorem 11.2.
Goal: find good lower bounds using probabilistic methods.

🧪 Challenges with direct probabilistic arguments

The excerpt sketches a naive approach and its difficulties:

Random graph setup: consider a random graph on t vertices with edge probability p.
Avoid triangles: want no triangles, which requires (t choose 3) · p³ ≤ 1, so p ≈ 1/t.
Avoid large independent sets: want no independent set of size n, which requires (t choose n) · (1 - p)^(n choose 2) ≤ 1.
- This gives t · ln(n) ≤ p · (n choose 2), so t ≤ p · n² / ln(n).
Tension: the two conditions pull in opposite directions—avoiding triangles wants small p, but avoiding large independent sets wants large p relative to t.

The excerpt stops here, but the implication is that the Local Lemma will resolve this tension by carefully managing dependencies among "bad" events (triangles and large independent sets).

🔍 Why the Local Lemma helps

Limited dependence: a triangle on three vertices depends only on edges involving those vertices; similarly, an independent set depends on edges within that set.
Rare but manageable: each bad event may have relatively high probability, but the dependency neighborhoods are small compared to the total number of events.
The Local Lemma shows that with the right choice of p and t, the probability that all bad events fail is positive, proving the existence of a graph with no triangles and no large independent sets.

🛠️ Algorithmic and broader context

🛠️ Constructive applications

The excerpt mentions growing interest in applying the Local Lemma algorithmically, i.e., in a constructive setting.
The lemma traditionally proves existence non-constructively (it shows positive probability but does not build the object).
Recent work seeks algorithms that efficiently find the rare configurations guaranteed by the lemma.

📚 Historical and ongoing development

The Local Lemma is described as "elegant but elementary" yet "very, very powerful."
The list of applications has been "growing steadily," indicating active research.
Early applications include Ramsey theory (as illustrated), but the lemma has since been applied to many areas of combinatorics.

122

Applying the Local Lemma

16.8 Applying the Local Lemma

🧭 Overview

🧠 One-sentence thesis

The Lovász Local Lemma can be applied to Ramsey theory to directly prove that the Ramsey number R(3, n) grows at least as fast as n² / ln² n, improving upon naive probabilistic arguments.

📌 Key points (3–5)

What the Local Lemma provides: a way to show that the probability of avoiding all "bad" events in a family is positive, even when events are not fully independent, as long as each event has low probability and limited dependencies.
The Ramsey application challenge: naive probabilistic methods fail to give good lower bounds for R(3, n) because avoiding triangles and avoiding large independent sets impose conflicting requirements on edge probability.
How the Local Lemma solves it: by carefully choosing parameters x and y for triangles and independent sets, and verifying the lemma's neighborhood conditions, we can prove the existence of a graph with no triangles and no n-element independent set on roughly n² / ln² n vertices.
Common confusion: the Local Lemma does not require full independence—it only requires that each event's "neighborhood" (events it depends on) is small enough and that probabilities are low enough to satisfy the lemma's inequalities.
Why it matters: this technique yields the correct exponent (2 on n) for the Ramsey bound, matching earlier clever constructions but in a more direct manner.

🔍 The symmetric Local Lemma statement

🔍 What the lemma says

Let F be a finite family of events. For each event E in F, let N(E) denote the subfamily of events from F \ {E} such that E is independent of any event not in N(E). Suppose P(E) ≤ p, |N(E)| ≤ d for every event E in F, and that e·p·(d + 1) < 1, where e ≈ 2.71828. Then the probability that none of the events in F occur is positive.

The lemma guarantees that all events can be avoided simultaneously.
Key condition: e·p·(d + 1) < 1, where p is an upper bound on event probabilities and d is an upper bound on neighborhood sizes.
The proof sets x(E) = 1 / (d + 1) for every event E and verifies the required inequalities.

🔄 Alternate form

A common variant uses the condition 4·p·d < 1 instead of e·p·(d + 1) < 1.
The proof is a trivial modification of the argument presented.
Both forms capture the same idea: low probability and limited dependency allow simultaneous avoidance.

🎯 The Ramsey number challenge

🎯 What we want to bound

Goal: estimate R(3, n), the smallest number of vertices such that any 2-coloring of edges contains either a triangle (3-clique) or an independent set of size n.
Upper bound from earlier: R(3, n) ≤ (n+1 choose 3) from Theorem 11.2.
Challenge: find a good lower bound by constructing a graph with no triangle and no independent set of size n.

🚫 Why naive probabilistic methods fail

First attempt: use a random graph on t vertices with edge probability p.
- To avoid triangles: need (t choose 3)·p³ ≤ 1, so p ≈ 1/t.
- To avoid independent sets of size n: need (t choose n)·(1 - p)^(n choose 2) ≤ 1, which requires t·ln n ≈ p·n²/2.
- Substituting p ≈ 1/t gives t·ln n ≈ n²/(2t), so t² ≈ n²/(2 ln n), meaning t ≈ n / √(2 ln n).
- Problem: this does not even make t larger than n, which is not helpful.
Second attempt: allow some triangles and remove one vertex from each.
- Set (t choose 3)·p³ ≈ t, so p ≈ t^(-2/3).
- This yields R(3, n) ≥ n^(6/5) / ln^(3/5) n.
- Problem: the exponent on n is 6/5, far from the upper bound's exponent.

🏆 The correct bound

Erdős (1961) used a clever probabilistic argument to show R(3, n) ≥ n² / ln² n.
This shows the exponent 2 on n is correct.
The Local Lemma provides a more direct proof of the same bound.

🛠️ Applying the Local Lemma to Ramsey numbers

🛠️ Setting up the events

Random graph: consider a random graph on t vertices with edge probability p.
Triangle events: for each 3-element subset S, event E_S is true when S forms a triangle.
- Probability: P(S) = p³.
Independent set events: for each n-element subset T, event E_T is true when T is an independent set.
- Probability: P(T) = (1 - p)^(n choose 2) ≈ e^(-p·n²/2).
Notation abuse: the excerpt refers to events E_S and E_T simply as S and T.

🔗 Defining neighborhoods

Neighborhood: consists of all sets in the family that have two or more elements in common with the given set.
For a 3-element set S:
- Other 3-element sets: 3·(t - 3) sets (sharing at least 2 elements).
- n-element sets: (t-3 choose n-3) + 3·(t-3 choose n-2) sets.
For an n-element set T:
- 3-element sets: (n choose 3) + (t - n)·(n choose 2) sets.
- Other n-element sets: sum over i from 2 to n-1 of (n choose i)·(t-n choose n-i) sets.

🧮 Choosing parameters

For 3-element sets: set x = x(S) = e² · p³ for each S.
For n-element sets: set y = y(T) = q^(1/2) = e^(-p·n²/4) for each T.
The excerpt notes "it will be clear in a moment where we got those values."

📐 The key inequalities

The Local Lemma requires:

p³ ≤ x · (1 - x)^(3(t-3)) · (1 - y)^((t-3 choose n-3) + 3(t-3 choose n-2))
q ≤ y · (1 - x)^((n choose 3) + (t-n)(n choose 2)) · (1 - y)^(sum of neighborhood terms)

Simplification: assume n^(3/2) < t < n² and ignore lower-order terms and constants:

p³ ≤ x · (1 - x)^t · (1 - y)^(t/n)
q ≤ y · (1 - x)^(t·n²) · (1 - y)^(t/n)

🎛️ Balancing the constraints

Keep (1 - y) terms large: want (1 - y)^(t/n) ≥ 1/e.
- This holds if t/n ≤ 1/y, i.e., n·ln t ≤ p·n²/2, or ln t ≤ p·n.
Keep (1 - x) terms large: want (1 - x)^t ≥ 1/e.
- This holds if t ≤ 1/x, i.e., t ≤ 1/p³.
Balance x and y: want (1 - x)^(t·n²) ≥ e^(-x·t·n²) ≥ y.
- This requires p ≤ x/t, and since x = p³, we get p^(-1) ≥ t^(1/2), or p ≤ t^(-1/2).

🎯 Final calculation

Set the parameters: ln t = p·n and p = t^(-1/2).
Substitute: ln t = t^(-1/2)·n, so t^(1/2)·ln t = n.
Squaring: t·(ln t)² = n².
Since ln t ≈ ln n (within the approximations used), we get t ≈ n² / ln² n.
Conclusion: R(3, n) ≥ n² / ln² n, confirming the correct exponent of 2 on n.

🔑 Key insights

🔑 Why the Local Lemma succeeds where naive methods fail

Naive methods treat all dependencies equally and require very restrictive conditions.
The Local Lemma exploits the fact that each event depends on only a small fraction of the total family.
By carefully tuning x and y, we balance the conflicting requirements of avoiding triangles and avoiding large independent sets.

🔑 The role of approximations

The excerpt emphasizes "ignoring smaller order terms and multiplicative constants."
This allows focusing on the dominant behavior: the exponent on n.
The final bound t ≈ n² / ln² n is understood within these approximations.
Don't confuse: the exact constants matter for precise bounds, but the Local Lemma argument captures the correct growth rate.