Numerical Methods for Ordinary Differential Equations

1

Some historical remarks

1.1 Some historical remarks

🧭 Overview

🧠 One-sentence thesis

The introduction of digital computers transformed numerical mathematics from a niche analytical tool into an essential, ubiquitous method for solving realistic mathematical models that cannot be solved analytically.

📌 Key points (3–5)

  • Historical roots: numerical methods date back to the 17th–18th centuries (Newton, Euler, Gauss), but the term "numerical mathematics" did not exist then.
  • The analytical bottleneck: fundamental physical laws were formulated as equations, but most could only be solved analytically in special cases, limiting technological development.
  • Computer revolution: digital computers enabled quantitative information from detailed, realistic models, making numerical methods ubiquitous.
  • Common confusion: numerical mathematics is not just "doing math on computers"—it is specifically about approximating solutions through finite computational processes on finite subsets of rational numbers.
  • Why it matters today: 70% of engineering journal papers use non-trivial mathematical models and methods; computations are often cheaper, safer, and more scalable than experiments.

📜 Origins and early development

📜 When numerical methods began

  • Modern applied mathematics started in the 17th and 18th centuries with scholars like Stevin, Descartes, Newton, and Euler.
  • Numerical aspects were naturally part of analysis, but the term "numerical mathematics" did not exist at that time.
  • Example: methods invented by Newton, Euler, and later Gauss are still important today.

🧱 The fundamental problem: unsolvable equations

  • In the 17th and 18th centuries, fundamental laws for physics (mechanics, hydrodynamics) were formulated as mathematical equations.
  • These equations looked simple but could be solved analytically only in a few special cases.
  • Result: technological development remained loosely connected with mathematics because most equations could not be solved in practice.

💻 The computer revolution

💻 How computers changed everything

  • The introduction and availability of digital computers transformed the field.
  • Computers made it possible to gain quantitative information from detailed and realistic mathematical models using numerical methods.
  • This applies to a multitude of phenomena and processes in physics and technology.
  • Application of computers and numerical methods has become ubiquitous (everywhere).

📊 Evidence of impact today

  • Statistical analysis shows that 70% of papers in professional engineering journals use non-trivial mathematical models and methods.
  • Computations are often cheaper than experiments.
  • Experiments can be:
    • Expensive
    • Dangerous
    • Downright impossible
  • Real-life experiments can often be performed only on a small scale, making their results less reliable.

🔍 What is numerical mathematics?

🔍 Core definition

Numerical mathematics: a collection of methods to approximate solutions to mathematical equations numerically by means of finite computational processes.

  • It is not just "mathematics using numbers" or "calculations."
  • The key concepts are:
    • Mappings and sets (like most of mathematics)
    • Computability (specific to numerical mathematics)

⚙️ What computability means

Computability has two requirements:

  1. Finite number of operations: the result can be obtained in a finite number of steps, so computation time will be finite.
  2. Finite subset of rational numbers: a computer has only finite memory, so it works on a finite subset of rational numbers.

🧩 Why results are approximations

  • In general, the result will be an approximation of the solution, not an exact solution.
  • Reasons:
    • Most mathematical equations contain operators based on infinite processes (integrals, derivatives).
    • Solutions are functions whose domain and image may contain irrational numbers.
    • Numerical methods can only obtain approximate solutions because they must use finite processes and finite memory.

🚫 Don't confuse: finite vs infinite

  • Traditional mathematics often deals with infinite processes (limits, infinite series, continuous functions over all real numbers).
  • Numerical mathematics must translate these into finite processes that a computer can execute.
  • Example: an integral is an infinite sum of infinitesimal parts; a numerical method approximates it with a finite sum of finite parts.
2

What is numerical mathematics?

1.2 What is numerical mathematics?

🧭 Overview

🧠 One-sentence thesis

Numerical mathematics provides finite computational methods to approximate solutions to mathematical problems that often cannot be solved analytically, making it essential for practical applications in science and engineering.

📌 Key points (3–5)

  • Core purpose: approximate solutions to mathematical equations through finite computational processes, since most equations cannot be solved analytically.
  • Key constraint—computability: results must be obtainable in finite operations on a finite subset of rational numbers (due to computer memory and time limits).
  • Why approximation is acceptable: numerical methods apply best to stable problems—those insensitive to small perturbations.
  • Common confusion: computers cannot distinguish polynomials of sufficiently high degree, so methods relying on exact algebraic properties (like the fundamental theorem of algebra) cannot be trusted in practice.
  • Efficiency matters: unlike pure mathematics, numerical mathematics treats reducing operations and storage as essential improvements, not optional optimizations.

📜 Historical context and motivation

📜 Early development and limitations

  • Numerical methods were invented by Newton, Euler, and Gauss in the 17th–18th centuries; many still play important roles today.
  • Fundamental physics laws (mechanics, hydrodynamics) were formulated as mathematical equations, but these could be solved analytically only in a few special cases.
  • This limitation meant technological development was only loosely connected with mathematics until computers arrived.

💻 The computer revolution

  • Digital computers changed everything: they enable quantitative information from detailed, realistic mathematical models.
  • Application of computers and numerical methods has become ubiquitous.
  • Statistical analysis shows non-trivial mathematical models and methods appear in 70% of engineering science journal papers.

🧪 Why computation beats experiments

  • Computations are often cheaper than experiments.
  • Experiments can be expensive, dangerous, or downright impossible.
  • Real-life experiments can often be performed only on a small scale, making their results less reliable.

🔢 Definition and core concepts

🔢 What numerical mathematics is

Numerical mathematics: a collection of methods to approximate solutions to mathematical equations numerically by means of finite computational processes.

  • In most mathematics, the key concepts are mappings and sets.
  • In numerical mathematics, computability must be added.

🖥️ Computability

Computability: the result can be obtained in a finite number of operations (finite computation time) on a finite subset of the rational numbers (because a computer has only finite memory).

  • This is a hard constraint imposed by physical computers.
  • Example: you cannot represent all real numbers or perform infinitely many steps.

📐 Why results are approximations

  • Most mathematical equations contain operators based on infinite processes (integrals, derivatives).
  • Solutions are functions whose domain and image may (and usually do) contain irrational numbers.
  • Computers can only work with finite subsets of rational numbers, so approximation is unavoidable.

🛡️ Stability and error analysis

🛡️ When to apply numerical methods

  • Because numerical methods produce only approximate solutions, they make sense only for stable problems.
  • Stable problems are insensitive to small perturbations.
  • Don't confuse: stability is a concept in both numerical and classical mathematics.

🔍 Tools for studying stability and error

  • Functional analysis is an important instrument for studying stability.
  • Functional analysis also plays an important role in error analysis: investigating the difference between the numerical approximation and the true solution.

⚠️ Rounding errors

Rounding errors: errors that follow from the use of finitely many digits.

  • Using only a finite subset of rational numbers has many consequences.
  • Example: a computer cannot distinguish between two polynomials of sufficiently high degree.
  • Consequently, methods based on the fundamental theorem of algebra (an nth degree polynomial has exactly n complex zeros) cannot be trusted in practice.

⚡ Efficiency as a core value

⚡ Why efficiency is essential

  • An important aspect of numerical mathematics is the emphasis on efficiency.
  • Contrary to ordinary mathematics, numerical mathematics considers an increase in efficiency (a decrease in the number of operations and/or amount of storage required) as an essential improvement, not just a nice-to-have.

🚀 Practical importance and ongoing challenges

  • Progress in efficiency is of great practical importance.
  • The end of this development has not been reached yet; the creative mind will meet many challenges.
  • Revolutions in computer architecture will overturn much conventional wisdom.

🎯 Advantages of numerical mathematics

🎯 Solving problems without closed-form solutions

  • A big advantage: numerical mathematics can provide answers to problems that do not admit closed-form solutions.
  • Example: the integral from 0 to π of the square root of (1 + cosine squared of x) dx represents the arc length of one arc of the curve y(x) = sine of x, which does not have a solution in closed form.
  • A numerical method can approximate this integral in a very simple way.

🔧 Perfect match with computers

  • An additional advantage: a numerical method uses only standard function evaluations and the operations addition, subtraction, multiplication, and division.
  • These are exactly the operations a computer can perform.
  • Numerical mathematics and computers form a perfect combination.

📝 Trade-off with analytical methods

  • Analytical methods have the advantage that the solution is given by a mathematical formula.
  • However, most problems do not have analytical solutions, making numerical methods indispensable.
3

Why Numerical Mathematics?

1.3 Why numerical mathematics?

🧭 Overview

🧠 One-sentence thesis

Numerical mathematics provides practical solutions to problems that lack closed-form answers by using only standard operations computers can perform, though it trades analytical insight for computational feasibility and must manage rounding errors from finite-precision arithmetic.

📌 Key points (3–5)

  • Main advantage: numerical methods can solve problems that have no closed-form solutions, such as certain integrals.
  • Perfect match with computers: numerical methods use only addition, subtraction, multiplication, and division—exactly what computers can do.
  • Trade-off: analytical methods give formulas that reveal solution behavior; numerical methods produce approximations that require visualization tools for insight.
  • Rounding errors are unavoidable: computers store only a finite subset of rational numbers using floating-point representation, introducing errors in every calculation.
  • Common confusion: don't assume numerical methods are less rigorous—they are essential when analytical solutions don't exist, and efficiency (fewer operations, less storage) is a core mathematical concern.

🎯 What numerical mathematics offers

🔓 Solving unsolvable problems

  • Many problems do not admit closed-form solutions.
  • Example from the excerpt: the integral from 0 to π of the square root of (1 + cosine squared of x) represents the arc length of one arc of the sine curve.
    • This integral has no solution in closed form.
    • A numerical method can approximate it in a very simple way.
  • Numerical mathematics provides answers where classical methods cannot.

🤝 Perfect partnership with computers

  • Numerical methods rely exclusively on:
    • Standard function evaluations
    • Addition, subtraction, multiplication, division
  • These are exactly the operations a computer can perform.
  • This alignment makes numerical mathematics and computers "a perfect combination."

⚖️ Analytical vs numerical approaches

📐 What analytical methods provide

  • The solution is given by a mathematical formula.
  • From the formula, you can gain insight into:
    • The behavior of the solution
    • The properties of the solution
  • This direct insight is a key advantage.

🖥️ What numerical methods provide

  • Numerical approximations do not come with formulas.
  • Insight must be gained through visualization tools.
  • Drawing a graph of a function using a numerical method is usually more useful than evaluating the solution at many individual points.
  • Don't confuse: lack of a formula does not mean lack of rigor—it means a different path to understanding.

🔢 Rounding errors and finite precision

🧮 Floating-point representation

Floating-point number: a number stored in a computer in the form ± 0.d₁d₂...dₙ · βᵉ, where d₁ > 0 and 0 ≤ dᵢ < β.

  • Mantissa: 0.d₁d₂...dₙ (the significant digits)
  • Base (β): often 2 (binary representation)
  • Exponent (e): an integer where L < e < U
  • Normalization (requiring d₁ > 0) prevents wasting digits and makes the representation unambiguous.

🎚️ Precision levels

PrecisionNumber of digits (n)Standard
Single24IEEE-754
Double53IEEE-754
  • Most computers and software packages (e.g., Matlab) satisfy the IEEE-754 standard.
  • Characteristic values for |L| and U are in the range [100, 1000].

❌ What rounding errors are

Rounding error: the error caused when a real number x is replaced with the floating-point number closest to it, called fl(x).

  • The relationship is written as: fl(x) = x(1 + ε)
  • Absolute error: |x − fl(x)| = |εx|
  • Relative error: |x − fl(x)| / |x| = |ε|
  • The difference between the two floating-point numbers enclosing x is βᵉ⁻ⁿ.
  • Rounding yields: |x − fl(x)| ≤ βᵉ⁻ⁿ / 2

🚫 Consequences of finite representation

  • A computer cannot distinguish between two polynomials of sufficiently high degree.
  • Methods based on the fundamental theorem of algebra (an nth-degree polynomial has exactly n complex zeros) cannot be trusted in practice.
  • Errors follow from using finitely many digits in every calculation.

🚀 Efficiency as a core concern

⚡ Why efficiency matters

  • Numerical mathematics places strong emphasis on efficiency.
  • Efficiency means:
    • Decreasing the number of operations required
    • Reducing the amount of storage required
  • An increase in efficiency is considered an essential improvement, not just a practical bonus.

🔬 Ongoing development

  • Progress in efficiency is of great practical importance.
  • The end of this development has not been reached yet.
  • The creative mind will meet many challenges in this area.
  • Revolutions in computer architecture will overturn much conventional wisdom.
  • Don't confuse: efficiency is not a secondary concern—it is a fundamental mathematical goal in numerical methods.
4

Rounding errors

1.4 Rounding errors

🧭 Overview

🧠 One-sentence thesis

Rounding errors arise because computers represent real numbers with finite precision, and these errors propagate through arithmetic operations in predictable ways that can severely degrade numerical results when significant digits are lost.

📌 Key points (3–5)

  • Floating point representation: computers store numbers in a finite form ± 0.d₁d₂...dₙ · βᵉ, where the mantissa has n digits, β is the base, and e is the exponent.
  • Relative precision bound: the relative error from rounding any real number x to fl(x) is bounded by eps = ½ β¹⁻ⁿ; for double precision (β=2, n=53), eps ≈ 10⁻¹⁶.
  • Loss of significant digits: subtracting nearly equal numbers produces a small result with large relative error, because only the leading significant digit(s) remain and trailing zeros carry no information.
  • Common confusion: absolute vs relative error—adding a small number to a large one gives large absolute error but small relative error; subtracting nearly equal numbers gives small absolute error but large relative error.
  • Error propagation rules: absolute errors add for addition/subtraction; relative errors add (approximately) for multiplication/division.

🔢 Floating point representation

🔢 Structure of floating point numbers

Floating point number: a number stored in the form ± 0.d₁d₂...dₙ · βᵉ, where d₁ > 0, 0 ≤ dᵢ < β, and L < e < U.

  • Mantissa: the fractional part 0.d₁d₂...dₙ.
  • Base β: typically 2 (binary).
  • Exponent e: an integer between bounds L and U (characteristic values |L| and U are in [100, 1000]).
  • Precision: n = 24 for single precision, n = 53 for double precision.
  • The normalization d₁ > 0 prevents wasting digits and makes the representation unique.
  • Most systems follow the IEEE-754 standard.

📐 Distribution and special cases

  • Floating point numbers are not uniformly distributed on the real line.
  • There is a gap around zero.
  • Underflow: a computational result falls within the gap near zero; machines typically warn, replace with 0, and continue.
  • Overflow: a result exceeds the largest representable floating point number; machines warn and may continue with Inf (infinity).
  • Example: for β=2 and three mantissa digits, the numbers ±0.1d₂d₃ · 2ᵉ (e = -1, 0, 1, 2) cluster more densely near zero and spread out farther from zero.

⚠️ Rounding and error bounds

⚠️ Rounding process

Rounding: replacing a real number x with the closest floating point number fl(x).

  • The rounded value is written as fl(x) = x(1 + ε).
  • Absolute error: |x - fl(x)| = |εx|.
  • Relative error: |x - fl(x)| / |x| = |ε|.

📏 Error bounds

  • The gap between consecutive floating point numbers enclosing x is βᵉ⁻ⁿ.

  • Rounding gives absolute error |x - fl(x)| ≤ ½ βᵉ⁻ⁿ.

  • Because |x| ≥ βᵉ⁻¹ (since d₁ > 0), the relative error satisfies |ε| ≤ eps, where:

    eps = ½ β¹⁻ⁿ (the computer's relative precision).

  • For β=2 and n=53 (double precision), eps ≈ 10⁻¹⁶, meaning approximately 16 decimal digits are used.

  • Deterministic nature: rounding errors are not random; repeating the same algorithm yields the same results.

🧮 Arithmetic operations in floating point

🧮 Model for floating point arithmetic

  • Let ◦ denote an arithmetic operation (+, -, ×, or /).

  • For floating point numbers x and y (i.e., x = fl(x), y = fl(y)), the machine result is:

    z = fl(x ◦ y).

  • The exact result x ◦ y is generally not a floating point number, so an error occurs.

  • From fl(x ◦ y) = {x ◦ y}(1 + ε), where |ε| ≤ eps and z ≠ 0.

  • This describes the error from converting an exact calculation result to floating point form.

🔀 Two sources of error

When x and y are not floating point numbers, the total error has two parts:

|x ◦ y - fl(fl(x) ◦ fl(y))| ≤ |x ◦ y - fl(x) ◦ fl(y)| + |fl(x) ◦ fl(y) - fl(fl(x) ◦ fl(y))|.

  • First term: error in the data (from rounding x and y to fl(x) and fl(y)).
  • Second term: error from converting the result of an exact calculation to floating point form.

📊 Example: basic operations

Using β=10, n=5, x=5/7, y=1/3, so fl(x)=0.71429×10⁰ and fl(y)=0.33333×10⁰:

OperationResultExact valueAbsolute errorRelative error
x + y0.10476×10¹22/210.190×10⁻⁴0.182×10⁻⁴
x - y0.38096×10⁰8/210.762×10⁻⁵0.200×10⁻⁴
x × y0.23809×10⁰5/210.524×10⁻⁵0.220×10⁻⁴
x ÷ y0.21429×10¹15/70.429×10⁻⁴0.200×10⁻⁴
  • For addition: fl(x) + fl(y) = (0.71429 + 0.33333)×10⁰ = 0.1047620000...×10¹, rounded to 0.10476×10¹.
  • Exact value x + y = 22/21 = 1.0476190476..., so absolute error ≈ 0.190×10⁻⁴ and relative error ≈ 0.182×10⁻⁴.

🚨 Loss of significant digits

🚨 What is loss of significant digits

Loss of significant digits: a phenomenon where subtracting nearly equal numbers leaves only a few significant digits in the result, with trailing zeros representing no information.

  • The large relative error is not due to floating point system limitations but to the finite precision of the data representation.
  • Example: with β=10, n=5, x=5/7, u=0.714251, so fl(x)=0.71429 and fl(u)=0.71425.
  • Subtraction: fl(x) - fl(u) = 0.71429 - 0.71425 = 0.0000400000×10⁰ = 0.40000×10⁻⁴.
  • Exact value: x - u = 0.3471428571...×10⁻⁴.
  • Absolute error ≈ 0.528×10⁻⁵, but relative error ≈ 0.152 (15%!).
  • The zeros in 0.40000 have no significance; only the digit 4 is meaningful.

🔁 Propagation of large relative error

  • A large relative error becomes a large absolute error when the result is multiplied by a large number or divided by a small number.
  • Example (continued): (x - u) × v, where v = 98765.1, fl(v) = 0.98765×10⁵.
    • Computed: fl(fl(x) - fl(u)) × fl(v) = 0.4×10⁻⁴ × 0.98765×10⁵ = 0.39506×10¹.
    • Exact: (x - u) × v = 3.42855990000460...
    • Absolute error ≈ 0.522, relative error ≈ 0.152 (still 15%).
  • If y² is added to (x - u) × v, the result is indistinguishable due to the large absolute error—the operation has no effect on the computed result, indicating something is fundamentally wrong.

🧪 Ill-conditioned vs unstable

  • Ill-conditioned: a set of input data for which a numerical process exhibits loss of significant digits.
  • Unstable: a numerical process that exhibits loss of significant digits for all possible input data; such processes are useless or need improvement.
  • One objective of numerical analysis is to identify and classify unstable processes.

⚖️ Floating point arithmetic does not obey standard laws

  • Standard mathematical laws (commutative, associative, distributive) do not necessarily hold for floating point arithmetic.

📐 Error propagation rules

📐 General principle

  • In a multi-step numerical process, the accumulated error from all previous steps should be treated as a perturbation of the original data.
  • After many steps, error propagation typically dominates over individual floating point errors.
  • The rules for numerical error propagation are the same as for measurement error propagation in physical experiments.

➕ Addition and subtraction

Let δx = x - x̃ and δy = y - ỹ be the absolute perturbations.

  • For addition: (x + y) - (x̃ + ỹ) = (x - x̃) + (y - ỹ) = δx + δy.
  • For subtraction: (x - y) - (x̃ - ỹ) = δx - δy.
  • Error estimate: |(x ± y) - (x̃ ± ỹ)| ≤ |δx| + |δy|.
  • In words: the absolute error in the sum (or difference) of two perturbed terms is (bounded by) the sum of the absolute perturbations.

✖️ Multiplication and division

Define relative perturbations: x̃ = x(1 + εₓ) and ỹ = y(1 + εᵧ).

  • For multiplication:

    (xy - x̃ỹ) / (xy) = -(εₓ + εᵧ + εₓεᵧ) ≈ -(εₓ + εᵧ),

    assuming εₓ and εᵧ are negligible compared to 1.

  • Error estimate: |xy - x̃ỹ| / |xy| ≤ |εₓ| + |εᵧ|.

  • In words: the relative error in a product (or quotient) of two perturbed factors is approximately the sum of the two relative perturbations.

  • A similar rule holds for division.

  • These two rules explain many phenomena in floating point computations when x̃ = fl(x) and ỹ = fl(y).

🔍 Don't confuse absolute and relative error contexts

OperationError type that addsExample scenario
Addition / SubtractionAbsolute errors addSubtracting nearly equal numbers: small absolute error, large relative error
Multiplication / DivisionRelative errors addMultiplying by a large number: small relative error can become large absolute error
  • Adding a small number to a large number: large absolute error, small relative error.
  • Subtracting nearly equal numbers: small absolute error, large relative error.
  • Multiplying a number with large relative error by a large number: both large absolute and large relative error.
5

Landau's O-Symbol

1.5 Landau’s O-Symbol

🧭 Overview

🧠 One-sentence thesis

Landau's O-symbol provides a practical tool for comparing the growth rate of errors in numerical methods near a point, which is often more important than computing exact error values.

📌 Key points (3–5)

  • Why O-symbol matters: In numerical analysis, knowing how fast an error grows is often more important—and more feasible—than computing its exact value.
  • What O-symbol captures: It compares how one function grows relative to another function in the neighborhood of a point (here, near x = 0).
  • Computational rules: Four rules govern how O-terms combine under scaling, addition, multiplication, and division.
  • Common confusion: A function that is O(x²) is also O(x), but the reverse is not true—O-notation gives an upper bound on growth rate, not an exact rate.

🎯 The core idea

🎯 Why growth rate, not exact value

  • In many practical numerical problems, computing the explicit error value is impossible.
  • What we can determine is how the error behaves as we approach a point—does it shrink like x, like x², or faster?
  • The excerpt emphasizes: "it is often more important to know the growth rate of the error rather than its explicit value."

📍 Focus on x → 0

  • The general Landau O-symbol can be defined for any point a.
  • This book simplifies by always taking a = 0, so all comparisons are "for x → 0."

📐 Definition and meaning

📐 The formal definition

O-symbol: Let f and g be given functions. Then f(x) = O(g(x)) for x → 0 if there exist positive r and finite M such that |f(x)| ≤ M |g(x)| for all x in [−r, r].

  • In plain language: f(x) = O(g(x)) means "f does not grow faster than g, up to a constant factor M, in a small neighborhood around 0."
  • The constant M and the radius r can be any finite positive numbers; the key is that the inequality holds for all x close enough to 0.

🔍 What it tells you

  • If f(x) = O(x²), then near x = 0, |f(x)| is bounded by some constant times x².
  • This does not mean f(x) equals x²; it means f grows no faster than x².
  • Example: If f(x) = x², then f(x) = O(x²), but also f(x) = O(x) because x² grows slower than x near 0 (since x² < x for small positive x).

🧮 Computational rules

🧮 Four key rules

The excerpt provides four rules for manipulating O-terms. Assume f(x) = O(x^p) and g(x) = O(x^q) for x → 0, with p ≥ 0 and q ≥ 0.

RuleStatementMeaning
(a) Scaling downf(x) = O(x^s) for all s with 0 ≤ s ≤ pA function that is O(x^p) is also O(x^s) for any smaller exponent s
(b) Additionα f(x) + β g(x) = O(x^min{p,q}) for all α, β in RThe sum is dominated by the term with the smallest exponent (slowest decay)
(c) Multiplicationf(x) g(x) = O(x^(p+q))Exponents add when multiplying
(d) Divisionf(x) /x

⚠️ Don't confuse: upper bound vs exact rate

  • Rule (a) shows that O-notation gives an upper bound on growth rate.
  • If f(x) = x², then f(x) = O(x²) and f(x) = O(x), because x² grows slower than x near 0.
  • But if g(x) = x, then g(x) = O(x) but not O(x²)—g grows too fast to be bounded by x².

🧪 Worked example

🧪 Example from the excerpt

Let f(x) = x² and g(x) = x. The excerpt demonstrates:

  • (a) f(x) = O(x²) for x → 0, but also f(x) = O(x) for x → 0.
    • This illustrates rule (a): a function with exponent 2 is also O of any smaller exponent.
  • (b) h(x) = x² + 3x is O(x) for x → 0 but not O(x²) for x → 0.
    • The term 3x dominates near 0 (since x² becomes negligible compared to x).
    • Rule (b) says the sum is O(x^min{2,1}) = O(x).
    • h is not O(x²) because the x term grows faster than x².
  • (c) f(x) · g(x) = x² · x = x³ is O(x³) for x → 0.
    • Rule (c): exponents add, so O(x²) · O(x) = O(x^(2+1)) = O(x³).
  • (d) f(x) / g(x) = x² / x = x is O(x) for x → 0.
    • Rule (d): dividing by x (exponent 1) subtracts 1 from the exponent, so O(x²) / x = O(x^(2−1)) = O(x).

💡 Key takeaway from the example

  • The example shows that the dominant term (the one with the smallest exponent) determines the O-behavior of a sum.
  • It also reinforces that O-notation is about upper bounds: x² is O(x), even though x² is "smaller" than x near 0.
6

Some important concepts and theorems from analysis

1.6 Some important concepts and theorems from analysis

🧭 Overview

🧠 One-sentence thesis

This section establishes foundational concepts from analysis—including well-posedness, conditioning, and key theorems like Taylor polynomials—that underpin the reliability and accuracy of numerical methods for differential equations.

📌 Key points (3–5)

  • Well-posed vs ill-posed models: a model is well-posed if a solution exists, is unique, and changes continuously with data; otherwise it is ill-posed.
  • Well-conditioned vs ill-conditioned: well-conditioned models produce small output errors from small input errors; ill-conditioned models amplify errors.
  • Common confusion: well-posedness (mathematical existence and uniqueness) is distinct from conditioning (sensitivity to input errors)—a model can be well-posed but ill-conditioned.
  • Taylor polynomial machinery: the Taylor polynomial approximates a function near a point, and the remainder term quantifies the error using a higher derivative at an unknown intermediate point.
  • Why these matter: numerical analysis relies on these concepts to ensure solutions exist, are computable, and remain accurate despite rounding and approximation errors.

🏗️ Foundational model properties

🏗️ Well-posedness

Well posed: Mathematical models of physical phenomena are called well posed if:

  • A solution exists;
  • The solution is unique;
  • The solution's behavior changes continuously with the data.

Models that do not satisfy these conditions are called ill posed.

  • What it means: a well-posed problem has a solution, only one solution, and small changes in input produce small changes in output.
  • Why it matters: if a model is ill-posed, numerical methods may fail to find a solution, find multiple conflicting solutions, or produce wildly different results from tiny input changes.
  • Example: a differential equation that has no solution or infinitely many solutions is ill-posed.

🎯 Conditioning

Well conditioned: Mathematical models are called well conditioned if small errors in the initial data lead to small errors in the results. Models that do not satisfy this condition are called ill conditioned.

  • What it means: conditioning measures error amplification—how much small input errors grow in the output.
  • Don't confuse: well-posedness is about whether a solution exists and is unique; conditioning is about whether the solution is stable under perturbations.
  • Example: a well-posed problem can still be ill-conditioned if rounding errors in input data cause large errors in computed results.

🧮 Basic mathematical tools

🧮 Modulus of a complex number

The modulus of a complex number a + ib is equal to the square root of (a squared + b squared).

  • Also holds for a − ib: same formula.
  • This is the distance from the origin in the complex plane.

🧮 Kronecker delta

The Kronecker delta is defined as:

  • δ_ij = 1 if i = j,
  • δ_ij = 0 if i ≠ j.
  • A function of two integer variables.
  • Used to pick out diagonal elements or to express identity-matrix entries.

🧮 Eigenvalues

Let A be an m × m matrix, and I be the m × m identity matrix. The eigenvalues λ₁, …, λₘ of A satisfy det(A − λI) = 0.

  • Eigenvalues are the roots of the characteristic polynomial.
  • They describe how the matrix scales vectors along certain directions.

📐 Key theorems from calculus

📐 Triangle inequality

Triangle inequality with absolute value: For any x, y ∈ ℝ, |x + y| ≤ |x| + |y|.

Furthermore, for any function f : (a, b) → ℝ, the absolute value of the integral from a to b of f(x) dx is less than or equal to the integral from a to b of |f(x)| dx.

  • The first part bounds the size of a sum by the sum of sizes.
  • The second part extends this to integrals: integrating the absolute value gives an upper bound on the absolute value of the integral.

📐 Intermediate-value theorem

Assume f ∈ C[a, b]. Let f(a) ≠ f(b) and let F be a number between f(a) and f(b). Then there exists a number c ∈ (a, b) such that f(c) = F.

  • What it says: a continuous function on a closed interval takes every value between its endpoints.
  • Why it matters: guarantees that roots exist when the function changes sign.

📐 Rolle's theorem

Assume f ∈ C[a, b] and f differentiable on (a, b). If f(a) = f(b), then there exists a number c ∈ (a, b) such that f′(c) = 0.

  • What it says: if a differentiable function has the same value at two points, its derivative must be zero somewhere in between.
  • Used in the proof of the mean-value theorem and Taylor remainder.

📐 Mean-value theorem

Assume f ∈ C[a, b] and f differentiable on (a, b), then there exists a number c ∈ (a, b) such that f′(c) = (f(b) − f(a)) / (b − a).

  • What it says: the instantaneous rate of change (derivative) at some point equals the average rate of change over the interval.
  • Generalizes Rolle's theorem (when f(a) = f(b), the average rate is zero).

📐 Mean-value theorem for integration

Assume G ∈ C[a, b] and φ is an integrable function that does not change sign on [a, b]. Then there exists a number x ∈ (a, b) such that the integral from a to b of G(t)φ(t) dt = G(x) times the integral from a to b of φ(t) dt.

  • What it says: the weighted integral of G can be expressed as G evaluated at some intermediate point, times the integral of the weight function.
  • Used in error analysis for quadrature and interpolation.

📐 Weierstrass approximation theorem

Suppose f ∈ C[a, b] is given. For every ε > 0, there exists a polynomial function p such that for all x in [a, b], |f(x) − p(x)| < ε.

  • What it says: any continuous function on a closed interval can be approximated arbitrarily closely by a polynomial.
  • Why it matters: justifies polynomial interpolation and approximation methods in numerical analysis.

🔬 Taylor polynomial and remainder

🔬 Taylor polynomial (one variable)

Assume f ∈ C^(n+1)(a, b) is given. Then for all c, x ∈ (a, b) there exists a number ξ between c and x such that f(x) = P_n(x) + R_n(x),

where the Taylor polynomial P_n(x) is given by: P_n(x) = f(c) + (x − c)f′(c) + (x − c)² / 2! f′′(c) + … + (x − c)ⁿ / n! f^(n)(c),

and the remainder term R_n(x) is: R_n(x) = (x − c)^(n+1) / (n + 1)! f^(n+1)(ξ).

  • What it does: approximates f(x) near c using a polynomial of degree n.
  • Remainder term: the error is expressed using the (n+1)-th derivative at an unknown point ξ between c and x.
  • Why it matters: allows quantifying approximation error in terms of higher derivatives and distance from the expansion point.
  • Example: for f(x) = x², c = 0, n = 1: P₁(x) = 0 + x·0 = 0, but f(x) = O(x²) for x → 0 (as shown in Example 1.5.1).

🔬 Proof sketch of Taylor remainder

The excerpt provides a proof:

  1. Define K so that f(x) equals the Taylor polynomial plus K(x − c)^(n+1).
  2. Construct an auxiliary function F(t) that equals zero at both t = c and t = x.
  3. Apply Rolle's theorem: F′(ξ) = 0 for some ξ between c and x.
  4. Differentiate F and simplify (most terms cancel telescopically).
  5. Solve for K: K = f^(n+1)(ξ) / (n + 1)!.
  • This shows the remainder term is exact for some unknown ξ.

🔬 Taylor polynomial (two variables)

Let f : D ⊂ ℝ² → ℝ be continuous with continuous partial derivatives up to and including order n + 1 in a ball B ⊂ D with center c = (c₁, c₂) and radius ρ. Then for each x = (x₁, x₂) ∈ B there exists a θ ∈ (0, 1), such that f(x) = P_n(x) + R_n(x).

  • Taylor polynomial P_n(x): sum over all partial derivatives up to order n, weighted by products of (x_i − c_i).
  • Remainder R_n(x): involves (n+1)-th order partial derivatives evaluated at c + θ(x − c).
  • Proof idea: fix x, define h = x − c, and consider the one-variable function F(s) = f(c + sh); apply the one-variable Taylor theorem to F.
  • Example (n = 1): P₁(x) = f(c₁, c₂) + (x₁ − c₁) ∂f/∂x₁(c₁, c₂) + (x₂ − c₂) ∂f/∂x₂(c₁, c₂), and R₁(x) is O(‖x − c‖²).

📊 Power series expansions

📊 Geometric series

Power series of 1/(1 − x): Let x ∈ ℝ with |x| < 1. Then: 1/(1 − x) = 1 + x + x² + x³ + … = sum from k=0 to ∞ of x^k.

  • Valid only when |x| < 1 (convergence condition).
  • Foundation for many approximation techniques.

📊 Exponential function

Power series of e^x: Let x ∈ ℝ. Then: e^x = 1 + x + x²/2 + x³/3! + … = sum from k=0 to ∞ of x^k / k!.

  • Converges for all real x.
  • Used in numerical computation of exponential and related functions.

🔗 How these concepts connect

ConceptRole in numerical analysis
Well-posednessEnsures the problem has a unique solution that depends continuously on data
ConditioningMeasures how errors propagate from input to output
Taylor polynomialProvides local polynomial approximations with quantified error bounds
Mean-value theoremsJustify error estimates by relating function values to derivatives
Weierstrass theoremGuarantees polynomial approximation is theoretically possible
Power seriesGive explicit formulas for approximating standard functions
  • Don't confuse: well-posedness is a property of the mathematical model; conditioning is a property of the problem instance and can vary with data.
  • Common pattern: numerical methods approximate functions (via Taylor or other polynomials), then use remainder terms and mean-value theorems to bound errors.
7

Interpolation Introduction

2.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Interpolation methods allow us to estimate unknown intermediate values from a limited set of measurement data, with applications ranging from predicting car ownership trends to rendering images and computing trigonometric functions efficiently.

📌 Key points (3–5)

  • What interpolation does: estimates values between known data points (interpolation) or predicts values outside the measured range (extrapolation).
  • Why it matters: useful when only limited measurement data is available; saves memory and computation time in practical applications.
  • Linear interpolation basics: the simplest useful method connects two known points with a straight line to approximate intermediate values.
  • Common confusion: interpolation (within data range) vs extrapolation (outside data range)—both use similar techniques but extrapolation predicts beyond measured points.
  • Real-world applications: estimating statistics between survey years, image visualization from sparse pixels, and fast computation of function values.

🎯 The interpolation problem

🎯 What interpolation solves

Interpolation: determining intermediate values from a limited amount of measurement data.

Extrapolation: predicting values outside the range of measurements.

  • In practice, we often have only a few data points but need to estimate values at other points.
  • The core challenge: how to use known values to approximate unknown ones.

📊 Motivating example: car ownership data

The excerpt provides a concrete scenario:

  • Data: Number of cars per 1000 citizens in The Netherlands, measured every 5 years from 1990 to 2010.
  • Question: How to estimate the number in 2008 (interpolation) or predict the number in 2020 (extrapolation)?
Year19901995200020052010
Cars per 1000344362400429460
  • Only five data points are known, but we want estimates for years in between or beyond.

🛠️ Why interpolation is useful

💾 Memory and computation savings

  • Image visualization: Store only a limited number of pixels; use interpolation to render realistic images on screen, saving memory.
  • Function computation: Computing trigonometric functions is time-consuming; store precalculated values and interpolate for intermediate points cheaply.

🔄 General principle

  • Instead of measuring/computing everything, measure/store a few values and fill in the gaps with interpolation.
  • Trade-off: less data storage and computation vs. some approximation error.

📏 Linear interpolation fundamentals

📏 From zeroth to first degree

The excerpt contrasts two simple approaches:

MethodDescriptionExample
Zeroth degree (constant)Use the known value at one point as the approximation nearby"Tomorrow's weather = today's weather" (correct 80% of the time)
Linear (first degree)Draw a straight line between two known pointsConnect two data points with a line
  • Zeroth degree is the simplest but ignores trends.
  • Linear interpolation is "a better way"—it accounts for change between two points.

📐 The linear interpolation formula

Suppose we know function values at two points: f(x₀) and f(x₁), where x₀ < x₁.

Linear interpolation polynomial: a straight line through the two known points (x₀, f(x₀)) and (x₁, f(x₁)) used to approximate f(x) for x in [x₀, x₁].

The formula is:

  • L₁(x) = f(x₀) + [(x − x₀) / (x₁ − x₀)] × [f(x₁) − f(x₀)]

How it works:

  • The term (x − x₀) / (x₁ − x₀) measures how far x is between x₀ and x₁ (0 at x₀, 1 at x₁).
  • Multiply this fraction by the change in function values [f(x₁) − f(x₀)].
  • Add the starting value f(x₀).

Example: To approximate f(x) = 1/x at x = 0.95 using x₀ = 0.6 and x₁ = 1.3, the linear polynomial L₁ connects the two known points with a straight line.

🔑 Key properties of linear interpolation

The excerpt states two defining properties:

  1. Linearity: L₁(x) can be written as c₀ + c₁x (a straight line).
  2. Interpolation property: L₁(x₀) = f(x₀) and L₁(x₁) = f(x₁)—the line passes exactly through the two known points.

Don't confuse: The interpolation polynomial equals the function at the known points but only approximates it at intermediate points; the approximation quality depends on how smooth the original function is.

🧮 Mathematical derivation approach

The excerpt outlines how to derive the formula:

  • Start with the general linear form L₁(x) = c₀ + c₁x.
  • Apply the interpolation property: plug in x₀ and x₁, set equal to f(x₀) and f(x₁).
  • This produces a system of two linear equations in two unknowns (c₀, c₁).
  • Solving this system yields the coefficients, leading to formula (2.1).
8

Linear interpolation

2.2 Linear interpolation

🧭 Overview

🧠 One-sentence thesis

Linear interpolation approximates unknown function values by drawing a straight line between two known points, offering a simple and effective method that is accurate within the interval but becomes unreliable when extrapolating beyond it.

📌 Key points (3–5)

  • What linear interpolation does: uses a straight line through two known function values to approximate values at intermediate points.
  • How it compares to simpler methods: zeroth-degree (constant) interpolation uses a single known value for the neighborhood; linear interpolation improves on this by using two points and a line.
  • Lagrange form advantage: avoids solving a system of equations by using basis polynomials that satisfy specific interpolation properties.
  • Common confusion—interpolation vs extrapolation: inside the interval between the two nodes, errors are bounded and small; outside the interval, both truncation and measurement errors grow arbitrarily large.
  • Error behavior: the interpolation error depends on the second derivative of the function and the distance between nodes; measurement errors are magnified during extrapolation.

🔍 Comparison with simpler interpolation

🔍 Zeroth-degree (constant) interpolation

Zeroth-degree interpolation: approximation in the neighborhood of a known point is set equal to that known value.

  • Uses only one known function value.
  • The approximation is constant (flat) around that point.
  • Example: predicting tomorrow's weather will be the same as today's—correct in 80% of cases (higher in stable climates like the Sahara).
  • This is the simplest form but ignores trends between points.

📈 Linear interpolation improvement

  • Linear interpolation uses two known points and draws a straight line between them.
  • More plausible for approximating values between the two points because it captures the trend.
  • The excerpt calls this "a better way of interpolation."

🧮 The linear interpolation formula

🧮 Basic formula

Linear interpolation polynomial: L₁(x) = f(x₀) + [(x − x₀) / (x₁ − x₀)] · [f(x₁) − f(x₀)]

  • Given two nodes x₀ and x₁ (where x₀ < x₁) and their function values f(x₀) and f(x₁).
  • The formula produces a straight line through the points (x₀, f(x₀)) and (x₁, f(x₁)).
  • For any x in the interval [x₀, x₁], L₁(x) approximates f(x).

🔧 How the formula is derived

The excerpt explains two key properties used in the derivation:

  1. Linearity: L₁ is linear, so it can be written as L₁(x) = c₀ + c₁·x.
  2. Interpolation property: L₁(x₀) = f(x₀) and L₁(x₁) = f(x₁).

These properties lead to a system of two linear equations in two unknowns (c₀ and c₁), which can be solved to find the coefficients.

🎯 Lagrange basis polynomials

Lagrange basis polynomials L⁰₁(x) and L¹₁(x): linear polynomials defined so that L⁰₁(x₀) = 1, L⁰₁(x₁) = 0, L¹₁(x₀) = 0, L¹₁(x₁) = 1.

  • These basis polynomials are:
    • L⁰₁(x) = (x − x₁) / (x₀ − x₁)
    • L¹₁(x) = (x − x₀) / (x₁ − x₀)
  • The linear interpolation polynomial can then be written as:
    L₁(x) = L⁰₁(x)·f(x₀) + L¹₁(x)·f(x₁)
  • This is called the Lagrange form or Lagrange interpolation polynomial.
  • Advantage: avoids the need to solve the system of equations for c₀ and c₁.
  • The interpolation points x₀ and x₁ are often called nodes.

📏 Interpolation error and bounds

📏 Theorem on interpolation error

The excerpt provides Theorem 2.2.1:

  • Conditions: x₀ and x₁ are distinct nodes in [a, b]; the function f is at least twice continuously differentiable (f ∈ C²[a, b]); L₁ is the linear interpolation polynomial.
  • Error formula: for each x in [a, b], there exists a ξ in (a, b) such that
    f(x) − L₁(x) = (1/2)·(x − x₀)·(x − x₁)·f′′(ξ)
  • What this means: the error depends on the product (x − x₀)(x − x₁) and the second derivative of f at some unknown point ξ.

📐 Upper bound for the error

From the theorem, an upper bound follows:

  • |f(x) − L₁(x)| ≤ (1/8)·(x₁ − x₀)²·max|f′′(ξ)|
  • The error is proportional to the square of the distance between the nodes.
  • The maximum second derivative over the interval controls the error magnitude.

🧪 Example: sine function

The excerpt gives Example 2.2.1:

  • Known values: sin(36°) = 0.58778525, sin(38°) = 0.61566148.
  • Linear interpolation approximation for sin(37°): 0.601723.
  • Exact value: 0.60181502.
  • Difference: only 0.9 × 10⁻⁴ (very small).

⚠️ Measurement errors and extrapolation dangers

⚠️ Impact of measurement errors

In practice, function values come from measurements or calculations with errors.

  • Assume measurement error is bounded by ε: |f(x₀) − f̂(x₀)| ≤ ε and |f(x₁) − f̂(x₁)| ≤ ε, where f̂ denotes the measured (available) data.
  • The difference between the exact interpolation polynomial L₁ and the perturbed polynomial L̂₁ is bounded by:
    |L₁(x) − L̂₁(x)| ≤ [(|x₁ − x| + |x − x₀|) / (x₁ − x₀)]·ε

🔒 Interpolation (inside the interval)

  • For x in [x₀, x₁], the error from measurement uncertainties is bounded by ε.
  • The error stays controlled and does not grow.

🚨 Extrapolation (outside the interval)

  • For x outside [x₀, x₁], the error can grow arbitrarily large as x moves away from the interval.
  • If x ≥ x₁, the additional inaccuracy is bounded by:
    |L₁(x) − L̂₁(x)| ≤ [1 + 2·(x − x₁)/(x₁ − x₀)]·ε
  • Don't confuse: the same formula (2.1) can be used for extrapolation, and the error formula (2.3) still applies, but both truncation error and measurement error become very large outside [x₀, x₁].
  • The excerpt warns: "the danger of extrapolation, where the region of uncertainty, and hence, the errors due to uncertainties in the data, make the extrapolation error large."

📊 Visualization of errors

Figure 2.2 in the excerpt illustrates:

  • The region of uncertainty is filled (shaded).
  • Inside the interval [x₀, x₁], the uncertainty is narrow.
  • Outside the interval, the uncertainty region widens dramatically.
  • Example parameters: x₀ = 0, x₁ = 1, f(x₀) = 1, f(x₁) = 2, ε = 1/2.

🧾 Total error

  • The total error is the sum of:
    1. Truncation error (interpolation/extrapolation error from the theorem).
    2. Measurement error (from uncertainties in the data).
  • Both components grow large outside the interval [x₀, x₁].

🔗 Context and motivation

🔗 Why interpolation is needed

  • The excerpt begins by noting that if many precalculated function values are stored in memory, values at intermediate points can be determined "in a cheap way."
  • Linear interpolation provides this cheap approximation method.

🔗 Relation to higher-order methods

  • The excerpt briefly mentions that Section 2.3 covers higher-order Lagrange interpolation.
  • When n+1 function values are available, a polynomial of degree at most n can be used instead of a straight line.
  • Linear interpolation is the special case where n = 1 (two points, degree-1 polynomial).
MethodNumber of nodesPolynomial degreeComplexity
Zeroth-degree10 (constant)Simplest
Linear interpolation21 (line)Simple, effective
Higher-order Lagrangen+1nMore accurate, more complex
9

Higher-Order Lagrange Interpolation

2.3 Higher-Order Lagrange interpolation

🧭 Overview

🧠 One-sentence thesis

Higher-order Lagrange interpolation extends linear interpolation to use n+1 data points to construct an nth-degree polynomial, but increasing the polynomial degree does not always improve accuracy—especially near boundaries with equidistant nodes—and Chebyshev nodes can avoid these oscillation problems.

📌 Key points (3–5)

  • What higher-order Lagrange interpolation does: constructs a polynomial of degree at most n that matches the function at n+1 distinct points.
  • Two ways to build it: solve a Vandermonde matrix system for coefficients, or use Lagrange basis polynomials to avoid the linear system.
  • Common confusion—more points ≠ better accuracy: increasing the polynomial degree can cause large oscillations near boundaries (Runge's phenomenon) when nodes are equidistant.
  • How to avoid Runge's phenomenon: use Chebyshev nodes instead of equidistant nodes to minimize the maximum interpolation error.
  • Extrapolation warning: errors grow rapidly outside the interpolation interval, and higher-degree polynomials often perform worse in extrapolation.

🔧 Construction methods

🔧 The interpolation problem setup

nth degree Lagrange interpolation polynomial Lₙ(x): a polynomial of degree at most n such that the values of f and Lₙ at the n+1 different points x₀, …, xₙ coincide.

  • You are given n+1 function values at distinct points.
  • The goal is to find a polynomial Lₙ(x) that passes through all these points.
  • The polynomial has degree at most n (not necessarily exactly n).

🧮 Method 1: Vandermonde matrix

The polynomial Lₙ(x) can be written as a sum: Lₙ(x) = c₀ + c₁x + c₂x² + … + cₙxⁿ.

How it works:

  • Require that Lₙ(xⱼ) = f(xⱼ) for each interpolation point j = 0, …, n.
  • This creates a linear system where the matrix is called a Vandermonde matrix.
  • Each row of the matrix consists of terms of a geometric series: [1, x₀, x₀², …, x₀ⁿ] for the first row, and so on.
  • Solve the system to find the unknown coefficients c₀, …, cₙ.

Drawback: Requires solving a linear system.

🎯 Method 2: Lagrange basis polynomials

The polynomial is written as: Lₙ(x) = f(x₀)L₀ₙ(x) + f(x₁)L₁ₙ(x) + … + f(xₙ)Lₙₙ(x)

How the basis polynomials work:

  • Each Lagrange basis polynomial Lₖₙ(x) satisfies: Lₖₙ(xⱼ) = 1 if j = k, and 0 otherwise.
  • Explicit formula: Lₖₙ(x) equals the product of all (x - xᵢ) terms except (x - xₖ), divided by the product of all (xₖ - xᵢ) terms except (xₖ - xₖ).
  • This can also be written as: Lₖₙ(x) = ω(x) / [(x - xₖ)ω'(xₖ)], where ω(x) is the product of all (x - xᵢ) terms.

Advantage: Avoids solving a linear system; you can directly compute the interpolation polynomial.

📏 Error analysis

📏 Interpolation error formula

Theorem 2.3.1: For distinct nodes x₀ < x₁ < … < xₙ in [a, b], if f is (n+1)-times continuously differentiable and Lₙ is the Lagrange polynomial matching f at these nodes, then for each x in [a, b] there exists ξ in (a, b) such that:
f(x) - Lₙ(x) = [(x - x₀)(x - x₁)⋯(x - xₙ) · f⁽ⁿ⁺¹⁾(ξ)] / (n+1)!

What this means:

  • The error has two factors: a polynomial factor (x - x₀)⋯(x - xₙ) and a derivative factor f⁽ⁿ⁺¹⁾(ξ).
  • The polynomial factor depends on how x relates to the interpolation nodes.
  • The derivative factor depends on the function's smoothness.

Practical implication:

  • Best results come from choosing nodes so that x is in the innermost interval (not near the boundaries).

📊 Measurement error propagation

If the function values have measurement errors with |f(x) - f̂(x)| ≤ ε, then the error in the perturbed interpolation polynomial is at most: |Lₙ(x) - L̂ₙ(x)| ≤ ε · [sum of |Lₖₙ(x)| for k = 0 to n]

For equidistant nodes (xₖ = x₀ + kh with h = (xₙ - x₀)/n):

  • The sum of |Lₖₙ(x)| increases slowly with n.
  • The excerpt provides a table showing upper bounds for different intervals and degrees.
  • Example: for n=2 and x in [x₀, x₁] or [x₁, x₂], the bound is 1.25; for n=5 and x in [x₄, x₅], the bound is 3.1.

⚠️ Runge's phenomenon and its solution

⚠️ What is Runge's phenomenon

Runge's phenomenon: large oscillations of the interpolation polynomial near the boundary of the interval when using equidistant nodes.

Example from the excerpt:

  • Function: 1/(1 + x²) on interval [-5, 5]
  • Equidistant nodes: xₖ = -5 + 10k/n for k = 0, …, n
  • Observation: the 14th-degree polynomial approximates better than the 6th-degree on [-3, 3], but exhibits large oscillations near the endpoints ±5.

Key insight:

  • Increasing the polynomial degree does not guarantee better approximation everywhere.
  • Near boundaries, higher-degree polynomials can perform worse with equidistant nodes.

Don't confuse: This is not about measurement error or the function being non-smooth; it is about the choice of interpolation nodes.

🎯 Chebyshev nodes as the solution

The excerpt explains that the polynomial factor (x - x₀)⋯(x - xₙ) in the error formula can be minimized by choosing better interpolation points.

Chebyshev nodes: the optimal interpolation points that minimize max |(x - x₀)⋯(x - xₙ)| over all x in [a, b].

Formula for Chebyshev nodes: xₖ = [(a + b)/2] + [(b - a)/2] · cos[(2k + 1)π / (2(n + 1))] for k = 0, …, n

Why they work:

  • Chebyshev nodes are not equidistant; they cluster more densely near the boundaries.
  • With Chebyshev nodes, a continuously differentiable function can be approximated with arbitrarily small error by increasing n.
  • The excerpt shows that interpolation polynomials using Chebyshev nodes provide much better approximations compared to equidistant nodes (Figure 2.3(b) vs 2.3(a)).

🚫 Extrapolation pitfalls

🚫 Why extrapolation is dangerous

Example from the excerpt:

  • Function: f(x) = 1/x
  • Nodes: xₖ = 0.5 + k/n for k = 0, …, n
  • Observation: on the interpolation interval [0.5, 1.5], the 6th and 10th-degree polynomials are indistinguishable from the function.
  • For extrapolation (x ≥ 1.5), large errors occur, and the highest-degree polynomial has the largest error.

Key lesson:

  • Interpolation (within the node range) can be accurate.
  • Extrapolation (outside the node range) often produces large errors.
  • Higher-degree polynomials do not help—they often make extrapolation worse.

Practical advice from the excerpt:

  • When using tabulated values, choose nodes so that the point of interest x is in the innermost interval, not at the edges.

🔗 Connection to Taylor polynomials

🔗 Taylor polynomial as a special case

The excerpt mentions that interpolation can be generalized to match not only function values but also derivatives up to order mₖ at each node xₖ.

Special cases:

  • If n = 0 (only one node), the interpolation polynomial is the Taylor polynomial of degree m₀ about x₀.
  • If all mₖ = 0 (only function values, no derivatives), the polynomial is the nth-degree Lagrange polynomial.

Taylor polynomial limitation example:

  • Function: f(x) = 1/x approximated at x = 3 using Taylor polynomial about x = 1.
  • The nth-degree Taylor polynomial is: pₙ(x) = sum of (-1)ᵏ(x - 1)ᵏ for k = 0 to n.
  • Table 2.4 shows that pₙ(3) diverges as n increases: 1, -1, 3, -5, 11, -21, 43, -85, …
  • The approximation becomes less accurate with increasing polynomial degree because x = 3 is outside the region of convergence of the Taylor series.

Don't confuse: Taylor polynomials are useful for analysis but not always for approximation, especially far from the expansion point.

10

Interpolation with function values and derivatives

2.4 Interpolation with function values and derivatives *

🧭 Overview

🧠 One-sentence thesis

Hermite interpolation constructs higher-degree polynomials by matching both function values and derivatives at nodes, producing smoother approximations than Lagrange interpolation, which uses only function values.

📌 Key points (3–5)

  • What Hermite interpolation adds: uses both function values and derivative values at nodes to build a polynomial, unlike Lagrange interpolation which uses only function values.
  • Degree of the polynomial: with n+1 nodes and matching derivatives up to order m_k at each node, the polynomial degree is at most M = sum(m_k) + n.
  • Special cases: when all m_k = 0, it reduces to Lagrange interpolation; when n = 0, it becomes a Taylor polynomial.
  • Common confusion: Lagrange vs Hermite—Lagrange matches only function values (lower degree), Hermite matches values and derivatives (higher degree, smoother).
  • Why it matters: Hermite interpolation avoids derivative discontinuities at nodes, crucial for solving differential equations and smooth visualization.

🔧 General interpolation framework

🔧 Matching values and derivatives at nodes

General interpolation: given a function f and nodes x₀, …, xₙ, construct a polynomial p such that the derivatives of p and f coincide up to and including order m_k at each node x_k.

  • The polynomial p must satisfy: d^j p / dx^j (x_k) = d^j f / dx^j (x_k) for each k = 0, …, n and j = 0, …, m_k.
  • This is the polynomial of lowest degree meeting these conditions.
  • The excerpt assumes f is in C^m[a, b] where m = max(m_k).

📐 Degree of the resulting polynomial

  • The polynomial is at most of degree M = (m₀ + m₁ + … + mₙ) + n.
  • Example: if you have 2 nodes (n=1) and match up to first derivatives at each (m₀ = m₁ = 1), then M = (1 + 1) + 1 = 3 (a cubic polynomial).

🔄 Special cases

CaseConditionResult
Taylor polynomialn = 0 (single node)Polynomial of degree m₀ about x₀
Lagrange polynomialm₀ = … = mₙ = 0nth degree Lagrange polynomial (values only)
Hermite polynomialm₀ = … = mₙ = 1Matches function and first derivative at all nodes
  • Don't confuse: Taylor uses one node with many derivatives; Lagrange uses many nodes with no derivatives; Hermite uses many nodes with derivatives.

🎯 Hermite interpolation polynomials

🎯 What Hermite interpolation does

  • Hermite interpolation is the case where m₀ = … = mₙ = 1: both function values and first derivatives are known at each node.
  • With n+1 nodes, you have 2(n+1) conditions (value + derivative at each node), so you can construct a polynomial of degree 2n+1.

🧮 Constructing a cubic Hermite polynomial (two nodes)

  • Suppose you know (x₀, f(x₀)), (x₁, f(x₁)), (x₀, f'(x₀)), (x₁, f'(x₁)).
  • These four independent values determine a cubic polynomial H₃(x) = c₀ + c₁x + c₂x² + c₃x³.
  • The excerpt shows this leads to a 4×4 linear system (equation 2.9) for the coefficients c₀, c₁, c₂, c₃.
  • However, solving this system directly is not necessary; a basis-function approach (analogous to Lagrange) avoids the system.

🧩 General Hermite formula

The Hermite interpolation polynomial of degree 2n+1 is:

H₂ₙ₊₁(x) = sum(k=0 to n) [ f(x_k) H_kn(x) + f'(x_k) Ĥ_kn(x) ]

where:

  • H_kn(x) = [1 - 2(x - x_k) L'_kn(x_k)] L²_kn(x)
  • Ĥ_kn(x) = (x - x_k) L²_kn(x)
  • L_kn(x) are the Lagrange basis polynomials (from equation 2.8).

Why this works:

  • H_kn(x_j) = δ_jk (1 if j=k, 0 otherwise) and H'_kn(x_j) = 0 for all j.
  • Ĥ_kn(x_j) = 0 for all j, and Ĥ'_kn(x_j) = δ_jk.
  • Therefore H₂ₙ₊₁(x_j) = f(x_j) and H'₂ₙ₊₁(x_j) = f'(x_j) at every node.

📏 Error bound for Hermite interpolation

Theorem 2.4.1 states:

  • Let x₀ < x₁ < … < xₙ be distinct nodes in [a, b].
  • Let f be in C^(2n+2)[a, b] and H₂ₙ₊₁ be the Hermite interpolation polynomial matching f and f' at these nodes.
  • Then for each x in [a, b], there exists a ξ in (a, b) such that:

f(x) - H₂ₙ₊₁(x) = [(x - x₀)² · … · (x - xₙ)²] / (2n + 2)! · f^(2n+2)(ξ)

  • The error depends on the (2n+2)th derivative of f and the squared distance from each node.
  • The proof is analogous to the Lagrange error theorem (Theorem 2.2.1), using an auxiliary function ϕ(t) and showing ϕ'(t) has 2n+2 zeros in (a, b).

🌍 Why choose Hermite over Lagrange

🌍 Smoothness at nodes

  • Piecewise linear (Lagrange with degree 1 in each interval) interpolation is not smooth at nodes: the derivative does not exist or jumps.
  • Hermite interpolation ensures the first derivative exists and is continuous at nodes, producing a smoother curve.

🌊 Example: Seismic wave propagation (Example 2.4.2)

  • A system of ODEs for seismic waves includes the term dc/dz (derivative of propagation speed with respect to vertical position).
  • If c(z) is known only at a finite number of positions, an interpolation is needed.
  • Piecewise linear interpolation: c'(z) does not exist at nodes → large errors in the ODE solution.
  • Hermite interpolation: if both c and c' are known at nodes, a cubic Hermite polynomial in each interval ensures c'(z) exists everywhere, including at nodes.

🖼️ Example: Smooth visualization (Example 2.4.3)

  • Goal: draw a smooth curve through a finite set of points (e.g., visualizing a hill from f(x) = 1/(1 + x³) on [0, 4]).
  • Piecewise linear interpolation: the resulting curve has sharp corners at nodes, does not resemble a smooth hill.
  • Hermite interpolation: by matching derivatives, the curve is smooth at nodes, producing a realistic representation.
  • Don't confuse: "smooth" here means the derivative is continuous; Lagrange polynomials of higher degree are smooth within each piece but may have derivative jumps if used piecewise.
11

Interpolation with splines

2.5 Interpolation with splines

🧭 Overview

🧠 One-sentence thesis

Splines solve the smoothness problem of piecewise polynomial interpolation by ensuring that the polynomial pieces connect smoothly at the nodes without requiring prior knowledge of derivatives.

📌 Key points (3–5)

  • The problem splines solve: piecewise polynomial interpolation lacks smoothness at subinterval interfaces; Hermite interpolation fixes this but requires knowing derivatives, which are often unavailable from measurements.
  • What a spline is: a piecewise polynomial that connects smoothly in the nodes, with continuous derivatives up to a certain order.
  • Linear vs cubic splines: linear splines (degree 1) are just piecewise linear interpolation; cubic splines (degree 3) ensure continuity of the function, first derivative, and second derivative.
  • Common confusion: Hermite interpolation vs splines—Hermite requires derivative values at nodes; splines compute smoothness automatically without needing derivative data.
  • Natural boundary conditions: cubic splines need two extra conditions to be uniquely determined; natural splines set the second derivative to zero at the endpoints (curvature vanishes at boundaries).

🧩 Why splines are needed

🧩 The smoothness problem

  • Dividing the interpolation interval into subintervals and constructing a polynomial on each subinterval often works better than a single high-degree polynomial.
  • The issue: lack of smoothness at the interface between two subintervals—the piecewise polynomial may have "corners" or jumps in derivatives.
  • Example: piecewise linear interpolation (degree 1 spline) does not have a continuous first derivative at the nodes, leading to unrealistic representations (e.g., a hill looks jagged).

🔧 Hermite interpolation as a partial solution

  • Hermite interpolation ensures smoothness by matching both function values and derivatives at the nodes.
  • Limitation: you must know the derivative at each node.
  • If data come from measurements, derivatives are often unknown, making Hermite interpolation impractical.
  • Don't confuse: Hermite interpolation uses twice as many input values (function + derivative) compared to standard interpolation; splines achieve smoothness without requiring derivative data.

🛠️ Splines as the remedy

A spline is a piecewise polynomial that is connected smoothly in the nodes.

  • Splines ensure smoothness without needing to know derivatives in advance.
  • The smoothness is built into the construction: derivatives up to a certain order are continuous at the nodes.

📐 Spline definitions and properties

📐 General spline structure

  • Nodes: partition the interval [a, b] into subintervals: a = x₀ < x₁ < ⋯ < xₙ = b.
  • A spline of degree p is a piecewise function where each piece sₖ(x) on [xₖ, xₖ₊₁] is a polynomial of degree p.
  • Smoothness requirement: the pieces sₖ and their derivatives s′ₖ, …, s⁽ᵖ⁻¹⁾ₖ connect smoothly at the nodes.

📏 Linear spline (degree 1)

An interpolation spline of degree 1 is a function s ∈ C[a, b] such that on each subinterval [xₖ, xₖ₊₁], s is linear, and s(xₖ) = f(xₖ) for all nodes.

  • Linear splines are simply piecewise linear interpolation polynomials.
  • They are continuous but not smooth (first derivative is not continuous at nodes).

📐 Cubic spline (degree 3)

A cubic spline consists of third-degree polynomials on each subinterval, with the function value equal to f at the nodes, and the first and second derivatives continuous at the nodes.

  • Properties:
    • On each subinterval [xₖ, xₖ₊₁], s is a third-degree polynomial sₖ.
    • s(xₖ) = f(xₖ) for all nodes.
    • At each internal node xₖ₊₁: sₖ(xₖ₊₁) = sₖ₊₁(xₖ₊₁), s′ₖ(xₖ₊₁) = s′ₖ₊₁(xₖ₊₁), s″ₖ(xₖ₊₁) = s″ₖ₊₁(xₖ₊₁).
  • The result is s ∈ C²[a, b] (twice continuously differentiable).

🌊 Natural boundary conditions

  • Cubic splines have 4n unknowns (four coefficients per subinterval) but only 4n − 2 conditions from interpolation and smoothness.
  • Two extra conditions are needed to uniquely determine the spline.
  • Natural boundary conditions: s″₀(x₀) = s″ₙ₋₁(xₙ) = 0.
    • This means the curvature vanishes at the endpoints.
    • These conditions come from the original shipbuilding application (the physical spline beam has no bending moment at the ends).

🏗️ Constructing a cubic spline

🏗️ Polynomial representation

  • Each piece sₖ(x) is written as:
    sₖ(x) = aₖ(x − xₖ)³ + bₖ(x − xₖ)² + cₖ(x − xₖ) + dₖ, for k = 0, …, n − 1.
  • Define hₖ = xₖ₊₁ − xₖ (the width of each subinterval) and fₖ = f(xₖ).

🔢 Determining the coefficients

  1. From interpolation condition: sₖ(xₖ) = fₖ implies dₖ = fₖ.
  2. From second derivative continuity: s″ₖ(xₖ₊₁) = s″ₖ₊₁(xₖ₊₁) leads to
    aₖ = (bₖ₊₁ − bₖ) / (3hₖ).
  3. From function value continuity: sₖ(xₖ₊₁) = sₖ₊₁(xₖ₊₁) leads to
    cₖ = (fₖ₊₁ − fₖ) / hₖ − hₖ(2bₖ + bₖ₊₁) / 3.
  4. From first derivative continuity: s′ₖ(xₖ₊₁) = s′ₖ₊₁(xₖ₊₁) simplifies to
    hₖbₖ + 2(hₖ + hₖ₊₁)bₖ₊₁ + hₖ₊₁bₖ₊₂ = 3[(fₖ₊₂ − fₖ₊₁)/hₖ₊₁ − (fₖ₊₁ − fₖ)/hₖ].

🧮 Solving for the unknowns

  • Natural boundary conditions give b₀ = 0 and bₙ = 0.
  • The system becomes n − 1 equations for n − 1 unknowns b₁, …, bₙ₋₁.
  • The system is tridiagonal (each equation involves only three consecutive bₖ values), which is efficient to solve.
  • Once the bₖ are found, aₖ and cₖ follow from the formulas above.

🎨 Applications and advantages

🎨 Visualization example

  • Problem: drawing a smooth curve through a finite number of points (e.g., visualizing a symmetric hill using f(x) = 1/(1 + x³) on [0, 4]).
  • Piecewise linear interpolation: produces a jagged, unrealistic curve (does not resemble a hill).
  • Hermite interpolation: much better, but requires knowing derivatives at the nodes (uses twice as many input values).
  • Cubic spline: produces a high-quality smooth curve without needing derivative data.
  • Example: a cubic spline on six subintervals gives a better result than Hermite or Lagrange interpolation on the same nodes.

🌊 Seismic wave example

  • Context: seismic waves are used to explore underground for oil and gas reservoirs.
  • A model for wave propagation involves ordinary differential equations that require the derivative dc/dz (propagation speed c as a function of vertical position z).
  • Problem: c is known only at a finite number of positions; piecewise linear interpolation makes c′(z) undefined at the nodes, leading to large errors.
  • Solution: if both c and c′ are known at the nodes, third-degree Hermite interpolation can be used; if derivatives are unknown, cubic splines provide a smooth approximation with continuous first derivative at the nodes.

🛠️ Historical origin

  • Splines were originally developed for shipbuilding before computer modeling.
  • Naval architects placed metal weights (called knots or ducks) at control points and bent a thin metal or wooden beam (called a spline) through the weights.
  • The physics of the bending beam meant that the influence of each weight was largest at the point of contact and decreased smoothly along the spline.
  • To control a region more precisely, more weights were added.

🔄 Modern uses

  • Splines are used in CAD (computer-aided design).
  • They are experiencing a revival in modern numerical methods for partial differential equations.

🔍 Comparison of interpolation methods

MethodSmoothnessRequires derivative data?Typical use case
Piecewise linear (degree 1 spline)Continuous, not smoothNoSimple, low-quality approximation
Lagrange (higher degree)Smooth within each interval, may have jumps at interfacesNoSingle polynomial over entire interval
Hermite (degree 3)Smooth at nodesYes (function + derivative)When derivatives are known or can be estimated
Cubic spline (degree 3)C² (twice continuously differentiable)NoSmooth approximation without derivative data

Don't confuse: Hermite interpolation and cubic splines both produce smooth curves, but Hermite requires derivative values at every node, while cubic splines compute the necessary derivatives automatically to ensure smoothness.

12

Numerical Differentiation: Introduction and Simple Difference Formulae

3.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Numerical differentiation approximates derivatives from discrete data points, but the accuracy is limited by both truncation error (which decreases as step size shrinks) and rounding error (which increases as step size shrinks).

📌 Key points (3–5)

  • Why numerical derivatives are needed: when no explicit formula exists, derivatives must be estimated from discrete measurements (e.g., estimating vehicle speed from position measurements at different times).
  • Three basic difference formulae: forward, backward, and central differences approximate the first derivative with different accuracy levels.
  • Truncation error behavior: central difference converges much faster (order h²) than forward/backward differences (order h) as step size h decreases.
  • Common confusion: smaller step size does not always mean better accuracy—rounding errors grow as h shrinks, while truncation errors shrink.
  • Trade-off: the total error has competing components that behave oppositely as step size changes.

🚗 Motivation: speed detection

🚗 Real-world application

The excerpt uses Dutch police speed enforcement as a motivating example:

  • Police measure vehicle position at consecutive discrete times.
  • Velocity = first derivative of position with respect to time.
  • Acceleration = second derivative of position with respect to time.
  • Because measurements are discrete (not continuous), derivatives can only be approximated.

📏 Why approximation is necessary

No explicit formula for position is available, so velocity can only be estimated using an approximation based on several discrete vehicle positions at discrete times.

  • The time interval between recordings is nonzero, so exact instantaneous velocity cannot be determined.
  • Example: if police want to know whether a driver was braking or accelerating, they need numerical approximations of the second derivative.

🔢 Simple difference formulae

➡️ Forward difference

Forward difference: Q_f(h) = [f(x + h) − f(x)] / h, where h > 0 is the step size.

  • How it works: compare the function value at x with the value one step ahead at x + h.
  • Convergence: as h approaches zero, the forward difference converges to the exact derivative f′(x).
  • Truncation error: R_f(h) = f′(x) − Q_f(h) = −(h/2) f″(ξ) for some ξ between x and x + h (proven using Taylor expansion).
  • Order of accuracy: the error is proportional to h (first order).

Example: For f(x) = −x³ + x² + 2x at x = 1 with h = 0.5, the forward difference gives a rough approximation (see Figure 3.1 in the excerpt).

⬅️ Backward difference

Backward difference: Q_b = [f(x) − f(x − h)] / h, where h > 0.

  • How it works: compare the function value at x with the value one step behind at x − h.
  • Truncation error: R_b(h) = (h/2) f″(η) for some η between x − h and x.
  • Comparison with forward: accuracy is comparable to forward difference, but the error has opposite sign.

↔️ Central difference

Central difference: Q_c(h) = [f(x + h) − f(x − h)] / (2h).

  • How it works: average the forward and backward differences; compare values on both sides of x.
  • Why it's better: because forward and backward errors have opposite signs and nearly cancel when averaged.
  • Truncation error: R_c(h) = −(h²/6) f‴(ξ) for some ξ between x − h and x + h (requires third derivative to exist and be continuous).
  • Order of accuracy: the error is proportional to h² (second order)—converges much faster than forward or backward as h decreases.

Example: For the same function at x = 1 with h = 0.5, the central difference gives a noticeably better approximation than the forward difference.

Don't confuse: "better in the limit" vs "better for a given h"—as h → 0, central difference is always more accurate, but for a specific fixed h, forward or backward could still have smaller error in some cases.

📉 Comparison of difference formulae

FormulaDefinitionTruncation errorOrderNotes
Forward[f(x+h) − f(x)] / h−(h/2) f″(ξ)O(h)Uses one point ahead
Backward[f(x) − f(x−h)] / h+(h/2) f″(η)O(h)Uses one point behind; opposite sign error
Central[f(x+h) − f(x−h)] / (2h)−(h²/6) f‴(ξ)O(h²)Averages forward/backward; much faster convergence

⚠️ Rounding errors

⚠️ What rounding errors are

  • Source: measurement equipment, parallax, human performance (e.g., in car-videoing and laser control).
  • Impact: function values contain errors, so computed differences also contain errors.
  • Suppose measured values have error at most ε: |f(x) − f̂(x)| ≤ ε.

📈 Rounding error in central difference

The excerpt analyzes the central-difference formula:

  • If both f(x − h) and f(x + h) have rounding errors ≤ ε, then the rounding error in Q_c(h) is bounded by:
    • S_c(h) ≤ ε / h.
  • Key observation: the rounding-error bound increases as h decreases.

⚖️ Trade-off between truncation and rounding

  • Truncation error R_c(h) ∝ h² → decreases as h shrinks.
  • Rounding error S_c(h) ≤ ε/h → increases as h shrinks.
  • Implication: there is an optimal step size that balances these two competing errors; making h arbitrarily small does not guarantee better accuracy.

Don't confuse: the behavior of truncation error alone (smaller h is better) with the total error (which includes rounding and has a minimum at some intermediate h).

🔑 Key definitions

Numerical derivative: an approximation of a derivative based on discrete function values, used when no explicit formula is available.

Step size (h): the spacing between discrete points used in the difference formula.

Truncation error: the error R_f(h) = f′(x) − Q(h) resulting from approximating the derivative with a finite step size (even with exact function values).

Rounding error: additional error introduced by measurement or computational inaccuracies in the function values themselves.

13

3.2 Simple difference formulae for the first derivative

3.2 Simple difference formulae for the first derivative

🧭 Overview

🧠 One-sentence thesis

Simple difference formulae approximate derivatives by computing slopes over small intervals, but their accuracy depends on balancing truncation error (which shrinks as step size decreases) against rounding error (which grows as step size decreases).

📌 Key points (3–5)

  • Forward, backward, and central differences: three basic formulae that approximate the derivative by computing slopes between function values at nearby points.
  • Truncation error behavior: forward and backward differences have error proportional to h (step size), while central difference has error proportional to h squared, making it converge faster.
  • Common confusion: smaller step size does not always mean better accuracy—rounding errors grow as h shrinks, so there is an optimal step size that balances both error types.
  • Rounding error impact: when function values contain measurement or rounding errors of size ε, the rounding contribution to total error is roughly ε divided by h, which increases as h decreases.
  • Practical trade-off: the total error has a minimum at an optimal step size that depends on both the function's smoothness and the size of rounding errors.

📐 The three basic difference formulae

📐 Forward difference

Forward difference: Q_f(h) = [f(x + h) − f(x)] / h, where h > 0 is the step size.

  • This formula approximates the derivative by looking "forward" from x to x + h.
  • By definition of the derivative, the forward difference converges to f′(x) as h approaches zero.
  • Truncation error: R_f(h) = f′(x) − Q_f(h) = −(h/2) · f′′(ξ) for some ξ between x and x + h (assuming f has a continuous second derivative).
  • The error is proportional to h, meaning if you halve the step size, the truncation error roughly halves.
  • Example: For f(x) = −x³ + x² + 2x at x = 1 with h = 0.5, the forward difference gives a visible approximation but is "not yet very accurate" (as shown in Figure 3.1 of the excerpt).

📐 Backward difference

Backward difference: Q_b = [f(x) − f(x − h)] / h, where h > 0.

  • This formula looks "backward" from x to x − h.
  • Truncation error: R_b(h) = (h/2) · f′′(η) for some η between x − h and x.
  • The accuracy is comparable to the forward difference, but the error has the opposite sign.
  • The points ξ and η (from forward and backward errors) differ, but for small h this difference is very small.

📐 Central difference

Central difference: Q_c(h) = [f(x + h) − f(x − h)] / (2h).

  • This formula averages the forward and backward approaches, using points on both sides of x.
  • Because the forward and backward errors have opposite signs and nearly cancel, the central difference is more accurate.
  • Truncation error: R_c(h) = −(h²/6) · f′′′(ξ) for some ξ between x − h and x + h (assuming f has a continuous third derivative).
  • The error is proportional to h squared, so halving h reduces the error by roughly a factor of four—much faster convergence than forward or backward differences.
  • Example: For the same function f(x) = −x³ + x² + 2x at x = 1, the central difference gives a better approximation than the forward difference (Figure 3.1).
  • Don't confuse: "better in the limit h → 0" does not guarantee better accuracy for any given h; for a specific step size, forward or backward differences could still have smaller error.

📊 Comparison of the three formulae

FormulaDefinitionTruncation errorError orderNotes
Forward[f(x+h) − f(x)] / h−(h/2)·f′′(ξ)Proportional to hUses one forward step
Backward[f(x) − f(x−h)] / h+(h/2)·f′′(η)Proportional to hUses one backward step; opposite sign error
Central[f(x+h) − f(x−h)] / (2h)−(h²/6)·f′′′(ξ)Proportional to h²Averages forward and backward; faster convergence

⚖️ The trade-off between truncation and rounding errors

⚖️ Why smaller h is not always better

  • Truncation error decreases as h shrinks: for central difference, R_c(h) ≈ (h²/6)·m, where m bounds the third derivative.
  • Rounding error increases as h shrinks: if function values contain errors up to ε, the rounding contribution S_c(h) is bounded by ε/h.
  • As h decreases, truncation error falls but rounding error rises, so there is a sweet spot.

⚖️ Total error and optimal step size

  • The total error E_c(h) satisfies |E_c(h)| ≤ |R_c(h)| + S_c(h) ≤ (h²/6)·m + ε/h.
  • This upper bound φ(h) = (h²/6)·m + ε/h has a minimum at h_opt = cube root of (3ε/m).
  • At this optimal step size, the minimum total error is approximately cube root of (9ε²m/8).
  • Don't confuse: the optimal h depends on both the smoothness of f (via m) and the size of rounding errors (ε); it is not a fixed universal value.

⚖️ Practical example with rounding errors

  • Example 3.3.1 approximates the derivative of f(x) = e^x at x = 1 using six-digit precision (so ε ≈ 5·10⁻⁶).
  • Results show:
    • h = 0.2: error ≈ −0.01817
    • h = 0.1: error ≈ −0.00457
    • h = 0.01: error ≈ −0.00022 (best)
    • h = 0.001: error ≈ −0.00172 (worse)
    • h = 0.0001: error ≈ 0.01828 (much worse)
  • For h ≤ 0.1, the excerpt notes that m = max |f′′′(ξ)| can be bounded, allowing calculation of the optimal step size.
  • This demonstrates that very small h leads to deterioration due to rounding errors dominating the total error.

🔗 Connection to measurement errors

🔗 Real-world context

  • The excerpt opens by mentioning that measurement errors (e.g., from car video recording and laser control) provide "an additional deterioration of the approximation of the speed and acceleration."
  • Numerical differentiation is sensitive to such errors because derivatives amplify noise: small errors in function values become large errors in the computed slope.
  • Section 3.3 (covered here) treats the impact of measurement errors on derivative approximations, showing that the same ε/h behavior applies whether errors come from rounding or measurement.
14

Rounding errors in numerical differentiation

3.3 Rounding errors

🧭 Overview

🧠 One-sentence thesis

Rounding errors in function values grow as the step size h decreases, creating a trade-off with truncation error that determines an optimal step size for numerical differentiation.

📌 Key points (3–5)

  • Rounding error behavior: rounding errors in function values propagate into the derivative approximation and grow as h becomes smaller.
  • Truncation vs rounding trade-off: truncation error decreases with smaller h, but rounding error increases, so total error has a minimum at an optimal h.
  • Optimal step size: for central differences, h_opt equals the cube root of (3ε/m), where ε is the rounding error bound and m bounds the third derivative.
  • Common confusion: smaller h does not always mean better accuracy—below h_opt, rounding errors dominate and the approximation deteriorates.
  • Practical implications: improving accuracy requires either reducing rounding errors in function values or using higher-order formulas, not just shrinking h.

🔢 How rounding errors propagate

🔢 Rounding error in function values

  • Suppose each function value contains a rounding error of at most ε:
    • |f(x - h) - f̂(x - h)| ≤ ε
    • |f(x + h) - f̂(x + h)| ≤ ε
  • Here f̂ denotes the computed (rounded) function value, and f is the exact value.
  • This error comes from limited precision (e.g., six digits in the example).

📊 Rounding error bound for central differences

The rounding error S_c(h) in the central-difference approximation is bounded by ε/h.

  • The central-difference formula is Q_c(h) = [f(x + h) - f(x - h)] / (2h).
  • When computed values f̂ are used instead of exact f, the rounding error becomes:
    • S_c(h) = |Q_c(h) - Q̂_c(h)| ≤ ε/h
  • Key observation: the bound ε/h increases as h decreases.
  • Example: if ε = 5·10⁻⁶ and h = 0.0001, then S_c(h) ≤ 0.05, which is large.

⚠️ Why smaller h amplifies rounding error

  • The division by h in the difference formula magnifies the absolute error in the numerator.
  • As h shrinks, the numerator (difference of two nearly equal numbers) becomes small, but the rounding error ε stays constant.
  • The ratio ε/h therefore explodes as h → 0.
  • Don't confuse: this is different from truncation error, which improves with smaller h.

⚖️ The trade-off: total error

⚖️ Total error composition

  • The total error E_c(h) in the approximation f'(x) ≈ Q̂_c(h) has two parts:
    • Truncation error R_c(h) = f'(x) - Q_c(h) = -(h²/6)f'''(ξ), which decreases as h² when h shrinks.
    • Rounding error S_c(h) ≤ ε/h, which increases as 1/h when h shrinks.
  • The total error satisfies:
    • |E_c(h)| ≤ |R_c(h)| + S_c(h) ≤ (h²/6)m + ε/h = φ(h)
    • where m is an upper bound for |f'''(x)| near x.

📉 Behavior of the upper bound φ(h)

ComponentBehavior as h decreasesDominant when
Truncation error bound (h²m/6)Decreasesh is large
Rounding error bound (ε/h)Increasesh is small
Total error bound φ(h)U-shaped curveMinimum at h_opt
  • For large h: truncation error dominates, so reducing h improves accuracy.
  • For small h: rounding error dominates, so reducing h worsens accuracy.
  • The minimum of φ(h) occurs at h_opt = cube root of (3ε/m).
  • At h_opt, φ(h_opt) = cube root of (9ε²m/8).

🎯 Optimal step size

  • The optimal step size balances the two competing errors.
  • For the example f(x) = eˣ at x = 1 with ε = 5·10⁻⁶ and m = e^1.1:
    • h_opt ≈ 0.0171
    • φ(h_opt) ≈ 0.00044
  • The excerpt notes that φ(h) is almost constant near h_opt (especially for h > h_opt), so any step size close to h_opt gives comparable accuracy.

📋 Numerical example

📋 Example 3.3.1: derivative of eˣ at x = 1

  • Function values are computed with six digits, so ε ≈ 5·10⁻⁶.
  • The excerpt provides a table of results for various h:
hQ̂_c(h)Error f'(1) - Q̂_c(h)
0.22.73645-0.01817
0.12.72285-0.00457
0.012.71850-0.00022
0.0012.72000-0.00172
0.00012.700000.01828

🔍 Observations from the table

  • For h = 0.2 down to h = 0.01: error decreases (truncation error dominates).
  • For h = 0.01 to h = 0.001: error increases (rounding error starts to dominate).
  • For h = 0.0001: error is large (rounding error dominates completely).
  • The best accuracy occurs around h = 0.01, which is near h_opt = 0.0171.
  • This confirms the U-shaped error curve predicted by the theory.

📈 Figure 3.2 interpretation

  • The excerpt describes a figure showing three curves:
    • Truncation error bound h²m/6: decreases with h.
    • Rounding error bound ε/h: increases with h.
    • Total error bound φ(h): U-shaped, minimum at h_opt.
  • The figure visually confirms that for h < h_opt, rounding error dominates and can "explode" as h → 0.

💡 Practical implications

💡 How to improve accuracy

The excerpt lists two approaches when the derivative approximation is insufficiently accurate:

  1. Reduce rounding/measurement error in the function values (decrease ε).

    • This shifts the optimal h_opt to a smaller value and reduces the minimum error.
    • Example: use higher precision arithmetic or more accurate measurements.
  2. Use a higher-order difference formula.

    • Higher-order formulas have smaller truncation error R(h) for the same h.
    • Caution: h should not be too small, or rounding error will "annihilate the improvement."
    • Don't confuse: higher order helps with truncation, but does not eliminate the rounding error problem.

⚠️ When rounding errors are especially problematic

  • Large error in function values: inherently imprecise measurements (e.g., experimental data) have large ε.
  • Small h: if h is much smaller than h_opt, rounding error dominates.
  • The excerpt warns: "For h < h_opt, the measurement or rounding error may even explode when h → 0."
  • Computer-generated tables have smaller ε than measurement tables, but rounding is still noticeable for small h.

🔧 Context: solving differential equations

  • The excerpt notes that difference formulas are often used to solve differential equations (Chapter 7).
  • In that context, the goal is to determine the solution accurately, not the derivatives themselves.
  • "The behavior with respect to rounding errors is often much better in that case."
  • This suggests that the rounding error problem is more severe for direct differentiation than for solving ODEs.

🧠 Key takeaway

  • Do not blindly decrease h: there is an optimal step size that balances truncation and rounding errors.
  • Below h_opt, smaller h makes the approximation worse, not better.
  • The optimal h depends on both the rounding error ε and the smoothness of the function (measured by m).
15

General difference formulae for the first derivative

3.4 General difference formulae for the first derivative

🧭 Overview

🧠 One-sentence thesis

A general method exists to derive difference formulae for approximating derivatives by choosing node coefficients through Taylor expansion to maximize the order of the truncation error.

📌 Key points (3–5)

  • The general approach: use Taylor expansions around nodes to determine coefficients that maximize accuracy (minimize truncation error order).
  • Rule of thumb for scaling: divide coefficients by h^k when approximating the k-th derivative, motivated by the limit definition of derivatives.
  • Central vs one-sided formulae: central differences use symmetric nodes (x-h, x, x+h) and achieve higher accuracy; one-sided formulae use nodes on one side only (useful at boundaries).
  • Common confusion: the number of nodes and their spacing—more nodes or specific spacing can increase accuracy, but each choice requires solving a different system of equations.
  • Connection to interpolation: difference formulae can also be derived by differentiating Lagrange interpolation polynomials, avoiding the need to solve coefficient systems directly.

🔧 The general construction method

🔧 General form of the difference formula

The general form of Q(h) for approximating f'(x) is: Q(h) = (1/h) × sum over i of (alpha_i × f_i), where f_i = f(x_i) and x_i = x + i×h.

  • The nodes x_i are spaced by step size h (though non-equidistant nodes are also possible).
  • The coefficients alpha_i are unknowns to be determined.
  • The division by h comes from the limit definition of the derivative: f'(x) = limit as h→0 of [f(x+h) - f(x)]/h.

📏 Rule of thumb for higher derivatives

Rule of thumb 1: In a difference formula for the k-th derivative, the coefficients must be divided by h^k.

  • For the first derivative: divide by h.
  • For the second derivative: divide by h².
  • This pattern follows from repeatedly applying the limit definition: f''(x) = limit as h→0 of [f'(x+h) - f'(x)]/h, which introduces one more division by h.

🎯 Determining coefficients via Taylor expansion

The method works as follows:

  1. Write Taylor expansions of f(x_i) about x in terms of h.
  2. Substitute these expansions into Q(h).
  3. Collect terms by powers of h and derivatives of f.
  4. Set up conditions so that:
    • The coefficient of f(x) equals 0.
    • The coefficient of f'(x) equals 1 (since we want to approximate f'(x)).
    • Coefficients of higher derivatives (f'', f''', etc.) equal 0 as far as possible.
  5. Solve the resulting system of equations for the alpha_i.

The more conditions you can satisfy, the higher the order of the truncation error (the better the accuracy).

📐 Central difference formula

📐 Setup and nodes

Example: Central difference using three nodes: x_{-1} = x - h, x_0 = x, x_1 = x + h.

The formula is: Q(h) = [alpha_{-1} × f(x-h) + alpha_0 × f(x) + alpha_1 × f(x+h)] / h.

📐 Taylor expansions

Expand each function value:

  • f(x - h) = f(x) - h×f'(x) + (h²/2)×f''(x) - (h³/6)×f'''(x) + O(h⁴)
  • f(x) = f(x)
  • f(x + h) = f(x) + h×f'(x) + (h²/2)×f''(x) + (h³/6)×f'''(x) + O(h⁴)

Substitute into Q(h) and collect terms by derivative order.

📐 Conditions and solution

The conditions are:

  • Coefficient of f(x): (alpha_{-1} + alpha_0 + alpha_1) / h = 0
  • Coefficient of f'(x): -alpha_{-1} + alpha_1 = 1
  • Coefficient of f''(x): (h/2)×alpha_{-1} + (h/2)×alpha_1 = 0

Solving gives: alpha_{-1} = -1/2, alpha_0 = 0, alpha_1 = 1/2.

Result: Q(h) = [f(x+h) - f(x-h)] / (2h), with truncation error O(h²).

  • This is the familiar central-difference formula.
  • The error is O(h²), meaning it is second-order accurate.
  • Don't confuse: this uses three nodes but only two appear in the final formula (alpha_0 = 0).

🔀 One-sided difference formula

🔀 When and why one-sided

One-sided formulae are needed at boundaries of an interval where you cannot use nodes on both sides of x.

Example: Approximating the derivative at the left boundary with O(h²) accuracy.

🔀 Setup with three forward nodes

Choose nodes: x_0 = x, x_1 = x + h, x_2 = x + 2h.

The formula is: Q(h) = [alpha_0 × f(x) + alpha_1 × f(x+h) + alpha_2 × f(x+2h)] / h.

🔀 Taylor expansions and conditions

Expand:

  • f(x) = f(x)
  • f(x + h) = f(x) + h×f'(x) + (h²/2)×f''(x) + O(h³)
  • f(x + 2h) = f(x) + 2h×f'(x) + 2h²×f''(x) + O(h³)

Conditions:

  • Coefficient of f(x): (alpha_0 + alpha_1 + alpha_2) / h = 0
  • Coefficient of f'(x): alpha_1 + 2×alpha_2 = 1
  • Coefficient of f''(x): (h/2)×alpha_1 + 2h×alpha_2 = 0

Solving gives: alpha_0 = -3/2, alpha_1 = 2, alpha_2 = -1/2.

Result: Q(h) = [-3×f(x) + 4×f(x+h) - f(x+2h)] / (2h), with truncation error O(h²).

  • This achieves the same accuracy as the central formula but uses only forward nodes.
  • Trade-off: one-sided formulae typically require more nodes for the same accuracy.

🔗 Connection to interpolation

🔗 Alternative derivation method

Instead of solving coefficient systems, you can:

  1. Build a Lagrange interpolation polynomial through the nodes.
  2. Differentiate the polynomial to approximate the derivative.

If x_0, x_1, ..., x_n do not coincide and f is sufficiently smooth, then f(x) = sum over k of [f(x_k) × L_{kn}(x)] + error term, where L_{kn} are Lagrange basis polynomials.

Differentiating both sides gives: f'(x) = sum over k of [f(x_k) × L'_{kn}(x)] + derivative of error term.

🔗 Evaluating at a node

When x = x_j (one of the interpolation nodes), the last term in the differentiated formula simplifies because (x - x_j) becomes zero.

The result is: f'(x_j) = sum over k of [f(x_k) × L'_{kn}(x_j)] + simplified error term.

🔗 Example: forward difference via interpolation

Example: For n=1, x = x_0, x_1 = x_0 + h:

  • L_{01}(x) = (x - x_1)/(x_0 - x_1), so L'_{01}(x) = -1/h
  • L_{11}(x) = (x - x_0)/(x_1 - x_0), so L'_{11}(x) = 1/h

This gives: f'(x_0) = (-1/h)×f(x_0) + (1/h)×f(x_0 + h) + (h/2)×f''(ξ).

This is exactly the forward-difference formula.

Advantage: This approach avoids solving a system of equations; you only need to differentiate the Lagrange basis polynomials.

📊 Higher-order derivatives

📊 Second derivative formula

To approximate f''(x) using central differences with nodes x-h, x, x+h:

The general form is: Q(h) = [alpha_{-1}×f(x-h) + alpha_0×f(x) + alpha_1×f(x+h)] / h².

Note the division by h² (following Rule of thumb 1).

📊 Conditions for the second derivative

Using the same Taylor expansions as before, the conditions become:

  • Coefficient of f(x): (alpha_{-1} + alpha_0 + alpha_1) / h² = 0
  • Coefficient of f'(x): (-alpha_{-1} + alpha_1) / h = 0
  • Coefficient of f''(x): (1/2)×alpha_{-1} + (1/2)×alpha_1 = 1

Solving gives: alpha_{-1} = 1, alpha_0 = -2, alpha_1 = 1.

Result: Q(h) = [f(x+h) - 2×f(x) + f(x-h)] / h², with truncation error O(h²).

📊 Rule of thumb for error order

Rule of thumb 2: To derive a numerical method for the k-th derivative with truncation error O(h^p), the Taylor expansions should have a remainder term of order O(h^{p+k}).

  • For a first derivative with O(h²) error, you need O(h³) remainder terms.
  • For a second derivative with O(h²) error, you need O(h⁴) remainder terms.

📊 Repeated application approach

Another way to compute the second derivative: apply the first-derivative formula twice.

  • Approximate f''(x) by taking the difference of f'(x+h) and f'(x-h).
  • Each first derivative is itself approximated using a difference formula.
  • This requires at least three function values: f(x-h), f(x), f(x+h).
  • Example: using central differences for each step naturally leads to the same formula as the direct approach.
16

Relation between difference formulae and interpolation

3.5 Relation between difference formulae and interpolation *

🧭 Overview

🧠 One-sentence thesis

Derivatives of Lagrange interpolation polynomials provide an alternative, systematic way to derive difference formulae without solving large systems of equations.

📌 Key points (3–5)

  • Alternative derivation method: Instead of solving systems of equations (Section 3.4 approach), differentiate Lagrange interpolation polynomials to obtain difference formulae.
  • Core idea: The derivative of the interpolation polynomial approximates the derivative of the original function.
  • How it works: Start with the Lagrange interpolation formula, differentiate it, then evaluate at one of the interpolation nodes to eliminate problematic terms.
  • Common confusion: The differentiated interpolation formula contains a term that "generally cannot be evaluated," but evaluating at a node x_j makes that term vanish or simplify.
  • Practical outcome: This method reproduces known formulae (e.g., forward difference) and provides truncation error estimates automatically.

🔄 Why use interpolation instead of equation systems

🔄 Disadvantage of the Section 3.4 approach

  • The earlier method (Section 3.4) requires solving a system of equations.
  • As the desired order of accuracy increases, the number of unknowns grows, making the system larger and more cumbersome.

🔄 Interpolation-based alternative

  • Use Lagrange interpolation polynomials (introduced in Section 2.3) to approximate the function.
  • Then take the derivative of the interpolation polynomial as an approximation of the function's derivative.
  • This approach is "rather obvious" conceptually and avoids repeatedly setting up and solving equation systems.

🧮 Deriving difference formulae via differentiation

🧮 Starting point: Lagrange interpolation formula

If x₀, x₁, ..., xₙ do not coincide and f is sufficiently smooth (f ∈ C^(n+1)[a, b]), the Lagrange interpolation formula is:

f(x) = sum from k=0 to n of [f(x_k) L_kn(x)] + (x − x₀)···(x − xₙ) f^(n+1)(ξ(x)) / (n+1)!

  • L_kn(x) are the Lagrange basis polynomials.
  • The second term is the interpolation error (remainder term), where ξ(x) lies in the interval (a, b).

🧮 Differentiate the interpolation formula

Differentiate both sides with respect to x:

f'(x) = sum from k=0 to n of [f(x_k) L'_kn(x)] + d/dx[(x − x₀)···(x − xₙ)] f^(n+1)(ξ(x)) / (n+1)! + (x − x₀)···(x − xₙ) d/dx[f^(n+1)(ξ(x)) / (n+1)!]

  • The first term is a weighted sum of function values with derivatives of the basis polynomials as weights.
  • The last term "generally cannot be evaluated" because it involves the derivative of f^(n+1)(ξ(x)), where ξ(x) is unknown.

🧮 Evaluate at an interpolation node

Key trick: Set x = x_j (one of the interpolation nodes).

  • The product (x − x₀)···(x − xₙ) contains the factor (x_j − x_j) = 0, so the problematic last term vanishes.
  • The formula simplifies to:

f'(x_j) = sum from k=0 to n of [f(x_k) L'_kn(x_j)] + product over k≠j of [(x_j − x_k)] f^(n+1)(ξ(x_j)) / (n+1)!

  • The second term is now the truncation error, which can be estimated (though ξ(x_j) is still unknown, the order of the error is clear).

Don't confuse: The last term in the general differentiated formula cannot be evaluated for arbitrary x, but it simplifies when x equals one of the nodes.

📐 Example: Forward difference from interpolation

📐 Setup

  • Use n = 1 (two points): x₀ and x₁ = x₀ + h.
  • Evaluate the derivative at x = x₀.

📐 Lagrange basis polynomials and their derivatives

For n = 1:

Basis polynomialExpressionDerivative
L₀₁(x)(x − x₁) / (x₀ − x₁)−1/h
L₁₁(x)(x − x₀) / (x₁ − x₀)1/h

📐 Resulting formula

Substitute into the differentiated interpolation formula:

f'(x₀) = (−1/h) f(x₀) + (1/h) f(x₀ + h) + (h/2) f''(ξ)

Simplify:

f'(x₀) = [f(x₀ + h) − f(x₀)] / h + (h/2) f''(ξ)

  • This is exactly the forward-difference formula derived earlier.
  • The truncation error is O(h), as shown by the (h/2) f''(ξ) term.

Example: To approximate the derivative at x₀ using function values at x₀ and x₀ + h, the interpolation method automatically produces the forward-difference formula and its error term.

🔗 Connection to higher-order derivatives

🔗 Second derivatives via interpolation

The excerpt briefly mentions that higher-order derivatives (Section 3.6) can also be approximated using similar approaches.

  • For the second derivative f''(x), one can:
    • Use more interpolation nodes (e.g., x − h, x, x + h).
    • Differentiate the interpolation polynomial twice.
    • Evaluate at a node to obtain a difference formula.

🔗 Repeated application of first-derivative formulae

An alternative to direct interpolation:

  • Approximate f''(x) by taking the difference of first derivatives at auxiliary points x ± h/2.
  • Apply central-difference schemes twice:

f''(x) ≈ [f'(x + h/2) − f'(x − h/2)] / h

  • Then approximate each first derivative using central differences:

f''(x) ≈ (1/h) [(f(x + h) − f(x))/h − (f(x) − f(x − h))/h]

  • Simplify to:

Q(h) = [f(x + h) − 2f(x) + f(x − h)] / h²

  • The truncation error is O(h²), specifically: f''(x) − Q(h) = −(h²/12) f⁽⁴⁾(ξ).

Don't confuse: Direct interpolation and repeated application of first-derivative formulae both yield the same central-difference formula for the second derivative, but the derivation paths differ.

🔗 Rounding error caution

  • For second derivatives, rounding error is "even more serious."
  • An upper bound for rounding error is: S(h) ≤ 4ε / h².
  • As h decreases, rounding error grows faster (inversely proportional to h²), so very small h can degrade accuracy.
17

Difference formulae of higher-order derivatives

3.6 Difference formulae of Higher-Order derivatives

🧭 Overview

🧠 One-sentence thesis

Higher-order derivatives can be approximated numerically using difference formulae built from Taylor expansions, and Richardson's extrapolation provides both practical error estimates and methods to increase accuracy.

📌 Key points (3–5)

  • Second derivatives from Taylor expansion: Central-difference formulae for the second derivative require at least three function values and have truncation error O(h²).
  • Repeated application approach: The second derivative can also be computed by taking differences of first derivatives at auxiliary midpoints, yielding the same formula.
  • Richardson's extrapolation for error estimation: By computing Q(h), Q(2h), and Q(4h), the order p and error constant can be estimated numerically without knowing higher derivatives.
  • Common confusion: Rounding errors grow faster for higher derivatives (proportional to 1/h² for second derivatives vs 1/h for first derivatives), making optimal step size selection critical.
  • Richardson's extrapolation for higher accuracy: When the error order p is known, combining Q(h) and Q(2h) produces a new formula with error O(h^(p+1)) instead of O(h^p).

🧮 Deriving the second-derivative formula

🧮 Central-difference formula via Taylor expansion

The goal is to approximate f''(x) using three points: x - h, x, and x + h.

Setup:

  • Write Q(h) = (α₋₁ f(x - h) + α₀ f(x) + α₁ f(x + h)) / h²
  • The division by h² is motivated by "Rule of thumb 1" (scaling for the kth derivative)

Taylor expansions about x:

  • f(x + h) = f(x) + h f'(x) + (h²/2) f''(x) + (h³/6) f'''(x) + O(h⁴)
  • f(x) = f(x)
  • f(x - h) = f(x) - h f'(x) + (h²/2) f''(x) - (h³/6) f'''(x) + O(h⁴)

Matching coefficients: To make Q(h) approximate f''(x), set up conditions:

  • Coefficient of f(x): (α₋₁ + α₀ + α₁) / h² = 0
  • Coefficient of f'(x): (-α₋₁ + α₁) / h = 0
  • Coefficient of f''(x): (α₋₁ + α₁) / 2 = 1

Solution: α₋₁ = 1, α₀ = -2, α₁ = 1

Central-difference formula for second derivative:
Q(h) = [f(x + h) - 2f(x) + f(x - h)] / h²

  • Truncation error: O(h²)
  • Rounding error bound: S(h) ≤ 4ε / h²

🔁 Repeated application of first-derivative formula

An alternative derivation uses the first derivative twice.

Approach:

  • Approximate f''(x) ≈ [f'(x + h/2) - f'(x - h/2)] / h
  • Introduce auxiliary points at x ± h/2 to avoid needing function values at x ± 2h
  • Apply central differences to each first derivative:
    • f'(x + h/2) ≈ [f(x + h) - f(x)] / h
    • f'(x - h/2) ≈ [f(x) - f(x - h)] / h

Result:

  • f''(x) ≈ (1/h) × {[f(x + h) - f(x)] / h - [f(x) - f(x - h)] / h}
  • Simplifies to Q(h) = [f(x + h) - 2f(x) + f(x - h)] / h²

Truncation error from Taylor polynomials:

  • f''(x) - Q(h) = -(h²/12) f⁽⁴⁾(ξ) for some ξ in (x - h, x + h)

Don't confuse: This is the same formula as the Taylor-expansion method, but derived by composition rather than direct coefficient matching.

🔧 Generalized form for (vf')'

For certain differential equations, the derivative (vf')' must be approximated, where v is a given function.

Formula:

  • (vf')(x + h/2) - (vf')(x - h/2) / h
  • Applying central differences:
    [v(x + h/2)(f(x + h) - f(x)) - v(x - h/2)(f(x) - f(x - h))] / h²

📏 Rule of thumb for higher derivatives

📏 General principle

Rule of thumb 2:
To derive a numerical method for the kth derivative with truncation error O(h^p), the Taylor expansions should have a remainder term of order O(h^(p+k)).

Why this matters:

  • For the second derivative (k=2) with error O(h²) (p=2), Taylor expansions need O(h⁴) terms.
  • This guides how many terms to include when matching coefficients.

🔍 Richardson's extrapolation: practical error estimation

🔍 The error model

Richardson's extrapolation assumes the error has the form:

  • M - Q(h) = c_p h^p + O(h^(p+1))
  • M is the unknown true value
  • c_p ≠ 0 and p ∈ ℕ (the error order)

Applicability:

  • Works for difference formulae, interpolation, numerical integration, etc.
  • Requires the error to be expressible as a Taylor series

🔍 Estimating p and the error

For sufficiently small h, approximate M - Q(h) ≈ c_p h^p.

Three-point method: Compute Q(h), Q(2h), and Q(4h) to get:

  • M - Q(4h) = c_p (4h)^p
  • M - Q(2h) = c_p (2h)^p
  • M - Q(h) = c_p h^p

Eliminate M by subtraction:

  • Q(2h) - Q(4h) = c_p (2h)^p (2^p - 1)
  • Q(h) - Q(2h) = c_p h^p (2^p - 1)

Eliminate c_p and h by division:

  • [Q(2h) - Q(4h)] / [Q(h) - Q(2h)] = 2^p
  • Solve for p (should be close to the theoretical value)

Recover the error:

  • From Q(h) - Q(2h) = c_p h^p (2^p - 1), find c_p h^p
  • Error estimate: M - Q(h) = [Q(h) - Q(2h)] / (2^p - 1)

Example (forward difference for f(x) = e^x at x=1):

  • h = 0.025, Q(h) = 2.7525, Q(2h) = 2.7873, Q(4h) = 2.8588
  • [Q(2h) - Q(4h)] / [Q(h) - Q(2h)] = (-0.0714) / (-0.0348) = 2.0509 ≈ 2^p → p ≈ 1
  • Error estimate: M - Q(h) = -0.0348
  • Exact error: e - 2.7525 = -0.0342 (very close)
  • Improved approximation: Q(h) + c_p h^p = 2.7525 - 0.0348 = 2.7177

🔍 Practical complications

Why compute p numerically even when theory gives it?

  • Higher derivatives may not exist or be bounded
  • Combined approximation methods may have unclear p
  • Implementation errors in code

Good practice: Verify that computed p matches theoretical p to catch these issues.

🚀 Richardson's extrapolation: increasing accuracy

🚀 Deriving higher-order formulae

When p is known, Richardson's extrapolation can create a more accurate formula.

Starting from:

  • M - Q(h) = c_p h^p + O(h^(p+1))
  • M - Q(2h) = c_p (2h)^p + O(h^(p+1))

Eliminate c_p h^p:

  • Multiply the first by 2^p and subtract the second:
  • 2^p (M - Q(h)) - (M - Q(2h)) = O(h^(p+1))
  • (2^p - 1) M = 2^p Q(h) - Q(2h) + O(h^(p+1))

Richardson's higher-accuracy formula:
M = [2^p Q(h) - Q(2h)] / (2^p - 1) + O(h^(p+1))

The combination [2^p Q(h) - Q(2h)] / (2^p - 1) is one order more accurate than Q(h).

🚀 Example: forward difference of higher accuracy

Forward difference has error:

  • f'(x) - Q_f(h) = c₁ h + O(h²) (p = 1)
  • f'(x) - Q_f(2h) = c₁ 2h + O(h²)

Apply Richardson's extrapolation:

  • 2 × [f'(x) - Q_f(h)] - [f'(x) - Q_f(2h)] = O(h²)
  • f'(x) - [2 Q_f(h) - Q_f(2h)] = O(h²)

Expand:

  • 2 Q_f(h) - Q_f(2h) = 2 [f(x + h) - f(x)] / h - [f(x + 2h) - f(x)] / (2h)
  • = [-3f(x) + 4f(x + h) - f(x + 2h)] / (2h)

This is the same as the three-point forward formula with O(h²) error.

⚠️ Rounding errors and optimal step size

⚠️ Rounding error growth

For second derivatives:

  • Rounding error bound: S(h) ≤ 4ε / h²
  • ε is the relative machine precision (e.g., 10⁻¹⁶)

Comparison with first derivatives:

DerivativeTruncation errorRounding error bound
FirstO(h^p)~ε / h
SecondO(h^p)~ε / h²

Implication:

  • Rounding errors are "even more serious" for higher derivatives.
  • As h decreases, truncation error shrinks but rounding error explodes faster.

⚠️ Finding optimal h

Total error = truncation + rounding:

  • For second derivative with O(h²) truncation: Total ≤ C h² + 4ε / h²
  • Minimize by taking derivative with respect to h and setting to zero
  • Optimal h balances the two error sources

Example scenario (Exercise 3):

  • f(x) = sin x at x = 1, central difference for f''(x)
  • Theoretical optimal h can be computed from the error bound
  • Practical experiment: start with h = 1, divide by 10 repeatedly, observe where error is minimal
  • The computed optimal h should agree with theory

Don't confuse: Smaller h is not always better—beyond the optimal point, rounding errors dominate and total error increases.

18

Richardson's extrapolation

3.7 Richardson’s extrapolation

🧭 Overview

🧠 One-sentence thesis

Richardson's extrapolation uses multiple approximations at different step sizes to either estimate the numerical error practically or to construct higher-accuracy formulas from lower-order methods.

📌 Key points (3–5)

  • Core assumption: the method requires that the error has the form M − Q(h) = c_p h^p + O(h^(p+1)), where M is the true value, Q(h) is the approximation, c_p is a constant, and p is the error order.
  • Two main uses: (1) practical error estimation when the error order p is unknown, and (2) constructing higher-accuracy formulas when p is known.
  • Practical error estimate: by computing Q(h), Q(2h), and Q(4h), you can numerically determine p and estimate the error M − Q(h) without knowing higher-order derivatives.
  • Higher-accuracy formulas: when p is known, combining Q(h) and Q(2h) via the formula (2^p Q(h) − Q(2h))/(2^p − 1) yields a new approximation with error order O(h^(p+1)).
  • Common confusion: theoretical truncation error estimates often contain unknown higher-order derivatives, making them "useless in practice"; Richardson's extrapolation solves this by using only computable quantities.

🔍 The fundamental error assumption

🔍 Error form requirement

Richardson's extrapolation assumes the error has the form: M − Q(h) = c_p h^p + O(h^(p+1)), where c_p ≠ 0 and p ∈ ℕ.

  • M: the true (unknown) value you want to approximate.
  • Q(h): the numerical approximation using step size h.
  • c_p: a constant coefficient (unknown in practice).
  • p: the order of the error (a positive integer).
  • Why this form: the error can be expressed as a Taylor series, and p is the leading power of h in the error term.

🧩 Applicability beyond differentiation

  • The excerpt states that Richardson's extrapolation applies to difference formulas, but also to "other types of approximations (interpolation, numerical integration etc.)."
  • The only requirement: the error must have the form shown in equation (3.10).

🧮 Practical error estimation (when p is unknown)

🧮 The three-point method

When you don't know p or c_p, compute three approximations:

  • Q(4h), Q(2h), and Q(h).

This gives three equations:

  • M − Q(4h) = c_p (4h)^p
  • M − Q(2h) = c_p (2h)^p
  • M − Q(h) = c_p h^p

🔢 Eliminating unknowns step-by-step

  1. Subtract equations to eliminate M:

    • Q(2h) − Q(4h) = c_p (2h)^p (2^p − 1)
    • Q(h) − Q(2h) = c_p h^p (2^p − 1)
  2. Divide these two results to eliminate c_p and h:

    • [Q(2h) − Q(4h)] / [Q(h) − Q(2h)] = 2^p
  3. Solve for p: take the logarithm or recognize the power of 2.

  4. Substitute p back into Q(h) − Q(2h) = c_p h^p (2^p − 1) to find c_p h^p.

  5. Error estimate: M − Q(h) ≈ c_p h^p = [Q(h) − Q(2h)] / (2^p − 1).

📊 Example walkthrough (forward difference for e^x at x=1)

Grid sizeApproximationDifference
4h = 0.1Q(4h) = 2.8588...
2h = 0.05Q(2h) = 2.7873...Q(2h) − Q(4h) = −0.0714...
h = 0.025Q(h) = 2.7525...Q(h) − Q(2h) = −0.0348...
  • Compute ratio: (−0.0714...) / (−0.0348...) = 2.0509... ≈ 2^p, so p ≈ 1.
  • Since p should be an integer, take p = 1.
  • Error estimate: M − Q(h) = [Q(h) − Q(2h)] / (2^1 − 1) = −0.0348...
  • Exact error: e − 2.7525... = −0.0342..., so the estimate is very reliable.
  • Improved approximation: Q(h) + c_p h^p = 2.7525... − 0.0348... = 2.7177...

⚠️ Practical complications and verification

The excerpt lists three complications:

  1. Unknown smoothness: you may not know if higher-order derivatives exist or are bounded.
  2. Combined methods: the final result may combine various approximations, making the effective p unclear.
  3. Implementation errors: bugs in code can distort p.

Best practice: always verify that the computed p is close to the theoretical p (if known) to catch these issues.

🚀 Higher-accuracy formulas (when p is known)

🚀 Deriving the improved formula

If you already know p (e.g., from theory), you can construct a more accurate approximation:

  • Start with the two error equations:

    • M − Q(h) = c_p h^p + O(h^(p+1))
    • M − Q(2h) = c_p (2h)^p + O(h^(p+1))
  • Multiply the first by 2^p and subtract the second:

    • 2^p (M − Q(h)) − (M − Q(2h)) = 2^p c_p h^p − c_p (2h)^p + O(h^(p+1))
    • The leading error terms cancel, leaving O(h^(p+1)).
  • Rearrange:

    • (2^p − 1) M = 2^p Q(h) − Q(2h) + O(h^(p+1))
    • M = [2^p Q(h) − Q(2h)] / (2^p − 1) + O(h^(p+1))

The new approximation [2^p Q(h) − Q(2h)] / (2^p − 1) has accuracy one order higher than Q(h).

📐 Example: forward difference upgraded

  • Original forward difference: f'(x) − Q_f(h) = c_1 h + O(h²), so p = 1.

  • Apply Richardson's extrapolation with p = 1:

    • New formula = [2 Q_f(h) − Q_f(2h)] / (2 − 1) = 2 Q_f(h) − Q_f(2h).
  • Expand:

    • 2 [f(x+h) − f(x)] / h − [f(x+2h) − f(x)] / (2h)
    • = [−3 f(x) + 4 f(x+h) − f(x+2h)] / (2h)
  • Result: this formula has truncation error O(h²), matching the asymmetric three-point formula (3.6) from earlier in the chapter.

🔄 Don't confuse: error estimation vs. accuracy improvement

PurposeWhen to useWhat you get
Practical error estimatep unknown; want to know how accurate Q(h) isNumerical estimate of M − Q(h); can add it to Q(h) for a better approximation
Higher-accuracy formulap known; want a better methodA new formula with error order O(h^(p+1)) instead of O(h^p)

Both use the same underlying algebra, but the goals differ.

🧪 Context: why Richardson's extrapolation matters

🧪 The problem with theoretical error bounds

  • Earlier sections presented truncation error estimates like f''(x) − Q(h) = −(h²/12) f^(4)(ξ).
  • The catch: if you don't know f' (the first derivative), you certainly can't compute f^(4) (the fourth derivative).
  • The excerpt states: "these estimates are often of theoretical importance, but useless in practice."

🛠️ Richardson's extrapolation as a practical solution

  • No higher derivatives needed: you only compute Q(h), Q(2h), Q(4h)—all computable quantities.
  • Two benefits:
    1. Estimate the error numerically (Section 3.7.2).
    2. Build higher-order methods from low-order ones (Section 3.7.3).

🔗 Connection to second derivatives

The excerpt begins with a formula for the second derivative:

  • Q(h) = [f(x+h) − 2f(x) + f(x−h)] / h²
  • Truncation error: f''(x) − Q(h) = −(h²/12) f^(4)(ξ)
  • Rounding error bound: S(h) ≤ 4ε / h², where ε is machine precision.
  • The excerpt notes that "the effect of rounding errors is even more serious here" (because the denominator is h², not h).

This context motivates Richardson's extrapolation: when rounding and truncation errors compete, you need a practical way to balance them—Richardson's method provides that.

19

Nonlinear Equations: Introduction

4.1 Introduction

🧭 Overview

🧠 One-sentence thesis

This chapter introduces numerical iterative methods for solving nonlinear equations of the form f(p) = 0, focusing on convergence behavior and practical stopping criteria for approximating roots.

📌 Key points (3–5)

  • Core problem: finding zeros (roots) of nonlinear equations using iterative sequences that converge to the solution p.
  • Convergence order: higher-order methods (e.g., quadratic α=2) converge faster than lower-order methods (e.g., linear α=1); the asymptotic constant λ is less important than the order α.
  • Stopping criteria trade-offs: ideal criterion |p − pₙ| < ε is impractical because p is unknown; practical alternatives like |pₙ − pₙ₋₁| < ε can fail if successive approximations are close but far from the true solution.
  • Common confusion: small difference between successive approximations (|pₙ − pₙ₋₁| < ε) does not guarantee the approximation is close to the true solution (|p − pₙ| may still be large).
  • Real-world motivation: determining the friction coefficient w in turbulent fluid flow requires solving a nonlinear equation involving Reynolds number and experimental parameters.

🌊 Motivating application: turbulent flow

🌊 The friction coefficient problem

  • Context: pressure drop in turbulent pipe flow (Reynolds number Re > 3000) depends on a friction coefficient w.
  • The friction coefficient satisfies a nonlinear equation:
    • 1/√w = ln(Re√w) + 14 − 5.6k/k
    • where Re is the Reynolds number and k is an experimentally known parameter.
  • Why numerical methods are needed: this equation cannot be solved algebraically for w; iterative methods are required when Re and k are given.

🔢 Reynolds number and flow regimes

Flow typeReynolds number ReCharacteristics
LaminarRe < 2100Low flow velocity
Transitional2100 ≤ Re ≤ 3000Neither laminar nor turbulent
TurbulentRe > 3000Higher velocity; friction coefficient w needed
  • Reynolds number formula: Re = Dv/ν
    • D = pipe diameter (meters)
    • v = average flow velocity (m/s)
    • ν = fluid viscosity (m²/s)

🎯 Core definitions

🎯 Zeros and roots

Zero of a function f: a point p where f(p) = 0. Root of an equation: the same point p, viewed as the solution to f(x) = 0.

  • These terms are interchangeable.
  • The goal of this chapter: find p numerically when algebraic solutions are unavailable.

🔁 Iterative sequences

  • Numerical methods generate a sequence {pₙ} = p₀, p₁, p₂, ...
  • Desired behavior: lim (n→∞) pₙ = p (the sequence converges to the true solution).
  • Each method starts from an initial guess p₀ and refines it step by step.

📈 Convergence behavior

📈 Order of convergence

Convergence with order α and asymptotic constant λ: if lim (n→∞) |p − pₙ₊₁| / |p − pₙ|^α = λ, where λ and α are positive constants.

  • What it means: the error at step n+1 shrinks proportionally to the α-th power of the error at step n.
  • Higher order = faster convergence: a method with α=2 reduces error much faster than α=1.
  • Asymptotic constant λ is less important: the order α dominates the convergence speed.

🚀 Linear vs quadratic convergence

Order αNameBehaviorImportance
α = 1Linear convergenceError shrinks by a constant factor each stepλ is called the asymptotic convergence factor
α = 2Quadratic convergenceError shrinks proportionally to its squareMuch faster; λ matters less
  • Example: if α=2, an error of 0.01 becomes roughly 0.0001 in the next step (squared), whereas α=1 only multiplies by a constant factor.

🔒 Convergence guarantee (Theorem 4.2.1)

  • Condition: if |p − pₙ| ≤ k|p − pₙ₋₁| for all n, where 0 ≤ k < 1, then the sequence converges to p.
  • Why it works: by induction, |p − pₙ| ≤ kⁿ|p − p₀|; since k < 1, kⁿ → 0 as n → ∞.
  • Key requirement: the contraction factor k must be strictly less than 1.

🛑 Stopping criteria

🛑 Why stopping criteria are needed

  • Iterative methods run indefinitely in theory; practical computation must stop at some finite step n.
  • The challenge: decide when pₙ is "close enough" to the unknown true solution p.

✅ Ideal but impractical criterion

  • Criterion 1: |p − pₙ| < ε (error below tolerance ε).
  • Problem: p is not known in practice, so this cannot be computed.
  • This criterion is mentioned to clarify what we wish we could check.

⚠️ Successive-approximation criterion

  • Criterion 2: |pₙ − pₙ₋₁| < ε (two successive approximations are close).
  • Advantage: computable without knowing p.
  • Danger: for some methods, |pₙ − pₙ₋₁| can be very small while |p − pₙ| is still large (the method stops at a wrong approximation).
  • Don't confuse: closeness of successive steps does not guarantee closeness to the true solution.

📏 Relative-error criterion

  • Criterion 3: |pₙ − pₙ₋₁| / |pₙ| < ε (if p ≠ 0).
  • Why relative: when p is very large or very small, absolute error is misleading; relative error scales appropriately.
  • Example: an absolute error of 0.01 is large if p ≈ 0.1 but small if p ≈ 1000.

🎯 Function-value criterion

  • Criterion 4: |f(pₙ)| < ε (the function value at pₙ is close to zero).
  • Rationale: if f(pₙ) ≈ 0, then pₙ is approximately a root.
  • Condition for usefulness: the first derivative of f must exist and be continuous near p (this allows bounding the error; details not provided in this excerpt).
  • This criterion checks the residual rather than the approximation directly.
20

4.2 Definitions

4.2 Definitions

🧭 Overview

🧠 One-sentence thesis

Iterative methods for solving nonlinear equations generate sequences that converge to a root at different rates, and practical stopping criteria must balance accuracy with the inherent uncertainty in computed function values.

📌 Key points (3–5)

  • What the section defines: zeros/roots of nonlinear equations, convergence order and asymptotic constants, stopping criteria, and uncertainty intervals.
  • Convergence order: higher-order methods (e.g., quadratic, α = 2) converge faster than lower-order methods (e.g., linear, α = 1); the asymptotic constant λ is less important than the order α.
  • Common confusion: a small difference between successive approximations (|pₙ − pₙ₋₁| < ε) does not guarantee that pₙ is close to the true root p; some methods can satisfy this criterion while |p − pₙ| ≫ ε.
  • Uncertainty intervals: when function values are only known approximately (within ε̄), the true root cannot be pinpointed exactly; the uncertainty interval width depends on |f′(p)|, making root-finding ill-posed when the derivative is near zero.
  • Why it matters: understanding convergence rates and stopping criteria is essential for choosing and implementing iterative methods to solve f(p) = 0.

🎯 Problem setup and terminology

🎯 Zeros and roots

A zero of the function f is a point p such that f(p) = 0; equivalently, p is a root of the equation f(x) = 0.

  • The chapter discusses iterative methods to find such points numerically.
  • Example: the friction coefficient w in the turbulent-flow equation satisfies a nonlinear equation; finding w means finding the root of that equation.

🔄 Iterative sequences

  • Each numerical method generates a sequence {pₙ} = p₀, p₁, p₂, … intended to converge to the true root p.
  • The initial guess is p₀; subsequent terms are computed by the method's rule.
  • Goal: lim (n → ∞) pₙ = p.

📈 Convergence: order and rate

📈 Definition of convergence order

A sequence {pₙ} converges to p with order α and asymptotic constant λ if there exist positive constants λ and α such that
lim (n → ∞) [|p − pₙ₊₁| / |p − pₙ|^α] = λ,
assuming pₙ ≠ p for all n.

  • Order α measures how fast the error shrinks: higher α means faster convergence.
  • Asymptotic constant λ is a proportionality factor; the excerpt notes that λ is "less important" than α.

🔢 Two important cases

Order αNameMeaning
α = 1Linear convergenceError shrinks by a constant factor each step; λ is called the asymptotic convergence factor
α = 2Quadratic convergenceError shrinks quadratically; much faster than linear
  • In general, a higher-order method converges faster than a lower-order method.
  • Example: quadratic convergence (α = 2) means that if the error is 0.01 at step n, it might be roughly 0.0001 at step n+1, whereas linear convergence would only reduce it to, say, 0.001.

✅ Theorem 4.2.1: sufficient condition for convergence

Theorem 4.2.1: Suppose a sequence {pₙ} satisfies
|p − pₙ| ≤ k |p − pₙ₋₁|, for n = 1, 2, …, where 0 ≤ k < 1.
Then lim (n → ∞) pₙ = p: the sequence is convergent.

  • Why it works: by induction, |p − pₙ| ≤ kⁿ |p − p₀|; since k < 1, kⁿ → 0 as n → ∞.
  • This theorem guarantees convergence when the error contracts by a factor k < 1 at each step.
  • Don't confuse: this is a sufficient condition (if it holds, convergence is guaranteed), not a necessary one (convergence can occur even if this specific inequality does not hold).

🛑 Stopping criteria

🛑 Why stopping criteria are needed

  • Iterative methods run indefinitely in theory; in practice, we must decide when to stop.
  • The excerpt lists four common stopping criteria, each with trade-offs.

🎯 Criterion 1: |p − pₙ| < ε

  • What it is: stop when the approximation pₙ is within ε of the true root p.
  • Why it's ideal: directly measures the error we care about.
  • Why it's impractical: the true root p is not known in general, so this criterion cannot be used in practice.

🔁 Criterion 2: |pₙ − pₙ₋₁| < ε

  • What it is: stop when two successive approximations are close.
  • Pitfall: for some methods, |pₙ − pₙ₋₁| can be small while |p − pₙ| is still large (≫ ε).
  • Common confusion: small step size does not guarantee closeness to the true root; the method may stop at a wrong approximation.
  • Example: a slowly converging sequence might have pₙ and pₙ₋₁ very close to each other but both far from p.

📊 Criterion 3: relative error |pₙ − pₙ₋₁| / |pₙ| < ε

  • What it is: use relative difference instead of absolute difference.
  • Why it's better: accounts for the scale of p; meaningful whether p is very large or very small.
  • Assumption: p ≠ 0 (otherwise division by zero or near-zero).

📉 Criterion 4: |f(pₙ)| < ε

  • What it is: stop when the function value at pₙ is close to zero.
  • Why it's useful: if f(pₙ) is small, pₙ is "almost" a root.
  • Error bound: if f is differentiable and continuous near p, the intermediate-value theorem gives
    |f(p) − f(pₙ)| = |f′(ξ)| · |p − pₙ| for some ξ between p and pₙ.
    Since f(p) = 0 and |f(pₙ)| < ε, we have |f′(ξ)| · |p − pₙ| < ε.
    Defining m = min{|f′(ξ)|, ξ between p and pₙ}, it follows that
    |p − pₙ| < ε / m, if m ≠ 0 (equation 4.2).
  • Interpretation: the actual error in pₙ is bounded by ε divided by the minimum slope magnitude; a steep function (large |f′|) means small error, a flat function (small |f′|) means large error.

🌫️ Uncertainty intervals

🌫️ The problem: approximate function values

  • In numerical simulations, exact function values f(x) are often unknown.
  • Instead, we have an approximation f̂(x) satisfying |f(x) − f̂(x)| ≤ ε̄.
  • Consequence: using the perturbed function f̂, we cannot determine the zero of f exactly.

📏 Definition of the uncertainty interval

The uncertainty interval I is the set of all points that could be a solution:
I = {x ∈ [a, b] | |f(x)| < ε̄}.

  • Any point in I is indistinguishable from a true root given the measurement error ε̄.
  • The width of I depends on how steeply f changes near the root.

📐 Computing the interval width

  • Let p be the true zero of f, so f(p) = 0.
  • Consider the maximally perturbed function f̂(x) = f(x) + ε̄; its zero is denoted x⁺.
  • Linearizing f(x⁺) about p:
    0 = f̂(x⁺) = f(x⁺) + ε̄ ≈ f(p) + (x⁺ − p) f′(p) + ε̄ = (x⁺ − p) f′(p) + ε̄.
    Solving: x⁺ ≈ p − ε̄ / f′(p).
  • Similarly, for f̂(x) = f(x) − ε̄, the zero is x⁻ ≈ p + ε̄ / f′(p).
  • Since f′(p) can be positive or negative, the general form is:
    I ≈ [p − ε̄ / |f′(p)|, p + ε̄ / |f′(p)|], if f′(p) ≠ 0.

⚠️ Ill-posed problems

  • If |f′(p)| is close to 0, the uncertainty interval becomes very wide.
  • Interpretation: finding the root of f(p) = 0 is an ill-posed problem when the derivative is near zero; small errors in function values lead to large errors in the computed root.
  • Practical implication: if ε < ε̄ (the stopping tolerance is smaller than the measurement error), the stopping criterion |f(pₙ)| < ε is useless; the algorithm may continue iterating even though pₙ is already within the uncertainty interval I.
  • Don't confuse: the problem is not with the iterative method itself, but with the inherent limitation imposed by noisy function evaluations.
21

A Simple Root Finder: The Bisection Method

4.3 A simple root finder: the Bisection method

🧭 Overview

🧠 One-sentence thesis

The Bisection method guarantees convergence to a root by repeatedly halving an interval where the function changes sign, making it a reliable (though slow) starting-point generator for more efficient root-finding algorithms.

📌 Key points (3–5)

  • Core mechanism: repeatedly divide the interval in half and keep the subinterval where the function changes sign.
  • Guaranteed convergence: always converges to a solution (unconditional convergence), unlike faster methods that may fail.
  • Convergence speed: generally slow; the error bound shrinks by half each iteration.
  • Common confusion: the method may not improve monotonically—sometimes the distance to the true root can increase from one step to the next.
  • Practical role: useful for computing a starting estimate for more efficient methods discussed later.

🔍 Foundation: the intermediate-value theorem

🔍 What the theorem guarantees

Intermediate-value theorem application: If a continuous function f on interval [a, b] has f(a) and f(b) with opposite signs (f(a) · f(b) < 0), then there exists a number p in (a, b) where f(p) = 0.

  • The method relies on this theorem to ensure a root exists in the chosen interval.
  • Opposite signs mean the function crosses zero somewhere between a and b.
  • Example: if f(a) is negative and f(b) is positive, the continuous function must pass through zero.

🎯 How the method uses the theorem

  • Start with an interval [a, b] where the function has opposite signs at the endpoints.
  • At each step, check which half-interval still has opposite signs at its endpoints.
  • That half-interval is guaranteed to contain the root, so continue with it.

🔄 The algorithm step-by-step

🔄 Initialization

  • Let a₀ = a and b₀ = b (the initial interval bounds).
  • Verify that f(a₀) · f(b₀) < 0 (opposite signs).

🔄 Iteration formula

At each step n:

  1. Compute the midpoint: p_n = (a_n + b_n) / 2
  2. Check the stopping criterion (if satisfied, p_n is the approximate root)
  3. If not done, construct the new interval [a_{n+1}, b_{n+1}]:
    • If f(a_n) · f(p_n) < 0, then a_{n+1} = a_n and b_{n+1} = p_n (root is in the left half)
    • Otherwise, a_{n+1} = p_n and b_{n+1} = b_n (root is in the right half)
  • The chosen subinterval always contains the root p.
  • Each iteration halves the interval width: b_n - a_n = (b - a) / 2^n.

🔄 Stopping

  • The excerpt mentions "if the chosen stopping criterion is satisfied, then the zero of f is found."
  • The method continues until the criterion is met (e.g., interval small enough or function value close enough to zero).

📏 Convergence properties

📏 Unconditional convergence

  • Always converges: the sequence {p_n} converges to a root p as n → ∞.
  • No special conditions on the function beyond continuity and opposite signs at the endpoints.
  • This makes the method robust and reliable.

📏 Error bound (Theorem 4.3.1)

Theorem 4.3.1: Assume f is continuous on [a, b] and f(a) · f(b) < 0. Then the Bisection method generates a sequence {p_n} converging to a zero p of f, with error bound |p - p_n| ≤ (b - a) / 2^(n+1) for n ≥ 0.

Proof sketch:

  • At each step n, the interval width is b_n - a_n = (b - a) / 2^n.
  • The root p lies in (a_n, b_n), and the midpoint p_n = (a_n + b_n) / 2.
  • The maximum distance from p_n to any point in the interval is half the interval width: |p - p_n| ≤ (b_n - a_n) / 2 = (b - a) / 2^(n+1).

What this means:

  • Each iteration reduces the maximum possible error by a factor of 2.
  • After n iterations, the error is at most the original interval width divided by 2^(n+1).
  • Example: if the initial interval has width 1, after 10 iterations the error is at most 1/2048 ≈ 0.0005.

📏 Slow convergence and non-monotonic behavior

  • Generally slow: compared to more efficient methods discussed later in the chapter.
  • Non-monotonic: it may happen that |p - p_{n-1}| ≪ |p - p_n| (the approximation can get worse from one step to the next).
  • Don't confuse: "always converges" does not mean "always improves at every single step."

🛠️ Practical use

🛠️ Role as a starting-point generator

  • The excerpt emphasizes: "The unconditional convergence makes the Bisection method a useful method for computing a starting estimate for the more efficient methods that will be discussed later in this chapter."
  • More efficient methods (discussed later) may fail if started far from the root.
  • Bisection provides a reliable way to get close enough to the root before switching to a faster method.

🛠️ When to use

SituationBisection method suitability
Need guaranteed convergenceExcellent—always works if initial interval has opposite signs
Need fast convergencePoor—slow compared to other methods
Need a rough starting estimateExcellent—run a few iterations to narrow the interval
Function is expensive to evaluatePoor—requires many evaluations

🧮 Connection to uncertainty intervals

🧮 Uncertainty interval concept

The excerpt introduces the uncertainty interval I in the context of perturbed function values:

  • In numerical simulations, exact function values f(x) are often unknown.
  • Instead, an approximation f̂(x) is used, where |f(x) - f̂(x)| ≤ ε̄ (bounded error).
  • The set I = {x ∈ [a, b] | |f(x)| < ε̄} represents all points that could be solutions given the perturbation.

🧮 Width of the uncertainty interval

  • Assuming p is the true zero of f (so f(p) = 0).
  • The zero of the maximally perturbed function f̂(x) = f(x) + ε̄ is approximately x⁺ ≈ p - ε̄ / f'(p).
  • The zero of the minimally perturbed function f̂(x) = f(x) - ε̄ is approximately x⁻ ≈ p + ε̄ / f'(p).
  • The uncertainty interval is approximately I ≈ [p - ε̄ / |f'(p)|, p + ε̄ / |f'(p)|] if f'(p) ≠ 0.

🧮 Ill-posed problems

  • If |f'(p)| is close to 0, finding the root is an ill-posed problem (the uncertainty interval becomes very wide).
  • If the tolerance ε < ε̄, then a stopping criterion |f(p_n)| < ε is useless: the algorithm might continue even though p_n is already in the uncertainty interval I.
  • Don't confuse: a small function value |f(p_n)| does not guarantee p_n is close to the true root if the function is nearly flat (small derivative).
22

Fixed-Point Iteration (Picard Iteration)

4.4 Fixed-Point iteration (Picard iteration)

🧭 Overview

🧠 One-sentence thesis

Fixed-point iteration approximates the solution of a nonlinear equation by repeatedly applying a function g until the sequence converges to a point p where g(p) = p, and this method always converges when the derivative of g is bounded below 1 in absolute value.

📌 Key points (3–5)

  • Connection between zeros and fixed points: any zero-finding problem can be converted into a fixed-point problem by defining g(x) = x - f(x) or similar transformations, and conversely any fixed point of g corresponds to a zero of f(x) = x - g(x).
  • Existence and uniqueness conditions: Brouwer's theorem guarantees a fixed point exists when g maps an interval into itself; Banach's theorem adds that if the absolute value of g' is bounded by k < 1, the fixed point is unique.
  • Iteration formula and convergence: starting from p₀, compute p_n = g(p_{n-1}); if the sequence converges and g is continuous, the limit is a fixed point.
  • Convergence speed depends on g'(p): the error decreases approximately as |p - p_{n+1}| ≈ |g'(p)| |p - p_n|; small |g'(p)| means fast convergence, while |g'(p)| ≥ 1 means no convergence.
  • Common confusion—linear vs higher-order convergence: if g'(p) ≠ 0, the process is linearly convergent with rate |g'(p)|; if g'(p) = 0, convergence is higher-order (faster).

🔗 Relationship Between Zeros and Fixed Points

🔗 Converting zero problems to fixed-point problems

A fixed point of a function g is a number p such that g(p) = p.

  • If f has a zero at p (meaning f(p) = 0), you can always define a function g so that p becomes a fixed point of g.
  • Example construction: g(x) = x - f(x). Then g(p) = p - f(p) = p - 0 = p, so p is a fixed point of g.
  • The function g is not unique: g(x) = x - c·f(x) for any c ≠ 0 also has p as a fixed point.

🔄 Converting fixed-point problems to zero problems

  • Conversely, if g(p) = p, then define f(x) = x - g(x).
  • At p, f(p) = p - g(p) = p - p = 0, so p is a zero of f.
  • This two-way relationship means zero-finding and fixed-point methods are interchangeable frameworks.

📐 Existence and Uniqueness Theorems

📐 Brouwer's fixed-point theorem (existence)

Conditions: g is continuous on [a, b] and g(x) ∈ [a, b] for all x ∈ [a, b].

Conclusion: g has at least one fixed point in [a, b].

Why it works:

  • If g(a) = a or g(b) = b, the fixed point is at an endpoint.
  • Otherwise, g(a) > a and g(b) < b (since g maps into [a, b]).
  • Define h(x) = g(x) - x. Then h(a) > 0 and h(b) < 0.
  • By the intermediate-value theorem, there exists p where h(p) = 0, meaning g(p) = p.

🔒 Banach's fixed-point theorem (uniqueness)

Additional condition: g' exists on [a, b] and there is a constant k < 1 such that |g'(x)| ≤ k for all x ∈ [a, b].

Conclusion: the fixed point in [a, b] is unique.

Why it works:

  • Suppose there are two fixed points p and q with p < q.
  • By the mean-value theorem, there exists ξ ∈ [p, q] such that (g(p) - g(q))/(p - q) = g'(ξ).
  • Since g(p) = p and g(q) = q, this gives |p - q| = |g'(ξ)| |p - q|.
  • But |g'(ξ)| ≤ k < 1, so |p - q| < |p - q|, a contradiction.
  • Therefore p = q, and the fixed point is unique.

🔁 The Iteration Process

🔁 How to iterate

Starting point: choose an initial guess p₀.

Iteration formula: for n ≥ 1, compute p_n = g(p_{n-1}).

Convergence argument:

  • If the sequence {p_n} converges to p and g is continuous, then:
    • p = lim (n→∞) p_n = lim (n→∞) g(p_{n-1}) = g(lim (n→∞) p_{n-1}) = g(p).
  • So p is a fixed point of g.

✅ Guaranteed convergence theorem

Theorem 4.4.2: If g ∈ C[a, b], g(x) ∈ [a, b] for x ∈ [a, b], and |g'(x)| ≤ k < 1 for x ∈ [a, b], then the fixed-point iteration converges to p for any starting value p₀ ∈ [a, b].

Proof sketch:

  • Under these conditions, g has a unique fixed point p (by Banach's theorem).
  • By the mean-value theorem, |p - p_n| = |g(p) - g(p_{n-1})| = |g'(ξ)| |p - p_{n-1}| ≤ k |p - p_{n-1}|.
  • Since k < 1, the error shrinks at each step, so p_n → p.

⚡ Convergence Speed and Error Estimates

⚡ How fast does it converge?

Error reduction near the solution:

  • Close to p, the error behaves as |p - p_{n+1}| ≈ |g'(p)| |p - p_n|.
  • If |g'(p)| is small, convergence is fast.
  • If |g'(p)| is only slightly less than 1, convergence is slow.
  • If |g'(p)| ≥ 1, there is no convergence.

Linear vs higher-order convergence:

  • If g'(p) ≠ 0, the process is linearly convergent with asymptotic convergence factor |g'(p)|.
  • If g'(p) = 0, the process is higher-order convergent (faster than linear).

📏 Practical stopping criterion

When the conditions of Banach's theorem hold and both p₀ and p are in [a, b], the following error bound is available:

|p - p_n| ≤ (k / (1 - k)) |p_n - p_{n-1}|

Derivation:

  • For m > n, |p_m - p_n| ≤ |p_m - p_{m-1}| + |p_{m-1} - p_{m-2}| + ... + |p_{n+1} - p_n|.
  • By the mean-value theorem and |g'(x)| ≤ k, we have |p_{n+1} - p_n| ≤ k |p_n - p_{n-1}|.
  • Applying recursively: |p_m - p_n| ≤ (k + k² + ... + k^(m-n)) |p_n - p_{n-1}|.
  • Taking the limit as m → ∞ and summing the geometric series: |p - p_n| ≤ (k / (1 - k)) |p_n - p_{n-1}|.

Stopping rule: if |p_n - p_{n-1}| ≤ ((1 - k) / k) ε, then |p - p_n| ≤ ε.

Note: you can choose k = max_{x ∈ [a, b]} |g'(x)|.

🧮 Worked Example

🧮 Finding the zero of f(x) = x³ + 3x - 4

Goal: determine the real zero of f(x) = x³ + 3x - 4 (which is p = 1).

Constructing g:

  • Start from f(p) = 0 ⇔ p³ + 3p - 4 = 0.
  • Rearrange: p³ + 3p = 4 ⇔ p(p² + 3) = 4 ⇔ p = 4 / (p² + 3).
  • Define g(x) = 4 / (x² + 3).

Iteration:

  • Starting with p₀ = 0, compute p_n = g(p_{n-1}).
  • The table (rounded to four decimals) shows:
np_n|p - p_n||p_n - p_{n-1}||p - p_n| / |p - p_{n-1}|
001--
11.33330.33331.33330.3333
20.83720.16280.49610.4884
31.08080.08080.24360.4964

Verification:

  • g'(x) = -8x / (x² + 3)², so |g'(x)| ≤ 1/2 = k.
  • The third and fourth columns confirm |p - p_n| ≤ (k / (1 - k)) |p_n - p_{n-1}|.
  • The fifth column shows |p - p_{n+1}| / |p - p_n| ≈ |g'(p)| = 1/2, confirming linear convergence.

🖼️ Graphical interpretation

  • The iteration is visualized by plotting y = g(x) and y = x.
  • Starting from p₀ on the x-axis, move vertically to the curve y = g(x) to get g(p₀).
  • Then move horizontally to the line y = x to transfer this y-value to the x-axis as p₁.
  • Repeat: the bold line in the figure traces this staircase or cobweb pattern converging to the intersection point (p, p).
23

The Newton-Raphson Method

4.5 The Newton-Raphson method

🧭 Overview

🧠 One-sentence thesis

The Newton-Raphson method is a powerful numerical technique that uses tangent-line approximations to iteratively solve nonlinear equations, achieving quadratic convergence when the initial guess is sufficiently close to the true root.

📌 Key points (3–5)

  • Core idea: approximate the root by finding where the tangent line at the current guess crosses the x-axis, then repeat.
  • Mathematical foundation: derived from a first-degree Taylor polynomial by neglecting the small squared-error term.
  • Convergence speed: the method converges quadratically (error squared at each step) when conditions are met.
  • Common confusion: Newton-Raphson requires computing the derivative at each step; variants like the Secant and Quasi-Newton methods approximate the derivative to avoid this cost.
  • Key requirement: the initial approximation must be close enough to the root, within a certain interval around it.

🔍 Core mechanism and derivation

🔍 Taylor polynomial foundation

The method starts with a Taylor expansion around the current approximation x-bar:

f(x) = f(x-bar) + (x − x-bar) · f'(x-bar) + (x − x-bar)² / 2 · f''(ξ(x))

  • Here ξ(x) is some value between x and x-bar.
  • The excerpt assumes f is twice continuously differentiable on [a, b] and that the derivative at x-bar is not zero.
  • Because the distance |p − x-bar| is small, the squared term (p − x-bar)² can be neglected.

🧮 Deriving the iteration formula

After dropping the squared term and setting f(p) = 0 (since p is the root):

  • 0 ≈ f(x-bar) + (p − x-bar) · f'(x-bar)
  • The right-hand side is the formula for the tangent line at (x-bar, f(x-bar)).
  • Solving for p gives: p ≈ x-bar − f(x-bar) / f'(x-bar)

This motivates the iterative formula:

p_n = p_(n−1) − f(p_(n−1)) / f'(p_(n−1)), for n ≥ 1

  • Start with an initial approximation p_0.
  • Each new approximation p_n is the zero of the tangent line at (p_(n−1), f(p_(n−1))).

📐 Geometric interpretation

  • Graphically, p_n is where the tangent at the previous point crosses the horizontal axis.
  • The excerpt includes Figure 4.2 showing this process for f(x) = x² − 2.
  • Example: starting at p_0 = 1.00000, the tangent at (1, −1) crosses the x-axis at p_1 = 1.50000; repeating yields p_2 = 1.41666 and p_3 = 1.41421, matching √2 to six decimal places in only three steps.

🎯 Convergence theory

🎯 Fixed-point interpretation (Theorem 4.5.1)

The excerpt proves convergence by viewing Newton-Raphson as a fixed-point method p_n = g(p_(n−1)) with:

g(x) = x − f(x) / f'(x)

Conditions for convergence:

  • f is twice continuously differentiable on [a, b].
  • p is a root in [a, b] with f(p) = 0 and f'(p) ≠ 0.
  • Then there exists a δ > 0 such that the sequence {p_n} converges to p for any starting point p_0 in [p − δ, p + δ].

🔧 Proof sketch (three key steps)

The proof verifies the conditions of the fixed-point theorem:

  1. Continuity of g: Since f'(p) ≠ 0 and f' is continuous, there exists δ_1 > 0 such that f'(x) ≠ 0 for all x in [p − δ_1, p + δ_1], so g is well-defined and continuous there.
  2. Derivative bound: The derivative g'(x) = f(x) · f''(x) / (f'(x))². At the root, g'(p) = 0 because f(p) = 0. By continuity of g', there exists δ < δ_1 such that |g'(x)| ≤ k < 1 for all x in [p − δ, p + δ].
  3. Domain-range property: Using the mean-value theorem, |g(p) − g(x)| = |g'(ξ)| · |p − x| for some ξ between x and p. Since |g'(ξ)| < 1 and |p − x| < δ, it follows that |g(p) − g(x)| < δ, so g maps [p − δ, p + δ] into itself.

Don't confuse: The method only guarantees convergence if you start close enough (within δ of the root); a poor initial guess may diverge or converge to a different root.

⚡ Quadratic convergence

The excerpt derives the convergence rate by comparing two Taylor expansions:

  • For the true root: 0 = f(p) = f(p_n) + (p − p_n) · f'(p_n) + (p − p_n)² / 2 · f''(ξ_n), where ξ_n is between p_n and p.
  • For the Newton-Raphson step: 0 = f(p_n) + (p_(n+1) − p_n) · f'(p_n).

Subtracting the second from the first:

  • (p − p_(n+1)) · f'(p_n) + (p − p_n)² / 2 · f''(ξ_n) = 0
  • Rearranging: |p − p_(n+1)| / |p − p_n|² = |f''(ξ_n) / (2 · f'(p_n))|

This shows quadratic convergence with:

  • α = 2 (the error is squared at each step)
  • λ = limit as n → ∞ of |f''(ξ_n) / (2 · f'(p_n))| = |f''(p) / (2 · f'(p))|

Why it matters: Quadratic convergence means the number of correct digits roughly doubles with each iteration, explaining the rapid convergence in Example 4.5.1.

🔄 Variants avoiding derivative computation

🔄 Common form of variants

All variants share the structure:

p_n = p_(n−1) − f(p_(n−1)) / K

where K is an approximation of f'(p_(n−1)).

Motivation: Computing the derivative f'(p_(n−1)) at each step can be expensive or inconvenient; these methods approximate it instead.

🧪 Quasi-Newton method

Approximates the derivative using a finite difference:

K = (f(p_(n−1) + h) − f(p_(n−1))) / h, where h > 0

  • This is a forward-difference approximation of f'(p_(n−1)).
  • Only requires function evaluations, not the derivative formula.

📏 Secant method

Uses the slope of the secant line through the two most recent points:

K = (f(p_(n−1)) − f(p_(n−2))) / (p_(n−1) − p_(n−2))

  • Requires two initial guesses (p_0 and p_1).
  • The excerpt notes this method is derived in exercise 3.
  • Don't confuse: Unlike Newton-Raphson, the Secant method uses information from two previous points, not just one.

🎯 Regula-Falsi method

Combines Newton-Raphson with the Bisection algorithm to guarantee the root stays bracketed:

Setup: Start with an interval [a_(n−1), b_(n−1)] where f(a_(n−1)) · f(b_(n−1)) < 0 (opposite signs, so a root exists between them).

Iteration:

  1. Compute p_n = a_(n−1) − f(a_(n−1)) · (b_(n−1) − a_(n−1)) / (f(b_(n−1)) − f(a_(n−1))).
  2. If the stopping criterion is met, stop.
  3. Otherwise, construct a new interval [a_n, b_n]:
    • If f(a_(n−1)) · f(p_n) < 0, set a_n = a_(n−1) and b_n = p_n.
    • Otherwise, set a_n = p_n and b_n = b_(n−1).

Advantage: Always maintains a bracketing interval, so the root cannot be lost.

🌐 Extension to systems of equations

🌐 System formulations

The excerpt extends the theory to systems of m nonlinear equations in m unknowns.

Fixed-point form:

  • g_1(p_1, ..., p_m) = p_1
  • g_2(p_1, ..., p_m) = p_2
  • ...
  • g_m(p_1, ..., p_m) = p_m

General form:

  • f_1(p_1, ..., p_m) = 0
  • f_2(p_1, ..., p_m) = 0
  • ...
  • f_m(p_1, ..., p_m) = 0

🧮 Matrix-vector notation

In compact form:

g(p) = p (fixed-point form)
f(p) = 0 (general form)

where:

  • p = (p_1, ..., p_m)^T is the solution vector
  • g(p) = (g_1(p_1, ..., p_m), ..., g_m(p_1, ..., p_m))^T
  • f(p) = (f_1(p_1, ..., p_m), ..., f_m(p_1, ..., p_m))^T

🔁 Iterative approach

  • Start with an initial estimate p^(0).
  • Construct a sequence of successive approximations {p^(n)} until a desired tolerance is reached.
  • A possible stopping criterion is the norm ||p^(n) − p^(n−1)|| (the excerpt text cuts off here).

Note: The excerpt does not provide the full Newton-Raphson formula for systems (which would involve the Jacobian matrix), only the setup and notation.

24

Systems of nonlinear equations

4.6 Systems of nonlinear equations

🧭 Overview

🧠 One-sentence thesis

Iterative methods for single nonlinear equations extend naturally to systems of multiple nonlinear equations by using vector notation and replacing derivatives with Jacobian matrices.

📌 Key points (3–5)

  • Two equivalent forms: systems can be written in fixed-point form g(p) = p or general form f(p) = 0, where p is a vector of unknowns.
  • Iterative approximation: start with an initial estimate and construct a sequence of successive approximations until a stopping criterion (e.g., Euclidean norm of the difference) is satisfied.
  • Fixed-point (Picard) iteration: directly applies the vector version p^(n) = g(p^(n-1)), often requiring solving a linear system at each step.
  • Newton-Raphson for systems: linearizes the function f around the previous iterate using the Jacobian matrix, then solves a linear system to find the next approximation.
  • Common confusion: the Jacobian matrix replaces the scalar derivative; when partial derivatives are unavailable, finite-difference approximations (Quasi-Newton) can be used instead.

🧩 Core concepts and notation

🧩 Two forms of nonlinear systems

Systems of m nonlinear equations in m unknowns can be written in two equivalent ways:

Fixed-point form:

  • g₁(p₁, ..., pₘ) = p₁
  • g₂(p₁, ..., pₘ) = p₂
  • ...
  • gₘ(p₁, ..., pₘ) = pₘ

General form:

  • f₁(p₁, ..., pₘ) = 0
  • f₂(p₁, ..., pₘ) = 0
  • ...
  • fₘ(p₁, ..., pₘ) = 0

In vector notation: g(p) = p and f(p) = 0, respectively.

📐 Vector notation

p = (p₁, ..., pₘ)ᵀ is the vector of unknowns.

g(p) = (g₁(p₁, ..., pₘ), ..., gₘ(p₁, ..., pₘ))ᵀ is the vector of fixed-point functions.

f(p) = (f₁(p₁, ..., pₘ), ..., fₘ(p₁, ..., pₘ))ᵀ is the vector of general-form functions.

  • This notation mirrors the scalar case but uses vectors instead of single values.
  • Each component function depends on all m unknowns.

🛑 Stopping criterion

A typical stopping rule checks the Euclidean norm of the difference between successive approximations:

||p^(n) - p^(n-1)|| < ε

where ||x|| = square root of (x₁² + ... + xₘ²) is the Euclidean norm.

  • This measures how much the solution estimate changes from one iteration to the next.
  • When the change is smaller than tolerance ε, the iteration stops.

🔄 Fixed-point iteration (Picard iteration)

🔄 The iteration formula

Similar to the scalar case, the fixed-point iteration for systems uses:

p^(n) = g(p^(n-1))

  • Start with an initial estimate p^(0).
  • Compute the next approximation by evaluating the vector function g at the previous approximation.
  • Repeat until the stopping criterion is met.

🧮 Example: solving a 2×2 system

The excerpt solves the system:

  • 2p₁ - p₂ + p₁² = 1/9
  • -p₁ + 2p₂ + p₂² = 13/9

The exact solution is (p₁, p₂)ᵀ = (1/3, 2/3)ᵀ.

Picard iteration setup:

  • Rearrange to: 2p₁^(n) - p₂^(n) + p₁^(n-1)·p₁^(n) = 1/9 and -p₁^(n) + 2p₂^(n) + p₂^(n-1)·p₂^(n) = 13/9.
  • This can be written as a linear system A(p^(n-1))p^(n) = b, where the matrix A depends on the previous iterate.
  • Solve for p^(n) = A⁻¹(p^(n-1))b at each step.

Example iteration:

  • Initial estimate: p^(0) = (0, 0)ᵀ
  • First approximation: p^(1) = (5/9, 1)ᵀ

Key insight: The system can be written as g(p) = p with g(p) = A(p)⁻¹b, showing the fixed-point structure.

🚀 Newton-Raphson method for systems

🚀 Linearization approach

The Newton-Raphson method for systems approximates the solution to f(p) = 0 by linearizing f around the previous iterate p^(n-1):

f(p) ≈ f(p^(n-1)) + J(p^(n-1))(p - p^(n-1))

where J is the Jacobian matrix of f.

  • This is the multivariable Taylor expansion to first order.
  • Each component fⱼ is approximated by its value at p^(n-1) plus the sum of partial derivatives times the displacement in each variable.

📊 The Jacobian matrix

Jacobian matrix J(x): the m×m matrix of all first-order partial derivatives of f.

The (j, i)-entry is ∂fⱼ/∂xᵢ evaluated at x.

  • The Jacobian replaces the scalar derivative f' in the single-equation Newton-Raphson method.
  • It captures how each component function changes with respect to each variable.

🔧 Computing the next iterate

Set the linearization equal to zero to find p^(n):

f(p^(n-1)) + J(p^(n-1))(p^(n) - p^(n-1)) = 0

This can be rewritten as a linear system:

J(p^(n-1))s^(n) = -f(p^(n-1))

where s^(n) = p^(n) - p^(n-1) is the step.

Two equivalent formulas:

  • Solve the linear system for s^(n), then p^(n) = p^(n-1) + s^(n).
  • Or directly: p^(n) = p^(n-1) - J⁻¹(p^(n-1))f(p^(n-1)).

Don't confuse: You do not need to explicitly compute the inverse J⁻¹; instead, solve the linear system for s^(n) at each iteration.

🧮 Example: Newton-Raphson on a 2×2 system

The excerpt solves:

  • 18p₁ - 9p₂ + p₁² = 0
  • -9p₁ + 18p₂ + p₂² = 9

with initial estimate p^(0) = (0, 0)ᵀ.

Setup:

  • f₁(p₁, p₂) = 18p₁ - 9p₂ + p₁²
  • f₂(p₁, p₂) = -9p₁ + 18p₂ + p₂² - 9
  • Jacobian: J(x) = [[18 + 2x₁, -9], [-9, 18 + 2x₂]]

First iteration:

  • At p^(0) = (0, 0)ᵀ: J(p^(0)) = [[18, -9], [-9, 18]] and f(p^(0)) = (0, -9)ᵀ.
  • Solve J(p^(0))s^(1) = -f(p^(0)): 18s₁^(1) - 9s₂^(1) = 0 and -9s₁^(1) + 18s₂^(1) = 9.
  • Solution: s^(1) = (1/3, 2/3)ᵀ.
  • New approximation: p^(1) = p^(0) + s^(1) = (1/3, 2/3)ᵀ.

Key observation: In this example, the method converges in one iteration (p^(1) is the exact solution), demonstrating the fast convergence of Newton-Raphson.

🔍 Quasi-Newton method for systems

🔍 Approximating the Jacobian

When partial derivatives cannot be computed analytically, approximate them using finite differences:

∂fⱼ/∂xᵢ(x) ≈ [fⱼ(x + eᵢh) - fⱼ(x)] / h

where eᵢ is the i-th unit vector (all zeros except 1 in position i) and h is a small step size.

  • This is the forward divided difference approximation.
  • Central differences can also be used for better accuracy.

🛠️ Why use Quasi-Newton

  • When the functions f are complicated or given only as black-box procedures, computing symbolic derivatives may be impractical or impossible.
  • Finite-difference approximations allow the Newton-Raphson framework to be applied without explicit derivative formulas.
  • Trade-off: slightly slower convergence and additional function evaluations, but greater flexibility.

Don't confuse: Quasi-Newton for systems is analogous to the Secant method for scalar equations—both replace exact derivatives with approximations based on function values.

25

Numerical Integration

5.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Numerical integration provides methods to approximate integrals when analytic evaluation is not possible, enabling the calculation of physical quantities such as arc length, volume, and mass.

📌 Key points (3–5)

  • Why numerical integration is needed: many integrals cannot be evaluated analytically, so numerical approximation (quadrature) is required.
  • Structure of numerical methods: similar to Riemann sums but more practical—sum weighted function values at chosen integration points.
  • Rectangle rule basics: the simplest quadrature rule approximates the integral by multiplying interval width by the function value at one endpoint.
  • Error behavior: the Rectangle rule error is bounded by the maximum first derivative and grows with the square of the interval width.
  • Common confusion: Riemann sums vs. quadrature rules—Riemann sums are theoretical tools for defining integrals; quadrature rules are practical numerical methods with the same structure but optimized weights and points.

🎯 Motivation and context

🎯 Real-world problem: the spoiler example

  • A truck spoiler shaped by a sine function needs to be manufactured from a flat aluminum plate by extrusion.
  • The manufacturer must determine the width of the flat plate so that the horizontal dimension of the spoiler is 80 cm.
  • This requires computing the arc length of the curve:
    • x(t) = t, y(t) = sin(t), for 0 ≤ t ≤ 0.8.
  • The arc length formula is:

    l = integral from 0 to 0.8 of the square root of (1 + (dy/dt) squared) dt = integral from 0 to 0.8 of the square root of (1 + (cos t) squared) dt.

  • This integral cannot be evaluated in a simple analytic way, so numerical methods are necessary.

🔍 Typical applications

  • Determining physical quantities: volume, mass, length.
  • Any situation where the integrand has no closed-form antiderivative.

🧮 Theoretical foundation: Riemann sums

📐 Partition and mesh

Partition P_n of [a, b]: a finite set of distinct points x_k such that a = x_0 < x_1 < ... < x_n = b.

  • Intermediate points T_n: a set of points t_k where x_(k-1) ≤ t_k ≤ x_k.
  • Interval length h_k: x_k − x_(k-1).
  • Mesh width m(P_n): the maximum of all h_k for 1 ≤ k ≤ n.

📊 Riemann sum definition

Riemann sum: R(f, P_n, T_n) = sum from k=1 to n of h_k · f(t_k).

  • A function f in C[a, b] is Riemann integrable over [a, b] if, for a sequence of partitions P_1, P_2, ... with corresponding T_1, T_2, ... such that the limit as n → ∞ of m(P_n) = 0, the Riemann sum R(f, P_n, T_n) converges to a limit I = integral from a to b of f(x) dx.

⚠️ Practical limitation

  • Riemann sums are useful for studying integrability theoretically.
  • They are not very useful in practice for numerical computation.

🔧 Quadrature rules: practical structure

🔧 General form

Numerical integration rule (quadrature rule): I = sum from k=0 to n of w_k · f(t_k).

  • Integration points t_k: the locations where the function is evaluated.
  • Weights w_k: coefficients corresponding to each integration point.
  • This structure is similar to Riemann sums but optimized for practical computation.

🆚 Riemann sums vs. quadrature rules

AspectRiemann sumsQuadrature rules
PurposeTheoretical definition of integralsPractical numerical approximation
StructureSum of h_k · f(t_k)Sum of w_k · f(t_k)
WeightsInterval lengths h_kOptimized weights w_k
UsageStudying integrabilityComputing integrals numerically

📏 Rectangle rule: the simplest method

📏 Two versions

The Rectangle rule integrates a function f over a single interval [x_L, x_R]:

  1. Left Rectangle rule:

    • Integral from x_L to x_R of f(x) dx ≈ (x_R − x_L) · f(x_L).
    • Uses the function value at the left endpoint.
  2. Right Rectangle rule:

    • Integral from x_L to x_R of f(x) dx ≈ (x_R − x_L) · f(x_R).
    • Uses the function value at the right endpoint.

📐 Geometric interpretation

  • The approximation is the area of a rectangle with width (x_R − x_L) and height equal to the function value at one endpoint.
  • Example: if the interval is [0, 1] and f(0) = 2, the left Rectangle rule gives 1 · 2 = 2 as the approximate integral.

🎯 Error bound (Theorem 5.3.1)

Assumptions: f is in C¹[x_L, x_R] (continuously differentiable), and m_1 = max over x in [x_L, x_R] of |f'(x)|.

Error bounds:

  • Left Rectangle rule: |integral − (x_R − x_L) · f(x_L)| ≤ (1/2) · m_1 · (x_R − x_L)².
  • Right Rectangle rule: |integral − (x_R − x_L) · f(x_R)| ≤ (1/2) · m_1 · (x_R − x_L)².

Key observations:

  • The error is proportional to the square of the interval width: halving the interval reduces the error by a factor of 4.
  • The error depends on the maximum first derivative m_1: functions with larger slopes have larger errors.
  • Both left and right versions have the same error bound structure.

🔬 Proof sketch for the left Rectangle rule

  • Use Taylor expansion: f(x) = f(x_L) + (x − x_L) · f'(ξ(x)), where ξ(x) is in (x_L, x_R).
  • Integrate both sides over [x_L, x_R]:
    • Integral of f(x) = integral of f(x_L) + integral of (x − x_L) · f'(ξ(x)).
  • The first term gives (x_R − x_L) · f(x_L), which is the Rectangle rule approximation.
  • The second term is the error, bounded by the maximum derivative and the interval length squared.

⚠️ Don't confuse

  • Rectangle rule vs. Riemann sum: the Rectangle rule is a specific quadrature rule with weight w = (x_R − x_L) and one integration point; a Riemann sum is a general theoretical construction with arbitrary partitions and intermediate points.
  • Left vs. right version: both have the same error bound, but they may give different approximations for the same function; neither is universally better.
26

Riemann sums

5.2 Riemann sums

🧭 Overview

🧠 One-sentence thesis

Riemann sums provide the theoretical foundation for defining integrals, and numerical integration rules adopt a similar structure—summing weighted function values at chosen points—to approximate integrals in practice.

📌 Key points (3–5)

  • What Riemann sums are: a sum of function values at intermediate points, each multiplied by the length of a subinterval, used to define integrability.
  • How they work: partition the interval into subintervals, pick a point in each subinterval, evaluate the function there, and sum the products of function values and interval lengths.
  • Practical limitation: Riemann sums are mainly a theoretical tool; numerical integration rules (quadrature rules) share the same structure but are more useful for computation.
  • Key structure: both Riemann sums and quadrature rules sum weighted function evaluations, but quadrature rules use specific weights and integration points optimized for accuracy.
  • Common confusion: Riemann sums require the mesh width (largest subinterval) to shrink to zero for convergence; numerical rules use fixed points and weights without taking a limit.

📐 Partition and Riemann sum structure

📐 Partition of an interval

A partition P_n of [a, b] is a finite number of distinct points x_k such that a = x_0 < x_1 < ... < x_n = b.

  • The interval [a, b] is divided into n subintervals.
  • Each subinterval is [x_{k-1}, x_k].
  • The length of the k-th subinterval is h_k = x_k - x_{k-1}.
  • The mesh width m(P_n) is the maximum of all h_k; it measures the largest subinterval.

🎯 Intermediate points and the Riemann sum

  • For each subinterval, choose an intermediate point t_k such that x_{k-1} ≤ t_k ≤ x_k.
  • The set of all intermediate points is denoted T_n.
  • The Riemann sum for a continuous function f on [a, b] is:
    • R(f, P_n, T_n) = sum from k=1 to n of h_k · f(t_k).
  • This sum approximates the integral by treating each subinterval as a rectangle with height f(t_k) and width h_k.

Example: If [a, b] = [0, 1] and you partition it into two subintervals [0, 0.5] and [0.5, 1], you might pick t_1 = 0.2 in the first and t_2 = 0.7 in the second, then compute 0.5 · f(0.2) + 0.5 · f(0.7).

🔁 Convergence and integrability

🔁 When a function is Riemann integrable

  • Consider a sequence of partitions P_1, P_2, ... with corresponding intermediate points T_1, T_2, ...
  • Require that the mesh width shrinks: lim (n → ∞) m(P_n) = 0.
  • A function f is Riemann integrable over [a, b] if the Riemann sums R(f, P_n, T_n) converge to a limit I.
  • That limit I is defined as the integral: I = integral from a to b of f(x) dx.

⚠️ Don't confuse: theoretical vs practical

  • Riemann sums are "usually used to study integrability theoretically, but they are not very useful in practice."
  • The excerpt emphasizes that while Riemann sums define what an integral is, they are not efficient for computation.

🧮 Numerical integration rules (quadrature rules)

🧮 Structure of quadrature rules

  • Numerical integration rules have a similar structure to Riemann sums:
    • I ≈ sum from k=0 to n of w_k · f(t_k).
  • Here, t_k are called integration points and w_k are the corresponding weights.
  • These rules are also called quadrature rules.

🔍 How quadrature rules differ from Riemann sums

AspectRiemann sumsQuadrature rules
PurposeTheoretical definition of integralsPractical computation of integrals
ConvergenceRequire mesh width → 0Use fixed points and weights
FlexibilityIntermediate points can be arbitraryPoints and weights are chosen for accuracy
UsageStudy integrabilityApproximate integrals numerically
  • Quadrature rules do not take a limit; they provide a direct approximation formula.
  • The choice of weights and integration points is designed to minimize error for certain classes of functions.

🛠️ Context: why numerical integration matters

🛠️ Motivation from the excerpt

  • Many integrals "cannot be evaluated in a simple way" analytically.
  • The excerpt gives the example of computing arc length for a spoiler shape, which involves the integral of sqrt(1 + (cos t)^2) from 0 to 0.8.
  • "In such cases one has resort to numerical quadratures."
  • Numerical integration methods (like the rectangle rule mentioned later) are practical tools for these problems.

📌 Transition to simple integration rules

  • The excerpt states that "simple integration rules will be presented" in the next section (5.3), starting with the rectangle rule.
  • These rules apply the quadrature structure to approximate integrals over a single interval [x_L, x_R].
27

Simple integration rules

5.3 Simple integration rules

🧭 Overview

🧠 One-sentence thesis

Simple numerical integration rules approximate definite integrals by evaluating the function at specific points with corresponding weights, and their accuracy depends on the smoothness of the function and the interval width.

📌 Key points (3–5)

  • Structure of numerical rules: all have the form "sum of weights times function values at integration points," similar to Riemann sums but designed for practical computation.
  • Four basic rules: Rectangle (left/right endpoint), Midpoint (center), Trapezoidal (linear interpolation), and Simpson's (quadratic interpolation).
  • Error behavior: error bounds depend on the interval width raised to a power (squared for Rectangle, cubed for Midpoint and Trapezoidal, fifth power for Simpson's) and on derivatives of the function.
  • Common confusion: Midpoint rule is more accurate than Rectangle rule even though both use a single point—Midpoint's error involves the second derivative and a cubic term, while Rectangle's error involves the first derivative and a quadratic term.
  • Composite approach: applying simple rules over many small subintervals improves accuracy when a single interval produces too much error.

📐 Numerical integration framework

📐 What numerical integration rules are

Numerical integration rules (also called quadrature rules): formulas of the form I = sum from k=0 to n of (w_k times f(t_k)), where t_k are integration points and w_k are weights.

  • They approximate the definite integral of a function over an interval.
  • The structure resembles Riemann sums but is optimized for practical calculation rather than theoretical study of integrability.
  • Each rule chooses specific points and weights to balance simplicity and accuracy.

🎯 Focus of this section

  • Integration over a single interval [x_L, x_R].
  • Presentation of several simple rules with their approximation errors.
  • Consideration of rounding errors (mentioned but not detailed in the excerpt).

🔲 Rectangle rule

🔲 Two versions

The Rectangle rule is the simplest integration rule and comes in two forms:

VersionFormulaIntegration point
Left(x_R - x_L) times f(x_L)Left endpoint
Right(x_R - x_L) times f(x_R)Right endpoint
  • Both multiply the interval width by the function value at one endpoint.
  • Geometrically, this approximates the area under the curve by a rectangle.

📏 Error bound for Rectangle rule

Theorem 5.3.1: If f has a continuous first derivative on [x_L, x_R], and m_1 is the maximum of the absolute value of f'(x) over the interval, then:

  • Error for left Rectangle rule ≤ (1/2) times m_1 times (x_R - x_L) squared
  • Error for right Rectangle rule ≤ (1/2) times m_1 times (x_R - x_L) squared

Key observations:

  • Error depends on the first derivative of f.
  • Error grows with the square of the interval width.
  • Proof uses Taylor expansion: f(x) = f(x_L) + (x - x_L) times f'(ξ(x)) for some ξ(x) in the interval.
  • The mean-value theorem for integration is applied because (x - x_L) ≥ 0.

Example: If you halve the interval width, the error bound becomes one-quarter as large.

🎯 Midpoint rule

🎯 Definition

Midpoint rule: uses the integration point x_M = (x_L + x_R)/2, giving the approximation (x_R - x_L) times f(x_M).

  • Evaluates the function at the center of the interval.
  • Multiplies by the interval width, like Rectangle rule.
  • The excerpt notes this rule is "more accurate than expected at first sight."

📏 Error bound for Midpoint rule

Theorem 5.3.2: If f has a continuous second derivative on [x_L, x_R], and m_2 is the maximum of the absolute value of f''(x) over the interval, then:

  • Error ≤ (1/24) times m_2 times (x_R - x_L) cubed

Why Midpoint is better than Rectangle:

  • Error depends on the second derivative (not first).
  • Error grows with the cube of the interval width (not square).
  • The coefficient 1/24 is also favorable.
  • Proof uses Taylor series centered at x_M with a second-order remainder term.
  • The symmetry around x_M causes the first-derivative term to vanish when integrated.

Don't confuse: Both Midpoint and Rectangle use a single function evaluation, but Midpoint achieves higher accuracy by centering the evaluation point.

📊 Trapezoidal rule

📊 Definition and geometric meaning

Trapezoidal rule: approximates the integral by (x_R - x_L)/2 times (f(x_L) + f(x_R)).

  • Uses linear interpolation between the two endpoints.
  • The formula equals the area of a trapezium with vertices (x_L, 0), (x_R, 0), (x_R, f(x_R)), and (x_L, f(x_L)).
  • The linear interpolation polynomial is L_1(x) = [(x_R - x)/(x_R - x_L)] times f(x_L) + [(x - x_L)/(x_R - x_L)] times f(x_R).

📏 Error bound for Trapezoidal rule

Theorem 5.3.3: If f has a continuous second derivative on [x_L, x_R], and m_2 is the maximum of the absolute value of f''(x) over the interval, then:

  • Error ≤ (1/12) times m_2 times (x_R - x_L) cubed

Key observations:

  • Error depends on the second derivative, like Midpoint rule.
  • Error grows with the cube of the interval width.
  • The coefficient 1/12 is twice as large as Midpoint's 1/24, so Midpoint is more accurate.
  • Proof uses the truncation error from linear interpolation: f(x) - L_1(x) = (1/2) times (x - x_L)(x - x_R) times f''(ξ(x)).
  • The mean-value theorem applies because (x - x_L)(x - x_R) ≤ 0 on the interval.

Comparison with Midpoint:

RulePoints usedError coefficientError order
Midpoint1 (center)1/24Cubic in interval width
Trapezoidal2 (endpoints)1/12Cubic in interval width

🎪 Simpson's rule

🎪 Definition

Simpson's rule: uses quadratic interpolation with nodes x_L, x_M = (x_R - x_L)/2, and x_R, giving the approximation (x_R - x_L)/6 times (f(x_L) + 4 times f(x_M) + f(x_R)).

  • Based on a quadratic interpolation polynomial through three points.
  • Weights are 1, 4, and 1 (scaled by the interval width divided by 6).
  • More sophisticated than the previous rules.

📏 Error bound for Simpson's rule

Theorem 5.3.4 (stated without proof): If f has a continuous fourth derivative on [x_L, x_R], and m_4 is the maximum of the absolute value of f^(4)(x) over the interval, then:

  • Error ≤ (1/2880) times m_4 times (x_R - x_L) to the fifth power

Why Simpson's rule is highly accurate:

  • Error depends on the fourth derivative.
  • Error grows with the fifth power of the interval width.
  • Much smaller error than Rectangle, Midpoint, or Trapezoidal for smooth functions.
  • The coefficient 1/2880 is very small.

Example: If you halve the interval width, Simpson's error bound becomes 1/32 as large, while Trapezoidal becomes 1/8 as large.

🧱 Composite rules

🧱 Why composite rules are needed

  • Single-interval rules from Section 5.3 often produce errors larger than required accuracy.
  • Composite rules apply a simple rule repeatedly over many small subintervals.
  • The integral over [a, b] is split: integral from a to b = integral from x_0 to x_1 + integral from x_1 to x_2 + ... + integral from x_(n-1) to x_n.

🧱 Structure of composite approach

Subdivision:

  • Divide [a, b] into n subintervals: a = x_0 < x_1 < ... < x_n = b.
  • Uniform spacing: x_k = a + k times h, where h = (b - a)/n.
  • On each subinterval [x_(k-1), x_k], apply a simple rule to get approximation I_k.

Total approximation:

  • Sum the subinterval approximations: I = sum from k=1 to n of I_k.
  • Each I_k approximates the integral from x_(k-1) to x_k.
  • For each subinterval, x_L = x_(k-1), x_R = x_k, and x_M = (x_(k-1) + x_k)/2.

Don't confuse: Composite rules are not new formulas; they are systematic applications of the simple rules over smaller intervals to achieve better overall accuracy.

28

Composite rules

5.4 Composite rules

🧭 Overview

🧠 One-sentence thesis

Composite integration rules achieve higher accuracy than single-interval rules by subdividing the integration domain and summing approximations over each subinterval, with error bounds that decrease predictably as the step size shrinks.

📌 Key points (3–5)

  • Why composite rules: Single-interval rules (Section 5.3) often produce errors larger than required, so composite rules subdivide the domain to improve accuracy.
  • How they work: Split the interval [a, b] into n subintervals of width h = (b − a)/n, apply a simple rule on each piece, then sum the results.
  • Error behavior: The composite rule's total error is bounded by c(b − a)h^p, where p depends on the underlying rule (Rectangle: p=1, Midpoint/Trapezoidal: p=2, Simpson's: p=4).
  • Common confusion: Halving h does not halve the error uniformly—Rectangle errors halve (O(h)), Midpoint/Trapezoidal errors drop by 1/4 (O(h²)), and Simpson's errors drop by 1/16 (O(h⁴)).
  • Practical trade-off: Midpoint and Trapezoidal rules require the same work as Rectangle but converge faster; Simpson's needs twice as many function evaluations but converges much faster.

🧩 Core mechanism

🧩 Subdivision strategy

The excerpt subdivides [a, b] into n equal subintervals:

  • Nodes: a = x₀ < x₁ < … < xₙ = b, where xₖ = a + kh and h = (b − a)/n.
  • Each subinterval [xₖ₋₁, xₖ] has width h.

The integral over [a, b] is the sum of integrals over each subinterval: ∫ₐᵇ f(x)dx = ∫ₓ₀ˣ¹ f(x)dx + ∫ₓ₁ˣ² f(x)dx + … + ∫ₓₙ₋₁ˣₙ f(x)dx.

🔧 Applying a simple rule on each piece

  • On each subinterval [xₖ₋₁, xₖ], apply one of the rules from Section 5.3 (Rectangle, Midpoint, Trapezoidal, or Simpson's) to get an approximation Iₖ.
  • Sum all Iₖ to approximate the full integral: I ≈ ∑ₖ₌₁ⁿ Iₖ.

Example: For the composite left Rectangle rule, evaluate f at the left endpoint of each subinterval and multiply by h, then sum: I_L = h(f(a) + f(a+h) + … + f(b−h)).

📐 General error bound (Theorem 5.4.1)

The excerpt proves that if each subinterval's remainder term is bounded by cₖ · h^(p+1), and c = max{c₁, …, cₙ}, then:

|∫ₐᵇ f(x)dx − I| ≤ c(b − a)h^p.

Why this matters:

  • The total error depends on h^p, not h^(p+1), because n · h = b − a.
  • Smaller h → smaller error, and the rate of decrease is determined by p.

Don't confuse: The single-interval error is O(h^(p+1)), but the composite error is O(h^p) because you sum over n = (b−a)/h intervals.

🛠️ Four composite rules

🛠️ Composite Rectangle rules (left and right)

Left Rectangle:

  • I_L = h(f(a) + f(a+h) + … + f(b−h))
  • Error bound: |∫ₐᵇ f(x)dx − I_L| ≤ (1/2)M₁(b−a)h, where M₁ = max |f′(x)| on [a,b].

Right Rectangle:

  • I_R = h(f(a+h) + f(a+2h) + … + f(b))
  • Same error bound: ≤ (1/2)M₁(b−a)h.

Key property: Both are O(h), so halving h roughly halves the error.

🛠️ Composite Midpoint rule

  • I_M = h(f(a + h/2) + f(a + 3h/2) + … + f(b − h/2))
  • Error bound: |∫ₐᵇ f(x)dx − I_M| ≤ (1/24)M₂(b−a)h², where M₂ = max |f′′(x)| on [a,b].

Key property: O(h²), so halving h reduces error by a factor of 4.

🛠️ Composite Trapezoidal rule

  • I_T = h((1/2)f(a) + f(a+h) + … + f(b−h) + (1/2)f(b))
  • Error bound: |∫ₐᵇ f(x)dx − I_T| ≤ (1/12)M₂(b−a)h².

Key property: Also O(h²), but the error bound is twice as large as the Midpoint rule's.

Alternative interpretation: The composite Trapezoidal rule computes the integral of the linear spline that interpolates f at the nodes.

🛠️ Composite Simpson's rule

  • I_S = (h/6)(f(a) + 4f(a+h/2) + 2f(a+h) + 4f(a+3h/2) + … + 4f(b−h/2) + f(b))
  • Error bound: |∫ₐᵇ f(x)dx − I_S| ≤ (1/2880)M₄(b−a)h⁴, where M₄ = max |f⁽⁴⁾(x)| on [a,b].

Key property: O(h⁴), so halving h reduces error by a factor of 16.

📊 Comparison and practical guidance

📊 Accuracy vs cost

RuleError orderFunction evaluations per intervalWhen exact
Rectangle (left/right)O(h)1f is constant
MidpointO(h²)1f is linear
TrapezoidalO(h²)1 (shared endpoints)f is linear
Simpson'sO(h⁴)2 (includes midpoint)f is cubic polynomial

Practical advice from the excerpt:

  • Rectangle, Midpoint, and Trapezoidal all require the same amount of work (one function evaluation per subinterval, with endpoint sharing).
  • Among these, Midpoint and Trapezoidal are "clearly preferred" because they are O(h²) instead of O(h).
  • Simpson's needs twice as many function evaluations but converges much faster (O(h⁴)).

Don't confuse: "Same amount of work" means the same number of new function evaluations; Trapezoidal shares endpoints, so it also evaluates n+1 points for n intervals, just like Midpoint.

📊 Midpoint vs Trapezoidal

  • Both are O(h²), but the Midpoint rule's error bound is half that of the Trapezoidal rule: (1/24) vs (1/12).
  • The excerpt notes: "the upper bound for the error of the Trapezoidal rule is twice as large as the bound for the Midpoint rule."

📊 Example 5.4.1 (flat plate length)

The excerpt computes ∫₀⁰·⁸ √(1 + (cos t)²) dt with an error tolerance of 0.01 m (1 cm).

Observed error reduction when halving h:

hRectangle errorMidpoint errorTrapezoidal errorSimpson's error
0.85.54×10⁻²1.17×10⁻²2.27×10⁻²2.18×10⁻⁴
0.43.36×10⁻²2.78×10⁻³5.52×10⁻³1.36×10⁻⁵
0.21.82×10⁻²6.86×10⁻⁴1.37×10⁻³8.49×10⁻⁷
0.19.43×10⁻³1.71×10⁻⁴3.42×10⁻⁴5.30×10⁻⁸
0.054.80×10⁻³4.27×10⁻⁵8.54×10⁻⁵3.31×10⁻⁹

Verification of convergence rates:

  • Rectangle: errors decrease by factor ≈2 (O(h)).
  • Midpoint/Trapezoidal: errors decrease by factor ≈4 (O(h²)).
  • Simpson's: errors decrease by factor ≈16 (O(h⁴)).

The excerpt states: "The results are in agreement with these expectations."

⚠️ Measurement and rounding errors

⚠️ Perturbed function values

The excerpt considers what happens when function values are not exact but perturbed by an error ε(x), so |f(x) − f̂(x)| = ε(x), with εₘₐₓ = max ε(x) on [a,b].

Impact on the integral:

|∫ₐᵇ (f(x) − f̂(x))dx| ≤ εₘₐₓ(b − a).

⚠️ Total error with perturbations

For the composite left Rectangle rule using perturbed values f̂, the total error is:

|I − Î_L| ≤ |I − I_L| + |I_L − Î_L|

The excerpt derives:

|∫ₐᵇ f(x)dx − Î_L| ≤ (1/2)M₁(b−a)h + h·∑ₖ₌₀ⁿ⁻¹ εₘₐₓ = (1/2)M₁(b−a)h + n·h·εₘₐₓ.

Interpretation:

  • First term: discretization error (decreases as h → 0).
  • Second term: accumulated measurement error (increases as h → 0, because more function evaluations are needed).

Don't confuse: Making h smaller does not always improve accuracy when measurement errors are present—there is a trade-off between discretization error and rounding/measurement error accumulation.

29

Measurement and rounding errors

5.5 Measurement and rounding errors

🧭 Overview

🧠 One-sentence thesis

Measurement and rounding errors in function values limit the accuracy of numerical integration, making it pointless to reduce step size below a threshold where measurement error dominates, and some integrals are inherently ill-conditioned when small relative errors in input produce large relative errors in output.

📌 Key points (3–5)

  • What measurement error does: perturbs function values by an error epsilon, which propagates into the integral approximation and adds to the discretization error.
  • The step-size threshold: there is a minimum useful step size (related to epsilon-max divided by the derivative bound) below which reducing h does not improve total error because measurement error dominates.
  • Condition number of an integral: the ratio of the integral of absolute-value-of-f to the absolute-value-of-the-integral-of-f; when this ratio times the relative input error is order 1, the problem is ill-conditioned.
  • Common confusion: absolute error epsilon-max vs relative error epsilon-max-with-different-symbol—absolute error bounds the difference between true and perturbed function values; relative error bounds that difference as a fraction of the true value.
  • Why it matters: even with arbitrarily small step sizes, measurement errors can prevent accurate integration, and some integrals (like nearly-canceling positive and negative contributions) are fundamentally hard to compute accurately.

📏 How measurement errors propagate

📏 Perturbed function values

Measurement and rounding errors perturb function values such that the absolute difference between the true function f(x) and the perturbed function f-hat(x) equals epsilon(x).

  • Define epsilon-max = maximum of epsilon(x) over the interval [a, b].
  • The error in the exact integral (before any numerical approximation) is bounded by epsilon-max times (b minus a).
  • This is the baseline error introduced by measurement alone, independent of the integration rule.

🧮 Total error with a numerical rule

The excerpt derives the total error for the composite left Rectangle rule:

  • The difference between the exact integral I and the perturbed approximation I-hat-L is split into two parts:
    • The discretization error: absolute-value(I minus I-L), which depends on step size h.
    • The measurement error: absolute-value(I-L minus I-hat-L), which depends on epsilon-max.
  • Using the triangle inequality and summing over n subintervals, the total error is bounded by:
    • (one-half times M-1 times h plus epsilon-max) times (b minus a),
    • where M-1 = maximum of absolute-value(f-prime(x)) over [a, b].

🚫 The useless step-size regime

  • Key conclusion: it is useless to take h smaller than 2 epsilon-max divided by M-1.
  • Why: when h is very small, the discretization error (proportional to h) becomes smaller than the measurement error (proportional to epsilon-max), so the measurement error dominates and further reduction in h does not improve total accuracy.
  • Example from the excerpt: with epsilon(x) = 10^-3, the total error remains larger than 0.8 times 10^-3, and there is no benefit to taking step size smaller than 4 epsilon-max = 4 times 10^-3.
  • Don't confuse: this threshold is specific to the composite left Rectangle rule; similar (but different) thresholds exist for Midpoint, Trapezoidal, and Simpson's rules.

🧪 Conditioning of the integral

🧪 Well-conditioned vs ill-conditioned

The excerpt distinguishes between absolute error epsilon-max and relative error epsilon-max-with-different-symbol (denoted by a different Greek letter in the text):

  • Absolute error: bounded by epsilon-max (a fixed number).
  • Relative error: bounded by absolute-value(f(x)) times epsilon-max-relative.
  • The relative error in the integral is bounded by:
    • (integral of absolute-value-of-f divided by absolute-value-of-integral-of-f) times epsilon-max-relative.

Condition number of the integral: K-I = (integral from a to b of absolute-value-of-f(x) dx) divided by absolute-value(integral from a to b of f(x) dx).

  • If K-I times epsilon-max-relative is of order 1, the problem is ill-conditioned: the relative error in the computed integral is of order 1, meaning no digits may be correct in the worst case.
  • Why this happens: when the integral of f is small (because positive and negative parts nearly cancel) but the integral of absolute-value-of-f is large, small relative errors in function values can produce large relative errors in the result.

💼 Example: car manufacturer profits

The excerpt gives a concrete scenario:

  • Spring profits: w-spring(t) = 0.01 plus sin(pi t minus pi/2) for t in [0, 1].
  • Fall profits: w-fall(t) = 0.01 plus sin(pi t plus pi/2) for t in [0, 1].
  • Both integrals W-spring and W-fall equal 0.01 (10 million dollars).
  • The integral of absolute-value-of-w-spring is approximately 0.64.
  • Condition number: K-I = 0.64 divided by 0.01 = 64.
  • If function values have only two-digit accuracy, epsilon-max-relative = 0.005, so K-I times epsilon-max-relative = 0.32.
  • Conclusion: the determination of the integral is ill-conditioned; in the worst case, no digit of the approximation is correct.
  • The composite left Rectangle rule with h = 0.02 approximates W-spring as minus 0.01 (wrong sign!) and W-fall as 0.03 (three times too large), illustrating the ill-conditioning.
  • Note: the Midpoint and Trapezoidal rules obtain exact results for h = 0.02 in this example, showing that rule choice matters for specific functions.

📊 Summary table

ConceptDefinition / boundImplication
Absolute error epsilon-maxmax of epsilon(x) over [a, b]Baseline error in function values
Total error (left Rectangle)(one-half M-1 h plus epsilon-max)(b minus a)Sum of discretization and measurement errors
Minimum useful happroximately 2 epsilon-max / M-1Below this, measurement error dominates
Condition number K-I(integral of abs(f)) / abs(integral of f)Measures sensitivity to relative input errors
Ill-conditionedK-I times epsilon-max-relative = O(1)Relative error in output is order 1

🔍 How to distinguish absolute vs relative error

  • Absolute error: a fixed bound on the difference, independent of the size of f(x).
  • Relative error: a bound on the difference as a fraction of f(x), so the absolute difference grows with the magnitude of f.
  • The excerpt uses epsilon-max for absolute error and a different symbol (epsilon-max-relative) for relative error; the condition number analysis uses relative error because it compares the relative error in the integral to the relative error in function values.
30

Interpolatory quadrature rules

5.6 Interpolatory quadrature rules *

🧭 Overview

🧠 One-sentence thesis

Interpolatory quadrature rules approximate integrals by integrating the Lagrange interpolation polynomial through chosen nodes, and they exactly integrate polynomials up to degree N (or N+1 when N is even).

📌 Key points (3–5)

  • Core idea: replace the integrand with its Lagrange interpolation polynomial at N+1 nodes, then integrate the polynomial instead.
  • Weight formula: weights w_ℓ are computed by integrating the Lagrange basis polynomials and do not depend on the function f.
  • Exactness guarantee: every (N+1)-point interpolatory rule is exact for all polynomials of degree at most N.
  • Common confusion: the degree of exactness depends on whether N is even or odd—even N gives exactness up to degree N+1, odd N only up to degree N.
  • Newton-Cotes as special case: when nodes are equidistant and include both endpoints, the rule is called Newton-Cotes (e.g., Trapezoidal rule for N=1, Simpson's rule for N=2).

🧮 How interpolatory quadrature works

🧮 The construction method

The method replaces the integral of f with the integral of its Lagrange interpolation polynomial L_N:

Integral of f(x) from x_L to x_R ≈ integral of L_N(x) from x_L to x_R = sum from ℓ=0 to N of w_ℓ · f(x_ℓ)

  • Step 1: choose N+1 nodes x_0, ..., x_N in the interval [x_L, x_R].
  • Step 2: build the Lagrange interpolation polynomial L_N that passes through f at those nodes.
  • Step 3: integrate L_N instead of f; this yields a weighted sum of function values.

🔢 Weight computation

The weights are defined by:

w_ℓ = integral from x_L to x_R of L_ℓ^N(x) dx

  • L_ℓ^N(x) is the ℓ-th Lagrange basis polynomial.
  • Key property: weights depend only on the nodes, not on the function f.
  • Once nodes are chosen, weights can be precomputed and reused for any function.

🎯 Exactness and error behavior

🎯 Exactness for polynomials (Theorem 5.6.1)

If f is a polynomial of degree at most N, then every (N+1)-point interpolatory quadrature rule is exact: the integral equals the weighted sum exactly.

  • Why: the Lagrange interpolation polynomial of a polynomial of degree ≤ N is the polynomial itself (from Theorem 2.3.1).
  • Example: if you use 3 nodes (N=2) and f is a quadratic polynomial, the rule gives the exact integral.

🎯 Converse characterization (Theorem 5.6.2)

If a quadrature rule with nodes x_0, ..., x_N and weights w_0, ..., w_N is exact for all polynomials of degree at most N, then it must be the interpolatory rule.

  • This means interpolatory rules are the unique rules that achieve exactness for degree-N polynomials with (N+1) nodes.
  • The weights are uniquely determined by equation (5.5).

📏 Remainder (error) terms (Theorem 5.6.3)

The error depends on whether N is even or odd:

CaseSmoothness requiredError boundDegree of exactness
N evenf ∈ C^(N+2)[x_L, x_R]C_N · ((x_R - x_L)/N)^(N+3) · f^(N+2)(ξ)Up to degree N+1
N oddf ∈ C^(N+1)[x_L, x_R]D_N · ((x_R - x_L)/N)^(N+2) · f^(N+1)(ξ)Up to degree N
  • Even N: the rule integrates polynomials up to degree N+1 exactly.
  • Odd N: the rule integrates polynomials only up to degree N exactly.
  • Don't confuse: the number of nodes is N+1, but the degree of exactness is N or N+1 depending on parity.

🏛️ Newton-Cotes quadrature rules

🏛️ Definition

Newton-Cotes quadrature rule: an interpolatory rule where x_L = x_0 < x_1 < ... < x_N = x_R and the nodes are equidistantly distributed.

  • Equidistant: the spacing between consecutive nodes is constant.
  • Endpoints included: both x_L and x_R are nodes.

📐 Familiar examples

📐 Trapezoidal rule (N=1)

  • Uses linear interpolation at the two endpoints x_L and x_R.
  • This is a Newton-Cotes rule with N=1.
  • The error term matches Theorem 5.6.3 for N=1 (odd).

📐 Simpson's rule (N=2)

  • Uses quadratic interpolation at three nodes: x_L, (x_L + x_R)/2, and x_R.
  • This is a Newton-Cotes rule with N=2.
  • The error term matches Theorem 5.6.3 for N=2 (even), so it integrates cubics exactly.

⚠️ Higher-order Newton-Cotes

  • Newton-Cotes rules of higher order (N > 2) are rarely used in practice.
  • Reason: negative weights can occur, which is less desirable (can amplify rounding errors).

🔄 Composite rules and practical use

🔄 Composite interpolatory rules

  • How: repeat the interpolation on subintervals, as explained in Section 5.4.
  • Example: divide [x_L, x_R] into many small intervals, apply the same interpolatory rule on each, then sum the results.
  • This improves accuracy by reducing the interval length, which shrinks the error terms.

🔄 Why composite matters

  • Single-interval high-order rules can have large errors if the interval is wide.
  • Composite rules with lower-order interpolation (e.g., composite Trapezoidal or Simpson) are often more stable and easier to implement.
31

Gauss quadrature rules

5.7 Gauss quadrature rules *

🧭 Overview

🧠 One-sentence thesis

Gauss quadrature rules achieve higher accuracy than Newton-Cotes rules by choosing both weights and nodes optimally, so that polynomials up to degree 2N+1 are integrated exactly with N+1 points.

📌 Key points (3–5)

  • Core difference from Newton-Cotes: Newton-Cotes fixes nodes first (equidistant), then computes weights; Gauss determines both nodes and weights together to maximize polynomial exactness.
  • Exactness gain: An (N+1)-point Gauss rule integrates polynomials up to degree 2N+1 exactly, whereas Newton-Cotes with the same number of points integrates only up to degree N or N+1.
  • Standard interval: Gauss rules are defined on [-1, 1]; any other interval [x_L, x_R] must be transformed via a linear change of variable.
  • Common confusion: The nodes are not equidistant and not chosen arbitrarily—they are the unique solution to a system of 2N+2 nonlinear equations.
  • Practical use: Tables provide precomputed nodes and weights; composite Gauss rules apply the formula on subintervals for higher accuracy.

🔄 How Gauss quadrature differs from Newton-Cotes

🔄 Newton-Cotes approach (recap)

  • Nodes chosen first: equidistant points between x_L and x_R.
  • Weights computed second: by integrating the interpolating polynomial through those fixed nodes.
  • Exactness: polynomials up to degree N+1 (if N even) or N (if N odd).
  • Example: Trapezoidal rule (N=1) and Simpson's rule (N=2) are Newton-Cotes rules.

🎯 Gauss approach

  • Both nodes and weights determined together: the 2N+2 unknowns (N+1 weights w_ℓ and N+1 nodes x_ℓ) are chosen to satisfy 2N+2 equations.
  • Goal: integrate polynomials 1, x, x², ..., x^(2N+1) exactly.
  • Result: a unique solution exists, yielding the Gauss quadrature rule.
  • Don't confuse: Gauss nodes are not equidistant; they are the roots of orthogonal polynomials (though the excerpt does not name them).

🧮 Constructing a Gauss rule: the two-point example

🧮 Setup for N=1 (two points)

  • Unknowns: w₀, w₁, x₀, x₁.
  • Requirement: the formula
    integral from -1 to 1 of f(x) dx ≈ w₀ f(x₀) + w₁ f(x₁)
    must be exact when f is any polynomial of degree at most 3.

📐 The four equations

Substitute f(x) = 1, x, x², x³ and equate the approximation to the exact integral:

f(x)Exact integralGauss formulaEquation
12w₀ + w₁w₀ + w₁ = 2
x0w₀x₀ + w₁x₁w₀x₀ + w₁x₁ = 0
2/3w₀x₀² + w₁x₁²w₀x₀² + w₁x₁² = 2/3
0w₀x₀³ + w₁x₁³w₀x₀³ + w₁x₁³ = 0

🔍 Solution by symmetry

  • The equations are symmetric, so w₀ = w₁.
  • From the first equation: w₀ = w₁ = 1.
  • Solving the remaining equations yields:
    x₀ = -√3/3, x₁ = √3/3.
  • Final formula:
    integral from -1 to 1 of f(x) dx ≈ f(-√3/3) + f(√3/3).

📊 General Gauss quadrature formula

📊 Standard form on [-1, 1]

Gauss quadrature rule: integral from -1 to 1 of f(x) dx ≈ sum from ℓ=0 to N of w_ℓ f(x_ℓ), where the 2N+2 parameters x_ℓ and w_ℓ are chosen so that integral from -1 to 1 of x^j dx = sum from ℓ=0 to N of w_ℓ x_ℓ^j for j = 0, ..., 2N+1.

  • Exactness: polynomials up to degree 2N+1 are integrated exactly.
  • Precomputed values: Table 5.2 lists nodes and weights for N=0, 1, 2, 3.
  • N=0 case: corresponds to the Midpoint rule (one point at x=0, weight=2).

🔢 Table of nodes and weights (excerpt summary)

  • N=0: x₀=0, w₀=2 (Midpoint rule).
  • N=1: x₀=-√3/3, x₁=√3/3, w₀=w₁=1.
  • N=2: three points, including x₁=0 with weight 8/9.
  • N=3: four points with more complex irrational values.

🔀 Transforming to an arbitrary interval [x_L, x_R]

🔀 Linear transformation

  • Why needed: Gauss rules are defined on [-1, 1]; real integrals often have different limits.
  • Change of variable:
    integral from x_L to x_R of f(y) dy = ((x_R - x_L)/2) × integral from -1 to 1 of f((x_R - x_L)/2 × x + (x_L + x_R)/2) dx.
  • Interpretation: the new variable x runs from -1 to 1, and y is a linear function of x that runs from x_L to x_R.

🧩 Gauss formula on [x_L, x_R]

Applying the Gauss rule to the transformed integral:

integral from x_L to x_R of f(y) dy ≈ ((x_R - x_L)/2) × sum from ℓ=0 to N of w_ℓ f((x_R - x_L)/2 × x_ℓ + (x_L + x_R)/2).

  • x_ℓ: the tabulated Gauss nodes on [-1, 1].
  • Evaluation points: transform each x_ℓ to the interval [x_L, x_R] via the linear map.

📏 Error term

If f is 2N+2 times continuously differentiable on [x_L, x_R], the error is:

Error = ((x_R - x_L)^(2N+3) / ((N+1)!)^4) × (1 / ((2N+3)((2N+2)!)^3)) × f^(2N+2)(ξ),

where ξ is some point in (x_L, x_R).

  • Implication: the error involves the (2N+2)-th derivative, confirming exactness for polynomials up to degree 2N+1.

🧪 Example: integrating y⁷ on [1, 1.5]

🧪 Problem setup

  • Exact value: integral from 1 to 1.5 of y⁷ dy = 3.07861328125.
  • Transformation: x_L=1, x_R=1.5, so (x_R - x_L)/2 = 0.25 and (x_L + x_R)/2 = 1.25.
  • Gauss approximation:
    integral ≈ 0.25 × sum from ℓ=0 to N of w_ℓ f(0.25 x_ℓ + 1.25), where f(y) = y⁷.

📉 Results (Table 5.3 summary)

N (points)Absolute error
0 (1 point)6.9443 × 10⁻¹
1 (2 points)1.1981 × 10⁻²
2 (3 points)2.4414 × 10⁻⁵
3 (4 points)0 (exact)
  • Why N=3 is exact: y⁷ is degree 7, and the 4-point Gauss rule integrates polynomials up to degree 2×3+1=7 exactly.
  • Accuracy for smaller N: even with fewer points, Gauss rules are very accurate compared to Newton-Cotes of the same order.

🔁 Composite Gauss quadrature

🔁 Subdivision strategy

  • Composite rule: divide [x_L, x_R] into subintervals and apply the Gauss rule on each.
  • Same principle as composite Newton-Cotes: explained in Section 5.4 (not detailed in this excerpt).
  • Benefit: further reduces error by using smaller intervals, where the high-order derivative in the error term is smaller.

🛠️ Practical note

  • Gauss rules are tabulated and ready to use; composite versions are straightforward to implement.
  • Don't confuse: composite Gauss is not the same as increasing N—it means repeating the same N-point rule on multiple subintervals.
32

Numerical Time Integration of Initial-Value Problems

6.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Initial-value problems describe time-dependent processes through differential equations with conditions given at the starting time, and numerical methods are needed when analytic solutions cannot be obtained.

📌 Key points (3–5)

  • What an initial-value problem is: a differential equation describing a time-dependent process with conditions specified at the starting time t₀ (initial conditions).
  • When numerical methods are needed: when the problem cannot be solved analytically (in closed form), numerical approximation is required.
  • Well-posedness requirements: before applying numerical methods, verify that the problem has a unique solution that changes continuously with small perturbations.
  • Common confusion: Lipschitz continuity vs differentiability—a function can be differentiable and still not satisfy the Lipschitz condition everywhere; the bounded derivative condition is sufficient but not necessary.
  • Scope limitation: most numerical methods target first-order problems; higher-order problems require special treatment (not covered in this excerpt).

🌊 Motivating example: water discharge

🌊 The physical scenario

The excerpt presents a storage reservoir draining through a pipe:

  • Water is initially at rest.
  • At t₀ = 0, a latch opens instantaneously.
  • Water flow develops gradually due to inertia.
  • Practical question: How long until turbines reach full power?

📐 The mathematical model

The process is described by the nonlinear initial-value problem:

dq/dt = p(t) − aq², t > 0, with q(0) = 0.

  • q (m³/s): mass flow rate.
  • p: driving force (force/length divided by density, measured in m³/s²), depends on water level and other factors.
  • aq²: friction term.

🔍 When closed-form solutions exist vs when they don't

  • Special case: If p(t) is constant (p(t) = p₀), an analytic solution exists: q(t) = √(p₀/a) tanh(t√(ap₀)).
  • General case: For general functions p, no analytic solution can be obtained → numerical methods are necessary.

Example: If the water level changes over time (making p non-constant), the problem cannot be solved by hand and requires numerical approximation.

🧩 Core concepts: well-posedness

🧩 What well-posedness means

Well-posed initial-value problem: A problem of the form dy/dt = f(t, y), a ≤ t ≤ b, y(a) = yₐ, is well posed if it has a unique solution that depends continuously on the data (a, yₐ, and f).

Three requirements:

  1. Existence: The problem has a solution.
  2. Uniqueness: The solution is unique.
  3. Continuous dependence: Small changes in parameters (differential equation or initial conditions) lead to small changes in the solution.

Why it matters: Before applying numerical methods, you must ensure the problem is well-posed; otherwise, numerical approximations may be meaningless or unstable.

🔧 Lipschitz continuity

Lipschitz continuous in y: A function f(t, y) is Lipschitz continuous in y on a set D ⊂ ℝ² if there exists a constant L > 0 such that |f(t, y₁) − f(t, y₂)| ≤ L|y₁ − y₂| for all (t, y₁), (t, y₂) ∈ D.

  • L is called the Lipschitz constant.
  • Plain language: The function's rate of change in y is bounded—it cannot "jump" too fast.
  • Example: If f changes by at most L units for every 1-unit change in y, it is Lipschitz continuous with constant L.

🔗 Sufficient condition for Lipschitz continuity

Theorem 6.2.1: If f is differentiable with respect to y and |∂f/∂y(t, y)| ≤ L for all (t, y) ∈ D, then f is Lipschitz continuous.

  • In words: If the partial derivative of f with respect to y is bounded, Lipschitz continuity is guaranteed.
  • Don't confuse: This is a sufficient condition, not necessary—some functions are Lipschitz continuous without having a bounded derivative everywhere.

✅ Sufficient condition for well-posedness

Theorem 6.2.2: Suppose D = {(t, y) | a ≤ t ≤ b, −∞ < y < ∞} and f(t, y) is continuous on D. If f is Lipschitz continuous in y on D, then the initial-value problem is well posed.

  • Summary: Continuity + Lipschitz continuity in y → existence, uniqueness, and continuous dependence.
  • Practical implication: Check these conditions before applying numerical methods.

🛠️ Numerical methods overview

🛠️ Why numerical methods are needed

  • Many initial-value problems cannot be solved explicitly (no closed-form solution).
  • Qualitative techniques (e.g., phase portraits) exist but provide insufficient quantitative information.
  • Numerical methods approximate the solution to provide concrete values at specific times.

🎯 Design focus

  • Most methods target first-order problems: dy/dt = f(t, y) with y(a) = yₐ.
  • Higher-order problems: Methods exist that can be applied directly to higher-order initial-value problems, but the excerpt does not cover them.

Don't confuse: A first-order problem is not necessarily "simple"—it can still be nonlinear and require numerical approximation (as in the water discharge example).

📋 Summary of topics covered (from Chapter 5 context)

The excerpt references prior material on numerical integration (Riemann sums, Rectangle/Midpoint/Trapezoidal/Simpson's rules, composite rules, interpolatory and Newton-Cotes quadrature, Gauss quadrature), which are foundational for understanding numerical time integration.

33

Theory of Initial-Value Problems

6.2 Theory of Initial-Value problems

🧭 Overview

🧠 One-sentence thesis

Before applying numerical methods to solve an initial-value problem, we must ensure the problem is well-posed—meaning it has a unique solution that depends continuously on the data—which is guaranteed when the function is Lipschitz continuous.

📌 Key points (3–5)

  • Three requirements before numerical solving: existence of a solution, uniqueness of the solution, and continuous dependence on parameters/initial conditions.
  • Well-posedness definition: an initial-value problem is well-posed if it has a unique solution that depends continuously on the data (starting point, initial value, and function).
  • Lipschitz continuity as the key condition: if a function satisfies the Lipschitz condition (bounded rate of change in y), the initial-value problem is well-posed.
  • Common confusion: Lipschitz continuity vs differentiability—being differentiable with a bounded partial derivative is sufficient for Lipschitz continuity, but Lipschitz continuity is the weaker condition that guarantees well-posedness.
  • Why it matters: these theoretical checks ensure that numerical approximation makes sense and that small errors in data won't cause wild solution changes.

🔍 Core definitions

🔍 Well-posed initial-value problem

Well-posed: The initial-value problem dy/dt = f(t, y) for a ≤ t ≤ b with y(a) = y_a is called well-posed if the problem has a unique solution that depends continuously on the data (a, y_a, and f).

  • "Depends continuously on the data" means small changes in the starting time, initial value, or the function f produce only small changes in the solution.
  • This is not just about having a solution; it's about having exactly one solution that behaves predictably.
  • Example: if you slightly change the initial water flow or the driving force in the reservoir problem, the solution should change only slightly, not jump to a completely different behavior.

🔍 Lipschitz continuity

Lipschitz continuous: A function f(t, y) is Lipschitz continuous in y on a set D ⊂ R² if there exists a constant L > 0 such that |f(t, y₁) - f(t, y₂)| ≤ L|y₁ - y₂| for all (t, y₁), (t, y₂) ∈ D.

  • The constant L is called the Lipschitz constant.
  • This condition says: the function's change in output is bounded by a constant times the change in input (in the y variable).
  • In plain language: the function can't change arbitrarily fast as y changes; its "steepness" in the y direction is limited.
  • Example: if two solution curves start at slightly different y values at the same time t, Lipschitz continuity ensures they won't diverge faster than L times their initial separation.

🧪 Checking Lipschitz continuity

🧪 Sufficient condition via differentiability

Theorem 6.2.1: If the function f is differentiable with respect to y, then Lipschitz continuity is implied if the absolute value of the partial derivative ∂f/∂y is bounded by some L > 0 for all (t, y) in D.

  • This gives a practical test: compute the partial derivative of f with respect to y and check if it stays bounded.
  • Don't confuse: differentiability alone is not enough; you need the derivative to be bounded.
  • Example: if ∂f/∂y is always between -10 and 10, then f is Lipschitz continuous with L = 10.

🧪 Why this matters for well-posedness

Theorem 6.2.2: Suppose D = {(t, y) | a ≤ t ≤ b, -∞ < y < ∞} and f(t, y) is continuous on D. If f is Lipschitz continuous in y on D, then the initial-value problem is well-posed.

  • This is the main result: Lipschitz continuity (plus continuity) is a sufficient condition for well-posedness.
  • It guarantees all three requirements: existence, uniqueness, and continuous dependence.
  • The proof is not given in the excerpt but is referenced in textbooks (Boyce and DiPrima).

🛠️ Pre-numerical checklist

🛠️ What to verify before applying numerical methods

The excerpt emphasizes that before using a numerical method, you must ensure:

  1. Existence: The initial-value problem has a solution.
  2. Uniqueness: The solution is the only one (no multiple solutions).
  3. Continuous dependence: Small perturbations in the differential equation parameters or initial conditions cause only small changes in the solution.
  • These are not just theoretical niceties; they ensure numerical approximation is meaningful.
  • If uniqueness fails, a numerical method might converge to any of several solutions unpredictably.
  • If continuous dependence fails, small rounding errors could lead to wildly different numerical results.

🛠️ How the theory connects to practice

ConceptRole in numerical methods
Well-posednessEnsures the problem is suitable for numerical approximation
Lipschitz continuityProvides a checkable condition (via bounded derivative)
Continuous dependenceGuarantees numerical errors won't explode unpredictably
  • Example scenario: In the water reservoir problem (6.1), before choosing a numerical method to approximate q(t) for general p(t), you would check if the function on the right-hand side (p(t) - aq²) is Lipschitz continuous in q.
  • For the friction term -aq², the partial derivative with respect to q is -2aq, which is bounded on any finite interval, so Lipschitz continuity holds locally.

🔗 Context from the excerpt

🔗 The motivating example

  • The excerpt introduces a water discharge problem: dq/dt = p(t) - aq², q(0) = 0.
  • q is the mass flow rate (cubic meters per second), p is the driving force, and aq² represents friction.
  • When p(t) is constant, an analytic solution exists (involving hyperbolic tangent).
  • For general p(t), no closed-form solution is available, motivating numerical methods.
  • This example illustrates why theory matters: before numerically solving for general p, we need to know the problem is well-posed.

🔗 What comes next

  • The excerpt mentions that Section 6.3 will present elementary single-step methods for numerical approximation.
  • The theory in Section 6.2 serves as the foundation: it tells us when numerical methods are appropriate and what guarantees we have about the solution's behavior.
34

Elementary Single-Step Methods for Initial-Value Problems

6.3 Elementary Single-Step methods

🧭 Overview

🧠 One-sentence thesis

Elementary single-step methods approximate solutions to initial-value problems by using different numerical quadrature rules to approximate the integral form of the differential equation over one time interval, with each method trading off between computational simplicity, accuracy, and stability.

📌 Key points (3–5)

  • What single-step methods do: approximate the solution by integrating the differential equation over one time interval [tₙ, tₙ₊₁], using only information from the current time step.
  • Four basic methods: Forward Euler (explicit, simplest), Backward Euler (implicit, unconditionally stable), Trapezoidal (implicit, second-order accurate), and Modified Euler (explicit predictor-corrector, second-order).
  • Explicit vs implicit distinction: explicit methods (Forward Euler, Modified Euler) compute wₙ₊₁ directly from known values; implicit methods (Backward Euler, Trapezoidal) require solving an equation where wₙ₊₁ appears on both sides.
  • Common confusion—stability vs accuracy: a method may require a very small time step for stability even when accuracy would allow larger steps; implicit methods avoid this by being unconditionally stable but cost more per step.
  • Why it matters: these methods form the foundation for numerical time integration, and understanding their accuracy (local truncation error) and stability properties determines which method to use for a given problem.

🔧 Core framework: from differential to integral equation

🔧 The integral equation starting point

The excerpt emphasizes that although the differential equation y′ = f(t, y) is hard to solve directly, it can be rewritten as an integral equation:

y(t) = y(t₀) + ∫[t₀ to t] f(τ, y(τ)) dτ

  • This form is called an integral equation because the unknown function y appears under the integral sign.
  • It is equally difficult to solve as the original problem, but it provides a better starting point for numerical approximation.
  • Why this helps: numerical quadrature rules (Chapter 5) can approximate the integral, leading to different single-step methods.

🔧 Discretization and notation

  • The time axis is divided into discrete points: tₙ = t₀ + n·Δt, n = 0, 1, 2, ...

  • The exact solution at tₙ is denoted yₙ = y(tₙ).

  • The numerical approximation at tₙ is denoted wₙ, with w₀ = y₀ (the initial condition).

  • From tₙ to tₙ₊₁, the integral equation becomes:

    yₙ₊₁ = yₙ + ∫[tₙ to tₙ₊₁] f(t, y(t)) dt

  • Different methods arise from different ways to approximate this integral.

🔧 What "single-step" means

Because integral equation (6.3) only considers one time interval [tₙ, tₙ₊₁], these numerical methods are called single-step methods.

  • Only information from the current step tₙ is used to compute wₙ₊₁.
  • Multi-step methods (Section 6.11) also use earlier steps t₀, ..., tₙ₋₁.

🧮 The four elementary methods

🧮 Forward Euler method (explicit)

The Forward Euler method is the most simple, earliest, and best-known time-integration method.

How it works:

  • Approximates the integral using the left Rectangle rule: the integrand is evaluated at the left endpoint tₙ.
  • Formula: wₙ₊₁ = wₙ + Δt·f(tₙ, wₙ)
  • Geometrically: f(tₙ, wₙ) is the slope of the tangent at (tₙ, wₙ); the method follows this tangent for one time step.
  • The sequence {tₙ, wₙ} forms a piecewise-linear approximation called the Euler polygon.

Why it's explicit:

  • wₙ₊₁ can be computed directly from the formula; no equation-solving is needed.

Example scenario: If you know the current state wₙ and the slope f(tₙ, wₙ), you simply step forward along that slope for time Δt.

🧮 Backward Euler method (implicit)

How it works:

  • Approximates the integral using the right Rectangle rule: the integrand is evaluated at the right endpoint tₙ₊₁.
  • Formula: wₙ₊₁ = wₙ + Δt·f(tₙ₊₁, wₙ₊₁)
  • The unknown wₙ₊₁ appears on both sides of the equation.

Why it's implicit:

  • If f depends linearly on y, the equation can be solved easily.
  • For nonlinear problems, a numerical nonlinear solver (Chapter 4) is required at each time step.
  • This costs more computation time per step than Forward Euler.

Advantage: The Backward Euler method has better stability properties (discussed later), which can outweigh the extra cost per step.

🧮 Trapezoidal method (implicit)

How it works:

  • Approximates the integral using the Trapezoidal rule: averages the integrand at both endpoints.
  • Formula: wₙ₊₁ = wₙ + (Δt/2)·(f(tₙ, wₙ) + f(tₙ₊₁, wₙ₊₁))
  • Also implicit: wₙ₊₁ appears on the right-hand side.

Why it's better: The Trapezoidal method is more accurate than Forward or Backward Euler (second-order vs first-order local truncation error).

🧮 Modified Euler method (explicit predictor-corrector)

How it works:

  • An explicit variant of the Trapezoidal method.
  • Predictor step: use Forward Euler to predict w̄ₙ₊₁ = wₙ + Δt·f(tₙ, wₙ)
  • Corrector step: use the predicted value in the Trapezoidal formula: wₙ₊₁ = wₙ + (Δt/2)·(f(tₙ, wₙ) + f(tₙ₊₁, w̄ₙ₊₁))
  • Because the predictor is computed first, the corrector becomes an explicit formula.

Why it's useful: achieves second-order accuracy (like Trapezoidal) while remaining explicit (like Forward Euler).

Don't confuse: The Modified Euler method is explicit because the predictor removes the implicit dependence; the Trapezoidal method is implicit because wₙ₊₁ appears on both sides without a predictor.

📊 Comparison table

MethodTypeFormulaFunction evaluations per stepRemarks
Forward EulerExplicitwₙ₊₁ = wₙ + Δt·f(tₙ, wₙ)1Simplest; conditionally stable
Backward EulerImplicitwₙ₊₁ = wₙ + Δt·f(tₙ₊₁, wₙ₊₁)1 + solverUnconditionally stable
TrapezoidalImplicitwₙ₊₁ = wₙ + (Δt/2)·(f(tₙ, wₙ) + f(tₙ₊₁, wₙ₊₁))1 + solverSecond-order; unconditionally stable
Modified EulerExplicitPredictor + corrector2Second-order; conditionally stable

🎯 Key concepts for analysis

🎯 Local truncation error

The local truncation error at time step n+1, τₙ₊₁, is defined as τₙ₊₁ = (yₙ₊₁ - zₙ₊₁)/Δt

  • yₙ₊₁ is the exact solution at time step n+1.
  • zₙ₊₁ is the numerical approximation obtained by applying the method with starting point yₙ (the exact solution at step n).
  • This measures the new error introduced in one step, assuming the previous step was exact.

Orders of accuracy (from the excerpt):

  • Forward Euler: τₙ₊₁ = O(Δt) — first-order
  • Backward Euler: τₙ₊₁ = O(Δt) — first-order
  • Modified Euler: τₙ₊₁ = O(Δt²) — second-order
  • Trapezoidal: τₙ₊₁ = O(Δt²) — second-order

Example: For Forward Euler, using Taylor expansion shows τₙ₊₁ = (Δt/2)·y″(ξ) for some ξ in (tₙ, tₙ₊₁), so the error is proportional to Δt.

🎯 Global truncation error

The global truncation error eₙ₊₁ at time tₙ₊₁ is defined as eₙ₊₁ = yₙ₊₁ - wₙ₊₁

  • This is the difference between the exact solution and the numerical approximation.
  • It accumulates over all previous steps.
  • Key theorem (Lax's equivalence theorem): If a method is stable and consistent, then the global truncation error has the same order as the local truncation error.

Don't confuse: Local error measures one step assuming you start from the exact solution; global error measures the cumulative effect of all errors over many steps.

🎯 Consistency and convergence

A method is consistent if lim[Δt→0] τₙ₊₁(Δt) = 0, where (n+1)·Δt = T (fixed time).

A method is convergent if lim[Δt→0] eₙ₊₁ = 0, where (n+1)·Δt = T.

  • Consistency means the local error vanishes as the time step shrinks.
  • Convergence means the global error vanishes as the time step shrinks.
  • All four elementary methods are consistent.

🔒 Stability concepts

🔒 The test equation

To analyze stability, the excerpt uses a simplified linear problem:

y′ = λy + g(t), t > t₀, y(t₀) = y₀

  • The solution to the perturbed problem (with initial error ε₀) satisfies ε(t) = ε₀·exp(λ(t - t₀)).
  • The problem is stable if |ε(t)| < ∞ for all t > t₀.
  • It is absolutely stable if lim[t→∞] |ε(t)| = 0.
  • This requires λ ≤ 0 for stability, λ < 0 for absolute stability.

🔒 Amplification factor

For each method applied to the test equation, the error propagates as:

ε̄ₙ₊₁ = Q(λΔt)·ε̄ₙ

  • Q(λΔt) is the amplification factor: it determines how much an existing perturbation is amplified.
  • Numerical stability requires |Q(λΔt)| ≤ 1.

Amplification factors (from the excerpt):

  • Forward Euler: Q(λΔt) = 1 + λΔt
  • Backward Euler: Q(λΔt) = 1/(1 - λΔt)
  • Trapezoidal: Q(λΔt) = (1 + ½λΔt)/(1 - ½λΔt)
  • Modified Euler: Q(λΔt) = 1 + λΔt + ½(λΔt)²

🔒 Stability conditions for λ < 0

Forward Euler:

  • Requires -1 ≤ 1 + λΔt ≤ 1.
  • Since λ < 0, this gives Δt ≤ -2/λ.
  • Conditionally stable: time step must be small enough.

Backward Euler:

  • Requires |1/(1 - λΔt)| ≤ 1, which is satisfied for all Δt ≥ 0 when λ ≤ 0.
  • Unconditionally stable: no restriction on time step size.

Modified Euler:

  • Similar to Forward Euler: Δt ≤ -2/λ.
  • Conditionally stable.

Trapezoidal:

  • Unconditionally stable for λ ≤ 0.

Don't confuse: Unconditional stability means you can take any Δt for stability; it does not mean you can take arbitrarily large Δt for accuracy—accuracy still requires small enough Δt.

🔒 Stability for nonlinear problems

For a general nonlinear problem y′ = f(t, y), linearize about a point (t̂, ŷ):

λ = ∂f/∂y|(t̂,ŷ)

  • Use this λ in the stability condition |Q(λΔt)| ≤ 1.
  • The stability condition depends on the current approximation and time.

Example from excerpt: For y′ = -10y² + 20, the linearization gives λ = -20ŷ. Since y(t) ∈ (0, √2), the Forward Euler method is stable if Δt ≤ 1/(10√2).

🚀 Higher-order methods: RK4

🚀 The fourth-order Runge-Kutta method

The excerpt introduces the RK4 method as a widely-used higher-order explicit method.

Formula:

  • wₙ₊₁ = wₙ + (1/6)·(k₁ + 2k₂ + 2k₃ + k₄)

where the predictors are:

  • k₁ = Δt·f(tₙ, wₙ)
  • k₂ = Δt·f(tₙ + ½Δt, wₙ + ½k₁)
  • k₃ = Δt·f(tₙ + ½Δt, wₙ + ½k₂)
  • k₄ = Δt·f(tₙ + Δt, wₙ + k₃)

Why it's better:

  • Based on Simpson's rule for numerical integration.
  • Local truncation error is O(Δt⁴) — fourth-order accuracy.
  • Requires 4 function evaluations per step, but the higher accuracy often allows much larger time steps.

Stability:

  • Amplification factor: Q(λΔt) = 1 + λΔt + ½(λΔt)² + (1/6)(λΔt)³ + (1/24)(λΔt)⁴
  • For λ < 0, stable if Δt ≤ -2.8/λ.
  • Conditionally stable, but with a larger stability region than Forward or Modified Euler.

Example from excerpt (Table 6.2): To achieve |eₙ| ≤ 10⁻⁴ integrating from t₀ = 0 to T = 1:

  • Forward Euler needs 10,000 function evaluations.
  • Modified Euler needs 200 function evaluations.
  • RK4 needs only 40 function evaluations.

🚀 Why higher order matters

The excerpt emphasizes that higher-order methods are preferable when high accuracy is required, as long as the solution is sufficiently smooth.

  • Fewer time steps are needed for a given accuracy.
  • Total work depends on both the number of steps and the cost per step.
  • Example: A second-order method with 2 evaluations per step can be more efficient than a first-order method with 1 evaluation per step.

🔁 Richardson extrapolation for error estimation

🔁 Estimating the global error when order p is known

Assume the global error behaves as e(t, Δt) = cₚ(t)·Δtᵖ for small Δt.

  • Compute two approximations: w^(Δt)N using N steps of size Δt, and w^(2Δt)(N/2) using N/2 steps of size 2Δt.

  • The difference between them estimates the error:

    y(t) - w^(Δt)_N ≈ (w^(Δt)N - w^(2Δt)(N/2))/(2ᵖ - 1)

  • The factor 1/(2ᵖ - 1) depends on the order p of the method.

Example from excerpt (Table 6.4): For the Forward Euler method (p = 1) applied to a water-discharge problem, the error estimate is approximately doubled when the time step is doubled, confirming first-order behavior.

🔁 Estimating the order p when unknown

If the order is unknown, compute three approximations with Δt, 2Δt, and 4Δt.

  • The ratio (w^(2Δt) - w^(4Δt))/(w^(Δt) - w^(2Δt)) ≈ 2ᵖ
  • If this ratio is close to the expected 2ᵖ, then Δt is small enough for accurate error estimation.

Practical use: This approach enables adaptive time-stepping: adjust Δt during integration to keep the estimated error below a threshold.

🌐 Extension to systems of equations

🌐 Vector notation

For a system of m differential equations:

y′ⱼ = fⱼ(t, y₁, ..., yₘ), j = 1, ..., m

  • Write in vector form: y′ = f(t, y), where y and f are vectors of length m.
  • All single-step methods generalize directly to vector form.

Example: Forward Euler becomes wₙ₊₁ = wₙ + Δt·f(tₙ, wₙ).

🌐 Higher-order problems

A single mth-order differential equation can be converted to a first-order system by defining:

y₁ = x, y₂ = x′, ..., yₘ = x^(m-1)

Example from excerpt (mathematical pendulum):

  • Original: ψ″ + sin(ψ) = 0
  • System: y₁′ = y₂, y₂′ = -sin(y₁)

🌐 Stability for systems

For the linear test system y′ = Ay + g(t):

  • The eigenvalues λⱼ of matrix A play the role of λ in the scalar test equation.
  • Numerical stability requires |Q(λⱼΔt)| ≤ 1 for all eigenvalues λⱼ.
  • If any eigenvalue has Re(λⱼ) > 0, the system is unstable.

Example from excerpt (second-order IVP): For ax″ + bx′ + cx = g(t), the eigenvalues are λ = (-b ± √(b² - 4ac))/(2a). Stability depends on the signs of the real parts.

Don't confuse: For systems, you must check the stability condition for every eigenvalue; a single eigenvalue outside the stability region makes the entire system unstable.

⚠️ Stiff differential equations

⚠️ What stiffness means

Stiff differential equations describe problems that exhibit transients. Their solution is the sum of a rapidly decaying part (the transient) and a slowly-varying part.

Characteristics:

  • The transient decays on a fast timescale (determined by strongly negative eigenvalues).
  • The quasi-stationary solution varies on a slow timescale.
  • After a short time, only the slowly-varying part remains visible.

Why it's a problem for explicit methods:

  • Stability requires Δt small enough to handle the fast transient.
  • But accuracy for the slow quasi-stationary solution would allow much larger Δt.
  • Result: you're forced to take tiny steps long after the transient has vanished.

⚠️ Implicit methods for stiff problems

Backward Euler and Trapezoidal methods are unconditionally stable, so they can use larger time steps.

Superstability:

A method is superstable if lim[λΔt→-∞] |Q(λΔt)| < 1.

  • Backward Euler is superstable: initial perturbations in fast components decay quickly.
  • Trapezoidal is not superstable: |Q(λΔt)| → 1 as λΔt → -∞, so initial errors decay very slowly.

Example from excerpt (Figure 6.5): For a stiff problem with λ = -100 and Δt = 0.2:

  • Backward Euler: |Q(λΔt)| ≈ 0.048 — fast error decay.
  • Trapezoidal: |Q(λΔt)| ≈ 0.82 — slow error decay.

Trade-off: Implicit methods require solving a (possibly nonlinear) system at each step, increasing computational cost. The choice depends on the problem.


Summary: Elementary single-step methods provide a foundation for numerical time integration. Explicit methods (Forward Euler, Modified Euler) are simple but conditionally stable; implicit methods (Backward Euler, Trapezoidal) are unconditionally stable but require equation-solving. Higher-order methods (RK4) achieve better accuracy with fewer steps. Understanding local vs global errors, stability conditions, and the role of the amplification factor is essential for choosing and applying these methods effectively.

35

Analysis of Numerical Time-Integration Methods

6.4 Analysis of numerical Time-Integration methods

🧭 Overview

🧠 One-sentence thesis

Numerical methods for solving initial-value problems converge to the true solution when they are both stable (perturbations remain bounded) and consistent (local errors vanish as the time step shrinks), with the global error having the same order as the local truncation error.

📌 Key points (3–5)

  • Stability vs consistency: stability controls how perturbations propagate; consistency measures how well the method approximates the differential equation locally—both are needed for convergence.
  • Amplification factor Q determines stability: a method is stable if |Q(λΔt)| ≤ 1, which may impose restrictions on the time step Δt (conditional stability) or hold for all Δt (unconditional stability).
  • Local vs global truncation error: the local truncation error τ is the new error introduced in one step; the global error e accumulates over all steps, but Lax's theorem guarantees they have the same order when the method is stable and consistent.
  • Common confusion—conditional vs unconditional stability: Forward Euler requires Δt ≤ −2/λ (conditional), whereas Backward Euler and Trapezoidal methods are stable for any Δt > 0 (unconditional) when λ ≤ 0.
  • Why it matters: higher-order methods (e.g., Modified Euler with O(Δt²) error) achieve better accuracy than first-order methods (Forward/Backward Euler with O(Δt) error) for the same time step, reducing computational cost for a target accuracy.

🔍 Stability concepts

🔍 Analytical stability of the initial-value problem

An initial-value problem is called stable if |ε(t)| < ∞ for all t > t₀, and absolutely stable if it is stable and lim(t→∞) |ε(t)| = 0, where ε(t) = ỹ(t) − y(t) is the difference between the perturbed solution ỹ (with initial condition y₀ + ε₀) and the unperturbed solution y.

  • Stability means a small change in the initial condition does not blow up over time.
  • Absolute stability means the perturbation eventually dies out.
  • If |ε(t)| is unbounded, the problem is unstable.
  • Important: stability of the continuous problem does not guarantee stability of the numerical approximation—that depends on the method and time step.

🧪 The test equation

The excerpt uses the linear test equation

y′ = λy + g(t),  y(t₀) = y₀

to analyze stability.

  • The perturbed problem has initial condition y₀ + ε₀.
  • Subtracting the two problems gives the error equation: ε′ = λε, ε(t₀) = ε₀.
  • Solution: ε(t) = ε₀ exp(λ(t − t₀)).
  • The test equation is stable if λ ≤ 0 and absolutely stable if λ < 0.
  • This simple model reveals how perturbations grow or decay, guiding the analysis of numerical methods.

⚙️ Numerical stability and the amplification factor

When a numerical method is applied to the test equation, the error propagates as

ε̄ₙ₊₁ = Q(λΔt) ε̄ₙ,

where Q(λΔt) is the amplification factor.

  • Q tells you by what factor an existing perturbation is multiplied at each step.
  • Recursively, ε̄ₙ = (Q(λΔt))ⁿ ε̄₀.
  • Stability condition: |Q(λΔt)| ≤ 1 (perturbations do not grow without bound).
  • Absolute stability: |Q(λΔt)| < 1 (perturbations decay).
  • Instability: |Q(λΔt)| > 1 (perturbations blow up).
MethodAmplification factor Q(λΔt)
Forward Euler1 + λΔt
Backward Euler1 / (1 − λΔt)
Trapezoidal(1 + ½λΔt) / (1 − ½λΔt)
Modified Euler1 + λΔt + ½(λΔt)²

🔒 Conditional vs unconditional stability

Forward Euler (λ < 0):

  • Stability requires −1 ≤ 1 + λΔt ≤ 1, which gives −2 ≤ λΔt ≤ 0.
  • Since Δt > 0 and λ ≤ 0, the right inequality is automatic.
  • The left inequality gives Δt ≤ −2/λ (strict inequality for absolute stability).
  • This is conditional stability: the time step must be small enough.

Backward Euler (λ ≤ 0):

  • Stability requires −1 ≤ 1/(1 − λΔt) ≤ 1.
  • Because λ ≤ 0, the denominator 1 − λΔt ≥ 1, so the condition is satisfied for all Δt ≥ 0.
  • This is unconditional stability: no restriction on Δt.

Modified Euler (λ ≤ 0):

  • Analysis (not shown in detail) gives Δt ≤ −2/λ, same as Forward Euler—conditional stability.

Trapezoidal method (λ ≤ 0):

  • Unconditionally stable for all Δt ≥ 0.

Don't confuse: "unconditional" does not mean "always accurate"—it means stable for any Δt; accuracy still improves as Δt shrinks.

🌐 Stability of nonlinear problems

For a general nonlinear problem y′ = f(t, y), the excerpt linearizes around a point (t̂, ŷ):

f(t, y) ≈ f(t̂, ŷ) + (y − ŷ) ∂f/∂y(t̂, ŷ) + (t − t̂) ∂f/∂t(t̂, ŷ).

This gives an approximate linear problem y′ = λy + g(t), where

λ = ∂f/∂y(t̂, ŷ).
  • The stability condition |Q(λΔt)| ≤ 1 now depends on the local value of ∂f/∂y.
  • Example (water discharge): y′ = −10y² + 20, y(0) = 0. Here λ = ∂f/∂y = −20ŷ. Since y(t) ∈ (0, √2), λ < 0 (absolutely stable analytically). For Forward Euler, Δt ≤ 1/(10ŷ) ≤ 1/(10√2) ensures numerical stability.

📏 Local truncation error

📏 Definition and meaning

The local truncation error at time step n+1, τₙ₊₁, is defined as

τₙ₊₁ = (yₙ₊₁ − zₙ₊₁) / Δt,

where yₙ₊₁ is the true solution at time step n+1 and zₙ₊₁ is the numerical approximation obtained by applying the method with starting point yₙ (the exact solution at step n).

  • τ measures the error introduced in one step, assuming the previous value was exact.
  • It isolates the discretization error of the method itself.
  • The factor 1/Δt normalizes the error per unit time.

🔢 Forward Euler local error

Starting from yₙ, Forward Euler gives zₙ₊₁ = yₙ + Δt f(tₙ, yₙ).

Taylor expansion of the true solution:

yₙ₊₁ = yₙ + Δt y′ₙ + (Δt²/2) y″(ξ),  ξ ∈ (tₙ, tₙ₊₁).

Since y′ = f(t, y), we have y′ₙ = f(tₙ, yₙ). Substituting:

τₙ₊₁ = [yₙ + Δt y′ₙ + (Δt²/2) y″(ξ) − (yₙ + Δt y′ₙ)] / Δt = (Δt/2) y″(ξ).
  • Order: τₙ₊₁ = O(Δt), so Forward Euler is a first-order method.

🔢 Modified Euler local error

Modified Euler is a predictor–corrector method:

z*ₙ₊₁ = yₙ + Δt f(tₙ, yₙ),
zₙ₊₁ = yₙ + (Δt/2)[f(tₙ, yₙ) + f(tₙ₊₁, z*ₙ₊₁)].

Using a two-dimensional Taylor expansion of f(tₙ₊₁, z*ₙ₊₁) around (tₙ, yₙ) and simplifying (details in the excerpt), the result is

zₙ₊₁ = yₙ + Δt y′ₙ + (Δt²/2) y″ₙ + O(Δt³).

The Taylor expansion of the true solution is

yₙ₊₁ = yₙ + Δt y′ₙ + (Δt²/2) y″ₙ + O(Δt³).

The first three terms match, so

τₙ₊₁ = (yₙ₊₁ − zₙ₊₁) / Δt = O(Δt²).
  • Order: Modified Euler is a second-order method, more accurate than Forward Euler.

🧪 Local error for the test equation

For the test equation y′ = λy, the solution is y(t) = y₀ exp(λt), so

yₙ₊₁ = yₙ exp(λΔt).

Applying the method gives zₙ₊₁ = Q(λΔt) yₙ. Thus

τₙ₊₁ = [exp(λΔt) − Q(λΔt)] / Δt · yₙ.
  • The size of τ is determined by how well Q approximates the exponential.
  • Taylor expansion of exp(λΔt) = 1 + λΔt + ½(λΔt)² + (1/6)(λΔt)³ + …
  • Forward Euler: Q = 1 + λΔt → τ = O(Δt).
  • Modified Euler: Q = 1 + λΔt + ½(λΔt)² → τ = O(Δt²).
  • Backward Euler: Q = 1/(1 − λΔt) = 1 + λΔt + (λΔt)² + … → τ = O(Δt).
  • Trapezoidal: τ = O(Δt²) (exercise 3 in the excerpt).

🌍 Global truncation error and convergence

🌍 Global error definition

The global truncation error eₙ₊₁ at time tₙ₊₁ is defined as

eₙ₊₁ = yₙ₊₁ − wₙ₊₁,

where yₙ₊₁ is the true solution and wₙ₊₁ is the numerical approximation (computed from the numerical approximation wₙ at the previous step, not the exact yₙ).

  • e measures the cumulative error after n+1 steps.
  • It includes both the propagation of previous errors and the new local error at each step.

🔗 Consistency and convergence

Consistency: A method is consistent if lim(Δt→0) τₙ₊₁(Δt) = 0, where (n+1)Δt = T (fixed final time). A method is consistent of order p if τₙ₊₁ = O(Δtᵖ).

Convergence: A method is convergent if lim(Δt→0) eₙ₊₁ = 0, where (n+1)Δt = T. A method is convergent of order p if eₙ₊₁ = O(Δtᵖ).

  • Consistency means the method approximates the differential equation well in one step (as Δt→0).
  • Convergence means the numerical solution approaches the true solution (as Δt→0).
  • Don't confuse: consistency is a local property (one step); convergence is a global property (many steps).

🏆 Lax's equivalence theorem

Theorem (Lax): If a numerical method is stable and consistent, then the numerical approximation converges to the solution for Δt → 0. Moreover, the global truncation error and the local truncation error are of the same order.

Proof sketch (for the test equation y′ = λy):

  • Start with eₙ₊₁ = yₙ₊₁ − wₙ₊₁.
  • Since wₙ₊₁ = Q(λΔt) wₙ and wₙ = yₙ − eₙ, we have
    eₙ₊₁ = yₙ₊₁ − Q(λΔt) yₙ + Q(λΔt) eₙ.
    
  • Using the definition of τₙ₊₁, yₙ₊₁ − Q(λΔt) yₙ = Δt τₙ₊₁, so
    eₙ₊₁ = Δt τₙ₊₁ + Q(λΔt) eₙ.
    
  • Repeating this recursively:
    eₙ₊₁ = Σ(ℓ=0 to n) [Q(λΔt)]^ℓ Δt τₙ₊₁₋ℓ.
    
  • Stability (|Q| ≤ 1) and (n+1)Δt = T give
    |eₙ₊₁| ≤ Σ(ℓ=0 to n) Δt |τₙ₊₁₋ℓ| ≤ T · max|τℓ|.
    
  • Since the method is consistent, max|τℓ| → 0 as Δt → 0, so eₙ₊₁ → 0 (convergence).
  • The bound shows that the order of e equals the order of τ.

Key insight: stability prevents errors from growing exponentially; consistency ensures each step's error is small; together they guarantee global convergence with the same order as the local error.

📊 Summary table

MethodAmplification QStability (λ ≤ 0)Local error orderGlobal error order (if stable)
Forward Euler1 + λΔtConditional: Δt ≤ −2/λO(Δt)O(Δt)
Backward Euler1/(1 − λΔt)UnconditionalO(Δt)O(Δt)
Modified Euler1 + λΔt + ½(λΔt)²Conditional: Δt ≤ −2/λO(Δt²)O(Δt²)
Trapezoidal(1 + ½λΔt)/(1 − ½λΔt)UnconditionalO(Δt²)O(Δt²)

🚀 Practical implications

🚀 Why higher-order methods matter

  • Accuracy vs cost: a second-order method (Modified Euler, Trapezoidal) achieves O(Δt²) error, so halving Δt reduces error by a factor of 4, whereas a first-order method (Forward/Backward Euler) only reduces error by a factor of 2.
  • For a target accuracy ε, a second-order method can use a larger Δt (fewer steps), reducing computational cost.
  • Example: to achieve error ≈ 10⁻⁴, Forward Euler might need Δt ≈ 10⁻⁴ (10,000 steps over T=1), while Modified Euler might need Δt ≈ 10⁻² (100 steps).

🚀 Trade-offs

  • Explicit vs implicit: Forward and Modified Euler are explicit (easy to compute wₙ₊₁ directly); Backward Euler and Trapezoidal are implicit (require solving an equation for wₙ₊₁ at each step, more expensive per step but often more stable).
  • Conditional vs unconditional stability: unconditional methods (Backward Euler, Trapezoidal) allow larger Δt for stiff problems (where λ is very negative), avoiding tiny time steps imposed by stability constraints.

Don't confuse: a method can be high-order and conditionally stable (Modified Euler) or low-order and unconditionally stable (Backward Euler)—order and stability are independent properties.

36

Higher-Order Methods

6.5 Higher-Order methods

🧭 Overview

🧠 One-sentence thesis

Higher-order numerical methods achieve the same accuracy with fewer function evaluations than lower-order methods, making them more efficient when high accuracy is required for sufficiently smooth solutions.

📌 Key points (3–5)

  • Why higher order matters: Higher-order methods allow larger time steps for the same accuracy, reducing the total number of function evaluations needed.
  • Trade-off per step vs total work: Although higher-order methods need more function evaluations per time step (e.g., RK4 needs 4, Forward Euler needs 1), the total work is often less for a given accuracy.
  • The RK4 method: A fourth-order explicit method based on Simpson's rule that combines attractive stability properties with high accuracy.
  • Common confusion: Higher order is advantageous only when high accuracy is required and the solution is sufficiently smooth; the benefit grows dramatically as accuracy requirements increase.
  • Stability constraint: Even explicit higher-order methods like RK4 have stability limits (Δt ≤ −2.8/λ for λ < 0), though RK4's stability region is more favorable than simpler explicit methods.

📊 Efficiency gains from higher order

📊 Function evaluation comparison

The excerpt illustrates efficiency through a thought experiment comparing Forward Euler (order 1), Modified Euler (order 2), and RK4 (order 4) when integrating from t₀ = 0 to T = 1.

Assumptions:

  • Forward Euler: global error |eₙ| ≤ Δt, 1 function evaluation per step
  • Modified Euler: global error |eₙ| ≤ (Δt)², 2 evaluations per step
  • RK4: global error |eₙ| ≤ (Δt)⁴, 4 evaluations per step
Desired accuracyForward Euler ΔtFE # evaluationsModified Euler ΔtME # evaluationsRK4 ΔtRK4 # evaluations
10⁻¹10⁻¹100.36~0.56~8
10⁻²10⁻²10010⁻¹20~0.32~16
10⁻⁴10⁻⁴10,00010⁻²20010⁻¹40

🔍 Why the advantage grows

  • For accuracy 10⁻⁴, Forward Euler needs 10,000 evaluations vs. only 200 for Modified Euler—a 50× reduction.
  • The gap widens further for RK4 at very high accuracies because the time step can be much larger (proportional to the fourth root of the error tolerance).
  • Key insight: The higher the order p, the larger the admissible Δt for a fixed accuracy, so fewer steps are needed overall.

Don't confuse: "More evaluations per step" with "more total work." Higher-order methods do more work per step but take far fewer steps, so total work decreases when accuracy demands are high.

🚀 The fourth-order Runge-Kutta (RK4) method

🧮 The RK4 formula

The RK4 method updates the solution from wₙ to wₙ₊₁ by:

wₙ₊₁ = wₙ + (1/6)(k₁ + 2k₂ + 2k₃ + k₄)

where the four predictors are:

  • k₁ = Δt · f(tₙ, wₙ)
  • k₂ = Δt · f(tₙ + Δt/2, wₙ + k₁/2)
  • k₃ = Δt · f(tₙ + Δt/2, wₙ + k₂/2)
  • k₄ = Δt · f(tₙ + Δt, wₙ + k₃)

How it works:

  • k₁ evaluates the slope at the start of the interval.
  • k₂ and k₃ evaluate slopes at the midpoint using two different predictions.
  • k₄ evaluates the slope at the end using the k₃ prediction.
  • The weighted average (1/6)(k₁ + 2k₂ + 2k₃ + k₄) combines these slopes.

📐 Connection to Simpson's rule

The corrector formula is based on Simpson's rule for numerical integration:

y(tₙ₊₁) = y(tₙ) + ∫[tₙ to tₙ₊₁] f(t, y(t)) dt
≈ y(tₙ) + (Δt/6)[f(tₙ, y(tₙ)) + 4f(tₙ + Δt/2, y(tₙ + Δt/2)) + f(tₙ + Δt, y(tₙ + Δt))]

  • The RK4 method approximates y(tₙ) by wₙ.
  • It predicts y(tₙ + Δt/2) using both wₙ + k₁/2 and wₙ + k₂/2.
  • It predicts y(tₙ + Δt) using wₙ + k₃.

Example: To integrate y' = f(t, y) over one time step, RK4 samples the derivative at the start, twice at the midpoint (with different predictions), and once at the end, then combines them with Simpson's weights.

🎯 Accuracy and truncation error

🎯 Local truncation error of RK4

For the test equation y' = λy, the amplification factor is:

Q(λΔt) = 1 + λΔt + (1/2)(λΔt)² + (1/6)(λΔt)³ + (1/24)(λΔt)⁴

This is exactly the first five terms of the Taylor series for e^(λΔt).

Computing the local truncation error:

  • Define zₙ₊₁ by replacing wₙ with yₙ in the RK4 formula: zₙ₊₁ = Q(λΔt)yₙ.
  • The exact solution satisfies yₙ₊₁ = e^(λΔt)yₙ.
  • The difference is yₙ₊₁ − zₙ₊₁ = (e^(λΔt) − Q(λΔt))yₙ.
  • Because Q matches the first five terms of the exponential's Taylor series, only the 5th and higher powers of Δt remain.
  • Dividing by Δt shows the local truncation error is O((Δt)⁴).

General case: For general initial-value problems (not just the test equation), the RK4 method is also fourth order.

📏 Order of convergence

From Lax's equivalence theorem (section 6.4), if a method is stable and consistent, then:

  • The global truncation error and local truncation error are of the same order.
  • For RK4, both are O((Δt)⁴).

Don't confuse: Local truncation error (error in one step assuming the previous value is exact) with global truncation error (accumulated error after many steps). The theorem guarantees they have the same order for stable, consistent methods.

🛡️ Stability properties of RK4

🛡️ Stability analysis

For the test equation y' = λy, stability requires |Q(λΔt)| ≤ 1.

Case λ > 0 (exponential growth):

  • For any Δt > 0, Q(λΔt) > 1 (because all terms are positive).
  • Exponential error growth is inevitable; the method is unstable for growing solutions.

Case λ < 0 (exponential decay):

  • Stability requires: −1 ≤ 1 + x + (1/2)x² + (1/6)x³ + (1/24)x⁴ ≤ 1, where x = λΔt.

🔒 Deriving the stability bound

Left inequality (checking if Q ≥ −1):

  • Rearrange to: 2 + x + (1/2)x² + (1/6)x³ + (1/24)x⁴ ≥ 0.
  • Call this polynomial P(x). To check if P(x) ≥ 0 for all x, find its extremes.
  • Extremes occur where P'(x) = 1 + x + (1/2)x² + (1/6)x³ = 0.
  • At any extreme x̃, P(x̃) = 2 + x̃ + (1/2)x̃² + (1/6)x̃³ + (1/24)x̃⁴ = 1 + P'(x̃) + (1/24)x̃⁴ = 1 + (1/24)x̃⁴ > 0.
  • Since all extremes are positive, the minimum is positive, so P(x) > 0 for all x.
  • Conclusion: The left inequality is always satisfied; it does not restrict Δt.

Right inequality (checking if Q ≤ 1):

  • Rearrange to: x(1 + (1/2)x + (1/6)x² + (1/24)x³) ≤ 0.
  • Since x = λΔt < 0 (λ < 0, Δt > 0), this is equivalent to: 1 + (1/2)x + (1/6)x² + (1/24)x³ ≥ 0.
  • This polynomial has only one zero at x ≈ −2.8.
  • It is negative for x < −2.8 and positive for x > −2.8.
  • For stability, need x = λΔt ≥ −2.8.

Stability condition:

Δt ≤ −2.8/λ (for λ < 0)

Example: If λ = −1, then Δt ≤ 2.8. If λ = −10, then Δt ≤ 0.28. The stiffer the problem (larger |λ|), the smaller the required time step.

⚖️ Attractive stability despite being explicit

The excerpt notes that RK4 has "attractive stability properties despite its explicit nature."

  • Explicit methods generally have more restrictive stability limits than implicit methods.
  • RK4's stability bound (Δt ≤ 2.8/|λ|) is more generous than simpler explicit methods like Forward Euler (Δt ≤ 2/|λ|).
  • Combined with fourth-order accuracy, this makes RK4 a practical choice for many problems.

📋 Summary table of methods

The excerpt references Table 6.3 (not fully shown) that presents an overview of stability conditions and local truncation errors for all discussed methods:

MethodOrderEvaluations/stepLocal truncation errorStability (λ < 0)
Forward Euler11O(Δt)Δt ≤ 2/|λ| (approx)
Backward Euler11 (implicit)O(Δt)Unconditionally stable
Trapezoidal21 (implicit)O((Δt)²)Unconditionally stable
Modified Euler22O((Δt)²)(not specified)
RK444O((Δt)⁴)Δt ≤ 2.8/|λ|

Key takeaway: Higher-order methods are preferred when high accuracy is required, provided the solution is sufficiently smooth.

37

Global truncation error and Richardson error estimates

6.6 Global truncation error and Richardson error estimates

🧭 Overview

🧠 One-sentence thesis

Richardson extrapolation enables numerical estimation of the global truncation error by comparing solutions computed with different time steps, allowing adaptive time-stepping to achieve desired accuracy without knowing the exact solution.

📌 Key points (3–5)

  • Global truncation error scales with time step: for sufficiently small time steps, the error behaves as c_p(t) times (delta t)^p, where p is the method's order.
  • Richardson extrapolation estimates error: by computing solutions at two different step sizes (delta t and 2 delta t), the global truncation error can be estimated without knowing the exact solution.
  • Verification when order p is unknown: computing a third solution at step size 4 delta t allows checking whether delta t is small enough for the error estimate to be accurate.
  • Common confusion: the global truncation error estimate only works when delta t is "sufficiently small"—higher-order terms must be negligible; the ratio test (equation 6.48) reveals whether this condition holds.
  • Adaptive time-stepping: repeatedly halving delta t until the error estimate satisfies a threshold enables automatic control of accuracy in practical problems.

📐 Global truncation error structure

📐 Order-based error formula

For sufficiently small values of delta t, the global truncation error can be approximated by e(t, delta t) = c_p(t) (delta t)^p.

  • p is the method's order: p = 1 for Forward Euler and Backward Euler; p = 2 for Trapezoidal and Modified Euler; p = 4 for RK4.
  • c_p(t) is an unknown coefficient that depends on time t but not on delta t.
  • The formula assumes higher-order terms in delta t are negligible—this is only true when delta t is small enough.
  • Example: if you halve delta t in a second-order method (p = 2), the error should drop by a factor of 4.

🔍 Why "sufficiently small" matters

  • The excerpt repeatedly emphasizes that the approximation e(t, delta t) = c_p(t) (delta t)^p holds only for sufficiently small delta t.
  • If delta t is too large, higher-order terms (like (delta t)^(p+1), (delta t)^(p+2), etc.) are not negligible, and the simple power-law breaks down.
  • The practical consequence: Richardson extrapolation will give inaccurate error estimates if delta t is not small enough.
  • Don't confuse: "small delta t" is relative to the problem—what counts as "small enough" depends on the solution's behavior and must be verified numerically.

🧮 Richardson extrapolation when p is known

🧮 Two-step-size error estimate

The method uses two numerical approximations:

  • w^(delta t)_N: solution at time t using N steps of length delta t.
  • w^(2 delta t)_(N/2): solution at the same time t using N/2 steps of length 2 delta t.

From the error formula:

  • e(t, 2 delta t) = y(t) - w^(2 delta t)_(N/2) = c_p(t) (2 delta t)^p
  • e(t, delta t) = y(t) - w^(delta t)_N = c_p(t) (delta t)^p

Subtracting these two equations gives:

  • w^(delta t)N - w^(2 delta t)(N/2) = c_p(t) (delta t)^p (2^p - 1)

Solving for c_p(t):

  • c_p(t) = [w^(delta t)N - w^(2 delta t)(N/2)] / [(delta t)^p (2^p - 1)]

📊 Final error estimate formula

Substituting c_p(t) back into the error formula yields:

  • y(t) - w^(delta t)_N = [w^(delta t)N - w^(2 delta t)(N/2)] / (2^p - 1)

Key insight: the global truncation error for the finer-step solution is proportional to the difference between the two solutions, scaled by 1/(2^p - 1).

Method order pFactor 1/(2^p - 1)Interpretation
p = 11/1 = 1Error equals the difference
p = 21/3 ≈ 0.33Error is one-third the difference
p = 41/15 ≈ 0.067Error is one-fifteenth the difference

Example: if Forward Euler (p=1) gives w^(0.01)_20 = 4.543 and w^(0.02)_10 = 4.560, the estimated error is (4.543 - 4.560)/1 = -0.017.

🔬 Richardson extrapolation when p is unknown

🔬 Three-step-size verification

When the method's order p is unknown or when you want to verify that delta t is small enough, compute a third approximation:

  • w^(4 delta t)_(N/4): solution using N/4 steps of length 4 delta t.

The ratio test (equation 6.48):

  • [w^(2 delta t)(N/2) - w^(4 delta t)(N/4)] / [w^(delta t)N - w^(2 delta t)(N/2)] = 2^p

✅ Interpreting the ratio

  • If the computed ratio is close to the expected 2^p (e.g., near 2 for p=1, near 4 for p=2, near 16 for p=4), then delta t is small enough and the error estimate is accurate.
  • If the ratio deviates significantly from 2^p, delta t is too large—higher-order terms are not negligible—and you must halve delta t and recompute.
  • Example from Table 6.4: for delta t = 0.0025, the ratio is 1.924 (close to 2^1 = 2), but for delta t = 0.01, no ratio is available; by delta t = 0.000195, the ratio is 1.9995 (very close to 2).

🔄 Adaptive time-stepping workflow

  1. Choose an initial delta t based on intuition (e.g., 10–20 points to visualize the solution).
  2. Compute solutions at delta t, 2 delta t, and 4 delta t.
  3. Check the ratio (equation 6.48): is it close to 2^p?
    • If no: halve delta t and repeat (reuse earlier approximations to save computation).
    • If yes: proceed to step 4.
  4. Compute the error estimate (equation 6.47): is it below the threshold epsilon?
    • If no: halve delta t and repeat.
    • If yes: accept the solution w^(delta t)_N.

Don't confuse: halving delta t means you can reuse two of the three previous solutions (the old delta t becomes the new 2 delta t, and the old 2 delta t becomes the new 4 delta t), so you only need one new integration per iteration.

📋 Example: water discharge problem

📋 Problem setup

The excerpt presents a modified water-discharge problem:

  • y' = 50 - 2 y^2.1, t > 0
  • y(0) = 0
  • No explicit solution is known.

Goal: approximate y(0.2) using Forward Euler and estimate the global truncation error via Richardson extrapolation.

📋 Numerical results (Table 6.4)

delta tw^(delta t)_NEstimated error y(t) - w^(delta t)_NEstimated 2^p
0.014.5599
0.0054.5431-0.0168
0.00254.5344-0.00871.924
0.001254.5299-0.00441.966
0.0006254.5277-0.00221.984
............
0.00001954.5256-0.000071.9995

Observations:

  • As delta t decreases, the estimated 2^p converges to 2 (expected for p=1).
  • From delta t = 0.00125 onwards, the ratio is close to 2, indicating the linear error model holds.
  • The error estimate is approximately halved each time delta t is halved, confirming first-order behavior.
  • Example interpretation: at delta t = 0.00125, the estimated error is -0.0044, meaning the true solution is approximately 4.5299 + 0.0044 = 4.5343.

📋 Practical guidance

  • To visualize a solution like sine on (0, pi), about 10 to 20 points are required—this gives an initial guess for delta t.
  • Perform three integrations (delta t, 2 delta t, 4 delta t) and check equation 6.48.
  • If 2^p is not accurate enough, halve delta t (reusing two previous solutions).
  • Once 2^p is accurate, compute the error estimate (equation 6.47) and check against the threshold epsilon.
  • Repeat halving until the error estimate is acceptable—this is adaptive time-stepping.
38

Numerical methods for systems of differential equations

6.7 Numerical methods for systems of differential equations

🧭 Overview

🧠 One-sentence thesis

Numerical time-integration methods for systems of differential equations are straightforward vector-valued generalizations of their scalar counterparts, where each method applies the same structure to all components simultaneously.

📌 Key points (3–5)

  • From scalar to vector: Methods like Forward Euler, Backward Euler, Trapezoidal, Modified Euler, and RK4 extend naturally from single equations to systems by applying the same formula to all components.
  • System notation: A system of m unknown functions y₁, ..., yₘ can be written compactly in vector form as y' = f(t, y), where y and f are vectors.
  • Higher-order problems: A single mth-order differential equation can be transformed into a system of m first-order equations by defining new variables for each derivative.
  • Common confusion: The vector-valued methods look identical to scalar methods in notation, but each variable (w_j,n or component of w_n) represents a different unknown function, not a scalar value.
  • Implicit methods for systems: Backward Euler and Trapezoidal methods require solving a nonlinear system at each time step when applied to systems.

🔄 From scalar methods to systems

🔄 Vector notation for systems

A system of m differential equations: y'_j = f_j(t, y₁, ..., yₘ) for j = 1, ..., m, with initial conditions y_j(t₀) = y_j,0.

  • Instead of writing m separate equations, use vector notation: y = (y₁, ..., yₘ) and f = (f₁, ..., fₘ).
  • The vector-valued initial-value problem becomes: y' = f(t, y) with y(t₀) = y₀.
  • This compact notation makes the structure clearer and matches the scalar case visually.

🔄 Forward Euler for systems

The Forward Euler method applies to each component separately:

  • Component form: w_j,n+1 = w_j,n + Δt f_j(t_n, w₁,n, ..., wₘ,n) for j = 1, ..., m.
  • Vector form: w_n+1 = w_n + Δt f(t_n, w_n).
  • The excerpt emphasizes this is a "straightforward vector-valued generalization" of the scalar method.
  • Example: If you have 3 unknown functions, you update all 3 at once using the same time step and the same formula structure.

🧮 Other methods generalized to systems

🧮 Backward Euler method

  • Vector form: w_n+1 = w_n + Δt f(t_n+1, w_n+1).
  • This is an implicit method: the new value w_n+1 appears on both sides.
  • At each time step, you must solve a nonlinear system to find w_n+1.

🧮 Trapezoidal method

  • Vector form: w_n+1 = w_n + (Δt/2)(f(t_n, w_n) + f(t_n+1, w_n+1)).
  • Also implicit: requires solving for w_n+1 at each step.
  • Uses the average of the slopes at the current and next time points.

🧮 Modified Euler method

  • Two-stage explicit method:
    1. Predictor: w̄_n+1 = w_n + Δt f(t_n, w_n).
    2. Corrector: w_n+1 = w_n + (Δt/2)(f(t_n, w_n) + f(t_n+1, w̄_n+1)).
  • First compute a tentative next value, then refine it using the average slope.

🧮 RK4 method

The fourth-order Runge-Kutta method uses four stages:

  • k₁ = Δt f(t_n, w_n)
  • k₂ = Δt f(t_n + Δt/2, w_n + k₁/2)
  • k₃ = Δt f(t_n + Δt/2, w_n + k₂/2)
  • k₄ = Δt f(t_n + Δt, w_n + k₃)
  • Final update: w_n+1 = w_n + (k₁ + 2k₂ + 2k₃ + k₄)/6
  • All k values are now vectors, but the formula structure is identical to the scalar case.

🧮 Solving implicit methods

  • For Backward Euler and Trapezoidal methods, each time step requires solving a nonlinear system.
  • The excerpt refers to methods from Section 4.6 for finding these solutions.
  • Don't confuse: "implicit" means the unknown appears on both sides of the equation, not that the method is hidden or unclear.

🔺 Higher-order initial-value problems

🔺 What is a higher-order problem

A higher-order initial-value problem: relates the mth derivative of a function to its lower-order derivatives, written as x^(m) = f(t, x, x^(1), ..., x^(m-1)) with initial conditions for x and all derivatives up to order m-1.

  • x^(j) denotes the jth derivative of x with respect to t.
  • You have one equation involving the highest derivative, plus m initial conditions.
  • Example: A second-order equation involves x'', x', and x, with initial values for x(t₀) and x'(t₀).

🔺 Transformation to first-order systems

The key technique: define new variables for each derivative.

  • Let y₁ = x, y₂ = x^(1), ..., yₘ = x^(m-1).
  • Then y'₁ = y₂, y'₂ = y₃, ..., y'_{m-1} = yₘ.
  • The highest derivative becomes: y'ₘ = f(t, y₁, ..., yₘ).
  • Initial conditions: y₁(t₀) = x₀, ..., yₘ(t₀) = x^(m-1)₀.
  • This transforms a single mth-order equation into a system of m first-order equations.

🔺 Mathematical pendulum example

The angular displacement ψ satisfies: ψ'' + sin ψ = 0 with ψ(0) = ψ₀ and ψ'(0) = 0.

Transformation:

  • Define y₁ = ψ and y₂ = ψ'.
  • Then y'₁ = y₂ and y'₂ = -sin y₁.
  • Initial conditions: y₁(0) = ψ₀ and y₂(0) = 0.

Applying Forward Euler:

  • w₁,n+1 = w₁,n + Δt w₂,n
  • w₂,n+1 = w₂,n - Δt sin w₁,n
  • Vector form: w_n+1 = w_n + Δt (w₂,n, -sin w₁,n).

🔺 Second-order linear example

Consider: a x'' + b x' + c x = g(t) with initial conditions x(t₀) = x₀ and x'(t₀) = x^(1)₀.

Transformation:

  • Define y₁ = x and y₂ = x'.
  • Then y'₁ = y₂ and y'₂ = -(b/a)y₂ - (c/a)y₁ + (1/a)g(t).

Matrix-vector form:

  • Define matrix A = [[0, 1], [-c/a, -b/a]], vector y = (y₁, y₂), and vector g(t) = (0, (1/a)g(t)).
  • The system becomes: y' = A y + g(t).
  • This compact form is useful for analysis and implementation.

📊 Comparison of methods for systems

MethodTypeVector formKey feature
Forward EulerExplicitw_n+1 = w_n + Δt f(t_n, w_n)Simplest; direct computation
Backward EulerImplicitw_n+1 = w_n + Δt f(t_n+1, w_n+1)Requires solving nonlinear system
TrapezoidalImplicitw_n+1 = w_n + (Δt/2)(f(t_n, w_n) + f(t_n+1, w_n+1))Average of slopes; implicit
Modified EulerExplicitTwo-stage predictor-correctorRefines estimate with average
RK4ExplicitFour-stage weighted averageHigher accuracy; more computation

🔍 Stability for systems

🔍 Linear systems with constant coefficients

The excerpt introduces the simplified problem for stability analysis:

y' = A y + g(t) with y(t₀) = y₀, where y and g(t) are vectors of length m and A is the m × m coefficient matrix.

  • This is the direct generalization of the scalar case y' = λ y + g(t).
  • The coefficient matrix A replaces the scalar λ.
  • Stability analysis for systems relates to the scalar case through the properties of A.
  • The excerpt mentions "assuming perturbations only in the initial values" but does not elaborate further in the provided text.
39

Analytical and numerical stability for systems

6.8 Analytical and numerical stability for systems

🧭 Overview

🧠 One-sentence thesis

For systems of differential equations, both analytical and numerical stability depend on the eigenvalues of the coefficient matrix, with eigenvalues playing the same role as the scalar parameter λ in single-equation stability analysis.

📌 Key points (3–5)

  • Eigenvalues determine analytical stability: a system is analytically stable if and only if all eigenvalues of the coefficient matrix A have non-positive real parts (Re(λⱼ) ≤ 0).
  • Numerical stability uses scalar amplification factors: for systems, check |Q(λⱼΔt)| ≤ 1 for each eigenvalue λⱼ, where Q is the scalar amplification factor.
  • Complex eigenvalues require special handling: eigenvalues can be complex (λⱼ = μⱼ + iνⱼ), and the modulus (absolute value) must be computed for stability checks.
  • Common confusion—imaginary vs. negative real parts: purely imaginary eigenvalues (μⱼ = 0) cause Forward Euler to be always unstable, but the Trapezoidal method remains stable; don't confuse "stable" (Re ≤ 0) with "absolutely stable" (Re < 0).
  • Implicit methods are often unconditionally stable: Backward Euler and Trapezoidal methods are stable for any Δt when all eigenvalues have non-positive real parts, unlike explicit methods which impose strict time-step limits.

🔍 Analytical stability of systems

🔍 The test system and perturbations

The excerpt starts with a linear system with constant coefficient matrix:

Test system: y′ = Ay, t > t₀, y(t₀) = y₀

  • Here y and g(t) are vectors of length m, and A is an m×m coefficient matrix.
  • To study stability, the excerpt considers perturbations only in initial values: the perturbed problem is ỹ′ = Aỹ + g(t), ỹ(t₀) = y₀ + ε₀.
  • The difference ε = ỹ − y satisfies the homogeneous test system: ε′ = Aε, ε(t₀) = ε₀.

🧮 Eigenvalues as the key to stability

The excerpt explains that if A is diagonalizable, then A = SΛS⁻¹, where:

  • Λ = diag(λ₁, …, λₘ) contains the eigenvalues of A (which can be complex).
  • S is the eigenvector matrix (columns are right eigenvectors vⱼ of A).

By transforming ε = Sη, the system decouples into independent scalar equations:

  • ηⱼ = ηⱼ,₀ exp(λⱼ(t − t₀)), j = 1, …, m.
  • Writing λⱼ = μⱼ + iνⱼ (real and imaginary parts), we have |ηⱼ| = |ηⱼ,₀| exp(μⱼ(t − t₀)).
  • Because |exp(iνⱼ(t − t₀))| = 1, the growth or decay depends only on the real part μⱼ.

Analytical stability characterization:

Condition on eigenvaluesStability result
All λⱼ: Re(λⱼ) < 0Absolutely stable (solutions decay to zero)
All λⱼ: Re(λⱼ) ≤ 0Stable (solutions remain bounded)
At least one λⱼ: Re(λⱼ) > 0Unstable (solutions grow unbounded)

Example: For the second-order equation aλ² + bλ + c = 0, eigenvalues are λ₁,₂ = (−b ± √(b² − 4ac)) / (2a). The excerpt shows that if b² − 4ac < 0, eigenvalues are complex.

Don't confuse: "Stable" (Re ≤ 0) allows eigenvalues on the imaginary axis (oscillatory but bounded), while "absolutely stable" (Re < 0) requires all solutions to decay.

🧮 Numerical stability for systems

🧮 Amplification matrix

Each numerical method applied to the test system ε′ = Aε yields an amplification matrix Q(ΔtA):

MethodAmplification matrix Q(ΔtA)
Forward EulerI + ΔtA
Backward Euler(I − ΔtA)⁻¹
Trapezoidal(I − ½ΔtA)⁻¹ · (I + ½ΔtA)
Modified EulerI + ΔtA + (Δt²/2)A²
RK4I + ΔtA + (Δt²/2)A² + (Δt³/6)A³ + (Δt⁴/24)A⁴

The approximation at time tₙ₊₁ is εₙ₊₁ = Q(ΔtA)εₙ.

🔑 Key theorem: eigenvalues of the amplification matrix

Theorem 6.8.1: If A is diagonalizable with eigenvalues λ₁, …, λₘ, then the amplification matrix Q(ΔtA) is also diagonalizable, and its eigenvalues are Q(λ₁Δt), …, Q(λₘΔt).

  • Proof sketch: Any polynomial P(ΔtA) = SP(ΔtΛ)S⁻¹, where P(ΔtΛ) = diag(P(λ₁Δt), …, P(λₘΔt)).
  • Each amplification matrix can be written as Q(ΔtA) = R(ΔtA)⁻¹P(ΔtA), where R and P are polynomials.
  • Therefore Q(ΔtA) = SQ(ΔtΛ)S⁻¹, where Q(ΔtΛ) = diag(Q(λ₁Δt), …, Q(λₘΔt)).

Numerical stability characterization:

Condition on scalar amplification factorsStability result
All λⱼ: |Q(λⱼΔt)| < 1Absolutely stable
All λⱼ: |Q(λⱼΔt)| ≤ 1Stable
At least one λⱼ: |Q(λⱼΔt)| > 1Unstable

Key insight: For systems, use the scalar amplification factor Q(λΔt) evaluated at each eigenvalue λⱼ. The system is stable if and only if every eigenvalue satisfies the scalar stability condition.

🔢 Computing modulus for complex eigenvalues

For complex λⱼ = μⱼ + iνⱼ, the modulus is computed as:

  • |1 + λⱼΔt| = √((1 + μⱼΔt)² + (νⱼΔt)²)

Example: Forward Euler is absolutely stable when (1 + μⱼΔt)² + (νⱼΔt)² < 1 for all j.

🎯 Stability conditions for specific methods

🎯 Forward Euler: conditional stability

For Forward Euler, absolute stability requires |1 + λⱼΔt| < 1 for all eigenvalues.

For complex eigenvalues λⱼ = μⱼ + iνⱼ:

  • The condition (1 + μⱼΔt)² + (νⱼΔt)² < 1 must hold.
  • This cannot be satisfied if μⱼ > 0 (positive real part).
  • The stability bound for Δt is: Δt < −2μⱼ / (μⱼ² + νⱼ²) for each eigenvalue.
  • Overall: Δt < min{−2μⱼ / (μⱼ² + νⱼ²)} over all j.

Critical limitation: For imaginary eigenvalues (μⱼ = 0), Forward Euler is always unstable, regardless of Δt.

Example: A system with oscillatory solutions (imaginary eigenvalues) cannot be integrated stably with Forward Euler.

🛡️ Backward Euler: unconditional stability

For Backward Euler, stability requires |Q(λⱼΔt)| = 1/|1 − λⱼΔt| ≤ 1, equivalent to |1 − λⱼΔt| ≥ 1.

For complex eigenvalues λⱼ = μⱼ + iνⱼ:

  • The condition is (1 − μⱼΔt)² + (νⱼΔt)² ≥ 1.
  • If all eigenvalues have non-positive real part (μⱼ ≤ 0), this is automatically satisfied for any Δt.
  • Therefore, Backward Euler is unconditionally stable for analytically stable systems.

Special case: For purely imaginary eigenvalues (μⱼ = 0), Backward Euler is absolutely stable since |Q(λⱼΔt)| < 1.

⚖️ Trapezoidal method: unconditional stability

For the Trapezoidal method:

  • |Q(λⱼΔt)| = |1 + ½λⱼΔt| / |1 − ½λⱼΔt|
  • = √((1 + ½μⱼΔt)² + (½νⱼΔt)²) / √((1 − ½μⱼΔt)² + (½νⱼΔt)²)

For eigenvalues with non-positive real part (μⱼ ≤ 0):

  • The numerator is always ≤ the denominator, so |Q(λⱼΔt)| ≤ 1.
  • The method is unconditionally stable for any Δt.

Special case: For purely imaginary eigenvalues (μⱼ = 0), |Q(λⱼΔt)| = 1, so the method is stable (but not absolutely stable).

Don't confuse: Backward Euler damps imaginary eigenvalues (|Q| < 1), while Trapezoidal preserves their amplitude (|Q| = 1). For oscillatory systems without damping, Trapezoidal is more physically accurate.

📊 Stability regions

📊 Graphical representation

The stability region of a method is the set of complex values λΔt for which |Q(λΔt)| ≤ 1.

Forward Euler stability region:

  • S_FE = {λΔt ∈ ℂ : |1 + λΔt| ≤ 1}
  • This is a circle with center (−1, 0) and radius 1 in the complex plane.
  • The region lies completely to the left of the imaginary axis and is tangent to it.

How to use stability regions graphically:

  1. Mark each eigenvalue λⱼ in the complex plane (assuming Δt = 1).
  2. If any marked point lies outside the stability region, reduce Δt so that λⱼΔt falls inside.
  3. Graphically, this means moving from the marked point toward the origin along a straight line.
  4. The smallest Δt needed for all eigenvalues determines the stability bound.

Observations from the excerpt:

  • Backward Euler and Trapezoidal: unconditionally stable when all eigenvalues have non-positive real parts.
  • Modified Euler: very similar stability region to Forward Euler; both cannot stably integrate systems with imaginary eigenvalues.
  • RK4: the only explicit method discussed that can be stable for imaginary eigenvalues, provided |λΔt| ≤ 2.8.

🔄 Comparison of methods

MethodStability for imaginary eigenvaluesConditional or unconditional
Forward EulerAlways unstableConditional (strict Δt limit)
Backward EulerAbsolutely stable (damps)Unconditional
TrapezoidalStable (preserves amplitude)Unconditional
Modified EulerAlways unstableConditional (similar to Forward Euler)
RK4Stable if |λΔt| ≤ 2.8Conditional (but more tolerant)

🌐 General nonlinear systems

🌐 Linearization via the Jacobian

For general nonlinear systems y′ = f(t, y), stability is determined by linearization about a point (t̂, ŷ).

Jacobian matrix: J|(t̂,ŷ) is the m×m matrix of partial derivatives ∂fᵢ/∂yⱼ evaluated at (t̂, ŷ).

  • The Jacobian plays the role of the coefficient matrix A in the linear case.
  • Eigenvalues of the Jacobian, λ₁(t̂, ŷ), …, λₘ(t̂, ŷ), depend on time and the approximation.
  • Stability properties therefore vary with time and the current approximation.

🧪 Example: Competing species

The excerpt gives a model for two bacterial species:

  • y₁′ = y₁(1 − y₁ − y₂)
  • y₂′ = y₂(0.5 − 0.75y₁ − 0.25y₂)

At approximation wₙ = (1.5, 0)ᵀ, the Jacobian is:

  • J = [−2, −1.5; 0, −0.625]
  • Eigenvalues: λ₁ = −2, λ₂ = −0.625 (both real and negative).

For Forward Euler, the stability condition is Δt ≤ min{−2/λⱼ} = 1.

Important: This procedure must be repeated at each time step, since eigenvalues change with the approximation.

🔁 Example: Mathematical pendulum

For the pendulum system:

  • y₁′ = y₂
  • y₂′ = −sin y₁

The Jacobian at (t̂, ŷ) is:

  • J = [0, 1; −cos ŷ₁, 0]

If −π/2 < ŷ₁ < π/2, eigenvalues are purely imaginary: λ₁,₂ = ±i√(cos ŷ₁).

Method comparison for imaginary eigenvalues:

  • Forward Euler and Modified Euler: unstable for every Δt.
  • Backward Euler: unconditionally stable, but |Q| < 1 causes artificial damping (unphysical for an undamped pendulum).
  • Trapezoidal: unconditionally stable with |Q| = 1, preserving oscillation amplitude (more accurate).
  • RK4: stable if Δt ≤ 2.8/√(cos ŷ₁); gives damped oscillation if Δt is strictly smaller than the bound.

Don't confuse: Imaginary eigenvalues represent undamped oscillations. Methods with |Q| < 1 introduce artificial damping; methods with |Q| = 1 preserve energy.

📐 Global truncation error

📐 Error definitions for systems

The excerpt briefly mentions that error analysis extends to systems:

Local truncation-error vector: τₙ₊₁ = (yₙ₊₁ − zₙ₊₁) / Δt, where yₙ₊₁ is the exact solution at tₙ₊₁ and zₙ₊₁ is the one-step approximation starting from yₙ.

Global truncation-error vector: eₙ₊₁ = yₙ₊₁ − wₙ₊₁, where wₙ₊₁ is the numerical approximation.

The excerpt states that the orders of these errors are equal to the corresponding orders in the scalar case, but does not provide further details.

40

Global truncation error for systems

6.9 Global truncation error for systems

🧭 Overview

🧠 One-sentence thesis

The global truncation error for systems of ODEs has the same order as the corresponding scalar case when the method is stable and consistent, and the analysis extends naturally from scalar to vector equations through eigenvalue decomposition.

📌 Key points (3–5)

  • Core analogy: Local and global truncation errors for systems are defined analogously to the scalar case, with the same order relationships.
  • Key definitions: Local truncation error is (y_{n+1} - z_{n+1})/Δt, global truncation error is y_{n+1} - w_{n+1}, where y is exact, z is one-step approximation, w is numerical solution.
  • Order preservation: Methods like Forward Euler retain their order properties (local O(Δt), global O(Δt) if stable and consistent) when extended to systems.
  • Common confusion: The vector case might seem more complex, but decoupling via eigenvector transformation reduces it to m independent scalar problems, one per eigenvalue.
  • Practical tool: Richardson's extrapolation must be applied componentwise for systems to estimate global truncation errors.

📐 Error definitions for systems

📏 Local truncation-error vector

Local truncation-error vector: τ_{n+1} = (y_{n+1} - z_{n+1})/Δt

  • y_{n+1}: the exact solution at time t_{n+1}
  • z_{n+1}: the approximation at time t_{n+1} after applying one step of the numerical method starting from the exact value y_n
  • This measures the error introduced in a single step, normalized by the step size.
  • Example: If you start at the exact solution and take one Forward Euler step, z_{n+1} is where you land; the difference from the true y_{n+1} (divided by Δt) is the local error.

🌍 Global truncation-error vector

Global truncation-error vector: e_{n+1} = y_{n+1} - w_{n+1}

  • w_{n+1}: the numerical approximation at time t_{n+1} after many accumulated steps
  • This measures the total accumulated error from all previous steps.
  • Don't confuse: Local error assumes you start from the exact solution; global error reflects the reality that each step starts from an already-approximate value.

🔍 Order relationships and stability

📊 Order preservation from scalar to vector

The excerpt states that "the orders of these errors are equal to the corresponding orders in the scalar case."

MethodLocal truncation error orderGlobal truncation error order (if stable & consistent)
Forward EulerO(Δt)O(Δt)
General stable & consistent methodSame as scalarSame as scalar
  • The Forward Euler method for systems has local truncation error of order O(Δt).
  • If the method is stable and consistent, the global truncation error is of the same order as the local error.
  • This mirrors the scalar case exactly.

🛠️ Richardson's extrapolation for systems

  • The excerpt notes that "Richardson's extrapolation should be performed componentwise to obtain global truncation-error estimates."
  • This means: apply the extrapolation technique separately to each component of the vector solution.
  • Why componentwise: each component may have different magnitudes or behavior, so treating them independently gives accurate error estimates for the entire system.

🧮 Mathematical foundation: test system analysis

🧪 Test system setup

The excerpt motivates the result using the test system y' = Ay (a linear system with constant matrix A).

  • Numerical approximation: w_{n+1} = Q(Δt A) w_n, where Q is the amplification factor function for the method.
  • One-step from exact: z_{n+1} = Q(Δt A) y_n.
  • Local error formula: τ_{n+1} = (y_{n+1} - Q(Δt A) y_n)/Δt.
  • Global error formula: e_{n+1} = y_{n+1} - Q(Δt A) w_n.

🔗 Recursive relationship for global error

Because w_n = y_n - e_n, the global error satisfies:

e_{n+1} = y_{n+1} - Q(Δt A) y_n + Q(Δt A) e_n = Δt τ_{n+1} + Q(Δt A) e_n

  • This is a recurrence relation: the new global error equals the current local error (scaled by Δt) plus the amplified previous global error.
  • This structure is analogous to the scalar case and is the foundation for proving order preservation.

🎯 Decoupling via eigenvector transformation

🔀 Transformation to diagonal form

The excerpt decouples the error equation using the transformation e_{n+1} = S η_{n+1}, where:

  • S: matrix whose columns are the eigenvectors v₁, ..., v_m of A.
  • Λ: diagonal matrix with eigenvalues λ₁, ..., λ_m of A on the diagonal.
  • Result: S⁻¹ Q(Δt A) S = Q(Δt Λ), a diagonal matrix (by Theorem 6.8.1).

After transformation, the decoupled global error becomes:

η_{n+1} = Δt S⁻¹ τ_{n+1} + Q(Δt Λ) η_n

🧩 Decomposition of local error

The local truncation-error vector is decomposed along the eigenvectors:

τ_{n+1} = Σ_{j=1}^m α_{j,n+1} v_j = S α_{n+1}

  • α_{j,n+1}: components of the local error with respect to the eigenvector basis.
  • Substituting into the decoupled equation: η_{n+1} = Δt α_{n+1} + Q(Δt Λ) η_n.

📉 Component-wise scalar analysis

Because Q(Δt Λ) is diagonal with entries Q(λ_j Δt), the components decouple completely:

η_{j,n+1} = Σ_{ℓ=0}^n (Q(λ_j Δt))^ℓ Δt α_{j,n+1-ℓ} (for j = 1, ..., m)

  • Key insight: "This is a decoupled system: the scalar properties of local and global truncation errors can be applied to each component."
  • Each component behaves like a scalar problem with eigenvalue λ_j.
  • Example: For Forward Euler, each decoupled local error component α_{j,n+1} is of order O(Δt), just as in the scalar case.

✅ Order preservation conclusion

  • If the method is stable and consistent, each decoupled global truncation error component is of the same order as the decoupled local error (by Theorem 6.4.1).
  • Because the transformation matrix S does not depend on Δt, the original (coupled) local and global truncation errors are also of the same order.
  • Don't confuse: The vector structure adds no extra order of error; the eigenvector decomposition shows that the system is just m independent scalar problems in disguise.
41

Stiff Differential Equations

6.10 Stiff differential equations

🧭 Overview

🧠 One-sentence thesis

Stiff differential equations require implicit time-integration methods because their rapidly decaying transients force explicit methods to use impractically small time steps even though the slowly-varying quasi-stationary solution could be approximated accurately with much larger steps.

📌 Key points (3–5)

  • What stiffness means: problems with two timescales—a fast transient that decays rapidly and a slowly-varying quasi-stationary solution.
  • Why explicit methods fail: stability conditions force very small time steps based on the transient, even after it has vanished and only the slow part remains.
  • Why implicit methods work: unconditional stability allows larger time steps matched to the accuracy needs of the quasi-stationary solution, not the transient.
  • Common confusion: not all implicit methods behave the same—superstable methods (like Backward Euler) damp initial errors much faster than methods like Trapezoidal.
  • Systems criterion: stiffness occurs when eigenvalues have strongly negative real parts alongside near-zero real parts, or when the particular solution varies much more slowly than the homogeneous solution.

🔍 What makes a problem stiff

🔍 The two-timescale structure

Stiff differential equations (stiff systems) describe problems that exhibit transients. Their solution is the sum of a rapidly decaying part, the transient, and a slowly-varying part.

  • After a short time, the transient becomes invisible; only the quasi-stationary solution remains.
  • The excerpt uses a model problem: y′ = λ(y − F(t)) + F′(t), with solution y(t) = (y₀ − F(t₀)) exp(λ(t − t₀)) + F(t).
  • Stiffness condition: λ is strongly negative and F varies on a large (slow) timescale.
    • Transient: (y₀ − F(t₀)) exp(λ(t − t₀)) decays rapidly.
    • Quasi-stationary: F(t) varies slowly.

🔍 Why this creates a problem

  • The transient determines the stability condition for explicit methods.
  • Since λ is strongly negative, the condition |Q(λΔt)| ≤ 1 forces Δt to be very small.
  • Key inefficiency: this restriction is only due to the transient; the quasi-stationary solution could be approximated accurately with much larger steps.
  • The stability condition restricts the time step more than the accuracy requirement does.

⚙️ Error behavior in stiff problems

⚙️ Local truncation errors

  • Local truncation errors are relatively large at the beginning because the transient decays very rapidly.
  • Since the quasi-stationary solution is slowly varying, local errors become smaller at later times.

⚙️ Global truncation error decay

The global truncation error adapts to local errors and decreases after passing the transient phase.

From the formula: e_{n+1} = (Q(λΔt))^n Δt τ₁ + (Q(λΔt))^{n−1} Δt τ₂ + ... + Q(λΔt) Δt τ_n + Δt τ_{n+1}

  • If the method is stable and the amplification factor is small enough (e.g., |Q(λΔt)| ≤ 0.5), initial local truncation errors have much less influence than recent errors.
  • Due to exponential decay, past errors are damped out.
  • The smaller |Q(λΔt)| is, the faster past local truncation errors decay.
  • Potential inefficiency: if Δt is chosen very small relative to the timescale of F, the global truncation error could be unnecessarily small—not efficient if only long-time approximation is needed.

🛠️ Implicit methods for stiff problems

🛠️ Why implicit methods are preferable

  • Explicit methods require very small time steps due to stability conditions.
  • Implicit methods (Backward Euler, Trapezoidal) are unconditionally stable.
  • Theoretical advantage: time step can be taken arbitrarily large.
  • Practical constraint: Δt must still be chosen to obtain sufficiently accurate approximations of the quasi-stationary solution.
  • Trade-off: at each time step, a system of algebraic equations must be solved, increasing computational cost.

🛠️ Backward Euler vs Trapezoidal: a key difference

Both methods are unconditionally stable, but they exhibit significant differences in behavior.

Example from the excerpt: scalar stiff problem y′ = −100(y − cos t) − sin t, t > 0, y(0) = 0, with solution y(t) = −exp(−100t) + cos t (λ = −100).

  • Time step size: 0.2
  • Transient region size: order 0.01 (first time step already exceeds this region)
  • First local truncation error is large.

Amplification factors:

MethodAmplification factor Q(λΔt)Value for λ = −100, Δt = 0.2
Backward Euler1/(1 − λΔt)1/21 ≈ 0.048
Trapezoidal(1 + ½λΔt)/(1 − ½λΔt)9/11 ≈ 0.82
  • Backward Euler: after four time steps, the solution curve is almost reached; initial local truncation error is damped out much faster.
  • Trapezoidal: needs more time steps to let the large initial local truncation error decay enough.

🛠️ Superstability

Definition 6.10.1 (Superstability): A numerical method is called superstable if it is stable and lim_{λΔt → −∞} |Q(λΔt)| < 1.

  • Backward Euler is superstable.
  • Trapezoidal method: lim_{λΔt → −∞} |Q(λΔt)| = 1.
  • Implication: initial perturbations in fast components do not decay, or decay very slowly, when using the Trapezoidal method.
  • Don't confuse: both methods are unconditionally stable, but only superstable methods rapidly damp initial errors from the transient.

🧮 Stiffness in systems

🧮 Criteria for stiff systems

For systems of the form y′ = Ay + f, the solution is the sum of homogeneous and particular solutions.

A system is stiff if at least one of the following holds:

  • Some real parts of eigenvalues are strongly negative, whereas other eigenvalues have real parts close to zero.
  • The particular solution varies much more slowly than the homogeneous solution.

Essential feature: the solution contains two timescales—a fast transient (which determines numerical stability of explicit methods) and a slowly-varying component.

🧮 Example: two-component system

Suppose y′ = Ay, where A is a 2×2 matrix with eigenvalues λ₁ = −1 and λ₂ = −10000.

  • Solution: y(t) = c₁v₁ exp(−t) + c₂v₂ exp(−10000t), where v₁ and v₂ are eigenvectors and c₁, c₂ are integration constants determined by initial conditions.
  • Transient: term proportional to exp(−10000t) vanishes much sooner.
  • Quasi-stationary: term containing exp(−t) is slowly-varying compared to the transient.
  • Stability problem: the transient still determines the stability condition (Forward Euler: Δt ≤ 2/10000) over the whole domain.
  • Inhibition: this prevents adaptation of the time step to the relatively slowly-varying quasi-stationary part containing exp(−t).

🧮 General complications

  • Stiffness may occur in more complicated systems beyond the simple examples shown.
  • Implicit methods are recommended, but nonlinear systems of equations must be solved at each time step, increasing computational cost considerably.
  • Practical consideration: the choice of method is often a matter of careful consideration; explicit methods cannot be ruled out beforehand.
42

Multi-Step Methods

6.11 Multi-Step methods *

🧭 Overview

🧠 One-sentence thesis

Multi-step methods achieve higher efficiency than single-step methods by reusing information from multiple previous time points, though they introduce complications like spurious roots and starting-value requirements.

📌 Key points (3–5)

  • Core idea: Multi-step methods use approximations from several previous time steps (t₀, t₁, ..., tₙ) rather than only the most recent point, potentially creating higher-order approximations with less work per step.
  • Starting problem: ℓ-step methods require a single-step method of the same order to generate the first ℓ−1 values before the multi-step algorithm can begin.
  • Spurious roots: Multi-step methods produce multiple amplification factors; only the principal root corresponds to the differential equation, while spurious roots are artifacts of the numerical method that must be tracked for stability.
  • Common confusion: Multi-step vs single-step—single-step methods depend only on information at tₙ (though they may evaluate the function at intermediate points within [tₙ, tₙ₊₁]), whereas multi-step methods explicitly use stored values from earlier time steps.
  • Efficiency trade-off: The Adams-Bashforth method requires only one function evaluation per step (reusing previous evaluations) and can be more efficient than comparable single-step methods, but adaptive time-stepping and spurious-root tracking are more difficult.

🔄 What makes a method multi-step

🔄 Single-step vs multi-step distinction

Single-step methods: the approximation at tₙ₊₁ depends solely on information from the previous point tₙ.

  • Although single-step methods (like Runge-Kutta) may evaluate the function at intermediate points, this information is only obtained and used inside the interval [tₙ, tₙ₊₁].
  • Multi-step methods explicitly store and reuse approximations from multiple earlier time steps: t₀, t₁, ..., tₙ.
  • Example: A two-step method uses both wₙ and wₙ₋₁ to compute wₙ₊₁.

🎯 Motivation for multi-step methods

  • Since approximations at many previous times are already available, it seems reasonable to use this information to design higher-order approximations.
  • The goal is to achieve better accuracy or efficiency by leveraging historical data rather than discarding it.

🏁 Starting-value requirement

  • An ℓ-step method requires ℓ initial values: w₀, w₁, ..., w_{ℓ−1}.
  • Only w₀ comes from the initial condition; the remaining values must be generated using a single-step method, ideally of the same order as the multi-step method.
  • Example: For the two-step Adams-Bashforth method, w₀ = y₀ (initial condition), and w₁ is computed using a single-step method like the Trapezoidal or Modified Euler method; then for n ≥ 2, the multi-step algorithm is applied.

🧮 Adams-Bashforth method example

🧮 Derivation from extrapolation

The Adams-Bashforth method is a two-step explicit method derived by modifying the Trapezoidal method:

  • Start with the Trapezoidal method: wₙ₊₁ = wₙ + (Δt/2)(f(tₙ, wₙ) + f(tₙ₊₁, wₙ₊₁)).
  • To make it explicit, extrapolate f(tₙ₊₁, wₙ₊₁) using a linear interpolation polynomial based on tₙ₋₁ and tₙ.
  • The linear interpolation gives: L₁(tₙ₊₁) = 2f(tₙ, wₙ) − f(tₙ₋₁, wₙ₋₁).
  • Substituting this into the Trapezoidal formula yields the Adams-Bashforth method: wₙ₊₁ = wₙ + (3/2)Δt f(tₙ, wₙ) − (1/2)Δt f(tₙ₋₁, wₙ₋₁).

⚡ Efficiency advantage

  • Only one function evaluation is required per time step, because f(tₙ₋₁, wₙ₋₁) was already computed during the previous time step.
  • This reuse of previous evaluations reduces computational cost compared to methods that evaluate the function multiple times per step.

📊 Stability analysis of Adams-Bashforth

📊 Amplification factors

For the test equation y′ = λy, the Adams-Bashforth method yields: wₙ₊₁ = (1 + (3/2)λΔt)wₙ − (1/2)λΔt wₙ₋₁.

To find the amplification factor Q(λΔt), assume wₙ = Q(λΔt)wₙ₋₁ and wₙ₊₁ = Q(λΔt)²wₙ₋₁. This leads to a quadratic equation:

Q(λΔt)² − (1 + (3/2)λΔt)Q(λΔt) + (1/2)λΔt = 0.

The two roots are:

  • Q₁(λΔt) = [1 + (3/2)λΔt + √D] / 2
  • Q₂(λΔt) = [1 + (3/2)λΔt − √D] / 2

where D = 1 + λΔt + (9/4)(λΔt)².

🔒 Stability bounds

The method is stable if |Q₁,₂(λΔt)| ≤ 1.

  • For Q₁(λΔt), the stability bound is satisfied for all values.
  • For Q₂(λΔt), the bound requires λΔt ≥ −1.
  • Therefore, the time step must satisfy: Δt ≤ −1/λ.
  • Don't confuse: This is a stricter stability constraint than some single-step methods; the time step must be chosen twice as small as in the Modified Euler method.

🎭 Principal vs spurious roots

🎭 Two types of roots

Taylor expansion of the amplification factors reveals:

  • Q₁(λΔt) = 1 + λΔt + (1/2)(λΔt)² − (1/4)(λΔt)³ + O(Δt⁴)
  • Q₂(λΔt) = (1/2)λΔt + O(Δt²)

👑 Principal root

Principal root: the amplification factor that corresponds to the differential equation and has the correct order of accuracy.

  • Q₁(λΔt) is the principal root.
  • Its local truncation error is: τₙ₊₁ = (5/12)λ³Δt² yₙ + O(Δt³) = O(Δt²).
  • This matches the expected second-order accuracy.

👻 Spurious root

Spurious root: an amplification factor that does not belong to the differential equation but is a consequence of the chosen numerical method.

  • Q₂(λΔt) is the spurious root.
  • Its local truncation error is O(1), meaning it has no accuracy.
  • This root is an artifact of the multi-step formulation and must be tracked to ensure stability.
  • Keeping track of spurious roots is a difficult matter in multi-step methods.

⚖️ Efficiency comparison with Modified Euler

⚖️ Work per accuracy

To achieve local truncation error less than ε:

MethodTime step formulaRelative step size
Adams-BashforthΔtₐB = √(−12ε / (5λ³))1.0
Modified EulerΔtₘE = √(−6ε / λ³)ΔtₐB ≈ 0.63 ΔtₘE

⚖️ Overall efficiency

  • Two steps of Adams-Bashforth require the same work as one step of Modified Euler (because Adams-Bashforth reuses previous evaluations).
  • In the time it takes Modified Euler to advance by ΔtₘE, Adams-Bashforth advances by 2ΔtₐB ≈ 1.26 ΔtₘE.
  • Therefore, the Adams-Bashforth method requires less work than the Modified Euler method to reach the same final time with comparable accuracy.

⚠️ Practical limitations

⚠️ Why multi-step methods are less popular

The excerpt notes that multi-step methods are less popular than Runge-Kutta methods for several reasons:

  • Starting problems: Approximations must be known at several time steps before the multi-step algorithm can begin; this requires a separate single-step method.
  • Spurious roots: Tracking spurious roots is difficult and adds complexity to stability analysis.
  • Adaptive time-stepping: Changing the time step is more difficult because multi-step methods rely on a history of equally-spaced previous values; adjusting Δt disrupts this structure.

⚠️ When to consider multi-step methods

Despite these limitations, multi-step methods can be more efficient when:

  • The time step can remain constant or change infrequently.
  • The cost of function evaluations is high, making reuse of previous evaluations valuable.
  • The problem does not require frequent restarts or adaptive stepping.
43

The Finite-Difference Method for Boundary-Value Problems

7.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The finite-difference method solves boundary-value problems by replacing derivatives with difference formulae at discrete points, converting a differential equation into a system of algebraic equations.

📌 Key points (3–5)

  • What a boundary-value problem is: a differential equation on a line segment where the function and/or its derivatives are specified at both boundary points (not just one).
  • Core idea of the method: replace all derivatives in the differential equation with difference formulae (from Chapter 3) and neglect truncation errors to obtain a discrete approximation.
  • Three types of boundary conditions: Dirichlet (function value given), Neumann (derivative given), and Robin (combination of both).
  • Common confusion: boundary-value problems differ from initial-value problems—values are given at both ends of the interval, not just at the starting point.
  • Application context: many steady-state physical problems (e.g., stationary heat conduction) lead to boundary-value problems.

🔥 Motivating example: stationary heat conduction

🔥 Physical setup

  • A bar of length L and cross-sectional area A has temperature T(x) along its length.
  • Temperature is known at both ends: T(0) = T_l and T(L) = T_r.
  • Heat is generated inside the bar at rate Q(x) (measured in J/(m³s)).
  • The goal is to find the steady-state (long-time) temperature distribution.

⚖️ Energy balance derivation

  • Consider a small control volume between x and x + Δx.
  • Heat flows by conduction; Fourier's law gives the heat flow density:

    Heat flow density: q(x) = −λ dT/dx(x), where λ (J/(msK)) is the heat-conduction coefficient.

  • Energy balance: total heat outflow at x + Δx minus total heat inflow at x equals heat produced in the segment.
  • This yields:
    • −λA dT/dx(x + Δx) + λA dT/dx(x) = AQ(x)Δx
  • Dividing by AΔx and letting Δx → 0 gives the differential equation:
    • −λ d²T/dx²(x) = Q(x) for 0 < x < L
  • Boundary conditions: T(0) = T_l and T(L) = T_r.

🔄 Alternative boundary condition

  • Sometimes the heat flux (not temperature) is known at an endpoint.
  • At x = L, Fourier's law gives:
    • −λ dT/dx(L) = q_L, where q_L (J/(m²s)) is the known heat flux.
  • This is a different type of boundary condition (Neumann instead of Dirichlet).

📐 General form and boundary condition types

📐 General linear second-order boundary-value problem

The excerpt describes the standard form:

  • Differential equation: −(p(x)y′(x))′ + r(x)y′(x) + q(x)y(x) = f(x) for 0 < x < L
  • Assumptions: p(x) > 0 and q(x) ≥ 0 for all x in [0, L]
  • Boundary conditions at both ends:
    • At x = 0: a₀y(0) + b₀y′(0) = c₀
    • At x = L: a_Ly(L) + b_Ly′(L) = c_L

🏷️ Three types of boundary conditions

The excerpt defines three standard types (illustrated at x = 0):

TypeDefinitionCoefficients
DirichletFunction value is given: a₀y(0) = c₀b₀ = 0
NeumannDerivative is given: b₀y′(0) = c₀a₀ = 0
RobinMixed: both function and derivative appeara₀ ≠ 0 and b₀ ≠ 0
  • Example (Dirichlet): y(0) = 5 specifies the function value directly.
  • Example (Neumann): y′(0) = 3 specifies the slope at the boundary.
  • Example (Robin): 2y(0) + 3y′(0) = 7 combines both.

⚠️ Uniqueness warning

  • The excerpt notes that if a₀ = a_L = 0 (both boundaries are Neumann) and q(x) = 0 throughout, the problem may not have a unique solution (or may have no solution at all).
  • Don't confuse: this is a special degenerate case; most well-posed boundary-value problems have unique solutions.

🔢 The finite-difference method

🔢 Core principle

Finite-difference method: replace all derivatives in the differential equation by difference formulae (from Chapter 3) and neglect truncation errors to obtain a discrete approximation w for the solution y.

  • The method converts a continuous differential equation into a discrete system of algebraic equations.
  • Each derivative is approximated using values at nearby grid points.
  • Truncation errors (the difference between the true derivative and the finite-difference approximation) are ignored in the discrete system.

🧮 Example: homogeneous Dirichlet problem

The excerpt illustrates the method with:

  • Differential equation: −y′′(x) + q(x)y(x) = f(x) for 0 < x < 1
  • Boundary conditions: y(0) = 0 and y(1) = 0 (both Dirichlet, both zero—hence "homogeneous")

Discretization steps:

  1. Divide the interval [0, 1] into n + 1 equal subintervals of length Δx = 1/(n + 1).
  2. Define nodes x_j = jΔx for j = 0, 1, 2, ..., n + 1.
  3. Replace the second derivative y′′(x_j) with a finite-difference formula (e.g., central difference).
  4. Evaluate q(x) and f(x) at each node.
  5. Write one algebraic equation per interior node (j = 1, 2, ..., n).
  6. Use the boundary conditions to set w₀ = 0 and w_{n+1} = 0.
  7. Solve the resulting system of n linear equations for the unknowns w₁, w₂, ..., w_n.
  • The discrete approximation w_j approximates the true solution y(x_j) at each node.
  • Example: with n = 3, there are 4 subintervals and 3 interior nodes; the method produces 3 equations for w₁, w₂, w₃.

🔍 Key distinction: boundary-value vs initial-value

  • Boundary-value problem: conditions are given at both ends of the interval.
  • Initial-value problem (Chapter 6): conditions are given at only the starting point; the solution evolves forward in time.
  • Don't confuse: the finite-difference method for boundary-value problems produces a system of simultaneous equations (all unknowns are coupled), whereas time-stepping methods for initial-value problems compute the solution sequentially.
44

7.2 The Finite-Difference Method

7.2 The Finite-Difference method

🧭 Overview

🧠 One-sentence thesis

The finite-difference method solves boundary-value problems by replacing derivatives with difference formulae at discrete nodes, transforming the differential equation into a linear system that can be solved numerically.

📌 Key points (3–5)

  • Core principle: replace all derivatives in the differential equation with difference formulae (from Chapter 3) and neglect truncation errors to obtain a discrete approximation.
  • Domain discretization: divide the interval into equidistant subintervals with nodes, then approximate the solution only at these nodes.
  • Three boundary condition types: Dirichlet (specifies function value), Neumann (specifies derivative), and Robin (mixes both).
  • Common confusion: homogeneous vs nonhomogeneous boundary conditions—nonhomogeneous conditions (nonzero values) require adjusting the right-hand side vector in the linear system.
  • Output: the method converts the continuous boundary-value problem into a matrix-vector equation Aw = f that can be solved for the approximate solution w.

🔧 General boundary-value problem setup

🔧 Standard form

The general linear second-order boundary-value problem in one dimension is:

−(p(x) y′(x))′ + r(x) y′(x) + q(x) y(x) = f(x), for 0 < x < L

with boundary conditions at both ends:

  • At x = 0: a₀ y(0) + b₀ y′(0) = c₀
  • At x = L: aₗ y(L) + bₗ y′(L) = cₗ

Assumptions:

  • p(x) > 0 for all x in [0, L]
  • q(x) ≥ 0 for all x in [0, L]

Uniqueness caveat: The problem does not have a unique solution when a₀ = aₗ = 0 and q(x) = 0 throughout the interval (if solutions exist at all).

🏷️ Three types of boundary conditions

The excerpt defines three categories based on the coefficients at x = 0 (similar definitions apply at x = L):

TypeDefinitionCoefficientsWhat it specifies
Dirichleta₀ y(0) = c₀b₀ = 0Function value directly
Neumannb₀ y′(0) = c₀a₀ = 0Derivative (slope/flux)
Robina₀ ≠ 0 and b₀ ≠ 0Both nonzeroMix of value and derivative

Example from heat conduction: If heat flux is known at x = L, Fourier's law gives −λ dT/dx(L) = qₗ, which is a Neumann boundary condition.

🔢 Discretization and difference scheme

🔢 Dividing the domain

  1. Split the interval [0, 1] into n + 1 equidistant subintervals.
  2. Length of each subinterval: Δx = 1/(n + 1).
  3. Nodes are located at xⱼ = j Δx for j = 0, 1, ..., n, n+1.
  4. The solution y(x) is approximated only at these discrete nodes.

Notation:

  • yⱼ = exact solution at node xⱼ
  • wⱼ = numerical approximation of yⱼ

🔄 Replacing derivatives with differences

The key step is to replace the second derivative y″(x) with a central-difference formula (from Chapter 3, equation 3.8):

y″(xⱼ) ≈ (yⱼ₋₁ − 2yⱼ + yⱼ₊₁) / Δx²

For the differential equation −y″(x) + q(x) y(x) = f(x), this becomes:

−(wⱼ₋₁ − 2wⱼ + wⱼ₊₁) / Δx² + qⱼ wⱼ = fⱼ for j = 1, ..., n

Why only j = 1 to n? The values at j = 0 and j = n+1 are determined by the boundary conditions, not by the differential equation.

🧮 Matrix formulation for homogeneous Dirichlet conditions

🧮 Example 7.2.1: zero boundary values

Consider the problem:

  • −y″(x) + q(x) y(x) = f(x) for 0 < x < 1
  • y(0) = 0 and y(1) = 0 (homogeneous Dirichlet conditions)

The finite-difference scheme yields n equations for the n unknowns w₁, ..., wₙ (since w₀ = wₙ₊₁ = 0 from boundary conditions).

Matrix-vector form: Aw = f

where A = K + M:

  • K is the (n × n) tridiagonal matrix from the second derivative:

    K = (1/Δx²) × [  2  -1   0  ...  0 ]
                   [ -1   2  -1  ...  0 ]
                   [  0  -1   2  ...  0 ]
                   [ ...            ... ]
                   [  0  ...  -1   2   ]
    
  • M is the diagonal matrix from the q(x) term:

    M = [ q₁   0  ...  0  ]
        [  0  q₂  ...  0  ]
        [ ...         ... ]
        [  0  ...  0  qₙ ]
    
  • w = (w₁, ..., wₙ)ᵀ (the unknowns)

  • f = (f₁, ..., fₙ)ᵀ (right-hand side values at nodes)

🔓 Don't confuse: homogeneous vs nonhomogeneous

"Homogeneous" means the boundary values are zero; "nonhomogeneous" means they are nonzero. This distinction affects how the linear system is set up.

🔀 Handling nonhomogeneous Dirichlet conditions

🔀 Example 7.2.2: nonzero boundary values

Now consider:

  • Same differential equation −y″(x) + q(x) y(x) = f(x)
  • But y(0) = α and y(1) = β (nonzero boundary values)

Key difference: The boundary values α and β are known constants, but they affect the equations at the interior nodes adjacent to the boundaries.

🔀 Adjusting the equations

  • At j = 1 (next to the left boundary):

    • The difference formula involves w₀, but w₀ = α (known).
    • Original: −(w₀ − 2w₁ + w₂) / Δx² + q₁ w₁ = f₁
    • Rearranged: −(−2w₁ + w₂) / Δx² + q₁ w₁ = f₁ + α/Δx²
  • At j = n (next to the right boundary):

    • The difference formula involves wₙ₊₁, but wₙ₊₁ = β (known).
    • Rearranged: −(wₙ₋₁ − 2wₙ) / Δx² + qₙ wₙ = fₙ + β/Δx²

Matrix-vector form: Aw = f + r

where:

  • A, w, and f are defined exactly as in Example 7.2.1
  • r = (1/Δx²) × (α, 0, ..., 0, β)ᵀ is the adjustment vector that incorporates the nonzero boundary values

Why this works: The known boundary values are moved to the right-hand side, so the system still solves for the n unknown interior values w₁, ..., wₙ.

📐 Linear algebra concepts for error analysis

📐 Scaled Euclidean norm

‖w‖ = sqrt((1/n) × sum of wᵢ² for i = 1 to n)

This is the "size" of a vector w in n-dimensional space, scaled by dividing by n before taking the square root.

📐 Subordinate matrix norm

‖A‖ = max ‖Aw‖ over all vectors w with ‖w‖ = 1

This measures the maximum "stretching" that the matrix A can do to a unit vector.

Key inequality: For any vector y, ‖Ay‖ ≤ ‖A‖ · ‖y‖

This inequality is used to bound how errors propagate through the linear system.

📐 Condition number

κ(A) = ‖A‖ · ‖A⁻¹‖

What it measures: sensitivity of the solution to perturbations in the right-hand side.

Error propagation: If the right-hand side f is perturbed by Δf, the solution w is perturbed by Δw, and:

‖Δw‖ / ‖w‖ ≤ κ(A) × (‖Δf‖ / ‖f‖)

Interpretation:

  • Large condition number → small relative error in f may cause large relative error in w (ill-conditioned problem).
  • The condition number amplifies the relative error from input to output.

Note: For symmetric matrices A, the eigenvalues λ₁, ..., λₙ are real valued (the excerpt mentions this but does not elaborate further).

45

Some concepts from Linear Algebra

7.3 Some concepts from Linear Algebra

🧭 Overview

🧠 One-sentence thesis

Understanding vector and matrix norms, along with the condition number, is essential for analyzing how errors in the input propagate to errors in the numerical solution of boundary-value problems.

📌 Key points (3–5)

  • Scaled Euclidean norm: measures the "size" of a vector by averaging the squares of its components.
  • Subordinate matrix norm: measures the maximum stretching effect a matrix has on unit vectors.
  • Condition number: quantifies how sensitive the solution is to perturbations in the input—a large condition number means small input errors can cause large output errors.
  • Common confusion: the condition number is not about the matrix itself being "bad," but about how errors amplify when solving Aw = f.
  • Gershgorin circle theorem: provides a practical way to estimate eigenvalues (and thus the condition number for symmetric matrices) without computing them exactly.

📏 Vector and matrix norms

📏 Scaled Euclidean norm of a vector

The scaled Euclidean norm of a vector w in R^n is defined as the square root of (1/n times the sum of the squares of all components).

  • In words: take each component w_i, square it, add all squares together, divide by n, then take the square root.
  • The scaling by 1/n makes the norm independent of the dimension n, so it represents an "average" magnitude.
  • Example: for a vector with n components, this norm gives a sense of the typical size of each component.

📐 Subordinate matrix norm

The natural, or subordinate, matrix norm related to the vector norm is defined as the maximum of ||Aw|| over all vectors w with ||w|| = 1.

  • In words: the matrix norm is the largest amount by which the matrix A can stretch a unit vector.
  • The excerpt shows that for any vector y in R^n (not zero), we have ||Ay|| ≤ ||A|| · ||y||.
  • This inequality (7.7) is key: it bounds how much a matrix can magnify any vector.
  • Don't confuse: the matrix norm is not computed component-wise; it is derived from the vector norm and measures the worst-case stretching.

🔢 Condition number and error propagation

🔢 What the condition number measures

The quantity κ(A) = ||A|| · ||A^(−1)|| is called the condition number of the matrix A.

  • Context: we want to solve Aw = f. If the right-hand side f is perturbed by an error Δf, the solution w will contain an error Δw.
  • The excerpt derives that the relative error in w is bounded by: (||Δw|| / ||w||) ≤ κ(A) · (||Δf|| / ||f||).
  • In words: the condition number is the amplification factor—it tells you how much a relative error in f can grow into a relative error in w.
  • A large condition number implies that even a small relative error in the input may result in a large relative error in the solution.

⚠️ Why condition number matters

  • The inequality (7.8) shows that the condition number directly controls error sensitivity.
  • If κ(A) is large, the problem is "ill-conditioned": small input perturbations (e.g., rounding errors, measurement noise) can lead to large output errors.
  • Example: if κ(A) = 1000 and the input has a 0.1% relative error, the output could have up to a 100% relative error.
  • Don't confuse: a large condition number does not mean the matrix is "wrong," but that the problem itself is sensitive to input changes.

🎯 Computing the condition number for symmetric matrices

🎯 Eigenvalue formula for symmetric matrices

  • For a symmetric matrix A, the eigenvalues λ₁, ..., λₙ are real-valued.
  • The excerpt states (7.9): ||A|| = |λ|_max (the largest absolute eigenvalue) and ||A^(−1)|| = 1 / |λ|_min (the reciprocal of the smallest absolute eigenvalue).
  • Therefore, κ(A) = |λ|_max / |λ|_min.
  • In words: the condition number is the ratio of the largest to the smallest absolute eigenvalue.
  • Practical implication: you only need to know (or estimate) the two extremal eigenvalues to compute the condition number.

🔍 Gershgorin circle theorem

Theorem 7.3.1 (Gershgorin circle theorem): The eigenvalues of a general n × n matrix A are located in the complex plane in the union of circles |z − a_ii| ≤ sum of |a_ij| for j ≠ i, where z is in C.

  • In words: each eigenvalue lies within at least one circle centered at a diagonal entry a_ii, with radius equal to the sum of the absolute values of the off-diagonal entries in that row.
  • The proof assumes Av = λv and picks the largest component v_i in modulus, then uses the triangle inequality to bound |λ − a_ii|.
  • Example: for the matrix A = [[2, −1], [−1, 2]], the theorem says eigenvalues satisfy |λ − 2| ≤ 1, so λ is in [1, 3]. The actual eigenvalues are λ₁ = 1 and λ₂ = 3, which indeed satisfy this condition.
  • Why it's useful: the theorem provides bounds on eigenvalues without computing them exactly, which helps estimate the condition number.

🧮 Example: applying Gershgorin to a 2×2 matrix

  • Matrix: A = [[2, −1], [−1, 2]].
  • For row 1: center at 2, radius = |−1| = 1 → circle |λ − 2| ≤ 1.
  • For row 2: center at 2, radius = |−1| = 1 → same circle.
  • Union of circles: all eigenvalues lie in |λ − 2| ≤ 1, i.e., λ ∈ [1, 3].
  • Actual eigenvalues: λ₁ = 1, λ₂ = 3, which are indeed in [1, 3].
  • Don't confuse: the theorem gives a region, not the exact eigenvalues; it is an estimation tool.

🔗 Connection to numerical methods

🔗 Why these concepts matter for boundary-value problems

  • The excerpt begins by noting that to estimate the global accuracy of a numerical approximation, it is necessary to analyze the difference between the numerical approximation w_j and the solution y_j.
  • The matrix-vector form Aw = f + r arises from discretizing a boundary-value problem.
  • Norms and the condition number allow us to bound how errors in the right-hand side (e.g., from discretization or rounding) affect the computed solution.
  • Example: if the discretization introduces an error Δf in f, the condition number tells you how much error Δw to expect in w.

🧩 Preparing for consistency, stability, and convergence

  • The excerpt ends by noting that the next section (7.4) will define the local truncation error and prove that the difference between the numerical approximation and the true solution tends to zero as the step size Δx tends to zero.
  • The tools introduced here (norms, condition number, Gershgorin theorem) are the foundation for those proofs.
46

Consistency, Stability and Convergence

7.4 Consistency, stability and convergence

🧭 Overview

🧠 One-sentence thesis

A finite-difference scheme converges to the true solution if and only if it is both consistent (local error vanishes as step size shrinks) and stable (the inverse matrix norm remains bounded).

📌 Key points (3–5)

  • Three core concepts: consistency (local truncation error → 0), stability (inverse matrix norm bounded), and convergence (global error → 0).
  • The fundamental theorem: stability + consistency together guarantee convergence.
  • Local vs global error: local truncation error measures how well the scheme approximates the differential equation at each node; global truncation error measures the difference between numerical and exact solutions.
  • Common confusion: consistency alone is insufficient—a scheme can satisfy the differential equation locally yet still fail to converge without stability.
  • Order preservation: the global truncation error has the same order as the local truncation error.

🔍 Error concepts

🔍 Local truncation error

Local truncation error ε: the difference between applying the finite-difference operator to the exact solution and to the numerical approximation: ε = Ay - Aw = Ay - f, where y_j = y(x_j) are exact values at nodes and w is the numerical approximation.

  • Measures how well the scheme approximates the differential equation at each grid point.
  • For the central-difference discretization of the second derivative, the local truncation error is O(Δx²).
  • Example: For the scheme in equation (7.4), row j gives ε_j = O(Δx²) because the central-difference formula has second-order accuracy.

🌐 Global truncation error

Global truncation error e: the difference between the exact solution and the numerical approximation: e = y - w.

  • This is what we ultimately care about: how far is our computed answer from the true solution?
  • Related to local error by: Ae = A(y - w) = ε, so e = A⁻¹ε.
  • Don't confuse: local error measures equation approximation; global error measures solution approximation.

🎯 The three properties

✅ Consistency

Consistency: a finite-difference scheme is consistent if the limit as Δx → 0 of the norm of the local truncation error equals zero: lim(Δx→0) ‖ε‖ = 0.

  • Means the scheme approximates the differential equation better and better as the grid is refined.
  • For the boundary-value problem example, ‖ε‖ = O(Δx²), so the system is consistent.
  • Consistency alone does not guarantee convergence—stability is also required.

🛡️ Stability

Stability: a finite-difference scheme is stable if A⁻¹ exists and there exists a constant C, independent of Δx, such that ‖A⁻¹‖ ≤ C as Δx → 0.

  • Ensures that perturbations in the right-hand side do not grow unboundedly.
  • For symmetric matrices, ‖A⁻¹‖ = 1/|λ|_min, so stability requires the smallest eigenvalue to stay bounded away from zero.

Two cases for the boundary-value problem:

CaseConditionEigenvalue boundsStability result
q(x) > 0q_min ≤ q(x) ≤ q_maxq_min ≤ λ_j ≤ q_max + 4/Δx²‖A⁻¹‖ ≤ 1/q_min, stable
q(x) = 0Zero everywhereλ_min ≈ π² (from formula 7.11)Stable

🎓 Convergence

Convergence: a scheme is convergent if the global truncation error satisfies lim(Δx→0) ‖e‖ = 0.

  • This is the ultimate goal: the numerical solution approaches the exact solution as the grid is refined.
  • Theorem 7.4.1: If a scheme is stable and consistent, then it is convergent.
  • Proof sketch: Since e = A⁻¹ε, taking norms gives ‖e‖ ≤ ‖A⁻¹‖ ‖ε‖. Stability bounds ‖A⁻¹‖ and consistency makes ‖ε‖ → 0, so ‖e‖ → 0.

🧮 Eigenvalue estimation

🔵 Gershgorin circle theorem

Gershgorin circle theorem: The eigenvalues of an n×n matrix A are located in the complex plane in the union of circles |z - a_ii| ≤ sum(|a_ij|) for j ≠ i, where z ∈ C.

  • Provides bounds on eigenvalues without computing them exactly.
  • For each row i, draw a circle centered at the diagonal element a_ii with radius equal to the sum of absolute values of off-diagonal elements in that row.
  • All eigenvalues lie somewhere in the union of these circles.
  • Example: For the 2×2 matrix with diagonal 2 and off-diagonal -1, the theorem gives |λ - 2| ≤ 1, which correctly bounds the actual eigenvalues λ₁ = 1 and λ₂ = 3.

🔢 Condition number for symmetric matrices

For symmetric matrices, the condition number simplifies:

  • κ(A) = |λ|_max / |λ|_min (ratio of largest to smallest absolute eigenvalue).
  • Only need to estimate the extremal eigenvalues to compute or estimate the condition number.
  • The Gershgorin theorem helps estimate these extremal values.

📊 Practical example

🌡️ Heat transport in a bar

The excerpt presents two related problems:

Without dissipation (7.12):

  • Equation: -y'' = 25e^(5x), with y(0) = y(1) = 0.
  • The numerical approximation converges rapidly to the solution.
  • Step size Δx = 1/16 provides sufficient accuracy for practical purposes.

With dissipation (7.13):

  • Equation: -y'' + 9y = 25e^(5x), with y(0) = y(1) = 0.
  • The term 9y describes heat dissipation proportional to temperature.
  • Maximum temperature is lower than without dissipation.
  • Convergence behavior is similar to the case without dissipation.

Don't confuse: Both problems converge, but the physical solution differs—dissipation reduces peak temperature.

47

Conditioning of the discretization matrix

7.5 Conditioning of the discretization matrix *

🧭 Overview

🧠 One-sentence thesis

The effective condition number of the discretization matrix remains bounded as the step size decreases, making the finite-difference method more stable than the standard condition number suggests.

📌 Key points (3–5)

  • Standard condition number grows unboundedly: when step size Δx tends to zero, the condition number κ(A) ≈ 4(π Δx)² becomes unbounded, suggesting arbitrarily large errors.
  • Effective condition number is bounded: a more realistic estimate using the effective condition number κ_eff(A) shows that the method is actually stable in many applications.
  • Key difference: the effective condition number depends on the ratio ||f|| / ||w|| (norm of the right-hand side divided by norm of the solution), not just the eigenvalues.
  • Common confusion: the standard condition number κ(A) is too pessimistic; it overestimates the sensitivity to perturbations in the right-hand side.
  • Practical implication: although harder to compute, the effective condition number gives a more accurate picture of numerical stability.

📐 The standard condition number problem

📐 Eigenvalue approximations

For the boundary-value problem with q(x) = 0 on the interval 0 ≤ x ≤ 1, the discretization matrix A has:

  • Minimum eigenvalue: λ_min ≈ π²
  • Maximum eigenvalue: λ_max ≈ 4 / Δx²
  • Condition number: κ(A) ≈ 4(π Δx)²

⚠️ Unbounded growth

  • As the step size Δx tends to zero (finer discretization), the condition number κ(A) becomes unbounded.
  • This suggests that perturbations in the right-hand side f could lead to arbitrarily large errors in the approximation.
  • The excerpt notes this is "not desirable" because it would imply the method becomes unstable as we refine the grid.

🔍 The effective condition number

🔍 Improved error estimate

The original error estimate (equation 7.8) was "too pessimistic." A more realistic estimate starts from:

  • ||Δw|| = ||A⁻¹ Δf|| ≤ (1 / |λ|_min) ||Δf||

This leads to a relative error bound:

  • ||Δw|| / ||w|| ≤ (1 / λ_min) · (||f|| / ||w||) · (||Δf|| / ||f||)

📊 Definition and behavior

Effective condition number: κ_eff(A) = (1 / λ_min) · (||f|| / ||w||)

The effective condition number replaces the role of κ(A) in the error estimate.

Why it stays bounded:

  • λ_min ≈ π² (does not depend on Δx)
  • In many applications, the ratio ||f|| / ||w|| remains bounded as Δx tends to zero
  • Therefore, κ_eff(A) is bounded, unlike the standard condition number

⚖️ Trade-off: accuracy vs. computation

AspectStandard κ(A)Effective κ_eff(A)
Accuracy of estimateToo pessimisticMore accurate
Ease of computationEasier (only eigenvalues)Harder (requires estimate of w)
Behavior as Δx → 0UnboundedBounded (in many applications)

Don't confuse: The effective condition number is not always easy to use in practice because it requires an estimate for w (the solution), which is what we are trying to compute in the first place.

🔗 Connection to the boundary-value problem

🔗 Context from Example 7.2.1

The analysis revisits a boundary-value problem of the form:

  • −y'' + q(x)y = f(x), for 0 < x < 1
  • With boundary conditions y(0) = 0, y(1) = 0

The discretization creates a matrix A whose conditioning affects how errors in the right-hand side f propagate to errors in the numerical solution.

🔗 Practical significance

  • The bounded effective condition number means the finite-difference method is more robust than the standard condition number suggests.
  • Refining the grid (smaller Δx) does not necessarily make the problem ill-conditioned in practice.
  • Example: Even though κ(A) grows without bound, the actual error amplification (measured by κ_eff(A)) remains controlled in typical applications.
48

Neumann Boundary Condition

7.6 Neumann boundary condition

🧭 Overview

🧠 One-sentence thesis

The Neumann boundary condition can be discretized using a virtual point outside the domain, and although this introduces a local truncation error of order O(Δx) at the boundary, the global truncation error remains O(Δx²).

📌 Key points (3–5)

  • What changes with Neumann: unlike Dirichlet conditions where boundary values are known, Neumann conditions specify the derivative at the boundary, requiring an extra equation for the boundary node.
  • Virtual point technique: a fictitious point outside the domain is introduced to discretize the Neumann condition using central differences.
  • Local vs global error: the Neumann boundary discretization has local truncation error O(Δx), worse than the interior O(Δx²), but the global error still achieves O(Δx²).
  • Common confusion: don't assume that a lower-order local error at one point ruins the overall accuracy—the global error can still be second-order.

🔧 Setting up the Neumann problem

🔧 The modified boundary-value problem

The example modifies a standard problem to include a Neumann condition on the right boundary:

  • Differential equation: −y″ + q(x)y = f(x) for 0 < x < 1
  • Left boundary (Dirichlet): y(0) = 0
  • Right boundary (Neumann): y′(1) = 0

Key difference from Dirichlet: the value w_{n+1} at the right boundary is no longer known; instead, the derivative at that point is specified.

📐 System size adjustment

  • The system must now include an equation for j = n + 1 (the right boundary node).
  • Result: n + 1 equations with n + 1 unknowns (w₁, …, w_{n+1}).
  • Without this extra equation, the system would be underdetermined.

🎭 The virtual point method

🎭 Introducing the fictitious node

To handle the unknown w_{n+2} that appears in the finite-difference equation at j = n + 1:

  • Define a virtual point x_{n+2} = 1 + Δx, which lies outside the physical domain [0, 1].
  • This point does not represent a real solution value but serves as a mathematical tool.

Example: if the domain is [0, 1] and Δx = 0.1, then x_{n+1} = 1 is the right boundary and x_{n+2} = 1.1 is the virtual point.

🔗 Discretizing the Neumann condition

The Neumann boundary condition y′(1) = 0 is approximated using central differences:

  • Central difference formula: (w_{n+2} − w_n) / (2Δx) = 0
  • This implies: w_{n+2} = w_n

Why this works: the central difference uses points symmetrically around x_{n+1}, and setting the derivative to zero means the solution is locally flat.

🧮 Substituting into the boundary equation

The finite-difference equation at j = n + 1 is:

−w_n + 2w_{n+1} − w_{n+2} divided by Δx² plus q_{n+1}w_{n+1} = f_{n+1}

After substituting w_{n+2} = w_n and dividing by 2 (to preserve matrix symmetry):

−w_n + w_{n+1} divided by Δx² plus (1/2)q_{n+1}w_{n+1} = (1/2)f_{n+1}

Note the factor of 1/2: this adjustment keeps the coefficient matrix symmetric, which is important for numerical stability and efficiency.

📊 Matrix structure and system

📊 The resulting linear system

The system Aw = f has a symmetric matrix A = K + M, where:

ComponentStructureNotes
KTridiagonal with 2 on diagonal, −1 on off-diagonals; bottom-right entry is 1 (not 2)Scaled by 1/Δx²
MDiagonal with q₁, …, q_n, q_{n+1}/2Last entry is halved
fRight-hand side with f₁, …, f_n, f_{n+1}/2Last entry is halved

Key observation: the last row and last entry of f are both scaled by 1/2 due to the Neumann discretization.

🔍 Error analysis

🔍 Local truncation error at the boundary

Using Taylor expansions of y_n and y_{n+2} about x_{n+1}:

  • y_n = y_{n+1} − Δx·y′{n+1} + (Δx²/2)y″{n+1} − (Δx³/6)y‴_{n+1} + O(Δx⁴)
  • y_{n+2} = y_{n+1} + Δx·y′{n+1} + (Δx²/2)y″{n+1} + (Δx³/6)y‴_{n+1} + O(Δx⁴)

Since y′_{n+1} = 0 (the Neumann condition), these simplify to:

y_{n+2} = y_n + O(Δx³)

Implication: replacing y_{n+2} by y_n introduces a local truncation error of order O(Δx) at the boundary.

🔍 Comparison with interior error

LocationLocal truncation errorSource
Interior nodes (j = 1, …, n)O(Δx²)Standard central difference
Neumann boundary (j = n+1)O(Δx)Virtual point approximation

Don't confuse: the boundary has a worse local error, but this does not determine the global error.

🎯 Global truncation error remains O(Δx²)

The total local truncation error can be written as:

ε = Ay − f = Δx²·u + Δx·v

where:

  • Δx²·u comes from interior nodes
  • Δx·v comes from the Neumann boundary, with v = (0, …, 0, v_{n+1})ᵀ and v_{n+1} = O(1)

The global error e = A⁻¹ε splits into two parts:

  1. e⁽¹⁾ = A⁻¹(Δx²·u): by construction, ‖e⁽¹⁾‖ = O(Δx²)
  2. e⁽²⁾ = A⁻¹(Δx·v): can be shown to satisfy e⁽²⁾j = Δx³·v{n+1}·j for j = 1, …, n+1

🧮 Why e⁽²⁾ is also O(Δx²)

The excerpt verifies that Ae⁽²⁾ = Δx·v by checking each component:

  • For j = 2, …, n: (Ae⁽²⁾)_j = 0 (interior nodes cancel)
  • For j = 1: (Ae⁽²⁾)₁ = 0
  • For j = n+1: (Ae⁽²⁾){n+1} = Δx·v{n+1} (boundary node)

Since Δx·j ≤ 1 (the domain size), we have:

e⁽²⁾j ≤ Δx²·v{n+1}

Therefore ‖e⁽²⁾‖ = O(Δx²), and the global truncation error ‖e‖ = O(Δx²).

Key insight: even though the local error at the boundary is only first-order, its contribution to the global error is second-order because the error propagates through the inverse of the matrix, which effectively integrates the local error and improves its order.

49

7.7 The general problem

7.7 The general problem *

🧭 Overview

🧠 One-sentence thesis

A general second-order boundary-value problem with variable coefficients and mixed boundary conditions can be discretized using finite differences by introducing a virtual grid point to handle the Neumann boundary condition, yielding a linear system that may be non-symmetric when a first-order derivative term is present.

📌 Key points (3–5)

  • The general problem form: a second-order ODE with variable coefficients p(x), r(x), q(x), a Dirichlet condition at x=0, and a Neumann-type condition at x=1.
  • Interior discretization: uses central differences for both the second-order term (with variable p) and the first-order term (with coefficient r).
  • Handling the Neumann boundary: a virtual grid point x_{n+2} is introduced, and the boundary condition p(1)y'(1)=β is discretized by averaging two derivative approximations to eliminate the virtual point.
  • Common confusion: when r(x) ≠ 0 (first-order derivative present), the discretization matrix is not symmetric, unlike the pure second-order case.
  • Why it matters: this framework covers a broad class of problems, but the presence of first-order derivatives (convection terms) can lead to physically unacceptable approximations if not handled carefully.

🧮 Problem statement and assumptions

🧮 The general boundary-value problem

The section considers a boundary-value problem on the interval [0, 1]:

General BVP:
− (p(x) y'(x))' + r(x) y'(x) + q(x) y(x) = f(x), 0 < x < 1,
y(0) = α,
p(1) y'(1) = β.

  • Variable coefficients: p(x), r(x), q(x) are functions of x.
  • Key assumption: p(x) > 0 on [0, 1] (ensures well-posedness).
  • Boundary conditions: Dirichlet at the left endpoint (y(0) = α) and a Neumann-type condition at the right endpoint involving p(1) and the derivative.

🔍 Structure of the equation

  • The term − (p(x) y'(x))' is a variable-coefficient second-order term (generalization of −y'').
  • The term r(x) y'(x) is a first-order derivative term (often called a convection or transport term).
  • The term q(x) y(x) is a zeroth-order (reaction) term.
  • Example: when p(x)=1, r(x)=0, q(x)=constant, this reduces to the simpler problems studied earlier.

🔢 Discretization in the interior

🔢 Interior node discretization

At an interior grid point x_j (j = 1, ..., n+1), the discretization reads:

Interior scheme (7.17):
[− p_{j+1/2} (w_{j+1} − w_j) + p_{j−1/2} (w_j − w_{j−1})] / Δx² + r_j (w_{j+1} − w_{j−1}) / (2Δx) + q_j w_j = f_j.

  • Second-order term: uses a variable-coefficient centered difference with p evaluated at half-integer points (p_{j±1/2}).
    • This approximates − (p y')' by differencing the flux p y' at cell interfaces.
  • First-order term: uses a standard central difference (w_{j+1} − w_{j−1}) / (2Δx) for y'.
  • Zeroth-order term: simply q_j w_j.

⚠️ Non-symmetry when r ≠ 0

  • The excerpt explicitly notes: if r_j ≠ 0, the discretization matrix is not symmetric.
  • Why: the central difference for the first-order term couples w_{j+1} and w_{j−1} asymmetrically (with opposite signs), breaking the symmetry that arises when r=0.
  • Don't confuse: the pure second-order problem (r=0) yields a symmetric matrix; adding a first-order term destroys this property.

🎯 Handling boundary conditions

🎯 Dirichlet boundary at x=0

  • The condition y(0) = α translates directly to w_0 = α.
  • This value is substituted directly into the interior discretization (7.17) for j=1.
  • No special treatment needed; it is a simple known value.

🎯 Neumann boundary at x=1: the virtual grid point

At j = n+1 (the right boundary node), the discretization initially involves a virtual grid point x_{n+2} outside the domain:

Initial discretization at j=n+1 (7.18):
[− p_{n+3/2} (w_{n+2} − w_{n+1}) + p_{n+1/2} (w_{n+1} − w_n)] / Δx² + r_{n+1} (w_{n+2} − w_n) / (2Δx) + q_{n+1} w_{n+1} = f_{n+1}.

  • The term − p_{n+3/2} (w_{n+2} − w_{n+1}) involves the unknown w_{n+2}.
  • Goal: eliminate w_{n+2} using the boundary condition p(1) y'(1) = β.

🔧 Averaging the derivative approximations

To remove w_{n+2}, the boundary condition is discretized by averaging two derivative approximations:

  • Approximate p_{n+1/2} y'(1 − Δx/2) ≈ p_{n+1/2} (w_{n+1} − w_n) / Δx.
  • Approximate p_{n+3/2} y'(1 + Δx/2) ≈ p_{n+3/2} (w_{n+2} − w_{n+1}) / Δx.
  • Average them:
    (1/2) [p_{n+1/2} (w_{n+1} − w_n)/Δx + p_{n+3/2} (w_{n+2} − w_{n+1})/Δx] = β.

Rearranging gives:

p_{n+3/2} (w_{n+2} − w_{n+1}) = − p_{n+1/2} (w_{n+1} − w_n) + 2Δx β.

🔧 Final discretization at the boundary

Substitute this expression into (7.18) and use (w_{n+2} − w_n) / (2Δx) = β / p(1) (from the boundary condition) to obtain:

Final boundary equation:
2 p_{n+1/2} (w_{n+1} − w_n) / Δx² + r_{n+1} β / p(1) + q_{n+1} w_{n+1} = f_{n+1} + 2β / Δx.

  • This equation involves only interior unknowns (w_n, w_{n+1}) and known boundary data (β, α).
  • The virtual point w_{n+2} has been eliminated.
  • Example: if β=0 (homogeneous Neumann), the term simplifies further.

🧪 Implications and warnings

🧪 When the method works well

  • The discretization is consistent and yields a linear system for the unknowns w_1, ..., w_{n+1}.
  • For problems with r(x)=0 (no first-order term), the matrix is symmetric and often well-conditioned.
  • The framework is general and applies to many physical problems (heat conduction with variable conductivity, reaction-diffusion, etc.).

⚠️ Warning: first-order derivatives and unacceptable approximations

The excerpt ends with a preview of Section 7.8:

  • When the problem contains a first-order derivative (r(x) ≠ 0), applying a central-difference scheme may lead to physically unacceptable approximations.
  • This is especially problematic for convection-diffusion equations (where the first-order term represents transport/convection).
  • The issue will be illustrated both analytically and numerically in the next section.
  • Don't confuse: central differences are standard and accurate for many problems, but they can fail (produce oscillations or non-physical solutions) when convection dominates.
AspectPure second-order (r=0)With first-order term (r≠0)
Matrix symmetrySymmetricNot symmetric
Central differenceUsually stable and accurateMay produce unacceptable approximations
Physical contextDiffusion, elasticityConvection-diffusion, transport
50

Convection-Diffusion Equation

7.8 Convection-Diffusion equation

🧭 Overview

🧠 One-sentence thesis

Careless discretization of convection-diffusion equations can produce spurious oscillations, but upwind schemes eliminate these oscillations at the cost of lower accuracy.

📌 Key points (3–5)

  • The problem: central-difference discretization of convection-diffusion equations can produce physically unacceptable oscillatory approximations.
  • When oscillations occur: the numerical solution becomes oscillatory when the product of velocity magnitude and step size exceeds 2 (|v|Δx > 2).
  • The remedy: upwind discretization uses one-sided differences and guarantees non-oscillatory solutions for all step sizes.
  • Common confusion: upwind methods are physically correct but have lower accuracy—global truncation error is O(Δx) instead of O(Δx²).
  • Why it matters: the exact solution is strictly monotone, so any numerical oscillations are spurious and must be rejected.

🔥 The convection-diffusion model

🔥 What the equation represents

Convection-diffusion equation: a boundary-value problem combining heat conduction (diffusion, the term −y″) and transport by a flowing medium (convection, the term vy′).

  • The excerpt uses the example: −y″(x) + vy′(x) = 0 on the interval 0 < x < 1, with boundary conditions y(0) = 0 and y(1) = 1.
  • The parameter v is the velocity (a real number).
  • The exact solution is y(x) = (e^(vx) − 1) / (e^v − 1).
  • This solution increases strictly over [0, 1], meaning it never decreases and has no oscillations.

🚫 Why oscillations are unacceptable

  • The exact solution is strictly increasing (monotone).
  • Any numerical approximation that contains oscillations is physically incorrect and must be rejected.
  • The excerpt emphasizes that "spurious oscillations" arise from careless discretization.

⚠️ Central-difference discretization and its failure

⚠️ How central differences work

  • At interior nodes x_j, the central-difference scheme approximates:
    • Second derivative: −(w_(j−1) − 2w_j + w_(j+1)) / Δx²
    • First derivative: v(w_(j+1) − w_(j−1)) / (2Δx)
  • This leads to a linear system Aw = f.

📉 When central differences fail

  • The discrete solution has the form w_j = ar₁^j + br₂^j, where r₁ = 1 and r₂ = (1 + vΔx/2) / (1 − vΔx/2).
  • Oscillations occur when r₂ < 0, which happens when |v|Δx > 2.
  • If r₂ is negative, then r₂^j alternates sign: positive for even j, negative for odd j.
  • Example: with v = 30 and Δx = 0.1, we have |v|Δx = 3 > 2, so the central-difference approximation oscillates (see Figure 7.5).

🔍 The stability condition

  • To obtain monotone (non-oscillatory) solutions, we require |v|Δx < 2.
  • In practice, velocity v is given, so this may force impractically small step sizes: Δx < 2/|v|.
  • Don't confuse: this is not a convergence condition but a condition to avoid spurious oscillations.

🌬️ Upwind discretization as the remedy

🌬️ How upwind differences work

Upwind discretization: approximate vy′_j using one-sided differences based on the sign of v.

  • If v ≥ 0 (flow to the right): use the upwind difference v(w_j − w_(j−1)) / Δx.
  • If v < 0 (flow to the left): use the downwind difference v(w_(j+1) − w_j) / Δx.
  • The idea is to use information from the "upstream" direction.

✅ Why upwind differences succeed

  • For v ≥ 0, the discretization becomes: −(w_(j−1) − 2w_j + w_(j+1)) / Δx² + v(w_j − w_(j−1)) / Δx = 0.
  • Solving for the roots: r₁ = 1 and r₂ = 1 + vΔx.
  • Since r₂ = 1 + vΔx > 0 for all Δx > 0, the solution is non-oscillatory for all step sizes.
  • Example: with v = 30 and Δx = 0.1, the upwind approximation is monotone (Figure 7.5).

⚖️ The trade-off: accuracy vs. stability

SchemeOscillations?Global truncation errorWhen to use
Central-differenceYes, if |v|Δx > 2O(Δx²)Only when |v|Δx < 2 and higher accuracy is needed
UpwindNo, for all Δx > 0O(Δx)When |v|Δx ≥ 2 or physical correctness is critical
  • The price to pay: upwind methods guarantee physically meaningful approximations but have lower order of accuracy.
  • Don't confuse: upwind is not "better" in all cases—it sacrifices accuracy for stability.

📊 Numerical examples

📊 Example 7.8.1: Monotone exact solution

  • Problem: −y″ + vy′ = 0 with y(0) = 0, y(1) = 1.
  • Exact solution: y(x) = (e^(vx) − 1) / (e^v − 1), which is strictly increasing.
  • With v = 30 and Δx = 0.1 (so vΔx = 3 > 2):
    • Central-difference: produces oscillations (physically unacceptable).
    • Upwind: produces a monotone approximation (physically correct).

📊 Example 7.8.2: Non-monotone exact solution

  • Problem: −y″ + vy′ = 1 with y(0) = 0, y(1) = 0.
  • Exact solution: y(x) = (1/v)(x − (1 − e^(vx)) / (1 − e^v)).
  • This solution is not monotone, but similar behavior is observed:
Velocity vvΔxCentral-difference behaviorUpwind behavior
v = 101 < 2No oscillations, more accurate (O(Δx²))Less accurate (O(Δx)) but stable
v = 202Boundary condition at x=1 has no effect on the rest; still accurateStable
v = 10010 > 2Oscillations, large errorsMore accurate than central

🔬 Special case: vΔx = 2

  • When vΔx = 2, the discretization at node x_n simplifies: the term with w_(n+1) cancels out.
  • This means the boundary condition at x = 1 does not affect the rest of the approximation.
  • The excerpt notes that for Example 7.8.2 with v = 20, the result is "still quite accurate" despite this decoupling.

🔧 Practical considerations

🔧 When to use each method

  • Use central-difference when:
    • |v|Δx < 2 (stability condition is satisfied).
    • Higher accuracy (O(Δx²)) is needed.
  • Use upwind when:
    • |v|Δx ≥ 2 (central-difference would oscillate).
    • Physical correctness (no spurious oscillations) is more important than accuracy.
    • Reasonable step sizes are required (avoiding impractically small Δx).

🔧 Advanced methods

  • The excerpt states: "in practice more advanced methods are used to solve convection-diffusion equations."
  • No further details are provided, but this suggests that neither central nor upwind is the final solution in real applications.
51

7.9 Nonlinear Boundary-Value Problems

7.9 Nonlinear Boundary-Value problems

🧭 Overview

🧠 One-sentence thesis

Nonlinear boundary-value problems are solved by iterative methods that generate a sequence of linear problems, each of which can be solved using finite-difference schemes until the iterates converge to the true solution.

📌 Key points (3–5)

  • The core problem: a nonlinear differential equation with boundary conditions that cannot be solved directly by linear methods.
  • The iterative strategy: generate a sequence of approximations y⁽⁰⁾, y⁽¹⁾, y⁽²⁾, … where each iterate solves a linear problem derived from the previous iterate.
  • Two main methods: Picard iteration (substitutes previous iterate into the nonlinear term) and Newton-Raphson (linearizes the nonlinear term around the previous iterate).
  • Common confusion: both methods produce linear boundary-value problems at each step, but Newton-Raphson uses derivative information (partial derivatives of g) while Picard simply evaluates g at the previous iterate.
  • Why it matters: this approach lets us reuse finite-difference schemes for linear problems to tackle nonlinear ones.

🎯 The nonlinear boundary-value problem

🎯 Problem structure

The excerpt presents the general form:

− (py′)′ + g(y′, y, x) = 0, 0 < x < 1,
y(0) = 0, y(1) = 0.

  • The nonlinearity comes from the function g, which depends on the derivative y′, the function y itself, and the position x.
  • The boundary conditions are both zero at x = 0 and x = 1 (Dirichlet conditions).
  • Direct solution is not possible because g is nonlinear; we cannot write the problem as a simple matrix equation.

🔄 The iterative approach

  • The general strategy is to build a sequence of iterates: y⁽⁰⁾, y⁽¹⁾, y⁽²⁾, …
  • Each iterate is computed by solving a linear boundary-value problem.
  • As the iteration number m increases, y⁽ᵐ⁾ converges to the true solution y: y⁽ᵐ⁾ → y as m → ∞.
  • The derived linear problems are solved using the finite-difference schemes explained in earlier sections of the chapter.

🔁 Picard iteration

🔁 How Picard works

The Picard method approximates the solution by determining y⁽ᵐ⁾ from:

− (py⁽ᵐ⁾′)′ = − g(y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾, x),
y⁽ᵐ⁾(0) = 0, y⁽ᵐ⁾(1) = 0.

  • At iteration m, the nonlinear term g is evaluated using the previous iterate y⁽ᵐ⁻¹⁾.
  • This turns the right-hand side into a known function (no longer dependent on y⁽ᵐ⁾), making the equation linear in y⁽ᵐ⁾.
  • The finite-difference approach is used to approximate derivatives of both the (m−1)th and mth iterates.

🧩 Key idea

  • Substitute and solve: plug the old iterate into the nonlinear term, then solve a linear problem for the new iterate.
  • Example: if g(y′, y, x) = y², then at step m you solve − (py⁽ᵐ⁾′)′ = − (y⁽ᵐ⁻¹⁾)², which is linear in y⁽ᵐ⁾.

🔬 Newton-Raphson method

🔬 Linearization around the previous iterate

The Newton-Raphson method linearizes g about (y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾):

g(y′, y, x) ≈ g(y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾, x) + (y′ − y⁽ᵐ⁻¹⁾′) r + (y − y⁽ᵐ⁻¹⁾) q

where:

  • r = ∂g/∂y′ evaluated at (y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾, x)
  • q = ∂g/∂y evaluated at (y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾, x)

🔬 The iterative equation

The corresponding iterative method determines y⁽ᵐ⁾ from:

− (py⁽ᵐ⁾′)′ + r y⁽ᵐ⁾′ + q y⁽ᵐ⁾ = y⁽ᵐ⁻¹⁾′ r + y⁽ᵐ⁻¹⁾ q − g(y⁽ᵐ⁻¹⁾′, y⁽ᵐ⁻¹⁾, x),
y(0) = 0, y(1) = 0.

  • The left-hand side is linear in y⁽ᵐ⁾ and its derivative.
  • The right-hand side is known (computed from the previous iterate).
  • This is a linear boundary-value problem that can be solved with finite-difference methods.

🧩 Key idea

  • Use derivative information: Newton-Raphson computes the partial derivatives r and q to build a better linear approximation.
  • This typically leads to faster convergence than Picard, but requires computing and evaluating partial derivatives.

🔀 Comparing the two methods

MethodWhat it doesComplexityTypical convergence
PicardEvaluates g at previous iterateSimpler (no derivatives needed)Slower (linear convergence)
Newton-RaphsonLinearizes g using partial derivativesMore complex (requires ∂g/∂y′ and ∂g/∂y)Faster (quadratic convergence near solution)

🔀 Don't confuse

  • Both methods solve a linear problem at each step, but the linear problem is different.
  • Picard: the nonlinear term is simply "frozen" at the old iterate.
  • Newton-Raphson: the nonlinear term is replaced by its first-order Taylor expansion around the old iterate.
52

The Instationary Heat Equation

8.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The time-dependent temperature distribution in a bar is governed by a parabolic partial differential equation that combines spatial heat conduction with temporal change, requiring both boundary conditions (temperatures at the ends) and an initial condition (starting temperature distribution).

📌 Key points (3–5)

  • What the problem describes: temperature evolution over time in a one-dimensional bar, combining methods from initial-value problems (Chapter 6) and boundary-value problems (Chapter 7).
  • Physical law foundation: the heat equation derives from energy conservation applied to a control volume, using Fourier's law for heat flow.
  • Three types of conditions needed: boundary conditions at both ends of the bar, plus an initial temperature distribution at time zero.
  • Key physical insight: heat flow density is proportional to the negative spatial temperature gradient (Fourier's law), meaning heat flows from hot to cold.
  • Common confusion: this is an initial-boundary-value problem, not just a boundary-value problem—time evolution requires knowing the starting state.

🔥 Physical setup and modeling goal

🔥 The bar and what we want to find

  • A bar of length L with cross-sectional area A.
  • We want to determine T(x, t): temperature as a function of position x along the bar and time t (measured in Kelvin).
  • The problem combines spatial variation (along the bar) with temporal evolution (how temperature changes over time).

🌡️ Known information (conditions)

The problem requires three pieces of information:

Condition typeWhat is specifiedNotation
Left boundaryTemperature at x = 0 for all timeT(0, t) = T_l(t)
Right boundaryTemperature at x = L for all timeT(L, t) = T_r(t)
Initial conditionTemperature everywhere at t = 0T(x, 0) = T₀(x)
  • Don't confuse: boundary conditions fix temperature at the ends for all time; the initial condition fixes temperature everywhere at the starting moment.

🔧 Heat production

  • Q(x, t) denotes internal heat production within the bar, measured in joules per cubic meter per second (J/(m³s)).
  • This represents heat generated inside the material (e.g., from chemical reactions or electrical resistance).

⚖️ Deriving the heat equation from energy conservation

⚖️ Control volume approach

The derivation applies the energy conservation law to a small slice of the bar:

  • Spatial slice: between positions x and x + Δx
  • Time interval: between times t and t + Δt

This control volume method tracks how energy enters, leaves, and accumulates in the slice.

🌊 Fourier's law of heat conduction

Fourier's law: The heat flow density is given by q(x, t) = -λ ∂T/∂x(x, t), where λ (measured in J/(msK)) is the heat-conduction coefficient.

  • Heat flow density q is the rate of heat transfer per unit area.
  • The negative sign is crucial: heat flows opposite to the temperature gradient (from hot to cold).
  • Example: if temperature increases in the positive x direction (∂T/∂x > 0), then q is negative, meaning heat flows in the negative x direction (toward lower temperature).

📐 Energy balance equation

The energy balance states:

Energy at time t + Δt = Energy at time t − Heat out at x + Heat in at x + Δx + Heat produced

Written mathematically:

  • ρcT(x, t + Δt)AΔx = ρcT(x, t)AΔx − λ(∂T/∂x)(x, t)AΔt + λ(∂T/∂x)(x + Δx, t)AΔt + Q(x, t)AΔxΔt

Where:

  • ρ (kg/m³) is mass density
  • c (J/(kgK)) is specific heat (energy needed to raise temperature)
  • ρcT represents thermal energy per unit volume

🔄 From discrete to continuous

After dividing by AΔxΔt and rearranging:

  • ρc [T(x, t + Δt) − T(x, t)]/Δt = λ [(∂T/∂x)(x + Δx, t) − (∂T/∂x)(x, t)]/Δx + Q

Taking limits as Δx → 0 and Δt → 0:

  • The left side becomes the time derivative ∂T/∂t
  • The right side becomes the second spatial derivative ∂²T/∂x²

📋 The complete initial-boundary-value problem

📋 The parabolic PDE system

The final problem consists of three parts:

Governing equation (heat equation):

  • ρc ∂T/∂t = λ ∂²T/∂x² + Q for 0 < x < L and 0 < t

Boundary conditions:

  • T(0, t) = T_l(t) and T(L, t) = T_r(t) for 0 ≤ t

Initial condition:

  • T(x, 0) = T₀(x) for 0 ≤ x ≤ L

🔍 Why "parabolic"

  • The excerpt classifies this as a parabolic partial differential equation.
  • This classification comes from the form: first-order in time, second-order in space.
  • The problem type determines which numerical methods are appropriate.

🧩 Combining previous methods

  • The excerpt states that solving this problem will combine methods from Chapters 6 and 7:
    • Chapter 6 methods handle time evolution (initial-value problems for ODEs)
    • Chapter 7 methods handle spatial discretization (boundary-value problems)
  • This combination is necessary because the problem has both temporal dynamics and spatial structure.
53

8.2 Semi-Discretization

8.2 Semi-Discretization

🧭 Overview

🧠 One-sentence thesis

Semi-discretization solves parabolic partial differential equations by first discretizing only the spatial direction to produce a system of ordinary differential equations in time, which can then be integrated using time-stepping methods.

📌 Key points (3–5)

  • What semi-discretization is: discretizing only the spatial variable (x-direction) while leaving time continuous, producing a system of ODEs.
  • How it works: apply finite-difference methods to the spatial derivatives, converting a PDE into a system of first-order ODEs in time.
  • The two-stage process: space discretization (Chapter 7 methods) followed by time discretization (Chapter 6 methods).
  • Common confusion: semi-discretization is not full discretization—time remains continuous until a second discretization step is applied.
  • Stability insight: the eigenvalues of the spatial discretization matrix determine stability of the semi-discrete system.

🔥 The heat-conduction problem setup

🔥 Physical model

The excerpt derives the heat equation from energy conservation in a bar:

  • A bar of length L with cross-sectional area A.
  • Temperature T(x, t) varies in space (x) and time (t).
  • Boundary conditions: temperature known at both ends, T(0, t) = T_l and T(L, t) = T_r.
  • Initial condition: T(x, 0) = T_0(x).
  • Heat production Q(x, t) inside the bar.

🌡️ Fourier's law and energy balance

Fourier's law: the heat flow density is q(x, t) = −λ ∂T/∂x(x, t), where λ is the heat-conduction coefficient.

  • Energy balance on a control volume between x and x + Δx over time interval Δt accounts for:
    • Energy stored (ρ c T, where ρ is mass density and c is specific heat).
    • Heat flow in and out at boundaries.
    • Internal heat production Q.
  • Taking limits as Δx → 0 and Δt → 0 yields the heat equation: ρ c ∂T/∂t = λ ∂²T/∂x² + Q.

📐 The initial-boundary-value problem

The full problem to solve is:

  • PDE: ρ c ∂T/∂t = λ ∂²T/∂x² + Q for 0 < x < L, 0 < t.
  • Boundary conditions: T(0, t) = T_l(t), T(L, t) = T_r(t) for 0 ≤ t.
  • Initial condition: T(x, 0) = T_0(x) for 0 ≤ x ≤ L.

This is a parabolic PDE requiring methods from both spatial (Chapter 7) and temporal (Chapter 6) discretization.

🧩 What semi-discretization means

🧩 Definition and purpose

Semi-discretization (or method of lines): discretization applied to the spatial x-direction but not to the temporal t-direction.

  • The goal is to convert a PDE (which depends on both x and t) into a system of ODEs (which depend only on t).
  • "Semi" means halfway: only one dimension is discretized at this stage.
  • Time remains a continuous variable until a second discretization step is applied.

🔢 The discretization process

Spatial discretization:

  • Divide the spatial interval [0, 1] into n + 1 equal parts with length Δx.
  • Nodes are x_i = i Δx for i = 0, ..., n + 1.
  • Approximate y(x_i, t) by u_i(t).
  • The vector u(t) = (u_1(t), ..., u_n(t))^T contains the unknowns (boundary values omitted because they are known).

Applying finite-difference methods (from Chapter 7):

  • Replace the spatial derivative ∂²y/∂x² with a finite-difference approximation.
  • This produces a system of first-order ordinary differential equations:
    • du/dt = K u + r for 0 < t ≤ T,
    • u(0) = y_0 (initial condition).

📊 The resulting ODE system

The matrix K and vector r are:

  • K = (1/Δx²) times a tridiagonal matrix with −2 on the diagonal and 1 on the super- and sub-diagonals.
  • r = (1/Δx²) (y_L(t), 0, ..., 0, y_R(t))^T, incorporating the boundary conditions.

Why this form:

  • The matrix K comes from the finite-difference approximation of the second spatial derivative.
  • The sign of K differs from the boundary-value problem matrix in Chapter 7 because here we consider ∂²y/∂x² directly, not −y''.

🛡️ Stability of the semi-discrete system

🛡️ Eigenvalue condition

The excerpt states:

  • The eigenvalues λ_j of K satisfy λ_j ≤ 0 for all j = 1, ..., n − 1.
  • This can be verified using the Gershgorin circle theorem.

Why it matters:

  • Non-positive eigenvalues imply stability of the semi-discrete system.
  • The system will not grow unboundedly in time due to the spatial discretization alone.

Don't confuse:

  • This is stability of the semi-discrete system (after spatial discretization only).
  • Full stability also depends on the time integration method chosen in the next step.

⏱️ Time integration (the second stage)

⏱️ Completing the discretization

After semi-discretization, the system du/dt = K u + r is a system of ODEs in time.

  • This system can be integrated using methods from Chapter 6 (e.g., Forward Euler, Backward Euler, Runge-Kutta).
  • The solution is approximated at discrete times t_j = j Δt, using m time steps of length Δt = T/m.

🔄 Explicit vs implicit methods

The excerpt highlights a clear difference:

  • Explicit methods (e.g., Forward Euler): compute the next time step using only current values; simpler but may require smaller time steps for stability.
  • Implicit methods (e.g., Backward Euler): involve solving a system at each time step; more stable but computationally more expensive.

📍 Notation for the fully discrete solution

  • The numerical approximation at position i Δx and time j Δt is denoted w_j^i.
  • This approximates u_i(t_j), which in turn approximates y(x_i, t_j).

Example: To solve the heat-conduction problem in Example 8.2.1:

  1. Apply finite differences in space → obtain du/dt = K u + r.
  2. Apply a time-stepping method (e.g., Forward Euler) → compute w_j^i for each time step j and spatial node i.
54

Time integration

8.3 Time integration

🧭 Overview

🧠 One-sentence thesis

Time integration methods (Forward Euler, Backward Euler, and Crank-Nicolson) convert the semi-discretized heat equation into fully discrete approximations, but they differ critically in stability requirements and computational cost.

📌 Key points (3–5)

  • Semi-discretization produces an ODE system: spatial discretization (finite differences in x) yields a system of first-order ODEs in time, which then needs time integration.
  • Three methods compared: Forward Euler (explicit, conditionally stable), Backward Euler (implicit, unconditionally stable), and Trapezoidal/Crank-Nicolson (implicit, unconditionally stable, higher accuracy).
  • Stability vs cost trade-off: Forward Euler is cheap per step but requires very small time steps (Δt ≤ Δx²/2); implicit methods allow larger steps but require solving linear systems.
  • Common confusion—halving spatial step: when Δx is halved, Forward Euler requires Δt to be divided by four (not two) to maintain stability, because the stability bound is Δt ≤ Δx²/2.
  • Truncation error balance: choosing Δt and Δx so that O(Δt) and O(Δx²) errors have similar magnitude guides step-size selection.

🔥 The heat-conduction problem and semi-discretization

🔥 The PDE and boundary/initial conditions

The excerpt considers the heat-conduction equation:

  • Partial differential equation: ∂y/∂t = ∂²y/∂x² for 0 < x < 1, 0 < t ≤ T.
  • Boundary conditions: y(0, t) = y_L(t) and y(1, t) = y_R(t).
  • Initial condition: y(x, 0) = y₀(x).

This is a parabolic PDE describing how temperature evolves in space and time.

🧮 Spatial discretization (finite differences)

  • The interval [0, 1] is divided into n+1 equal parts with step size Δx.
  • Nodes are x_i = i Δx, i = 0, …, n+1.
  • The numerical approximation at (x_i, t) is denoted u_i(t).
  • The vector u(t) = (u₁(t), …, uₙ(t))ᵀ omits boundary values (which are known).

Using finite-difference techniques from Chapter 7, the second derivative in x is replaced by a difference formula, yielding a system of ODEs.

📐 The semi-discrete system

Semi-discretization (method of lines): discretization applied to the spatial direction but not yet to time.

The result is:

  • du/dt = Ku + r, 0 < t ≤ T
  • u(0) = y₀

where:

  • K is an n×n tridiagonal matrix with entries (1/Δx²) times [−2 on diagonal, 1 on super- and sub-diagonals].
  • r = (1/Δx²)(y_L(t), 0, …, 0, y_R(t))ᵀ incorporates boundary conditions.

Key property: all eigenvalues λ_j of K satisfy λ_j ≤ 0, which implies the semi-discrete system is analytically stable.

Sign note: because the original PDE has +∂²y/∂x², the matrix K differs in sign from the boundary-value problem matrix in Chapter 7 (which had −y'').

⏩ Forward Euler method

⏩ The scheme

From Chapter 6, the Forward Euler approximation at time t_{j+1} is:

  • w^{j+1} = w^j + Δt(Kw^j + r^j)

This is an explicit method: w^{j+1} is computed directly from known values at time t_j.

Dependency: w^{j+1}i depends on approximations at (x{i−1}, t_j), (x_i, t_j), and (x_{i+1}, t_j) (see Figure 8.2(a) in the excerpt).

📏 Local truncation error

The local truncation error τ^{j+1} is defined as:

  • τ^{j+1} = (y^{j+1} − z^{j+1}) / Δt

where z^{j+1} is the Forward Euler step applied to the exact solution y^j.

Expanding:

  • τ^{j+1} = (y^{j+1} − y^j)/Δt − (Ky^j + r^j)
  • = ∂y^j/∂t + ε^j − ∂²y^j/∂x² + μ^j

The errors ε^j and μ^j come from:

  • Time discretization: ε^j_i = (Δt/2) ∂²y/∂t²(x_i, ζ_j), ζ_j ∈ (t_j, t_{j+1})
  • Space discretization: μ^j_i = −(Δx²/12) ∂⁴y/∂x⁴(ξ_i, t_j), ξ_i ∈ (x_{i−1}, x_{i+1})

Using the heat equation (∂y/∂t = ∂²y/∂x²), the leading terms cancel, leaving:

  • τ^{j+1} = ε^j + μ^j

Order of accuracy: O(Δt + Δx²).

Balancing errors: to make both components similar in size, choose Δt ≈ Δx². This suggests that halving Δx requires dividing Δt by four.

⚠️ Stability condition

The Forward Euler method is conditionally stable.

From the Gershgorin circle theorem:

  • Eigenvalues of K satisfy: −4/Δx² ≤ λ_i ≤ 0 for i = 1, …, n.

For real eigenvalues, Forward Euler is stable if:

  • Δt ≤ −2/λ_i for all i

Since |λ|_max = 4/Δx², the stability condition becomes:

  • Δt ≤ Δx²/2

Critical implication: if Δx is halved, Δt must be divided by four to maintain stability.

🧊 Stiffness

The condition number of K is:

  • κ(K) = |λ|_max / |λ|_min ≈ 4/(πΔx)²

For small Δx, the system is stiff (large difference between |λ|_max and |λ|_min). Explicit methods face severe stability restrictions in stiff systems, making implicit methods more attractive.

⏪ Backward Euler method

⏪ The scheme

The Backward Euler approximation is:

  • w^{j+1} = w^j + Δt(Kw^{j+1} + r^{j+1})

Rearranging:

  • (I − ΔtK)w^{j+1} = w^j + Δt r^{j+1}

This is an implicit method: w^{j+1} appears on both sides and must be solved from a linear system.

Dependency: see Figure 8.2(b) in the excerpt.

📏 Accuracy and stability

  • Local truncation error: also O(Δt + Δx²), same order as Forward Euler.
  • Stability: the Backward Euler method is unconditionally stable.

Key advantage: the time step Δt may be chosen based on accuracy requirements alone, without stability constraints. Stiffness does not impair numerical stability.

💰 Computational cost

  • Solving the linear system (I − ΔtK)w^{j+1} = w^j + Δt r^{j+1} is more expensive per step than Forward Euler.
  • However, K is sparse (tridiagonal), so efficient solvers exist.

Trade-off: cheap per-step cost with many small steps (Forward Euler) versus expensive per-step cost with fewer large steps (Backward Euler).

Example: if accuracy allows Δt = 0.01 but Forward Euler stability requires Δt ≤ 0.0001, Backward Euler may be faster overall despite higher per-step cost.

🌉 Trapezoidal (Crank-Nicolson) method

🌉 The scheme

The Trapezoidal method averages the right-hand side at t_j and t_{j+1}:

  • w^{j+1} = w^j + (Δt/2)(Kw^j + r^j + Kw^{j+1} + r^{j+1})

Crank-Nicolson method: the Trapezoidal method applied to the heat equation.

This is also an implicit method (w^{j+1} appears on both sides).

Dependency: see Figure 8.2(c) in the excerpt.

📏 Properties

  • Accuracy: the Trapezoidal method is second-order in time (O(Δt²)), higher than Forward or Backward Euler (O(Δt)).
  • Combined with spatial discretization: overall error is O(Δt² + Δx²).
  • Stability: unconditionally stable (like Backward Euler).

Advantage over Backward Euler: better accuracy for the same time step, or larger time steps for the same accuracy.

🔀 Comparison of methods

MethodTypeStabilityAccuracy (time)Cost per stepWhen to use
Forward EulerExplicitConditional: Δt ≤ Δx²/2O(Δt)LowNon-stiff, small Δx not required
Backward EulerImplicitUnconditionalO(Δt)MediumStiff systems, stability critical
Crank-NicolsonImplicitUnconditionalO(Δt²)MediumHigh accuracy needed, stiff OK

Don't confuse: "unconditionally stable" does not mean "arbitrarily accurate"—it means stability does not restrict Δt, but accuracy still does.

🔄 Halving the spatial step

  • Forward Euler: Δt must be divided by four (because Δt ∝ Δx²).
  • Backward Euler / Crank-Nicolson: Δt can be chosen independently of Δx for stability; accuracy considerations may still suggest smaller Δt.

Example: if Δx changes from 0.1 to 0.05, Forward Euler's stability bound changes from Δt ≤ 0.005 to Δt ≤ 0.00125.

    Numerical Methods for Ordinary Differential Equations | Thetawave AI – Best AI Note Taker for College Students