Adjustment theory an introduction

Introduction to Adjustment Theory

Introduction

🧭 Overview

🧠 One-sentence thesis

Measurements are inherently uncertain due to fundamental and practical limitations, so the calculus of observations provides methods to describe, analyze, and adjust measurement data to obtain consistent and sufficiently accurate results for a given purpose.

📌 Key points (3–5)

Why measurements are uncertain: mathematical models are abstractions; real-world conditions are uncontrollable; improved instruments only shift uncertainty to another level; and absolute certainty is often unnecessary and too costly.
What the calculus of observations does: describes and analyzes measurement processes, provides computational methods to handle uncertainty, describes quality of results, and guides optimal measurement design.
Two essential models: the functional model (mathematical rules describing relationships, e.g., Euclidean geometry) and the stochastic model (statistical description of measurement variability, e.g., normal distribution).
Common confusion: measurements vs. mathematical certainty—measurements deal with physical reality and always have uncertainty, whereas mathematical certainty applies only to abstract logical statements.
Why adjustment is needed: redundant observations usually violate mathematical rules (e.g., triangle angles not summing to π), so adjustment produces consistent results.

🔍 Why measurements are always uncertain

🔍 Mathematical certainty vs. physical reality

Mathematical certainty pertains to statements like: if a=b and b=c and c=d, then a=d.

Mathematical certainty applies to abstractions and logical statements, not to physical measurements.
Example: If a, b, c, and d are measured terrain distances, a surveyor will compare d with a before stating a=d, because measurements involve humans, instruments, and material conditions that do not always follow logic.
Don't confuse: a mathematical model can usefully describe the real world, but there is always an essential difference between the model and reality.

🌍 Four fundamental reasons for uncertainty

Abstraction gap: mathematical models describe abstractions, not physical/technical/social matters directly.
Uncontrollable conditions: laboratory conditions can be partly controlled, but natural conditions are uncontrollable and only partly describable.
Persistent uncertainty: improved methods and instruments reduce uncertainty but do not eliminate it—uncertainty simply shifts to another level.
Purpose-driven tolerance: measurements serve a purpose; some uncertainty is acceptable, and better instruments cost more, so one must balance accuracy against cost.

💡 Practical implication

Absolute certainty from measurements is fundamentally and practically impossible.
It is also often not needed: measurements aim to provide information that is accurate enough and cheap enough for a specific purpose.

📐 What the calculus of observations covers

📐 Four main tasks

The calculus of observations (waarnemingsrekening) is the part of mathematical statistics dealing with measurement results. It addresses:

Description and analysis of measurement processes.
Computational methods that account for uncertainty in measurements.
Quality description of measurements and derived results.
Design guidelines for measurement setups to achieve optimal procedures.

🔧 Typical surveying problems

The excerpt illustrates with surveying examples:

Problem	Question
Repeatability	Repeating a measurement usually gives different answers—how to describe this?
Error detection	Results can be corrupted by systematic or gross errors—can these be traced?
Redundancy	Geometric figures are measured with redundant observations (e.g., all three triangle angles instead of two)—how to adjust them to obey mathematical rules?
Quality propagation	How does measurement variability affect the final result? Can non-detected gross errors remain?

Once these questions are answered, one can determine the required measurement setup from the desired quality of the end product.
This book focuses mainly on points 1 (description of variability) and 3 (adjustment), the elementary aspects of adjustment theory.

🧩 The functional model

🧩 What a functional model is

The functional model: notions and rules from a mathematical theory used to describe relationships in the problem.

A mathematical model strips away unnecessary details and describes essential aspects using applicable mathematical theory.
The model should properly describe the situation without being needlessly complicated.
Formulating a good model is the "art" of the discipline, involving experience, experiments, intuition, and creativity.

📏 Example: two-dimensional Euclidean geometry

Scenario: determining the relative position of terrain points (ignoring height differences) over not-too-large distances.
Functional model choice: project points onto a level surface and treat it as a plane; use two-dimensional Euclidean geometry (e.g., sine and cosine rules).
Why it works: experience shows this abstraction is valid; terrain points are marked sharply so they can be treated as mathematical points.
Caution: gratuitous use of a "standard" model can be risky—like laws of nature, a model can only be declared valid on the basis of experiments.

🔗 Link to measurements

When measuring geometric elements (e.g., an angle α), one assigns a number (the observation) according to certain rules (the measurement procedure) to a mathematical notion.
The functional model provides the mathematical rules that the measurements should obey (e.g., triangle angles summing to π).

📊 The stochastic model

📊 What a stochastic model is

The stochastic model: notions from mathematical statistics used to describe the inherent variability of measurement processes.

Repeating a measurement of the same quantity usually gives different answers.
To describe this phenomenon, one could repeat measurements many times (e.g., 100 times) and repeat the experiment itself.
If the histograms of each experiment are sufficiently alike (similar shapes, locations not differing much), the measurement variability can be described by a stochastic variable.

🎲 Typical stochastic model in surveying

Assumption: observations from a certain measurement method can be described as an independent sample drawn from a normal (Gaussian) distribution with a certain standard deviation.
Mathematical expectation: the expectation of the normally distributed stochastic variable (the observable) equals the unknown quantity (e.g., the true angle).
Example: If three angles α, β, and γ of a triangle are measured, the expectations of the three angular stochastic variables obey the functional model relation (sum to π), but the individual observations themselves usually do not.

📝 Registration and past experience

Registration: the measurement result as a number plus additional information (instrument type, procedure, observer, weather, etc.).
Purpose: establish a link with past experience or experiments to justify the choice of stochastic model.
In practice, not every measurement is repeated; one relies on extrapolation of past experience.

🔄 Functional and stochastic models together

🔄 The mathematical model

The functional and stochastic model together form the mathematical model.
Functional model: describes the mathematical relationships that should hold (e.g., geometric rules).
Stochastic model: describes the variability and uncertainty in the observations.

🗺️ Diagram of fundamental relations

The excerpt provides a conceptual flow (figure 0.1):

Terrain situation (real world)
Functional model (mathematical abstraction)
Measurement (assigning numbers to quantities)
Experiment (repeating measurements to observe variability)
Stochastic model (statistical description of variability)

This diagram shows how the real-world situation, mathematical rules, and measurement uncertainty are interconnected in adjustment theory.

⚠️ Why adjustment is necessary

Redundant observations (e.g., measuring all three angles of a triangle instead of two) usually do not obey the functional model's mathematical rules (e.g., the three measured angles do not sum to π).
Adjustment (vereffening) is needed to obtain consistent results that satisfy the functional model.
The excerpt asks: what is the best way to perform such an adjustment? This is a central question in adjustment theory.

Introduction to Linear Estimation Theory

1.1 Introduction

🧭 Overview

🧠 One-sentence thesis

When an unknown quantity is measured multiple times with differing results due to measurement errors, least-squares estimation provides a systematic way to find the best estimate by minimizing the sum of squared differences between observations and the estimated value.

📌 Key points (3–5)

The core problem: multiple measurements of the same unknown quantity differ from each other and from the true value due to measurement errors; we need a method to estimate the unknown from these imperfect observations.
Least-squares principle: estimate the unknown by choosing the value that makes the sum of squared measurement errors as small as possible.
Geometric interpretation: the least-squares estimate corresponds to an orthogonal projection—the estimate is chosen so that the error vector is perpendicular to the measurement direction.
Common confusion: the observations y₁ and y₂ are not equal to the true unknown x; the differences e₁ and e₂ (measurement errors) capture this discrepancy.
Two approaches: the chapter introduces both a deterministic approach (least-squares) and later probabilistic approaches (best linear unbiased estimators and maximum likelihood estimators).

📐 The fundamental problem

📏 Measuring an unknown distance

Suppose an unknown distance x is measured twice, yielding two measurements y₁ and y₂.
Due to various influences (instrumental effects, human effects, etc.), the two measurements will generally differ: y₁ ≠ y₂.
One or both measurements will also differ from the true unknown distance x: y₁ ≠ x and y₂ ≠ x.

⚠️ Measurement errors

Measurement error: the difference between an observation and the true unknown value.

The differences y₁ − x and y₂ − x are called measurement errors, denoted e₁ and e₂ respectively.
This gives the relations:
- y₁ = x + e₁
- y₂ = x + e₂
In vector notation: y = a x + e, where y is the observation vector, a is a known vector, x is the unknown scalar, and e is the error vector.
Example: if you measure a distance twice and get 10.2 m and 10.5 m, the true distance x is unknown, and the errors e₁ and e₂ capture how far each measurement is from the truth.

🎯 The estimation question

Problem: How do we estimate the unknown x from the given observation vector y?
Both x and e are unknown; only y and a are given.
The chapter restricts itself initially to the case of two observations and one unknown.

🔧 Least-squares estimation

🧮 The least-squares principle

The equation y = a x + e can be visualized geometrically: y and a are given vectors, x and e are unknown.
Intuitive idea: estimate x as x̂ such that a x̂ is as close as possible to the observation vector y.
This means choosing x̂ so that the error estimate ê = y − a x̂ has minimum distance to zero.

⊥ Orthogonality condition

From geometry, a x̂ has minimum distance to y if the error estimate ê is orthogonal (perpendicular) to a.
Orthogonality condition: aᵀ ê = 0
Substituting ê = y − a x̂ gives: aᵀ (y − a x̂) = 0
This is called the normal equation.

📊 The least-squares estimate

Solving the normal equation yields:
- LSQ-estimate of x: x̂ = (aᵀ a)⁻¹ aᵀ y
- LSQ-estimate of observations: ŷ = a x̂
- LSQ-estimate of measurement errors: ê = y − a x̂
The name "Least-Squares" (LSQ) comes from the fact that x̂ minimizes the sum of squares (y − a x)ᵀ (y − a x).

🔍 Why it minimizes the sum of squares

The excerpt provides two proofs:

Proof method	Key idea
Algebra	Expand the sum of squares expression and show that the orthogonality condition leads to the minimum.
Calculus	Take the derivative of the sum of squares with respect to x, set it to zero, and solve for x̂.

Both approaches confirm that the orthogonality condition (normal equation) yields the least-squares estimate.
Don't confuse: "least-squares" refers to minimizing the sum of squared errors, not the sum of absolute errors.

🌐 Context and scope

📚 Broader framework

The excerpt mentions that this problem sits within adjustment theory, which deals with the relation between terrain situation, functional model, measurement, experiment, and stochastic model.
The functional model describes the mathematical relationship between unknowns and observations (e.g., the sum of angles in a triangle).
The stochastic model describes the probabilistic nature of observations (e.g., assuming observations are drawn from a normal distribution with a certain standard deviation).
Together, the functional and stochastic models form the mathematical model.

🔜 What comes next

This introductory section uses a deterministic approach (least-squares).
Later sections will introduce probabilistic assumptions about observations and measurement errors.
This leads to more advanced estimators: best linear unbiased estimators and maximum likelihood estimators.
Example: in surveying and geodesy, observations are often assumed to be an independent sample from a normal distribution, with the mathematical expectation equal to the unknown quantity.

Least-Squares Estimation (Orthogonal Projection)

1.2 Least-Squares estimation (orthogonal projection)

🧭 Overview

🧠 One-sentence thesis

Least-squares estimation finds the best estimate of an unknown parameter by minimizing the squared distance between observations and the model, which geometrically corresponds to an orthogonal projection.

📌 Key points (3–5)

Core principle: the least-squares estimate minimizes the sum of squared errors between observations and the model.
Geometric interpretation: the estimate is found by projecting the observation vector orthogonally onto the line spanned by the model vector.
Two equivalent formulations: the problem can be solved either as y = ax + e (parametric form) or as by = be (implicit form using the normal vector).
Common confusion: orthogonal projectors P_a and P_a⊥ project onto different spaces—one onto the line spanned by a, the other onto the line orthogonal to a.
Extension to weighted case: when observations have different accuracy, a weight matrix W can emphasize more trustworthy observations.

📐 The basic estimation problem

📐 Problem setup

The excerpt considers the equation y = ax + e, where:

y is the given observation vector (2×1)
a is the given model vector (2×1)
x is the unknown parameter (1×1) to be estimated
e is the unknown measurement error vector (2×1)

The central question: how to estimate the unknown x from the given observation vector y.

🎯 Intuitive geometric approach

From the geometry, it is intuitively appealing to estimate x as x̂ such that ax̂ is as close as possible to the given observation vector y.

The key insight: ax̂ has minimum distance to y if the error vector ê = y - ax̂ is orthogonal to a.

🔍 The least-squares solution

🔍 Orthogonality condition leads to the normal equation

Orthogonality condition: a* ê = 0, where ê = y - ax̂

This orthogonality condition leads to the normal equation:

a* a x̂ = a* y

From this, the least-squares estimate follows as:

x̂ = (a* a)^(-1) a* y

📊 Complete least-squares estimates

Once x̂ is found, the other estimates follow:

LSQ-estimate of observations: ŷ = ax̂
LSQ-estimate of measurement errors: ê = y - ŷ
Sum of squares: ê* ê (the squared length of the error vector)

✅ Why it's called "least-squares"

The name comes from the fact that x̂ minimizes the sum of squares (y - ax)*(y - ax).

The excerpt provides two proofs:

Algebraic proof: directly shows that the sum of squares equals a constant plus a squared term that vanishes at x̂.
Calculus proof: uses the derivative condition—the minimum occurs where the derivative with respect to x equals zero, leading to the normal equation.

💡 Concrete example: arithmetic average

Example: When a = (1,1)*, the normal equation becomes 2x̂ = y₁ + y₂.

Result: x̂ = (y₁ + y₂)/2

This shows that the least-squares estimate equals the arithmetic average when both observations measure the same quantity with equal weight.

🎭 Orthogonal projectors

🎭 Definition and notation

The excerpt introduces two key matrices:

P_a = a(a* a)^(-1)a*
P_a⊥ = I - a(a* a)^(-1)a*

Orthogonal projectors: matrices that project vectors onto specific subspaces along orthogonal directions.

🔄 What each projector does

Projector	Projects onto	Projects along
P_a	line spanned by a	line orthogonal to a
P_a⊥	line orthogonal to a	line spanned by a

Using the cosine rule, P_a y can be written as the projection of y onto the direction of a.

🔀 Alternative representation using the normal vector

If b is the vector orthogonal to a (so b* a = 0), the projectors can be alternatively represented as:

P_a = P_b⊥ = I - b(b* b)^(-1)b*
P_a⊥ = P_b = b(b* b)^(-1)b*

Don't confuse: P_a projects onto a, but it equals P_b⊥ (projection orthogonal to b); similarly, P_a⊥ equals P_b.

🔄 Two equivalent formulations

🔄 Parametric vs implicit form

The least-squares problem can be formulated in two ways:

Parametric form (equation 1): y = ax + e

Represents the line as z = ax (through the origin with direction vector a)
Solved by minimizing ||y - ax||²

Implicit form (equation 5): b* y = b* e

Represents the same line using the normal vector b (where b* a = 0)
Obtained by pre-multiplying equation (1) with b* and noting that b* a = 0
Solved by minimizing ||e||² subject to b* z = 0

🎯 Geometric equivalence

Both formulations lead to the same geometric solution:

Parametric form: project y onto the line spanned by a
Implicit form: project y onto the vector b (orthogonal to a)

The least-squares estimates for the implicit form are:

ê = P_b y = b(b* b)^(-1)b* y
ŷ = P_b⊥ y = [I - b(b* b)^(-1)b*] y

Example: For the problem with a = (1,1), the implicit form becomes (y₁ - y₂) = (e₁ - e₂), expressing that the difference between two observed distances is generally nonzero because of measurement errors. The vector b = (1, -1) is the normal vector, and both formulations yield identical results.

⚖️ Weighted least-squares

⚖️ Motivation for weighting

Up to now, the two observations y₁ and y₂ have been regarded as equally accurate.

When observations have different accuracy levels, it is reasonable to give more weight to more trustworthy observations.

⚖️ Introducing the weight matrix

Weight matrix W: a symmetric, positive-definite matrix that emphasizes or de-emphasizes the influence of specific observations.

The weighted least-squares principle minimizes:

E(x) = (y - ax)* W (y - ax)

instead of the unweighted sum of squares.

📏 How weighting works

If w₁₁ > w₂₂, more importance is attached to the first observation
The minimization process tries harder to make (y₁ - a₁x)² small
This is reasonable if observation y₁ is obtained with higher accuracy or is more trustworthy than y₂

Example scenario: If one observation comes from a more precise instrument, assigning it a higher weight will make the estimate rely more heavily on that measurement.

1.3 Weighted Least-Squares estimation

🧭 Overview

🧠 One-sentence thesis

Weighted least-squares estimation generalizes ordinary least-squares by introducing a weight matrix that allows more accurate observations to have greater influence on the estimate, and this weighted approach can still be interpreted geometrically as a form of projection.

📌 Key points (3–5)

What weighting does: a symmetric, positive-definite weight matrix W allows different observations to have different influence on the estimate based on their trustworthiness.
How weights affect the estimate: larger weight on an observation (e.g., w₁₁ > w₂₂) makes the minimization process try harder to fit that observation, producing a weighted average closer to the more trusted data.
Geometric interpretation: weighted least-squares corresponds to oblique projection, where the estimate lies at the point of tangency between a constraint line and an ellipse defined by the weight matrix.
Common confusion: weighted vs unweighted—when W = I (identity matrix), the ellipse becomes a circle and weighted least-squares reduces to ordinary orthogonal projection; otherwise the projection is oblique.
Key result: the weighted estimate is given by x̂ = (aWa)⁻¹aWy, which replaces the simple arithmetic average with a weighted average favoring more accurate observations.

🎯 The weighting principle

🎯 Why introduce weights

Up to this point, all observations y₁ and y₂ were treated as equally accurate.
In practice, observations may come from instruments of different accuracy.
The weighted approach allows emphasizing or de-emphasizing specific observations based on their trustworthiness.

⚖️ The weight matrix W

Weight matrix W: a symmetric, positive-definite matrix whose elements control the influence of each observation on the estimate.

Symmetric: W has the same values above and below the diagonal.
Positive-definite: for any non-zero vector z, z*Wz > 0 (this ensures the matrix is invertible and the minimization has a unique solution).
Example: if w₁₁ > w₂₂, more importance is attached to the first observation, so minimizing E(x) tries harder to make (y₁ - a₁x)² small.

📐 The weighted minimization problem

Instead of minimizing the unweighted sum of squares (y - ax)*(y - ax), weighted least-squares minimizes:

Weighted sum of squares: (y - ax)*W(y - ax)

The solution x̂ that minimizes this expression is called the weighted least-squares (WLSQ) estimate of x.

🧮 Solving the weighted problem

🧮 The normal equation

The solution is obtained by solving:

Normal equation: (aWa)x̂ = aWy

From this follows:

WLSQ-estimate of x: x̂ = (aWa)⁻¹aWy
WLSQ-estimate of y: ŷ = ax̂
WLSQ-estimate of e: ê = y - ŷ
Weighted sum of squares: ê*Wê

The denominator (a*Wa) is always positive because W is positive-definite, ensuring the solution exists.

📊 Example with weighted average

Consider the problem from section 1.1 with weight matrix:

Element	Value
w₁₁	weight on first observation
w₂₂	weight on second observation
w₁₂	cross-weight (may be zero)

The weighted estimate becomes:

x̂ = (w₁₁y₁ + w₂₂y₂) / (w₁₁ + w₂₂)

This is a weighted average of the data, not a simple arithmetic average.

If w₁₁ > w₂₂, the average is closer to y₁ than to y₂.
The residual ê₂ > ê₁ when w₁₁ > w₂₂ (the less-trusted observation gets a larger residual).

Don't confuse: When W = I (identity matrix), this reduces to the unweighted case with arithmetic average (y₁ + y₂)/2.

🔷 Geometric interpretation: ellipses and tangency

🔷 From circles to ellipses

The equation z*Wz = c describes an ellipse in general:

If w₁₁ = w₂₂ and w₁₂ = 0, the ellipse is a circle.
If w₁₁ ≠ w₂₂ and w₁₂ = 0, the ellipse has major and minor axes parallel to the coordinate axes.
If w₁₂ ≠ 0, the ellipse may have arbitrary orientation.

📏 The gradient and normal vectors

For the function F(z) = z*Wz, the gradient at z₀ is:

∇F(z₀) = 2Wz₀

This gradient vector is normal (orthogonal) to the ellipse at z₀.

The tangent line of the ellipse z*Wz = c at z₀ is therefore given by:

(Wz₀)*(z - z₀) = 0

🎯 The tangency condition

Comparing the tangent equation with the normal equation shows:

The WLSQ-estimate x̂ is that value for which (y - ax̂) is orthogonal to the normal Wa at a of the ellipse zWz = aWa.
Equivalently, (y - ax̂) is parallel to the tangent line of the ellipse zWz = aWa at a.

Key insight: The weighted least-squares estimate is found at the point where the constraint line touches (is tangent to) an ellipse defined by the weight matrix.

Don't confuse: When W = I, the ellipse becomes a circle and the weighted estimate reduces to ordinary orthogonal projection (see figures 1.8 and 1.9 in the excerpt).

🔀 Oblique projectors

🔀 Definition of projectors

The matrices that appear in the WLSQ solution are called oblique projectors:

Projector onto a: P_{a,(Wa)⊥} = a(aWa)⁻¹aW
Projector onto (Wa)⊥: P_{(Wa)⊥,a} = I - a(aWa)⁻¹aW

These can be written alternatively using the vector b (orthogonal to a) and W⁻¹:

P_{a,(Wa)⊥} = a(aWa)⁻¹aW = I - W⁻¹b(bW⁻¹b)⁻¹b
P_{(Wa)⊥,a} = I - a(aWa)⁻¹aW = W⁻¹b(bW⁻¹b)⁻¹b

📐 What "oblique" means

Oblique projection onto a line spanned by a: the projection is along a line orthogonal to Wa (not orthogonal to a itself).
Oblique projection onto a line orthogonal to Wa: the projection is along a line spanned by a.

The term "oblique" indicates the projection direction is not perpendicular to the target line (unless W = I).

🔄 Connection to the implicit form

The excerpt shows that the weighted least-squares problem can also be solved using the implicit form:

Minimize eWe subject to be = b*y

The solution ê is the point on the line be = by for which e*We is minimal.

Different values of c in e*We = c give different ellipses; larger c means larger ellipse.
The solution ê is the point of tangency of the line by = be with one of the ellipses e*We = c.
At tangency, the normal of the ellipse is parallel to the normal of the line (which is b).
Therefore, ê must be parallel to W⁻¹b.

This gives:

ê = W⁻¹b·α for some scalar α
Solving for α using bê = by yields the weighted least-squares estimates.

Don't confuse: The two forms (parametric y = ax + e and implicit be = by) are equivalent and lead to the same weighted estimates, just derived from different geometric perspectives.

🔁 Weighted least-squares as orthogonal projection

🔁 Dependence on base vectors

The geometric interpretation (orthogonal vs oblique) depends on the choice of base vectors used to construct the vectors y, a, and e.

So far, the implicit assumption was orthogonal base vectors (1,0)* and (0,1)*.
With these orthonormal bases, unweighted least-squares appears as orthogonal projection and weighted least-squares as oblique projection.

🔄 Changing the base vectors

If instead of (1,0)* and (0,1)*, a different set of non-orthonormal base vectors d₁ and d₂ is chosen:

The vectors y, a, and e are constructed as linear combinations of d₁ and d₂ using the same scalars y₁, y₂, a₁, a₂, e₁, e₂.
The equation y = ax + e still holds.

The excerpt hints (but does not complete the argument in the provided text) that with a suitable choice of non-orthonormal base vectors, weighted least-squares can also be interpreted as orthogonal projection in the transformed coordinate system.

Key takeaway: The distinction between "orthogonal" and "oblique" projection is not absolute—it depends on the coordinate system. Weighted least-squares is oblique in the standard orthonormal basis but can be viewed as orthogonal in a suitably chosen non-orthonormal basis.

Weighted Least-Squares (also orthogonal projection)

1.4 Weighted Least-Squares (also orthogonal projection)

🧭 Overview

🧠 One-sentence thesis

Every weighted least-squares problem can be interpreted as an orthogonal projection when the weight matrix is factored as W = D*D, meaning the "oblique" appearance depends only on the choice of base vectors.

📌 Key points (3–5)

Core insight: weighted least-squares estimation, which looks like oblique projection with standard orthonormal bases, is actually orthogonal projection when viewed with non-orthonormal base vectors.
Weight matrix factorization: every symmetric positive-definite weight matrix W can be written as W = D*D (proven via Cholesky decomposition).
Inner product interpretation: the weight matrix W = D*D defines an inner product that changes how "length" and "orthogonality" are measured.
Common confusion: the geometric interpretation (orthogonal vs oblique) depends on the choice of base vectors—the same projection can look orthogonal or oblique depending on the coordinate system.
Estimators vs estimates: when the observation vector y is treated as a random variable, the resulting least-squares solutions become random variables called estimators; with a sample value y, they are called estimates.

🔄 From oblique to orthogonal projection

🔄 How base vectors change the picture

In earlier sections, unweighted least-squares was orthogonal projection and weighted least-squares was oblique projection—both used the standard orthonormal base vectors (1,0)* and (0,1)*.
The excerpt shows that if you switch from orthonormal bases to non-orthonormal base vectors d₁ and d₂, the same weighted least-squares problem becomes an orthogonal projection.
The vectors y, a, and e are constructed from scalars y₁, y₂, a₁, a₂, e₁, e₂ using the new bases:
- y = y₁ d₁ + y₂ d₂
- a = a₁ d₁ + a₂ d₂
- e = e₁ d₁ + e₂ d₂
The equation y = ax + e still holds, but the geometric appearance changes.

🎯 Orthogonality condition with new bases

To estimate x, require that the residual vector y - ax̂ is orthogonal to a:
- a* (y - ax̂) = 0
Substituting the new base-vector representations gives:
- (a₁ d₁ + a₂ d₂)* (y - ax̂) = 0
After algebra, this leads to:
- x̂ = (a* DD a)⁻¹ a D*D y
The matrix D*D is symmetric and positive-definite, so it can be interpreted as a weight matrix W.
Key result: the solution to the orthogonality condition is identical to the weighted least-squares solution with weight matrix W = D*D.

🔢 Every weight matrix is a product

🔢 The crucial question

Can every weight matrix W (i.e., every symmetric positive-definite matrix) be written as W = D*D?

Answer: Yes, every such W can be factored this way.
This means every weighted least-squares problem can be interpreted as orthogonal projection in an appropriately chosen coordinate system.

🧮 Cholesky decomposition (proof)

The excerpt provides a proof by induction:
- For n = 1, the result is trivial.
- Assume true for (n-1) × (n-1) matrices.
- For an n × n symmetric positive-definite matrix W, partition it into blocks involving an (n-1) × (n-1) matrix A (also symmetric positive-definite).
- By the induction hypothesis, A = L_{n-1} L*{n-1} for some lower triangular L{n-1}.
- Construct L from L_{n-1} and solve for additional elements l and b such that LL* = W.
- The positivity condition b_{nn} - l*l > 0 is guaranteed by W being positive-definite.
Conclusion: there exists a nonsingular lower triangular matrix L with positive diagonal elements such that W = LL*.

Example (from the excerpt):

A simple Cholesky decomposition is shown (the excerpt mentions it but does not give the full matrix; it states "a simple example" is provided).

📏 Inner products and metrics

📏 Standard vs weighted inner product

In the unweighted case (W = I), the inner product of vectors u and v is:
- (u, v) = u* v = u₁ v₁ + u₂ v₂
The square of the length of u is:
- ||u||² = u* u = u₁² + u₂²
In the weighted case (W = D*D), the inner product is:
- (u, v)_W = u* W v = u* DD v = (Du) (Dv)
The square of the length of u is:
- ||u||²_W = u* W u = (Du)* (Du)

🎨 Visualizing the metric

The excerpt describes two equivalent ways to visualize the weighted inner product:
- Figure 1.17a: use orthonormal base vectors and visualize the metric by an ellipse (the set of points with ||u||²_W = constant).
- Figure 1.17b: use non-orthonormal base vectors d₁ and d₂ (the columns of D) and visualize the metric by a circle; the non-orthonormal bases "absorb" the weight matrix.
Don't confuse: the same weighted least-squares problem can look like an ellipse with orthonormal bases or a circle with non-orthonormal bases—the geometry is the same, only the coordinate system changes.

🔀 Projectors as orthogonal operators

🔀 Reinterpreting oblique projectors

Earlier, the matrix P_{a,(Wa)} from equation (8) was an oblique projector.
The excerpt shows it can be interpreted as an orthogonal projector onto the line spanned by a, where orthogonality is measured with respect to the inner product defined by W.
From the geometry (figure 1.18):
- ŷ = a x̂
- a* W (y - ŷ) = 0
- Solving gives x̂ = (a* W a)⁻¹ a* W y
- Therefore ŷ = a (a* W a)⁻¹ a* W y
The matrix a (a* W a)⁻¹ a* W is recognized as P_{a,(Wa)} from equation (8).
Conclusion: the "oblique" projector is actually an orthogonal projector in the W-inner-product sense.

🔀 Notation simplification

From now on, the excerpt assumes the inner product is clear from context and denotes the projector simply as P_a (dropping the explicit W notation).

🎲 Estimators vs estimates

🎲 Random variables and least-squares

So far, no probabilistic assumptions were made about observations or errors.
The least-squares estimates follow from solving:
- minimize (y - ax)* W (y - ax)
The solutions are:
- x̂, ŷ, ê (estimates)
Vector ê contains the least-squares residuals.

🎲 From estimates to estimators

If the observation vector y is treated as a random vector (denoted with underscore: y_), then substituting y_ into the least-squares formulas produces random variables:
- x̂_, ŷ_, ê_ (estimators)
Distinction:
- Estimators: random variables obtained by substituting the random observation vector y_ into the least-squares formulas.
- Estimates: the numerical values obtained by substituting a sample value y.
The probability density functions of the estimators can be derived from the probability density function of y_.

Term	Input	Output	Nature
Estimate	Sample value y	Numerical x̂, ŷ, ê	Fixed numbers
Estimator	Random vector y_	Random x̂_, ŷ_, ê_	Random variables

The mean and covariance matrix of Least-Squares estimators

1.5 The mean and covariance matrix of Least-Squares estimators

🧭 Overview

🧠 One-sentence thesis

Least-squares estimators are unbiased regardless of the weight matrix choice, and their variance is minimized when the weight matrix is set to the inverse of the observation covariance matrix.

📌 Key points (3–5)

Estimators vs estimates: estimators are random variables (when observation vector is random); estimates are the numerical values obtained from sample data.
Unbiasedness property: if observations equal a deterministic component plus zero-mean random errors, least-squares estimators are unbiased for any weight matrix.
Variance minimization: the variance of the least-squares estimator is smallest when the weight matrix equals the inverse of the observation covariance matrix.
Common confusion: unbiasedness does not depend on the weight matrix choice, but minimum variance does—choosing the right weight matrix makes the estimator "best."
BLUE property: least-squares estimators become Best Linear Unbiased Estimators (minimum variance among all linear unbiased estimators) with the optimal weight matrix.

📐 Estimators as random variables

📐 What estimators are

Estimator: a random variable obtained by substituting a random observation vector into the least-squares formula; estimate: the numerical value obtained when the observation vector is replaced by its sample value.

The observation vector y (underscored) is treated as a random vector.
When y is substituted into the least-squares solution formulas, the resulting x̂, ŷ, and ê are also random variables (estimators).
If you plug in actual measured data (sample values), you get estimates (fixed numbers).

🔄 Linear functions and distributions

Because x̂, ŷ, and ê are linear functions of y, their probability distributions inherit properties from y.
If y is normally (Gaussian) distributed, then x̂, ŷ, and ê are also normally distributed.
The excerpt emphasizes that computing means and covariances is straightforward for linear functions.

🎯 Unbiasedness of least-squares estimators

🎯 The model assumption

The excerpt assumes:

The observation vector can be written as: y = Ax + e
- Ax is a deterministic (non-random) component.
- e is a random measurement error component.
The mean of the error is zero: E{e} = 0 (errors are zero "on the average").

✅ Why estimators are unbiased

Applying the propagation law of means to the least-squares formulas gives:
- E{x̂} = x (the expected value of the estimator equals the true unknown parameter).
- E{ŷ} = Ax and E{ê} = 0.
This result holds irrespective of the probability distribution type and independent of the weight matrix W.
Example: if you repeatedly sample observations and compute estimates, the average of those estimates will converge to the true parameter value.

🔍 Don't confuse: unbiasedness vs minimum variance

Unbiasedness means the estimator is "correct on average" and does not depend on the weight matrix.
Minimum variance (small spread around the mean) does depend on the weight matrix choice.
An unbiased estimator can still have large variance (high variability), which is undesirable.

📊 Covariance matrices and variance minimization

📊 Propagation of covariances

The covariance matrix of the observation vector is denoted Q_y.
Applying the propagation law of covariances to the least-squares formulas gives the covariance matrices of x̂, ŷ, and ê.
The variance of x̂ (denoted σ²_x̂) measures the spread of the estimator's distribution around its mean.

🎯 Optimal weight matrix

The excerpt derives that the variance σ²_x̂ is minimized when the two vectors DWa and (D)⁻¹a* are parallel (angle between them is zero).
This occurs when the weight matrix is chosen as:
- W = Q_y⁻¹ (the inverse of the observation covariance matrix).
With this choice, the variance becomes minimal.

📉 Why small variance matters

Variance measures the variability expected when drawing independent samples from the distribution.
A small variance combined with unbiasedness means there is high probability that a sample estimate will be close to the true unknown value.
Example: if you measure repeatedly, estimates will cluster tightly around the true parameter.

🏆 Best Linear Unbiased Estimators (BLUEs)

🏆 Definition and properties

BLUE (Best Linear Unbiased Estimator): an estimator that is linear (a linear function of observations), unbiased (expected value equals the true parameter), and has minimum variance among all linear unbiased estimators.

"Best" means minimum variance in this context.
Least-squares estimators are BLUEs when the weight matrix is set to W = Q_y⁻¹.

🔗 Summary of optimality

Property	Condition	Implication
Unbiased	Observations = deterministic + zero-mean error	Holds for any weight matrix
Minimum variance	Weight matrix = inverse of observation covariance	Variance is smallest possible
Linear	Estimator is a linear function of observations	Straightforward computation
BLUE	All three properties combined	Optimal among linear unbiased estimators

⚠️ Don't confuse: deterministic derivation vs probabilistic properties

Least-squares estimates were originally derived using deterministic principles (orthogonality, minimum distance) without probabilistic assumptions.
Nevertheless, when observations are treated as random, the estimators have strong probabilistic optimality properties (unbiasedness, minimum variance).
The excerpt emphasizes that these "optimal properties" emerge even though the derivation did not initially rely on probability theory.

Best Linear Unbiased Estimators (BLUE's)

1.6 Best Linear Unbiased Estimators (BLUE’s)

🧭 Overview

🧠 One-sentence thesis

Best linear unbiased estimators (BLUE's) are unique and coincide with least-squares estimators when the weight matrix equals the inverse of the observation covariance matrix, making them optimal in the sense of minimum variance among all linear unbiased estimators.

📌 Key points (3–5)

What BLUE means: "Best" = minimum variance; "Linear" = linear function of observations; "Unbiased" = expected value equals the true parameter.
Uniqueness result: there is only one BLUE for a given problem, and it equals the least-squares estimator when the weight matrix is chosen as the inverse covariance matrix of observations.
Three properties to satisfy: L-property (linearity), U-property (unbiasedness), and B-property (minimum variance in the class of linear unbiased estimators).
Common confusion: least-squares estimation starts from deterministic principles (orthogonality, minimum distance) with no probabilistic assumptions, yet it produces estimators with optimal probabilistic properties when the weight matrix is chosen correctly.
Optimal weight matrix: choosing W as the inverse of the observation covariance matrix minimizes the variance of the least-squares estimator.

🎯 From least-squares to BLUE

🎯 Least-squares as a deterministic principle

Least-squares estimates are derived from deterministic principles: "orthogonality" and "minimum distance."
No probabilistic considerations are used in the derivation itself.
Despite this, least-squares estimators turn out to have desirable probabilistic properties.

📊 Optimal properties of least-squares estimators

Least-squares estimators possess three key properties:

Property	Description
Unbiased	Independent of the choice of weight matrix W
Minimum variance	Achieved by choosing W as the inverse of the observation covariance matrix
Linear	Linear functions of the observations

When all three properties hold, the estimator is called a best linear unbiased estimator (BLUE).
"Best" specifically means minimum variance in this context.

🔑 Optimal weight matrix choice

The variance of the least-squares estimator can be minimized by a specific choice of weight matrix.
Using the decomposition Q_y = D* D, the variance is minimal when the angle between two vectors (DWa and (D*)^(-1)a) is zero—that is, when they are parallel.
This occurs when W = Q_y^(-1) (the inverse of the covariance matrix of observations).
With this choice, least-squares estimators become BLUE's.

🧩 The three BLUE properties

🧩 L-property: Linearity

The estimator x-hat must be a linear function of the observations y.

The estimator must have the form: x-hat = l* y, where l is a vector.
This defines the class of linear estimators, which is infinite.
Example: any linear combination of observations qualifies, so there are infinitely many possible linear estimators.

🧩 U-property: Unbiasedness

The estimator x-hat must be unbiased for all values the true parameter x may take.

Unbiasedness means: the expected value of x-hat equals x for all x.
Mathematically: E(x-hat) = x for all x.
From the linearity condition and the model E(y) = Ax, unbiasedness requires: l a = 1* (where a is related to the model structure).
This constraint narrows the class of estimators to linear and unbiased (LU) estimators, but this class is still infinite.
Example: any vector l of the form l_o + c·b (where c is arbitrary and b is orthogonal to a) satisfies the unbiasedness condition.

🧩 B-property: Best (minimum variance)

Among all linear unbiased estimators, find the one with minimum variance.

The variance of x-hat follows from the "propagation law of variances": variance of x-hat = l* Q_y l, where Q_y is the covariance matrix of observations.
The B-property requires finding vectors l (denoted l-hat) that minimize this variance while satisfying the unbiasedness constraint l* a = 1.
Geometrically: find the shortest vector whose endpoint lies on the constraint line.
Result: there is only one such vector, making the BLUE unique.

🔍 Deriving the BLUE class

🔍 The minimization problem

The problem is to minimize the variance subject to the unbiasedness constraint:

Minimize: variance of x-hat = l* Q_y l
Subject to: l* a = 1

Using the decomposition Q_y = D* D, define l-bar = D l. The problem becomes:

Minimize: length of l-bar
Subject to: l-bar* (D*)^(-1) a = 1

📐 Geometric interpretation

The constraint l-bar* (D*)^(-1) a = 1 defines a line.
The minimization seeks the point on this line closest to the origin.
From the geometry, there is only one such vector l-hat.
This vector lies along (D*)^(-1) a and has length equal to the inverse of the norm of (D*)^(-1) a.

✅ The unique solution

The unique BLUE is:

x-hat = l-hat* y
Where l-hat is determined by the minimization.
Key conclusion: the best linear unbiased estimator is unique and equals the least-squares estimator when W = Q_y^(-1).

Don't confuse: the class of linear estimators is infinite, and the class of linear unbiased estimators is also infinite, but the class of best linear unbiased estimators contains exactly one estimator.

🔄 Alternative derivation

🔄 Parametric representation of the constraint

The constraint a* l = 1 defines a line with direction vector orthogonal to a.
This line can be represented parametrically as: l = l_o + b·a, where l_o is any particular solution and b is orthogonal to a.
This replaces the constraint with a parametric form, making the minimization easier.

🔄 Solving via parametrization

The minimization problem becomes: minimize (l_o + b·a)* Q_y (l_o + b·a).
The solution for the optimal b is derived algebraically.
Substituting back gives the same result: x-hat = l-hat* y, confirming the uniqueness and form of the BLUE.

🔄 Connection to least-squares

Using known results from earlier sections, the alternative derivation shows that the BLUE formula simplifies to the least-squares formula when W = Q_y^(-1).
This confirms that least-squares with optimal weight matrix = BLUE.

🆚 Least-squares vs. BLUE perspective

🆚 Two starting points, same destination

Approach	Starting principle	Assumptions	Result
Least-squares	Deterministic: orthogonality and minimum distance	No probabilistic assumptions initially	Produces estimators with optimal probabilistic properties when W = Q_y^(-1)
BLUE	Probabilistic: linearity, unbiasedness, minimum variance	Explicitly seeks optimal probabilistic properties	Unique estimator that equals least-squares when W = Q_y^(-1)

Don't confuse: least-squares is derived without probability, but it turns out to be optimal in a probabilistic sense when the weight matrix is chosen correctly.
The BLUE approach starts from probabilistic optimality criteria and arrives at the same estimator.
Both perspectives converge: least-squares with W = Q_y^(-1) is the unique BLUE.

1.7 The Maximum Likelihood (ML) method

🧭 Overview

🧠 One-sentence thesis

The Maximum Likelihood method estimates parameters by choosing the value that makes the observed data most probable, and for normally distributed observations it produces the same result as least-squares estimation with the weight matrix equal to the inverse covariance matrix.

📌 Key points (3–5)

What ML does: selects the parameter value that maximizes the probability density (likelihood) of the observed data.
Information requirement: ML needs the complete probability distribution of the observations, not just the mean and covariance matrix.
Key result for normal distributions: when observations are normally distributed, ML estimators are identical to least-squares estimators with weight matrix W = Q_y^(-1).
Common confusion: likelihood is the probability density function viewed as a function of the parameter (for fixed observed data), not as a function of the data itself.
Comparison with other methods: unlike least-squares (deterministic) and BLUE (needs only mean and covariance), ML requires full distributional assumptions.

🎯 The likelihood principle

🎯 What likelihood means

Likelihood of x: the probability density function p(y; x) considered for fixed observed y as a function of the parameter x.

For a fixed parameter x, p(y; x) describes how probable different observation values y are.
For a fixed observation y, p(y; x) describes how "likely" different parameter values x are to have produced that observation.
The higher the probability density at the observed y for a given x, the more likely that x is the true parameter value.

📐 The ML estimation principle

The method maximizes the likelihood:

Choose the estimate x-hat that maximizes p(y; x) for the given observed y.
In words: find the parameter value that would make the observed data most probable.
Example: if two parameter values x₁ and x₂ produce different probability densities at the observed y, the one with higher density is more likely to be correct.

🔍 Why complete distribution is needed

The ML method requires knowing the full mathematical expression for p(y; x).
This contrasts with BLUE, which only needs the mean E{y} and covariance matrix Q_y.
Without the complete distribution, you cannot evaluate which parameter value maximizes the probability of the observed data.

📊 ML for normal distributions

📊 The normal case setup

When y is normally distributed with mean E{y} = ax and covariance matrix Q_y, the probability density function is given (for two-dimensional y) by a specific mathematical form involving the inverse covariance matrix and the squared distance (y - ax)* Q_y^(-1) (y - ax).

🔗 Connection to least-squares

It is often more convenient to maximize ln p(y; x) instead of p(y; x) itself, because both have their maxima at the same parameter values.
Taking the logarithm of the normal density shows that the first two terms are constant (independent of x).
Therefore, maximizing ln p(y; x) amounts to maximizing -1/2 (y - ax)* Q_y^(-1) (y - ax).
This is equivalent to minimizing (y - ax)* Q_y^(-1) (y - ax).

Key conclusion:

For normally distributed observations, maximum likelihood estimators are identical to least-squares estimators with weight matrix W = Q_y^(-1).

🎨 Geometric interpretation

The excerpt provides a geometric picture:

Contours of constant probability density are ellipses centered at ax.
Smaller probability density corresponds to larger ellipses (larger c values).
By varying x, the family of ellipses translates along the line with direction vector a.
The likelihood of x is maximized where the observation point y is a point of tangency of one of the ellipses.
At the tangency point, the normal to the ellipse is Q_y^(-1) (y - ax-hat), which is orthogonal to the tangent line.
The tangent line is parallel to the direction vector a, so a* Q_y^(-1) (y - ax-hat) = 0.
This geometric condition confirms that ML-estimators and LSQ-estimators with W = Q_y^(-1) are identical for normal distributions.

Example: Imagine translating a family of concentric ellipses along a line until the observed data point y touches one ellipse; the center of that ellipse gives the ML estimate.

🔄 Comparison of estimation methods

🔄 Three methods overview

The excerpt summarizes three methods discussed in the chapter:

Method	Assumptions	What is needed	Key feature
LSQ (Least-Squares)	Deterministic; no probability assumptions	Model y = ax + e; weight matrix W	Minimizes (y - ax)* W (y - ax)
BLUE (Best Linear Unbiased)	Random observations; E{e} = 0, E{ee*} = Q_y	Mean and covariance matrix of y	Uses only first two moments
ML (Maximum Likelihood)	Full probability distribution of y	Complete p(y; x)	Maximizes likelihood of observed data

🧩 Model progression

The excerpt shows how the model evolves with increasing assumptions:

Deterministic model: y = ax + e (measurement errors e are fixed).
Random model: y = ax + e (measurement errors e are random, so y is random).
BLUE model: E{y} = ax, Q_y = E{ee*} (mean and covariance specified).
ML model: y ~ Normal(ax, Q_y) (full normal distribution specified).

⚖️ When methods coincide

For the normal distribution model, the ML-estimator is identical to the BLU-estimator and the LSQ-estimator if W = Q_y^(-1).
Don't confuse: this identity holds specifically for normal distributions; for other distributions, ML may differ from LSQ/BLUE.
The choice of weight matrix W = Q_y^(-1) is what makes the three methods produce the same estimate in the normal case.

Summary

1.8 Summary

🧭 Overview

🧠 One-sentence thesis

Three estimation methods—LSQ, BLUE, and ML—converge to the same solution when measurements are normally distributed and the weight matrix equals the inverse of the covariance matrix.

📌 Key points (3–5)

Three estimation methods: LSQ (deterministic), BLUE (random errors with mean zero and known covariance), and ML (adds normality assumption).
When they coincide: ML-estimator, BLU-estimator, and LSQ-estimator are identical if the weight matrix W equals the inverse covariance matrix Q_y^(-1).
Model progression: the model evolves from deterministic (y = ax + e) to stochastic (random y with E{e}=0) to fully probabilistic (normally distributed y).
Common confusion: LSQ is purely geometric/deterministic; BLUE adds statistical assumptions about error structure; ML further assumes a specific distribution (normal).
Generalization ahead: the chapter introduces concepts for two observations and one unknown, which will extend to m observations and n unknowns in the next chapter.

📐 The three estimation methods

📐 Least Squares (LSQ)

LSQ-method: a deterministic method applied to the model y = ax + e.

Treats measurement errors e as fixed (non-random) quantities.
Minimizes the sum of squared residuals weighted by a matrix W.
No probabilistic assumptions about the observations.
Example: Given two measurements y₁ and y₂, find x that minimizes the weighted squared deviations.

📊 Best Linear Unbiased Estimation (BLUE)

BLUE-method: applied when measurement errors e are random with E{e} = 0 and E{ee*} = Q_y.

The model becomes: random y with known mean structure and covariance.
Assumptions:
- Expected value of errors is zero: E{e} = 0
- Covariance matrix of observations is known: E{ee*} = Q_y
Produces the best (minimum variance) linear unbiased estimator under these assumptions.
Don't confuse: BLUE does not assume a specific distribution shape, only mean and covariance.

🎯 Maximum Likelihood (ML)

ML-method: adds the assumption that y is normally distributed to the BLUE model.

Full model: y is normally distributed with mean ax and covariance Q_y.
Estimation principle: find x that maximizes the likelihood (probability density) of observing y.
Geometric interpretation from the excerpt:
- Translate a family of ellipses (constant probability contours) along direction vector a.
- The ML-estimate x̂ corresponds to the center ax̂ when y becomes a point of tangency.
- At tangency, the normal to the ellipse Q_y^(-1)(y - ax̂) is orthogonal to the direction a.
Key result: a* Q_y^(-1)(y - ax̂) = 0, which shows ML and LSQ with W = Q_y^(-1) are identical.

🔗 When the methods converge

🔗 Identical solutions under normality

Method	Model assumptions	When identical to others
LSQ	Deterministic: y = ax + e	When W = Q_y^(-1)
BLUE	Random e, E{e}=0, E{ee*}=Q_y	Always best linear unbiased
ML	Normal distribution added	Identical to BLUE and LSQ (with W=Q_y^(-1))

The excerpt emphasizes: "in case of the normal distribution, ML-estimators and LSQ-estimators with W = Q_y^(-1) are identical!"
This convergence happens because:
- The normal distribution's geometry (elliptical contours) aligns with the weighted least squares criterion.
- The orthogonality condition from ML (a* Q_y^(-1)(y - ax̂) = 0) matches the LSQ normal equations when W = Q_y^(-1).

🧮 The tangency condition

At the ML-estimate, the normal to the probability ellipse Q_y^(-1)(y - ax̂) is perpendicular to the model direction a.
This geometric condition translates to: a* Q_y^(-1)(y - ax̂) = 0.
Example: Imagine sliding a family of concentric ellipses along a line until one just touches the observation point y; the center of that ellipse gives the ML-estimate.

🚀 Generalization to higher dimensions

🚀 From simple to general

Current scope: two observations (y₁, y₂) and one unknown (x).
Next chapter: m observations (y₁, y₂, ..., y_m) and n unknowns (x₁, x₂, ..., x_n).
The general model becomes: y (m×1 vector) is normally distributed with mean Ax (where A is m×n matrix) and covariance Q_y.

📋 Model evolution summary

The excerpt shows three stages of model refinement:

Deterministic (LSQ): y = ax + e (equation 42)
Stochastic (BLUE): random y, E{e}=0, E{ee*}=Q_y (equation 44)
Probabilistic (ML): y ~ Normal(ax, Q_y) (equation 45)
High-dimensional: y ~ Normal(Ax, Q_y) where A is m×n (equation 46)

Don't confuse: each stage adds assumptions; LSQ requires the least, ML requires the most (but yields the strongest conclusions under normality).

Levelling Networks and Observation Equations

2.1 Introduction

🧭 Overview

🧠 One-sentence thesis

When a surveying network has more measurements than strictly necessary unknowns, the resulting inconsistencies from measurement errors must be handled through least-squares estimation by finding the solution that brings the computed values as close as possible to all observations.

📌 Key points (3–5)

Redundancy is deliberate: more measurements than needed (9 height differences for 5 unknowns) allow detection of blunders and outliers.
Different measurement sets give different answers: using different subsets of measurements yields different computed heights because each measurement contains errors.
Observation equations: the linear system y = Ax + e relates m observations to n unknowns plus random errors.
Common confusion: "too many measurements" is not waste—redundancy is essential for quality control and blunder detection.
Least-squares solution: estimate x such that Ax is orthogonal to the residual (y - Ax), yielding x̂ = (AA)⁻¹Ay when A*A is invertible.

📐 The levelling network example

📐 Setup and objective

A levelling network connects points 0, 1, 2, 3, 4, and 5 with measured height differences.
Point 0 has a known height in the NAP reference system (Normaal Amsterdams Peil).
Total measurements: 9 levelled height differences.
Unknowns: heights of the remaining 5 points.
Goal: determine the 5 unknown heights from the 9 measured height differences.

🛤️ Two alternative measurement paths

The excerpt shows two different ways to compute the heights:

Configuration	Measurements used	Path
Figure 2.2	y₁, y₆, y₈, y₉, y₄ (5 out of 9)	Start at 0 → determine 4, 5, 3, 2, 1
Figure 2.3	Different set of 5 out of 9	Start at 0 → determine 1, 2, 5, 3, 4

If measurements were perfect, both paths would yield identical heights.
Because perfect measurements do not exist, the two solutions will differ due to different measurement errors in each subset.

⚠️ Why redundancy matters

Strictly speaking, only 5 out of the 9 available height differences are needed to determine the 5 unknown heights.

With exactly 5 measurements, you can never detect blunders or outliers.
Redundant measurements (more than the minimum) enable quality control.
Network design courses teach how many measurements are needed and how to distribute them to achieve a certain quality.
Don't confuse: "too many measurements" with waste—redundancy is a feature, not a bug.

🧮 From example to general model

🧮 Writing the observation equations

For the 9 height differences and 5 unknown heights (denoted x₁, x₂, …, x₅) plus known height x₀ of point 0, the excerpt writes 9 equations relating measurements to unknowns.

In matrix notation:

The system becomes y = Ax + e (assuming x₀ = 0 for simplicity).
m = 9 (number of observations).
n = 5 (number of unknowns).
Matrix A has 9 rows and 5 columns.
e represents the unknown random measurement errors.

Observation equations: the linear system y = Ax + e.

If x₀ ≠ 0, the known vector can be moved to the left-hand side to maintain the form y = Ax + e.
This is the multidimensional generalization of the two-observation, one-unknown case from chapter 1.

📏 Restriction on dimensions

The number of observations m must be ≥ the number of unknowns n (i.e., n ≤ m).
This ensures the system is not underdetermined.

🎯 Least-squares estimation for multiple unknowns

🎯 The problem setup

Given an observation vector y, solve for unknown vectors x and e in:

y = Ax + e

Writing A as column vectors a₁, a₂, …, aₙ:

y = x₁a₁ + x₂a₂ + … + xₙaₙ + e

🧭 Geometric intuition (m=3, n=2 example)

The excerpt visualizes the problem:

The observation vector y is given.
The estimate x̂ should make Ax̂ as close as possible to y.
This happens when the residual ê = y - Ax̂ is orthogonal to the "plane" spanned by the column vectors of A.

Intuitively appealing to estimate x as x̂ such that Ax̂ is as close as possible to the given observation vector y.

🔧 Deriving the normal equations

Orthogonality condition:

The residual (y - Σaᵢx̂ᵢ) must be orthogonal to each column vector aⱼ.
This means: aⱼ* (y - Ax̂) = 0 for each j.
Stacking these n conditions: A* (y - Ax̂) = 0.
Rearranging: AAx̂ = Ay.

This is a system of n linear equations with n unknowns.

✅ The least-squares solution

If the matrix A*A is invertible, the solution is unique:

x̂ = (AA)⁻¹Ay

Compare with the one-unknown case in chapter 1, section 1.2.
The n×n matrix AA is invertible if and only if rank(AA) = n.
Since rank(A*A) = rank(A), invertibility depends on A having full column rank.

🔍 Don't confuse

The least-squares estimate x̂ is not "the true value"—it is the value that minimizes the sum of squared residuals given the observations.
Orthogonality is the key geometric condition: the residual must be perpendicular to the column space of A.

Least-Squares Estimation

2.2 Least-Squares estimation

🧭 Overview

🧠 One-sentence thesis

Least-squares estimation solves overdetermined systems by finding the parameter vector that minimizes the weighted sum of squared residuals, yielding unbiased estimates when the observation errors have zero mean.

📌 Key points (3–5)

Why we need least-squares: When more observations than unknowns exist (e.g., 9 height differences for 5 unknown heights), inconsistencies arise, and least-squares resolves them while enabling blunder detection.
The geometric intuition: The estimate makes the residual vector orthogonal to the column space of the design matrix, which is equivalent to minimizing the sum of squared errors.
Weighted vs unweighted: The unweighted version minimizes (y - Ax)*(y - Ax); the weighted version uses a positive definite weight matrix W to account for different observation qualities.
Unbiasedness property: If observation errors have zero mean, the least-squares estimator is unbiased (its expected value equals the true parameter) regardless of the weight matrix choice.
Common confusion: Rank matters—the matrix A must have full column rank (rank A = n) for a unique solution; rank defects occur when absolute references are missing (e.g., unknown reference height).

📐 The overdetermined system problem

📏 Why measure more than needed

Strictly speaking, only n observations are needed to determine n unknowns.
However, with exactly n measurements, you can never detect blunders or outliers.
Example: In the levelling network with 5 unknown heights, 9 height differences are measured instead of the minimum 5.
The trade-off: redundant measurements create inconsistencies but enable quality control.

🔢 The observation equation model

Observation equations: a system of the form y = Ax + e, where y is the m×1 observation vector, A is the m×n design matrix, x is the n×1 unknown parameter vector, and e is the m×1 random error vector.

The system has m observations and n unknowns, with m ≥ n.
Each equation relates measurements to unknowns plus an unknown random error.
Matrix form: the 9 levelling equations become y = Ax + e with m=9 rows and n=5 columns.

⚠️ The rank requirement

The matrix A must have full rank n (rank A = n) for a unique solution.
This ensures that A*A (the normal matrix) is invertible.
Don't confuse: rank A = rank(A*A) always holds (proven in the excerpt via nullspace equality).
Example of rank defect: if the reference point height is also unknown, the system has rank defect of 1 because absolute heights cannot be determined from height differences alone.

🎯 Deriving the least-squares solution

🔺 Geometric interpretation (orthogonality principle)

Visualize for m=3, n=2: the observation vector y, the column space spanned by columns of A, and the estimate Ax-hat.
The key insight: estimate x as x-hat such that Ax-hat is as close as possible to y.
This happens when the residual vector e-hat = y - Ax-hat is orthogonal to the column space of A.
Mathematically: the residual must be perpendicular to every column vector of A.

🧮 The normal equations

The orthogonality condition leads to:

A*(y - Ax-hat) = 0
Expanding: Ay - AAx-hat = 0
Rearranging: AAx-hat = Ay

Normal equations: the system AAx = Ay, which has n equations with n unknowns.

If A*A is invertible (i.e., rank A = n), the unique solution is:

x-hat = (AA)^(-1) Ay

⚖️ Weighted least-squares

The unweighted version minimizes: (y - Ax)*(y - Ax)

The weighted version minimizes: (y - Ax)*W(y - Ax), where W is an m×m symmetric positive definite weight matrix.

The weighted solution becomes:

x-hat = (AWA)^(-1) AWy
e-hat = y - Ax-hat = y - A(AWA)^(-1)AWy

The weight matrix W allows different observations to have different influence based on their quality or precision.

🧪 Proof that weighted solution minimizes

The excerpt provides an algebraic proof:

Expand (y - Ax)*W(y - Ax) around x-hat
Show it equals (y - Ax-hat)*W(y - Ax-hat) plus a positive term (x - x-hat)AWA(x - x-hat)
The second term vanishes only when x = x-hat, proving minimality

📊 Statistical properties of estimators

📈 Propagation laws for linear functions

When random vectors u and v are linearly related as v = Mu + m (M is a constant matrix, m is a constant vector):

Propagation law of means:

E{v} = M·E{u} + m
The mean of v can be computed from the mean of u alone.

Propagation law of variances:

Q_v = M·Q_u·M*
Where Q denotes the covariance matrix.
The covariance matrix of v can be computed from the covariance matrix of u alone.

Propagation law of covariances:

Q_uv = Q_u·M*
This gives the covariance between u and v.

Don't confuse: these laws apply only to linear functions; nonlinear functions generally require the full probability density function.

✅ Unbiasedness of least-squares estimators

Assume the model y = Ax + e with:

E{e} = 0 (errors have zero mean)
E{ee*} = Q_y (covariance matrix of observations)

This implies E{y} = Ax (the expected observation equals the deterministic model).

Applying the propagation law of means to x-hat = (AWA)^(-1)AWy:

E{x-hat} = (AWA)^(-1)AW·E{y}
E{x-hat} = (AWA)^(-1)AW·Ax
E{x-hat} = x

Unbiased estimator: an estimator whose expected value equals the true parameter value.

Key insight: unbiasedness holds regardless of the choice of weight matrix W.

📉 Covariance matrices of estimators

Applying the propagation law of variances to x-hat:

Q_x-hat = (AWA)^(-1)AW·Q_y·WA(AWA)^(-1)

Applying the propagation law of covariances between y and x-hat:

Q_y,x-hat = Q_y·WA(A*WA)^(-1)

These formulas show how uncertainty in observations propagates to uncertainty in parameter estimates through the linear transformation defined by the least-squares solution.

The Mean and Covariance Matrix of Least-Squares Estimators

2.3 The mean and covariance matrix of Least-Squares estimators

🧭 Overview

🧠 One-sentence thesis

When observation vectors and estimators are linearly related, their means and covariance matrices can be computed using propagation laws without needing the full probability distribution, and least-squares estimators are unbiased if the observation errors have zero mean.

📌 Key points (3–5)

Propagation laws for linear functions: when two random vectors are linearly related, the mean and covariance matrix of one can be computed from the mean and covariance matrix of the other, without knowing the complete probability density function.
Unbiasedness of least-squares estimators: if the observation errors have zero mean, the least-squares estimators are unbiased regardless of the choice of weight matrix W.
Covariance matrices of estimators: the covariance matrices of the parameter estimate, residual estimate, and observation estimate can all be derived by applying the propagation laws to the least-squares formulas.
Common confusion: the quadratic form (residual-transpose times weight times residual) is not a linear function of observations, so its mean requires a different approach using the trace of a matrix product.
Distribution-free results: all mean and covariance results hold without assuming any specific probability distribution for the observations; only normality assumptions are needed for the variance of the quadratic form.

📐 Propagation Laws for Means and Covariances

📊 Definition of mean for random vectors

The mean of a random vector v, denoted E{v}, is defined by integrating the vector function F(u) weighted by the probability density function of u over all possible values.

This definition shows that computing the mean requires knowing both the functional relationship F(.) and the probability density function of u.
For nonlinear functions, this integral can be complex or impossible to solve analytically.

➕ Propagation law of means (linear case)

When two random vectors u and v are related linearly as v = M u + m (where M is a constant matrix and m is a constant vector):

E{v} = M E{u} + m

The mean of v can be computed once the mean of u is known.
The key insight: for linear functions, you do not need the full probability density function—only the mean of u.
Example: if u represents raw measurements with known mean, and v represents transformed measurements (e.g., scaled or shifted), the mean of v follows directly from the linear transformation.
Don't confuse: this result holds only for linear functions; nonlinear transformations generally require the full probability density function.

📈 Propagation law of variances and covariances (linear case)

When u and v are linearly related as v = M u + m:

The covariance matrix of v is Q_v = M Q_u M-transpose

where Q_u is the covariance matrix of u.

The covariance matrix of v can be computed once the covariance matrix of u is known.
Again, the complete probability density function is not needed—only the covariance matrix of u.
The excerpt also states a similar result for the covariance between u and v: Q_uv = Q_u M-transpose.

🔍 Why linearity matters

Property	Linear functions	Nonlinear functions
Mean propagation	Requires only E{u}	Generally requires full probability density function
Covariance propagation	Requires only Q_u	Generally requires full probability density function
Simplicity	Direct matrix multiplication	Complex integration

The excerpt emphasizes that these simplifications apply "in case u and v are linearly related."
This is the foundation for deriving properties of least-squares estimators, which are linear functions of the observations.

🎯 Mean and Covariance of Least-Squares Estimators

🧮 The observation model

The model used is:

y = A x + e

where:

y is the observation vector
A is the design matrix
x is the parameter vector
e is the error vector

The assumptions are:

E{e} = 0 (errors have zero mean)
E{e e-transpose} = Q_y (covariance matrix of errors equals Q_y)

From these assumptions, it follows that E{y} = A x and the covariance matrix of y is Q_y.

✅ Unbiasedness of least-squares estimators

Applying the propagation law for means to the least-squares estimators gives:

E{x-hat} = x, E{e-hat} = 0, E{y-hat} = A x

This shows that the least-squares estimators are unbiased estimators.
The excerpt emphasizes: "this property of unbiasedness is independent of the choice for W" (the weight matrix).
Example: if the true parameter is x, the expected value of the estimate x-hat equals x, meaning on average the estimator hits the true value.
Don't confuse: unbiasedness does not mean the estimate is always correct; it means the average over many repetitions equals the true value.

📊 Covariance matrices of the estimators

Applying the propagation law for variances to the least-squares estimators gives:

Covariance matrix of x-hat: Q_x-hat = (A-transpose W A)-inverse A-transpose W Q_y W A (A-transpose W A)-inverse
Covariance matrix of e-hat: Q_e-hat = (I - A (A-transpose W A)-inverse A-transpose W) Q_y (I - W A (A-transpose W A)-inverse A-transpose)-transpose
Covariance matrix of y-hat: Q_y-hat = A (A-transpose W A)-inverse A-transpose W Q_y W A (A-transpose W A)-inverse A-transpose

where:

W is the weight matrix
I is the identity matrix
The superscript "-inverse" denotes matrix inverse
The superscript "-transpose" denotes matrix transpose

Applying the propagation law of covariances gives:

Covariance between x-hat and e-hat: Q_x-hat-e-hat = (A-transpose W A)-inverse A-transpose W Q_y (I - W A (A-transpose W A)-inverse A-transpose)-transpose

🔧 Practical interpretation

These formulas allow computing the uncertainty (covariance) of the estimates from the uncertainty of the observations.
The weight matrix W influences the covariance matrices but not the unbiasedness.
All these results are derived without assuming any specific probability distribution for y.

🧩 The Mean of the Quadratic Form

⚠️ Why the quadratic form is special

The quadratic form e-hat-transpose W e-hat (residual-transpose times weight times residual) is not a linear function of y.

Therefore, the propagation law for means (equation 21) cannot be applied directly.
The excerpt shows how to work around this by expressing the quadratic form as a linear combination of products of elements.

🔢 Expressing as a linear combination

If e-hat_i is the i-th element of e-hat and (W)_ij is the (i,j)-th element of W, then:

e-hat-transpose W e-hat = sum over all i and j of (W)_ij e-hat_i e-hat_j

This shows the quadratic form is a linear combination of products e-hat_i e-hat_j.
The expectation of each product E{e-hat_i e-hat_j} is the covariance between e-hat_i and e-hat_j, which is the (i,j)-th element of Q_e-hat.

📌 The trace formula

Through algebraic manipulation, the excerpt derives:

E{e-hat-transpose W e-hat} = trace(W Q_e-hat)

where trace is the sum of the diagonal elements of the matrix product W Q_e-hat.

This result allows computing the mean of the quadratic form from the weight matrix and the covariance matrix of the residuals.
The derivation uses the symmetry of the covariance matrix: (Q_e-hat)_ij = (Q_e-hat)_ji.

📉 Variance of the quadratic form

The excerpt states without proof that if y is normally distributed with mean and covariance as given, then:

variance of e-hat-transpose W e-hat = 2 trace(W Q_e-hat W Q_e-hat)

This result requires the normality assumption, unlike the mean and covariance results.
The excerpt notes that deriving the variance is complex even under normality, and even more complicated without distributional assumptions.
Don't confuse: the mean of the quadratic form (trace formula) does not require normality, but the variance formula does.

🏆 Best Linear Unbiased Estimation (BLUE)

🎯 The estimation problem

The goal is to estimate a parameter q that is a linear function of x:

q = f-transpose x

where f is a known vector.

Example: in a leveling network, q might represent the height difference between two points, and f would select the appropriate combination of parameters.
The estimator q-hat should be a linear function of the observations y: q-hat = l-transpose y, where l is a vector to be determined.

📋 The three criteria (L, U, B properties)

The estimator q-hat must satisfy:

L-property (Linear): q-hat = l-transpose y (linear function of observations)
U-property (Unbiased): E{q-hat} = q (expected value equals true value)
B-property (Best): variance of q-hat is minimized

🔍 The unbiasedness condition

Substituting q-hat = l-transpose y into the unbiasedness condition E{q-hat} = q and using the model y = A x + e with E{e} = 0 gives:

A-transpose l = f

This is a system of n linear equations with m unknowns (the elements of l).
Every vector l that satisfies this equation gives a linear unbiased estimator of q.
The excerpt shows that l = C (A-transpose C)-inverse f is a solution, where C is any m×n matrix such that (A-transpose C)-inverse exists.

🧮 General solution structure

The general solution of the inhomogeneous system A-transpose l = f is:

l = particular solution + any vector in the nullspace of A-transpose

The nullspace N(A-transpose) consists of all vectors z such that A-transpose z = 0.
The dimension of N(A-transpose) equals m - rank(A).
If rank(A) = n (full column rank), then dim N(A-transpose) = m - n.

✨ Connection to weighted least-squares

One choice for the matrix C is C = W Q_y, which gives:

q-hat = f-transpose (A-transpose W A)-inverse A-transpose W y

This is the weighted least-squares estimator.
The excerpt notes: "This shows once again that the weighted least-squares estimator is an unbiased estimator."
The "best" part (B-property) would involve choosing W to minimize the variance, but the excerpt does not complete this derivation in the provided text.

Best Linear Unbiased Estimation

2.4 Best Linear Unbiased Estimation

🧭 Overview

🧠 One-sentence thesis

The best linear unbiased estimator (BLUE) of a parameter is identical to the weighted least-squares estimator when the weight matrix equals the inverse of the observation covariance matrix, and it minimizes variance among all linear unbiased estimators.

📌 Key points (3–5)

What BLUE means: an estimator that is (L) a linear function of observations, (U) unbiased (expected value equals true parameter), and (B) has minimum variance among all linear unbiased estimators.
Key identity: the BLUE of the parameter vector x is identical to the weighted least-squares estimator when the weight matrix W equals the inverse of the observation covariance matrix Q_y.
How to find BLUE: solve a constrained minimization problem—minimize variance subject to the unbiasedness constraint—which leads to a specific formula involving the design matrix A and covariance matrix Q_y.
Common confusion: not all linear unbiased estimators are "best"—the class of linear unbiased estimators forms a linear manifold, and BLUE is the unique member with minimum variance.
Special case: when the number of observations m equals the number of unknowns n (no redundancy), the estimate is simply the unique solution of the system, and you cannot improve the estimate of the expected observations beyond the observations themselves.

🎯 The three BLUE criteria

📐 L-property: linearity

The estimator q-hat must be a linear function of the observations y.

Written as: q-hat = l-transpose times y, where l is a vector to be determined.
This restricts the search to estimators that are linear combinations of the observed data.
Example: if you have three observations, the estimator must be of the form: (some coefficient) times observation-1 plus (another coefficient) times observation-2 plus (another coefficient) times observation-3.

🎯 U-property: unbiasedness

The expected value of the estimator must equal the true parameter: E{q-hat} = q.

This constraint ensures no systematic error.
Substituting the linear form into this condition and using the model y = Ax gives: A-transpose times l = f, where q = f-transpose times x.
This is a system of n linear equations with m unknowns (the elements of vector l).
Every vector l satisfying this equation gives a linear unbiased estimator, but not necessarily the best one.

📊 B-property: minimum variance

Among all linear unbiased estimators, choose the one with minimum variance.

The variance of q-hat is: l-transpose times Q_y times l, where Q_y is the covariance matrix of observations.
The minimization problem: minimize l-transpose times Q_y times l subject to A-transpose times l = f.
This constrained optimization problem has a unique solution that defines the BLUE.

🔍 The class of linear unbiased estimators

🧩 General solution structure

The constraint A-transpose times l = f is an inhomogeneous system.
From linear algebra, the general solution is: l = l_o + Bz, where:
- l_o is any particular solution of A-transpose times l = f
- B is an m-by-(m-n) matrix satisfying A-transpose times B = 0 (the nullspace of A-transpose)
- z is an arbitrary (m-n)-dimensional vector
All vectors l satisfying the unbiasedness constraint lie on a linear manifold parallel to the range space of matrix B.

📏 Dimension of the solution space

The nullspace of A-transpose has dimension m - n.
This follows from the dimension theorem: dimension of nullspace plus dimension of range space equals m.
Since rank of A-transpose equals rank of A equals n (assumed full rank), the nullspace has dimension m - n.
There exist m - n linearly independent vectors that form a basis of this nullspace, collected into matrix B.

🏆 Finding the BLUE

🎲 The minimization problem

Minimize: l-transpose times Q_y times l
Subject to: l = l_o + Bz (the general form of all unbiased estimators)
This can be rewritten as: minimize z-transpose times (B-transpose times Q_y times B) times z plus 2 times z-transpose times (B-transpose times Q_y times l_o) plus (constant term)

✅ The solution

The optimal z is: z = negative (B-transpose times Q_y times B)-inverse times B-transpose times Q_y times l_o
Substituting back gives: l = l_o minus B times (B-transpose times Q_y times B)-inverse times B-transpose times Q_y times l_o
The BLUE of q is then: q-hat = l-transpose times y

🔑 Key identity for simplification

The excerpt proves the identity: I = A times (A-transpose times Q_y-inverse times A)-inverse times A-transpose times Q_y-inverse plus B times (B-transpose times Q_y times B)-inverse times B-transpose times Q_y
This identity allows expressing the solution in terms of matrix A (from the original model) rather than matrix B.
Using this identity, the BLUE simplifies to: q-hat = f-transpose times (A-transpose times Q_y-inverse times A)-inverse times A-transpose times Q_y-inverse times y

🔗 Connection to weighted least-squares

🎯 The main result

The BLUE of the parameter vector x is: x-hat = (A-transpose times Q_y-inverse times A)-inverse times A-transpose times Q_y-inverse times y

This is identical to the weighted least-squares estimator when the weight matrix W equals Q_y-inverse.
The BLUE of any linear function q = f-transpose times x is: q-hat = f-transpose times x-hat, where x-hat is the BLUE of x.

📊 Covariance properties

When W = Q_y-inverse (the BLUE case):

Quantity	Covariance matrix
x-hat	Q_x-hat = (A-transpose times Q_y-inverse times A)-inverse
y-hat	Q_y-hat = A times (A-transpose times Q_y-inverse times A)-inverse times A-transpose
e-hat (residuals)	Q_e-hat = Q_y minus A times (A-transpose times Q_y-inverse times A)-inverse times A-transpose

Important property: the estimator e-hat is not correlated with either x-hat or y-hat in the BLUE case.
The expected value of e-hat-transpose times Q_y-inverse times e-hat equals m minus n (the degrees of freedom).

⚠️ Special case: no redundancy

🔢 When m equals n

If the number of observations m equals the number of unknowns n, matrix A is square.
With full rank assumption, A is invertible (A-inverse exists).
The formula simplifies dramatically: (A-transpose times Q_y-inverse times A)-inverse = A times Q_y times A-transpose

📉 Implications of no redundancy

The estimate becomes: x-hat = A-inverse times y (simply the unique solution of y = Ax)
The predicted observations: y-hat = y (the observations themselves)
The residuals: e-hat = 0 (no residual error)
Interpretation: you have just enough observations to determine x uniquely, but not enough to improve the estimate of E{y} = Ax beyond the raw observations y.
Don't confuse: this is not "better estimation"—it's actually worse because you cannot check consistency or reduce uncertainty through redundancy.

🧮 Mathematical foundations

🔧 Proof technique for the key identity

The identity I = A times (A-transpose times Q_y-inverse times A)-inverse times A-transpose times Q_y-inverse plus B times (B-transpose times Q_y times B)-inverse times B-transpose times Q_y is proved by constructing a partitioned matrix R.
Define R as the m-by-m matrix with columns [A, Q_y times B].
Show that R is invertible by proving that the only solution to R times (a-transpose, b-transpose)-transpose = 0 is the trivial solution (both a and b are zero vectors).
Use the fact that A-transpose times B = 0 (from the nullspace property) and that both A and B have full rank.
The inverse of R can be explicitly written, confirming the identity.

📐 Why the matrix is invertible

Matrix R = [A, Q_y times B] is m-by-m and square.
It is invertible if and only if it has full rank m.
Full rank is equivalent to: the only solution of R times vector = 0 is the zero vector.
The proof uses:
- B has full rank m-n, so B times b = 0 implies b = 0
- A has full rank n, so A times a = 0 implies a = 0
- A-transpose times B = 0 (orthogonality condition)
Therefore R is invertible, and the identity holds.

🔄 Orthogonal projection interpretation

🎯 The projector matrix

The matrix P_A = A times (A-transpose times Q_y-inverse times A)-inverse times A-transpose times Q_y-inverse is an orthogonal projector.
It projects onto the range space of A and along its orthogonal complement.
The inner product used is: (u, v) = u-transpose times Q_y-inverse times v, where Q_y-inverse is the matrix representation of the inner product.

📊 Range space and orthogonal complement

Range space R(A): the set of all vectors of the form A times x for some x.

Orthogonal complement R(A)^: the set of all vectors orthogonal to R(A) under the inner product defined by Q_y-inverse.

Dimension of R(A) = n (the rank of A)
Dimension of R(A)^ = m - n
The space of all m-dimensional vectors is the direct sum: R(A) ⊕ R(A)^ (they only share the zero vector and together span the whole space)
Matrix B (with columns forming a basis of the nullspace of A-transpose) represents R(A)^ as: R(A)^ = {Q_y times B times z for any z}

🔍 Unique decomposition

Any vector z in m-dimensional space can be uniquely decomposed as: z = u + v, where u is in R(A) and v is in R(A)^.
The vector u is the orthogonal projection of z onto R(A).
The projector matrix P_A maps z to u.
This geometric interpretation underlies the BLUE: y-hat = P_A times y is the orthogonal projection of the observations onto the range space of the design matrix A.

The Orthogonal Projector

2.5 The orthogonal projector

🧭 Overview

🧠 One-sentence thesis

The matrix A(AQ_y^(-1)A)^(-1)AQ_y^(-1) is an orthogonal projector that projects any vector onto the range space of A along its orthogonal complement, and its trace equals the rank of A.

📌 Key points (3–5)

What an orthogonal projector does: it decomposes any vector uniquely into a component in the range space R(A) and a component in the orthogonal complement R(A)^⊥.
The projector matrix: P_A = A(AQ_y^(-1)A)^(-1)AQ_y^(-1) projects onto R(A); I - P_A projects onto R(A)^⊥.
Key properties: P_A is idempotent (P_A P_A = P_A) and symmetric (P_A* = P_A); its eigenvalues are only 0 or 1.
Common confusion: the projector depends on the inner product defined by Q_y^(-1); different inner products yield different projectors.
Why trace matters: trace(P_A) = n (the rank of A), because P_A has exactly n eigenvalues equal to 1 and the rest equal to 0.

📐 Inner product and vector spaces

📐 The inner product structure

The m-dimensional linear vector space ℝ^m has the inner product: ⟨u, v⟩ = u*Q_y^(-1)v.

Matrix Q_y^(-1) is the matrix representation of this inner product.
This is not the standard dot product; it is weighted by Q_y^(-1).
The choice of inner product affects what "orthogonal" means in this context.

🗂️ Range space and orthogonal complement

The range space of the m×n matrix A is defined as: R(A) = {Az : z ∈ ℝ^n}.

R(A) is a linear subspace of ℝ^m with dimension equal to rank A, which is n.
The orthogonal complement R(A)^⊥ is the set of all vectors orthogonal to R(A) under the inner product defined by Q_y^(-1).
Dimension formula: dim R(A) + dim R(A)^⊥ = m, so dim R(A)^⊥ = m - n.

🔗 Direct sum decomposition

The space ℝ^m is the direct sum of R(A) and R(A)^⊥: ℝ^m = R(A) ⊕ R(A)^⊥.

This means R(A) ∩ R(A)^⊥ = {0} (only the null vector is in both).
A basis of R(A) together with a basis of R(A)^⊥ forms a basis of ℝ^m.
The column vectors of A form a basis of R(A); the column vectors of Q_y B form a basis of R(A)^⊥, where B is the m×(m-n) full-rank matrix satisfying A*Q_y^(-1)B = 0.

🎯 The orthogonal projection

🎯 What orthogonal projection means

If z ∈ ℝ^m can be decomposed as z = u + v with u ∈ R(A) and v ∈ R(A)^⊥, then u is called the orthogonal projection of z on R(A).

Every vector z in ℝ^m can be decomposed in only one way along R(A) and R(A)^⊥.
The component u lies in R(A); the component v lies in R(A)^⊥.
Example: given any observation vector, the projector splits it into a part that can be explained by the model (in R(A)) and a residual part (in R(A)^⊥).

🧮 Matrix representation of P_A

The orthogonal projector on R(A) is defined as: P_A = A(AQ_y^(-1)A)^(-1)AQ_y^(-1).

This matrix takes any z ∈ ℝ^m and returns its projection onto R(A).
Derivation: since every z can be written as z = Aa + Q_y B b (using the basis formed by columns of A and Q_y B), the component along R(A) is u = Aa.
Solving for a and substituting gives the formula for P_A.

🔄 Properties of P_A

Property	Equation	Meaning
Idempotent	P_A P_A = P_A	Projecting twice is the same as projecting once
Symmetric	P_A* = P_A	The projector is self-adjoint under the inner product
Range	R(P_A) = R(A)	The image of P_A is exactly R(A)
Nullspace	N(P_A) = R(A)^⊥	Vectors in R(A)^⊥ are mapped to zero

Don't confuse: idempotence means applying P_A repeatedly has no additional effect after the first application.

🔢 The complementary projector

🔢 I - P_A projects onto R(A)^⊥

I - P_A is also an orthogonal projector; it projects onto R(A)^⊥ and along R(A).

Notation: sometimes written as P_A^⊥.
Properties parallel to P_A: (I - P_A)(I - P_A) = I - P_A, (I - P_A)* = I - P_A.
Range and nullspace are swapped: R(I - P_A) = R(A)^⊥ and N(I - P_A) = R(A).

📊 Connection to estimation

From equation (49) in the previous section: Q_y^(-1)(y - ŷ) = (I - P_A)Q_y^(-1)y.
This shows that the residual (y - ŷ) is the projection of y onto R(A)^⊥.
The estimate ŷ = P_A y is the projection of y onto R(A).

🧮 Eigenvalues and trace

🧮 Eigenvalues are 0 or 1

The eigenvalue problem for P_A is: P_A z = λz.
Using idempotence: P_A P_A z = λ P_A z = λ² z, but also P_A P_A z = P_A z = λz.
Therefore λ² z = λz, which implies λ² = λ (for non-zero z), so λ = 0 or λ = 1.

📏 Trace equals rank

trace(P_A) = n, where n is the rank of A.

The trace of a matrix is the sum of its eigenvalues.
To prove trace(P_A) = n, we need to show P_A has exactly n eigenvalues equal to 1.
For λ = 1, the solution space of (I - P_A)z = 0 is the nullspace of I - P_A, which is R(A).
Since dim R(A) = rank A = n, there are n eigenvalues equal to 1.
The remaining m - n eigenvalues are 0.

🔗 Application to Q_ŷ

From equation (61): Q_ŷ = A(AQ_y^(-1)A)^(-1)A = P_A Q_y.
Therefore: trace(Q_y^(-1) Q_ŷ) = trace(Q_y^(-1) P_A Q_y) = trace(P_A) = n.
This result was used in section 2.4 to derive variance formulas.

🔍 Special case: no redundancy

🔍 When m = n

If the number of observations m equals the number of unknowns n, there is no redundancy.
Matrix A is square and invertible (since rank A = n).
The projector simplifies: (AQ_y^(-1)A)^(-1) = A^(-1)Q_y(A)^(-1).

📌 Consequences for estimation

Substituting into the estimator formula gives: x̂ is simply the unique solution of y = Ax.
The best linear unbiased estimator of E{y} = Ax is ŷ = y (the observations themselves).
Interpretation: with m = n, there are just enough observations to determine x uniquely, but not enough to improve the estimate of E{y} beyond the raw observations.
Don't confuse: having enough observations to solve for x does not mean having redundancy to check or improve the model fit.

2.6 Summary

🧭 Overview

🧠 One-sentence thesis

The section concludes a treatment of Best Linear Unbiased Estimation by summarizing key results in a table and introducing the transition to condition equations as an alternative model formulation.

📌 Key points (3–5)

What this section does: provides a summary table of Best Linear Unbiased Estimation results (model, estimators, variances/covariances, orthogonal projectors).
Transition to a new model: introduces the condition-equation formulation as an alternative to observation equations.
Key relationship: the number of redundant observations (m − n) equals the number of condition equations (b).
Common confusion: observation equations express observations as functions of unknowns, while condition equations express constraints that observations must satisfy (e.g., loop misclosures summing to zero).

📋 Summary table structure

📋 What the table covers

The excerpt states:

In the table on the next page an overview is given of the results of Best Linear Unbiased Estimation.

The table organizes results under four headings:
- Model
- Estimators
- Mean
- Variances and covariances
- Orthogonal projectors
The actual content of the table is not provided in the excerpt.

🔄 Transition to condition equations

🔄 Two ways to model the same network

The excerpt introduces Chapter 3 by contrasting two formulations for a levelling network:

Formulation	What it does	Example from excerpt
Observation equations	Express observed values as functions of unknown parameters	Height differences written as functions of unknown heights (equation 1, matrix form equation 2)
Condition equations	Express constraints that observations must satisfy	Loop misclosures must sum to zero (equation 3, matrix form equation 4)

Both describe the same physical network but organize the mathematics differently.

🔗 Key relationship: redundancy equals conditions

The excerpt establishes:

Note that in the above example of the levelling network the number of redundant observations, m − n = 4, equals the number of condition equations, b = 4. This is generally true.

Why this is true: each observation beyond the minimum n needed to determine the unknowns gives rise to one extra condition equation.
Formally: m − n = b (equation 6).
Example: In the levelling network, m = 9 observations, n = 5 unknown heights, so 4 redundant observations → 4 condition equations (one per loop).

🔁 Loop misclosures

The excerpt uses a levelling network with four loops to illustrate condition equations:

Misclosure (tegenspraak): the amount by which a loop fails to close (sum of height differences around a loop ≠ 0 due to measurement errors).
With perfect measurements, each loop would sum to zero; with errors, misclosures t₁, t₂, t₃, t₄ exist.
Each loop gives one condition equation.

Don't confuse:

Observation equations model what you measure as a function of what you want to find.
Condition equations model constraints that measurements should satisfy (e.g., geometric closure).

🧮 Matrix formulations

🧮 Observation-equation form

Matrix equation: y = A x + e (equation 2)
m observations, n unknowns, rank(A) = n.
Number of redundant observations: m − n.

🧮 Condition-equation form

Matrix equation: B^T y = t (equations 4, 5)
B is m × b, where b is the number of condition equations.
t is the vector of misclosures.
The excerpt notes that pre-multiplying the observation-equation matrix by the condition-equation matrix relates the two formulations (statement incomplete in excerpt).

The Model with Condition Equations

3.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The condition-equation model provides an implicit representation of measurements that eliminates unknown parameters by expressing constraints the observations must satisfy, and it leads to best linear unbiased estimators through minimization of variance subject to the unbiasedness constraint.

📌 Key points (3–5)

Two equivalent representations: observation equations (parametric, y = Ax + e) and condition equations (implicit, B*y = t) describe the same model; the number of condition equations equals the number of redundant observations (b = m - n).
Key relationship: the matrices A and B satisfy B*A = 0, meaning condition equations eliminate the unknown parameter vector x.
Unbiasedness constraint: any estimator l that can be written as l = g + Ba (for some vector a) produces an unbiased estimate of a linear function j = gE{y}.
Common confusion: condition equations only exist when there are redundant observations (m > n); if m = n, no matrix B exists and no condition equations can be formed.
Best estimator: the optimal vector a that minimizes variance is a = -(BQ_yB)^(-1)BQ_y*g, leading to the best linear unbiased estimator.

🔄 From observation equations to condition equations

🔄 The parametric representation (observation equations)

Observation equations: y = Ax + e, where y is the m×1 observation vector, A is m×n with rank n, x is the n×1 unknown parameter vector, and e is the measurement error vector.

This is the form developed in chapter 2.
Example: in the levelling network, height differences are written as functions of unknown heights.
The number of redundant observations is m - n (observations beyond the minimum needed to determine x uniquely).

🔄 The implicit representation (condition equations)

Condition equations: B*y = t, where B is a b×m matrix with rank b, y is the observation vector, and t is the misclosure vector.

Instead of expressing observations as functions of unknowns, these equations express constraints the observations must satisfy.
Example: in a levelling network with loops, the height differences around each closed loop should sum to zero; deviations from zero are misclosures t₁, t₂, t₃, t₄.
The misclosure vector t arises from measurement errors: t = B*e.

🔗 The fundamental relationship: B*A = 0

Pre-multiplying the observation equation matrix A by the condition equation matrix B yields the zero matrix.
This eliminates the unknown parameter vector x from the equations.
Why this works: the column vectors of B lie in the nullspace of A* (the transpose of A).
From linear algebra: dim(N(A*)) = m - n, so exactly m - n linearly independent vectors b_i exist satisfying B*A = 0.

📊 Number of condition equations equals redundancy

Concept	Value	Meaning
Redundant observations	m - n	Extra observations beyond minimum needed
Number of condition equations	b	Number of constraints
Key equality	b = m - n	Each redundant observation creates one condition equation

Don't confuse: if m = n (no redundancy), no matrix B exists and no condition equations can be formed; condition equations require redundant observations.

🎯 Best linear unbiased estimation with condition equations

🎯 The estimation problem setup

Goal: estimate a parameter j that is a linear function of the expected observations: j = g*E{y}.
Example: in the levelling network, j might represent the height difference between two points; then g is a vector selecting the appropriate observations.
Estimator form: j_hat = l*y (linear in observations).
Assumptions: E{e} = 0 and D{e} = σ²Q_y (same as chapter 2).
From the condition equations: E{y} ⊥ R(B) (the mean is orthogonal to the range space of B).

🎯 The three criteria (L, U, B properties)

L-property (linearity): j_hat = l*y for some vector l.
U-property (unbiasedness): E{j_hat} = j.
B-property (best = minimum variance): minimize var(j_hat) = σ²lQ_yl.

🔍 Deriving the unbiasedness constraint

Substitute j_hat = ly into the unbiasedness condition: E{ly} = g*E{y}.
Rearranging: (l - g)*E{y} = 0.
This means l - g is orthogonal to E{y}.
Since E{y} ⊥ R(B), the vector l - g must lie in R(B).
Key result: there exists a b×1 vector a such that l = g + B*a.

🔍 Why l = g + B*a ensures unbiasedness

Substitute into the estimator: j_hat = (g + Ba)y = gy + aB*y.
Take expectation: E{j_hat} = gE{y} + aB*E{y}.
From the condition equations (15): B*E{y} = 0.
Therefore: E{j_hat} = g*E{y} = j ✓

Interpretation: any vector l of the form g + B*a produces an unbiased estimator; the challenge is to find the specific a that minimizes variance.

📉 Minimizing variance to find the best estimator

📉 The variance of the estimator

Apply the propagation law of variances to j_hat = l*y:
- var(j_hat) = lD{y}l = σ²lQ_yl.
Substitute l = g + B*a:
- var(j_hat) = σ²(g + Ba)Q_y(g + Ba).

📉 The minimization problem

Optimization problem: minimize σ²(g + Ba)Q_y(g + Ba) with respect to the vector a.

This is an unconstrained minimization problem in the variable a.
The solution is found by setting the derivative with respect to a equal to zero.

📉 The optimal solution

Result: a = -(BQ_yB)^(-1)BQ_y*g.
Substitution into l = g + B*a gives the optimal vector l.
This l, when used in j_hat = l*y, produces the best linear unbiased estimator.

Verification: the excerpt states this can be verified using an algebraic proof similar to the one in section 1.2 (not shown in the excerpt).

🧮 Key mathematical tools and properties

🧮 Orthogonal projectors (from section 2.5)

Orthogonal projector P_A: projects vectors onto R(A) along R(A)^⊥.

Properties of P_A:
- P_A = A(AQ_y^(-1)A)^(-1)AQ_y^(-1) (from equation 61).
- P_A is symmetric and idempotent (P_A² = P_A).
- Eigenvalues are 1 or 0.
- Trace of P_A equals rank(A) = n.
The complementary projector I - P_A (also written P_A^⊥) projects onto R(A)^⊥ along R(A).

🧮 Eigenvalue analysis

The eigenvalue problem for P_A: P_A*z = λz.
From P_A² = P_A, it follows that λ² = λ, so λ = 0 or λ = 1.
The number of eigenvalues equal to 1 is n (the rank of A).
Proof sketch: for λ = 1, the solution space is N(I - P_A) = R(A), which has dimension n.

🧮 Nullspace and rangespace relationships

From linear algebra: dim(N(A*)) + rank(A*) = m.
Since rank(A) = n, we have dim(N(A*)) = m - n.
This dimension equals the number of linearly independent condition equations.

Don't confuse: the nullspace of A* (which gives the columns of B) is different from the nullspace of A; the former has dimension m - n, the latter has dimension 0 when A has full column rank.

Best Linear Unbiased Estimation

3.2 Best Linear Unbiased Estimation

🧭 Overview

🧠 One-sentence thesis

The Best Linear Unbiased Estimation (BLUE) method for the condition-equations model produces estimators identical to those from the observation-equations model, differing only in representation through matrix B instead of matrix A.

📌 Key points (3–5)

Two equivalent model representations: observation equations (parametric, using matrix A) and condition equations (implicit, using matrix B) describe the same underlying model.
BLUE criteria: the estimator must be Linear (a linear function of observations), Unbiased (expected value equals the true parameter), and Best (minimum variance).
Key result: the estimator adjusts observations by subtracting a term that depends on the misclosure vector, ensuring the adjusted observations satisfy the condition equations.
Common confusion: the two models look different but yield identical estimates because they use the same estimation principle (BLUE) and represent the same underlying relationships.
Extension for non-zero b₀: when condition equations include a non-zero constant vector, an additional term appears in the estimators but variance matrices remain unchanged.

🔄 From observation equations to condition equations

🔄 Two representations of the same model

The excerpt establishes that a model can be written in two equivalent ways:

Observation equations (parametric representation): equation (2) in the form involving matrix A and unknown parameter vector x
Condition equations (implicit representation): equation (5) in the form B* y = 0, where B* is a b×m matrix with rank b = m - n

The transformation works because:

Pre-multiplying observation equations by matrix B* eliminates the parameter vector x
The nullspace of A* has dimension m - n, meaning m - n linearly independent vectors bᵢ exist that satisfy the orthogonality condition
When m = n (no redundant observations), no such matrix B exists and no condition equations can be formed

📐 The misclosure vector

The misclosure vector t arises from random measurement errors in the observations.

From the condition equations B* y = 0 and the relation B* A = 0, it follows that:

t = B* y represents the misclosure (how much the observations fail to satisfy the conditions)
t = B* e, showing the misclosure is entirely due to measurement error e
Example: in a levelling loop, if observed height differences sum to a small non-zero value instead of zero, that sum is the misclosure

🎯 The BLUE estimation problem

🎯 Model assumptions

The starting point for estimation is:

B* E{y} = 0 (condition equations hold in expectation)
E{e} = 0 (errors have zero mean)
Q_y = σ² Q_y (variance-covariance structure of observations)

These lead to:

E{t} = 0
Q_t = B* Q_y B

🎯 Estimating a linear function

Suppose we want to estimate a parameter j that is a linear function of E{y}:

j = g* E{y} for some known vector g
Example: in a levelling network between points 1 and 5, j represents the height difference and g is a vector selecting the appropriate combination of observations

The estimator ĵ must satisfy three properties:

Property	Symbol	Meaning
Linear	L-property	ĵ = l* y (linear function of observations)
Unbiased	U-property	E{ĵ} = j (expected value equals true parameter)
Best	B-property	Minimize variance among all linear unbiased estimators

🔍 Finding the unbiased estimators

Substituting ĵ = l* y into the unbiasedness condition gives:

l* E{y} = g* E{y} for all possible E{y}
This implies (l - g) is orthogonal to E{y}

Since the subspace orthogonal to E{y} is spanned by the columns of B, the vector (l - g) must lie in the range space of B:

l - g = B a for some b×1 vector a
Therefore l = g + B a

Key insight: Every vector l of this form produces an unbiased estimator. The excerpt verifies this by showing E{ĵ} = j when l has this structure.

⚖️ Minimizing variance

The variance of the estimator is:

Var(ĵ) = l* Q_y l = (g + B a)* Q_y (g + B a)

The minimization problem becomes:

Minimize a* (B* Q_y B) a + 2 a* B* Q_y g + constant terms

The solution is:

a = -(B* Q_y B)⁻¹ B* Q_y g

Substituting back:

l = g - B (B* Q_y B)⁻¹ B* Q_y g

The BLUE estimator is therefore:

ĵ = l* y = g* y - g* Q_y B (B* Q_y B)⁻¹ B* y

Don't confuse: the vector a is an auxiliary variable introduced to parameterize all unbiased estimators; it is not the same as the matrix A from observation equations.

📊 Estimating the observation vector

📊 The BLUE of E{y}

By choosing g to be unit vectors (1,0,...,0), (0,1,0,...,0), etc., the BLUE of E{y} itself is:

ŷ = y - Q_y B (B* Q_y B)⁻¹ B* y

This can also be written as:

ŷ = y - Q_y B (Q_t)⁻¹ t

where t = B* y is the misclosure vector and Q_t = B* Q_y B.

Interpretation: The estimator adjusts the observations y by subtracting a correction term that depends on the misclosure vector t, ensuring the adjusted observations satisfy the condition equations.

✅ Verification of condition equations

Pre-multiplying ŷ by B* gives:

B* ŷ = B* y - B* Q_y B (B* Q_y B)⁻¹ B* y = B* y - B* y = 0

This confirms that the estimator ŷ satisfies the condition equations exactly (zero misclosure).

🔗 Equivalence with observation-equations model

The excerpt states that the estimator from equation (27) is identical to the estimator derived in the previous chapter using observation equations. This is because:

The same estimation principle (BLUE) is applied
The models are equivalent, differing only in representation
The relation B* A = 0 connects the two matrices
Results differ only in appearance: chapter 2 expressed everything in terms of A, while this chapter uses B

Example: If you estimate a height difference using either the observation-equations approach (with design matrix A and parameters) or the condition-equations approach (with constraint matrix B), you get numerically the same answer.

🔧 Extension to non-zero constant vector

🔧 The b₀ ≠ 0 case

The basic condition-equations model assumes B* E{y} = 0. Sometimes condition equations include a non-zero constant:

B* E{y} = b₀

Example from the excerpt: In a triangulation network with two triangles and six angle observations, each triangle's angles should sum to π (not zero). The condition equations become:

B* y = b₀ where b₀ contains the values π for each triangle

🔧 Modified estimators

The misclosure vector and its variance become:

t = B* y - b₀
Q_t = B* Q_y B (unchanged)

The estimators are modified by adding a term involving b₀:

ŷ = y - Q_y B (B* Q_y B)⁻¹ (B* y - b₀)
ê = e - Q_y B (B* Q_y B)⁻¹ (B* y - b₀)

Important: The non-zero b₀ vector is a vector of constants, so it does not affect the variance matrices Q_t, Q_ŷ, or Q_ê. Only the point estimates change, not their precision.

Don't confuse: Redefining observables to force b₀ = 0 is possible in simple cases but generally difficult, so the modified formulae accommodate b₀ directly.

📋 Summary comparison

📋 Main results for condition equations

The excerpt provides a summary translating results from observation equations (chapter 2) into condition-equations form:

Model: B* E{y} = 0, E{e} = 0, Q_y given

Estimators:

ŷ = y - Q_y B (B* Q_y B)⁻¹ B* y
ê = -Q_y B (B* Q_y B)⁻¹ B* y

Variances and covariances: Expressed in terms of B and Q_y

Orthogonal projectors: Defined using B instead of A

📋 Side-by-side comparison

The excerpt includes a comparison table showing:

Observation equations (left column): formulas involving matrix A
Condition equations (right column): formulas involving matrix B
The two columns represent the same estimation results in different notation

This reinforces that the choice between observation equations and condition equations is a matter of representation, not a fundamental difference in the model or results.

3.3 The case

🧭 Overview

🧠 One-sentence thesis

When condition equations include a non-zero constant vector b₀, the least-squares estimators require an additional term compared to the zero-misclosure case, but the variance matrices remain unchanged.

📌 Key points (3–5)

The triangulation example: a small triangulation network with six angle observables must satisfy two conditions (each triangle's angles sum to π), introducing a non-zero b₀ vector.
Why b₀ ≠ 0 matters: the condition equations no longer sum to zero in expectation, requiring adaptation of the estimation formulae from the previous section.
How to adapt the estimators: the non-zero b₀ adds an extra term to the least-squares estimators ŷ and ê, but does not change the variance matrices.
Common confusion: although the model looks different from equation (5), it is still solvable without redefining observables—just extend the formulae directly.

📐 The triangulation network example

📐 Setup and condition equations

The excerpt presents a small triangulation network (figure 3.2) with six angle observables.
Two conditions must hold:
- In each triangle, the expectation of the three observed angles sums to π.
Measurement errors cause small misclosures, so the condition equations do not hold exactly.

The two condition equations in matrix notation:
B* y = b₀

Here b = 2 (two conditions), m = 6 (six observables).
The matrix B and vector b₀ are given explicitly in equation (33).

🔄 Difference from the levelling-loop case

The levelling-loop model (equation (5)) had condition equations that summed to zero (b₀ = 0).
The triangulation model has b₀ ≠ 0 because each triangle's angles sum to π, not zero.
At first glance, one might try to redefine observables to force the form of equation (5), but the excerpt notes this is "difficult in general."

🔧 Adapting the estimation formulae

🔧 Misclosure vector with b₀

From equation (33), the misclosure vector t and its variance matrix are:
- t = B* y − b₀
- Qₜ = B* Qᵧ B
The misclosure now includes the constant offset b₀.

🔧 Modified estimators

The least-squares estimators (equation (35)) become:

ŷ = y − Qᵧ B (B* Qᵧ B)⁻¹ (B* y − b₀)
ê = y − ŷ

Key observation:

Comparing with equation (27) (the b₀ = 0 case), the non-zero b₀ adds an extra term.
The term (B* y − b₀) replaces the simpler B* y in the original formula.
Example: if b₀ were zero, the formula would reduce to the earlier result; the non-zero constant shifts the projection.

📊 Variance matrices unchanged

The excerpt emphasizes: "Since b₀ is a vector of constants, the extension of the model with a non-zero b₀-vector does not affect the variance matrices Qₜ, Qŷ, and Qê."
Why: b₀ is deterministic (no randomness), so it does not contribute to variance.
Don't confuse: the estimators themselves change (extra term), but their precision (variance) does not.

🧩 Relationship to earlier models

🧩 Equivalence in principle

The excerpt notes that model (15) (condition equations with b₀ = 0) is equivalent to the observation-equations model from chapter 2.
Both use the same estimation principle (BLUE) and the same underlying model, differing only in representation:
- Chapter 2: everything expressed in terms of matrix A.
- Chapter 3: everything expressed in terms of matrix B.
Results are identical, just presented differently.

🧩 Why not redefine observables?

One could try to rewrite the triangulation equations to force b₀ = 0 (e.g., by redefining angle observables).
The excerpt states this is "difficult in general," so it is more practical to adapt the formulae directly to handle b₀ ≠ 0.

📋 Summary of the adaptation

Aspect	b₀ = 0 case	b₀ ≠ 0 case
Misclosure vector	t = B* y	t = B* y − b₀
Estimator ŷ	y − Qᵧ B (B* Qᵧ B)⁻¹ B* y	y − Qᵧ B (B* Qᵧ B)⁻¹ (B* y − b₀)
Variance matrices	Qₜ, Qŷ, Qê unchanged	Qₜ, Qŷ, Qê unchanged
Extra term	None	−(B* Qᵧ B)⁻¹ b₀ inside the projection

The non-zero b₀ shifts the estimator by a deterministic amount but does not alter the stochastic properties (variances).

3.4 Summary

🧭 Overview

🧠 One-sentence thesis

The condition-equations model (model 15) produces identical estimation results to the observation-equations model from the previous chapter, differing only in representation through matrix B instead of matrix A.

📌 Key points (3–5)

Equivalence of models: Model (15) with condition equations yields the same BLUE estimators as the observation-equations model; only the mathematical representation differs (matrix B vs. matrix A).
Extension for non-zero b₀: When condition equations include a non-zero constant vector b₀, estimators gain an additional constant term, but variance matrices remain unchanged.
Free variates: Observables that do not appear in condition equations are called free variates; however, their estimators are not simply equal to the observed values unless covariances are zero.
Common confusion: A free variate (e.g., y₁ not appearing in the condition equation) does not automatically mean its residual ê₁ is zero—covariances between observables determine whether the residual is zero.
Summary structure: The chapter summarizes results by translating the previous chapter's formulas from matrix A notation to matrix B notation.

🔄 Model equivalence and representation

🔄 Same principle, different notation

The excerpt states that model (15) is "equivalent to the model in section 2.6" and uses "the same principle of estimation, namely BLUE, and the same model."
The two models "only differ in their representation":
- Chapter 2 expressed everything in terms of matrix A.
- Chapter 3 expresses everything in terms of matrix B.
Because the underlying model and estimation principle are identical, "the results of estimation for model (15) are identical to the results of the previous chapter."

📋 Summary tables

The excerpt provides summary tables on pages 69–70 that restate the results from section 2.6 using matrix B notation.
These tables include:
- Model specification
- Estimators
- Mean values
- Variances and covariances
- Orthogonal projectors
A comparison table contrasts "observation equations" and "condition equations" side by side.

🧮 Handling non-zero constant vectors

🧮 The case of b₀ ≠ 0

The basic condition-equations model (equation 5) assumes the form B* E{y} = 0.
When condition equations include a non-zero constant vector b₀, the model becomes B* E{y} = b₀ (equation 33).
Example: In a triangulation network (figure 3.2), six angle observables must satisfy two conditions—each triangle's three angles sum to π (a non-zero constant).

➕ Additional term in estimators

The excerpt derives that with a non-zero b₀, the least-squares estimators (equation 35) gain "an additional term" compared to the b₀ = 0 case (equation 27).
However, "since b₀ is a vector of constants," this extension "does not affect the variance matrices Q_t, Q_ŷ, and Q_ê."
In other words:
- Estimators shift by a constant.
- Uncertainty (variance) remains the same.

🆓 Free variates and covariance effects

🆓 What are free variates

Free variates: observables that do not appear in the condition equations.

The excerpt introduces this concept through a simplified example where the second component a₂ of vector a is zero and the first component b₁ of vector b is zero (equations 3 and 4).
In this case, the condition equation (equation 4) shows "the coefficient of E{y₁} equals zero," meaning y₁ does not appear in the condition equation.
Therefore, y₁ is a free variate.

⚠️ Common misconception about free variates

One might think: "Since E{y₁} does not appear in the condition equation (5), the estimator ŷ₁ of E{y₁} equals y₁, and thus the residual ê₁ = y₁ - ŷ₁ is zero."
The excerpt emphasizes: "This however is not true."
Figure 4.2 shows that "both ê₁ and ê₂ are unequal to zero."

🔗 Role of covariance

The excerpt explains: "It thus seems that the covariance, s₁₂, between y₁ and y₂ determines to what extent ê₁ = y₁ - ŷ₁ differs from zero."
Two cases:
- If s₁₂ = 0 (covariance is zero, Q_y is diagonal): then ê₁ = 0 (figure 4.3).
- If s₁₂ ≠ 0 (non-zero covariance): then also ê₁ ≠ 0.
Don't confuse: "free variate" means the observable does not appear in the condition equation, but it does not mean the observable is independent or that its residual vanishes—covariance with other observables matters.

Condition	Residual of free variate y₁
s₁₂ = 0 (diagonal Q_y)	ê₁ = 0
s₁₂ ≠ 0 (non-diagonal Q_y)	ê₁ ≠ 0

🖼️ Geometric interpretation

Figures 4.1–4.3 illustrate oblique projection of the observation vector y.
When the covariance matrix Q_y is diagonal (figure 4.3), the ellipse axes align with the coordinate axes, and the free variate's residual is zero.
When Q_y is non-diagonal (figure 4.2), the ellipse is tilted, and the projection affects all components, including the free variate.

y_R-Variates

4.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Observables that are stochastically or functionally related to another set of observables—called y_R-variates—require special estimation formulae because their estimators depend on correlations and functional relationships, not just their own measurements.

📌 Key points (3–5)

What y_R-variates are: observables that are either stochastically or functionally related to another set of observables y.
Three types exist: free variates (correlate with y), derived variates (functions of y), and constituent variates (y are functions of them).
Key insight for free variates: even when an observable does not appear in a condition equation, its estimator is not simply equal to its measured value if it correlates with other observables.
Common confusion: if y₁ does not appear in a condition equation, one might think ê₁ = 0, but this is not true when covariances are non-zero; the covariance s₁₂ determines whether ê₁ differs from zero.
Why it matters: the formula generalizes to multi-dimensional cases and applies to all three types of y_R-variates, enabling correct estimation when adding new measurements to existing networks or computing derived quantities.

🔍 The free-variate paradox

🔍 What happens when an observable "disappears" from the condition equation

The excerpt starts with a simple case (m=2, n=1). The model with observation equations is:

E{y} = a·x (observation equations)

The corresponding condition equation model is:

b* · E{y} = 0 (condition equations)

When the direction vector a is rotated to be parallel to the 1-axis, the second component a₂ becomes zero and the first component b₁ becomes zero. The condition equation then becomes:

b₂ · E{y₂} = 0

This shows that E{y₁} does not appear in the condition equation.

Free variates: observables that do not appear in the condition equations.

❌ The wrong intuition

One might conclude: "Since E{y₁} does not appear in the equation, the estimator ŷ₁ of E{y₁} equals y₁, so ê₁ = y₁ - ŷ₁ is zero."

This is not true. The figures in the excerpt show clearly that both ê₁ and ê₂ are unequal to zero.

🔑 The key: covariance determines the correction

The reason ê₁ ≠ 0 is that the major and minor axes of the error ellipse make a non-zero angle with the coordinate axes.

If the covariance matrix Q_y is diagonal (s₁₂ = 0): then ê₁ = 0.
If s₁₂ ≠ 0: then also ê₁ ≠ 0.

The covariance s₁₂ between y₁ and y₂ determines to what extent ê₁ differs from zero.

📐 The geometric relationship

The excerpt derives (using figure 4.4):

tan(β) = ê₁ / ê₂, where β is the angle between a vector and the 1-axis.

Using the inverse of Q_y and the condition that a = (a₁, 0)*, the excerpt shows:

tan(β) = -s₁₂ / s₂₂

Combining these:

ê₁ = -(s₁₂ / s₂₂) · ê₂

This formula shows how ê₁ (and thus the estimator ŷ₁ = y₁ - ê₁) for free variates can be determined from ê₂.

Don't confuse: "not appearing in the equation" does not mean "no correction needed"; correlation propagates the adjustment.

📚 Three types of y_R-variates

📚 Definition and classification

y_R-variates: observables that are either stochastically or functionally Related to another set of observables y.

Type	Name	Relationship	Example from excerpt
1	Free variates	Correlate with y-variates	Heights x̂₂, x̂₃, x̂₄ that correlate with x̂₁, x̂₅ when a new height difference y₁₀ between points 1 and 5 is added
2	Derived variates	Functions of y-variates	Height differences not directly measured but computed as functions of estimated heights
3	Constituent variates	y-variates are functions of them	Individual measured height differences whose sum y₁ is used in the condition equation

🗺️ Type 1: Free variates (example)

Consider a levelling network where five heights have been estimated using nine levelled height differences, giving estimators x̂₁, x̂₂, x̂₃, x̂₄, x̂₅.

Now a new height difference y₁₀ between points 1 and 5 is measured. The condition equation is:

x̂₁ - x̂₅ - y₁₀ = 0

On the basis of this equation, we can compute ŷ₁₀ and improved estimators for the heights of points 1 and 5.

Key point: Since the estimators x̂₂, x̂₃, x̂₄ are correlated with x̂₁ and x̂₅, these estimators can also be improved, even though they do not appear in the condition equation. The variates x̂₂, x̂₃, x̂₄ are the free variates in this case.

📊 Type 2: Derived variates

Height differences that are not directly measured but can be computed as functions of the estimated heights.

🧩 Type 3: Constituent variates

Assume a height difference between points 1 and 2 has not been measured directly, but equals the sum of a number of measured height differences.

You can include either:

The individual measured height differences in the condition equation, or
Their sum y₁.

In the latter case, the individual measured height differences are the constituent variates.

🧮 Multi-dimensional generalization

🧮 The general formula for free variates

The excerpt generalizes the formula ê₁ = -(s₁₂ / s₂₂) · ê₂ to the multi-dimensional case.

Starting from the model with condition equations:

B* · E{y} = 0
D{y} = σ² Q_y

The solution is:

ê = -Q_y · B · (B* Q_y B)⁻¹ · (B* y)

Now assume instead we have:

(B* 0) · E{(y, y_R)*} = 0
Covariance matrix is block form with Q_y, Q_y,y_R, and Q_y_R

Here y_R is a vector of free variates (coefficients are zero).

🔢 Derivation from condition equations

Replacing B* with (B* 0) and Q_y with the block covariance matrix in the solution formula gives:

ê = -Q_y · B · (B* Q_y B)⁻¹ · (B* y)
ê_R = -Q_y_R,y · B · (B* Q_y B)⁻¹ · (B* y)

From the first equation:

(B* Q_y B)⁻¹ · (B* y) = -Q_y⁻¹ · B · ê

Substituting into the second equation:

ê_R = Q_y_R,y · Q_y⁻¹ · ê

This is the multi-dimensional generalization of equation (8).

🔄 Derivation from observation equations

The excerpt also mentions deriving the same formula using the model with observation equations:

E{(y, y_R)} = (A, A_R*)* · x
Note that A_R = 0 holds (free variates do not depend on the parameters).

The excerpt states "If we denote the inverse of the covariance matrix..." but the text cuts off.

✅ Universal applicability

The excerpt states that formula (14) holds for all three types of y_R-variates, not just free variates.

Don't confuse: The same mathematical framework applies whether the relationship is stochastic (free variates), functional forward (derived variates), or functional backward (constituent variates).

Free variates

4.2 Free variates

🧭 Overview

🧠 One-sentence thesis

Formula (14) generalizes the adjustment of free variates to the multi-dimensional case and applies equally to free, derived, and constituent variates in both condition-equation and observation-equation models.

📌 Key points (3–5)

Free variates are those that do not appear in the condition equations (their coefficients are zero), so they are not directly constrained by the model.
Formula (14) is the multi-dimensional generalization that can be derived from either the condition-equation model or the observation-equation model.
Three types of y_R-variates: free (not in condition equations), derived (functions of observables), and constituent (observables are functions of them)—all satisfy the same formula (14).
Common confusion: weight matrices vs covariance matrices—formula (14) is expressed in covariance terms, but the observation-equation derivation uses weight matrices and requires conversion via the inverse relationship.
Why it matters: the unified formula (14) means the same adjustment machinery works for all three types of variates, simplifying the theory and computation.

🔧 What free variates are

🔧 Definition and role

Free variates: variates y_R whose coefficients in the condition equations are zero; they do not appear in the condition equations.

In the condition-equation model (10), some variates may be constrained by equations; free variates are not.
Example: In a levelling network, estimated heights x̂₂, x̂₃, x̂₄ are free variates if they do not appear in a particular condition equation (9).
Free variates are still part of the adjustment problem but are not directly restricted by the conditions.

🔄 Model structure with free variates

The original condition-equation model (10) has the form with coefficient matrix B* and observables y.
When free variates y_R are included, the model (12) replaces B* with (B* 0), meaning the y_R block has zero coefficients.
The covariance matrix of (10) is Q_y; for model (12) it becomes a block matrix with Q_y and Q_R on the diagonal and Q_yR off-diagonal.

🧮 Deriving formula (14) from condition equations

🧮 Starting from the condition-equation solution

The solution of the original model (10) is given by equation (11).
For model (12) with free variates, substitute (B* 0) for B* and the block covariance matrix for Q_y.
The second equation of (11) becomes the first equation of (13).

🧮 Solving for the adjusted free variates

From the first equation of (13), solve for the Lagrange multipliers.
Substitute back into the second equation of (13) to eliminate the multipliers.
The result is formula (14):
- ŷ_R = y_R - Q_Ry Q_y⁻¹ (y - ŷ)
- This is the multi-dimensional generalization of equation (8) from section 4.1.

🔍 Deriving formula (14) from observation equations

🔍 Observation-equation model for free variates

The equivalent of model (12) in observation-equation form is model (15).
The key relationship is that the design matrix for y_R is zero (no direct observations of y_R).
The inverse of the covariance matrix of observables is denoted by equation (16), which defines weight matrices G_y, G_R, and G_Ry.

🔍 Normal equations and weight matrices

The normal equations for model (15) are given by (17).
From the last equation of (17), solve for the adjusted free variates.
The result (18) looks similar to (14) but is expressed in terms of weight matrices (G) rather than covariance matrices (Q).

🔍 Converting weights to covariances

Formula (14) uses covariance matrices; equation (18) uses weight matrices.
The relationship is given by the inverse (see equation (16)): the weight matrix is the inverse of the covariance matrix.
From this inverse relationship, derive equation (19): Q_Ry = - Q_R G_Ry G_y⁻¹.
Substituting (19) into (18) proves formula (14).
Don't confuse: weight matrices (inverses of covariances) are used in the observation-equation derivation, but the final formula (14) is in covariance form for consistency.

🌳 Extension to derived and constituent variates

🌳 Derived variates

Derived variates: y_R-variates that are functions of the observables y.

The functional relationship is given by equation (20): y_R = some function of y.
The adjusted derived variates follow the same relationship (21): ŷ_R = function of ŷ.
Subtracting (21) from (20) gives equation (22), which relates the residuals.
Apply the propagation law of covariances to (20) to show that the coefficient matrix L equals Q_Ry Q_y⁻¹.
This proves that formula (14) also holds for derived variates.
Example: In a levelling network, height differences not directly measured but computed from estimated heights are derived variates.

🌳 Constituent variates

Constituent variates: y_R-variates of which the y-variates are functions (the reverse of derived variates).

The functional relationship is equation (23): y = function of y_R (observables are functions of constituent variates).
Example: In a levelling network, if the sum y₁ of individual measured height differences is used in the condition equation, the individual differences are constituent variates.
Apply the propagation law of variances to get equation (24) and the propagation law of covariances to get equation (25).
The solution of the original model (26) is (27): ê = y - ŷ.
Reformulate the model with condition equations as (28), replacing B with L*B, y with y_R, and Q_y with Q_R.
The solution (29) for model (28), combined with (23), (24), and (25), reduces to formula (14).
This proves that formula (14) holds for constituent variates as well.

📊 Summary: unified formula for all three types

Type of y_R-variate	Relationship to observables y	Key step in proof
Free	Not in condition equations (coefficients zero)	Substitute block matrices into condition-equation solution or use observation-equation normal equations
Derived	y_R = function of y	Apply propagation law to show L = Q_Ry Q_y⁻¹
Constituent	y = function of y_R	Apply propagation laws and reformulate model with y_R as primary variates

All three types satisfy the same adjustment formula (14): ŷ_R = y_R - Q_Ry Q_y⁻¹ (y - ŷ).
This unification simplifies adjustment theory: the same machinery applies regardless of the type of y_R-variate.
The excerpt states "it will be shown that formula (8) holds for all three types of y_R-variates," and sections 4.2, 4.3, and 4.4 provide the proofs for free, derived, and constituent variates respectively.

4.3 Derived variates

🧭 Overview

🧠 One-sentence thesis

The adjustment formula for estimated residuals applies not only to directly measured observables but also to three special types of variates—free variates, derived variates, and constituent variates—each of which relates to the observables in a different way.

📌 Key points (3–5)

Core formula (14): A single residual formula holds for all three types of y_R-variates (free, derived, and constituent).
Free variates: observables that do not appear in the condition equations; their coefficients are zero.
Derived variates: quantities computed as functions of the observables (e.g., height differences not directly measured but calculated from estimated heights).
Constituent variates: the reverse relationship—the observables are functions of these variates (e.g., individual measured height differences that sum to a total).
Common confusion: the three types differ in their functional relationship to the observables: free variates are independent of condition equations, derived variates are outputs computed from observables, and constituent variates are inputs from which observables are built.

🔓 Free variates

🔓 What free variates are

Free variates: observables whose coefficients in the condition equations are zero; they do not appear in the condition equations.

In the model with condition equations, some variates may not be constrained by the equations.
The excerpt shows that when the coefficient matrix B* is replaced by (B* 0) to include free variates, the covariance structure changes accordingly.
Example: In a levelling network, if certain height differences are not involved in any constraint equation, they are free variates.

🧮 Derivation of formula (14) for free variates

The excerpt provides two derivations:

Using condition equations:
- Start with the standard condition-equation model and modify it so that y_R has zero coefficients.
- The solution for the residuals becomes: estimated y_R minus y_R equals the covariance of y_R with y, times the inverse covariance of y, times the residuals of y.
- This is the multi-dimensional generalization of the earlier one-dimensional formula (8).
Using observation equations:
- The equivalent model in observation-equation form is written with parameters x and free variates y_R.
- The normal equations yield a similar result expressed in terms of weight matrices.
- By converting weight matrices to covariance matrices using the propagation law, the same formula (14) is obtained.

Don't confuse: The two derivations use different model representations (condition equations vs observation equations) but arrive at the same result; they are equivalent approaches, not competing methods.

🧩 Derived variates

🧩 What derived variates are

Derived variates: y_R-variates that are functions of the observables y.

These are quantities not directly measured but computed from the observables.
The functional relationship is written as: y_R equals some matrix L times y.
Example: In a levelling network, a height difference between two points that was not directly measured but is computed from estimated heights is a derived variate.

🧮 Proof that formula (14) holds for derived variates

Start with the functional relationship y_R = L y.
The same relationship holds for the estimated values: estimated y_R = L times estimated y.
Subtract to get the residuals: estimated y_R minus y_R = L times (estimated y minus y).
Apply the propagation law of covariances to the functional relationship: covariance of y_R with y equals L times the covariance of y.
This shows that L equals the covariance of y_R with y times the inverse covariance of y, which is exactly the coefficient in formula (14).
Therefore, formula (14) also holds for derived variates.

🏗️ Constituent variates

🏗️ What constituent variates are

Constituent variates: y_R-variates of which the y-variates are functions (the reverse of derived variates).

Here the observables y are built from the constituent variates y_R.
The functional relationship is: y = L* times y_R.
Example: In a levelling network, if individual measured height differences are summed to form a total, the individual differences are constituent variates and the sum is the observable.

🧮 Proof that formula (14) holds for constituent variates

Start with the functional relationship y = L* y_R.
Apply the propagation law of variances: covariance of y = L* times covariance of y_R times transpose of L*.
Apply the propagation law of covariances: covariance of y with y_R = L* times covariance of y_R.
The residuals for the original model with condition equations are given by a known formula.
Reformulate the model by substituting y = L* y_R into the condition equations, so the coefficient matrix B is replaced by L* B, y is replaced by y_R, and the covariance matrix Q_y is replaced by Q_R.
The residual formula for this reformulated model, combined with the propagation laws, yields formula (14) again.

🔄 Example: rederiving the condition-equation solution

The excerpt illustrates the constituent-variate approach by "rederiving" the solution of the model with condition equations:

Define the misclosure vector as: w = B y minus c.
Rewrite the condition equations as: I times w = 0 (where I is the identity matrix).
Solve this reformulated model to get the residuals of w.
Interpret y as a constituent variate (y_R) and w as the observable (y in the formula).
Apply formula (14), using the propagation laws to relate covariances of y and w.
Substitute and simplify to recover the familiar result for the residuals of y.

Don't confuse: This example does not introduce a new method; it shows that the constituent-variate framework is consistent with the standard condition-equation solution.

📊 Summary of the three types

Type	Relationship to observables	Key characteristic	Example from excerpt
Free variates	Not constrained by condition equations	Coefficients in condition equations are zero	Height differences not involved in any constraint
Derived variates	Functions of the observables: y_R = L y	Computed from measurements	Height difference computed from estimated heights
Constituent variates	Observables are functions of them: y = L* y_R	Measurements built from these variates	Individual height differences that sum to a total

All three types satisfy the same residual formula (14).
The proofs rely on the propagation law of covariances and the structure of the adjustment models.
The excerpt emphasizes that formula (14) is general and applies regardless of how y_R relates to y.

Constituent variates

4.4 Constituent variates

🧭 Overview

🧠 One-sentence thesis

Formula (14) applies to constituent variates—observables that are themselves functions of the variates being estimated—by transforming the model through the propagation law and showing that the adjustment formula remains valid.

📌 Key points (3–5)

What constituent variates are: y_R-variates of which the y-variates are functions (the reverse of derived variates).
The functional relationship: the observables y are expressed as functions of y_R through a transformation matrix L*.
How formula (14) is proven: by applying the propagation law to both variances and covariances, then reformulating the condition-equation model in terms of y_R.
Common confusion: constituent variates reverse the direction—y is a function of y_R, whereas derived variates have y_R as a function of y.
Practical example: the misclosure vector model demonstrates how the general result recovers the familiar condition-equation solution.

🔄 The constituent-variate relationship

🔄 Definition and setup

Constituent variates are y_R-variates of which the y-variates are functions.

The functional relationship is assumed to take the form: y = L* times y_R (equation 23).
This is the reverse direction compared to derived variates (section 4.3), where y_R was a function of y.
Example: if y_R represents underlying parameters and y represents measurements derived from them, y_R are constituent variates.

📐 Propagation laws applied

The excerpt applies two propagation laws to equation (23):

Propagation law	Result	Equation
Variances	Q_y = L* times Q_R times L*^T	(24)
Covariances	Q_{y,R} = L* times Q_R	(25)

These relationships transform the covariance structure from y_R space to y space.
They are essential for rewriting the adjustment formula in terms of constituent variates.

🔧 Model transformation

🔧 Reformulating the condition-equation model

The original model with condition equations is:

B times y + w = 0 (equation 26)
Solution: e_hat = y - y_hat = -Q_y times B^T times (B times Q_y times B^T)^(-1) times w (equation 27)

When y is expressed in terms of constituent variates y_R (equation 23), the model becomes:

B times (L* times y_R) + w = 0 (equation 28)

🔀 Substitution pattern

Comparing models (26) and (28), the excerpt identifies three substitutions:

Original (26)	Replaced by (28)
B	L* times B
y	y_R
Q_y	Q_R

Applying these substitutions to solution (27) gives the solution for model (28) as equation (29).
Using equations (23), (24), and (25), equation (29) can be rewritten to match formula (14).
Don't confuse: the substitution is systematic—every occurrence of the original variable and its covariance matrix is replaced consistently.

📝 Worked example: misclosure vector

📝 Setting up the misclosure model

The excerpt "rederives" the condition-equation solution using the constituent-variate framework.

Starting model with condition equations:

B times y + w = 0 (equation 30)

Define the misclosure vector:

w = B times y (equation 31)

Rewrite the model as:

t - w = 0, which is also in condition-equation form (equation 32)
The coefficient matrix is now the identity matrix I.

🧮 Applying formula (14)

To find e_hat = y - y_hat, interpret:

w (from equation 31) as the y_R-variate
t (from equation 31) as the y-variate

Solving model (32) gives:

e_hat_t = t - t_hat (equation 33)

Apply the propagation laws to equation (31):

Variances: Q_t = B times Q_y times B^T (equation 35)
Covariances: Q_{t,w} = B times Q_y (equation 36)

Substitute equations (33), (35), and (36) into formula (14):

The result is the familiar condition-equation solution.
Example: this demonstrates that the constituent-variate framework is not a new method but a unifying perspective that recovers known results.

✅ Verification

The final result matches the well-known solution for the model with condition equations.
This confirms that formula (14) holds for constituent variates.
Don't confuse: the example uses the misclosure vector as a teaching tool; the general proof (equations 23–29) applies to any constituent-variate relationship.

Mixed Model Representations: Introduction

5.1 Introduction

🧭 Overview

🧠 One-sentence thesis

The mixed model representation combines observation equations and condition equations into a unified framework that generalizes both earlier models and can be transformed into either pure form for solution.

📌 Key points (3–5)

Two new model types introduced: the mixed model B*E{y} = Ax (combining observation and condition equations) and the constrained observation model (observation equations with parameter constraints).
How the mixed model generalizes earlier forms: when B equals the identity matrix, it reduces to observation equations; when A equals the zero matrix, it reduces to condition equations.
Two solution strategies: transform the mixed model into observation-equations-only form or into condition-equations-only form.
Common confusion: the mixed model is not a third independent type—it is a generalization that contains the earlier two models as special cases.
Key requirement: both matrices A and B must have full column rank, and the number of unknown parameters must not exceed the number of conditions on observations (n ≤ b).

📚 Background: The two earlier models

📐 Model with observation equations

The excerpt reviews the model from chapter 2:

Form: y = Ax + e (equation 1)
The estimator x̂ of x follows from solving the normal equations (equation 2).
The estimators ŷ and ê follow as ŷ = Ax̂ and ê = y - ŷ (equations 4 and 5).

🔗 Model with condition equations

The excerpt reviews the model from chapter 3:

Form: B*E{y} = 0 (equation 6)
The estimators ŷ and ê can be represented in terms of matrix B (equations 7 and 8).

🔄 Relationship between the two

The observation-equations model uses an n×1 vector of unknown parameters x and an m×1 vector of observables y.
The condition-equations model imposes constraints directly on the expected values of observations.
The excerpt states that the new mixed model is "a mixture of (1) and (6)."

🧩 The first new model: B*E{y} = Ax

🎯 Definition and structure

Mixed model representation: B*E{y} = Ax (equation 9)

x is an n×1 vector of unknown parameters.
y is an m×1 vector of observables.
This form combines observation equations (linking y to x via A) and condition equations (imposing constraints via B).

🔀 How it generalizes earlier models

The excerpt explains two special cases:

Condition	Result	Reduces to
B = identity matrix	Constraints become trivial	Observation equations (1)
A = zero matrix	No parameter linkage	Condition equations (6)

Example: if B is the identity, the "condition" B*E{y} = Ax becomes simply E{y} = Ax, which is the observation-equations model.
Example: if A is zero, the model becomes B*E{y} = 0, which is the condition-equations model.

⚙️ Full-rank requirement

The excerpt states two assumptions:

Both matrices A and B must have full column rank (equation 12).
This implies n ≤ b: "the number of unknown parameters should not exceed the number of conditions on the observations."

Don't confuse: full column rank for A and B does not mean they are square or invertible; it means their columns are linearly independent.

🛠️ Two solution strategies for the mixed model

🔧 Strategy 1: Transform into observation-equations-only form

The excerpt outlines this approach:

Treat B*E{y} = Ax as an inhomogeneous system with unknown vector E{y} (equation 13).
Write the general solution as the sum of:
- A particular solution (equation 16): E{y} = (B^T B)^(-1) B^T Ax
- A homogeneous solution (equation 15): involving B̂, the matrix whose columns are orthogonal to the columns of B.
The combined solution (equation 17) is in the form of observation equations only.
Apply the observation-equations solution method from chapter 2 to this transformed representation.

Why this works: equation 17 is "completely equivalent to the first equation of (11)," so solving the transformed system gives the same estimators x̂, ŷ, and ê.

🔩 Strategy 2: Transform into condition-equations-only form

The excerpt begins to outline this approach:

Consider the same system (13): B*E{y} = Ax.
Introduce Â, the b×(b - n) matrix whose columns are orthogonal to the columns of A.
Pre-multiply equation (13) by Â^T to eliminate the parameter vector x (the excerpt is cut off here).

The idea: by projecting onto the space orthogonal to A, the term involving x vanishes, leaving only conditions on y—a pure condition-equations form.

🆕 The second new model: Observation equations with parameter constraints

📝 Definition

Constrained observation model: y = Ax + e with B*x = c (equation 10, paraphrased)

This is "observation equations with conditions on the parameter vector x."
If B = 0 (no constraints), it reduces to the simple observation-equations model (1).

🔜 Solution deferred

The excerpt states: "The solution of (10) will be derived in section 5.3."

Don't confuse: this second model constrains the parameters x, whereas the first mixed model (9) constrains the expected observations E{y}.

5.2 The model representation B*E{y} = Ax

5.2 The model representation

🧭 Overview

🧠 One-sentence thesis

The mixed model BE{y} = Ax combines observation equations and condition equations into a single framework, and its solution can be derived by transforming the observations into a new variate t = By and applying standard observation-equation methods.

📌 Key points (3–5)

What the model is: a mixture of observation equations (E{y} = Ax) and condition equations (BE{y} = 0), where BE{y} = Ax generalizes both forms.
Three solution approaches: transform to observation-equations-only, transform to condition-equations-only, or use the y_R-variate method (the excerpt follows the third).
The y_R-variate method: define a new vector t = B*y, solve for x̂ using standard observation-equation formulas, then recover ê_y and ŷ using propagation laws.
Common confusion: this model reduces to pure observation equations when B = identity matrix, and to pure condition equations when A = zero matrix; it is a unified representation, not a separate theory.
Full-rank assumption: both A and B must have full column rank, which implies n ≤ b (the number of unknown parameters cannot exceed the number of conditions on observations).

🔄 Three approaches to solving the mixed model

🔄 Overview of solution strategies

The excerpt presents three possible ways to solve model (11): B*E{y} = Ax.

Transform to observation equations only: express E{y} as a sum of particular and homogeneous solutions.
Transform to condition equations only: pre-multiply by a matrix orthogonal to A to eliminate x.
Use y_R-variates: define t = B*y and treat the problem as observation equations in t.

The excerpt follows the third approach in detail.

🧩 Why the y_R-variate method

The y_R-variate method leverages results from chapter 4 (formula 14) to handle constituent variates.
It avoids explicitly constructing orthogonal complements for both A and B.
The key insight: by defining t = B*y, the mixed model (11) becomes a standard observation-equation model in t.

🛠️ The y_R-variate solution method

🛠️ Step 1: Define the new variate t

Define a b×1 vector t as: t = B*y (equation 19).

Substitution into the original model (11) gives a representation in the form of observation equations only (equation 20).
This transformed model is: E{t} = Ax, with observations t and the same unknown parameter vector x.

📐 Step 2: Solve for x̂ using observation-equation formulas

The estimator x̂ follows from the standard normal-equation solution for observation equations (equation 21):

x̂ depends on (A* Q_t^(-1) A)^(-1) A* Q_t^(-1) t.
The variance-covariance matrix Q_t is obtained by applying the propagation law to t = By (equation 22): Q_t = B Q_y B.
Substituting (19) and (22) into (21) gives the final estimator x̂ in terms of the original observations y (equation 23).

🔍 Step 3: Compute t̂ and ê_t

The estimators t̂ and ê_t = t - t̂ follow simply from the observation-equation solution (equation 24):

t̂ = Ax̂
ê_t = t - t̂

These are intermediate results; the goal is to recover ê_y and ŷ.

🔗 Step 4: Recover ê_y using the y_R-variate formula

To obtain ê_y = y - ŷ, the excerpt interprets y of (19) as a constituent y_R-variate.

Application of formula (14) from chapter 4 gives ê_y in terms of ê_t, Q_yt, and Q_t (equation 25).
The covariance matrix Q_yt follows from the propagation law of covariances applied to (19): Q_yt = Q_y B (equation 26).

✅ Step 5: Final estimators for ê_y and ŷ

Combining equations (22), (23), (24), (25), and (26), the excerpt derives the final estimators (equation 27):

ê_y = y - ŷ
ŷ = y - ê_y

These are the adjusted observations and residuals for the original mixed model.

🧷 Full-rank assumptions and model reduction

🧷 Full column rank for A and B

The excerpt assumes both matrices A and B have full column rank (equation 12):

rank(A) = n
rank(B) = b

This ensures that the normal equations are solvable and that the conditions are independent.

📏 Constraint on the number of parameters

Matrix A can only have full column rank if n ≤ b: the number of unknown parameters should not exceed the number of conditions on the observations.

If n > b, the system is underdetermined and A cannot have full column rank.
This constraint is specific to the mixed model; it does not apply to pure observation-equation models.

🔄 Special cases: reduction to simpler models

The mixed model (9) generalizes both observation equations and condition equations:

Condition	Resulting model	Explanation
B = identity matrix	Observation equations (1)	B*E{y} = Ax becomes E{y} = Ax
A = zero matrix	Condition equations (6)	BE{y} = Ax becomes BE{y} = 0

Don't confuse: the mixed model is not a new theory; it is a unified representation that includes both simpler models as special cases.

🔢 Alternative derivations (mentioned but not detailed)

🔢 Transform to observation equations only

The excerpt outlines (but does not fully detail) a first approach:

Consider the system B*E{y} = Ax as an inhomogeneous system of linear equations with unknown E{y}.
Write E{y} as the sum of a particular solution (equation 16) and a homogeneous solution (equation 15).
The homogeneous solution uses a matrix B^ whose columns are orthogonal to the columns of B (equation 14).
The resulting representation (equation 17) is in terms of observation equations only, and standard methods apply.

🔢 Transform to condition equations only

The excerpt also outlines a second approach:

Pre-multiply the system (13) by A^*, where A^ is a matrix whose columns are orthogonal to the columns of A.
This gives a representation in terms of condition equations only (equation 18).
Standard condition-equation methods then apply.

Don't confuse: these alternative approaches are mathematically equivalent to the y_R-variate method; the excerpt chooses the third approach for its conceptual clarity and use of earlier results.

5.3 The model representation

🧭 Overview

🧠 One-sentence thesis

The model representation E{y} = Ax with B*x = 0 combines observation equations with parameter constraints and can be solved efficiently in two steps: first ignoring the constraints to get an intermediate estimator, then applying the constraints as condition equations.

📌 Key points (3–5)

What this model is: observation equations E{y} = Ax with additional conditions B*x = 0 imposed on the parameter vector x.
How to solve it: transform the constrained parameters into an unconstrained form using an orthogonal complement matrix, then apply standard observation-equation methods.
Two-step interpretation: solve the unconstrained observation model first (get x̂_A), then apply the parameter conditions as a separate condition-equation model to get the final x̂.
Common confusion: this model reduces to standard observation equations when B = 0 (no parameter constraints); the constraints are on parameters, not on observations.
Why it matters: the solution shows that constrained estimation can be decomposed into unconstrained estimation plus a correction step, and the residual norm splits into two corresponding parts.

🔧 Model structure and assumptions

🔧 What the model represents

Model representation: E{y} = Ax, B*x = 0

E{y} = Ax: standard observation equations linking m×1 observable vector y to n×1 unknown parameter vector x through design matrix A.
B*x = 0: b additional homogeneous conditions imposed directly on the parameter vector x.
The excerpt emphasizes this is "observation equations with conditions on the parameter vector x."

📐 Rank assumptions

Both matrices A and B must have full column rank.
Specifically: rank A = n and rank B = b.
Constraint: matrix B can only have full column rank if b ≤ n—you cannot impose more than n independent conditions on n unknown parameters.
Don't confuse: the conditions are on the parameters themselves, not on the observations (unlike the condition-equation model from chapter 3).

🔄 Transformation to unconstrained form

🔄 Finding the parametric representation

The excerpt solves the homogeneous system B*x = 0 by introducing an orthogonal complement:

Let B̂ be the n×(n−b) matrix whose column vectors are orthogonal to the column vectors of matrix B.
Then BB̂ = 0 (the columns of B̂ span the null space of B).
Parametric representation: any solution x satisfying B*x = 0 can be written as x = B̂λ, where λ is an (n−b)×1 vector of new unconstrained parameters.
This representation (32) is completely equivalent to the constraint B*x = 0.

🔄 Substitution into observation equations

Substitute x = B̂λ into the first equation E{y} = Ax:

Result: E{y} = AB̂λ (equation 33).
This representation is "completely equivalent to (29) and it is in the form of observation equations only."
Now the problem is in standard observation-equation form, so standard methods apply to estimate λ.

🧮 Deriving the estimators

🧮 Estimator for the unconstrained parameter λ

From the transformed observation equations E{y} = AB̂λ, the estimator λ̂ follows the standard formula:

λ̂ = [(AB̂)*Q_y^(−1)(AB̂)]^(−1) (AB̂)*Q_y^(−1) y (equation 34).
This is the standard least-squares solution for observation equations.

🧮 Relating back to the original unconstrained estimator x̂_A

The excerpt introduces x̂_A as the solution of the model E{y} = Ax without the parameter constraints:

x̂_A satisfies the normal equations AQ_y^(−1)Ax̂_A = AQ_y^(−1)y (equation 35).
The subscript A emphasizes that x̂_A is not the solution of the constrained model (29), but of the unconstrained model (36).
Its covariance matrix is Q_x̂_A = [A*Q_y^(−1)A]^(−1) (equation 37).

Using these definitions, the estimator λ̂ can be rewritten as:

λ̂ = [B̂Q_x̂_A^(−1)B̂]^(−1) B̂Q_x̂_A^(−1) x̂_A (equation 38).
This form shows λ̂ depends on the unconstrained estimator x̂_A and its covariance.

🧮 Final estimator for x

Combining x = B̂λ with the expression for λ̂:

x̂ = B̂λ̂ = B̂[B̂Q_x̂_A^(−1)B̂]^(−1) B̂Q_x̂_A^(−1) x̂_A (equation 39).
The excerpt notes this takes "a very familiar form"—it is the solution of a model with observation equations where x̂_A is treated as the observable and Q_x̂_A as its covariance matrix (equation 40).

🧮 Alternative condition-equation form

The equivalent representation in terms of condition equations (using B*B̂ = 0):

B*x̂ = 0 with "observable" x̂_A and covariance Q_x̂_A (equation 41).
The solution is x̂ = x̂_A − Q_x̂_A B[BQ_x̂_A B]^(−1) Bx̂_A (equation 42).
Since (41) is equivalent to (40), equation (42) must be identical to (39).

Final compact form (equation 43):

x̂ = x̂_A − Q_x̂_A B[BQ_x̂_A B]^(−1) Bx̂_A
This expresses the constrained estimator x̂ directly in terms of the matrices A and B from the original model (29).

🪜 Two-step solution procedure

🪜 Step 1: Solve the unconstrained observation model

First, consider model (29) without the parameter conditions:

Solve E{y} = Ax (the observation-equation model 36).
This gives the unconstrained estimator x̂_A and its covariance Q_x̂_A.

🪜 Step 2: Apply the parameter conditions

Second, include the conditions on the parameters:

Solve the condition-equation model B*x̂ = 0 with "observable" x̂_A (model 41).
This gives the final constrained estimator x̂.

The excerpt states: "the solution of (29) can be obtained in two steps" and emphasizes the first step uses observation equations while the second step uses condition equations.

🪜 Complete solution formulas

Combining the two steps (equation 44):

x̂ = x̂_A − Q_x̂_A B[BQ_x̂_A B]^(−1) Bx̂_A
ŷ = Ax̂
ê = y − ŷ

Example: An organization first estimates parameters ignoring regulatory constraints (step 1), then adjusts the estimates to satisfy the constraints (step 2).

📏 Residual norm decomposition

📏 Splitting the squared residual norm

The excerpt derives that the squared norm of the residual vector can be split into two parts (equation 45):

ê*Q_y^(−1)ê = (first term) + (second term)
First term: corresponds to the squared norm of the least-squares residual vector of the unconstrained model E{y} = Ax.
Second term: corresponds to the squared norm of the least-squares residual vector of the condition model B*E{x̂_A} = 0.

📏 Interpretation

The total residual reflects both the misfit of observations to the model and the violation of parameter constraints by the unconstrained estimator.
Don't confuse: the first term measures observation misfit; the second term measures how much the unconstrained estimator violates the parameter constraints.

6.1 Introduction to Partitioned Model Representations

6.1 Introduction

🧭 Overview

🧠 One-sentence thesis

Partitioning the design matrices in linear estimation models allows efficient computation of parameter subsets and reveals how adding parameters affects estimator precision.

📌 Key points (3–5)

What partitioning means: splitting the design matrices A and B columnwise and rowwise to separate parameter vectors into subgroups.
Why partition: when only a subset of parameters is needed, partitioning avoids solving the complete system of normal equations, making computation more efficient.
Four cases covered: the chapter considers partitioning for both observation-equation models and condition-equation models, in different configurations.
Common confusion: adding more parameters to a model does not improve precision—it generally makes estimators less precise unless the new parameters are orthogonal to the original ones.
Key result: the estimator of the original parameters remains unchanged after adding new parameters if and only if the range spaces are orthogonal.

📐 The two foundational models

📐 Observation-equation model

The excerpt recalls from chapter 2:

Model: (observation equations)
Estimators given as: (equation 2)

This is the standard least-squares setup where observations are modeled as functions of unknown parameters.
The estimators and their variance matrices are derived from normal equations.

📐 Condition-equation model

The excerpt recalls from chapter 3:

Model: (condition equations)
Estimators given as: (equation 4)

This model imposes constraints on the expected values of the parameters.
The chapter will partition matrix B rowwise for this case.

🔀 What partitioning does

🔀 Splitting the parameter vector

The design matrix A (or B) is partitioned to separate the unknown parameter vector x into subvectors x₁ and x₂.
Columnwise partitioning of A corresponds to splitting x into [x₁, x₂].
Rowwise partitioning of B corresponds to splitting constraint equations into groups.

🎯 The four cases

The chapter announces it will cover:

Case 1: (observation model, specific partitioning)
Case 2: (observation model, another partitioning)
Case 3: (condition model, partitioning)
Case 4: (condition model, another partitioning)

Case 2 is noted as "not dealt with explicitly in this chapter."
The goal is to show how partitioning A and B carries through to the estimators in equations (2) and (4).

🧮 Efficient computation via elimination

🧮 The partitioned normal equations

The excerpt presents the partitioned model (equation 5) and its normal equations (equation 6):

The normal equations are a system involving both x̂₁ and x̂₂.
If only x̂₁ is needed, there are two approaches:
1. Solve the complete system and extract the first n₁ elements.
2. Eliminate x̂₂ from the first n₁ equations and solve only for x̂₁.

⚡ Why elimination is more efficient

The second approach avoids solving the full system.
Elimination is achieved by premultiplying the normal equations with a specially constructed square, full-rank matrix (equation 7).
This produces a "reduced normal matrix" (equation 12), which is smaller and faster to invert.

🔧 The elimination matrix

The matrix used for elimination (equation 7) has the form:

Top block: identity matrix of size n₁.
Bottom-left block: a term that cancels x̂₂ from the first n₁ equations.
Bottom-right block: identity matrix of size n₂.

After premultiplication, the first n₁ equations contain only x̂₁, and the last n₂ equations express x̂₂ in terms of x̂₁.

🔍 Orthogonal projectors and reduced systems

🔍 The orthogonal projector

Equation (9) introduces:

Orthogonal projector: P_A₁^⊥ (projector onto the orthogonal complement of the range space of A₁)

This projector has the properties shown in equation (10).
It is used to rewrite the reduced normal equations compactly (equation 12).

📝 The reduced normal matrix

Equation (11) defines an abbreviation for the reduced normal matrix:

It incorporates the orthogonal projector applied to A₂.
The solution for x̂₁ (equation 13) and its variance matrix (equation 14) are expressed using this reduced matrix.

🔗 Finding x̂₂ after x̂₁

Once x̂₁ is known, x̂₂ can be found from the last n₂ equations of the reduced system (equation 15).
The variance matrix of x̂₂ is derived using the propagation law of variances (equation 16).
Summary result: equation (17) gives both estimators and their variance matrices in terms of the partitioned structure.

🔄 Symmetric elimination: swapping roles

🔄 Eliminating x̂₁ instead

Instead of eliminating x̂₂, one can eliminate x̂₁ from the last n₂ equations.
This is done by premultiplying with a different matrix (equation 18).
The resulting system (equation 19) is analogous to equation (12), with roles reversed.

🔁 The symmetric result

Equation (21) gives the estimators and variance matrices when x̂₁ is eliminated:

Compare with equation (17): equation (21) follows by interchanging x₁ and x₂.
This symmetry shows that either parameter subset can be treated as the "primary" one.

📊 Adding parameters: impact on precision

📊 The original model

Suppose the original model (equation 22) has only parameter vector x₁:

Estimator: x̂₁⁰
Variance matrix: Q_x̂₁⁰

Equation (23) gives the standard least-squares formulas for this model.

➕ Enlarging the model

Now enlarge the model (equation 24) by adding parameter vector x₂:

The new estimator for x₁ is x̂₁ (equation 25).
The new variance matrix is Q_x̂₁.

🔑 When does the estimator stay the same?

Equation (26) states:

The estimators x̂₁⁰ and x̂₁ are identical if and only if: A₁* Q_y⁻¹ A₂ = 0

This condition means the range space of A₁ is orthogonal to the range space of A₂: R(A₁) ⊥ R(A₂).
In this case, the partitioned normal matrix becomes block-diagonal.
Conclusion: the estimator of x₁ is unaffected by adding x₂ if and only if the two parameter sets are orthogonal in the observation space.

📉 Precision generally gets worse

Equation (27) rewrites the variance matrix of x̂₁ using the orthogonal projector:

For any linear function a* x₁, the variance with the enlarged model is compared to the variance with the original model.
Equation (28) shows:

Variance with enlarged model ≥ Variance with original model
The second term in equation (27) is always non-negative.
Important result: adding more parameters to a linear model generally makes the precision of the original estimators poorer.
Example: An organization adds new variables to a regression model; even if the new variables are relevant, the uncertainty in the original coefficients typically increases unless the new variables are orthogonal to the old ones.

🚫 When added parameters are not estimable

🚫 Variance matrix of the added parameter

Equation (29) expresses the variance matrix Q_x̂₂ of the added parameter x₂:

It is the inverse of a matrix involving the orthogonal projector applied to A₂.
The matrix A₂* Q_y⁻¹ A₂ - A₂* P_A₁^⊥ A₂ must be invertible for Q_x̂₂ to exist.

⚠️ Non-estimability condition

The excerpt states:

Q_x̂₂ does not exist and x₂ is not estimable if R(A₂) ⊆ R(A₁)

If the column vectors of A₂ are linear combinations of the column vectors of A₁, then the added parameters are redundant.
In this case, the new parameters cannot be uniquely estimated.
Don't confuse: orthogonality (R(A₁) ⊥ R(A₂)) means the estimator of x₁ is unchanged; containment (R(A₂) ⊆ R(A₁)) means x₂ is not estimable at all.

Relationship	Effect on x̂₁	Effect on x̂₂
R(A₁) ⊥ R(A₂)	Unchanged	Estimable, independent
R(A₂) ⊆ R(A₁)	Changed	Not estimable (redundant)
General case	Changed (less precise)	Estimable, correlated

The Partitioned Model

6.2 The partitioned model

🧭 Overview

🧠 One-sentence thesis

Partitioning the design matrix into submatrices allows efficient computation of parameter estimators by eliminating unwanted parameters from the normal equations, and reveals how adding parameters affects precision and residual magnitudes.

📌 Key points (3–5)

Efficient estimation: When only interested in a subset of parameters, eliminate the others from the normal equations instead of solving the full system—this avoids unnecessary computation.
Precision trade-off: Adding more parameters to a model generally makes the estimators of the original parameters less precise (larger variance), unless the new parameters are orthogonal to the original ones.
Residual reduction: Adding parameters generally makes the least-squares residual vector shorter (better fit), reaching zero when the number of parameters equals the number of observations.
Common confusion: The predicted residual (difference between new observations and predictions from the old model) is not the same as the least-squares residual (difference between observations and the final fitted values).
Recursive updating: New observations can be incorporated without re-processing old data—only the previous estimator and the new observations are needed.

🔧 Partitioned normal equations and elimination

🔧 The partitioned model setup

The model is written with the design matrix split columnwise:

Partitioned observation model: E{y} = A₁x₁ + A₂x₂, where A₁ and A₂ are submatrices corresponding to parameter vectors x₁ and x₂.

The normal equations become a block system involving both x̂₁ and x̂₂.
If you only need x̂₁, you can avoid solving for x̂₂ explicitly.

✂️ Eliminating unwanted parameters

Two approaches to find x̂₁:

Solve the complete normal equations and extract the first n₁ elements.
Eliminate x̂₂ from the first n₁ equations and solve only for x̂₁—this is more efficient.

How elimination works:

Premultiply the normal equations by a special full-rank matrix that zeros out x̂₂ in the first block.
The resulting system involves an orthogonal projector that projects A₂ onto the orthogonal complement of A₁'s range space.
The resulting normal matrix is called a reduced normal matrix (reduced for x₂).

Example: If you have a model with 10 parameters but only care about the first 3, elimination lets you solve a 3×3 system instead of a 10×10 system.

🔄 Symmetric elimination

You can also eliminate x̂₁ from the last n₂ equations to solve for x̂₂ first.
The formulas are symmetric: swapping the roles of x₁ and x₂ gives analogous results.

📉 Effect of adding parameters on precision

📉 When does adding parameters change the original estimator?

Suppose the original model is E{y} = A₁x₁, giving estimator x̂₁⁽⁰⁾.
Now enlarge it to E{y} = A₁x₁ + A₂x₂, giving estimator x̂₁.

The two estimators are identical if and only if the range space of A₁ is orthogonal to the range space of A₂, written R(A₁) ⊥ R(A₂) or A₁*Qy⁻¹A₂ = 0.
In this case, the partitioned normal matrix becomes block-diagonal.
Don't confuse: Orthogonality here is in the weighted inner product defined by Qy⁻¹, not ordinary Euclidean orthogonality.

📊 Precision always gets worse (or stays the same)

For any linear function c*x₁, the variance of its estimator satisfies:

Variance inequality: Var(cx̂₁) ≥ Var(cx̂₁⁽⁰⁾)

The second term in the variance formula is always non-negative.
Implication: Adding parameters generally makes the estimators of the original parameters less precise.
Only when R(A₁) ⊥ R(A₂) does equality hold.

Example: An organization fits a straight line through the origin to data. Later, they add an intercept parameter. The slope estimator will have larger variance in the enlarged model (unless the data points are symmetric about zero in a specific weighted sense).

🔍 Degree of dependence and estimability

For the added parameter x₂ (assumed scalar for simplicity):

The variance of x̂₂ depends on the angle α between A₂ and its projection onto R(A₁).
Degree of dependence: The angle α measures how much A₂ depends on A₁.

Angle α	Interpretation	Variance of x̂₂
α = π/2	A₂ orthogonal to R(A₁)	Minimum: (A₂*Qy⁻¹A₂)⁻¹
α = 0	A₂ lies in R(A₁)	Infinite (not estimable)
α small	High dependence	Very large (poorly estimable)

The ratio sin²α can be computed from diagonal elements of the normal matrix before and after reduction.
Software typically computes sin²α to diagnose singularities or rank defects.

Don't confuse: "Not estimable" (α = 0) means the column vectors of A₂ are linearly dependent on those of A₁, so the combined matrix (A₁ A₂) is rank-deficient.

📏 Effect on residuals and fitted values

📏 Orthogonal decomposition of the projector

The fitted value ŷ = A₁x̂₁ + A₂x̂₂ can be decomposed:

The orthogonal projector PA (projecting onto the range of (A₁ A₂)) splits into two orthogonal pieces.
One piece projects onto R(A₁), the other onto the orthogonal complement of R(A₁) within R(A).

📉 Residuals get smaller when adding parameters

Let ê₁ be the residual vector from the original model E{y} = A₁x₁, and ê₂ the residual from the enlarged model.

Residual norm inequality: ||ê₂||² ≤ ||ê₁||²

The squared length of the residual vector generally decreases when more parameters are added.
Extreme case: If the number of parameters equals the number of observations (m = n₁ + n₂), the matrix A becomes square and regular, PA becomes the identity matrix, and ê₂ = 0 (perfect fit, zero redundancy).

Example: Fitting a straight line through the origin leaves some residuals. Adding an intercept parameter allows the line to shift vertically, reducing the sum of squared residuals.

🔁 Recursive estimation with new observations

🔁 Two-step procedure

Consider a partitioned model where observations are split into y₁ and y₂ (assumed uncorrelated):

E{y₁} = A₁x, E{y₂} = A₂x

Step 1: Solve the partial model E{y₁} = A₁x to get x̂⁽¹⁾ and its variance Qx̂⁽¹⁾.
Step 2: Use x̂⁽¹⁾ and the new observations y₂ to compute the updated estimator x̂⁽²⁾.

Key advantage: You do not need to save the old observations y₁ to compute x̂⁽²⁾—only x̂⁽¹⁾, Qx̂⁽¹⁾, and y₂ are required.
This allows recursive updating: x̂⁽ᵏ⁾ can be computed from x̂⁽ᵏ⁻¹⁾ and yₖ.

🔄 Measurement update equations

Two equivalent formulas for the updated estimator:

Invert an n×n matrix:
x̂⁽²⁾ = x̂⁽¹⁾ + (update term involving (Qx̂⁽¹⁾⁻¹ + A₂*Q₂⁻¹A₂)⁻¹)
Invert an m₂×m₂ matrix (more efficient when m₂ is small):
x̂⁽²⁾ = x̂⁽¹⁾ + Qx̂⁽¹⁾A₂*(A₂Qx̂⁽¹⁾A₂* + Q₂)⁻¹(y₂ - A₂x̂⁽¹⁾)

The second formula is advantageous when m₂ (the number of new observations) is small compared to n (the number of parameters).
When m₂ = 1 (one new observation), only a scalar needs to be inverted.

📐 Predicted residual

Predicted residual: y₂ - A₂x̂⁽¹⁾, the difference between the new observations and the prediction based on the old estimator.

The correction to x̂⁽¹⁾ is proportional to the predicted residual.
Small predicted residual → small correction (the old estimator already predicts the new data well).
Large predicted residual → large correction (the old estimator does not match the new data).
The correction is also smaller if the variance of x̂⁽¹⁾ is small (more confidence in the old estimator).

Don't confuse: The predicted residual y₂ - A₂x̂⁽¹⁾ is not the least-squares residual y₂ - ŷ₂ = y₂ - A₂x̂⁽²⁾. The predicted residual uses the old estimator; the least-squares residual uses the final estimator.

📐 Worked example: straight line models

📐 Model 1: Line through the origin

Model: E{yᵢ} = aᵢx₁₁ (a straight line through the origin with slope x₁₁).

The least-squares estimator minimizes the sum of squared vertical distances from the points (aᵢ, yᵢ) to the line.
Estimator: x̂₁₁ = (sum of aᵢ²)⁻¹(sum of aᵢyᵢ)
Variance: σ²x̂₁₁ = σ²/(sum of aᵢ²)

📐 Model 2: Line with intercept

Enlarged model: E{yᵢ} = aᵢx₁ + x₂ (slope x₁, intercept x₂).

Define Ā = (1/m)sum of aᵢ (mean of the aᵢ).
The estimator x̂₁ involves the centered values (aᵢ - Ā).
Variance comparison: The variance of x̂₁ in Model 2 is larger than the variance of x̂₁₁ in Model 1, confirming the general principle that adding parameters increases variance.
The inequality σ²x̂₁ ≥ σ²x̂₁₁ can be verified algebraically by expanding the expressions.

Example interpretation: Fitting a line through the origin gives a precise slope estimate if the intercept is truly zero. Allowing a free intercept makes the slope estimate less precise because the model must now estimate both parameters from the same data.

The Partitioned Model

6.3 The partitioned model

🧭 Overview

🧠 One-sentence thesis

The partitioned model allows least-squares estimation to be performed recursively or in blocks, so that when new observations become available or when different groups measure separate parts of a network, the solution can be updated without re-processing all original data—provided the observations are uncorrelated.

📌 Key points (3–5)

Recursive estimation: when new observations y₂ arrive, the estimator can be updated from the previous solution x̂⁽¹⁾ and the new data, without needing to store or re-process the old observations y₁.
Uncorrelated observations assumption: the recursive schemes only have practical value when y₁ and y₂ are uncorrelated; if they are correlated, de-correlation transforms are needed, which require keeping the old data.
Two update formulas: the measurement update can be computed by inverting an n-order matrix (equation 59) or an m₂-order matrix (equation 62); the latter is more efficient when m₂ is small.
Common confusion: the predicted residual y₂ − A₂x̂⁽¹⁾ is not the same as the least-squares residual y₂ − A₂x̂⁽²⁾; the predicted residual measures how well the first-step estimate predicts the new observations.
Block estimation: when different groups measure separate parts of a network (e.g., two countries), the normal equations can be assembled by summing the normal matrices and right-hand sides from each partial model, then solved by eliminating shared parameters.

🔄 Recursive estimation fundamentals

🔄 The two-step procedure

Consider the partitioned model:

Step 1: y₁ = A₁x + e₁
Step 2: y₂ = A₂x + e₂

where y₁ and y₂ are uncorrelated.

The full model solution x̂⁽²⁾ can be found in two steps:

Solve the partial model with y₁ alone to get x̂⁽¹⁾ and its variance Q_x̂⁽¹⁾.
Use x̂⁽¹⁾ together with the new observations y₂ to compute x̂⁽²⁾ via the update model.

Why this matters: You do not need to save the old observables y₁ to compute x̂⁽²⁾. You can compute x̂⁽²⁾ from the solution x̂⁽¹⁾ and the new observables y₂ alone.

📐 Measurement update equations

Two equivalent formulas are given:

Formula	Matrix inversion size	When to use
Equation (59)	Order n	General case
Equation (62)	Order m₂	When m₂ is small compared to n

Equation (62) (more efficient when m₂ is small):

x̂⁽²⁾ = x̂⁽¹⁾ + Q_x̂⁽¹⁾ A₂ᵀ (Q₂ + A₂ Q_x̂⁽¹⁾ A₂ᵀ)⁻¹ (y₂ − A₂x̂⁽¹⁾)

The correction to x̂⁽¹⁾ depends on the predicted residual y₂ − A₂x̂⁽¹⁾.
If m₂ = 1 (only one new observation), only a scalar needs to be inverted in equation (62), whereas equation (59) requires inverting three matrices.

Example: If you have an initial estimate from 100 observations and then receive 2 new observations, equation (62) inverts a 2×2 matrix instead of a 100×100 matrix.

🔍 Predicted residual vs least-squares residual

Predicted residual: y₂ − A₂x̂⁽¹⁾, which measures how well the first-step estimate x̂⁽¹⁾ predicts the expected value E{y₂}.

Least-squares residual: y₂ − A₂x̂⁽²⁾, which measures the misfit after the full model is solved.

Don't confuse: The predicted residual uses the old estimate x̂⁽¹⁾; the least-squares residual uses the updated estimate x̂⁽²⁾.

Interpretation:

If the predicted residual is small, the correction to x̂⁽¹⁾ is small (the new data agrees with the old estimate).
If the predicted residual is large, the correction to x̂⁽¹⁾ is large (the new data disagrees with the old estimate).
If the variance of x̂⁽¹⁾ is small, the correction is small (we have more confidence in x̂⁽¹⁾ and give it more weight than y₂).

📊 Variance and precision updates

📊 How variance improves

The variance update formula (equation 63) shows:

Q_x̂⁽²⁾ = Q_x̂⁽¹⁾ − Q_x̂⁽¹⁾ A₂ᵀ (Q₂ + A₂ Q_x̂⁽¹⁾ A₂ᵀ)⁻¹ A₂ Q_x̂⁽¹⁾

The minus sign means the precision of the estimator gets better (variance decreases).
This is understandable: by including the additional observable y₂, more information is available to estimate x.

📈 Recursive update of the residual norm

The norm of the least-squares residual vector (eᵀ Q_y⁻¹ e) can also be updated recursively:

(e*⁽²⁾)ᵀ Q_y⁻¹ e*⁽²⁾ = (e*⁽¹⁾)ᵀ Q₁⁻¹ e*⁽¹⁾ + (y₂ − A₂x̂⁽¹⁾)ᵀ (Q₂ + A₂ Q_x̂⁽¹⁾ A₂ᵀ)⁻¹ (y₂ − A₂x̂⁽¹⁾)

The update depends on the predicted residual.
This can be generalized to more steps (Table 6.3).

⚠️ Correlated observations case

⚠️ When observations are correlated

If y₁ and y₂ are correlated (Q₁₂ ≠ 0, Q₂₁ ≠ 0), the recursive schemes cannot be applied directly.

Solution: De-correlate the observables first by transforming the model.

Two de-correlation transforms are given:

Define new observables: ȳ₁ = y₁, ȳ₂ = y₂ − Q₂₁Q₁⁻¹y₁
Define new observables: ȳ₁ = y₁ − Q₁₂Q₂⁻¹y₂, ȳ₂ = y₂

After transformation, the new observables are uncorrelated and the recursive schemes can be applied.

⚠️ Loss of practical advantage

Important limitation: The transformed observable ȳ₂ still depends on the old observable y₁.

You still need to save the old observable y₁ to compute x̂⁽²⁾ from x̂⁽¹⁾ and y₂.
The recursive schemes lose their practical implication in the correlated case.

Conclusion: The recursive schemes are only of practical value if the observables are uncorrelated. Fortunately, this is the case in most practical applications.

Connection: The theory plays an important role in the estimation theory of dynamic systems and in Kalman filtering.

🧱 Block estimation

🧱 Summing normal equations from partial models

Consider a partitioned model with multiple blocks:

y₁ = A₁x₁ + e₁
y₂ = A₂₁x₁ + A₂₂x₂ + e₂
y₃ = A₃₂x₂ + A₃₃x₃ + e₃

The normal matrix of the full model can be found by summing the normal matrices of the partial models.

The right-hand side of the full normal equations can be found by summing the right-hand sides of the partial normal equations.

Don't confuse: The x̂⁽²⁾ in the block estimation section (section 6.4) is different from the x̂⁽²⁾ in the recursive estimation section (section 6.3).

🗺️ Practical example: geodetic networks

Example scenario: Two countries (Netherlands and Germany) are covered with a large geodetic network.

The Dutch Triangulation Department measures the Dutch part (full lines).
The German Triangulation Department measures the German part (dashed lines).
x₁ = unknown coordinates of network points in the Netherlands.
x₂ = unknown coordinates of network points on the border (shared).
x₃ = unknown coordinates of network points in Germany.

How block estimation works:

Each country computes its own normal equations from its measurements.
The full normal equations are assembled by summing the two sets of normal equations.
The shared parameters (x₂, border points) are eliminated by pre-multiplying with transformation matrices.
The solution is found by first solving for x₂, then substituting back to solve for x₁ and x₃.

Advantage: Each country can process its own data independently; only the normal equations (much smaller than the raw data) need to be shared and combined.

The Partitioned Model

6.4 The partitioned model

🧭 Overview

🧠 One-sentence thesis

The partitioned model allows large estimation problems to be solved by breaking them into smaller partial models, computing reduced normal equations separately, and then combining results—reducing computational load and enabling distributed computation while preserving data secrecy.

📌 Key points (3–5)

Core mechanism: The normal matrix of a partitioned model equals the sum of the normal matrices from its partial models; the same holds for the right-hand sides.
Helmert's Block method: Partial models are reduced locally (eliminating internal parameters), only reduced blocks are sent to a central computing center, which solves for shared parameters and sends results back.
Practical advantages: Distributes computational load across organizations, and the computing center never sees original measured data—useful when secrecy is required.
Common confusion: Do not confuse the estimator notation x̂⁽²⁾ in the partial-model context (section 6.4) with x̂⁽²⁾ from the recursive schemes (previous section)—they refer to different partitioning schemes.
Two solution paths: Block estimation can proceed either by merging partial models first or by reducing them separately and combining only the reduced blocks.

🧩 Core mechanism: summing partial normal equations

🧩 How partitioned models combine

Consider the partitioned model:

A partitioned model splits observations and design matrices into blocks corresponding to different parameter groups.

The excerpt shows model (86) and its normal equations (87), then two partial models (88) with their normal equations (89).

Key result:

The normal matrix of the full model = sum of the normal matrices of the two partial models.
The right-hand side of the full model = sum of the right-hand sides of the partial models.

Why this matters:

Each partial model can be processed independently.
Results are combined by simple addition, not by re-processing raw data.

🔢 Generalization to multiple blocks

The excerpt generalizes to model (90) with three parameter vectors x₁, x₂, x₃.

The partitioned normal equations (91) can be reduced step-by-step:

Premultiply by matrix (92) to eliminate x̂₁ from the second block → system (93).
Premultiply by matrix (95) to eliminate x̂₃ from the second block → system (96).

The solution (98) proceeds in stages:

Solve the second block for x̂₂.
Substitute x̂₂ into the first and third blocks to solve for x̂₁ and x̂₃.

Don't confuse: The notation x̂⁽²⁾ in section 6.4 refers to parameters in the second partition of the current model, not the recursive estimator x̂⁽²⁾ from the previous section (the excerpt explicitly warns about this in footnote 4).

🌍 Practical application: Helmert's Block method

🌍 The geodetic network scenario

The excerpt uses a concrete example (figure 6.5):

The Netherlands and Germany each measure part of a large geodetic network.
x₁ = coordinates of Dutch network points.
x₂ = coordinates of border points (shared).
x₃ = coordinates of German network points.

Partial models:

Dutch partial model (99): relates Dutch observations to x₁ and x₂.
German partial model (100): relates German observations to x₂ and x₃.

🖥️ Method 1: Central merging

Both countries send their partial models (99) and (100) plus measured data to the Computing Centre.
The Computing Centre merges them into the complete model (90).
The Computing Centre solves for x̂₁, x̂₂, x̂₃.
Results are sent back: (x̂₁, x̂₂) to the Dutch, (x̂₂, x̂₃) to the Germans.

Drawback: The Computing Centre handles the full computational load and has access to all raw data.

🔐 Method 2: Helmert's Block method

The Dutch form their partial normal equations and reduce for x₁, producing the reduced block (101).
The Germans form their partial normal equations and reduce for x₃, producing the reduced block (102).
Each country sends only the reduced normal block and right-hand side to the Computing Centre (not the raw data).
The Computing Centre adds the reduced blocks to form system (103) and solves for x̂₂.
x̂₂ is sent back to both countries.
The Dutch use x̂₂ and (101) to form system (104) and solve for x̂₁ locally.
The Germans use x̂₂ and (102) to form system (105) and solve for x̂₃ locally.

Advantages:

Aspect	Benefit
Computational load	Distributed: each country does part of the work; the Computing Centre solves only for the shared parameters x₂ (smaller system).
Data secrecy	The Computing Centre never sees the original measured data yᵢ, only the reduced quantities A₁₂Q₁⁻¹y₁ and A₂₂Q₂⁻¹y₂. It cannot reconstruct the complete network.

Historical note: The method was introduced by the German geodesist F.R. Helmert (1843–1917) and successfully used for the readjustment of the European triangulation network (RETrig).

🔄 Block estimation via condition equations

🔄 Alternative solution path

Section 6.5 presents a second block-estimation approach using condition equations instead of direct reduction.

Three-step procedure (table 6.4 and 6.5):

Solve the first partial model to get x̂⁽¹⁾ and its covariance Q_x̂⁽¹⁾.
Solve the second partial model to get x̂⁽²⁾ and its covariance Q_x̂⁽²⁾.
Formulate a model of condition equations (112) or (122) that treats x̂⁽¹⁾ and x̂⁽²⁾ as free random variates and enforces consistency.

Key result: The solution of the condition-equation model (113) or (123) is identical to the solution of the original partitioned model (107) or (117).

Two equivalent expressions for the solution (114) and (115):

They differ only by interchanging the roles of x̂⁽¹⁾ and x̂⁽²⁾.
Both yield the same final estimators.

📐 Levelling network example

Example 3 (figure 6.6) applies block estimation to a levelling network:

Point 0 is the reference (x₀ = 0).
The full model (124) has normal equations (125).
Direct solution: eliminate x̂₁ from the second equation via premultiplication (126), solve for x̂₂ and x̂₃ (127), back-substitute for x̂₁ (128).

Block-estimation approach:

Partial model (129) for one part of the network → normal equations (131) → solution (132).
Partial model (130) for another part → normal equations (133) → solution (134).
The excerpt notes that (132) is the solution of the partial network in figure 6.7a, and (134) is the solution of another partial network (the excerpt ends mid-sentence).

Don't confuse: The partial solutions (132) and (134) are intermediate results; they must be combined via the condition-equation model to obtain the final solution consistent with the full network.

🧷 When recursive schemes lose practical value

🧷 The de-correlation requirement

The excerpt begins by noting that recursive schemes (from earlier sections) require observables y₁ and y₂ to be uncorrelated.

If observables are correlated:

They can be transformed to uncorrelated observables (75) or (78).
The transformed models (76) or (79) allow recursive schemes to be applied mathematically.

But: The transformed observable (e.g., y₂ in (76)) still depends on the old observable y₁, so one must save y₁ to compute the recursive estimates.

Conclusion: Recursive schemes are only of practical value when observables are already uncorrelated—fortunately, this is the case in most practical applications.

Importance: The theory developed in this section plays an important role in estimation theory of dynamic systems and Kalman filtering.

The Partitioned Model

6.5 The partitioned model

🧭 Overview

🧠 One-sentence thesis

The solution of a partitioned adjustment model can be computed in separate phases by first solving partial models independently and then combining their results through a connecting model, reducing computational load and enabling distributed processing.

📌 Key points (3–5)

Block estimation method: a partitioned model can be solved in three steps—solve two partial models separately, then solve a connecting condition-equations model using the partial results.
Computational advantage: the Computing Centre handles a smaller system (only dimension of x₂) while partial computations stay at local departments, reducing central computational load.
Secrecy benefit: the Computing Centre never receives original measured data, only processed results from partial models, preserving data confidentiality.
Common confusion: the final solution obtained through block estimation is mathematically identical to solving the full partitioned model directly—the method changes the computational path, not the result.
Estimation in phases: for partitioned condition-equations models, solve the first partial model, then use its solution and variance to formulate and solve a second model.

🧩 Block Estimation Method (Helmert's Approach)

🧩 The partitioned model structure

The excerpt starts with a partitioned model (equation 106) that can be written in terms of two parameter vectors (x₁ and x₂) and corresponding observation groups.

The direct solution (equation 107) solves the entire system at once.
The block estimation approach breaks this into manageable pieces.

🔧 Three-step solution process

The block estimation scheme (Table 6.4) shows:

Step 1: Solve the first partial model (equation 108, first part) to get estimates and variance for the first parameter set.
Step 2: Solve the second partial model (equation 108, second part) to get estimates and variance for the second parameter set.
Step 3: Formulate a model of condition equations (equation 112) using the results from steps 1 and 2, then solve it.

The final estimators from step 3 (equations 114 and 115) are identical to the direct solution (equation 107).

🔄 Symmetry and interchangeability

Equation 115 follows from equation 114 by interchanging the roles of the two parameter sets.
This symmetry shows the method is flexible: either parameter set can be treated as "primary."

🏗️ Generalized Block Estimation

🏗️ Extended partitioned model

The excerpt generalizes the method to a more complex partitioned model (equation 116) with multiple observation groups.

The solution (equation 117) is the direct approach.
Two partial models (equation 118) are solved separately, yielding solutions (equation 119).

🔗 Connecting model formulation

Using notation from the partial solutions, the full solution (equation 120) can also be obtained by solving:

An observation-equations model (equation 121), or equivalently
A condition-equations model (equation 122) where the partial estimates are treated as free random variates.

The solution (equation 123) generalizes the earlier result (equation 113).

Table 6.5 summarizes the generalized block estimation:

Step	Action
(1)–(2)	Solve each partial model separately
(3)–(4)	Use partial results to formulate a condition-equations model
(5)–(6)	Solve the connecting model; this yields the solution of the original partitioned model

📐 Levelling network example

The excerpt provides a concrete levelling network (Figure 6.6) with reference point x₀ = 0.

The full model (equation 124) has normal equations (equation 125).
Direct solution: eliminate x̂₁ from the second normal equation by premultiplying with a transformation matrix, yielding equations 126–128.

Block estimation approach:

Partial model (129): corresponds to the sub-network in Figure 6.7a, solution (132).
Partial model (130): corresponds to the sub-network in Figure 6.7b, solution (134).
Connecting model (135): one condition equation linking the two partial networks.
Final solution (136): verified to be identical to the direct solution (equations 127–128).

Redundancy check:

Partial model (129): redundancy = 1
Partial model (130): redundancy = 0
Connecting model (135): redundancy = 1
Total redundancy = 1 + 0 + 1 = 2, matching the full network (Figure 6.6).

Don't confuse: the block method does not change the total redundancy or the final estimates; it redistributes the computation.

💼 Practical Advantages

💼 Reduced computational load at the centre

With the block method, the Computing Centre solves a system of dimension equal to x₂ only (equation 103 reference).
Partial computations remain at the two national Triangulation Departments.
This distributes the workload and reduces the size of the central system.

🔒 Data secrecy

The Computing Centre receives processed quantities (e.g., weighted observations from partial models), not the original measured data yᵢ.
The Centre cannot reconstruct the complete network coordinates from these processed inputs.
This is advantageous when data and coordinates must remain confidential.

🌍 Historical application

The method is known as Helmert's Block method, introduced by German geodesist F.R. Helmert (1843–1917).
It was successfully used for the readjustment of the European triangulation network (RETrig: Réseau Européen Trigonometrique).

🔄 Estimation in Phases (Tienstra's Method)

🔄 Partitioned condition-equations model

The excerpt introduces a partitioned model of condition equations (equation 141) with constraint matrices B₁ and B₂.

The direct solution is given by equation 144.
The method of estimation in phases solves this in two stages.

📝 Two-phase solution process

Phase 1: Solve the partial model (equation 142) to obtain:

Estimate ŷ_B₁ (equation 153)
Variance matrix Q_ŷ_B₁ (equation 154)

Phase 2: Using the results from Phase 1, solve the model (equation 143) to obtain the final estimate ŷ_B (equation 155).

The final solution (155) is identical to the direct solution of the full partitioned model (141).

🧮 Derivation outline

The solution of (141) involves normal equations (146) in terms of auxiliary variable ẑ.
Eliminate ẑ₁ by premultiplying with a transformation matrix, yielding reduced normal equations (150).
The reduced system solution (151) leads to the two-phase formulation.
The orthogonal projector (148) and abbreviation (149) simplify the algebra.

📊 Residual vector decomposition

The least-squares residual vector ê_B = y - ŷ_B can be decomposed (equation 157):

ê_B = ê_B₁ + ê_B₂

where ê_B₁ and ê_B₂ are the residual vectors from the two separate models (142) and (143).

Squared norm of residuals (equation 159):

The squared norm ê_B^T Q_y^(-1) ê_B equals the sum of the squared norms from the two phases.
This shows that the total residual sum of squares can be partitioned across the phases.

Don't confuse: the residual vectors add, but the squared norms also add because the cross-term vanishes (equation 158).

🔗 Connection to mixed model

The block estimation scheme of Table 6.5 can also be derived from the mixed model representation (section 5.3 reference).

The partitioned model (116) is equivalent to a mixed model form (138)–(139).
Applying mixed-model results to (138) yields the block estimation scheme.
Equation 140 shows that the squared norm of residuals of the partitioned model equals the sum of squared norms from the three constituent models (the two partial models and the connecting model).

🏛️ Historical note

The method of estimation in phases was introduced by Dutch geodesist J.M. Tienstra (1895–1951), professor of Mathematical Geodesy at Delft University of Technology (appointed 1935).

The Partitioned Model

6.6 The partitioned model

🧭 Overview

🧠 One-sentence thesis

The partitioned model of condition equations can be solved efficiently by estimation in phases—first solving a partial model for a subset of observations, then solving a second model that incorporates the remaining observations using the first-phase results.

📌 Key points (3–5)

What the partitioned model is: a model of condition equations split into two parts, B₁ and B₂, allowing solution in two separate phases rather than all at once.
How estimation in phases works: solve the partial model with B₁ first to get ŷ_B₁ and its variance matrix, then solve the second model using those results to get the final solution ŷ_B.
Key advantage: the second phase does not require the full variance matrix Q_y; only the misclosures and partial results are needed, making the method computationally efficient.
Common confusion: the least-squares residual vector of the partitioned model is the sum of the residual vectors from the two separate phases, not computed independently.
Connection to earlier material: the partitioned condition equations model corresponds to the partitioned observation equations model from section 6.2, with equivalent orthogonal decompositions and projectors.

📐 The partitioned model structure

📐 Model definition

Partitioned model of condition equations: B₁ᵀy = 0 and B₂ᵀy = 0, written together as Bᵀy = 0 where B = (B₁ B₂).

The full model (141) is split into two condition equation sets.
The solution ŷ_B can be found without solving the full system at once.
This is the foundation for the method of estimation in phases introduced by Dutch geodesist J.M. Tienstra (1895-1951).

🔢 Two-phase solution approach

Phase 1: Solve the partial model B₁ᵀy = 0 to obtain:

The estimate ŷ_B₁
Its variance matrix Q_ŷ_B₁

Phase 2: Solve the model B₂ᵀ(y - ŷ_B₁) = 0 using the first-phase results.

The final solution ŷ_B is constructed from both phases.
The excerpt shows mathematically (equations 144–155) that this two-phase approach yields the same result as solving the full partitioned model directly.

🧮 Mathematical derivation

🧮 Normal equations and elimination

The solution of the full model (144) involves a vector random variable ẑ defined in (145), which satisfies the partitioned system of normal equations (146).

The estimator ẑ₁ can be eliminated from the second set of normal equations by premultiplying with a specific full-rank matrix.
This elimination process (147) introduces an orthogonal projector and leads to a reduced system of normal equations (150).
The solution of this reduced system (151) can be rewritten to show the two-phase structure (152–155).

🔗 Connecting the phases

The excerpt demonstrates (equations 152–155) that:

Equation (153) is exactly the solution of the first partial model (142).
Equation (154) gives the variance matrix of ŷ_B₁.
Substituting these into (152) yields (155), which is the solution of the second model (143).

Don't confuse: The final solution ŷ_B is not simply ŷ_B₁; it incorporates corrections from the second phase based on the remaining condition equations B₂.

📊 Residuals and redundancy

📊 Residual vector decomposition

The least-squares residual vector ê_B = y - ŷ_B can be expressed as the sum of residuals from the two phases (157):

ê_B = ê_B₁ + ê_B₂

where:

ê_B₁ is the residual from the first phase (model 142)
ê_B₂ is the residual from the second phase (model 143)

📏 Squared norm of residuals

The squared norm of the residual vector (the sum of squared residuals) follows from (159) and simplifies to (164):

ê_Bᵀ Q_y⁻¹ ê_B = ê_B₁ᵀ Q_y⁻¹ ê_B₁ + ê_B₂ᵀ Q_ŷ_B₁⁻¹ ê_B₂

The excerpt emphasizes (after equation 159) that computing the second term does not require the full variance matrix Q_y.
Instead, the misclosures (defined in 160) can be used: w₁ = B₁ᵀy and w₂ = B₂ᵀy.
Equations (161–163) show how to compute the needed quantities using only these misclosures and the variance matrix from the first phase.

Why this matters: This property preserves the efficiency of the two-phase method—the second phase does not need to access the original full variance matrix.

🏗️ Worked example: levelling networks

🏗️ One-loop network (first phase)

Example 4 presents a one-loop levelling network (figure 6.8) with model (165).

The solution (166) and variance matrix (167) are computed for three height-difference observables.
This serves as the baseline for the first phase when the network is enlarged.

🏗️ Two-loop network (full partitioned model)

The network is enlarged with two additional observables (figure 6.9), giving model (168).

First phase (169–171):

Solve the partial model for the first three observables.
The solution (170) is identical to the original one-loop solution (166).
The last two observables (y₄ and y₅) receive no correction in this phase because they are uncorrelated with y₁, y₂, and y₃.

Second phase (172–173):

Solve the partial model B₂ᵀ(y - ŷ_B₁) = 0 using the first-phase results.
The final solution (173) incorporates corrections from both phases.

Don't confuse: The assumption that y₄ and y₅ are uncorrelated with the first three observables is crucial; this is typical in practical applications but not universal.

🔄 Correspondence with observation equations

🔄 Parallel model structures

The excerpt (section starting "There exists a close correspondence...") establishes a correspondence between:

Observation equations (section 6.2)	Condition equations (section 6.6)
Partitioned model (174): E{y} = A₁x₁ + A₂x₂	Partitioned model (176): B₁ᵀy = 0, B₂ᵀy = 0
Partial model (175): E{y} = A₁x₁ (x₂ = 0)	Partial model (177): B₁ᵀy = 0

Setting x₂ = 0 in the observation equations model increases redundancy (more condition equations can be formed).
The solution of (176) is identical to the solution of (175); the solution of (177) is identical to the solution of (174).
This equivalence (178) links the two model types through the relationship A = (A₁ A₂) and B = (B₁ B₂).

🔄 Orthogonal decompositions

Both model pairs have orthogonal decompositions:

For observation equations (179): P_A = P_A₁ + P_A₂
For condition equations (180): P_B^⊥ = P_B₁^⊥ + P_B₂^⊥

The excerpt uses these decompositions (181–182) to relate the orthogonal projectors between the two model types.

🔄 Alternative proof via correspondence

Table 6.7 summarizes the correspondence between the four models, allowing an alternative proof of estimation in phases (equations 183–186).

Starting from the orthogonal relation (183), the excerpt derives (184): P_B^⊥ Q_y = P_B₂^⊥ Q_y + P_B₁^⊥ Q_y.
Substituting the decomposition (180) and using properties of projectors leads to (185).
Recognizing the variance matrix structure yields (186), confirming the two-phase method.

Geometric interpretation: Figure 6.10 illustrates the relationship P_B^⊥ Q_y = P_B₂^⊥ Q_y + P_B₁^⊥ Q_y as an orthogonal decomposition in the observation space.

📋 Summary table

The excerpt provides Table 6.6 summarizing the estimation-in-phases procedure:

First phase: Solve B₁ᵀy = 0 to obtain ŷ_B₁ and Q_ŷ_B₁.
Second phase: Solve B₂ᵀ(y - ŷ_B₁) = 0 to obtain the final solution ŷ_B.

The table consolidates the key formulas for practitioners applying the method.

The Nonlinear A-Model

7.1 The nonlinear A-Model

🧭 Overview

🧠 One-sentence thesis

The nonlinear A-model, which describes most geodetic applications where observations are nonlinearly related to unknown parameters, can be solved through linearization using Taylor's theorem and an iterative least-squares process that progressively refines the solution.

📌 Key points (3–5)

When nonlinearity matters: In most geodetic applications (unlike levelling), the observation vector E{y} is nonlinearly related to the parameter vector x, requiring a nonlinear A-model instead of the linear version.
How linearization works: Taylor's theorem allows approximation of a nonlinear function by a linear function at a point x₀, provided x₀ is close enough to the true value that the second-order remainder can be neglected.
Iteration process: When the initial approximation x₀ is not close enough, least-squares iteration repeatedly re-linearizes around successively better approximations until convergence.
Common confusion: Linear vs. nonlinear vector functions—a function A(·) is linear only if it satisfies A(αx + βy) = αA(x) + βA(y) for all scalars α, β and vectors x, y; otherwise it is nonlinear.
Practical workflow: Approximate parameter values are usually computed directly from observations, then used to linearize the model, solve for corrections, update the approximation, and repeat until changes become negligible.

🔍 Distinguishing linear from nonlinear models

🔍 Definition of linearity

A vector function A(·): ℝⁿ → ℝᵐ is linear if and only if A(αx + βy) = αA(x) + βA(y) for all scalars α, β and all vectors x, y in ℝⁿ.

If the function does not satisfy this property, it is nonlinear.
This is a formal test; in practice, look for products, powers, trigonometric functions, or square roots of the unknowns.

📐 Examples from geodesy

Application	Model structure	Linear or nonlinear?	Why
Levelling network	Height differences = linear combinations of heights	Linear	Observation equations are sums/differences of unknowns
Distance resection	Observed distances = square root of sum of squared coordinate differences	Nonlinear	Square root and squared terms of unknowns
Parabola fitting	y = ax² + b	Nonlinear	The unknown x-coordinates appear squared
Equilateral triangle	Coordinates related through trigonometric and distance functions	Nonlinear	Trigonometric functions and square roots of unknowns

Don't confuse: A model with many equations is not automatically nonlinear; linearity depends on the functional form, not the number of equations or unknowns.

📏 Taylor's theorem for linearization

📏 Single-variable case

For a function f: ℝ → ℝ with continuous q-th order derivatives, there exists a scalar ξ between x and x₀ such that:

f(x) = f(x₀) + (df/dx)(x₀)·Δx + (1/2!)(d²f/dx²)(x₀)·(Δx)² + ... + remainder

where Δx = x - x₀.

The zero-order term f(x₀) depends only on x₀.
The first-order term (df/dx)(x₀)·Δx is linear in x.
The second-order remainder can be made arbitrarily small by choosing x₀ close enough to x.

📏 Multi-variable case

For a function f: ℝⁿ → ℝ with continuous q-th order partial derivatives:

f(x) = f(x₀) + [gradient at x₀]ᵀ·Δx + (1/2)·Δxᵀ·[Hessian at x₀]·Δx + remainder

where:

Gradient vector: ∇f(x₀) = [∂f/∂x₁, ..., ∂f/∂xₙ]ᵀ evaluated at x₀
Hessian matrix: matrix of second partial derivatives ∂²f/(∂xₐ∂xᵦ) evaluated at x₀

✂️ Linearization approximation

Linearization of f(x) at x₀: f(x) ≈ f(x₀) + [∇f(x₀)]ᵀ·Δx

This approximation neglects the second-order and higher terms.
It is valid when x₀ is sufficiently close to x that the remainder is negligible.
Geometrically (1D case): the linearization is the tangent line to the curve at x₀.

Example: Linearization of f(x) = tan(x) at x₀ = 0 gives f(x) ≈ 0 + (1/cos²(0))·(x - 0) = x.

Example: Linearization of f(x₁, x₂) = √(x₁² + x₂²) at (x₁⁰, x₂⁰) gives: f(x₁, x₂) ≈ √((x₁⁰)² + (x₂⁰)²) + [x₁⁰/√((x₁⁰)² + (x₂⁰)²)]·(x₁ - x₁⁰) + [x₂⁰/√((x₁⁰)² + (x₂⁰)²)]·(x₂ - x₂⁰)

🔧 Constructing the linearized A-model

🔧 From nonlinear to linearized form

The nonlinear A-model is: E{y} = A(x)

where A(·): ℝⁿ → ℝᵐ is a nonlinear vector function.

Each component aᵢ(x) of A(x) can be linearized: aᵢ(x) ≈ aᵢ(x₀) + [∇aᵢ(x₀)]ᵀ·Δx

Stacking all m components gives: A(x) ≈ A(x₀) + [∂A/∂x evaluated at x₀]·Δx

🔧 The linearized model

Define:

Δy = y - A(x₀) (observed minus function value at approximation)
Δx = x - x₀ (correction to the approximation)
Design matrix: ∂A/∂x evaluated at x₀ (m × n matrix of partial derivatives)

Then the linearized A-model is: E{Δy} = [∂A/∂x at x₀]·Δx

Key insight: This is now a linear model in the unknowns Δx, so standard least-squares formulas apply.

🔧 Least-squares solution

The least-squares estimator for the correction is: Δx̂ = [(∂A/∂x)ᵀ·P·(∂A/∂x)]⁻¹·(∂A/∂x)ᵀ·P·Δy

where P is the weight matrix of the observations.

The updated parameter estimate is: x̂ = x₀ + Δx̂

The variance matrix is: Var(x̂) ≈ σ²·[(∂A/∂x)ᵀ·P·(∂A/∂x)]⁻¹

Important caveat: These results are approximate because the second-order remainder was neglected; they are accurate only if x₀ is close enough to the true value.

📝 Worked example: Distance resection

Consider three known points 1, 2, 3 and one unknown point 4. Observe three distances l₁₄, l₂₄, l₃₄. The unknowns are the two coordinates (x₄, y₄).

Nonlinear model: E{l₁₄} = √[(x₄ - x₁)² + (y₄ - y₁)²] E{l₂₄} = √[(x₄ - x₂)² + (y₄ - y₂)²] E{l₃₄} = √[(x₄ - x₃)² + (y₄ - y₃)²]

Linearized model: Δl₁₄ ≈ [(x₄⁰ - x₁)/distance₁₄⁰]·Δx₄ + [(y₄⁰ - y₁)/distance₁₄⁰]·Δy₄ (and similarly for the other two distances)

where distance₁₄⁰ = √[(x₄⁰ - x₁)² + (y₄⁰ - y₁)²] computed from the approximate values.

Computing approximate values: Use the cosine rule with two observed distances and two known points to compute an approximate angle, then compute (x₄⁰, y₄⁰) from that angle.

🔄 Least-squares iteration process

🔄 Why iteration is needed

If the initial approximation x₀ is not close enough to the true value, the second-order remainder is not negligible.
The computed x̂ = x₀ + Δx̂ is then not the true least-squares estimator.
Solution: Use x̂ as a new, better approximation and repeat the linearization and solution process.

🔄 Iteration algorithm

Initialize: Choose an initial approximation x⁰ (often computed directly from observations).
Linearize: Compute the design matrix ∂A/∂x at xⁱ and form Δy = y - A(xⁱ).
Solve: Compute the correction Δx̂ⁱ using least-squares.
Update: Set xⁱ⁺¹ = xⁱ + Δx̂ⁱ.
Check convergence: If ||xⁱ⁺¹ - xⁱ|| is smaller than a chosen tolerance, stop; otherwise return to step 2.

Convergence criterion: The iteration is terminated when the difference between successive solutions is negligible (e.g., changes in the 5th or 6th significant digit).

🔄 Iteration flow diagram summary

The excerpt provides a detailed flow diagram (Table 7.2) with these steps:

Start with approximate values x₀
Loop: compute design matrix, form reduced observations, solve normal equations, update parameters
Test: if change is small enough, stop; otherwise continue loop

📊 Numerical example

Consider the nonlinear model: E{y₁} = x² E{y₂} = x

with observations y₁ = 4.1, y₂ = 1.9.

Iteration results:

Iteration i	Solution xⁱ
0	1.9
1	2.0205959
2	2.0176464
3	2.0176324
4	2.0176344
5	2.0176344

After iteration 2, changes are small.
Depending on required precision, x̂ ≈ 2.01763 (5 significant digits).

Don't confuse: The iteration process with the linearization itself—linearization is a single approximation step; iteration is the repeated application of linearization to improve the approximation.

🔗 Relationship to the nonlinear B-model

🔗 Nonlinear B-model definition

The nonlinear B-model (model of condition equations) is: Bᵀ(y) = 0

where Bᵀ(·): ℝᵐ → ℝᵐ⁻ⁿ is a nonlinear vector function expressing (m - n) independent conditions on the m observations.

🔗 Connection to the A-model

The nonlinear B-model and A-model are related by: Bᵀ(A(x)) = 0 for all x

Taking partial derivatives with respect to x and applying the chain rule: [∂Bᵀ/∂y]·[∂A/∂x] = 0

This is the linearized version of the relationship.

🔗 Linearized B-model

Pre-multiplying the linearized A-model by ∂Bᵀ/∂y evaluated at y₀ gives: [∂Bᵀ/∂y at y₀]·Δy = 0

This is the linearized B-model.

Practical note: The iteration process for the nonlinear B-model is more involved than for the A-model and is not detailed in this excerpt.

📝 Example: Parabola fitting

Nonlinear A-model: Observe (xᵢ, yᵢ) coordinates of n points on a parabola y = ax² + b. The model has 2n observation equations in (n + 2) unknowns (x₁, ..., xₙ, a, b).

Redundancy: 2n - (n + 2) = n - 2, so (n - 2) independent condition equations exist.

Nonlinear B-model (condition equations): yᵢ - a·xᵢ² - b = yⱼ - a·xⱼ² - b for all pairs i, j

Or equivalently: (yᵢ - yⱼ) - a·(xᵢ² - xⱼ²) = 0 for i = 3, ..., n (with j = 1 or 2 as reference).

Linearized B-model: Apply Taylor expansion to these conditions around approximate values (y₀ᵢ, a⁰, b⁰).

📝 Example: Equilateral triangle

Nonlinear A-model: Observe (xᵢ, yᵢ) for i = 1, 2, 3 (vertices of an equilateral triangle). Unknowns: position (x₁, y₁), orientation α, and side length a. The model has 6 observation equations in 4 unknowns.

Redundancy: 6 - 4 = 2, so 2 independent condition equations.

Nonlinear B-model (condition equations): The three sides must be equal: √[(x₂ - x₁)² + (y₂ - y₁)²] = √[(x₃ - x₂)² + (y₃ - y₂)²] √[(x₂ - x₁)² + (y₂ - y₁)²] = √[(x₁ - x₃)² + (y₁ - y₃)²]

Linearized B-model: Linearize these conditions around approximate values computed from the observations (e.g., using the observed coordinates to estimate position, orientation, and side length).

Verification note: The excerpt states that ∂Bᵀ/∂y · ∂A/∂x = 0 should hold if the approximate values satisfy the nonlinear condition equations.

7.2 The Nonlinear B-Model and Its Linearization

7.2 The nonlinear B-Model and its linearization

🧭 Overview

🧠 One-sentence thesis

The nonlinear B-model (condition equations) can be linearized through partial derivatives and the chain rule, allowing standard least-squares estimation to be applied iteratively, just as with the nonlinear A-model.

📌 Key points (3–5)

Nonlinearity is typical: most geodetic applications use nonlinear condition equations, not linear ones.
Relationship to A-model: the nonlinear B-model is derived from the nonlinear A-model through a nonlinear generalization; linearization uses the chain rule.
Linearization enables estimation: once linearized, the B-model can be solved using standard least-squares formulas.
Iteration is required: solving the nonlinear B-model needs an iterative process (though the excerpt does not detail it).
Common confusion: the linearized B-model is not the same as the original nonlinear B-model; it is an approximation obtained by taking partial derivatives around approximate values.

🔗 The nonlinear B-model and its connection to the A-model

🔗 Nonlinear B-model definition

The nonlinear B-model: B(.) = 0*, where B(.)* is a nonlinear vector function from m into m - n.

This is the model of condition equations in nonlinear form.
The excerpt emphasizes that "there are very few geodetic applications for which the model of condition equations is linear."
In most practical cases, the condition equations are nonlinear.

🔄 Relationship to the nonlinear A-model

The nonlinear B-model is related to the nonlinear A-model by:
- B(y) = 0* is the nonlinear generalization of the linear relation (equation 7 from section 3.1).
The A-model describes observation equations; the B-model describes condition equations.
The two models are connected through a transformation that generalizes the linear case.

🧮 Linearization process

🧮 Applying the chain rule

To linearize, take the partial derivative with respect to x of the relation between B-model and A-model.
Apply the chain rule (kettingregel in Dutch).
The result is the linearized version of the nonlinear relation.
This linearized relation is then used to construct the linearized B-model from the linearized A-model.

📐 Constructing the linearized B-model

Start with the linearized A-model (equation 20 from earlier sections).
Pre-multiply the linearized A-model by a specific matrix.
Combine with the linearized relation (equation 38) to obtain:
- x - x₀ = ... B(y₀) (equation 39).
This is the linearized B-model.

🔁 Iteration requirement

Once the linearized B-model is obtained, standard least-squares estimation formulas can be applied.
However, solving the nonlinear B-model requires an iteration process, similar to the nonlinear A-model.
The excerpt notes that "the iteration process needed for solving the nonlinear B-model is quite involved" and is not discussed in detail.

📚 Worked examples

📚 Example 14: Nonlinear A-model with redundancy

Setup: A nonlinear A-model with 2n observation equations in (n + 2) unknowns.
Redundancy: 2n - (n + 2) = n - 2.
Therefore, (n - 2) independent condition equations can be formulated.

Nonlinear condition equations:

The excerpt gives specific nonlinear conditions (equation 41).
These are the nonlinear B-model conditions.

Linearization:

Linearize the nonlinear condition equations (41) to obtain equation (42).
This is the linearized B-model.
The linearized A-model (equation 43) is also provided.
Verification step: The excerpt asks the reader to verify that a specific relationship holds between the linearized models.

📚 Example 15: Equilateral triangle problem

Setup: A nonlinear A-model (from example 12) with 6 observation equations in 4 unknown parameters (x₁, y₁, a, α).
Redundancy: 6 - 4 = 2.
Therefore, two independent condition equations can be formulated.

Nonlinear condition equations (equation 45):

These conditions state that the sides of the equilateral triangle should be equal.
This is a geometric constraint expressed as condition equations.

Linearization (equation 46):

Linearize the nonlinear conditions (45).
Important note: The excerpt states that a specific relationship holds "if the approximate values used in (46) satisfy the nonlinear condition equations."
This means the linearization is valid around approximate values that already satisfy the constraints.

Don't confuse:

The condition equations express geometric constraints (equal sides), not observation equations.
The linearized version is an approximation valid near the approximate values, not an exact representation.

🔍 Key distinctions

Aspect	Nonlinear B-model	Linearized B-model
Form	B*(.) = 0 (nonlinear vector function)	Linear approximation using partial derivatives
Solution method	Requires iteration (complex, not detailed)	Standard least-squares formulas apply
Validity	Exact representation of constraints	Approximation valid near approximate values
Relationship to A-model	Nonlinear generalization of linear relation	Derived by pre-multiplying linearized A-model

Common confusion: The linearized B-model is not simply a "linear B-model"; it is the result of linearizing a nonlinear model around approximate values, and it requires iteration to converge to the true solution.