Linear Regression Using R An Introduction to Data Modeling

What is a Linear Regression Model?

1.1 What is a Linear Regression Model?

🧭 Overview

🧠 One-sentence thesis

Linear regression modeling uses measured input parameters from multiple systems to build a mathematical function that predicts output performance, revealing which inputs matter most and enabling predictions for unmeasured configurations.

📌 Key points (3–5)

What regression modeling does: finds a mathematical function that describes the relationship between input parameters (independent variables) and output (dependent variable/response).
Linear combination constraint: the model is restricted to a linear combination of input parameters, though the parameters themselves need not be linear.
Discovery of importance: the modeling process reveals which inputs heavily influence the output and which have little or no impact.
Prediction capability: once developed, the model can predict performance for new systems with input values not present in the original measurements.
Common confusion: the model is not the real system—the real system always produces correct results regardless of what the model predicts; a model is a useful tool, not reality.

📊 Data structure and terminology

📊 Organizing measurements into observations

Performance measurements from multiple systems are organized into a table with n rows (one per system) and k columns (for input parameters).
Each row is called a single observation.
Example structure from the excerpt:
- System index (1 to n)
- Input parameters: Clock (MHz), Cache (kB), Transistors (M)
- Output: Performance

🔤 Key terminology

Independent variables: the input parameters whose values are set by the experimenter or determined by system configuration.

Dependent variable or response: the output value (performance) measured for the system.

The goal is to use the k independent measurements to determine a function f() that relates inputs to output.
Example from the excerpt: performance = f(Clock, Cache, Transistors)
Don't confuse: "independent" means the experimenter controls these values; "dependent" means this value depends on (responds to) the independent variables.

🎯 What the model reveals

🎯 Discovering input importance

The modeling process shows how important each input is in determining the output.
Example scenario from the excerpt: performance might be heavily dependent on clock frequency, while cache size and transistor count are much less important.
Some inputs may have essentially no impact on the output, making them unnecessary to include in the model.

🔮 Predicting unmeasured systems

Once the model is developed, it can predict performance for systems with input values that did not exist in the original measured set.
The excerpt shows three new systems (n+1, n+2, n+3) with different input combinations where performance is unknown (marked with "?").
The regression model fills in these question marks by applying the function learned from the original n systems.

🧮 Linear combination approach

🧮 What "linear combination" means

The function is a linear combination of the input parameters.
This is a common and powerful approach in regression modeling, sufficient for most systems likely to be encountered.
Important clarification from the excerpt: while the function is a linear combination of the input parameters, the parameters themselves do not need to be linear.

⚖️ Automatic scaling

Because the model is a linear combination, the values of model parameters are automatically scaled during development.
Consequence: the units used for inputs and output are arbitrary.
You can rescale input and output values before modeling and still produce an equivalent model.

⚠️ Model vs. reality

⚠️ The fundamental distinction

What you develop is just a model, not the real system.
The model is hopefully useful for:
- Understanding the system
- Predicting future results
Critical reminder from the excerpt: do not confuse a model with the real system.
The real system will always produce the correct results, regardless of what the model may say the results should be.
Example implication: if the model predicts one value but the real system produces another, the real system is correct—the model is an approximation.

What is R?

1.2 What is R?

🧭 Overview

🧠 One-sentence thesis

R provides a complete interactive environment for statistical computing that combines built-in functions, programming capabilities, and excellent graphical tools—all available as free, open-source software.

📌 Key points (3–5)

What R is: a computer language developed specifically for statistical computing, offering both direct function use and custom programming.
Key technical feature: object-oriented language using vectors and matrices as basic operands, enabling large-scale data operations with minimal code.
Graphical strength: provides excellent tools for producing complex plots relatively easily.
Common confusion: R is not just a programming language—it is a complete interactive environment where you can process data directly without writing full programs.
Accessibility: free and open-source, developed by volunteers through the R Project.

💻 R as an interactive environment

💻 More than a language

R is described as "more than" just a computer language—it is a complete environment for interacting with data.
You can use built-in functions directly without writing a complete program.
You can also write custom programs for tasks without built-in functions or for repeating operations multiple times.
Don't confuse: R is not only for programmers; it supports both interactive use (calling functions one at a time) and traditional programming (writing scripts).

🎯 Direct interaction example

The excerpt shows a simple R session:

x <- c(2,4,6,8,10,12,14,16) concatenates values into a vector and assigns it to variable x.
Typing x alone prints the vector contents.
mean(x) computes the arithmetic mean (result: 9).
var(x) computes the variance (result: 24).
The > character indicates R is waiting for input.
The [1] notation shows the first row of the matrix (R treats vectors as single-row matrices).

🧮 Technical architecture

🧮 Object-oriented with vectors and matrices

R is an object-oriented language that uses vectors and matrices as its basic operands.

This design makes R "quite useful for working on large sets of data using only a few lines of code."
The vector/matrix foundation means operations naturally scale to large datasets.
Example: in the interaction shown, a single mean(x) call processes an entire vector without loops.

📊 Graphical capabilities

R provides "excellent graphical tools for producing complex plots relatively easily."
The excerpt emphasizes both complexity (what you can create) and ease (how simply you can create it).

🌐 Availability and learning approach

🌐 Open-source and free

R is free software.
It is an open-source project developed by many volunteers.
You can learn about its history and download it from the R Project website.

📖 How this book teaches R

The book does not aim to make you an expert R programmer.
Focus: using R as a computing environment for statistical analysis, not teaching language syntax/semantics directly.
Approach: introduce R features as needed for specific regression modeling steps.
Assumption: readers already have some programming expertise to follow examples, but do not need to be expert programmers.
Rationale: developing good regression models is an interactive process requiring you to "dig in and play around" with data and models.

What's Next?

1.3 What’s Next?

🧭 Overview

🧠 One-sentence thesis

The book follows a structured path from understanding data quality through building simple and multiple regression models to making predictions and reading data files.

📌 Key points (3–5)

Chapter sequence: the tutorial progresses from data understanding (Ch. 2) → simple regression (Ch. 3) → multiple regression (Ch. 4) → prediction (Ch. 5) → data import details (Ch. 6) → summary and further reading (Ch. 7–8).
Why data quality comes first: flawed data produces flawed models ("garbage in, garbage out"), so Chapter 2 focuses on understanding and verifying data before modeling begins.
Missing values are normal: large datasets often have incomplete cells (marked NA in R); R functions can handle them, but sometimes you must explicitly tell functions to ignore NA values.
Common confusion: missing values may result from errors (forgotten entries) or legitimate reasons (e.g., a processor without an L2 cache parameter).

📚 Book roadmap

📖 Core chapters and their roles

Chapter	Focus	Purpose
2	Understand Your Data	Introduces sample data and how to read it into R; emphasizes data quality verification
3	Simple regression	Develops the simplest regression model with a single independent variable
4	Multiple regression	Explains building a more complex model with multiple independent input variables
5	Prediction	Shows how to use the MLR model to predict system response from new input data
6	Data import details	Explains in more detail the routines for reading data files into R
7	Summary + further reading	Summarizes the MLR modeling process and suggests additional resources
8	Experiments	Provides hands-on experiments to expand understanding of the modeling process

🎯 The interactive learning approach

The book does not aim to make you an expert R programmer.
Instead, it treats R as a computing environment for statistical analysis rather than a programming language to master.
The tutorial introduces R concepts as you need them for specific modeling steps.
Assumption: you already have some programming expertise to follow examples, but you do not need to be an expert.

🗑️ Data quality fundamentals

🗑️ Garbage in, garbage out

Good data is the basis of any regression model, because we use this data to actually construct the model. If the data is flawed, the model will be flawed.

The first step in regression modeling is ensuring your data is reliable.
No universal approach exists for verifying data quality; the method depends on data provenance.
If you collect data yourself: you know its origin and can verify collection methods.
If you obtain data from elsewhere: you depend on the source's reliability and must verify correctness as much as possible.

❓ Missing values (`NA`)

Missing values: cells without values in your data table, indicated by NA (Not Available) in R.

Why values go missing:

Errors: the experimenter forgot to fill in a particular entry.
Legitimate absence: the system configuration did not have that parameter available.
- Example: not every processor tested in the sample data had an L2 cache, so those cells are legitimately missing.

How R handles missing values:

R is designed to gracefully handle NA values.
Most R functions appropriately ignore NA values and still compute the desired result.
Don't confuse: sometimes you must explicitly tell a function to ignore NA values (e.g., the mean() function with an input vector containing NA requires an explicit instruction).

💻 R as a tool (not a language to master)

💻 What the book teaches about R

The excerpt shows a simple R session:
- x <- c(2, 4, 6, 8, 10, 12, 14, 16) concatenates values into a vector and assigns it to variable x.
- Typing x prints the vector contents; R treats vectors as a single-row matrix, so [1] indicates the first row.
- mean(x) computes the arithmetic mean; var(x) computes the variance.
The > character indicates R is waiting for input.

🎓 Learning philosophy

Goal: use R for interactive data analysis and model development, not to become a language expert.
Method: introduce R syntax and semantics only when needed for specific regression modeling steps.
Prerequisite: some programming expertise to follow examples, but not expert-level skill.
Developing good regression models is an interactive process requiring you to "dig in and play around" with data and models.

2.1 Missing Values

🧭 Overview

🧠 One-sentence thesis

R is designed to handle missing data gracefully by marking absent values as NA, which most functions can ignore automatically or with explicit instructions.

📌 Key points (3–5)

Why values go missing: errors (e.g., forgetting to fill in an entry) or legitimate absence (e.g., a system configuration lacking a particular parameter).
How R represents missing data: uses the notation NA (Not Available) to mark cells without values.
How R functions handle NA: most functions ignore NA values automatically, but some require an explicit instruction to do so.
Common confusion: calling mean() on a vector with NA returns NA by default; you must use na.rm=TRUE to compute the mean while ignoring missing values.

🗂️ Why data has missing values

🛠️ Errors during data collection

An experimenter may simply forget to fill in a particular entry.
This is a human mistake, not a reflection of the system being measured.

🔧 Legitimate absence of parameters

Some system configurations may not have certain parameters available.
Example: Not every processor tested in the example data had an L2 cache, so those cells would naturally be empty.
Don't confuse: missing because of an error vs. missing because the parameter does not exist for that configuration.

🔤 How R marks and handles missing values

🔤 The NA notation

NA: R's notation to indicate that the corresponding value is Not Available.

R uses NA to explicitly mark cells without values in a data table.
This is not the same as zero or an empty string; it is a special marker for "no data here."

⚙️ Automatic handling by R functions

Most functions in R are written to appropriately ignore NA values and still compute the desired result.
This means R is designed to work with incomplete data without crashing or producing nonsensical results.

🛑 When explicit instructions are needed

Sometimes you must explicitly tell a function to ignore NA values.
Example: Calling mean() on an input vector that contains NA values causes it to return NA as the result.
To compute the mean while ignoring the NA values, you must use mean(x, na.rm=TRUE).
- na.rm=TRUE is the explicit instruction to remove NA values before computing.
Don't confuse: the default behavior (return NA if any NA is present) vs. the explicit removal behavior (ignore NA and compute from the remaining values).

🧪 Practical implications

🧪 Working with incomplete data

Any large collection of data is probably incomplete—missing values are expected, not exceptional.
R's design allows you to proceed with analysis even when some cells are empty.
You do not need to manually fill in or delete every missing value before starting your work; R can handle them during computation.

📋 Checking function behavior

When using a function on data with NA values, check whether it returns NA or ignores the missing values.
If the result is NA and you want to ignore missing values, look for a parameter like na.rm=TRUE in the function's documentation.

2.2 Sanity Checking and Data Cleaning

🧭 Overview

🧠 One-sentence thesis

Sanity checking and data cleaning are essential steps to verify data validity and prepare it for modeling, but analysts must balance removing obvious errors with preserving potentially interesting anomalies.

📌 Key points (3–5)

What sanity checking involves: examining minimum and maximum values of key parameters and using visual plots to detect obvious flaws.
When to remove data: delete results only when errors are obvious (e.g., values hundreds of times larger than others, likely transcription errors), but avoid discarding strange-looking data without good justification.
Common confusion: strange data vs. flawed data—sometimes the most interesting conclusions come from data that initially appears flawed but actually reveals an unsuspected phenomenon.
Using domain knowledge: apply your understanding of the system to inform model building (e.g., expecting clock rate to be key in computer performance models).
Healthy skepticism: it is impossible to prove data is flawless, so always evaluate regression results critically and trust your intuition if something feels wrong.

🔍 Initial verification steps

🔍 Checking key parameters

Examine the minimum and maximum values of key input parameters (columns) in your data.
Look for anything that appears obviously wrong.
This is a quick first pass to catch drastic flaws.

📊 Visual inspection

R provides plotting functions to quickly obtain a visual indication of key relationships in your data set.
Visual plots help identify patterns and outliers that numeric summaries might miss.
The excerpt notes that examples of these functions appear in Section 3.1.

🧹 Deciding what to clean

🧹 When to delete data

Scenario for deletion:

You find that performance reported for a few system configurations is hundreds of times larger than all other systems tested.
Although the data could be correct, it seems more likely to be a transcription error.
In such cases, you may decide to delete those results.

Important caution:

It is important not to throw out data that looks strange without good justification.

Sometimes the most interesting conclusions come from data that on first glance appeared flawed.
Strange-looking data may actually be hiding an interesting and unsuspected phenomenon.
Don't confuse: obviously erroneous data (e.g., clear transcription mistakes) vs. unexpected but valid data that reveals new insights.

📝 What data cleaning means

Data cleaning: the process of checking your data and putting it into the proper format.

This encompasses both verification and transformation steps.
It prepares the data for modeling by ensuring quality and consistency.

🧠 Applying domain knowledge

🧠 Using system understanding

Always use your knowledge of the system and the relationships between inputs and outputs to inform model building.
Domain expertise helps you set expectations and validate model results.

⚙️ Example: computer performance modeling

What domain knowledge suggests:

From experience, clock rate is expected to be a key parameter in any regression model of computer systems performance.
Analysts should make sure models include the clock parameter.

What to do if the model disagrees:

If the modeling methodology suggests that clock is not important, carefully consider whether the model used is appropriate.
The mismatch between domain knowledge and model results is a red flag.
Example: deeper insights about cache effects on system performance should inform how you develop more complex models (covered in Chapter 4).

✅ Building confidence through sanity checks

These types of checks help you feel more comfortable that your data is valid.
They provide a reality check against your understanding of the system.

⚠️ Maintaining healthy skepticism

⚠️ Limits of verification

Key limitation:

It is impossible to prove that your data is flawless.

No amount of checking can guarantee perfect data.
Always maintain awareness of this fundamental uncertainty.

🤔 Trusting your intuition

How to evaluate results:

Look at any regression modeling results with a healthy dose of skepticism.
Think carefully about whether the results make sense.
Trust your intuition: if the results don't feel right, there is quite possibly a problem lurking somewhere in the data or in your analysis.

Don't confuse:

Statistical significance vs. practical sense—even if a model is statistically valid, it must also align with domain understanding.

📝 Note on terminology

The excerpt includes a note that the word "significant" will be avoided in the tutorial because:

It is overused in statistics.
It has both too much weight and not enough information to be useful.

📂 Context: the example data

📂 Data source

The input data comes from the publicly available CPU DB database.
Contains design characteristics and measured performance results for a large collection of commercial processors.
Data was collected over many years using a common format and standardized parameters.

📊 Data scope

The particular version used contains information on 1,525 processors.
Many parameters (columns) are useful for understanding and comparing processor performance.
Not all parameters will be useful as predictors in regression models.
Example: some parameters like "Instruction set width" are mentioned as potentially not useful (the excerpt cuts off here).

2.3 The Example Data

🧭 Overview

🧠 One-sentence thesis

The CPU DB database provides processor design characteristics and performance measurements that must be carefully filtered and validated to select useful predictors for regression modeling.

📌 Key points (3–5)

Data source: the CPU DB database contains 1,525 commercial processors with design parameters and performance measurements collected over many years.
Parameter selection: not all database columns are useful—some lack data for many processors, others don't distinguish among processors, so elimination is necessary.
Key predictor candidates: clock frequency, parallelism parameters (threads/cores), technology parameters (transistors, die size, gate delays), and memory parameters (cache sizes) are likely important based on processor design knowledge.
Performance metric: SPEC CPU integer and floating-point benchmark scores from 1992, 1995, 2000, and 2006 serve as the regression model output.
Common confusion: although the database has 1,500+ processor configurations, each individual benchmark has far fewer results because processors were only tested with benchmarks current at their introduction.

📊 The CPU DB database structure

📊 What the database contains

The CPU DB database: a publicly available collection of design characteristics and measured performance results for commercial processors, organized in a common format with standardized parameters.

Contains information on 1,525 processors.
Data collected over many years.
Many parameters (columns) describe different aspects of processor design and performance.

🗂️ Parameter categories in the database

The database organizes processor information into several types of parameters:

Parameter category	Examples	Purpose
Design characteristics	Instruction set width, Processor family	Describe processor architecture
Parallelism	Number of threads, number of cores	Indicate parallel processing capability
Technology	Transistors, die size, feature size, channel length, FO4 delay	Reflect fabrication technology and logic complexity
Memory	L1 instruction cache, L1 data cache, L2 cache, L3 cache	Describe memory hierarchy
Performance	SPEC CPU benchmark scores	Measured performance results

🔍 Selecting useful predictors

❌ Parameters to eliminate

Not all database columns are suitable for regression modeling:

Missing data: some parameters (e.g., Instruction set width) are not available for many processors.
Lack of distinction: some parameters (e.g., Processor family) are common among several processors and don't provide useful information for distinguishing among them.
These columns should be eliminated as possible predictors.

✅ Parameters to keep

⚡ Clock frequency

Has a large influence on performance based on knowledge of processor design.
Should definitely be retained as a predictor candidate.

🔀 Parallelism parameters

Number of threads and cores.
Could be important drivers of performance.
Should be kept available for possible inclusion in the regression model.

🔬 Technology parameters

Technology-related parameters: those directly determined by the particular fabrication technology used to build the processor.

Size and complexity indicators: number of transistors and die size are rough indicators of the processor's logic size and complexity.
Gate delay indicators: feature size, channel length, and FO4 (fanout-of-four) delay are related to gate delays in the processor's logic.
Why they matter: these parameters have a direct effect on how much processing can be done per clock cycle and affect the critical path delays.
At least some of these parameters could be important in a regression model describing performance.

💾 Memory parameters

Separate L1 instruction and data cache sizes.
Unified L2 and L3 cache sizes.
Why they matter: memory delays are critical to a processor's performance.
All memory-related parameters have the potential for being important in the regression models.

📈 Performance metric

🎯 SPEC CPU benchmark scores

The performance metric: the score obtained from the SPEC CPU integer and floating-point benchmark programs from 1992, 1995, 2000, and 2006.

This performance result will be the regression model's output (the thing being predicted).
Two types: integer benchmark scores and floating-point benchmark scores.
Four generations: 1992, 1995, 2000, and 2006 benchmark sets.

⚠️ Data availability limitation

Performance results are not available for every processor running every benchmark.
Most processors have performance results only for those benchmark sets that were current when the processor was introduced into the market.
Don't confuse: although there are more than 1,500 lines in the database (representing more than 1,500 unique processor configurations), a much smaller number of results are reported for each individual benchmark.
Example: a processor introduced in 1995 likely has only 1995 benchmark results, not 2000 or 2006 results.

🗄️ Data frames in R

🗄️ What a data frame is

Data frame: the fundamental object used for storing tables of data in R.

Think of it as a large table with:
- A row for each system measured.
- A column for each parameter.
Key feature: all columns do not need to be the same data type—some may be numerical, others textual.
This feature is useful when manipulating large, heterogeneous data files.

📥 Reading the CPU DB data

R has built-in functions for reading data from CSV (comma separated values) format files and organizing the data into data frames.
The excerpt mentions a custom function extract_data() written specifically for reading the CPU DB file.
After reading, the data is organized into eight data frames:
- int92.dat, fp92.dat (1992 integer and floating-point benchmarks)
- int95.dat, fp95.dat (1995 benchmarks)
- int00.dat, fp00.dat (2000 benchmarks)
- int06.dat, fp06.dat (2006 benchmarks)

Data Frames

2.4 Data Frames

🧭 Overview

🧠 One-sentence thesis

Data frames in R organize heterogeneous tabular data—rows for observations and columns for parameters—and provide flexible access methods that make working with large, mixed-type datasets practical.

📌 Key points (3–5)

What a data frame is: R's fundamental object for storing tables of data, with rows for individual observations (e.g., systems measured) and columns for parameters.
Heterogeneous columns: columns in a data frame can have different data types (numerical, textual, etc.), which is useful for real-world datasets.
Accessing data: square brackets [row, column] let you retrieve specific cells, entire rows, or entire columns; you can use numeric indices or quoted names.
Common confusion: row name vs row position—sorting changes position but not the name label, so int92.dat["71",...] always refers to the processor labeled "71" regardless of its position.
Practical workflow: use head() and tail() to preview large data frames instead of printing the entire table, which can overwhelm the console.

📦 What data frames are

📦 Definition and structure

Data frame: the fundamental object in R for storing tables of data.

Think of it as a large table:
- Each row represents one observation (e.g., one processor configuration).
- Each column represents one parameter (e.g., clock speed, performance score).
The excerpt emphasizes that data frames organize data into this row-by-column layout.

🎨 Heterogeneous data types

An "interesting and useful feature" is that columns do not need to be the same data type.
Some columns may be numerical (e.g., clock frequency), others textual (e.g., processor names).
This flexibility is "quite useful when manipulating large, heterogeneous data files."
Example: a dataset with numeric performance scores, text labels for processor models, and categorical fabrication sizes can all coexist in one data frame.

📥 Loading data into R

📥 Reading CSV files

R has built-in functions for reading csv (comma separated values) files and organizing them into data frames.
The excerpt notes that "the specifics of this reading process can get a little messy, depending on how the data is organized in the file."
For the CPU DB dataset, a custom function extract_data() is provided to handle the specifics (details deferred to Chapter 6).

📥 The CPU DB example

After loading the read-data.R file, six new data frames appear in the R workspace:
- int92.dat, fp92.dat, int95.dat, fp95.dat, int00.dat, fp00.dat, int06.dat, fp06.dat.
Each data frame contains processor data for a specific benchmark:
- int92.dat: processors with SPEC Integer 1992 (Int1992) benchmark results.
- fp92.dat: processors with Floating-Point 1992 (Fp1992) benchmark results, and so on.
The .dat suffix indicates the variable is a data frame.

📥 Columns in the CPU DB data frames

The excerpt provides a table of 17 columns (Table 2.1). Key columns include:

Column number	Column name	Definition
1	(blank)	Processor index number
2	nperf	Normalized performance
3	perf	SPEC performance
4	clock	Clock frequency (MHz)
5	threads	Number of hardware threads
6	cores	Number of hardware cores
7	TDP	Thermal design power
8	transistors	Number of transistors (M)
9	dieSize	Chip size
10	voltage	Nominal operating voltage
11	featureSize	Fabrication feature size
12	channel	Fabrication channel size
13	FO4delay	Fan-out-four delay
14–17	L1icache, L1dcache, L2cache, L3cache	Cache sizes at different levels

The first column (processor index) is used for display and identification but is not part of the data frame before import.

🔍 Accessing data frame elements

🔍 Cell access by position

Use square brackets [row, column] to identify a specific cell.
Example: int92.dat[15,12] returns the value in row 15, column 12 (which is 180 in the excerpt's example).

🔍 Cell access by name

You can also use quoted names for rows and columns.
Example: int92.dat["71","perf"] returns the perf value for the row labeled "71" (which is 105.1).
Don't confuse: this is the row labeled "71," not necessarily the 71st row in the table.
- If the data is sorted, the position changes but the row name stays the same.
- Using names makes your code robust to reordering.

🔍 Accessing entire rows or columns

Entire column: leave the row parameter empty.
- Example: int92.dat[,"clock"] prints all clock frequency values.
Entire row: leave the column parameter empty.
- Example: int92.dat[36,] prints all parameter values for the processor in row 36.

🔍 Utility functions

nrow(): returns the number of rows in the data frame.
ncol(): returns the number of columns.
These help you understand the size of your dataset programmatically.

🖥️ Viewing data frames safely

🖥️ The problem with printing large tables

Simply typing the data frame name (e.g., int92.dat) prints the entire table to the console.
Caution: for large datasets, this can overwhelm the console and is generally not recommended.
The excerpt shows a truncated example of what happens when you print int92.dat:
- First row is the header with column names.
- Each subsequent row is one processor's data.
- The output is cut off to fit the page.

🖥️ Recommended preview functions

head(int92.dat): prints the header and the first few rows.
- Gives you a quick glance at the structure and initial data.
tail(int92.dat): prints the header and the last few rows.
- Useful for checking the end of the dataset.
These functions are "highly recommended" for interactive data exploration.

Accessing a Data Frame

2.5 Accessing a Data Frame

🧭 Overview

🧠 One-sentence thesis

R provides multiple notations—square brackets with row/column indices or names, and the dollar-sign shorthand—to access individual cells, entire rows, entire columns, and to apply functions across data frame columns.

📌 Key points (3–5)

Square-bracket notation: use [row, column] to access specific cells by numeric index or by quoted name; leave one parameter empty to get entire rows or columns.
Row names vs row numbers: a row's name (label) stays with the data even if the data frame is sorted, but the numeric row position may change.
Dollar-sign shorthand: dataframe$columnname is a simpler way to access a column when doing interactive computation.
Common confusion: int92.dat["71","perf"] accesses the row labeled "71", not the 71st row by position.
Built-in functions on columns: functions like min(), max(), mean(), and sd() can operate directly on a column vector extracted from a data frame.

🔢 Square-bracket indexing

🔢 Accessing a single cell by position

Syntax: dataframe[row_number, column_number]
Example: int92.dat[15,12] returns the value 180 from row 15, column 12.
Both row and column are specified numerically (1-indexed).

🏷️ Accessing a cell by name

Syntax: dataframe["row_name", "column_name"]
Example: int92.dat["71","perf"] returns 105.1 from the row labeled "71" in the column labeled "perf".
Important distinction: the row name is a label attached to the observation, not the numeric position.
- If the data frame is sorted, the row name travels with the data, but the 71st row by position might contain different data.
- Don't confuse: ["71",...] is not the same as [71,...].

📋 Accessing entire rows or columns

Entire column: leave the row parameter empty: dataframe[, "column_name"]
- Example: int92.dat[,"clock"] prints all values in the "clock" column: 100 125 166 175 190 ...
Entire row: leave the column parameter empty: dataframe[row_number, ]
- Example: int92.dat[36,] prints all column values for row 36: nperf perf clock threads cores ... 36 13.07378 79.86399 80 1 1 ...

📏 Counting rows and columns

nrow(dataframe) returns the number of rows.
- Example: nrow(int92.dat) returns 78.
ncol(dataframe) returns the number of columns.
- Example: ncol(int92.dat) returns 16.

🧮 Applying functions to columns

🧮 Built-in functions on vectors

R functions can operate on a vector of any length, so you can pass an entire column to a function.
The excerpt demonstrates four summary functions applied to the "perf" column:

Function	Syntax	Result
Minimum	`min(int92.dat[,"perf"])`	`36.7`
Maximum	`max(int92.dat[,"perf"])`	`366.857`
Mean	`mean(int92.dat[,"perf"])`	`124.2859`
Standard deviation	`sd(int92.dat[,"perf"])`	`78.0974`

These functions compute useful summary statistics quickly because they accept the entire column vector as input.

💲 Dollar-sign shorthand

💲 Simpler column access

Syntax: dataframe$columnname
This notation is easier to type and read when doing interactive computation in R.
Example: int92.dat$perf accesses the "perf" column without needing square brackets or quotes.

💲 Equivalent expressions

The dollar-sign notation produces the same results as square-bracket notation:
- min(int92.dat$perf) returns 36.7 (same as min(int92.dat[,"perf"])).
- max(int92.dat$perf) returns 366.857.
- mean(int92.dat$perf) returns 124.2859.
- sd(int92.dat$perf) returns 78.0974.
The excerpt notes that square-bracket notation "can become cumbersome" for substantial interactive work, so the dollar-sign alternative is provided.

💲 When to use which notation

Square brackets: more flexible (can access by row and column, by name or number).
Dollar sign: more convenient for quickly accessing a single column by name during interactive analysis.
Don't confuse: both notations access the same underlying data; the choice is about convenience and readability.

Visualize the Data

3.1 Visualize the Data

🧭 Overview

🧠 One-sentence thesis

Before building a regression model, you must visually inspect whether a roughly linear relationship exists between the predictor variable and the output.

📌 Key points (3–5)

First step in modeling: determine whether a linear relationship looks present between predictor and output.
Use domain knowledge: understanding the subject matter (e.g., computer system design) guides which predictor to examine.
Scatter plots reveal patterns: plotting the data shows whether performance increases with the predictor and whether the relationship is roughly linear.
Common confusion: a linear relationship does not need to be perfectly linear—some spread or variation around a straight line is expected.
What to look for: whether superimposing a straight line on the scatter plot reasonably fits the data points.

📊 Why visualization comes first

📊 The role of visual inspection

The excerpt emphasizes that the first step in single-predictor modeling is to check whether a linear relationship "looks" present.
You are not yet computing coefficients or formulas; you are using your eyes to assess the pattern.
This step prevents building a linear model when the data do not support one.

🧠 Domain-specific knowledge guides choice

The excerpt uses computer system performance as an example: domain knowledge tells us that clock frequency strongly influences performance.
Consequently, the analyst looks for a roughly linear relationship between clock frequency (predictor) and performance (output).
Don't confuse: domain knowledge is not a substitute for visualization—it tells you which predictor to plot, but you still need to see the relationship.

🖼️ Creating the scatter plot in R

🖼️ The plot function

The excerpt shows this R function call:

plot(int00.dat[,"clock"], int00.dat[,"perf"],
     main="Int2000", xlab="Clock", ylab="Performance")

First parameter: the x-axis variable (independent variable)—here, the clock column from the int00.dat data frame.
Second parameter: the y-axis variable (dependent variable)—here, the perf column.
main="Int2000": provides a title for the plot.
xlab="Clock" and ylab="Performance": label the x- and y-axes.

📈 What the plot reveals

The resulting scatter plot (Figure 3.1 in the excerpt) shows:

Performance tends to increase as clock frequency increases, as expected from domain knowledge.
When you superimpose a straight line on the scatter plot, the relationship between predictor (clock frequency) and output (performance) is roughly linear.
It is not perfectly linear: as clock frequency increases, there is a larger spread in performance values.

Example: If you plotted clock frequency on the x-axis and performance on the y-axis, you would see data points clustering around an upward-sloping line, but with some scatter—especially at higher clock frequencies.

🔍 Interpreting "roughly linear"

🔍 What "roughly linear" means

A roughly linear relationship: one where a straight line superimposed on the scatter plot reasonably fits the data, even if individual points do not fall exactly on the line.

The excerpt explicitly states the relationship is "not perfectly linear."
Some variation or spread around the line is normal and expected.
The key question is whether a straight line is a reasonable summary of the trend.

⚠️ Observing spread and variation

The excerpt notes that "as the clock frequency increases, we see a larger spread in performance values."
This means the data points are more scattered (further from a hypothetical straight line) at higher clock frequencies.
Don't confuse: increased spread does not automatically disqualify a linear model—it signals that the model may not capture all variation, but a linear trend can still be present.

🚀 Next step after visualization

🚀 Moving to quantification

Once the scatter plot shows a roughly linear relationship, the next step is to develop a regression model.
The excerpt states: "Our next step is to develop a regression model that will help us quantify the degree of linearity in the relationship between the output and the predictor."
Visualization is qualitative ("does it look linear?"); the regression model provides quantitative measures (slope, intercept, fit quality).

The Linear Model Function

3.2 The Linear Model Function

🧭 Overview

🧠 One-sentence thesis

The linear model function in R (lm()) generates a regression model that predicts system behavior by fitting a straight line to measured data using the method of least squares.

📌 Key points (3–5)

Purpose of regression models: predict a system's behavior by extrapolating from previously measured output values with known input parameters.
How lm() works: computes the y-intercept and slope of a line that most closely fits measured data by minimizing distances between the line and individual data points.
Model formula syntax: perf ~ clock means "model performance by clock speed"; the tilde (~) indicates the relationship between predictor and output.
Predicted vs actual values: the hat symbol (^) on ŷ indicates a predicted or estimated value, not the actual observed value.
Common confusion: the coefficients alone don't tell you model quality—you need additional evaluation (covered in the next section).

📐 The simple linear regression equation

📐 Mathematical form

The simplest regression model is a straight line with the form: ŷ = a₀ + a₁x₁, where x₁ is the input to the system, a₀ is the y-intercept of the line, a₁ is the slope, and ŷ is the output value the model predicts.

a₀ (intercept): where the line crosses the y-axis
a₁ (slope): how much the output changes for each unit increase in the input
x₁: the predictor variable (input)
ŷ: the predicted output (not the actual measured value)

🎯 Why the hat matters

The hat symbol (^) distinguishes predicted values from actual observed values.
The model gives you an estimate; real measurements may differ.
Example: if the model predicts ŷ = 500 but you actually measured 520, the difference (20) is the residual.

🔧 Using the `lm()` function in R

🔧 Basic syntax

The excerpt shows this function call:

int00.lm <- lm(perf ~ clock, data=int00.dat)

What each part means:

lm(...): the linear model function
perf ~ clock: "model performance by clock speed"—the tilde (~) reads as "by"
data=int00.dat: specifies which data frame contains the variables
int00.lm: the variable that stores the resulting linear model object

📊 The tilde operator (~)

The tilde indicates the relationship between two variables.
Read it as "by" or "explained by."
Format: output ~ predictor
In the example: performance is explained by clock speed.

💾 The linear model object

The result is a "linear model object" stored in a variable (here, int00.lm).
The excerpt uses the suffix .lm to emphasize that this variable contains a linear model.
Typing the variable name by itself prints the function call and the computed coefficients.

🧮 Method of least squares

🧮 How R computes the line

The method of least squares finds the line that most closely fits the measured data by minimizing the distances between the line and the individual data points.

R automatically uses this method when you call lm().
"Minimizing distances" means finding the line that makes the sum of all residuals as small as possible.
You don't need to compute a₀ and a₁ manually; R does it for you.

📏 What "best fit" means

Not every data point will lie exactly on the line.
The least squares method finds the line that is closest to all points overall.
Example: if some points are above the line and some below, the method balances these distances to find the optimal line.

🔍 Interpreting the output

🔍 Reading the coefficients

When you type int00.lm, R prints:

Coefficients:
(Intercept)   clock
51.7871       0.5863

(Intercept): a₀ = 51.7871 (the y-intercept)
clock: a₁ = 0.5863 (the slope)
The final regression model is: predicted performance = 51.7871 + 0.5863 × clock

📈 What the model tells you

For every 1-unit increase in clock frequency, performance increases by approximately 0.5863 units.
When clock frequency is zero, the model predicts a baseline performance of about 51.79 (though this may not be physically meaningful).
Example: if clock = 1000, predicted performance = 51.7871 + 0.5863 × 1000 = 638.0871.

⚠️ Limitation: coefficients alone are not enough

The excerpt explicitly states: "The information we obtain by typing int00.lm shows us the regression model's basic values, but does not tell us anything about the model's quality."
You need additional evaluation (like summary()) to determine how well the data fit the model.
Don't confuse: having a line does not mean the line is a good predictor.

📊 Visualizing the model

📊 Plotting the fitted line

The excerpt shows how to overlay the regression line on the scatter plot:

plot(perf ~ clock, data=int00.dat)
abline(int00.lm)

plot(perf ~ clock, data=int00.dat): creates a scatter plot of the data
abline(int00.lm): adds the regression line using the slope and intercept from the model
abline() is short for "(a,b)-line"—it plots a line using the model's coefficients

👁️ What to look for

The line should roughly follow the trend of the data points.
The excerpt notes: "If we superimpose a straight line on this scatter plot, we see that the relationship between the predictor (the clock frequency) and the output (the performance) is roughly linear."
However, it also warns: "It is not perfectly linear... As the clock frequency increases, we see a larger spread in performance values."
This visual check is a first step; formal evaluation comes next.

Evaluating the Quality of the Model

3.3 Evaluating the Quality of the Model

🧭 Overview

🧠 One-sentence thesis

A regression model's quality is assessed by examining residuals, coefficient statistics, and goodness-of-fit measures to determine how well the fitted line describes the data and whether the assumptions of linear regression hold.

📌 Key points (3–5)

What quality evaluation provides: the summary() function extracts residuals, coefficient estimates with standard errors, p-values, and R-squared values to judge model fit.
Residuals as a diagnostic tool: differences between actual and predicted values should be normally distributed around zero if the model fits well.
Coefficient significance: the ratio of estimate to standard error (t-value) and the p-value tell whether there is strong evidence of a linear relationship.
Common confusion: a high R-squared is not always necessary—a model can predict well even with a modest R-squared; also, the intercept p-value is often not important unless theory requires the intercept to be zero.
Residual analysis goes deeper: plotting residuals against fitted values reveals patterns that summary statistics alone may miss, such as non-uniform scatter or increasing spread.

📊 Understanding residuals

📏 What residuals are

Residuals: the differences between the actual measured values and the corresponding values on the fitted regression line.

Each data point's residual is the vertical distance from the point to the regression line.
Positive residual: the point is above the line (model under-predicted).
Negative residual: the point is below the line (model over-predicted).
Example: if the actual performance is 1500 and the model predicts 1400, the residual is +100.

📐 Residual summary statistics

The summary() output reports five numbers for residuals:

Statistic	Meaning
Min	Distance from the line to the point furthest below
1Q	First quartile of all sorted residuals
Median	Median residual value
3Q	Third quartile of all sorted residuals
Max	Distance from the line to the point furthest above

What to look for in a good model:

Median near zero (residuals balanced around zero).
Min and Max roughly the same magnitude (symmetric spread).
1Q and 3Q roughly the same magnitude (symmetric quartiles).
The excerpt notes that "a good model's residuals should be roughly balanced around and not too far away from the mean of zero."

🔔 Normal distribution expectation

If the line is a good fit, residuals should be normally distributed (Gaussian, "bell curve") around a mean of zero.
This implies decreasing probability of finding residuals as you move further from zero.
Don't confuse: the residual summary alone is not enough—visual tests (residual plots) are needed to confirm normality.

🧮 Coefficient estimates and significance

🔢 Estimate and standard error

The Coefficients table shows:

Estimate: the fitted coefficient values (intercept and slope).
Std. Error: the statistical standard error for each coefficient.

Rule of thumb for a good model:

Standard error should be at least 5 to 10 times smaller than the coefficient.
Example from the excerpt: the slope estimate is 0.58635 with standard error 0.02697, giving a ratio of 21.7 (0.58635 ÷ 0.02697), which means "relatively little variability in the slope estimate."
The intercept standard error (53.31513) is roughly the same as the estimate (51.78709), suggesting "more uncertainty in the estimate of this coefficient"—but this is "not typically something to worry about for the y-intercept."

📈 t-value and p-value

t-value: the ratio of the estimate to the standard error (also called the test statistic).
Pr(>|t|) (p-value): the probability of observing a t-value as extreme or more extreme, assuming there is no linear relationship between predictor and response.

Interpreting the p-value:

A tiny p-value (e.g., less than 2×10⁻¹⁶) means "strong evidence of a linear relationship."
Example: the slope p-value in the excerpt is <2e-16, so "the probability of observing a t value of 21.741 or more extreme, assuming there is no linear relationship between clock speed and performance, is less than 2e-16."
A large p-value (e.g., 0.332 for the intercept) means "little evidence that the true intercept is not zero"—but this is "typically not something very interesting" unless theory requires the intercept to be zero.

⭐ Significance codes

The asterisks, periods, or spaces next to coefficients are quick visual indicators:
- *** means 0 < p ≤ 0.001
- ** means 0.001 < p ≤ 0.01
- * means 0.01 < p ≤ 0.05
- . means 0.05 < p ≤ 0.1
- (space) means p > 0.1

📏 Goodness-of-fit measures

📉 Residual standard error

Residual standard error: a measure of the total variation in the residual values.

If residuals are normally distributed, the first and third quartiles should be about 1.5 times this standard error.
Degrees of freedom = total observations minus number of coefficients.
Example: 256 observations, 2 coefficients (slope and intercept) → 254 degrees of freedom.

🎯 Multiple R-squared

Multiple R-squared: a number between 0 and 1 that measures how well the model describes the measured data.

Computed by dividing the variation the model explains by the data's total variation.
Multiply by 100 to interpret as a percentage.
Example: R² = 0.6505 means "65.05% of the variability in performance is explained by the variation in clock speed."
Important clarification: "Random chance and measurement errors creep in, so the model will never explain all data variation. Consequently, you should not ever expect an R² value of exactly one."
Don't confuse: "A good model does not necessarily require a large R² value. It may still accurately predict future observations, even with a small R² value."
In general, values closer to 1 indicate a better-fitting model.

🔧 Adjusted R-squared

Adjusted R-squared: the R² value modified to take into account the number of predictors used in the model.

Always smaller than the R² value.
More relevant for models with multiple predictors (discussed in Chapter 4).

🧪 F-statistic

Compares the current model to a model with only the intercept parameter.
For simple linear regression (one predictor), the F-statistic gives the same information as the slope t-test: F-statistic = (t-value)² and has the same p-value.
Also called the "overall F test."
For models with multiple predictors, it compares the full model to the intercept-only model (more detail in Chapter 4).

🔍 Residual analysis for deeper diagnostics

🖼️ Why residual plots matter

Residual analysis: examines residual values to see what they can tell us about the model's quality.

The summary() function provides substantial information, but residual analysis "digs deeper."
A model that fits well would "over-predict as often as it under-predicts."
Plotting residuals reveals patterns that summary statistics alone may miss.

📊 How to interpret a residuals plot

Plot fitted values (x-axis) against residuals (y-axis).
What to expect for a well-fitted model: residuals distributed normally around zero, uniformly scattered above and below zero.
Warning signs:
- Residuals increase as fitted values increase (non-constant variance).
- Residuals not uniformly scattered above and below zero (systematic bias).
Example from the excerpt: "In this plot, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly scattered above and below zero."—this suggests the model may not fully capture the data structure.

🛠️ Generating the plot

Use plot(fitted(model), resid(model)) to create the residuals plot.
The fitted() function extracts predicted values; resid() extracts residuals.

Residual Analysis

3.4 Residual Analysis

🧭 Overview

🧠 One-sentence thesis

Residual analysis examines the differences between actual and predicted values to reveal whether a regression model adequately explains the data, with well-fitted models showing residuals that are normally distributed around zero without clear patterns.

📌 Key points (3–5)

What residuals measure: the difference between actual measured values and values predicted by the regression model; positive residuals mean under-prediction, negative mean over-prediction.
Pattern-free residuals signal good fit: a well-fitted model over-predicts as often as it under-predicts, so residuals should scatter uniformly around zero with no clear trend.
Q-Q plot tests normality: if residuals are normally distributed (as expected for a good model), points on the Q-Q plot should follow a straight line; deviations indicate problems.
Common confusion: residuals increasing or showing patterns does not mean the model is useless—it only means a better model may be possible.
Diagnostic plots reveal model quality: residual plots and Q-Q plots together help identify whether the current predictors sufficiently explain the data.

📊 What residuals tell us

📊 Definition and meaning

Residual value: the difference between the actual measured value stored in the data frame and the value that the fitted regression line predicts for that corresponding data point.

Residual = Actual value − Predicted value.
Positive residual (greater than zero): the model predicted too small a value (under-predicted).
Negative residual (less than zero): the model predicted too large a value (over-predicted).
Residuals capture what the model cannot explain.

🎯 What good residuals look like

A model that fits the data well should:

Over-predict as often as it under-predicts.
Show residuals distributed normally around zero.
Display no clear trend or pattern when plotted.

Don't confuse: "no pattern" does not mean "all residuals are exactly zero"—random variation and measurement error always exist, so perfect prediction (R² = 1) is not expected.

🔍 Residual plot analysis

🔍 How to read the residual plot

The excerpt describes plotting residuals (vertical axis) against fitted values (horizontal axis):

plot(fitted(int00.lm), resid(int00.lm))

What to look for:

Residuals should scatter uniformly above and below zero.
No trend as you move left to right.

⚠️ Signs of a problem

The example in the excerpt shows:

Residuals tend to increase as fitted values increase (moving to the right).
Residuals are not uniformly scattered above and below zero.

Interpretation:

Using clock speed as the sole predictor does not sufficiently or fully explain the data.
Any clear trend or pattern in residuals suggests you probably need a better model.

Important clarification: This does not mean the simple linear regression model is useless—it only means you may be able to construct a model that produces tighter residual values and better predictions.

📈 Q-Q plot for normality

📈 What the Q-Q plot tests

Quantile-versus-quantile (Q-Q) plot: a visual test of whether residuals from the model are normally (Gaussian) distributed around a mean of zero.

If the model fits well, residuals should be normally distributed.
The Q-Q plot provides a nice visual indication of this assumption.

📈 How to interpret the Q-Q plot

Generated with:

qqnorm(resid(int00.lm))
qqline(resid(int00.lm))

Expected pattern for good fit:

Points should follow a straight line.

What the example shows:

The two ends (tails) diverge considerably from the reference line.
This indicates residuals are not normally distributed.

🔄 Tail behavior and distribution shape

The excerpt explains how tail deviations reveal distribution problems:

Tail behavior	What it means
Right tail diverges upward	Distribution's right tail is "heavier" than expected from normal distribution
Left tail diverges downward	Distribution's left tail is "lighter" than expected
Both patterns together	Indicative of a right-skewed distribution

Conclusion from the example: This test further confirms that using only the clock as a predictor in the model is insufficient to explain the data.

🛠️ Additional diagnostic tools

🛠️ Four default diagnostic plots

The excerpt mentions that four diagnostic plots can be obtained automatically:

par(mfrow=c(2,2))
plot(int00.lm)

This displays plots in a 2-by-2 grid. The top two are the residual plots created manually above.

🛠️ Scale-Location plot

An alternate way of visualizing residuals versus fitted values.
Residuals are standardized and then transformed by square-root.
This "folds" the residuals and can aid in finding patterns.

🛠️ Residuals vs Leverage plot

Used to identify possible outliers.
The excerpt notes: "in this plot, there are no outliers."
Not discussed in detail in this section.

🚀 Next steps

🚀 Moving toward better models

The excerpt concludes:

The diagnostic plots reveal that the current simple linear regression model (using only clock speed) does not fully explain the data.
The next step is to learn to develop regression models with multiple input variables.
Perhaps a more complex model will be better able to explain the data.

Key takeaway: Residual analysis does not just reject models—it guides you toward improvements by showing where and how the current model falls short.

4.1 Visualizing the Relationships in the Data

🧭 Overview

🧠 One-sentence thesis

Before building a multiple linear regression model, visualizing pairwise relationships in the data helps identify linear patterns and potential predictors while avoiding redundant or unnecessary variables.

📌 Key points (3–5)

Visual exploration first: Use pairwise comparison plots to see relationships between all variables before model development.
What to look for: Obvious linear relationships (proportional or inverse) between performance and input variables.
Common confusion: More predictors ≠ better model—too many predictors can over-fit to noise rather than capture true system behavior.
Balance needed: Too few predictors cause bias; too many model random noise and create numerical instability.
Adjusted R-squared helps: Unlike regular R-squared (which always increases with more predictors), adjusted R-squared accounts for the number of predictors to reveal whether adding a variable truly improves the model.

📊 Understanding multiple linear regression models

📐 What MLR models look like

Multiple linear regression model: a generalization of simple linear regression with k variables of the form: predicted y = a₀ + a₁x₁ + a₂x₂ + ... + aₖxₖ

The x values are the system inputs
The a coefficients are model parameters computed from measured data
Predicted y is the output value the model estimates
Everything from single-variable models applies to multiple-variable models

🔍 Creating pairwise visualizations

🖼️ How to generate comparison plots

The excerpt demonstrates using the pairs() function to create a matrix of scatter plots showing all pairwise comparisons in the data frame.

The gap parameter controls spacing between individual plots (set to zero to eliminate space)
Each box shows one variable plotted against another

📖 How to read the plot matrix

Example from the excerpt:

Locate a variable label (e.g., "perf" for performance)
The box immediately to the right shows a scatter plot with that variable on the vertical axis and the next variable on the horizontal axis
Scan through plots to identify obviously linear relationships

🔎 What patterns to notice

The excerpt identifies several relationship types:

Proportional relationship: performance and clock show a somewhat proportional pattern
Inverse relationship: performance and feature size show a weakly inverse relationship
Perfect correlation: performance and normalized performance show perfect linear correlation (because one is a rescaling of the other)

⚖️ The predictor selection challenge

❌ Why not include everything

The excerpt warns against the novice assumption that "more information is better":

A good regression model explains relationships as simply as possible
Use the smallest number of predictors necessary for good predictions
Too many predictors build random noise into the model

🎯 The over-fitting problem

Over-fitted model: a model that is very good at predicting outputs from the specific input data set used to train it, but does not accurately model the overall system's response for a broader range of inputs.

Why over-fitting happens:

Using too many or redundant predictors
The model learns the noise patterns in the training data instead of the true underlying relationships
Redundant predictors can also cause numerical instabilities when computing coefficients

⚖️ Finding the right balance

Too few predictors	Too many predictors
Produces biased predictions	Models random noise
Misses important relationships	Over-fits to training data
Underfits the system	Creates numerical instability

📈 Using adjusted R-squared to guide selection

📊 The problem with regular R-squared

The excerpt explains a key confusion point:

Adding more predictors always causes R-squared to increase
This can mislead you into thinking additional predictors generated a better model
Sometimes the increase reflects better modeling of random noise, not true improvement

🔧 How adjusted R-squared helps

Adjusted R-squared: a modification of R-squared that changes the value according to the number of estimated parameters in the model.

The formula (in words): Adjusted R-squared = 1 minus [(n minus 1) divided by (n minus m)] times (1 minus R-squared), where n is the number of observations and m is the number of estimated parameters (k + 1, where k is the number of predictors).

How to interpret it:

If adding a predictor increases R-squared by more than expected from random fluctuations → adjusted R-squared increases
If the increase is just noise → adjusted R-squared decreases
This helps determine whether a predictor truly improves the model fit

🎯 The goal

Find predictors that genuinely explain the system's behavior, not just the noise in the training data.

4.2 Identifying Potential Predictors

🧭 Overview

🧠 One-sentence thesis

The key to building a good multiple linear regression model is selecting the smallest number of predictors necessary to explain the relationship between inputs and outputs without over-fitting to noise or under-fitting due to missing important variables.

📌 Key points (3–5)

The balance problem: too many predictors cause over-fitting (modeling noise instead of true relationships); too few cause biased predictions.
R² always increases with more predictors, but adjusted R² compensates by penalizing models with unnecessary parameters, helping distinguish real improvement from noise modeling.
Domain knowledge guides selection: exclude obviously irrelevant variables (e.g., TDP in this case) and include non-linear terms (e.g., square roots of cache sizes) when physical reasoning suggests non-linear relationships.
Common confusion: more information is not always better—redundant or unnecessary predictors build random noise into the model, making it good at predicting training data but poor at generalizing to new inputs.
Start broad, then eliminate: begin with all plausible predictors and use backward elimination to systematically remove those that don't contribute meaningfully.

🎯 The core trade-off

⚖️ Too few vs too many predictors

Too few predictors → biased predictions that miss important relationships.
Too many predictors → over-fitted model that follows random noise in the training data.

Over-fitted model: a model that is very good at predicting outputs from the specific input data set used to train it, but does not accurately model the overall system's response and will not appropriately predict outputs for a broader range of inputs.

An over-fitted model learns the training data's quirks instead of the underlying system behavior.
Redundant or unnecessary predictors also cause numerical instabilities when computing coefficients.
Example: if you include every available variable without thought, the model may perfectly predict your training data but fail on new systems.

📈 Why R² is misleading

Adding any predictor will always increase R², even if that predictor is just noise.
This can trick you into thinking the model improved when it actually just got better at modeling randomness.
Don't confuse: R² going up ≠ model getting better at capturing true relationships.

📐 Adjusted R² as a solution

🧮 How adjusted R² works

The formula is:

Adjusted R² = 1 minus (n minus 1) divided by (n minus m), times (1 minus R²)

where:

n = number of observations
m = number of estimated parameters (k + 1, where k is the number of predictors and +1 is for the intercept)

🔍 What adjusted R² tells you

If adding a new predictor increases R² by more than expected from random fluctuations, adjusted R² will increase → the predictor is useful.
If removing a predictor decreases R² by more than expected from random variations, adjusted R² will decrease → the predictor was contributing meaningfully.
The adjustment penalizes models for having more parameters, helping you avoid over-fitting.
Goal: use as few predictors as possible while still explaining the data well.

🧹 Excluding obvious non-predictors

🚫 The TDP example

TDP (thermal design power) is defined as "the average amount of power in watts that a cooling system must dissipate."
It is a specification provided by the chip manufacturer to ensure adequate cooling, not a parameter that directly affects performance.
Conclusion: exclude TDP from potential predictors because it doesn't drive the output variable (performance).

🔬 Using domain knowledge

Before throwing all columns into the model, step back and consider what you know about the underlying system.
Exclude variables that are obviously unhelpful based on their meaning.
This prevents wasting computational effort and reduces the risk of including noise.

🔧 Choosing and transforming predictors

📊 Handling multiple output columns

The data frame has two output columns: perf and nperf.
A regression model can have only one output.
nperf is a linear transformation of perf that rescales values to the range [0, 100] for easier interpretation.
Decision: use nperf as the model output and ignore perf.
Note: this rescaling has no effect on the models because it is a linear transformation.

🧮 Including non-linear terms

Regression models add terms linearly, but individual terms can be non-linear (e.g., a coefficient times x raised to some power p).
Include non-linear terms only if you have physical reasons to suspect a non-linear relationship.
Example from the excerpt: empirical studies suggest cache miss rates are roughly proportional to the square root of cache size.
Therefore, include both:
- First-degree terms (p = 1) for each cache size
- Square root terms (p = 1/2) for each cache size
Don't confuse: you must include the linear term even when adding a non-linear term; both are potential predictors.

🗑️ Handling missing data

The excerpt notes that only a few entries include values for L3 cache.
If L3 is kept as a predictor, only observations with non-missing L3 values will be included in the model.
This would reduce the data set from 256 systems to only 10 systems.
Decision: exclude L3 cache size as a potential predictor to preserve the larger data set.

📋 Final predictor list

📝 The starting set

The excerpt provides a table of potential predictors to be used in model development:

Category	Predictors
Processor specs	clock, threads, cores, transistors, dieSize, voltage, featureSize, channel, FO4delay
Cache sizes (linear)	L1icache, L1dcache, L2cache
Cache sizes (square root)	√L1icache, √L1dcache, √L2cache

This list reflects both the available data and domain-specific knowledge.
Excluded: perf (using nperf instead), TDP (not a direct performance driver), L3cache (too much missing data).
Included: non-linear (square root) cache terms based on empirical performance modeling research.

🎓 The principle

Exploiting domain-specific knowledge when selecting predictors ultimately helps produce better models than blindly applying the model development process.
Start with all plausible predictors, informed by understanding of the system.
Then use a systematic process (backward elimination, introduced in the next section) to refine the list.

The Backward Elimination Process

4.3 The Backward Elimination Process

🧭 Overview

🧠 One-sentence thesis

Backward elimination helps find the right balance of predictors by starting with all candidates and iteratively removing those with large p-values until only meaningful predictors remain.

📌 Key points (3–5)

The core trade-off: too many predictors make the model follow noise; too few reduce prediction accuracy.
How backward elimination works: start with all predictors, compute the model, remove predictors with large p-values one at a time, and repeat until all remaining predictors are below the threshold.
What p-value means: a large p-value indicates that observing the t-statistic (or larger) is fairly likely by random chance if the slope is actually zero, so the predictor is not contributing meaningfully.
Common confusion: backward elimination vs forward selection—backward starts with all predictors and removes; forward starts empty and adds; backward is better at keeping groups of predictors that work well together.
Why manual judgment matters: automated methods lack intuition about whether the model "makes sense" physically; continually ask if including or excluding each parameter is reasonable.

🎯 The predictor selection challenge

⚖️ Balancing model complexity

Too many predictors: the model will train to follow random variations (noise) in the data too closely.
Too few predictors: the model may not be as accurate at predicting future values as a model with more predictors.
The backward elimination process helps navigate this trade-off systematically.

🔧 Domain knowledge in predictor choice

The excerpt illustrates several decisions made before starting backward elimination:

TDP excluded: specified by the manufacturer to ensure cooling capability, so it is not a useful predictor for the model.
Nonlinear terms included: cache miss rates are roughly proportional to the square root of cache size (from prior empirical studies), so both first-degree (p = 1) and square-root (p = 1/2) terms for each cache size are included.
L3 cache excluded: only a few entries (10 out of 256 systems) have L3 cache values; keeping it would reduce the dataset size drastically.

Exploiting domain-specific knowledge when selecting predictors can help produce better models than blindly applying the model development process.

🔄 How backward elimination works

🔄 The iterative process

Start with all possible predictors: use the lm() function to compute the model with every candidate predictor.
Examine p-values: use the summary() function to find each predictor's p-value.
Identify candidates for removal: for predictors whose contribution (slope) is close to zero—based on the estimate and standard error—the p-value will be large.
Remove one predictor: if a p-value is larger than the predetermined threshold, remove that predictor from the model.
Refit the model: compute a new model excluding the removed parameter.
Repeat: continue until all remaining predictors have p-values below the threshold.

📊 Interpreting p-values

A large p-value means that the chance of observing the t-statistic (ratio of estimate to standard error) or larger, assuming the slope or estimate is zero, is fairly likely based on random chance, and the parameter is not contributing to the fit of the model.

Typical threshold: p = 0.05 (95 percent confidence that the predictor is meaningful).
Alternative threshold: p = 0.10 is also not unusual.
Flexibility: there may be reasons to keep a term in the model if its p-value is only slightly larger than the threshold.

Example: If a predictor has p = 0.07 and the threshold is 0.05, you might still keep it if there is a strong physical reason to believe it matters.

🆚 Comparing model-building approaches

🆚 Backward elimination vs forward selection

Approach	How it works	Advantage	Disadvantage
Backward elimination	Start with all predictors; remove one at a time if p-value is large	Straightforward to determine which parameter to drop at each step; more likely to keep groups of predictors that work well together	Requires computing a full model initially
Forward selection	Start with no predictors; add one at a time if p-value stays below threshold	Builds model incrementally	More difficult to determine which parameter to try at each step; may miss groups of predictors that work well together

🤖 Automated vs manual approaches

The excerpt mentions other approaches:

Step-wise regression
All possible regressions
Automated selection

Why automated methods have strong appeal:

As technologically savvy individuals, we believe automated processes can test a broader range of predictor combinations than manual testing.

Why the excerpt cautions against them:

Automated procedures lack intuitive insights into the underlying physical nature of the system being modeled.
Intuition helps answer whether a model is reasonable to construct in the first place.
Automated methods make it too easy to forget to think about whether each step makes sense.

🧠 The "does it make sense?" test

Continually ask yourself:

Does it make sense that parameter i is included but parameter j is excluded?
Is there a physical explanation to support the inclusion or exclusion of any potential parameter?

Example: If voltage is excluded but clock speed is included, does that align with what you know about processor performance?

🛠️ Example: starting the backward elimination process

🛠️ The full model setup

The excerpt shows the initial model with all potential predictors from Table 4.1:

Predictors: clock, threads, cores, transistors, dieSize, voltage, featureSize, channel, FO4delay, L1icache, sqrt(L1icache), L1dcache, sqrt(L1dcache), L2cache, sqrt(L2cache)
Output variable: nperf (normalized performance)
Data frame: int00.dat

The function call:

int00.lm.full <- lm(nperf ~ clock + threads + cores + transistors + dieSize + voltage + featureSize + channel + FO4delay + L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), data=int00.dat)

📋 Naming conventions

Variable name: int00.lm.full
- .lm reminds us this is a linear model
- .full indicates the model includes all possible predictors
Explicit data argument: data=int00.dat avoids confusion when manipulating multiple models simultaneously

📊 Reading the summary output

The summary() function provides:

Call: the formula used to create the model
Residuals: distribution of errors (Min, 1Q, Median, 3Q, Max)
Coefficients table: for each predictor, shows Estimate, Standard Error, t value, and p-value (Pr(>|t|))

Example from the excerpt:

clock: Estimate = 0.02605, p-value < 2e-16 (very small, highly significant)
threads: Estimate = -2.346, p-value = 0.26596 (large, candidate for removal)
cores: p-value = 0.21235 (large, candidate for removal)
transistors: p-value = 0.68897 (very large, strong candidate for removal)

The next step (not fully shown) would be to remove the predictor with the largest p-value, refit the model, and repeat.

An Example of the Backward Elimination Process

4.4 An Example of the Backward Elimination Process

🧭 Overview

🧠 One-sentence thesis

Backward elimination systematically removes predictors with the highest p-values above a threshold (e.g., 0.05) one at a time until all remaining predictors are statistically significant, producing a final model that balances explanatory power with parsimony.

📌 Key points (3–5)

The backward elimination procedure: start with all potential predictors, then iteratively remove the predictor with the largest p-value above the threshold until all remaining p-values are below the threshold.
Why backward elimination is preferred: it is straightforward to decide which parameter to drop at each step, and it is more likely to retain groups of predictors that work well together.
Common confusion—automated vs manual modeling: automated procedures test many combinations but lack intuitive insight into whether the model "makes sense" physically or logically; always ask if each step is reasonable.
Handling missing data: R automatically removes rows with NA values when computing models, which can change the number of observations (degrees of freedom) between models and affect comparability.
Stopping criterion: the process stops when all predictors have p-values below the predetermined threshold, even if the final model contains many predictors.

🔄 The backward elimination workflow

🔄 Starting with the full model

Begin by including all potential predictors in the initial model using the lm() function.
Example from the excerpt: the full model for int00.dat included clock, threads, cores, transistors, dieSize, voltage, featureSize, channel, FO4delay, L1icache, sqrt(L1icache), L1dcache, sqrt(L1dcache), L2cache, and sqrt(L2cache).
The summary() function provides detailed information: residuals, coefficient estimates, standard errors, t-values, and p-values for each predictor.

🔄 Iterative removal of predictors

At each step, identify the predictor with the largest p-value.
If that p-value exceeds the predetermined threshold (e.g., p = 0.05), remove that predictor and recompute the model.
Use the update() function with the notation .~. - <predictor> to remove a predictor and recompute in one step.
Example: FO4delay had p = 0.99123 (the largest), so it was removed first; then featureSize (p ≈ 0.78), transistors (p ≈ 0.68), threads (p ≈ 0.37), and dieSize (p ≈ 0.14) were removed in sequence.

🛑 Stopping the process

Stop when all remaining predictors have p-values below the threshold.
In the excerpt, the process stopped at model int00.lm.6 with ten predictors, all with p-values < 0.02.
Even if the final model contains many predictors, if all are statistically significant, there is no statistical reason to exclude any.

🆚 Comparing model selection approaches

🆚 Backward elimination vs forward selection

Approach	How it works	Advantages	Disadvantages
Backward elimination	Start with all predictors; remove one at a time	Straightforward to decide which to drop; more likely to retain groups of predictors that work well together	Requires starting with a full model
Forward selection	Start with no predictors; add one at a time	Builds up from simplest model	Harder to decide which predictor to add at each step; may miss groups of predictors that work well together

🆚 Manual vs automated procedures

Automated methods (stepwise regression, all possible regressions, automated selection): can test a broader range of predictor combinations.
Disadvantage of automation: lack intuitive insight into the underlying physical or logical nature of the system being modeled.
Why intuition matters: it helps answer whether the model "makes sense"—e.g., does it make sense that parameter i is included but parameter j is excluded? Is there a physical or logical explanation?
Don't confuse: automation is convenient, but it can make you forget to think critically about each modeling step.

📊 Interpreting model diagnostics during elimination

📊 Tracking R² and adjusted R²

R² values stayed very close to 0.965 throughout the elimination process in the example.
Adjusted R² values tended to increase slightly with each dropped predictor, indicating that models with fewer predictors and more degrees of freedom explained the data slightly better.
Don't over-interpret small changes: these changes may be due to random data fluctuations, but it is reassuring when they behave as expected.

📊 Degrees of freedom and missing data

The number of degrees of freedom (DF) equals the number of observations used minus the number of predictors.
Missing data (NA values): R automatically removes rows with missing values for any predictor in the model.
Example: the full model had 61 DF with 179 observations deleted due to missingness; after removing transistors, DF jumped from 63 to 67 because fewer rows had missing data (176 instead of 179).
Comparability issue: adjusted R² assumes the same observations are used in both models being compared; if different rows are dropped, models are not directly comparable.

📊 F-test interpretation

The F-test roughly compares the current model to a model with one fewer predictor; if the current model is better, the p-value will be small.

In all models in the example, the F-test p-value was very small and consistent.
This consistency means the F-test did not help discriminate between potential models in this case.

🧪 Residual analysis and model validation

🧪 Checking model assumptions

After building the final model, apply residual analysis techniques (same as for simple linear regression).
The plot() function on a linear model object produces a panel of diagnostic plots.

🧪 Key diagnostic plots

Residuals vs Fitted (top left): residuals should be uniformly scattered around zero with no obvious patterns.
- Example: the excerpt notes residuals appeared "somewhat uniformly scattered about zero" with no obvious patterns, giving no reason to believe the model is poor.
Q-Q plot (top right): residuals should roughly follow the indicated line if they are normally distributed.
- Example: residuals roughly followed the line, but with some nonlinearities and slight deviations; one or two points deviating slightly should not be overly concerning, but serve as a reminder that all models are imperfect.
Don't confuse: even Q-Q plots from models drawn from normal distributions will deviate slightly due to random chance; do not reject a model based solely on minor deviations.

⚠️ When backward elimination fails

⚠️ Singularities and undefined coefficients

Sometimes the backward elimination process produces results that do not make sense.
Example: when trying to model Int1992 data with all potential predictors, the output showed:
- Only 6 residuals (observations 14–19).
- (14 not defined because of singularities).
- Most predictors had NA for Estimate, Std. Error, t value, and Pr(>|t|).
What this means: R could not compute coefficients for most predictors because of singularities (e.g., perfect multicollinearity or insufficient data).

⚠️ Diagnosing the problem

Check the number of observations: in the Int1992 example, only 6 observations were used.
Check for missing data: many rows may have been dropped due to NA values.
Check for multicollinearity: predictors may be perfectly correlated or linearly dependent.
Takeaway: backward elimination assumes you have enough data and that predictors are not perfectly collinear; when these assumptions fail, the process breaks down.

Residual Analysis

4.5 Residual Analysis

🧭 Overview

🧠 One-sentence thesis

Residual analysis techniques from simple linear regression can be applied to multiple linear regression models to validate model assumptions, though patterns in diagnostic plots should prompt caution rather than immediate rejection.

📌 Key points (3–5)

Purpose: residual analysis checks whether the assumptions used to develop the model are valid.
Key diagnostic plots: Residuals vs Fitted (scatter pattern), Q-Q plot (normality), Scale-Location, and Residuals vs Leverage.
What to look for: uniformly scattered residuals around zero with no obvious patterns; residuals roughly following the expected line in Q-Q plots.
Common confusion: minor deviations in Q-Q plots are normal due to random chance—don't reject a model based on slight nonlinearities alone; all models are imperfect.
When things go wrong: sometimes backward elimination produces nonsensical results with singularities, indicating the model cannot estimate certain coefficients.

📊 Diagnostic plot panel

📊 How to generate the plots

The excerpt shows that a panel of four diagnostic plots can be produced with:

par(mfrow=c(2,2)) sets up a 2×2 grid.
plot(int00.lm.6) generates the four standard diagnostic plots for the fitted model.

These plots are the same residual analysis techniques used in simple linear regression (Section 3.4), now applied to multiple linear regression.

🔍 Interpreting the Residuals vs Fitted plot

🔍 What it shows

Location: top left plot in the panel.
Purpose: checks whether residuals are well-behaved and uniformly scattered around zero.

✅ Good signs

Residuals appear "somewhat uniformly scattered about zero."
No obvious patterns visible.
The excerpt concludes: "this plot gives us no reason to believe that we have produced a poor model."

⚠️ What to watch for

If you see clear patterns (e.g., curves, funnels, clusters), that suggests the residuals are not well-behaved and the model may be inadequate.

📈 Interpreting the Q-Q plot

📈 What it shows

Location: top right plot in the panel.
Purpose: checks whether residuals are normally distributed.

Q-Q plot: a plot where residuals are compared against theoretical quantiles from a normal distribution; if residuals follow the indicated line, they are approximately normal.

✅ What the excerpt found

Residuals "roughly follow the indicated line."
Some nonlinearities and one or two points that "only slightly deviate" are visible.

🧠 How to interpret deviations

Don't overreact: "Even Q-Q plots from models drawn from normal distributions will deviate from the expected due to random chance."
Be cautious but not rejecting: the excerpt advises being "slightly more cautious" but says "we should not reject the model based on this one test."
Key takeaway: "all models are imperfect"—minor deviations are a reminder, not a fatal flaw.

Observation	Interpretation
Residuals roughly follow the line	Normality assumption is reasonably satisfied
Slight nonlinearities visible	Warrants caution, but not immediate rejection
One or two points deviate slightly	Normal due to random chance; not overly concerning

🚨 When things go wrong

🚨 Singularities in model fitting

The excerpt introduces a failure mode: sometimes backward elimination "get results that do not appear to make any sense."

🔧 Example: Int1992 data

When trying to fit a full multiple linear regression model with all predictors from Table 4.1, the output shows:

Message: (14 not defined because of singularities)
Result: most coefficients show NA for Estimate, Std. Error, t value, and p-value.
Only the intercept and clock have valid estimates; all other predictors (threads, cores, transistors, dieSize, voltage, featureSize, channel, FO4delay, L1icache, sqrt(L1icache), L1dcache, sqrt(L1dcache), L2cache, sqrt(L2cache)) are undefined.

🧩 What singularities mean

The model cannot estimate coefficients for those predictors.
This typically indicates multicollinearity or linear dependence among predictors (though the excerpt does not explicitly state the cause).
Example: if one predictor is a perfect linear combination of others, R cannot separate their individual effects.

⚠️ Don't confuse

Minor Q-Q plot deviations (random noise, acceptable) vs. singularities (fundamental model failure, unacceptable).
Singularities are a structural problem with the data or model specification, not a minor diagnostic concern.

When Things Go Wrong

4.6 When Things Go Wrong

🧭 Overview

🧠 One-sentence thesis

When building a multiple linear regression model, data anomalies such as constant-value columns or too many missing observations can prevent R from computing coefficients, requiring systematic investigation and removal of problematic predictors before proceeding with model selection.

📌 Key points (3–5)

The problem: R returns "NA" for most coefficients and reports "not defined because of singularities," meaning it cannot compute values due to data anomalies.
Root causes: columns where all values are identical (no variation), or columns with too few non-missing observations to support the number of coefficients needed.
Diagnostic approach: use the table() function to inspect each predictor column and identify which ones have insufficient variation or data.
Common confusion: a predictor may have valid-looking values but still be unusable if it appears in too few rows—variables must vary and have enough observations.
Resolution: systematically remove problematic predictors, then continue with the normal backward elimination process.

🚨 Recognizing the problem

🚨 What the error looks like

When attempting to fit a full multiple linear regression model to the Int1992 data, the output shows:

Every predictor except clock has NA for Estimate, Std. Error, t value, and Pr(>|t|).
A message: "14 not defined because of singularities."
A note: "72 observations deleted due to missingness."

Singularities: R could not compute coefficient values because of anomalies in the data; technically, it could not invert the matrix used in the least-squares minimization process.

🔍 Why this matters

The model cannot proceed with backward elimination if most coefficients are undefined.
The excerpt emphasizes that only 4 degrees of freedom remain from 78 total rows, meaning almost all data is unusable.
Example: with 78 rows, 2 predictors used, 4 degrees of freedom, and 72 deleted rows, the numbers add up but indicate a severe data problem.

🔧 Diagnosing data anomalies

🔧 Using the `table()` function

The excerpt recommends table() as a quick way to summarize a data vector and spot anomalies.

How it works:

The top line shows unique values in the column.
The line below shows how many times each value appears.
Example: for clock, the output shows a range from 48 to 350 with reasonable frequency distribution (minimum 1 occurrence, maximum 10 occurrences).

⚠️ Identifying constant-value columns

When table(int92.dat$threads) is executed, the result is:

threads
1
78

This means all 78 entries contain the same value: 1.

Why this is a problem:

An input variable in which all of the elements are the same value has no predictive power in a regression model. Variables must vary!

If every row has the same value, there is no way to distinguish one row from another.
The excerpt emphasizes: "Variables must vary!"
Don't confuse: a column can have valid numeric data but still be useless if it never changes.

📉 Identifying insufficient-data columns

After removing threads and cores (both constant), the problem persists. Further investigation with table(int92.dat$L2cache) reveals:

L2cache
96  256  512
6    2    2

Only three unique values exist.
These appear in only 10 rows total.

Why this is a problem:

The excerpt states: "having only ten observations makes it impossible for lm() to compute the fourteen necessary model coefficients."
Don't confuse: the values (96, 256, 512) look reasonable, but the number of observations is too small to support the model.

🛠️ Resolving the problem

🛠️ Step-by-step elimination

The excerpt demonstrates a systematic process:

Step	Action	Result
1	Identify `threads` as constant (all 1s)	Remove `threads` from model
2	Identify `cores` as constant (all 1s)	Remove `cores` from model
3	Check remaining predictors	Still only 4 degrees of freedom, 72 observations deleted
4	Identify `L2cache` with only 10 observations	Remove `L2cache` and `sqrt(L2cache)`
5	Re-fit model	Now 26 degrees of freedom, 40 observations deleted, all coefficients computed

✅ What success looks like

After removing threads, cores, L2cache, and sqrt(L2cache), the model output shows:

All remaining predictors have numeric Estimate, Std. Error, t value, and Pr(>|t|) values.
26 degrees of freedom (up from 4).
40 observations deleted (down from 72).
Residuals with Min, 1Q, Median, 3Q, Max—indicating a proper fit.

Now the normal backward elimination process can proceed:

Begin by eliminating the predictor with the largest p-value above the threshold (in this case, transistors with p = 0.78018).
Continue until all predictors have p-values below the threshold.

🎯 Final model for Int1992

After completing backward elimination, the excerpt reports that only two predictors remain:

clock
dieSize

Important note from the excerpt:

"Notice that we could not have expected to only have these two predictors left in the model, just based on the p-values from the starting model. This is due to many factors including being able to include more observations as predictors that have missing values are removed from the model."

Don't assume the final model based on initial p-values alone.
Removing problematic predictors allows more observations to be included, which changes the model dynamics.

Data Splitting for Training and Testing

5.1 Data Splitting for Training and Testing

🧭 Overview

🧠 One-sentence thesis

To properly evaluate a regression model's predictive quality, you must train it on one subset of data and test it on a separate subset, because using the same data for both would artificially inflate accuracy.

📌 Key points (3–5)

Why split data: Using the same data to both build and test a model is like grading your exam with the answer key you copied from—it guarantees perfect results but proves nothing.
Training set vs testing set: the training set computes the model's coefficients; the testing set evaluates how well those coefficients predict new outcomes.
How to split: randomly partition the available data into two portions (e.g., 50/50) so both sets are similar but separate.
Common confusion: don't use all available data to fit the model and then reuse that same data to check predictions—you need independent test data.
Why randomization matters: random selection ensures the train-test split is unbiased and reproducible (if seeded) or truly random (if not seeded).

🚫 Why you cannot use the same data twice

🚫 The answer-key analogy

The excerpt compares reusing the same data to "copying exam answers from the answer key and then using that same answer key to grade your exam."
Result: you would always get a perfect score, but it tells you nothing about whether you actually understand the material.
In modeling: if you fit coefficients to a dataset and then test predictions on that same dataset, the model will appear artificially accurate because it was optimized for exactly those observations.

🔍 Training vs testing roles

Training: using a portion of data in the lm() function to compute the specific values of the model's coefficients.

Testing: using the remaining portion of data to see how well the model predicts results, compared to this test data.

Training = learning the relationship from data.
Testing = checking whether that learned relationship generalizes to new data.
Don't confuse: the training set teaches the model; the test set evaluates it. They must be separate.

🔀 How to split data into training and testing sets

🔀 The splitting process

The excerpt describes a sequence of operations to partition the int00.dat data frame:

Set a seed (optional, for reproducibility): set.seed(1234) ensures the same random split every time; omit this line for true randomness.
Count total rows: rows <- nrow(int00.dat) stores the number of observations.
Choose a fraction: f <- 0.5 means 50% of data for training, 50% for testing (this choice is somewhat arbitrary).
Find the split point: upper_bound <- floor(f * rows) rounds down to the nearest integer to mark where to divide the data.
Randomize row order: sample(rows) returns a random permutation of row indices; using this as the row index shuffles the entire data frame into permuted_int00.dat.
Assign training set: train.dat <- permuted_int00.dat[1:upper_bound, ] takes the first half.
Assign testing set: test.dat <- permuted_int00.dat[(upper_bound+1):rows, ] takes the second half.

🎲 Why randomization is important

Random permutation ensures that both training and testing sets are similar (drawn from the same population) but separate (no overlap).
Without randomization, you might accidentally put all of one type of observation in one set, biasing the results.
Example: if the data were sorted by processor speed, taking the first half for training and the second half for testing would mean training on slow processors and testing on fast ones—poor generalization.

🏋️ Training the model and predicting test outcomes

🏋️ Training step

The excerpt calls lm() with the predictors identified earlier and the train.dat data frame.
This computes the model's coefficients and assigns the result to int00_new.lm.
Key point: only the training data influences the coefficient values.

🔮 Prediction step

The predict() function takes the trained model (int00_new.lm) and applies it to the test.dat data frame.
Syntax: predicted.dat <- predict(int00_new.lm, newdata=test.dat).
This produces predicted performance values for each processor in the test set, without using the actual measured performance from test.dat during prediction.

📏 Measuring prediction error

Delta (Δ): the difference between predicted and measured performance for each processor i, defined as Δi = Predictedi − Measuredi.

Compute the vector of differences: delta <- predicted.dat - test.dat$nperf.
test.dat$nperf selects the actual measured performance column from the test data frame.
Interpretation: positive delta means the model overpredicted; negative means it underpredicted.
The excerpt notes that the mean of these delta values (for n different observations) is a useful summary statistic, though the text cuts off before completing this explanation.

📊 Overall training-and-testing workflow

📊 Flow diagram summary

The excerpt references Figure 5.1, which shows:

Stage	Input	Process	Output
Split	`int00.dat` + fraction `f`	Random partition	`train.dat` + `test.dat`
Train	`train.dat` (inputs + outputs)	`lm()`	`int00_new.lm` (model coefficients)
Predict	`test.dat` (inputs only) + `int00_new.lm`	`predict()`	`predicted.dat`
Evaluate	`predicted.dat` + `test.dat$nperf`	Subtract	`delta` (error vector)

🔄 Why this process works

The model never "sees" the test set outputs during training, so predictions on the test set are genuinely new.
Comparing predicted vs actual test outcomes reveals how well the model generalizes beyond the data it was fitted to.
Don't confuse: the test set's inputs are used for prediction, but its outputs are only used afterward to measure error—they do not influence the model coefficients.

Training and Testing

5.2 Training and Testing

🧭 Overview

🧠 One-sentence thesis

Training a regression model on one portion of data and testing it on another portion allows us to evaluate how well the model predicts unseen cases, with confidence intervals and scatter plots revealing prediction quality.

📌 Key points (3–5)

The training-testing split: randomly partition data into two portions—train the model on the first, test predictions on the second.
What delta measures: the difference between predicted and measured values for each test case (delta = Predicted − Measured).
How to judge quality: a tight confidence interval around zero and a uniform scatter of delta values around zero indicate good predictions.
Common confusion: reproducibility vs. accuracy—similar results across runs mean the model is consistent, not necessarily good; it's easier to spot a bad model than to confirm a good one.
Why randomness matters: without setting a random seed, different train/test splits produce different confidence intervals, so running multiple experiments reveals model stability.

🔄 The training and testing workflow

🔄 Partitioning the data

The excerpt describes splitting a dataset (int00.dat) into two randomly selected portions: train.dat and test.dat.
The overall flow: inputs and outputs from the original dataset are divided, the model is trained on train.dat, and predictions are tested against test.dat.
Example: if you have 100 processors, you might randomly assign 50 to training and 50 to testing.

🏋️ Training the model

Training the regression model: the process of computing the model's coefficients using the training data.

The excerpt uses the lm() function with predictors identified earlier (clock, cores, voltage, channel, L1icache, sqrt(L1icache), L1dcache, sqrt(L1dcache), L2cache, sqrt(L2cache)) and the train.dat data frame.
The result is stored in int00_new.lm.
This step fits the model to the training portion only.

🔮 Testing the model

The predict() function takes the trained model (int00_new.lm) and applies it to the test data (test.dat).
It computes predicted outputs for each processor in the test set.
The predictions are stored in predicted.dat.
This step evaluates whether the model generalizes to data it has not seen during training.

📏 Measuring prediction quality

📏 The delta metric

Delta for processor i: the difference between the predicted and measured performance, defined as delta_i = Predicted_i − Measured_i.

The excerpt computes a vector of delta values: delta <- predicted.dat - test.dat$nperf.
Interpretation:
- delta = 0 → perfect prediction.
- delta > 0 → model overpredicted (predicted performance higher than actual).
- delta < 0 → model underpredicted (predicted performance lower than actual).
Example: if the model predicts performance of 80 but the actual measured performance is 75, delta = 5.

📊 Mean delta and confidence intervals

The mean of delta values across n processors is calculated as: mean delta = (1/n) × sum of all delta_i.
A confidence interval for this mean indicates how well the model predicted the test set.
The excerpt uses t.test(delta, conf.level = 0.95) to compute a 95% confidence interval.
In the example: the interval is [-3.08, 0.97].
- This interval includes zero, suggesting the model does not systematically over- or underpredict.
- Given that nperf is scaled 0–100, this is a "reasonably tight" interval.
What to look for: a tight confidence interval around zero indicates good predictions.

📉 Scatter plot of delta

The excerpt generates a scatter plot of delta values using plot(delta).
What good predictions look like: a tight band of values uniformly scattered around zero.
In the example (Figure 5.2): most values cluster near zero, but a few outliers are more than ten points above or below zero.
This visual check complements the confidence interval by revealing the distribution and any extreme mispredictions.

🎲 Variability and reproducibility

🎲 The role of random partitioning

The sample() function returns a different random permutation each time unless a random seed is set with set.seed(1234).
Different partitions assign different processors to train and test sets, leading to different confidence intervals and scatter plots.
Example from the excerpt: running the same test five times without a seed produced intervals [-1.94, 1.46], [-1.95, 2.68], [-2.66, 3.81], [-6.13, 0.75], [-4.21, 5.29].
Changing the fraction assigned to train vs. test (e.g., f = 0.5) also changes results.

🔍 Interpreting variability

Observation	What it means	Implication
Results vary wildly across runs	Model is unstable or sensitive to data split	Good reason for concern
Results are similar across runs	Model is consistently reproducible	Does not guarantee the model is good, only that it is consistent

Don't confuse: reproducibility (similar results each time) with accuracy (correct predictions).
The excerpt warns: "It is often easier to spot a bad model than to determine that a model is good."

✅ Best practice

Run the training-testing experiment several times and observe how results change.
If confidence intervals remain reasonably tight and scatter plots show consistent patterns, the model is likely stable.
In the example, repeated intervals and the scatter plot lead to the conclusion: "this model is reasonably good at predicting the performance of a set of processors when the model is trained on a different set of processors executing the same benchmark program."

🎯 Conclusion from the example

🎯 Model assessment

The 95% confidence interval [-3.08, 0.97] includes zero and is reasonably tight relative to the 0–100 scale of nperf.
The scatter plot shows most delta values near zero, with a few outliers.
Repeated experiments yield similar confidence intervals.
Overall judgment: the model is "reasonably good" but "not perfect."
Whether the differences are large enough to warrant concern depends on the user's requirements.

🎯 What "reasonably good" means

The model does not systematically over- or underpredict (interval includes zero).
Most predictions are close to actual values (tight scatter around zero).
The model generalizes to unseen processors executing the same benchmark.
Example: if you train on 50 processors and test on 50 different processors, the model's predictions are reliable enough for practical use, though not flawless.

Predicting Across Data Sets

5.3 Predicting Across Data Sets

🧭 Overview

🧠 One-sentence thesis

A regression model trained on one data set can be tested by predicting outcomes on different data sets, but models have limits and may fail when applied to data with different underlying factors.

📌 Key points (3–5)

Cross-dataset prediction: after training a model on one data set, you can test it by predicting outcomes on entirely different data sets (not just splitting the same data).
How to evaluate: use confidence intervals and scatter plots of differences (predicted minus actual) to assess prediction quality; zero should be in the confidence interval for a good model.
Common confusion: a confidence interval containing zero suggests reasonable predictions, but individual scatter plot points may still show large errors—always examine both statistics and visualizations.
Models have limits: a model trained on one version of data may fail completely on future or fundamentally different data if new factors emerge that the model doesn't capture.

🔬 Testing with different benchmarks

🔬 Using a related data set

The excerpt demonstrates training a model on Int2000 benchmark data, then using it to predict Fp2000 benchmark performance:

Train the regression model (int00.lm) using all the Int2000 data.
Apply the trained model to predict Fp2000 results using predict() with newdata=fp00.dat.
Calculate differences: delta = predicted - actual.
Run a t-test to get a 95% confidence interval.

✅ Signs of reasonable prediction

When predicting Fp2000 from the Int2000 model:

The confidence interval contains zero: [-0.45, 3.41].
The interval is relatively small.
The scatter plot shows values "randomly distributed around zero."

What this means: the model captures enough common factors between Int2000 and Fp2000 to make useful predictions.

⚠️ Hidden problems in seemingly good results

Even when the confidence interval looks good:

Individual predictions can be quite wrong.
The excerpt notes maximum positive deviation of almost 20 and negative deviation greater than 43.
Don't confuse: overall statistical measures (confidence interval) with individual prediction accuracy—both matter.

Example: The confidence interval suggests "relatively good results," but the scatter plot reveals "not all the values are well predicted."

🚫 When models fail

🚫 Predicting future versions

The excerpt shows a clear failure case: using the Int2000 model to predict Int2006 benchmark results.

What went wrong:

Confidence interval: [48.87, 52.94] — does not contain zero.
Mean difference: 50.9096 — predicted values are systematically much larger than actual values.
Scatter plot: all predicted values are "much larger than the actual values."

🧩 Why models have limits

"Models have their limits."

The excerpt explains:

More factors affect the next generation (Int2006) than the model captures.
The model was built on Int2000 data and doesn't account for changes in newer processor generations.
To predict future performance better, you would need to "uncover those factors" through deeper domain understanding.

Key lesson: A model is only as good as the factors it includes; when the underlying system changes, the model may become obsolete.

📊 Evaluation workflow

📊 Standard testing procedure

The excerpt describes a consistent workflow for cross-dataset prediction:

Step	Action	Purpose
1. Train	Use `lm()` on source data set	Build the regression model
2. Predict	Use `predict()` with `newdata=`	Apply model to new data
3. Calculate delta	`predicted - actual`	Measure prediction errors
4. Test	`t.test(delta, conf.level=0.95)`	Get confidence interval
5. Visualize	Scatter plot of delta vs index	Examine error distribution

🔍 What to look for

Good model indicators:

Confidence interval includes zero.
Scatter plot shows random distribution around zero.
No systematic bias (all positive or all negative errors).

Bad model indicators:

Confidence interval far from zero.
Systematic bias in one direction.
Large individual deviations even if statistics look acceptable.

Example: "It is often easier to spot a bad model than to determine that a model is good."

Reading CSV files

6.1 Reading CSV files

🧭 Overview

🧠 One-sentence thesis

Reading CSV files into R is conceptually simple but often requires custom functions to extract, filter, and organize messy real-world data into the format needed for regression modeling.

📌 Key points (3–5)

Basic CSV reading: R's read.csv() function directly loads comma-separated data into a data frame with rows and columns.
Data cleaning challenge: Real data often arrives messy (missing fields, inconsistent formats), and cleaning it is heavily data-dependent and beyond simple input functions.
Custom extraction functions: When data contains multiple benchmarks or subsets, you define your own functions to filter rows, select columns, and assemble new data frames.
Common confusion: Don't confuse is.na() (which finds missing values) with its complement !is.na() (which finds present values)—the latter is used to filter out incomplete records.
Why it matters: Getting data into the right format is often one of the most difficult aspects of model development, despite R's powerful modeling capabilities.

📂 Basic CSV import

📂 The `read.csv()` function

Comma separated values (CSV): a de facto standard format for exchanging data among computer systems, where each line is one record and commas separate fields.

R provides read.csv() to load a CSV file directly into a data frame.
Syntax: processors <- read.csv("all-data.csv")
- Each file line becomes one row.
- Each comma-separated field becomes one column.
After loading, the variable processors holds all the data organized by rows and columns.

🔍 Inspecting the data frame

Typing the variable name alone (e.g., processors) prints a long, confusing list.
Use head(processors) to see column headings and the first six rows.
This preview helps you determine which columns to extract for modeling.

Don't confuse: The data frame itself vs. a preview—head() shows only the top rows, not the entire dataset.

🛠️ Custom data extraction

🛠️ Why custom functions are needed

Real data is often messy: missing fields, incorrectly recorded values, inconsistent formats.
Data cleaning: the process of getting data into the format necessary for analysis and modeling.
The excerpt states that specific cleaning steps are "heavily dependent on the data set and are thus beyond the scope of this tutorial."
Custom functions let you perform a sequence of operations multiple times on different data pieces.

🔧 Defining functions in R

General format:

function-name <- function(a1, a2, ...) {
  R expressions
  return(object)
}

function-name: the name you choose.
a1, a2, ...: the list of arguments.
The body contains R expressions evaluated when the function is called.
A function can return any type of data object using return().

🎯 The `extract_data()` function

🎯 Purpose and usage

Extracts all rows that have a result for a given benchmark program from the processors data frame.
Example calls:
- int92.dat <- extract_data("Int1992")
- fp92.dat <- extract_data("Fp1992")
- Each call filters rows with results for that benchmark and assigns them to a new data frame.

🧩 How `extract_data()` works

Build the column name: Uses nested paste() functions to concatenate strings.
- Example: extract_data("Int2000") produces "SpecInt2000..average.base.", the column name containing performance results.
Extract performance data: Calls get_column(benchmark, temp) to select rows with the desired column.
Normalize performance: Computes nperf (normalized performance) from the raw perf values:
- Find max and min performance.
- Compute range: max_perf - min_perf.
- Normalize: nperf = 100 * (perf - min_perf) / range.
Extract predictors: A sequence of get_column() calls extracts data for each predictor (clock speed, threads, cores, TDP, transistors, die size, voltage, feature size, channel length, FO4 delay, cache sizes).
Assemble and return: The data.frame() function assembles all arguments into a single data frame, which extract_data() returns.

Don't confuse: The raw performance value (perf) with the normalized value (nperf)—the latter scales all results to a 0–100 range for easier comparison.

🔎 The `get_column()` function

🔎 Purpose

Returns all the data in a given column for which the given benchmark program has been defined (i.e., filters out rows with missing benchmark results).

🧩 How `get_column()` works

Arguments: x (benchmark name string) and y (desired column name string).
Build the benchmark column name: Nested paste() functions produce the same result as in extract_data().
Find non-missing rows: is.na(processors[, benchmark]) returns a vector with 1 for rows that have NA (missing) values in the benchmark column, and 0 for rows with actual values.
Complement the result: The exclamation point !is.na(...) flips the output, so ix contains a vector identifying every row that has performance results for the indicated benchmark.
Extract and return: The function extracts the selected rows from the processors data frame (using ix as a filter) and returns them.

Common confusion: is.na() finds missing values, but !is.na() finds present values—the latter is what you need to filter out incomplete records.

⚠️ Practical considerations

⚠️ Data-dependent complexity

The excerpt emphasizes that these extraction functions "can be somewhat tricky to write, because they depend so much on the specific format of your input file."
The functions presented are a guide; you will need to adapt them to your own data format.

⚠️ Execution vs. concept

The excerpt notes that reading data is "conceptually a simple problem" but "the execution can be rather messy, depending on how the data was collected and organized in the file."
Example: The nested paste() calls and !is.na() filtering are necessary because the data contains multiple benchmarks and missing values.

📋 Summary of the workflow

Step	Function/Tool	What it does
1. Load CSV	`read.csv()`	Reads file into a data frame
2. Preview	`head()`	Shows column names and first six rows
3. Filter rows	`get_column()`	Extracts rows with non-missing benchmark results
4. Assemble subset	`extract_data()`	Builds a new data frame with selected predictors and normalized performance

Summary: Linear Regression Modeling Workflow in R

7 Summary

🧭 Overview

🧠 One-sentence thesis

Linear regression modeling in R follows a seven-step workflow—from reading and sanity-checking data through visualization, predictor selection, validation, and prediction—but the resulting model is only an approximation of the real system and must be used with caution.

📌 Key points (3–5)

Seven-step workflow: read data → sanity check → visualize → identify potential predictors → select predictors → validate → predict.
Data preparation is tricky: reading data into R can be one of the hardest tasks because you may not control the format; custom parsing functions are often needed.
Model validation uses multiple checks: R² and adjusted-R², residual analysis, and training/testing splits all help assess model quality.
Common confusion: train/test splitting must happen first—don't fit a model on all data before deciding to split.
Key limitation: the model is only an approximation of the real underlying system, not the system itself.

📥 Data preparation steps

📥 Reading data into R

Reading data can be one of the trickiest tasks because you may not have controlled how data was collected or formatted.
Be prepared to write custom functions to parse your data and load it into an R data frame.
The excerpt references Chapter 6 as an example of reading a moderately complicated CSV file.

🔍 Sanity checking your data

Sanity check: perform checks to ensure nothing is obviously wrong with the data.

Types of checks to perform (depend on your data specifics):

Find minimum, maximum, average, and standard deviation in each data frame column.
Look for parameter values that seem suspiciously outside expected limits.
Determine the fraction of missing (NA) values in each column to ensure sufficient data is available.
Determine the frequency of categorical parameters to see if any unexpected values appear.
Perform any other data-specific tests.

Goal: feel confident that your data set's values are reasonable and consistent.

📊 Visualizing your data

Always plot your data to get a basic sense of its shape and ensure nothing looks out of place.
Example: you may expect a somewhat linear relationship between two parameters; if you see something else (e.g., a horizontal line), investigate further.
The pairs() function is useful for performing this quick visual check.

Why visualization matters:

Your assumption about a linear relationship could be wrong.
The data may be corrupted.
Something completely unexpected may be happening.
You must understand what might be happening before developing the model.

🎯 Predictor identification and selection

🎯 Identifying potential predictors

Before beginning backward elimination, identify the set of all possible predictors that could go into your model.

Simplest case: all available columns in your data frame.

Reasons to eliminate columns before modeling:

A column containing only a few valid entries probably is not useful.
Your knowledge of the system may give good reason to eliminate a parameter (example: TDP was eliminated in Section 4.2).
You may want to include non-linear functions of parameters as possible predictors (example: square root of cache size terms).

Handling observations with missing values:

Consider removing observations that only have values for a few predictors.
Observations with missing values will only be useful in building a model with predictors for which the entry has values.

🔽 Selecting the predictors

Use the backward elimination process (described in Section 4.3).
Select predictors based on the p-value threshold you decide to use.
This step narrows down the potential predictors to those included in the final model.

✅ Model validation and use

✅ Validating the model

Multiple validation approaches:

Validation method	What to examine
R² values	Examine both R² and adjusted-R² values
Residual analysis	Further examine the model's quality
Training/testing split	See how well your model predicts values from the test set

Critical timing rule: If you intend to split your data into training and testing sets, do that first—don't fit a model using all of the data before deciding to train and test.

Don't confuse: fitting a model on all data and then splitting vs. splitting first and then fitting only on training data.

🔮 Predicting with the model

Once you have a model that appropriately explains your data, you can use it to predict previously unknown output values.
This is the final step after validation confirms the model's quality.

⚠️ Important limitations

⚠️ Models are approximations

What you have developed is only a model.

What a model can do:

Ideally, it is a useful tool for explaining variations in your measured data.
It helps understand relationships between inputs and output.

What a model cannot do:

It is only an approximation of the real underlying system.
It is limited in what it can tell us about that system.

Conclusion: Proceed with caution.

📚 Further learning resources

The excerpt mentions several categories of resources:

Books on R as a programming language.
Books that focus on specific statistical ideas and use R as the computational language.
A book on computer performance measurement.

A Few Things to Try Next

8 A Few Things to Try Next

🧭 Overview

🧠 One-sentence thesis

This chapter provides a structured set of exercises to practice and deepen regression modeling skills using R, progressing from data cleaning through simple and multiple linear regression to cross-dataset prediction evaluation.

📌 Key points (3–5)

Exercise progression: starts with data cleaning and visualization, moves through simple linear regression (SLR), then advances to multiple linear regression (MLR) and predictive modeling.
Model evaluation focus: exercises emphasize assessing model quality through residuals, p-values, residual standard errors, R² values, F-statistics, and residual analysis.
Cross-prediction testing: a key task is using models trained on one benchmark/year to predict performance on other benchmarks/years, revealing model generalization limits.
Data splitting exploration: exercises include experimenting with training set fraction (f) to find optimal splits and observing variability across repeated tests.
Common confusion: the excerpt distinguishes forward-in-time prediction (reasonable) from backward prediction (not sensible)—models should predict future or concurrent data, not past data.

🧹 Data preparation and exploration

🧹 Cleaning benchmark datasets

The excerpt suggests cleaning selected benchmark results (Int1992, Int1995, etc.) by:

Computing summary statistics for every column: average, variance, minimum, and maximum.
Sorting column data to identify outliers or unusual patterns.
Determining the fraction of NA (missing) values for each column.
The exercise asks "How else could you verify that the data looks reasonable?" encouraging exploration beyond these basic checks.

📊 Visualizing processor performance

Plot processor performance versus clock frequency for each benchmark result, similar to a referenced figure (Figure 3.1).
This visualization helps identify relationships and patterns before formal modeling.

🔢 Simple linear regression (SLR) exercises

🔢 Building SLR models

Develop a simple linear regression model for all benchmark results.
The exercise asks: "What input variable should you use as the predictor?" prompting thoughtful variable selection.

📈 Visualizing SLR fits

Superimpose SLR models on corresponding scatter plots of the data (referencing Figure 3.2).
This step helps visually assess how well the linear model captures the data pattern.

🔍 Evaluating SLR quality

The excerpt lists multiple evaluation criteria:

Residuals: examine the differences between observed and predicted values.
p-values of coefficients: assess statistical significance of predictors.
Residual standard errors: measure typical prediction error magnitude.
R² values: quantify proportion of variance explained by the model.
F-statistic: test overall model significance.
Residual analysis: perform appropriate diagnostic checks (e.g., checking assumptions).

🔀 Multiple linear regression (MLR) exercises

🔀 Developing MLR models

Generate pair-wise comparison plots for each benchmark result (referencing Figure 4.1) to explore relationships between multiple variables.
Develop a multiple linear regression model for each benchmark result.
Key questions: "Which predictors are the same and which are different across these models? What other similarities and differences do you see across these models?"

🔍 Evaluating MLR quality

The same evaluation criteria apply as for SLR:

Residuals, p-values, residual standard errors, R² values, F-statistic, and residual analysis.
This parallel structure emphasizes consistent model assessment across different modeling approaches.

🎯 Cross-prediction and model generalization

🎯 Forward-in-time prediction tables

The excerpt provides two tables (one for integer benchmarks Int1992–Int2006, one for floating-point benchmarks Fp1992–Fp2006) to be filled with prediction results:

Each cell should contain: x (± y) where x = mean of delta values (prediction errors) and y = width of the 95% confidence interval.
Critical constraint: "You need only predict forwards in time."

Example: Using a model developed with Int1992 data to predict Int2006 results is reasonable, but using Int2006 data to predict Int1992 results "does not make sense."

🧪 Assessing predictive ability

The exercise asks what can be said about models' predictive abilities:

Same-year cross-type prediction: How well does an integer benchmark model predict floating-point benchmark performance in the same year?
Cross-generation prediction: How do predictions perform across different benchmark generations (e.g., 1992 → 2000)?
This reveals whether models capture generalizable relationships or are overfitted to specific datasets.

🔄 Data splitting experiments

🔄 Exploring training set fraction

f: the fraction of the complete data set used in the training set.

The exercise instructs:

For the Fp2000 dataset, plot a 95% confidence interval for the mean of delta for f = [0.1, 0.2, ..., 0.9].
Identify which value of f gives the best result (smallest confidence interval).
Repeat the test n = 5 times to observe how the best f value changes, revealing variability in optimal splits.

🔁 Repeating across datasets

Repeat the f-variation experiment for all other datasets.
This systematic exploration helps understand how training set size affects prediction quality across different benchmark types and years.

Linear Regression Using R An Introduction to Data Modeling

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

📊 Data structure and terminology

📊 Organizing measurements into observations

🔤 Key terminology

🎯 What the model reveals

🎯 Discovering input importance

🔮 Predicting unmeasured systems

🧮 Linear combination approach

🧮 What "linear combination" means

⚖️ Automatic scaling

⚠️ Model vs. reality

⚠️ The fundamental distinction

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

💻 R as an interactive environment

💻 More than a language

🎯 Direct interaction example

🧮 Technical architecture

🧮 Object-oriented with vectors and matrices

📊 Graphical capabilities

🌐 Availability and learning approach

🌐 Open-source and free

📖 How this book teaches R

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

📚 Book roadmap

📖 Core chapters and their roles

🎯 The interactive learning approach

🗑️ Data quality fundamentals

🗑️ Garbage in, garbage out

❓ Missing values (NA)

💻 R as a tool (not a language to master)

💻 What the book teaches about R

🎓 Learning philosophy

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

🗂️ Why data has missing values

🛠️ Errors during data collection

🔧 Legitimate absence of parameters

🔤 How R marks and handles missing values

🔤 The NA notation

⚙️ Automatic handling by R functions

🛑 When explicit instructions are needed

🧪 Practical implications

🧪 Working with incomplete data

📋 Checking function behavior

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

🔍 Initial verification steps

🔍 Checking key parameters

📊 Visual inspection

🧹 Deciding what to clean

🧹 When to delete data

📝 What data cleaning means

🧠 Applying domain knowledge

🧠 Using system understanding

⚙️ Example: computer performance modeling

✅ Building confidence through sanity checks

⚠️ Maintaining healthy skepticism

⚠️ Limits of verification

🤔 Trusting your intuition

📝 Note on terminology

📂 Context: the example data

📂 Data source

📊 Data scope

🧭 Overview

🧠 One-sentence thesis

📌 Key points (3–5)

📊 The CPU DB database structure

📊 What the database contains

🗂️ Parameter categories in the database

🔍 Selecting useful predictors

❌ Parameters to eliminate

❓ Missing values (`NA`)