PSTAT100: Data Science — Concepts and Analysis
May 23, 2026
Most real phenomena are driven by multiple factors simultaneously.
We wish to capture the multivariate relationships.
Omitting relevant predictors leads to the SLR slope absorbing the effect of missing variables that are correlated with X.
The SLR residual plot for body mass vs. flipper length showed systematic banding — a strong hint that species was acting as an omitted variable.
Multiple Linear Regression Model
For response y with p predictors x_1, \ldots, x_p, the MLR model is defined as y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n, where \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2) (under the Gauss-Markov assumptions).
Example of observations, predictors and response for MLR.
In SLR, \hat{\beta}_1 captures the total marginal relationship between X and Y.
In MLR, each slope is a partial slope:
To simplify notation, we will express the MLR model in matrix form. We define the following:
\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix},\quad \mathbf{Y} = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix}. \quad \boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}.
Then we can write our model as
\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}.
where we have that \boldsymbol{\varepsilon}\sim\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n).
Note here that \boldsymbol{X} has p+1 columns (including the intercept) and n rows (observations).
This matrix formulation is more compact and allows us to derive the OLS estimator using linear algebra.
| Assumption | Matrix statement |
|---|---|
| Linearity | \mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0} |
| Independence + Homoscedasticity | \text{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}_n |
| Normality | \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\, \sigma^2 \mathbf{I}_n) |
| Full rank | \text{rank}(\mathbf{X}) = p + 1 (columns linearly independent) |
# Set seed and define true parameters
rng = np.random.default_rng(42)
n, p = 150, 3
beta0_true = 3.0 # intercept
beta_true = np.array([2.0, -1.5, 0.8]) # slopes
sigma_true = 2.0 # noise level
# Define the design matrix and response vector
X_pred = rng.normal(0, 1, size=(n, p))
X_design = np.column_stack([np.ones(n), X_pred])
y = X_design @ np.r_[beta0_true, beta_true] + rng.normal(0, sigma_true, size=n)
# Plot the first 10 rows of the design matrix
fig, ax = plt.subplots(figsize=(8, 4))
im = ax.imshow(X_design[:10], aspect='auto', cmap='Blues')
ax.set_xticks(range(p + 1))
ax.set_xticklabels(['$\\mathbf{1}$', '$x_1$', '$x_2$', '$x_3$'], fontsize=13)
ax.set_yticks(range(10))
ax.set_yticklabels([f'$i = {i+1}$' for i in range(10)])
ax.set_title('First 10 rows of the design matrix $\\mathbf{X}$ (n=150, p=3)')
plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout(); plt.show()plt.figure(figsize=(8, 5))
scatter = plt.scatter(X_pred[:, 0], y, c=X_pred[:, 1], s=100 * np.abs(X_pred[:, 2]),
cmap='viridis', alpha=0.7, edgecolor='k')
plt.xlabel('$X_1$', fontsize=12)
plt.ylabel('$Y$', fontsize=12)
plt.title('Scatterplot of $Y$ vs. $X_1$ (coloured by $X_2$, sized by $X_3$)', fontsize=13)
cbar = plt.colorbar(scatter)
cbar.set_label('$X_2$ value', fontsize=12)
plt.tight_layout(); plt.show() $X_1$ $X_2$ $X_3$ $Y$
$X_1$ 1.000 0.055 -0.009 0.560
$X_2$ 0.055 1.000 0.118 -0.432
$X_3$ -0.009 0.118 1.000 0.147
$Y$ 0.560 -0.432 0.147 1.000
Once again we will estimate \boldsymbol{\beta} using ordinary least squares (OLS). The OLS estimator is the value of \boldsymbol{\beta} that minimises the residual sum of squares:
\text{RSS}(\boldsymbol{\beta}) = \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}).
Expanding the objective we get that
\text{RSS}(\boldsymbol{\beta}) = \mathbf{Y}^\top\mathbf{Y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{Y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}.
Taking the matrix gradient with respect to \beta and setting it to zero gives the normal equations:
\mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{Y}.
Provided \mathbf{X}^\top\mathbf{X} is invertible (i.e. \mathbf{X} has full column rank), the unique solution is:
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}.
This derivation is beyond the scope of the course but we see again that our parameter estimates are given explicitly.
When p = 1 and \mathbf{X} = [\mathbf{1},\, \mathbf{x}], we have:
\mathbf{X}^\top\mathbf{X} = \begin{pmatrix} n & \sum x_i \\ \sum x_i & \sum x_i^2 \end{pmatrix}, \quad \mathbf{X}^\top\mathbf{Y} = \begin{pmatrix} \sum y_i \\ \sum x_i y_i \end{pmatrix}.
Computing \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} and simplifying (using S_{xx} = \sum(x_i - \bar{x})^2), we recover:
\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}.
The fitted values are:
\hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} = \mathbf{H}\mathbf{Y},
where \mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top
Since the residuals can be written as
\boldsymbol{e} = \boldsymbol{Y} - \hat{\boldsymbol{Y}} = (\boldsymbol{I}-\boldsymbol{H})\boldsymbol{Y}.
the variance of the residuals is
\text{Var}(\boldsymbol{e}) = \sigma^2(\boldsymbol{I}-\boldsymbol{H}).
# Solve the normal equations (numerically stable — avoids forming the explicit inverse)
beta_hat = np.linalg.solve(X_design.T @ X_design, X_design.T @ y)
# Compare to the true values
beta_all_true = np.r_[beta0_true, beta_true]
print("Parameter | True | Estimated")
print("-" * 36)
for j, (tv, ev) in enumerate(zip(beta_all_true, beta_hat)):
print(f" β_{j} | {tv:5.2f} | {ev:8.3f}")
# Plot
labels = [f'$\\hat{{\\beta}}_{j}$' for j in range(p + 1)]
x_pos = np.arange(p + 1)
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(x_pos - 0.18, beta_all_true, width=0.35,
label='True $\\boldsymbol{\\beta}$', color='crimson', alpha=0.75)
ax.bar(x_pos + 0.18, beta_hat, width=0.35,
label='OLS $\\hat{\\boldsymbol{\\beta}}$', color='steelblue', alpha=0.75)
ax.set_xticks(x_pos); ax.set_xticklabels(labels, fontsize=12)
ax.set_ylabel('Coefficient value')
ax.set_title('OLS Estimates vs. True Parameters')
ax.legend(); plt.tight_layout(); plt.show()Parameter | True | Estimated
------------------------------------
β_0 | 3.00 | 2.896
β_1 | 2.00 | 1.881
β_2 | -1.50 | -1.626
β_3 | 0.80 | 0.718
Substituting \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} into the OLS formula:
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}.
Taking expectations (using \mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0} and treating \mathbf{X} as fixed):
\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}. \quad \checkmark
Collinearity
Collinearity refers to the case in which two or more predictors are correlated (related).
We wish to compute \text{Cov}(\hat{\boldsymbol{\beta}}). We have that
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} = \boldsymbol{C}\boldsymbol{Y}.
Applying the linear transformation rule for variance we have
\text{Var}(\hat{\boldsymbol{\beta}}) = C\text{Var}(\sigma^2\boldsymbol{I})C^\top = \sigma^2 CC^\top = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}.
| Factor | Effect on \text{Var}(\hat{\boldsymbol{\beta}}) |
|---|---|
| More observations (↑ n) | Smaller — more information |
| Less noise (↓ \sigma^2) | Smaller — cleaner signal |
| More spread in predictors | Smaller — better leverage |
| High multicollinearity | Larger — (\mathbf{X}^\top\mathbf{X})^{-1} blows up |
Gauss-Markov Theorem (Matrix Form)
Under GM1–GM3 (normality is not required), the OLS estimator \hat{\boldsymbol{\beta}} is BLUE:
For any other linear unbiased estimator \tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{Y}, \text{Var}(\tilde{\beta}_j) \geq \text{Var}(\hat{\beta}_j) \quad \text{for all } j.
Fitting p + 1 parameters costs p + 1 degrees of freedom. The unbiased estimator is:
\hat{\sigma}^2 = \frac{\text{RSS}}{n - p - 1}.
We divide by n - p - 1, not n - 2 as in SLR.
rng2 = np.random.default_rng(7)
beta_hat_sims = []
for _ in range(2000):
y_s = X_design @ np.r_[beta0_true, beta_true] + rng2.normal(0, sigma_true, size=n)
bh_s = np.linalg.solve(X_design.T @ X_design, X_design.T @ y_s)
beta_hat_sims.append(bh_s)
beta_hat_sims = np.array(beta_hat_sims) # shape (2000, 4)
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
for j in range(3):
lab = rf'$\hat{{\beta}}_{j+1}$'
tv = beta_true[j]
sns.histplot(beta_hat_sims[:, j + 1], bins=40, kde=True,
color='steelblue', ax=axes[j])
axes[j].axvline(tv, color='crimson', lw=2.5, linestyle='--',
label=f'True = {tv}')
axes[j].axvline(beta_hat_sims[:, j+1].mean(), color='navy', lw=2, linestyle=':',
label=f'Mean = {beta_hat_sims[:, j+1].mean():.3f}')
axes[j].set_xlabel(lab)
axes[j].set_title(f'Sampling distribution of {lab}')
axes[j].legend(fontsize=8)
plt.suptitle('Unbiasedness of MLR OLS Estimators (2 000 simulations)',
fontsize=13, y=1.01)
plt.tight_layout(); plt.show()Under GM1–GM4 (including normality):
T_j = \frac{\hat{\beta}_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \sim t_{n - p - 1}.
y_hat = X_design @ beta_hat
RSS = np.sum((y - y_hat)**2)
sigma_h = np.sqrt(RSS / (n - p - 1))
XtX_inv = np.linalg.inv(X_design.T @ X_design)
SE = sigma_h * np.sqrt(np.diag(XtX_inv))
print(f"{'':5} {'β̂':>8} {'SE':>8} {'t':>7} {'p-val':>10}")
for j in range(p + 1):
tv = beta_hat[j] / SE[j]
pv = 2 * t_dist.sf(abs(tv), df=n - p - 1)
print(f"β_{j} {beta_hat[j]:>8.3f} {SE[j]:>8.4f}"
f" {tv:>7.3f} {pv:>10.3e}") β̂ SE t p-val
β_0 2.896 0.1707 16.967 2.381e-36
β_1 1.881 0.1732 10.859 1.666e-20
β_2 -1.626 0.1813 -8.971 1.314e-15
β_3 0.718 0.1865 3.849 1.769e-04
Tests whether any predictor is useful:
H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{vs.} \quad H_1: \text{at least one } \beta_j \neq 0.
Overall F-Statistic
F = \frac{\text{SSR}/p}{\text{SSE}/(n - p - 1)} = \frac{R^2/p}{(1 - R^2)/(n - p - 1)} \sim F_{p,\, n-p-1} \quad \text{under } H_0.
R² = 0.5722
F-stat = 65.09
p-value = 8.88e-27
Model selection
Model selection is the application of a principled method to determine the complexity of the model, e.g. choosing a subset of predictors, choosing the degree of the polynomial model etc.
Overfitting
Overfitting is the phenomenon where the model is unnecessarily complex, in the sense that portions of the model captures the random noise in the observation, rather than the relationship between predictor(s) and response.
Overfitting visualization.
R^2 Always Increases
Adding a predictor to a model never decreases R^2, even if the predictor is pure noise.
rng3 = np.random.default_rng(99)
R2_vals = []
Xk_full = rng3.normal(0, 1, (n, 25))
for k in range(1, 25):
Xk = np.column_stack([
np.ones(n),
Xk_full[:, :k]
])
bk = np.linalg.solve(Xk.T @ Xk, Xk.T @ y)
yk = Xk @ bk
R2_vals.append(1 - np.sum((y - yk)**2) / SST)
plt.figure(figsize=(7, 3))
plt.plot(range(1, 25), R2_vals, 'o-', color='steelblue')
plt.axvline(3, color='crimson', lw=1.5, linestyle='--',
label='True $p = 3$')
plt.xlabel('Number of predictors (noise after $p=3$)')
plt.ylabel('$R^2$')
plt.title('$R^2$ never decreases as we add predictors')
plt.legend(); plt.tight_layout(); plt.show()Adjusted R^2
\bar{R}^2 = 1 - \frac{\text{SSE}/(n - p - 1)}{\text{SST}/(n - 1)} = 1 - (1 - R^2)\frac{n - 1}{n - p - 1}.
adj_R2_vals = [1 - (1 - r2) * (n - 1) / (n - k - 1)
for k, r2 in enumerate(R2_vals, start=1)]
plt.figure(figsize=(7, 3))
plt.plot(range(1, 25), R2_vals, 'o-',
color='steelblue', label='$R^2$')
plt.plot(range(1, 25), adj_R2_vals, 's--',
color='crimson', label='Adj. $R^2$')
plt.axvline(3, color='gray', lw=1.5, linestyle=':',
label='True $p = 3$')
plt.xlabel('Number of predictors')
plt.ylabel('Fit statistic')
plt.title('$R^2$ vs. Adjusted $R^2$')
plt.legend(fontsize=9); plt.tight_layout(); plt.show()Information criteria balance goodness of fit against model complexity.
\text{AIC} = n\log\!\left(\frac{\text{RSS}}{n}\right) + 2(p + 2), \qquad \text{BIC} = n\log\!\left(\frac{\text{RSS}}{n}\right) + (p + 2)\log(n).
| Criterion | Penalty term | Best used for |
|---|---|---|
| AIC | 2(p+2) — mild | Maximising predictive accuracy |
| BIC | (p+2)\log n — stronger | Identifying the true model |
A partial F-test compares a full model to a reduced (nested) model:
F = \frac{(\text{SSE}_\text{red} - \text{SSE}_\text{full})/q}{\text{SSE}_\text{full}/(n - p_\text{full} - 1)} \sim F_{q,\, n - p_\text{full} - 1},
where q is the number of additional predictors in the full model.
anova_lm(reduced_model, full_model) — more on this in Lecture 12.Multicollinearity occurs when two or more predictors are highly linearly correlated with each other.
| Component | SLR (p = 1) | MLR (p predictors) |
|---|---|---|
| Model | Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i | \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} |
| OLS estimator | \hat{\beta}_1 = S_{xy}/S_{xx} | \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} |
| Fitted values | \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i | \hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y} |
| Var of \hat{\beta} | \sigma^2/S_{xx} | \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1} |
| Residual df | n - 2 | n - p - 1 |
| Inference | t_{n-2} | t_{n-p-1}; F_{p,\,n-p-1} |
| Fit metric | R^2 | Adj. R^2, AIC, BIC |
statsmodels, handling categorical variables with dummy coding, model comparison, interaction terms, residual diagnostics, added-variable plots, and VIF.