PSTAT100: Data Science — Concepts and Analysis
May 23, 2026
| Goal | What matters | Implication |
|---|---|---|
| Prediction accuracy | Low out-of-sample error | Can tolerate many predictors if they help |
| Interpretability | Sparse, readable model | Prefer fewer, meaningful predictors |
| Inference validity | Correct standard errors | Need well-specified, parsimonious model |
| Parsimony | Fewest predictors possible | Occam’s razor; generalisability |
There is no universal “best” model — only models that are better suited to a given purpose. The criterion you use to select a model should match the criterion you care about.
For any estimator \hat{f} of f, the expected test error at a point x_0 decomposes as:
\mathbb{E}\!\left[(Y - \hat{f}(x_0))^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}(x_0)] - f(x_0)\right)^2}_{\text{Bias}^2} + \underbrace{\mathrm{Var}(\hat{f}(x_0))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible error}}.
Adding predictors reduces bias (the model fits more patterns) but inflates variance (the model memorises the training data). The tradeoff is managed by choosing model complexity wisely.
Multiple Linear Regression
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\, \sigma^2 \mathbf{I}_n),
where \mathbf{y} \in \mathbb{R}^n, \mathbf{X} \in \mathbb{R}^{n \times (p+1)} (including an intercept column), and \boldsymbol{\beta} \in \mathbb{R}^{p+1}.
The OLS estimator is:
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}.
| Symbol | Formula | Role |
|---|---|---|
| \hat{\mathbf{y}} | \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y} | Fitted values |
| \mathbf{H} | \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top | Hat (projection) matrix |
| h_{ii} | [\mathbf{H}]_{ii} | Leverage of observation i |
| \text{RSS} | \mathbf{e}^\top\mathbf{e} = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 | Residual sum of squares |
The hat matrix \mathbf{H} is an orthogonal projector onto \text{col}(\mathbf{X}):
We will see that h_{ii} reappears in the LOOCV formula, making leverage a bridge between geometry, influence, and cross-validation.
| Assumption | Statement | Consequence if violated |
|---|---|---|
| Linearity | \mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta} | Systematic bias in \hat{\boldsymbol{\beta}} |
| Homoscedasticity | \text{Var}(\varepsilon_i) = \sigma^2 for all i | Inefficient OLS; SEs wrong |
| No multicollinearity | \mathbf{X}^\top\mathbf{X} is invertible (ideally well-conditioned) | Inflated variances of \hat{\boldsymbol{\beta}} |
| Exogeneity | \mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \mathbf{0} | OLS is biased |
| Normality | \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) | Needed for finite-sample inference |
For p \leq 20 or so, best subset is feasible. For p \gtrsim 30, stepwise or regularization is preferred.
For each k = 0, 1, \ldots, p:
Then select among \mathcal{M}_0, \mathcal{M}_1, \ldots, \mathcal{M}_p using a criterion such as AIC, BIC, or adjusted R^2.
Best subset selection is the gold standard for small p, but its exponential cost means stepwise methods or regularization are needed for moderate to large p.
What would this look like in
python?
def forward_stepwise(y, X_candidates, criterion='aic'):
"""
Greedy forward stepwise selection.
y: pd.Series, X_candidates: pd.DataFrame (predictors only, no intercept)
Returns a list of selected predictor names.
"""
remaining = list(X_candidates.columns) # predictors not yet in the model
selected = [] # predictors chosen so far
best_score = np.inf # track the best criterion seen
while remaining:
# Try adding each remaining predictor to the current selected set
scores = {}
for col in remaining:
cols = selected + [col] # candidate set with one addition
X_fit = sm.add_constant(X_candidates[cols])
res = sm.OLS(y, X_fit).fit()
scores[col] = getattr(res, criterion) # AIC (or BIC) of this candidate
# Pick whichever addition gave the lowest criterion value
best_col = min(scores, key=scores.get)
best_new = scores[best_col]
if best_new < best_score: # adding best_col actually improved the fit
best_score = best_new
selected.append(best_col) # commit the addition
remaining.remove(best_col)
else:
break # no remaining predictor improves the criterion: stop
return selecteddef backward_stepwise(y, X_candidates, criterion='aic'):
"""
Greedy backward stepwise selection.
"""
selected = list(X_candidates.columns) # start with all predictors in the model
# Score the full model as the initial benchmark
X_fit = sm.add_constant(X_candidates[selected])
best_score = getattr(sm.OLS(y, X_fit).fit(), criterion)
improved = True
while improved and len(selected) > 1: # stop if no improvement or only intercept left
improved = False
scores = {}
# Try dropping each predictor and score the resulting smaller model
for col in selected:
cols = [c for c in selected if c != col] # all predictors except col
X_fit = sm.add_constant(X_candidates[cols])
res = sm.OLS(y, X_fit).fit()
scores[col] = getattr(res, criterion) # criterion when col is removed
# Find the predictor whose removal most improves (lowers) the criterion
drop_col = min(scores, key=scores.get)
drop_score = scores[drop_col]
if drop_score < best_score: # dropping drop_col actually helped
best_score = drop_score
selected.remove(drop_col) # commit the removal
improved = True # signal that another pass should be tried
return selected| Property | Forward | Backward |
|---|---|---|
| Start | Null model | Full model |
| Requires n > p? | No | Yes (full model must be estimable) |
| Can re-include dropped predictors? | No | No |
| Can drop previously added predictors? | No | No |
Caution: stepwise selection inflates Type I error and produces overoptimistic p-values — it should be used for model building, not for inference. Report final model uncertainty honestly.
AIC and BIC
\text{AIC} = -2\ell(\hat{\boldsymbol{\beta}}) + 2k
\text{BIC} = -2\ell(\hat{\boldsymbol{\beta}}) + k\ln n
where \ell(\hat{\boldsymbol{\beta}}) is the maximised log-likelihood, k is the number of free parameters (including \sigma^2), and n is the sample size. Lower is better.
n_vals = np.arange(8, 1001)
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(n_vals, 2 * np.ones_like(n_vals), lw=2.5,
color='steelblue', label='AIC penalty per parameter (constant = 2)')
ax.plot(n_vals, np.log(n_vals), lw=2.5,
color='#fc8d62', label='BIC penalty per parameter ($\\ln n$)')
ax.set_xlabel('Sample size $n$')
ax.set_ylabel('Penalty per parameter')
ax.set_title('AIC vs. BIC: Penalty Per Parameter')
ax.legend(fontsize=10)
plt.tight_layout(); plt.show()Ordinary R^2 = 1 - \text{RSS}/\text{TSS} increases whenever a predictor is added, even if that predictor is irrelevant.
Adjusted R^2 corrects for this:
\bar{R}^2 = 1 - \frac{\text{RSS}/(n - p - 1)}{\text{TSS}/(n-1)} = 1 - \frac{n-1}{n-p-1}(1 - R^2).
# Load data and fit a sequence of models for illustration
housing = fetch_california_housing(as_frame=True)
df = housing.frame.copy()
df.columns = [c.lower() for c in df.columns]
# Fit models of increasing complexity
formulas = {
'Intercept only': 'medhouseval ~ 1',
'+ MedInc': 'medhouseval ~ medinc',
'+ MedInc + HouseAge': 'medhouseval ~ medinc + houseage',
'+ MedInc + HouseAge\n+ AveRooms + AveBedrms':
'medhouseval ~ medinc + houseage + averooms + avebedrms',
'Full (all 8)': 'medhouseval ~ medinc + houseage + averooms + avebedrms'
' + aveoccup + population + latitude + longitude',
}
rows = []
for name, formula in formulas.items():
fit = smf.ols(formula, data=df).fit()
rows.append({
'Model': name.replace('\n', ' '),
'k': int(fit.df_model) + 2, # predictors + intercept + sigma²
'R²': round(fit.rsquared, 4),
'Adj. R²': round(fit.rsquared_adj, 4),
'AIC': round(fit.aic, 1),
'BIC': round(fit.bic, 1),
})
cmp = pd.DataFrame(rows)
print(cmp.to_string(index=False)) Model k R² Adj. R² AIC BIC
Intercept only 2 0.0000 0.0000 64485.9 64493.9
+ MedInc 3 0.4734 0.4734 51249.3 51265.2
+ MedInc + HouseAge 4 0.5091 0.5091 49803.4 49827.2
+ MedInc + HouseAge + AveRooms + AveBedrms 6 0.5375 0.5374 48578.7 48618.4
Full (all 8) 10 0.6062 0.6061 45265.5 45337.0