Lecture 15: Regularization

PSTAT100: Data Science — Concepts and Analysis

John Inston

johninston@ucsb.edu

University of California, Santa Barbara

July 13, 2026

🚁 Overview

Aims of the lecture

Appreciate the limitations of subset selection and why they motivate regularization.
Understand cross-validation as a principled method for estimating out-of-sample error.
Introduce regularization (Ridge, Lasso, Elastic Net) and the shrinkage idea.
Diagnose multicollinearity and understand how regularization addresses it.
Compare nested models with partial F-tests.

📚 Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import (
    variance_inflation_factor
)

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import (
    LinearRegression,
    Ridge, Lasso, ElasticNet,
    RidgeCV, LassoCV, ElasticNetCV
)
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

💅 Figure Styles

sns.set_style('whitegrid')
sns.set_palette('Set2')

Recap: Lecture 14 — Subset Selection

What we established

The bias-variance tradeoff: adding predictors reduces bias but inflates variance.
Information criteria (AIC, BIC, adj. R^2) add a complexity penalty to avoid overfitting.
Subset selection searches over models to find a good predictor set.

Subset selection strategies

Method	Start	Cost
Best subsets	—	2^p models
Forward stepwise	Null model	O(p^2)
Backward stepwise	Full model	O(p^2)
Hybrid	Either	O(p^2)

Example: Housing Data

We have been looking at California housing data!

# Load dataset — used throughout this lecture
housing = fetch_california_housing(as_frame=True)
df = housing.frame.copy()
df.columns = [c.lower() for c in df.columns]

predictors = ['medinc','houseage','averooms','avebedrms',
              'aveoccup','population','latitude','longitude']
feature_names = ['MedInc','HouseAge','AveRooms','AveBedrms',
                 'AveOccup','Population','Latitude','Longitude']

X_full = df[predictors].values
y_full = df['medhouseval'].values
# Standardize predictors for regularization 
X_scaled = StandardScaler().fit_transform(X_full)

Limitations of Subset Selection

Subset selection methods are powerful but carry important limitations that become severe as p grows.

Limitation	Detail
Computational cost	Best subsets requires 2^p fits — infeasible for p \gtrsim 30
Greedy, not global	Stepwise methods can miss the truly optimal subset
Discrete selection	Each predictor is either in or out — no gradation of importance
Instability	Small changes in the data can produce very different selected subsets
Fails when p \geq n	The full model cannot be estimated when predictors outnumber observations

Regularization (this lecture) overcomes all of these: it is computationally cheap, globally optimised, produces continuous coefficient values, is stable, and works when p \gg n.

Cross-Validation

🔄 Why Cross-Validation?

The problem with in-sample fit

RSS, R^2, and the likelihood are computed on the training data — the same data used to fit the model.
The model has already “seen” these observations: in-sample fit is optimistic.

The solution: hold out data

Cross-validation repeatedly splits the data into a training set and a validation set.
Model error is evaluated on observations the model has not seen during fitting.
This gives an honest estimate of out-of-sample (generalisation) error.

Cross-validation is also our main tool for choosing the regularization hyperparameter \lambda — we will return to it in the regularization section.

k-Fold Cross-Validation

k-fold CV

Randomly partition the n observations into k roughly equal folds.
For each fold j = 1, \ldots, k: fit the model on the other k-1 folds, predict on fold j.
Report the average test error: \text{CV}_{(k)} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i^{(-\kappa(i))})^2, where \hat{y}_i^{(-\kappa(i))} is the prediction for observation i from the model fit without its fold \kappa(i).

k = 5 or k = 10 are standard choices — they balance bias and variance of the CV estimate well.
k = n (LOOCV): leave one observation out at a time — low bias, but high variance and expensive.

LOOCV and the Hat Matrix

For OLS, there is a remarkable shortcut that avoids refitting the model n times:

\text{LOOCV} = \frac{1}{n}\sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}}\right)^2,

where e_i = y_i - \hat{y}_i is the ordinary residual and h_{ii} is the i-th diagonal of the hat matrix \mathbf{H}.

Why does this work?

When observation i is removed, the prediction at x_i changes by e_i / (1 - h_{ii}).
Observations with high leverage h_{ii} have a large influence on the LOOCV score — they are “risky” in the sense that leaving them out changes the fit substantially.
This formula is computationally free once \mathbf{H} is computed.

k-Fold CV in Python with `sklearn`

kf = KFold(n_splits=10, shuffle=True, random_state=0)

# Compare subsets of features using 10-fold CV MSE
feature_sets = {
    '1 predictor (MedInc)': X_full[:, [0]],
    '3 predictors': X_full[:, :3],
    '5 predictors': X_full[:, :5],
    'Full (8 predictors)': X_full,
}

print(f"{'Model':<30} {'10-fold CV MSE':>16}")
print("-" * 48)
for name, X_sub in feature_sets.items():
    scores = cross_val_score(
        LinearRegression(), X_sub, y_full,
        cv=kf, scoring='neg_mean_squared_error'
    )
    print(f"{name:<30} {-scores.mean():>16.4f}")

Model                            10-fold CV MSE
------------------------------------------------
1 predictor (MedInc)                     0.7013
3 predictors                             0.6514
5 predictors                             0.6189
Full (8 predictors)                      0.5279

LOOCV via the Hat Matrix

# LOOCV shortcut using the hat matrix
X_aug = sm.add_constant(
    df[['medinc','houseage','averooms','avebedrms',
        'aveoccup','population','latitude','longitude']]
)
fit_full = sm.OLS(y_full, X_aug).fit()
H        = X_aug @ np.linalg.solve(X_aug.T @ X_aug, X_aug.T)
h_diag   = np.diag(H)
e        = fit_full.resid

loocv = np.mean((e / (1 - h_diag))**2)
print(f"LOOCV MSE (hat-matrix shortcut): {loocv:.4f}")

LOOCV MSE (hat-matrix shortcut): 0.5282

Regularization (Shrinkage Methods)

🎯 Why Regularize?

The problem with OLS in high dimensions

When p is large relative to n, or when predictors are highly correlated, (\mathbf{X}^\top\mathbf{X})^{-1} becomes ill-conditioned.
OLS coefficients have high variance — tiny changes in the data lead to large changes in \hat{\boldsymbol{\beta}}.

The regularization idea

Add a penalty on the size of \boldsymbol{\beta} to the least squares objective.
Trading a little bias for a large reduction in variance can lower total test error.

This is exactly the bias-variance tradeoff in action: we deliberately introduce bias to tame variance.

Regularization as Constrained Optimization

Regularization can be written in two equivalent forms.

Penalty (Lagrangian) form — what we minimize:

\min_{\boldsymbol{\beta}}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|_q

Constraint form — equivalent formulation:

\min_{\boldsymbol{\beta}}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 \quad \text{subject to} \quad \|\boldsymbol{\beta}\|_q \leq t

For every value of \lambda \geq 0 there is a corresponding budget t \geq 0 that produces the same solution.
As \lambda increases, t decreases — the constraint tightens and coefficients must shrink further.
At \lambda = 0 (t = \infty): no constraint — recovers the OLS solution.
At \lambda \to \infty (t = 0): all \hat{\beta}_j = 0 — the null model.

The constraint perspective makes the geometry transparent: the regularized solution is the point on the constraint region \|\boldsymbol{\beta}\|_q \leq t that is closest to the OLS solution.

⚠️ Why Standardize Predictors?

The regularization penalty \lambda \sum_j \beta_j^2 (or \lambda \sum_j |\beta_j|) penalizes coefficients by their magnitude.

The scale problem

Suppose X_1 is measured in millions of dollars and X_2 in metres.
A unit change in X_1 moves the response by \beta_1 — but \beta_1 will naturally be tiny (because the units are huge).
A unit change in X_2 moves the response by \beta_2 — with larger units, \beta_2 can be larger.
The penalty then falls almost entirely on \beta_2, unfairly shrinking it relative to \beta_1.

The fix: standardize before fitting

X_j \leftarrow \frac{X_j - \bar{X}_j}{s_j}

This puts all predictors on a unit-variance, zero-mean scale so the penalty is applied equally.

Important: the intercept \beta_0 is never penalized — shrinking it would shift all predictions by the mean of y, which is not what we want.

Ridge Regression (L_2 Penalty)

Ridge Regression

\hat{\boldsymbol{\beta}}_{\text{ridge}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|^2_2,

where \lambda \geq 0 is the regularization hyperparameter and \|\boldsymbol{\beta}\|^2_2 = \sum_j \beta_j^2 is the squared L_2 norm.

Closed-form solution:

\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}.

Why does adding \lambda\mathbf{I} help? Think in terms of eigenvalues.

Regularization and Eigenvalues

Impact of \lambda.

\mathbf{X}^\top\mathbf{X} is a symmetric positive semi-definite matrix with eigenvalues \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0.
The OLS inverse involves 1/\lambda_j for each eigenvalue. If any \lambda_j \approx 0 (nearly singular — which happens when predictors are correlated), that term explodes, inflating \text{Var}(\hat{\boldsymbol{\beta}}).
Adding \lambda\mathbf{I} shifts every eigenvalue up by \lambda: the eigenvalues of \mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I} are \lambda_j + \lambda.
The inverse now involves 1/(\lambda_j + \lambda) — even if \lambda_j \approx 0, the denominator is at least \lambda > 0, so it stays bounded.

The matrix is therefore always invertible for any \lambda > 0, regardless of multicollinearity.

Ridge shrinks all coefficients toward zero but never sets them exactly to zero — all predictors are retained.
Particularly useful when predictors are correlated (multicollinearity).
Predictors must be standardised first (the penalty is scale-sensitive).

Ridge: The Effect of \lambda

As λ increases (moving right), Ridge coefficients are smoothly shrunk toward zero. All predictors remain in the model — none are zeroed out. Dashed horizontal lines show the OLS coefficient values (λ = 0 limit). Note: coefficients are on the standardised scale.

Lasso (L_1 Penalty)

Lasso — Least Absolute Shrinkage and Selection Operator

\hat{\boldsymbol{\beta}}_{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|_1,

where \|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j| is the L_1 norm.

No closed-form solution — requires coordinate descent or convex optimization.
The L_1 penalty induces sparsity: for large enough \lambda, some \hat{\beta}_j are set exactly to zero.
Lasso performs automatic variable selection — it simultaneously shrinks and selects.
Like Ridge, predictors must be standardised.

The key difference: Ridge shrinks coefficients toward zero uniformly; Lasso can zero them out completely.

Lasso: Why Does Sparsity Happen?

Recall the constraint form: minimize RSS subject to \|\boldsymbol{\beta}\|_1 \leq t.

The L_1 constraint region (diamond) has corners on the coordinate axes.
The RSS contours are ellipses centred at the OLS solution, expanding outward.
As we shrink t, the ellipse first touches the diamond.
Because the diamond has sharp corners, contact most often occurs at a corner — where one or more \beta_j = 0 exactly.

The L₁ ball (diamond) has corners where an axis meets the boundary. The expanding RSS ellipse tends to first touch a corner, setting one coefficient exactly to zero.

This geometry is why L_2 (Ridge) never produces exactly-zero coefficients: the smooth L_2 ball has no corners, so the ellipse touches it at a non-axis point.

Ridge vs. Lasso: Geometry

Constraint regions for Ridge (circle, L₂ ball) and Lasso (diamond, L₁ ball). The OLS solution is at the unconstrained minimum. The Lasso constraint region has corners along the axes, making it likely that the constrained minimum falls on a corner — setting one coefficient exactly to zero.

Elastic Net

Elastic Net

\hat{\boldsymbol{\beta}}_{\text{EN}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\!\left[\alpha\|\boldsymbol{\beta}\|_1 + \frac{1-\alpha}{2}\|\boldsymbol{\beta}\|^2_2\right],

where \alpha \in [0, 1] controls the mix: \alpha = 1 is pure Lasso, \alpha = 0 is pure Ridge.

Combines the sparsity of Lasso with the grouping property of Ridge.
When predictors are correlated, Lasso tends to arbitrarily select one from a group; Elastic Net tends to select the group together.
Two hyperparameters: \lambda (overall penalty strength) and \alpha (L1 vs. L2 mix).

Ridge vs. Lasso — Practical Guidance

	Ridge	Lasso	Elastic Net
Variable selection	No	Yes	Yes
Correlated predictors	Handles well	Picks one arbitrarily	Selects group
Works when p > n	Yes	Yes (selects \leq n)	Yes
Interpretability	All retained	Sparse	Sparse
Inference on \hat{\beta}	Non-standard	Non-standard	Non-standard

Rules of thumb

Use Ridge when you expect a dense signal — most predictors contribute something.
Use Lasso when you expect sparsity — only a few predictors truly matter.
Use Elastic Net when predictors are correlated and sparse — combines the best of both.
When p \gg n, prefer Lasso or Elastic Net to obtain an interpretable sparse model.

Choosing \lambda: Cross-Validated Grid Search

# RidgeCV — built-in leave-one-out CV over a grid of alpha values
ridge_cv  = RidgeCV(alphas=np.logspace(-3, 4, 100), cv=10)
ridge_cv.fit(X_scaled, y_full)
print(f"Ridge best λ (alpha): {ridge_cv.alpha_:.4f}")

# LassoCV — coordinate descent + k-fold CV
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 100),
                   cv=10, max_iter=10_000, random_state=0)
lasso_cv.fit(X_scaled, y_full)
print(f"Lasso best λ (alpha): {lasso_cv.alpha_:.6f}")

Ridge best λ (alpha): 89.0215
Lasso best λ (alpha): 0.000320

Coefficient Path Plots

alphas = np.logspace(-3, 4, 200)

# Ridge path
ridge_coefs = np.array([
    Ridge(alpha=a).fit(X_scaled, y_full).coef_
    for a in alphas
])

# Lasso path
lasso_coefs = np.array([
    Lasso(alpha=a, max_iter=10000).fit(X_scaled, y_full).coef_
    for a in alphas
])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

palette = sns.color_palette('tab10', n_colors=8)

for j, name in enumerate(feature_names):
    axes[0].plot(np.log10(alphas), ridge_coefs[:, j],
                 lw=1.8, color=palette[j], label=name)
    axes[1].plot(np.log10(alphas), lasso_coefs[:, j],
                 lw=1.8, color=palette[j], label=name)

for ax, title, best_a in [
    (axes[0], 'Ridge Coefficient Path', ridge_cv.alpha_),
    (axes[1], 'Lasso Coefficient Path', lasso_cv.alpha_),
]:
    ax.axvline(np.log10(best_a), color='crimson', lw=2,
               linestyle='--', label=f'CV-selected $\\lambda$')
    ax.axhline(0, color='gray', lw=0.8)
    ax.set_xlabel('$\\log_{10}(\\lambda)$')
    ax.set_ylabel('Coefficient value')
    ax.set_title(title)
    ax.legend(fontsize=7, ncol=2)

plt.suptitle('Regularization Paths — California Housing', fontsize=13, y=1.02)
plt.tight_layout(); plt.show()

Coefficient Path Plots

Coefficient paths for Ridge (left) and Lasso (right) as λ increases. Ridge shrinks all coefficients smoothly to zero; Lasso sets some to exactly zero, performing variable selection. The dashed red line marks the CV-selected λ.

Implementation in Python

🗺️ The Dataset: Diabetes

We use sklearn’s built-in diabetes dataset (GeeksForGeeks 2024), a classic regression benchmark.

n = 442 patients; response = disease progression one year after baseline (continuous).
p = 10 predictors: age, sex, BMI, blood pressure, and six blood serum measurements.
Predictors are already mean-centred and scaled by sklearn.

from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X_diab = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y_diab = pd.Series(diabetes.target, name='progression')

print(X_diab.shape)
print(y_diab.describe().round(1))

(442, 10)
count    442.0
mean     152.1
std       77.1
min       25.0
25%       87.0
50%      140.5
75%      211.5
max      346.0
Name: progression, dtype: float64

Step 1: Train-Test Split

Before fitting any model we hold out 25% of the data as a test set — this will never be seen during training or hyperparameter tuning.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_diab, y_diab, test_size=0.25, random_state=42
)

print(f"Training set: {X_train.shape[0]} observations")
print(f"Test set:     {X_test.shape[0]} observations")

Training set: 331 observations
Test set:     111 observations

Because the predictors are already scaled in this dataset, we do not need to apply StandardScaler here. In general, you should always scale when using regularized methods.

Step 2: Fitting Ridge Regression

We fit Ridge with alpha=1 (\lambda = 1) as a starting point, then inspect the coefficients.

# Fit Ridge with α = 1
ridge_model = Ridge(alpha=1)
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)

# Test MSE
mse_ridge = np.mean((y_pred_ridge - y_test)**2)
print(f"Ridge (α=1) — Test MSE: {mse_ridge:.2f}\n")

# Coefficient table
ridge_coef_df = pd.DataFrame({
    'Feature':    X_train.columns,
    'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print(ridge_coef_df.round(2).to_string(index=False))

Step 2: Fitting Ridge Regression

Ridge (α=1) — Test MSE: 3105.47

Feature  Coefficient
    bmi       278.30
     s5       215.85
     bp       197.62
     s3      -151.39
     s4       120.32
     s6       101.76
    sex       -67.72
    age        50.55
     s2       -26.23
     s1        -6.25

Step 3: Fitting Lasso Regression

We fit Lasso with alpha=1. Notice that some coefficients are set exactly to zero — Lasso has performed automatic variable selection.

# Fit Lasso with α = 1
lasso_model = Lasso(alpha=1)
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.predict(X_test)

# Test MSE
mse_lasso = np.mean((y_pred_lasso - y_test)**2)
print(f"Lasso (α=1) — Test MSE: {mse_lasso:.2f}\n")

# Coefficient table — mark zeroed-out predictors
lasso_coef_df = pd.DataFrame({
    'Feature':    X_train.columns,
    'Coefficient': lasso_model.coef_,
    'Selected':    lasso_model.coef_ != 0
}).sort_values('Coefficient', key=abs, ascending=False)

print(f"Features retained: {(lasso_model.coef_ != 0).sum()} / {X_train.shape[1]}\n")
print(lasso_coef_df.round(2).to_string(index=False))

Step 3: Fitting Lasso Regression

Lasso (α=1) — Test MSE: 3433.16

Features retained: 3 / 10

Feature  Coefficient  Selected
    bmi       398.39      True
     s5       238.19      True
     bp        46.18      True
    age         0.00     False
    sex        -0.00     False
     s1         0.00     False
     s2         0.00     False
     s3        -0.00     False
     s4         0.00     False
     s6         0.00     False

Step 4: Fitting Elastic Net

Elastic Net adds a second hyperparameter l1_ratio (\alpha in our notation) controlling the L1/L2 mix. Here we use l1_ratio=0.5 — equal weight to both penalties.

# Fit Elastic Net with λ=1, α=0.5
enet_model = ElasticNet(alpha=1, l1_ratio=0.5)
enet_model.fit(X_train, y_train)
y_pred_enet = enet_model.predict(X_test)

# Test MSE
mse_enet = np.mean((y_pred_enet - y_test)**2)
print(f"Elastic Net (α=1, l1_ratio=0.5) — Test MSE: {mse_enet:.2f}\n")

# Coefficient table
enet_coef_df = pd.DataFrame({
    'Feature':    X_train.columns,
    'Coefficient': enet_model.coef_,
    'Selected':    enet_model.coef_ != 0
}).sort_values('Coefficient', key=abs, ascending=False)

print(f"Features retained: {(enet_model.coef_ != 0).sum()} / {X_train.shape[1]}\n")
print(enet_coef_df.round(2).to_string(index=False))

Step 4: Fitting Elastic Net

Elastic Net (α=1, l1_ratio=0.5) — Test MSE: 5554.23

Features retained: 9 / 10

Feature  Coefficient  Selected
    bmi         3.30      True
     s5         2.95      True
     bp         2.26      True
     s4         2.14      True
     s3        -1.87      True
     s6         1.70      True
    age         0.41      True
     s1         0.34      True
     s2         0.08      True
    sex         0.00     False

Step 5: Choosing \lambda by Cross-Validation

alpha=1 was arbitrary. We use built-in CV classes to search over a grid and find the optimal \lambda for each method.

# Ridge: CV over grid
ridge_cv_diab = RidgeCV(alphas=np.logspace(-3, 4, 100), cv=10)
ridge_cv_diab.fit(X_train, y_train)

# Lasso: CV over grid
lasso_cv_diab = LassoCV(alphas=np.logspace(-3, 2, 100),
                        cv=10, max_iter=10_000, random_state=42)
lasso_cv_diab.fit(X_train, y_train)

# Elastic Net: CV over grid
enet_cv_diab = ElasticNetCV(alphas=np.logspace(-3, 2, 100),
                            l1_ratio=[0.1, 0.5, 0.7, 0.9, 1.0],
                            cv=10, max_iter=10_000, random_state=42)
enet_cv_diab.fit(X_train, y_train)

print(f"Ridge   best λ: {ridge_cv_diab.alpha_:.4f}")
print(f"Lasso   best λ: {lasso_cv_diab.alpha_:.4f}")
print(f"ElasNet best λ: {enet_cv_diab.alpha_:.4f},  "
      f"best l1_ratio: {enet_cv_diab.l1_ratio_:.2f}")

Step 5: Choosing \lambda by Cross-Validation

Ridge   best λ: 0.0955
Lasso   best λ: 0.0045
ElasNet best λ: 0.0045,  best l1_ratio: 1.00

Step 6: Comparing All Methods

# OLS baseline
ols_model = LinearRegression().fit(X_train, y_train)

models = {
    'OLS':                    ols_model,
    'Ridge (α=1)':            ridge_model,
    'Ridge (CV-tuned)':       ridge_cv_diab,
    'Lasso (α=1)':            lasso_model,
    'Lasso (CV-tuned)':       lasso_cv_diab,
    'Elastic Net (CV-tuned)': enet_cv_diab,
}

print(f"{'Method':<28} {'Test MSE':>10} {'# Features':>12}")
print("-" * 52)
for name, mdl in models.items():
    y_hat = mdl.predict(X_test)
    mse   = np.mean((y_hat - y_test)**2)
    # count non-zero coefficients (OLS and Ridge always keep all)
    coefs = mdl.coef_
    n_feat = (coefs != 0).sum() if hasattr(mdl, 'coef_') else 10
    print(f"{name:<28} {mse:>10.2f} {n_feat:>12}")

Step 6: Comparing All Methods

Method                         Test MSE   # Features
----------------------------------------------------
OLS                             2848.31           10
Ridge (α=1)                     3105.47           10
Ridge (CV-tuned)                2810.61           10
Lasso (α=1)                     3433.16            3
Lasso (CV-tuned)                2839.42            9
Elastic Net (CV-tuned)          2839.42            9

Visualising the Results

Coefficient estimates for all six methods on the diabetes dataset. OLS and Ridge retain all 10 features; Lasso and Elastic Net zero out some. CV tuning substantially changes the estimates compared to α=1.

Conclusion

✅ What We Covered

Limitations of subset selection: discrete, greedy, and unstable for large p.
Cross-validation: honest out-of-sample error estimation; the key tool for choosing \lambda.
Regularization: Ridge (L_2) shrinks smoothly; Lasso (L_1) shrinks and selects; Elastic Net combines both.
Constrained optimization view: regularization = fitting RSS subject to a norm ball constraint.
Standardization: essential before applying any penalized method.
Implementation: fitting Ridge, Lasso, and Elastic Net in sklearn, with CV-tuned \lambda.

📅 What’s Next?

Classification models: logistic regression, support vector machines, decision trees.
Non-linear regression and basis expansions.

References

GeeksForGeeks. 2024. Implementation of Lasso, Ridge and Elastic Net. Https://www.geeksforgeeks.org/machine-learning/implementation-of-lasso-ridge-and-elastic-net/.

Lecture 15: Regularization

🚁 Overview

Aims of the lecture

📚 Required Libraries

💅 Figure Styles

Recap: Lecture 14 — Subset Selection

What we established

Subset selection strategies

Example: Housing Data

We have been looking at California housing data!

Limitations of Subset Selection

Cross-Validation

🔄 Why Cross-Validation?

The problem with in-sample fit

The solution: hold out data

k-Fold Cross-Validation

LOOCV and the Hat Matrix

Why does this work?

k-Fold CV in Python with sklearn

LOOCV via the Hat Matrix

Regularization (Shrinkage Methods)

🎯 Why Regularize?

The problem with OLS in high dimensions

The regularization idea

Regularization as Constrained Optimization

⚠️ Why Standardize Predictors?

The scale problem

The fix: standardize before fitting

Ridge Regression (L_2 Penalty)

Regularization and Eigenvalues

Impact of \lambda.

Ridge: The Effect of \lambda

Lasso (L_1 Penalty)

Lasso: Why Does Sparsity Happen?

Ridge vs. Lasso: Geometry

Elastic Net

Ridge vs. Lasso — Practical Guidance

Rules of thumb

Choosing \lambda: Cross-Validated Grid Search

Coefficient Path Plots

Coefficient Path Plots

Implementation in Python

🗺️ The Dataset: Diabetes

Step 1: Train-Test Split

Step 2: Fitting Ridge Regression

Step 2: Fitting Ridge Regression

Step 3: Fitting Lasso Regression

Step 3: Fitting Lasso Regression

Step 4: Fitting Elastic Net

Step 4: Fitting Elastic Net

Step 5: Choosing \lambda by Cross-Validation

Step 5: Choosing \lambda by Cross-Validation

Step 6: Comparing All Methods

Step 6: Comparing All Methods

Visualising the Results

Conclusion

✅ What We Covered

📅 What’s Next?

References

k-Fold CV in Python with `sklearn`