Regularization

Shrinkage methods for regression including cross-validation, Ridge regression (L₂ penalty), Lasso (L₁ penalty), and Elastic Net, with Python implementation on the diabetes dataset.
Author

John Robin Inston

Published

May 22, 2026

Modified

May 22, 2026

ImportantLearning Objectives
  • Understand the limitations of subset selection and why they motivate regularization.
  • Describe \(k\)-fold cross-validation and the LOOCV shortcut via the hat matrix.
  • Define and compare Ridge (\(L_2\)), Lasso (\(L_1\)), and Elastic Net regularization.
  • Understand the geometric intuition for why Lasso induces sparsity.
  • Implement regularized regression in Python with cross-validated hyperparameter selection.
Libraries and Styling
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing, load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    RidgeCV, LassoCV, ElasticNetCV,
)
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error

sns.set_style('whitegrid')
sns.set_palette('Set2')

Limitations of Subset Selection

Subset selection methods are powerful, but carry important limitations that become severe as \(p\) grows.

Limitation Detail
Computational cost Best subsets requires \(2^p\) fits — infeasible for \(p \gtrsim 30\)
Greedy, not global Stepwise methods can miss the truly optimal subset
Discrete selection Each predictor is either in or out — no gradation of importance
Instability Small changes in data can produce very different selected subsets
Fails when \(p \geq n\) The full model cannot be estimated when predictors outnumber observations

Regularization overcomes all of these: it is computationally cheap, globally optimised, produces continuous coefficient values, is stable, and works when \(p \gg n\).

Cross-Validation

Why cross-validate?

RSS, \(R^2\), and the likelihood are computed on the training data — the same data used to fit the model. The model has already seen these observations, so in-sample fit is optimistic. Cross-validation (CV) evaluates error on observations the model has not seen during fitting, giving an honest estimate of out-of-sample (generalisation) error.

Important\(k\)-Fold Cross-Validation
  1. Randomly partition the \(n\) observations into \(k\) roughly equal folds.
  2. For each fold \(j = 1, \ldots, k\): fit the model on the other \(k-1\) folds, predict on fold \(j\).
  3. Report the average test error: \[\text{CV}_{(k)} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i^{(-\kappa(i))})^2,\] where \(\hat{y}_i^{(-\kappa(i))}\) is the prediction for observation \(i\) from the model fit without its fold.

\(k = 5\) or \(k = 10\) are standard choices. Leave-one-out CV (LOOCV) sets \(k = n\) — it has low bias but high variance and is computationally expensive for most methods.

The LOOCV shortcut for OLS

For ordinary least squares there is a closed-form shortcut that avoids refitting \(n\) times:

\[\text{LOOCV} = \frac{1}{n}\sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}}\right)^2,\]

where \(e_i = y_i - \hat{y}_i\) is the ordinary residual and \(h_{ii}\) is the \(i\)-th diagonal of the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). Observations with high leverage \(h_{ii}\) have large influence on the LOOCV score. This formula is computationally free once \(\mathbf{H}\) is computed.

housing = fetch_california_housing(as_frame=True)
df = housing.frame.copy()
df.columns = [c.lower() for c in df.columns]

predictors = ['medinc', 'houseage', 'averooms', 'avebedrms',
              'aveoccup', 'population', 'latitude', 'longitude']
X_full = df[predictors].values
y_full = df['medhouseval'].values
X_scaled = StandardScaler().fit_transform(X_full)

# 10-fold CV for models of increasing size
kf = KFold(n_splits=10, shuffle=True, random_state=0)
feature_sets = {
    '1 predictor (MedInc)': X_full[:, [0]],
    '3 predictors':         X_full[:, :3],
    '5 predictors':         X_full[:, :5],
    'Full (8 predictors)':  X_full,
}

print(f"{'Model':<30} {'10-fold CV MSE':>16}")
print("-" * 48)
for name, X_sub in feature_sets.items():
    scores = cross_val_score(
        LinearRegression(), X_sub, y_full,
        cv=kf, scoring='neg_mean_squared_error',
    )
    print(f"{name:<30} {-scores.mean():>16.4f}")
Model                            10-fold CV MSE
------------------------------------------------
1 predictor (MedInc)                     0.7013
3 predictors                             0.6514
5 predictors                             0.6189
Full (8 predictors)                      0.5279

Cross-validation is also our main tool for choosing the regularization hyperparameter \(\lambda\).

Why Regularize?

When \(p\) is large relative to \(n\), or when predictors are highly correlated, \((\mathbf{X}^\top\mathbf{X})^{-1}\) becomes ill-conditioned. OLS coefficients have high variance — tiny changes in the data lead to large changes in \(\hat{\boldsymbol{\beta}}\).

Regularization adds a penalty on the size of \(\boldsymbol{\beta}\) to the least squares objective, trading a little bias for a large reduction in variance. The two equivalent formulations are:

  • Penalty form: \(\min_{\boldsymbol{\beta}}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|_q\)
  • Constraint form: \(\min_{\boldsymbol{\beta}}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 \;\text{ subject to }\; \|\boldsymbol{\beta}\|_q \leq t\)

For every \(\lambda \geq 0\) there is a corresponding budget \(t \geq 0\) that produces the same solution. At \(\lambda = 0\) the OLS solution is recovered; at \(\lambda \to \infty\) all coefficients are driven to zero.

Why standardise predictors?

The penalty \(\lambda \sum_j \beta_j^2\) penalises coefficients by their magnitude. If predictors are measured in very different units, the penalty falls disproportionately on the predictors with larger natural scales. The fix is to standardise every predictor to zero mean and unit variance before fitting:

\[X_j \leftarrow \frac{X_j - \bar{X}_j}{s_j}.\]

The intercept \(\beta_0\) is never penalised.

Ridge Regression (\(L_2\) Penalty)

ImportantRidge Regression

\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|^2_2.\]

Closed-form solution: \[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}.\]

Adding \(\lambda\mathbf{I}\) shifts every eigenvalue of \(\mathbf{X}^\top\mathbf{X}\) up by \(\lambda\), making the matrix invertible for any \(\lambda > 0\) regardless of multicollinearity. Ridge shrinks all coefficients toward zero but never sets them exactly to zero — all predictors are retained. It is particularly useful when predictors are correlated.

Lasso (\(L_1\) Penalty)

ImportantLasso — Least Absolute Shrinkage and Selection Operator

\[\hat{\boldsymbol{\beta}}_{\text{lasso}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\|\boldsymbol{\beta}\|_1,\]

where \(\|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j|\).

The \(L_1\) penalty induces sparsity: for large enough \(\lambda\), some \(\hat{\beta}_j\) are set exactly to zero, so Lasso simultaneously shrinks and selects predictors. There is no closed-form solution — fitting requires coordinate descent.

Why does sparsity happen?

In the constraint form, the \(L_1\) constraint region is a diamond with sharp corners on the coordinate axes. The RSS contours are ellipses expanding outward from the OLS solution. As the constraint tightens, the first contact between the ellipse and the diamond most often occurs at a corner — where one or more \(\beta_j = 0\) exactly. The \(L_2\) constraint region (a smooth sphere) has no corners, which is why Ridge never produces exact zeros.

Constraint regions for Ridge (L₂ ball, left) and Lasso (L₁ diamond, right). RSS contours are shown as ellipses centred at the OLS solution. The Lasso solution touches a corner, setting β₁ = 0 exactly.

Elastic Net

ImportantElastic Net

\[\hat{\boldsymbol{\beta}}_{\text{EN}} = \underset{\boldsymbol{\beta}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2_2 + \lambda\!\left[\alpha\|\boldsymbol{\beta}\|_1 + \frac{1-\alpha}{2}\|\boldsymbol{\beta}\|^2_2\right],\]

where \(\alpha \in [0, 1]\) controls the mix: \(\alpha = 1\) is pure Lasso, \(\alpha = 0\) is pure Ridge.

Elastic Net combines the sparsity of Lasso with the grouping property of Ridge. When predictors are correlated, Lasso tends to arbitrarily select one from a group; Elastic Net tends to select the group together.

Practical guidance

Ridge Lasso Elastic Net
Variable selection No Yes Yes
Correlated predictors Handles well Picks one arbitrarily Selects group
Works when \(p > n\) Yes Yes (selects \(\leq n\)) Yes
Interpretability All retained Sparse Sparse
  • Use Ridge when most predictors contribute something (dense signal).
  • Use Lasso when only a few predictors truly matter (sparse signal).
  • Use Elastic Net when predictors are correlated and sparse.

Regularization Paths

The regularization path shows how coefficients change as \(\lambda\) increases. Ridge shrinks all coefficients smoothly; Lasso sets some to exactly zero at a specific threshold.

feature_names = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                 'AveOccup', 'Population', 'Latitude', 'Longitude']
alphas = np.logspace(-3, 4, 200)

ridge_coefs = np.array([Ridge(alpha=a).fit(X_scaled, y_full).coef_ for a in alphas])
lasso_coefs = np.array([Lasso(alpha=a, max_iter=10000).fit(X_scaled, y_full).coef_
                        for a in alphas])

ridge_cv = RidgeCV(alphas=np.logspace(-3, 4, 100), cv=10)
ridge_cv.fit(X_scaled, y_full)
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 100), cv=10, max_iter=10_000, random_state=0)
lasso_cv.fit(X_scaled, y_full)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
palette   = sns.color_palette('tab10', n_colors=8)

for j, name in enumerate(feature_names):
    axes[0].plot(np.log10(alphas), ridge_coefs[:, j], lw=1.8, color=palette[j], label=name)
    axes[1].plot(np.log10(alphas), lasso_coefs[:, j], lw=1.8, color=palette[j], label=name)

for ax, title, best_a in [(axes[0], 'Ridge Coefficient Path', ridge_cv.alpha_),
                           (axes[1], 'Lasso Coefficient Path', lasso_cv.alpha_)]:
    ax.axvline(np.log10(best_a), color='crimson', lw=2, linestyle='--',
               label=f'CV-selected $\\lambda$')
    ax.axhline(0, color='gray', lw=0.8)
    ax.set_xlabel('$\\log_{10}(\\lambda)$'); ax.set_ylabel('Coefficient value')
    ax.set_title(title); ax.legend(fontsize=7, ncol=2)

plt.suptitle('Regularization Paths — California Housing', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Coefficient paths for Ridge (left) and Lasso (right) on California housing. Ridge shrinks all coefficients smoothly; Lasso zeroes some out, performing variable selection. The dashed red line marks the CV-selected λ.

Implementation: Diabetes Dataset

We demonstrate the full workflow on sklearn’s diabetes dataset: \(n = 442\) patients, \(p = 10\) predictors (age, sex, BMI, blood pressure, and six serum measurements), and a continuous response measuring disease progression one year after baseline. The predictors are already mean-centred and scaled.

Train-test split

diabetes  = load_diabetes()
X_diab    = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y_diab    = pd.Series(diabetes.target, name='progression')

X_train, X_test, y_train, y_test = train_test_split(
    X_diab, y_diab, test_size=0.25, random_state=42
)
print(f"Training:  {X_train.shape[0]} obs  |  Test: {X_test.shape[0]} obs")
Training:  331 obs  |  Test: 111 obs

Fitting Ridge, Lasso, and Elastic Net

ridge_model = Ridge(alpha=1).fit(X_train, y_train)
lasso_model = Lasso(alpha=1).fit(X_train, y_train)
enet_model  = ElasticNet(alpha=1, l1_ratio=0.5).fit(X_train, y_train)

print(f"Ridge  (α=1) — Features retained: {(ridge_model.coef_ != 0).sum()}/10"
      f"   Test MSE: {np.mean((ridge_model.predict(X_test) - y_test)**2):.2f}")
print(f"Lasso  (α=1) — Features retained: {(lasso_model.coef_ != 0).sum()}/10"
      f"   Test MSE: {np.mean((lasso_model.predict(X_test) - y_test)**2):.2f}")
print(f"ElNet  (α=1) — Features retained: {(enet_model.coef_ != 0).sum()}/10"
      f"   Test MSE: {np.mean((enet_model.predict(X_test) - y_test)**2):.2f}")
Ridge  (α=1) — Features retained: 10/10   Test MSE: 3105.47
Lasso  (α=1) — Features retained: 3/10   Test MSE: 3433.16
ElNet  (α=1) — Features retained: 9/10   Test MSE: 5554.23

Choosing \(\lambda\) by cross-validation

ridge_cv_d = RidgeCV(alphas=np.logspace(-3, 4, 100), cv=10).fit(X_train, y_train)
lasso_cv_d = LassoCV(alphas=np.logspace(-3, 2, 100), cv=10,
                     max_iter=10_000, random_state=42).fit(X_train, y_train)
enet_cv_d  = ElasticNetCV(alphas=np.logspace(-3, 2, 100),
                          l1_ratio=[0.1, 0.5, 0.7, 0.9, 1.0],
                          cv=10, max_iter=10_000, random_state=42).fit(X_train, y_train)

print(f"Ridge best λ:   {ridge_cv_d.alpha_:.4f}")
print(f"Lasso best λ:   {lasso_cv_d.alpha_:.4f}")
print(f"ElNet best λ:   {enet_cv_d.alpha_:.4f}   l1_ratio: {enet_cv_d.l1_ratio_:.2f}")
Ridge best λ:   0.0955
Lasso best λ:   0.0045
ElNet best λ:   0.0045   l1_ratio: 1.00

Comparing all methods

ols_model = LinearRegression().fit(X_train, y_train)
models = {
    'OLS':                    ols_model,
    'Ridge (α=1)':            ridge_model,
    'Ridge (CV-tuned)':       ridge_cv_d,
    'Lasso (α=1)':            lasso_model,
    'Lasso (CV-tuned)':       lasso_cv_d,
    'Elastic Net (CV-tuned)': enet_cv_d,
}

print(f"{'Method':<28} {'Test MSE':>10} {'# Features':>12}")
print("-" * 52)
for name, mdl in models.items():
    y_hat  = mdl.predict(X_test)
    mse    = np.mean((y_hat - y_test)**2)
    n_feat = (mdl.coef_ != 0).sum()
    print(f"{name:<28} {mse:>10.2f} {n_feat:>12}")
Method                         Test MSE   # Features
----------------------------------------------------
OLS                             2848.31           10
Ridge (α=1)                     3105.47           10
Ridge (CV-tuned)                2810.61           10
Lasso (α=1)                     3433.16            3
Lasso (CV-tuned)                2839.42            9
Elastic Net (CV-tuned)          2839.42            9

CV-tuned Ridge and Lasso consistently outperform the arbitrary \(\alpha = 1\) choice and are competitive with OLS on a problem where all 10 predictors are informative. On datasets where only a few predictors are truly relevant, Lasso’s variable selection advantage becomes more pronounced.

Back to top