PSTAT100: Data Science — Concepts and Analysis
May 6, 2026
Regression
Regression is the task of estimating the conditional expectation function \mu(x) = \mathbb{E}[Y \mid X = x] from observed data (x_1, y_1), \ldots, (x_n, y_n).
| Task | Response Y | Example |
|---|---|---|
| Regression | Continuous | House price, temperature, body mass |
| Classification | Categorical | Spam / not-spam, tumour type |
Simple Linear Regression Model
We assume the data-generating process is: Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i = 1, \ldots, n, where:
Together, 1–4 imply that Y_i \stackrel{\text{iid}}\sim \mathcal{N} (\beta_0 + \beta_1 x_i,\, \sigma^2).
rng = np.random.default_rng(42)
n, beta0_true, beta1_true, sigma_true = 80, 2.0, 1.5, 2.5
x = rng.uniform(0, 10, size=n)
y = beta0_true + beta1_true * x + rng.normal(0, sigma_true, size=n)
x_line = np.linspace(0, 10, 200)
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y, alpha=0.6, color='steelblue', label='Observed data $(x_i, Y_i)$', zorder=3)
ax.plot(x_line, beta0_true + beta1_true * x_line, color='crimson', lw=2.5,
label=rf'True line: $\mu(x) = {beta0_true} + {beta1_true}x$')
ax.set_xlabel('$x$')
ax.set_ylabel('$Y$')
ax.set_title('Simulated SLR Data')
ax.legend()
plt.tight_layout()
plt.show()fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y, alpha=0.5, color='steelblue', zorder=3)
ax.plot(x_line, beta0_true + beta1_true * x_line, color='crimson', lw=2.5,
label='True line $\\mu(x)$')
# Show errors for a selection of points
rng_idx = np.random.default_rng(0)
show_idx = rng_idx.choice(n, size=14, replace=False)
for i in show_idx:
mu_i = beta0_true + beta1_true * x[i]
ax.plot([x[i], x[i]], [mu_i, y[i]], color='darkorange', lw=1.5, zorder=2)
ax.plot(
[], [],
color='darkorange',
lw=1.5,
label='Error $\\varepsilon_i = Y_i - \\mu(x_i)$'
)
ax.set_xlabel('$x$')
ax.set_ylabel('$Y$')
ax.set_title('Errors: Vertical Deviations from the True Line')
ax.legend()
plt.tight_layout()
plt.show()Ordinary Least Squares (OLS)
The OLS estimators \hat{\beta}_0 and \hat{\beta}_1 minimise the residual sum of squares: \text{RSS}(b_0, b_1) = \sum_{i=1}^n \bigl(y_i - b_0 - b_1 x_i\bigr)^2.
Taking \partial \text{RSS}/\partial b_0 = 0 and \partial \text{RSS}/\partial b_1 = 0:
-2\sum_{i=1}^n (y_i - b_0 - b_1 x_i) = 0,\quad \& \quad -2\sum_{i=1}^n x_i(y_i - b_0 - b_1 x_i) = 0.
Solving these normal equations gives:
\boxed{\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\,\bar{x},}
where S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2 and \quad S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).
If we assume GM1–GM4 then Y_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\, \sigma^2), \quad i = 1, \ldots, n.
The likelihood function is: \begin{aligned} L(\beta_0, \beta_1, \sigma^2) = \prod_{i=1}^n f_Y(y_i) = (2\pi\sigma^2)^{-\frac{n}{2}} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2\right). \end{aligned}
The log-likelihood function is: l(\beta_0, \beta_1, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \underbrace{\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2}_{\text{RSS}}.
Maximizing L with respect to \beta_0 and \beta_1 is equivalent to minimizing the RSS, so: \hat{\beta}_0^{\text{MLE}} = \hat{\beta}_0^{\text{OLS}}, \qquad \hat{\beta}_1^{\text{MLE}} = \hat{\beta}_1^{\text{OLS}}.
# Step 1: Compute the sufficient statistics
x_bar = x.mean()
y_bar = y.mean()
S_xx = np.sum((x - x_bar)**2)
S_xy = np.sum((x - x_bar) * (y - y_bar))
# Step 2: OLS slope and intercept
beta1_hat = S_xy / S_xx
beta0_hat = y_bar - beta1_hat * x_bar
print(f"True: β₀ = {beta0_true:.2f}, β₁ = {beta1_true:.2f}")
print(f"Estimated: β̂₀ = {beta0_hat:.3f}, β̂₁ = {beta1_hat:.3f}")
# Step 3: Plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y, alpha=0.4, color='steelblue', label='Data', zorder=3)
ax.plot(x_line, beta0_true + beta1_true * x_line, 'crimson', lw=2, label='True line')
ax.plot(x_line, beta0_hat + beta1_hat * x_line, 'navy', lw=2, linestyle='--',
label=f'OLS fit: $\\hat{{y}} = {beta0_hat:.2f} + {beta1_hat:.2f}x$')
ax.set_xlabel('$x$'); ax.set_ylabel('$Y$')
ax.legend(); plt.tight_layout(); plt.show()True: β₀ = 2.00, β₁ = 1.50
Estimated: β̂₀ = 1.103, β̂₁ = 1.616
Fitted Values and Residuals
The fitted value for observation i is \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i.
The residual is the difference between the observed and fitted response: e_i = y_i - \hat{y}_i.
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y, alpha=0.4, color='steelblue', zorder=3, label='Data')
ax.plot(x_line, beta0_hat + beta1_hat * x_line, 'navy', lw=2, label='OLS fit')
rng_idx2 = np.random.default_rng(1)
show_idx2 = rng_idx2.choice(n, size=16, replace=False)
for i in show_idx2:
ax.plot([x[i], x[i]], [y_hat[i], y[i]], color='darkorange', lw=1.5, zorder=2)
ax.plot([], [], color='darkorange', lw=1.5, label='Residual $e_i = y_i - \\hat{y}_i$')
ax.set_xlabel('$x$'); ax.set_ylabel('$Y$')
ax.set_title('OLS Residuals: Vertical Deviations from the Fitted Line')
ax.legend()
plt.tight_layout()
plt.show()We can write \hat{\beta}_1 = \sum_{i=1}^n c_i Y_i where c_i = (x_i - \bar{x})/S_{xx} (linear combination).
Since Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i and \mathbb{E}[\varepsilon_i] = 0:
\begin{aligned} \mathbb{E}[\hat{\beta}_1] & = \sum_{i=1}^n c_i \mathbb{E}[Y_i] = \beta_0\underbrace{\sum_{i=1}^n c_i}_{=\frac{n\bar{x}-n\bar{x}}{S_{xx}}=0} + \beta_1 \underbrace{\sum_{i=1}^nc_ix_i}_{=\frac{S_{xx}}{S_{xx}}=1} = \beta_1. \quad \checkmark \end{aligned}
An analogous argument gives \mathbb{E}[\hat{\beta}_0] = \beta_0.
Under GM1–GM4, the exact sampling variances are:
\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \qquad \text{Var}(\hat{\beta}_0) = \sigma^2\!\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right).
| Factor | Effect on \text{Var}(\hat{\beta}_1) |
|---|---|
| More spread in x (↑ S_{xx}) | ↓ Variance — better leverage |
| Larger sample size (↑ n) | ↓ Variance — more information |
| Less noise (↓ \sigma^2) | ↓ Variance — cleaner signal |
Since \sigma^2 is unknown, we estimate it with the residual variance: \hat{\sigma}^2 = \frac{\text{RSS}}{n - 2}.
We divide by n - 2 (not n) because estimating \hat{\beta}_0 and \hat{\beta}_1 costs two degrees of freedom.
Gauss-Markov Theorem
Under assumptions GM1–GM3 (linearity, independence, homoscedasticity — normality is not required), the OLS estimators \hat{\beta}_0 and \hat{\beta}_1 are BLUE:
Best Linear Unbiased Estimators.
rng2 = np.random.default_rng(7)
beta1_sims = []
for _ in range(2000):
y_sim = beta0_true + beta1_true * x + rng2.normal(0, sigma_true, size=n)
S_xy_s = np.sum((x - x_bar) * (y_sim - y_sim.mean()))
beta1_sims.append(S_xy_s / S_xx)
beta1_sims = np.array(beta1_sims)
theoretical_se = sigma_true / np.sqrt(S_xx)
fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(beta1_sims, bins=45, kde=True, color='steelblue', ax=ax)
ax.axvline(beta1_true, color='crimson', lw=2.5, linestyle='--',
label=f'True β1 = {beta1_true}')
ax.axvline(beta1_sims.mean(), color='navy', lw=2, linestyle=':',
label=f'Mean β̂1 = {beta1_sims.mean():.3f} (unbiased: True)')
ax.set_xlabel(r'$\hat{\beta}_1$'); ax.set_ylabel('Density')
ax.set_title(f'Sampling Distribution of $\\hat{{\\beta}}_1$ '
f'(theoretical SE = {theoretical_se:.3f}, empirical SE = {beta1_sims.std():.3f})')
ax.legend()
plt.tight_layout()
plt.show()Since \hat{\beta}_1 = \sum_i c_i Y_i is a linear combination of independent normals:
\hat{\beta}_1 \sim \mathcal{N}\!\left(\beta_1,\; \frac{\sigma^2}{S_{xx}}\right).
Because we estimate \sigma^2 from the same data, we incur extra uncertainty. This leads to:
T = \frac{\hat{\beta}_1 - \beta_1}{s_{\hat{\beta}_1}} \sim t_{n-2}, \qquad s_{\hat{\beta}_1} = \frac{s_x}{\sqrt{S_{xx}}}.
The most common test is: H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0.
If \beta_1 = 0, then X has no linear association with Y in the population.
T = \frac{\hat{\beta}_1}{s_{\hat{\beta}_1}} \sim t_{n-2} \quad \text{under } H_0.
p\text{-value} = 2\mathbb{P}(t_{n-2} \geq |T_{\text{obs}}|).
RSS = np.sum(residuals**2)
sigma_hat = np.sqrt(RSS / (n - 2))
SE_b1 = sigma_hat / np.sqrt(S_xx)
SE_b0 = sigma_hat * np.sqrt(1/n + x_bar**2 / S_xx)
t_b1 = beta1_hat / SE_b1
p_b1 = 2 * t_dist.sf(abs(t_b1), df=n - 2)
print(f"β̂₁ = {beta1_hat:.4f}, SE = {SE_b1:.4f}")
print(f"t = {t_b1:.3f}, p = {p_b1:.2e}")β̂₁ = 1.6164, SE = 0.0968
t = 16.691, p = 1.84e-27
A (1 - \alpha) \times 100\% confidence interval for \beta_j is:
\hat{\beta}_j \;\pm\; t_{\alpha/2,\, n-2} \cdot s_{\hat{\beta}_j}.
alpha = 0.05
t_crit = t_dist.ppf(1 - alpha/2, df=n - 2)
CI_b1 = (beta1_hat - t_crit * SE_b1, beta1_hat + t_crit * SE_b1)
CI_b0 = (beta0_hat - t_crit * SE_b0, beta0_hat + t_crit * SE_b0)
print(f"95% CI for β₁: ({CI_b1[0]:.3f}, {CI_b1[1]:.3f})")
print(f"95% CI for β₀: ({CI_b0[0]:.3f}, {CI_b0[1]:.3f})")
print(f"True β₁ = {beta1_true} ← inside CI? {CI_b1[0] < beta1_true < CI_b1[1]}")95% CI for β₁: (1.424, 1.809)
95% CI for β₀: (-0.004, 2.209)
True β₁ = 1.5 ← inside CI? True
Variance Decomposition (ANOVA Identity)
\underbrace{\sum_{i=1}^n (y_i - \bar{y})^2}_{\text{SST (Total)}} = \underbrace{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}_{\text{SSR (Regression)}} + \underbrace{\sum_{i=1}^n e_i^2}_{\text{SSE (Residual)}}.
| Quantity | Name | Degrees of Freedom | What it measures |
|---|---|---|---|
| SST | Total SS | n - 1 | Total variability in Y |
| SSR | Regression SS | 1 | Variability explained by the model |
| SSE | Residual SS | n - 2 | Variability unexplained (error) |
Coefficient of Determination R^2
R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}. R^2 is the proportion of variance in Y explained by the linear model.
SST = np.sum((y - y_bar)**2)
SSE_v = np.sum(residuals**2)
SSR = SST - SSE_v
R2 = SSR / SST
r = np.corrcoef(x, y)[0, 1]
print(f"SST = {SST:.2f} (total)")
print(f"SSR = {SSR:.2f} (explained)")
print(f"SSE = {SSE_v:.2f} (residual)")
print(f"R² = SSR/SST = {R2:.4f}")
print(f"r = {r:.4f} → r² = {r**2:.4f} ✓")
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y, alpha=0.35, color='steelblue', zorder=3, label='Data')
ax.plot(x_line, beta0_hat + beta1_hat * x_line, 'navy', lw=2, label='OLS fit $\\hat{y}$')
ax.axhline(y_bar, color='gray', lw=1.5, linestyle='--', label='$\\bar{y}$')
i0 = np.argmin(np.abs(x - 6.8))
ax.annotate('', xy=(x[i0], y[i0]), xytext=(x[i0], y_bar),
arrowprops=dict(arrowstyle='<->', color='crimson', lw=2))
ax.annotate('', xy=(x[i0], y_hat[i0]), xytext=(x[i0], y_bar),
arrowprops=dict(arrowstyle='<->', color='seagreen', lw=2))
ax.annotate('', xy=(x[i0], y[i0]), xytext=(x[i0], y_hat[i0]),
arrowprops=dict(arrowstyle='<->', color='darkorange', lw=2))
ax.plot([], [], color='crimson', label='$y_i - \\bar{y}$ (SST component)')
ax.plot([], [], color='seagreen', label='$\\hat{y}_i - \\bar{y}$ (SSR component)')
ax.plot([], [], color='darkorange', label='$e_i$ (SSE component)')
ax.legend(fontsize=9); plt.tight_layout(); plt.show()SST = 1952.34 (total)
SSR = 1525.27 (explained)
SSE = 427.07 (residual)
R² = SSR/SST = 0.7813
r = 0.8839 → r² = 0.7813 ✓
For a new input x^*, the point prediction is: \hat{y}^* = \hat{\beta}_0 + \hat{\beta}_1 x^*.
We may want to estimate:
Both share the same point estimate \hat{y}^*, but they involve fundamentally different uncertainties.
Confidence interval for \mathbb{E}[Y \mid X = x^*] — uncertainty in the mean response:
\hat{y}^* \;\pm\; t_{\alpha/2,\, n-2}\;\hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}}.
Prediction interval for a new observation Y^* — uncertainty in a single future value:
\hat{y}^* \;\pm\; t_{\alpha/2,\, n-2}\;\hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}}.
x_pred = np.linspace(0, 10, 300)
y_pred = beta0_hat + beta1_hat * x_pred
t_crit2 = t_dist.ppf(0.975, df=n - 2)
se_mean = sigma_hat * np.sqrt(1/n + (x_pred - x_bar)**2 / S_xx)
se_pi = sigma_hat * np.sqrt(1 + 1/n + (x_pred - x_bar)**2 / S_xx)
fig, ax = plt.subplots(figsize=(11, 5))
ax.scatter(x, y, alpha=0.4, color='steelblue', label='Data', zorder=3)
ax.plot(x_pred, y_pred, 'navy', lw=2, label='OLS fit')
ax.fill_between(x_pred,
y_pred - t_crit2 * se_mean,
y_pred + t_crit2 * se_mean,
alpha=0.35, color='navy', label='95% CI for $\\mathbb{E}[Y|x^*]$')
ax.fill_between(x_pred,
y_pred - t_crit2 * se_pi,
y_pred + t_crit2 * se_pi,
alpha=0.12, color='steelblue', label='95% Prediction interval')
ax.set_xlabel('$x$'); ax.set_ylabel('$Y$')
ax.set_title('Confidence Interval for the Mean Response vs. Prediction Interval')
ax.legend()
plt.tight_layout()
plt.show()| Step | Task | Key Formula / Tool |
|---|---|---|
| 1. Specify | Write down the statistical model | Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i |
| 2. Estimate | Minimise RSS | \hat{\beta}_1 = S_{xy}/S_{xx}, \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} |
| 3. Check properties | Verify unbiasedness, invoke Gauss-Markov | BLUE under GM1–GM3 |
| 4. Inference | t-tests and CIs for coefficients | T = \hat{\beta}_1 / \widehat{\text{SE}}(\hat{\beta}_1) \sim t_{n-2} |
| 5. Fit quality | Decompose variation | R^2 = \text{SSR}/\text{SST} |
| 6. Predict | Point estimate + interval | CI for mean vs. PI for new obs. |
statsmodels and reading every line of the regression summary.