PSTAT100: Data Science — Concepts and Analysis
May 6, 2026
statsmodels.| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Can we predict a penguin’s body mass from its flipper length?
flipper_length_mmbody_mass_g<class 'pandas.DataFrame'>
Index: 342 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 342 non-null str
1 island 342 non-null str
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null str
dtypes: float64(4), str(3)
memory usage: 21.4 KB
None
| flipper_length_mm | body_mass_g | |
|---|---|---|
| count | 342.00 | 342.00 |
| mean | 200.92 | 4201.75 |
| std | 14.06 | 801.95 |
| min | 172.00 | 2700.00 |
| 25% | 190.00 | 3550.00 |
| 50% | 197.00 | 4050.00 |
| 75% | 213.00 | 4750.00 |
| max | 231.00 | 6300.00 |
fig, axes = plt.subplots(1, 2, figsize=(12, 3.5))
for ax, col, label in zip(axes,
['flipper_length_mm', 'body_mass_g'],
['Flipper length (mm)', 'Body mass (g)']):
for sp, grp in penguins.groupby('species'):
ax.hist(grp[col], bins=18, alpha=0.55, label=sp, density=True)
ax.set_xlabel(label); ax.set_ylabel('Density')
ax.set_title(f'Distribution of {label}')
ax.legend(fontsize=9)
plt.tight_layout(); plt.show()Before fitting any model, inspect the scatterplot for:
flipper_length_mm body_mass_g
flipper_length_mm 1.0000 0.8712
body_mass_g 0.8712 1.0000
r = 0.8712
r² = 0.7590
statsmodels OLS APIFormula interface (recommended for readability):
Array interface (more control, used in MLR):
'body_mass_g ~ flipper_length_mm' directly maps to: Y_i = \beta_0 + \beta_1 \cdot \text{flipper\_length}_i + \varepsilon_i.statsmodels automatically adds the intercept \beta_0. OLS Regression Results
==============================================================================
Dep. Variable: body_mass_g R-squared: 0.759
Model: OLS Adj. R-squared: 0.758
Method: Least Squares F-statistic: 1071.
Date: Wed, 06 May 2026 Prob (F-statistic): 4.37e-107
Time: 15:13:19 Log-Likelihood: -2528.4
No. Observations: 342 AIC: 5061.
Df Residuals: 340 BIC: 5069.
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -5780.8314 305.815 -18.903 0.000 -6382.358 -5179.305
flipper_length_mm 49.6856 1.518 32.722 0.000 46.699 52.672
==============================================================================
Omnibus: 5.634 Durbin-Watson: 2.176
Prob(Omnibus): 0.060 Jarque-Bera (JB): 5.585
Skew: 0.313 Prob(JB): 0.0613
Kurtosis: 3.019 Cond. No. 2.89e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
The top block reports global information about the fit:
| Field | Meaning |
|---|---|
Dep. Variable |
The response variable Y |
No. Observations |
Sample size n |
R-squared |
Proportion of variance in Y explained by the model (R^2) |
Adj. R-squared |
R^2 adjusted for the number of predictors (more on this in Lecture 11) |
F-statistic |
Test of H_0: \beta_1 = 0 (all slopes are zero) — equivalent to the t-test in SLR |
Prob (F-statistic) |
p-value for the F-test |
AIC / BIC |
Information criteria for model comparison (lower is better) |
Log-Likelihood |
Value of the log-likelihood at the MLE |
In SLR,
Prob (F-statistic)and the p-value for \hat{\beta}_1 are identical. Both test whether \beta_1 = 0.
# Extract the coefficients table as a DataFrame
coef_table = model.summary2().tables[1]
print(coef_table.round(4))
# Intercept β₀
b0 = model.params['Intercept']
print(f"\nIntercept β̂₀ = {b0:.2f} g")
# Slope β₁
b1 = model.params['flipper_length_mm']
print(f"Slope β̂₁ = {b1:.4f} g per mm")
# 95% confidence intervals
ci = model.conf_int()
print(f"\n95% CI for β̂₁: ({ci.loc['flipper_length_mm', 0]:.3f}, {ci.loc['flipper_length_mm', 1]:.3f})") Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept -5780.8314 305.8145 -18.9031 0.0 -6382.3580 -5179.3047
flipper_length_mm 49.6856 1.5184 32.7222 0.0 46.6989 52.6722
Intercept β̂₀ = -5780.83 g
Slope β̂₁ = 49.6856 g per mm
95% CI for β̂₁: (46.699, 52.672)
| Column | Symbol | Meaning |
|---|---|---|
coef |
\hat{\beta}_j | OLS estimate |
std err |
\widehat{\text{SE}}(\hat{\beta}_j) | Standard error of the estimate |
t |
T = \hat{\beta}_j / \widehat{\text{SE}} | Test statistic for H_0: \beta_j = 0 |
P>|t| |
p-value | Evidence against H_0 — not the probability H_0 is true |
[0.025, 0.975] |
95% CI | Range of plausible values for \beta_j |
| Field | What it tests |
|---|---|
Omnibus / Prob(Omnibus) |
Normality of residuals (combined skewness + kurtosis test) |
Skew |
Skewness of the residuals (ideal: \approx 0) |
Kurtosis |
Kurtosis of the residuals (ideal: \approx 3) |
Jarque-Bera (JB) / Prob(JB) |
Another normality test for residuals |
Durbin-Watson |
Tests for serial correlation in residuals (ideal: \approx 2) |
Cond. No. |
Condition number — flags potential multicollinearity (relevant for MLR) |
These statistics give an initial indication — we will inspect assumptions visually through diagnostic plots in the next section.
# Fitted values and residuals
y_hat = model.fittedvalues
resid = model.resid
# Model fit statistics
print(f"R² = {model.rsquared:.4f}")
print(f"Adj. R² = {model.rsquared_adj:.4f}")
# F-test
print(f"\nF-statistic = {model.fvalue:.2f}")
print(f"Prob(F) = {model.f_pvalue:.2e}")
# Residual std deviation (σ̂)
print(f"\nσ̂ (RMSE) = {np.sqrt(model.mse_resid):.2f} g")
print(f"n = {int(model.nobs)}, df_resid = {int(model.df_resid)}")R² = 0.7590
Adj. R² = 0.7583
F-statistic = 1070.74
Prob(F) = 4.37e-107
σ̂ (RMSE) = 394.28 g
n = 342, df_resid = 340
# Build a prediction grid
x_grid = pd.DataFrame({
'flipper_length_mm': np.linspace(
penguins['flipper_length_mm'].min(),
penguins['flipper_length_mm'].max(), 200)
})
# Get predictions with confidence interval
pred_ci = model.get_prediction(x_grid).summary_frame(alpha=0.05)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(penguins['flipper_length_mm'], penguins['body_mass_g'],
alpha=0.4, color='steelblue', s=35, label='Observed data', zorder=3)
ax.plot(x_grid['flipper_length_mm'], pred_ci['mean'],
color='crimson', lw=2.5, label='OLS fit')
ax.fill_between(x_grid['flipper_length_mm'],
pred_ci['mean_ci_lower'], pred_ci['mean_ci_upper'],
alpha=0.25, color='crimson', label='95% CI for mean')
ax.set_xlabel('Flipper length (mm)'); ax.set_ylabel('Body mass (g)')
ax.set_title(f'SLR: Body Mass ~ Flipper Length ($R^2$ = {model.rsquared:.3f})')
ax.legend(); plt.tight_layout(); plt.show()pred_pi = model.get_prediction(x_grid).summary_frame(alpha=0.05)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(penguins['flipper_length_mm'], penguins['body_mass_g'],
alpha=0.4, color='steelblue', s=35, label='Observed data', zorder=3)
ax.plot(x_grid['flipper_length_mm'], pred_pi['mean'],
color='crimson', lw=2.5, label='OLS fit')
ax.fill_between(x_grid['flipper_length_mm'],
pred_pi['mean_ci_lower'], pred_pi['mean_ci_upper'],
alpha=0.35, color='crimson', label='95% CI for mean response')
ax.fill_between(x_grid['flipper_length_mm'],
pred_pi['obs_ci_lower'], pred_pi['obs_ci_upper'],
alpha=0.12, color='steelblue', label='95% Prediction interval')
ax.set_xlabel('Flipper length (mm)'); ax.set_ylabel('Body mass (g)')
ax.set_title('Confidence Band vs. Prediction Band')
ax.legend(); plt.tight_layout(); plt.show()# A new penguin with flipper length 195 mm
new_penguin = pd.DataFrame({'flipper_length_mm': [195]})
# Point prediction
y_hat_new = model.predict(new_penguin)
print(f"Predicted mass: {y_hat_new.values[0]:.1f} g")
# Full prediction with intervals
pred_new = (
model.get_prediction(new_penguin)
.summary_frame(alpha=0.05)
)
print(f"\n95% CI for mean: ({pred_new['mean_ci_lower'].values[0]:.1f},"
f" {pred_new['mean_ci_upper'].values[0]:.1f}) g")
print(f"95% PI for obs: ({pred_new['obs_ci_lower'].values[0]:.1f},"
f" {pred_new['obs_ci_upper'].values[0]:.1f}) g")Predicted mass: 3907.9 g
95% CI for mean: (3862.3, 3953.4) g
95% PI for obs: (3131.0, 4684.7) g
| Plot | Assumption checked |
|---|---|
| Residuals vs. Fitted | Linearity (GM1) and homoscedasticity (GM3) |
| Normal Q-Q | Normality of errors (GM4) |
| Scale-Location | Homoscedasticity (GM3) — a different view |
| Cook’s Distance | Influential observations |
# Extract fitted values and residuals
fitted = model.fittedvalues
residual = model.resid
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(fitted, residual, alpha=0.5, color='steelblue', s=35, zorder=3)
ax.axhline(0, color='crimson', lw=1.5, linestyle='--')
lowess = sm.nonparametric.lowess(residual, fitted, frac=0.5)
ax.plot(lowess[:, 0], lowess[:, 1], color='crimson', lw=2, label='LOWESS smoother')
ax.set_xlabel('Fitted values $\\hat{y}$'); ax.set_ylabel('Residuals $e$')
ax.set_title('Residuals vs. Fitted'); ax.legend()
plt.tight_layout(); plt.show()| Pattern | Likely violation | Action |
|---|---|---|
| Systematic curve | Linearity (GM1) — the true relationship is nonlinear | Add polynomial term |
| Fan / funnel shape | Homoscedasticity (GM3) | Log-transform Y; use WLS |
| Striped bands | Possible discrete/omitted variable | Add grouping variable |
Our penguins plot shows mild vertical banding — a hint that an omitted variable (species) is creating structure in the residuals. This will motivate adding species as a predictor in Lecture 11.
from scipy.stats import probplot
qq_result = probplot(residual, dist='norm')
fig, ax = plt.subplots(figsize=(7, 6))
ax.scatter(qq_result[0][0], qq_result[0][1],
alpha=0.55, color='steelblue', s=30, zorder=3)
ax.plot(qq_result[0][0],
qq_result[1][1] + qq_result[1][0] * qq_result[0][0],
color='crimson', lw=2, label='Reference line')
ax.set_xlabel('Theoretical quantiles'); ax.set_ylabel('Sample quantiles')
ax.set_title('Normal Q-Q Plot of Residuals'); ax.legend()
plt.tight_layout(); plt.show()| Pattern | Interpretation |
|---|---|
| Points on the line | Residuals are approximately normal ✓ |
| Heavy tails (S-curve) | Leptokurtic — more extreme values than normal |
| Light tails | Platykurtic — fewer extreme values than normal |
| Systematic curve | Skewed residuals — consider transformation |
# Standardised residuals
resid_std = residual / np.sqrt(np.sum(residual**2) / (len(residual) - 2))
sqrt_abs = np.sqrt(np.abs(resid_std))
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(fitted, sqrt_abs, alpha=0.5, color='steelblue', s=35, zorder=3)
lowess = sm.nonparametric.lowess(sqrt_abs, fitted, frac=0.5)
ax.plot(lowess[:, 0], lowess[:, 1], color='crimson', lw=2, label='LOWESS smoother')
ax.set_xlabel('Fitted values $\\hat{y}$')
ax.set_ylabel('$\\sqrt{|\\text{Standardised residuals}|}$')
ax.set_title('Scale-Location Plot'); ax.legend()
plt.tight_layout(); plt.show()Cook’s Distance
Cook’s distance D_i measures how much the fitted values change if observation i is removed. A large D_i indicates an influential point — one that substantially affects the estimated coefficients.
# Compute influence measures via statsmodels
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
threshold = 4 / len(penguins)
fig, ax = plt.subplots(figsize=(11, 4))
ax.stem(range(len(cooks_d)), cooks_d,
linefmt='steelblue', markerfmt='o', basefmt=' ')
ax.axhline(threshold, color='crimson', linestyle='--',
label=f'Threshold 4/n = {threshold:.4f}')
# Flag influential points
flagged = np.where(cooks_d > threshold)[0]
print(f"Observations above threshold: {len(flagged)}")
ax.set_xlabel('Observation index'); ax.set_ylabel("Cook's distance $D_i$")
ax.set_title("Cook's Distance"); ax.legend()
plt.tight_layout(); plt.show()Observations above threshold: 15
fig, axes = plt.subplots(2, 2, figsize=(13, 6))
# 1. Residuals vs Fitted
axes[0,0].scatter(fitted, residual, alpha=0.45, color='steelblue', s=25, zorder=3)
axes[0,0].axhline(0, color='crimson', lw=1.5, linestyle='--')
lw1 = sm.nonparametric.lowess(residual, fitted, frac=0.5)
axes[0,0].plot(lw1[:,0], lw1[:,1], 'crimson', lw=2)
axes[0,0].set_xlabel('Fitted values'); axes[0,0].set_ylabel('Residuals')
axes[0,0].set_title('Residuals vs. Fitted')
# 2. Normal Q-Q
qq = probplot(residual, dist='norm')
axes[0,1].scatter(qq[0][0], qq[0][1], alpha=0.5, color='steelblue', s=25, zorder=3)
axes[0,1].plot(qq[0][0], qq[1][1] + qq[1][0]*qq[0][0], 'crimson', lw=2, label='Reference')
axes[0,1].set_xlabel('Theoretical quantiles'); axes[0,1].set_ylabel('Sample quantiles')
axes[0,1].set_title('Normal Q-Q'); axes[0,1].legend(fontsize=9)
# 3. Scale-Location
axes[1,0].scatter(fitted, sqrt_abs, alpha=0.45, color='steelblue', s=25, zorder=3)
lw2 = sm.nonparametric.lowess(sqrt_abs, fitted, frac=0.5)
axes[1,0].plot(lw2[:,0], lw2[:,1], 'crimson', lw=2)
axes[1,0].set_xlabel('Fitted values')
axes[1,0].set_ylabel('$\\sqrt{|\\mathrm{Std.\\ Residuals}|}$')
axes[1,0].set_title('Scale-Location')
# 4. Cook's Distance
axes[1,1].stem(range(len(cooks_d)), cooks_d,
linefmt='steelblue', markerfmt='o', basefmt=' ')
axes[1,1].axhline(threshold, color='crimson', linestyle='--', label=f'4/n = {threshold:.3f}')
axes[1,1].set_xlabel('Observation index')
axes[1,1].set_ylabel("Cook's $D_i$")
axes[1,1].set_title("Cook's Distance"); axes[1,1].legend(fontsize=9)
plt.suptitle("Regression Diagnostic Panel", fontsize=14, y=1.01)
plt.tight_layout(); plt.show()In 1973, statistician Francis Anscombe constructed four datasets that share the same summary statistics but look completely different visually (Anscombe 1973):
Dataset I: x̄=9.00, ȳ=7.50, r=0.816, R²=0.667
Dataset II: x̄=9.00, ȳ=7.50, r=0.816, R²=0.666
Dataset III: x̄=9.00, ȳ=7.50, r=0.816, R²=0.666
Dataset IV: x̄=9.00, ȳ=7.50, r=0.817, R²=0.667
fig, axes = plt.subplots(2, 2, figsize=(12, 6), sharex=False, sharey=False)
for ax, (ds, grp) in zip(axes.flat, anscombe.groupby('dataset')):
model_a = smf.ols('y ~ x', data=grp).fit()
x_a = np.linspace(grp['x'].min(), grp['x'].max(), 100)
y_a = model_a.params['Intercept'] + model_a.params['x'] * x_a
ax.scatter(grp['x'], grp['y'], color='steelblue', s=60, zorder=3, alpha=0.8)
ax.plot(x_a, y_a, color='crimson', lw=2)
ax.set_title(f"Dataset {ds} (R² = {model_a.rsquared:.3f})", fontsize=11)
ax.set_xlabel('x'); ax.set_ylabel('y')
plt.suptitle("Anscombe's Quartet: Same Statistics, Different Structure", fontsize=13, y=1.01)
plt.tight_layout(); plt.show()| Dataset | What is actually happening | What to do |
|---|---|---|
| I | Linear relationship + random noise — SLR is appropriate | Proceed normally |
| II | Curved / nonlinear relationship | Add polynomial term; use different model |
| III | Linear relationship with one influential outlier | Investigate the outlier; consider robust regression |
| IV | One extreme leverage point drives everything | Investigate the point; do not report the regression uncritically |
Always plot your data and your residuals. Summary statistics alone — including R^2 and p-values — can be identical for datasets with completely different structures.
| Step | Code | Purpose |
|---|---|---|
| 1. Load & inspect | sns.load_dataset(); .describe(); .head() |
Understand the data |
| 2. Visualise | plt.scatter(); sns.pairplot() |
Check linearity, outliers, subgroups |
| 3. Fit | smf.ols('y ~ x', data=df).fit() |
Estimate \hat{\beta}_0, \hat{\beta}_1 |
| 4. Summarise | model.summary() |
Read coefficients, p-values, R^2 |
| 5. Plot fit | model.get_prediction().summary_frame() |
Visualise line + CI/PI bands |
| 6. Diagnostics | Residuals vs. fitted, Q-Q, scale-location, Cook’s | Check assumptions |
| 7. Report | Coefficient, SE, CI, p-value, R^2 | Communicate results honestly |
These limitations motivate multiple linear regression — adding species and other predictors to better model the joint structure. That is the topic of Lecture 11.
statsmodels using the formula interface.