Lecture 8: Statistics

PSTAT100: Data Science - Concepts and Analysis

John Inston

johninston@ucsb.edu

University of California, Santa Barbara

May 23, 2026

🚁 Overview

Aims of the lecture

Understand the framework of statistical inference.
Distinguish between populations, samples, parameters, and statistics.
Understand sampling distributions and the Central Limit Theorem.
Evaluate estimators using bias, variance, and MSE.
Apply maximum likelihood estimation (MLE).
Construct and interpret confidence intervals.
Conduct and interpret hypothesis tests.

📚 Required Libraries

In this lecture we will be using the following libraries:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, t, binom

💅 Figure Styles

sns.set_style('whitegrid')
sns.set_palette('Set2')

Introduction to Statistical Inference

🔭 What is Statistical Inference?

From data to conclusions

Statistical Inference

Statistical inference is the process of drawing conclusions about an unknown population from a sample of observed data, while rigorously accounting for uncertainty.

The core problem

We want to know something about a large population (e.g. all voters, all patients).
We cannot observe the whole population — we only have a sample.
We must reason from the sample to the population, quantifying how uncertain our conclusions are.

Two main branches

Frequentist inference: Unknown fixed parameters; probability describes long-run frequency of events.
Bayesian inference: Parameters are random variables with prior distributions, updated by data.

Populations and Samples

Key terminology

Population and Sample

A population is the complete set of units we are interested in. A sample is a subset of the population that we actually observe.

Parameters and statistics

A parameter is a numerical characteristic of the population — it is fixed but unknown.
- Examples: population mean \mu, population variance \sigma^2, proportion p.
A statistic is a numerical function of the sample — it is observed and used to estimate parameters.
- Examples: sample mean \bar{x}, sample variance s^2, sample proportion \hat{p}.

The goal

Use the statistic (computed from data) to estimate or test claims about the unknown parameter.

Example - FIV in Cats

Problem Statement

We wish to determine the proportion of cats that have FIV.

Probabilistic Model

We assume that this system can be modelled as a collection of independent Bernoulli random variables with unknown parameter p.
- We cannot census all cats (the population) to find the true proportion p.

Statistical Model

Instead, we take a sample of n=100 cats and compute the sample proportion \hat{p}.
- We use hat notation to specify parameter estimates.
- This is a statistic since it was computed from data.
Specifically, this is known as a point estimate of a parameter.

Example - FIV in Cats Parameter Estimation

Simulation

Suppose we know the true underlying proportion of cats with FIV is p=0.15.
- We write a function which generates a sample of n=100 realizations of a Bernoulli random variable with parameter p=0.15.

# Set seed for reproducibility
rng = np.random.default_rng(123)
# Write function generating cat samples
def generate_cat_sample():
    return rng.binomial(n=1, p=0.15, size=100)
# Example 
sample1 = generate_cat_sample()
print(sample1)

[0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0]

Estimation

We can compute the sample proportion \hat{p} from this sample.

sample1.mean()

0.12

Why Randomness Matters

Statistics are random variables

Because the sample is drawn randomly from the population, any statistic computed from it is a random variable.
If we took a different sample, we would get a different value of the statistic.
This sampling variability is what we need to quantify.

Let’s simulate the variability

We can use this same function to create multiple samples and comute multiple sample proportions.
From this we produce a histogram to inspect the distribution.

sample_proportions = [generate_cat_sample().mean() for _ in range(1000)]
sns.histplot(sample_proportions, bins=20, kde=True, color='steelblue')

Why Randomness Matters

Sampling variability: distribution of sample proportions from 1000 samples of size n=100.

Another Example - Heights of Pool Players

Heights of Students

We wish to estimate the average height of members of the SB pool league.
Probabilistic: Assume that the heights are normally distributed with unknown mean \mu and \sigma.
Statistics: In reality, how would you estimate these parameters?
- Collect a sample of playerss and determine their height.
- Compute the sample mean \bar{X}\approx \mu and sample standard deviation S\approx\sigma.

More Simulation

We assume that the true mean height is \mu=170 (cm) and \sigma=3.
We write a function generating n=100 realizations of X\sim\text{Normal}(170, 3^2).

rng = np.random.default_rng(456)
def generate_height_sample(n=100):
    return rng.normal(loc=170, scale=3, size=n)
samples = [generate_height_sample() for _ in range(1000)]
sample_means = [samp.mean() for samp in samples]
sample_sdvs = [samp.std(ddof=1) for samp in samples]

Another Example - Heights of Pool Players

Sample Distribution

sns.histplot(np.array(samples).flatten(), kde=True, color='steelblue')

Distribution of all 1000 samples (each of size n=100).

This is just a n=100000 sample from our Normal distribution.

Sample Mean Distribution

sns.histplot(sample_means, bins=20, kde=True, color='steelblue')

Distribution of sample means from 1000 samples of size n=100.

What distribution is this?

Sample Standard Deviation Distribution

sns.histplot(sample_sdvs, bins=20, kde=True, color='steelblue')

Distribution of sample standard deviations from 1000 samples of size n=100.

What about this?

Sampling Distributions and the CLT

📐 Sampling Distributions

The distribution of a statistic

Sampling Distribution

The sampling distribution of a statistic is the probability distribution of that statistic over all possible samples of a given size n from the population.

Properties of the sample mean

For a population with mean \mu and variance \sigma^2, the sample mean \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i satisfies:

\mathbb{E}[\bar{X}] = \mu, \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}.

Standard error

The standard error (SE) of \bar{X} is \text{SE}(\bar{X}) = \sigma / \sqrt{n}.
- This is the standard deviation of the sampling distribution of \bar{X}.
As n \to \infty, the SE shrinks — larger samples give more precise estimates.

The Central Limit Theorem

The most important theorem in statistics

Central Limit Theorem (CLT)

Let X_1, X_2, \ldots, X_n be i.i.d. random variables with mean \mu and finite variance \sigma^2. Then as n \to \infty: \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1).

What it means in practice

Regardless of the shape of the population, the sample mean is approximately normally distributed for large n.
A rule of thumb: the approximation is reliable for n \geq 30 (though this depends on skewness).
The CLT justifies applying normal-based inference to a huge variety of real-world data.

CLT in Action

Visualization of the CLT.

Point Estimation

🎯 Estimators and Estimates

Terminology

Estimator and Estimate

An estimator \hat{\theta} is a function of the sample used to estimate a population parameter \theta. An estimate is the specific value of the estimator computed from an observed sample.

Common estimators

Parameter	Estimator	Formula
Mean \mu	Sample mean	\bar{X} = \frac{1}{n}\sum X_i
Variance \sigma^2	Sample variance	S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2
Proportion p	Sample proportion	\hat{p} = \frac{\text{successes}}{n}

Properties of Estimators

How do we judge a good estimator?

Bias

The bias of an estimator \hat{\theta} is: \text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta. An estimator is unbiased if \text{Bias}(\hat{\theta}) = 0.

Mean Squared Error

\text{MSE}(\hat{\theta}) = \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right] = \text{Var}(\hat{\theta}) + \left[\text{Bias}(\hat{\theta})\right]^2.

The bias-variance trade-off

A low-bias, high-variance estimator is accurate on average but erratic.
A high-bias, low-variance estimator is consistently wrong but predictably so.
Good estimators minimise MSE, balancing both.

Bias of the Sample Variance

Why divide by n-1?

The naive estimator \tilde{S}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2 is biased: \mathbb{E}[\tilde{S}^2] = \frac{n-1}{n}\sigma^2.
Dividing by n-1 instead of n corrects for this: \mathbb{E}[S^2] = \sigma^2. \qquad \text{(unbiased)}
The factor n-1 is called the degrees of freedom — we “lose” one degree of freedom because \bar{X} is estimated from the same data.

rng = np.random.default_rng(7)
true_var = 4.0
biased, unbiased = [], []
for _ in range(10_000):
    x = rng.normal(0, np.sqrt(true_var), size=10)
    biased.append(np.var(x, ddof=0))
    unbiased.append(np.var(x, ddof=1))

fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(biased,   bins=60, alpha=0.6, density=True, label=f'n   (mean={np.mean(biased):.2f})',   color='steelblue')
ax.hist(unbiased, bins=60, alpha=0.6, density=True, label=f'n-1 (mean={np.mean(unbiased):.2f})', color='crimson')
ax.axvline(true_var, color='black', lw=2, linestyle='--', label=f'True σ² = {true_var}')
ax.set_xlabel('Estimated variance'); ax.set_ylabel('Density')
ax.set_title('Biased vs. Unbiased Variance Estimator (n = 10, 10 000 samples)')
ax.legend()
plt.tight_layout()
plt.show()

Bias of the Sample Variance

Biased (n) vs. unbiased (n-1) sample variance estimators.

Maximum Likelihood Estimation

📈 The Likelihood Function

Choosing parameters that make the data most probable

Likelihood Function

Given data x_1, \ldots, x_n assumed i.i.d. from a distribution with parameter \theta, the likelihood is: L(\theta) = \prod_{i=1}^n f(x_i;\, \theta). The log-likelihood \ell(\theta) = \sum_{i=1}^n \log f(x_i;\, \theta) is typically easier to maximise.

Maximum Likelihood Estimator (MLE)

\hat{\theta}_{\text{MLE}} = \arg\max_\theta\, \ell(\theta).

The MLE is the parameter value that makes the observed data most probable.
MLEs are generally consistent (converge to the true value) and asymptotically normal.

MLE for the Gaussian

Deriving the MLE analytically

For data x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2):

\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2.

Setting derivatives to zero gives

\hat{\mu}_{\text{MLE}} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.

Note

\hat{\mu}_{\text{MLE}} is unbiased, but \hat{\sigma}^2_{\text{MLE}} uses n (not n-1) and is biased.
This is why the sample variance corrects to n-1 in practice.

MLE in Python

rng = np.random.default_rng(3)
data = rng.normal(loc=5.0, scale=2.0, size=50)

mu_grid    = np.linspace(3, 7, 300)
sigma2_grid = np.linspace(1, 9, 300)

# Profile log-likelihoods
ll_mu    = [-0.5 * np.sum((data - m)**2) / np.var(data, ddof=0)
             - 0.5 * len(data) * np.log(2 * np.pi * np.var(data, ddof=0))
             for m in mu_grid]
ll_s2    = [-len(data) / 2 * np.log(2 * np.pi * s2) - np.sum((data - data.mean())**2) / (2 * s2)
             for s2 in sigma2_grid]

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
axes[0].plot(mu_grid, ll_mu, 'steelblue', lw=2)
axes[0].axvline(data.mean(), color='crimson', linestyle='--', label=f'MLE μ̂ = {data.mean():.2f}')
axes[0].set_xlabel('μ'); axes[0].set_ylabel('Log-likelihood')
axes[0].set_title('Profile log-likelihood for μ'); axes[0].legend()

axes[1].plot(sigma2_grid, ll_s2, 'steelblue', lw=2)
axes[1].axvline(np.var(data, ddof=0), color='crimson', linestyle='--',
                label=f'MLE σ̂² = {np.var(data, ddof=0):.2f}')
axes[1].set_xlabel('σ²'); axes[1].set_ylabel('Log-likelihood')
axes[1].set_title('Profile log-likelihood for σ²'); axes[1].legend()

plt.tight_layout()
plt.show()

MLE in Python

Profile log-likelihood for μ (left) and σ² (right) from a simulated Gaussian sample.

Confidence Intervals

📏 Quantifying Estimation Uncertainty

Beyond point estimates

A point estimate \hat{\theta} is a single number — it tells us nothing about uncertainty.
A confidence interval (CI) gives a range of plausible values for \theta.

Confidence Interval

A (1-\alpha)\times 100\% confidence interval for \theta is a random interval [L, U] (depending on the data) such that: P(L \leq \theta \leq U) = 1 - \alpha.

Correct interpretation

A 95% CI does not mean “there is a 95% probability that \theta is in this particular interval.”
It means: if we repeated the procedure many times, 95% of the resulting intervals would contain the true \theta.

CI for the Population Mean (Known σ)

The z-interval

When \sigma is known and either n is large (CLT) or the population is normal:

\bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}},

where z_{\alpha/2} is the upper \alpha/2 quantile of \mathcal{N}(0, 1).

Common critical values

Confidence level	\alpha	z_{\alpha/2}
90%	0.10	1.645
95%	0.05	1.960
99%	0.01	2.576

CI for the Population Mean (Unknown σ)

The t-interval

When \sigma is unknown (the typical case), we replace it with S and use the t-distribution:

\bar{X} \pm t_{\alpha/2,\, n-1} \frac{S}{\sqrt{n}},

where t_{\alpha/2, n-1} is the upper \alpha/2 quantile of the t-distribution with n-1 degrees of freedom.

The t-distribution

Heavier tails than the normal — accounts for extra uncertainty from estimating \sigma.
As n \to \infty, t_{n-1} \to \mathcal{N}(0,1).

Confidence Intervals in Python

# Set the seed 
rng = np.random.default_rng(12)
# Specify parameters 
true_mu, true_sigma, n_obs, n_intervals = 0.0, 1.0, 30, 50

# Loop generating confidence intervals
covered, not_covered = [], []
for _ in range(n_intervals):
    sample = rng.normal(true_mu, true_sigma, size=n_obs) # sampling from normal
    xbar, s = sample.mean(), sample.std(ddof=1)          # sample mean and std
    margin = t.ppf(0.975, df=n_obs - 1) * s / np.sqrt(n_obs) # compute bounds 
    (covered if abs(xbar - true_mu) <= margin else not_covered).append(
      (xbar - margin, xbar + margin)
      )

# Produce plot
fig, ax = plt.subplots(figsize=(10, 6))
for i, (lo, hi) in enumerate(covered):
    ax.plot([lo, hi], [i, i], color='steelblue', lw=1.5)
for i, (lo, hi) in enumerate(not_covered, start=len(covered)):
    ax.plot([lo, hi], [i, i], color='crimson', lw=1.5)
ax.axvline(true_mu, color='black', lw=2, linestyle='--', label='True μ = 0')
ax.set_xlabel('Parameter value'); ax.set_yticks([])
ax.set_title(f'95% Confidence Intervals (n={n_obs}): '
             f'{len(covered)}/{n_intervals} contain μ')
ax.legend()
plt.tight_layout()
plt.show()

Confidence Intervals in Python

Simulation: 50 confidence intervals at the 95% level. Green = contain μ; red = miss.

Hypothesis Testing

❓ The Framework

Making decisions under uncertainty

Hypothesis Test

A hypothesis test is a formal procedure for deciding between two competing hypotheses about a parameter, based on observed data.

Null and alternative hypotheses

The null hypothesis H_0: a default claim we assume to be true (often “no effect”, “no difference”).
The alternative hypothesis H_1 (or H_a): what we are trying to find evidence for.
Examples:
- H_0: \mu = 0 vs. H_1: \mu \neq 0 (two-sided)
- H_0: \mu \leq 0 vs. H_1: \mu > 0 (one-sided)

Compute a test statistic T from the data that measures evidence against H_0.
Ask: “If H_0 were true, how likely is a test statistic at least this extreme?”
If the answer is “very unlikely”, we reject H_0.

Test Statistics and p-Values

Formalising “how unlikely”

p-Value

The p-value is the probability, under H_0, of observing a test statistic at least as extreme as the one actually observed. p\text{-value} = P_{H_0}(T \geq t_{\text{obs}}) \quad \text{(one-sided example)}.

Decision rule at significance level \alpha

If p\text{-value} \leq \alpha: reject H_0 (result is “statistically significant”).
If p\text{-value} > \alpha: fail to reject H_0 (insufficient evidence).
Common choices: \alpha = 0.05 (5%) or \alpha = 0.01 (1%).
Important! What the p-value is not.
- It is not the probability that H_0 is true.
- It does not measure the size or practical importance of an effect.

Type I and Type II Errors

Two ways a test can be wrong

	H_0 true	H_0 false
Reject H_0	Type I Error (false positive)	Correct (Power)
Fail to reject H_0	Correct	Type II Error (false negative)

Definitions

Type I error rate = \alpha = P(\text{reject } H_0 \mid H_0 \text{ true}) — controlled by our choice of \alpha.
Type II error rate = \beta = P(\text{fail to reject } H_0 \mid H_1 \text{ true}).
Power = 1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ true}) — the probability of correctly detecting a true effect.

The trade-off

Decreasing \alpha (stricter test) reduces Type I errors but increases Type II errors.
Larger sample sizes increase power without inflating the Type I error rate.

The One-Sample t-Test

Testing a claim about the population mean

For testing H_0: \mu = \mu_0 based on a sample x_1, \ldots, x_n, the test statistic is:

T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1} \quad \text{under } H_0.

p-value computation

Two-sided (H_1: \mu \neq \mu_0): p = 2\,P(t_{n-1} \geq |t_{\text{obs}}|).
One-sided upper (H_1: \mu > \mu_0): p = P(t_{n-1} \geq t_{\text{obs}}).
One-sided lower (H_1: \mu < \mu_0): p = P(t_{n-1} \leq t_{\text{obs}}).

t-Test in Python

rng = np.random.default_rng(99)
sample = rng.normal(loc=2.5, scale=3.0, size=40)
mu_0 = 0.0

t_stat, p_val = stats.ttest_1samp(sample, popmean=mu_0)
df = len(sample) - 1

x_grid = np.linspace(-5, 5, 400)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x_grid, t.pdf(x_grid, df=df), 'k', lw=2, label=f't distribution (df={df})')

# Shade rejection region (two-sided, α = 0.05)
crit = t.ppf(0.975, df=df)
x_left  = np.linspace(-5, -crit, 200)
x_right = np.linspace(crit,  5, 200)
ax.fill_between(x_left,  t.pdf(x_left,  df=df), alpha=0.3, color='crimson', label='Rejection region (α=0.05)')
ax.fill_between(x_right, t.pdf(x_right, df=df), alpha=0.3, color='crimson')

ax.axvline(t_stat, color='steelblue', lw=2.5, linestyle='--',
           label=f'Observed T = {t_stat:.2f},  p = {p_val:.4f}')
ax.set_xlabel('t'); ax.set_ylabel('Density')
ax.set_title(f'One-sample t-test: H₀: μ = {mu_0}  vs.  H₁: μ ≠ {mu_0}')
ax.legend()
plt.tight_layout()
plt.show()

t-Test in Python

One-sample t-test: observed test statistic against the null t-distribution.

Connecting CIs and Hypothesis Tests

Two sides of the same coin

A (1-\alpha) confidence interval for \mu contains exactly those values \mu_0 for which the two-sided test at level \alpha fails to reject H_0: \mu = \mu_0.
Equivalently: reject H_0: \mu = \mu_0 at level \alpha if and only if \mu_0 lies outside the (1-\alpha) CI.

Practical takeaway

CIs and hypothesis tests answer complementary questions:
- CI: “What values of \mu are consistent with the data?”
- Test: “Is a specific value \mu_0 consistent with the data?”
Reporting a CI is often more informative than a binary reject/fail-to-reject decision.

Statistical vs. Practical Significance

Large samples can make small effects “significant”

With a very large n, even a trivially small difference from \mu_0 will yield a tiny p-value.
A statistically significant result is not necessarily practically important.

Effect sizes

Always report the estimated effect and a CI, not just the p-value.
Consider effect size measures such as Cohen’s d: d = \frac{\bar{X} - \mu_0}{S}.
A large effect size with a non-significant p-value may indicate insufficient power (small n), not the absence of an effect.

Summary: The Inference Pipeline

Step	Question	Tool
1. Formulate	What do I want to learn?	Research question
2. Model	What is the data-generating process?	Probability model
3. Estimate	What are the parameter values?	MLE / method of moments
4. Quantify uncertainty	How precise is my estimate?	Confidence interval
5. Test	Is a specific claim consistent with the data?	Hypothesis test / p-value
6. Communicate	What do the results mean in context?	Effect size, CI, visualisation

Conclusion

✅ What we covered

Statistical inference: drawing conclusions about populations from samples.
Populations, parameters, samples, and statistics.
Sampling distributions and the Central Limit Theorem.
Point estimation: bias, variance, MSE, and the MLE.
Confidence intervals: construction and correct interpretation.
Hypothesis testing: null/alternative hypotheses, test statistics, p-values, Type I/II errors.
The t-test and the duality between CIs and tests.

📅 What’s next?

Simple linear regression.
Fitting, interpreting, and evaluating regression models.