PSTAT100: Data Science - Concepts and Analysis
May 6, 2026
Statistical Inference
Statistical inference is the process of drawing conclusions about an unknown population from a sample of observed data, while rigorously accounting for uncertainty.
Population and Sample
A population is the complete set of units we are interested in. A sample is a subset of the population that we actually observe.
[0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0]
Sampling Distribution
The sampling distribution of a statistic is the probability distribution of that statistic over all possible samples of a given size n from the population.
For a population with mean \mu and variance \sigma^2, the sample mean \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i satisfies:
\mathbb{E}[\bar{X}] = \mu, \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}.
Central Limit Theorem (CLT)
Let X_1, X_2, \ldots, X_n be i.i.d. random variables with mean \mu and finite variance \sigma^2. Then as n \to \infty: \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1).
Visualization of the CLT.
Estimator and Estimate
An estimator \hat{\theta} is a function of the sample used to estimate a population parameter \theta. An estimate is the specific value of the estimator computed from an observed sample.
| Parameter | Estimator | Formula |
|---|---|---|
| Mean \mu | Sample mean | \bar{X} = \frac{1}{n}\sum X_i |
| Variance \sigma^2 | Sample variance | S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2 |
| Proportion p | Sample proportion | \hat{p} = \frac{\text{successes}}{n} |
Bias
The bias of an estimator \hat{\theta} is: \text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta. An estimator is unbiased if \text{Bias}(\hat{\theta}) = 0.
\text{MSE}(\hat{\theta}) = \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right] = \text{Var}(\hat{\theta}) + \left[\text{Bias}(\hat{\theta})\right]^2.
rng = np.random.default_rng(7)
true_var = 4.0
biased, unbiased = [], []
for _ in range(10_000):
x = rng.normal(0, np.sqrt(true_var), size=10)
biased.append(np.var(x, ddof=0))
unbiased.append(np.var(x, ddof=1))
fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(biased, bins=60, alpha=0.6, density=True, label=f'n (mean={np.mean(biased):.2f})', color='steelblue')
ax.hist(unbiased, bins=60, alpha=0.6, density=True, label=f'n-1 (mean={np.mean(unbiased):.2f})', color='crimson')
ax.axvline(true_var, color='black', lw=2, linestyle='--', label=f'True σ² = {true_var}')
ax.set_xlabel('Estimated variance'); ax.set_ylabel('Density')
ax.set_title('Biased vs. Unbiased Variance Estimator (n = 10, 10 000 samples)')
ax.legend()
plt.tight_layout()
plt.show()Likelihood Function
Given data x_1, \ldots, x_n assumed i.i.d. from a distribution with parameter \theta, the likelihood is: L(\theta) = \prod_{i=1}^n f(x_i;\, \theta). The log-likelihood \ell(\theta) = \sum_{i=1}^n \log f(x_i;\, \theta) is typically easier to maximise.
\hat{\theta}_{\text{MLE}} = \arg\max_\theta\, \ell(\theta).
For data x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2):
\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2.
\hat{\mu}_{\text{MLE}} = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.
rng = np.random.default_rng(3)
data = rng.normal(loc=5.0, scale=2.0, size=50)
mu_grid = np.linspace(3, 7, 300)
sigma2_grid = np.linspace(1, 9, 300)
# Profile log-likelihoods
ll_mu = [-0.5 * np.sum((data - m)**2) / np.var(data, ddof=0)
- 0.5 * len(data) * np.log(2 * np.pi * np.var(data, ddof=0))
for m in mu_grid]
ll_s2 = [-len(data) / 2 * np.log(2 * np.pi * s2) - np.sum((data - data.mean())**2) / (2 * s2)
for s2 in sigma2_grid]
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
axes[0].plot(mu_grid, ll_mu, 'steelblue', lw=2)
axes[0].axvline(data.mean(), color='crimson', linestyle='--', label=f'MLE μ̂ = {data.mean():.2f}')
axes[0].set_xlabel('μ'); axes[0].set_ylabel('Log-likelihood')
axes[0].set_title('Profile log-likelihood for μ'); axes[0].legend()
axes[1].plot(sigma2_grid, ll_s2, 'steelblue', lw=2)
axes[1].axvline(np.var(data, ddof=0), color='crimson', linestyle='--',
label=f'MLE σ̂² = {np.var(data, ddof=0):.2f}')
axes[1].set_xlabel('σ²'); axes[1].set_ylabel('Log-likelihood')
axes[1].set_title('Profile log-likelihood for σ²'); axes[1].legend()
plt.tight_layout()
plt.show()Confidence Interval
A (1-\alpha)\times 100\% confidence interval for \theta is a random interval [L, U] (depending on the data) such that: P(L \leq \theta \leq U) = 1 - \alpha.
When \sigma is known and either n is large (CLT) or the population is normal:
\bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}},
where z_{\alpha/2} is the upper \alpha/2 quantile of \mathcal{N}(0, 1).
| Confidence level | \alpha | z_{\alpha/2} |
|---|---|---|
| 90% | 0.10 | 1.645 |
| 95% | 0.05 | 1.960 |
| 99% | 0.01 | 2.576 |
When \sigma is unknown (the typical case), we replace it with S and use the t-distribution:
\bar{X} \pm t_{\alpha/2,\, n-1} \frac{S}{\sqrt{n}},
where t_{\alpha/2, n-1} is the upper \alpha/2 quantile of the t-distribution with n-1 degrees of freedom.
# Set the seed
rng = np.random.default_rng(12)
# Specify parameters
true_mu, true_sigma, n_obs, n_intervals = 0.0, 1.0, 30, 50
# Loop generating confidence intervals
covered, not_covered = [], []
for _ in range(n_intervals):
sample = rng.normal(true_mu, true_sigma, size=n_obs) # sampling from normal
xbar, s = sample.mean(), sample.std(ddof=1) # sample mean and std
margin = t.ppf(0.975, df=n_obs - 1) * s / np.sqrt(n_obs) # compute bounds
(covered if abs(xbar - true_mu) <= margin else not_covered).append(
(xbar - margin, xbar + margin)
)
# Produce plot
fig, ax = plt.subplots(figsize=(10, 6))
for i, (lo, hi) in enumerate(covered):
ax.plot([lo, hi], [i, i], color='steelblue', lw=1.5)
for i, (lo, hi) in enumerate(not_covered, start=len(covered)):
ax.plot([lo, hi], [i, i], color='crimson', lw=1.5)
ax.axvline(true_mu, color='black', lw=2, linestyle='--', label='True μ = 0')
ax.set_xlabel('Parameter value'); ax.set_yticks([])
ax.set_title(f'95% Confidence Intervals (n={n_obs}): '
f'{len(covered)}/{n_intervals} contain μ')
ax.legend()
plt.tight_layout()
plt.show()Hypothesis Test
A hypothesis test is a formal procedure for deciding between two competing hypotheses about a parameter, based on observed data.
p-Value
The p-value is the probability, under H_0, of observing a test statistic at least as extreme as the one actually observed. p\text{-value} = P_{H_0}(T \geq t_{\text{obs}}) \quad \text{(one-sided example)}.
| H_0 true | H_0 false | |
|---|---|---|
| Reject H_0 | Type I Error (false positive) | Correct (Power) |
| Fail to reject H_0 | Correct | Type II Error (false negative) |
For testing H_0: \mu = \mu_0 based on a sample x_1, \ldots, x_n, the test statistic is:
T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1} \quad \text{under } H_0.
rng = np.random.default_rng(99)
sample = rng.normal(loc=2.5, scale=3.0, size=40)
mu_0 = 0.0
t_stat, p_val = stats.ttest_1samp(sample, popmean=mu_0)
df = len(sample) - 1
x_grid = np.linspace(-5, 5, 400)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x_grid, t.pdf(x_grid, df=df), 'k', lw=2, label=f't distribution (df={df})')
# Shade rejection region (two-sided, α = 0.05)
crit = t.ppf(0.975, df=df)
x_left = np.linspace(-5, -crit, 200)
x_right = np.linspace(crit, 5, 200)
ax.fill_between(x_left, t.pdf(x_left, df=df), alpha=0.3, color='crimson', label='Rejection region (α=0.05)')
ax.fill_between(x_right, t.pdf(x_right, df=df), alpha=0.3, color='crimson')
ax.axvline(t_stat, color='steelblue', lw=2.5, linestyle='--',
label=f'Observed T = {t_stat:.2f}, p = {p_val:.4f}')
ax.set_xlabel('t'); ax.set_ylabel('Density')
ax.set_title(f'One-sample t-test: H₀: μ = {mu_0} vs. H₁: μ ≠ {mu_0}')
ax.legend()
plt.tight_layout()
plt.show()| Step | Question | Tool |
|---|---|---|
| 1. Formulate | What do I want to learn? | Research question |
| 2. Model | What is the data-generating process? | Probability model |
| 3. Estimate | What are the parameter values? | MLE / method of moments |
| 4. Quantify uncertainty | How precise is my estimate? | Confidence interval |
| 5. Test | Is a specific claim consistent with the data? | Hypothesis test / p-value |
| 6. Communicate | What do the results mean in context? | Effect size, CI, visualisation |