Lecture 16: Classification

PSTAT100: Data Science — Concepts and Analysis

John Inston

University of California, Santa Barbara

May 23, 2026

🚁 Overview

Aims of the lecture

  • Understand what classification is and how it differs from regression.
  • Extend logistic regression from binary to multiclass problems.
  • Introduce decision trees: how splits are chosen and trees are grown.
  • Understand random forests as an ensemble that reduces variance.
  • Evaluate and compare all three classifiers on a real credit scoring problem.

📚 Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    classification_report, accuracy_score,
    precision_recall_fscore_support,
)

💅 Figure Styles

sns.set_style('whitegrid')
sns.set_palette('Set2')

From Regression to Classification

The Classification Problem

  • In regression, the response is continuous: Y \in \mathbb{R}.

  • In classification, the response is categorical: Y \in \{1, 2, \ldots, K\}.

Some examples

Task Response classes
Email spam detection Spam / Not Spam
Medical diagnosis Disease / No Disease
Handwritten digit recognition 0, 1, 2, …, 9
Credit scoring Good / Standard / Poor

The goal is to learn a decision rule \hat{f}: \mathcal{X} \to \{1, \ldots, K\} that maps features to class labels as accurately as possible.

Recap: Binary Logistic Regression (Lecture 13)

For a binary response Y \in \{0, 1\}, logistic regression models the probability of the positive class via the sigmoid function:

P(Y = 1 \mid \mathbf{x}) = \sigma(\mathbf{x}^\top \boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}}

The decision rule is simply:

\hat{y} = \begin{cases} 1 & \text{if } P(Y=1 \mid \mathbf{x}) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}

This corresponds to a linear decision boundary \mathbf{x}^\top\boldsymbol{\beta} = 0 in feature space.

Key properties

Strength Limitation
Interpretable coefficients Assumes a linear boundary
Calibrated class probabilities Struggles with interactions and non-linearity
Computationally efficient Requires feature engineering for complex patterns

Extending to Multiple Classes

Binary logistic regression generalises to K > 2 classes in two main ways:

  • One-vs-Rest (OvR): fit K separate binary classifiers, each asking “is this observation class k or not?”. Predict the class whose classifier returns the highest probability.

  • Multinomial (softmax): fit all classes simultaneously using a single model:

P(Y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{x}^\top \boldsymbol{\beta}_k)}{\displaystyle\sum_{j=1}^{K} \exp(\mathbf{x}^\top \boldsymbol{\beta}_j)}

sklearn’s LogisticRegression uses the multinomial (softmax) approach by default when K > 2. The decision boundary remains linear — each class is still separated by a hyperplane.

Why Go Beyond Logistic Regression?

Two concentric circles cannot be separated by any straight line — logistic regression is structurally unable to solve this problem. A decision tree carves the space into axis-aligned rectangles and recovers the circular boundary well.

Meet the Data

💳 The Credit Score Dataset

Source: Kaggle - Credit Score Classification dataset

The business problem

  • A bank wants to automatically assign each customer to a credit score band:

    • Good — low credit risk; eligible for favourable rates
    • Standard — moderate risk; standard products
    • Poor — high risk; restricted lending

The data

  • 100,000 labelled records across 8 months
  • 27 raw features: income, payment history, loan counts, credit utilisation, …
  • Each customer appears in multiple monthly snapshots

Loading the Data

Training Data

  • We download the file train.csv from Kaggle and load it into a DataFrame:
raw = pd.read_csv('data/train.csv', low_memory=False)
  • We print the shape of the data:
print('Train Data Size : ',raw.shape)
Train Data Size :  (100000, 28)
  • We also check the first few rows to get a sense of the data:
raw.head()

Loading the Data

ID Customer_ID Month Name Age SSN Occupation Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts ... Credit_Mix Outstanding_Debt Credit_Utilization_Ratio Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month Amount_invested_monthly Payment_Behaviour Monthly_Balance Credit_Score
0 0x1602 CUS_0xd40 January Aaron Maashoh 23 821-00-0265 Scientist 19114.12 1824.843333 3 ... _ 809.98 26.822620 22 Years and 1 Months No 49.574949 80.41529543900253 High_spent_Small_value_payments 312.49408867943663 Good
1 0x1603 CUS_0xd40 February Aaron Maashoh 23 821-00-0265 Scientist 19114.12 NaN 3 ... Good 809.98 31.944960 NaN No 49.574949 118.28022162236736 Low_spent_Large_value_payments 284.62916249607184 Good
2 0x1604 CUS_0xd40 March Aaron Maashoh -500 821-00-0265 Scientist 19114.12 NaN 3 ... Good 809.98 28.609352 22 Years and 3 Months No 49.574949 81.699521264648 Low_spent_Medium_value_payments 331.2098628537912 Good
3 0x1605 CUS_0xd40 April Aaron Maashoh 23 821-00-0265 Scientist 19114.12 NaN 3 ... Good 809.98 31.377862 22 Years and 4 Months No 49.574949 199.4580743910713 Low_spent_Small_value_payments 223.45130972736786 Good
4 0x1606 CUS_0xd40 May Aaron Maashoh 23 821-00-0265 Scientist 19114.12 1824.843333 3 ... Good 809.98 24.797347 22 Years and 5 Months No 49.574949 41.420153086217326 High_spent_Medium_value_payments 341.48923103222177 Good

5 rows × 28 columns

A First Look at the Features

raw.info()
<class 'pandas.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  str    
 1   Customer_ID               100000 non-null  str    
 2   Month                     100000 non-null  str    
 3   Name                      90015 non-null   str    
 4   Age                       100000 non-null  str    
 5   SSN                       100000 non-null  str    
 6   Occupation                100000 non-null  str    
 7   Annual_Income             100000 non-null  str    
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  str    
 13  Type_of_Loan              88592 non-null   str    
 14  Delay_from_due_date       100000 non-null  int64  
 15  Num_of_Delayed_Payment    92998 non-null   str    
 16  Changed_Credit_Limit      100000 non-null  str    
 17  Num_Credit_Inquiries      98035 non-null   float64
 18  Credit_Mix                100000 non-null  str    
 19  Outstanding_Debt          100000 non-null  str    
 20  Credit_Utilization_Ratio  100000 non-null  float64
 21  Credit_History_Age        90970 non-null   str    
 22  Payment_of_Min_Amount     100000 non-null  str    
 23  Total_EMI_per_month       100000 non-null  float64
 24  Amount_invested_monthly   95521 non-null   str    
 25  Payment_Behaviour         100000 non-null  str    
 26  Monthly_Balance           98800 non-null   str    
 27  Credit_Score              100000 non-null  str    
dtypes: float64(4), int64(4), str(20)
memory usage: 21.4 MB
  • Several numeric columns are stored as strings with dirty values (_, trailing letters, embedded underscores).
  • Data is missing in many columns, but not with a consistent placeholder.
  • Cleaning is needed before any modelling.

Data Cleaning

The raw data required several steps before modelling:

  • Dropped identifier and sensitive columns: ID, Customer_ID, Name, SSN, Type_of_Loan
  • Coerced dirty numeric strings: stripped embedded underscores and trailing characters from 8 columns
  • Parsed credit history age: "22 Years and 1 Months" → integer months (Credit_History_Months)
  • Replaced _ placeholder with NaN in Credit_Mix
print(f"Raw: {raw.shape} -> Cleaned: {df.shape}")
print(f"\nMissing values after cleaning:")
missing = df.isnull().sum()
print(
    missing[missing > 0]
    .sort_values(ascending=False)
    .to_string()
)
Raw: (100000, 28) -> Cleaned: (100000, 23)

Missing values after cleaning:
Credit_Mix                 20195
Monthly_Inhand_Salary      15002
Credit_History_Months       9030
Num_of_Delayed_Payment      7002
Amount_invested_monthly     4479
Changed_Credit_Limit        2091
Num_Credit_Inquiries        1965
Monthly_Balance             1200

Target Distribution

  • Standard is the majority class (~53 %), Good is the rarest (~18 %) — the dataset is moderately imbalanced.
  • This means a naive classifier predicting “Standard” every time would achieve ~53 % accuracy — our models must do meaningfully better.
fig, ax = plt.subplots(figsize=(7, 4))
sns.countplot(data=df, x='Credit_Score', ax=ax)

n = len(df)
for bar in ax.patches:
    pct = bar.get_height() / n * 100
    ax.text(bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 400,
            f'{pct:.1f} %', ha='center', fontsize=10)

ax.set_xlabel('Credit Score Band')
ax.set_ylabel('Count')
ax.set_title('Credit Score Class Distribution')
plt.tight_layout()
plt.show()

Target Distribution

Credit score class distribution. Standard is the majority class (53 %); Good is the rarest (18 %). Imbalance will matter for evaluation.

Building the Feature Matrix

  • We select 17 numeric and 5 categorical features, encode the target as an integer, then perform an 80/20 stratified train/validation split — keeping class proportions equal in both sets.
    • We use the function train_test_split from sklearn.model_selection with stratify=y.
num_features = [
    'Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
    'Num_Credit_Card', 'Interest_Rate','Num_of_Loan', 'Delay_from_due_date',
    'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
    'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Credit_History_Months',
    'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance',
]
cat_features = [
    'Month', 'Occupation', 'Credit_Mix',
    'Payment_of_Min_Amount', 'Payment_Behaviour',
]
# Encode target: Poor=0, Standard=1, Good=2
label_map = {'Poor': 0, 'Standard': 1, 'Good': 2}
y = df['Credit_Score'].map(label_map).values
X = df[num_features + cat_features]

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Train: {X_train.shape[0]:,}")
print(f"Validation: {X_val.shape[0]:,}")
Train: 80,000
Validation: 20,000

Preprocessing Pipeline

  • Numeric features are median-imputed then standardised; categorical features are mode-imputed then one-hot (dummy) encoded.
  • The preprocessor is fitted on the training set only and then applied to the validation set — preventing any data leakage - where information from the validation set influences the training process.
num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
])
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_features),
    ('cat', cat_pipe, cat_features),
])

X_train_p = preprocessor.fit_transform(X_train)
X_val_p   = preprocessor.transform(X_val)

ohe_cols     = (preprocessor.named_transformers_['cat']['encode']
                .get_feature_names_out(cat_features).tolist())
feature_names = num_features + ohe_cols
print(f"Processed feature matrix: {X_train_p.shape}")
Processed feature matrix: (80000, 54)

Evaluating Classifiers

Choosing an Evaluation Metric

Accuracy — the simplest metric: fraction of all predictions that are correct.

\text{Accuracy} = \frac{\text{Number correct}}{\text{Total predictions}}

But accuracy can be misleading. In our dataset, 53 % of customers are “Standard” — a classifier that always predicts “Standard” scores 53 % accuracy without learning anything useful.

For each class k we also compute:

  • Precision - Of all customers we predicted as k, what fraction truly are k?
  • Recall - Of all customers that truly are k, what fraction did we identify?
  • F1 - Harmonic mean of precision and recall — balances the two.

When classes are imbalanced, a model can game accuracy by favouring the majority class. Precision and recall expose this — they measure performance within each class separately.

Beyond Accuracy: The Confusion Matrix

Confusion Matrix

For K classes, a K \times K table where entry (i, j) is the number of observations of true class i predicted as class j.

  • Diagonal entries: correct predictions.
  • Off-diagonal entries: misclassifications.

For our credit score problem (K = 3):

Predicted Poor Predicted Standard Predicted Good
True Poor TP FN FN
True Standard FP TP FN
True Good FP FP TP

Precision, Recall, and F1-Score

For each class k, treating it as “positive” vs. “all others”:

\text{Precision}_k = \frac{TP_k}{TP_k + FP_k} \qquad \text{Recall}_k = \frac{TP_k}{TP_k + FN_k}

F_{1,k} = \frac{2 \cdot \text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}_k}

  • Precision: of all customers predicted as class k, what fraction truly belong to k? High precision means few false alarms.
  • Recall: of all customers who truly belong to class k, what fraction did we correctly identify? High recall means few missed cases.
  • F1: the harmonic mean of precision and recall — useful when you care about both equally, and penalises extreme imbalances between the two.

In credit scoring, recall for “Poor” is critical — a missed high-risk customer can be very costly. But precision also matters — falsely flagging a “Good” customer harms trust.

Aggregating Across Classes

  • Precision, recall, and F1 are defined per class, but we often want a single number to compare models. We aggregate across classes in different ways depending on how much we care about each one.
Strategy How
Macro average Unweighted mean across all K classes
Weighted average Mean weighted by class support (# true samples)
Micro average Pool TP/FP/FN across all classes first, then compute

Weighted average is usually most informative when class sizes differ — as they do here (Good is the rarest band).

Logistic Regression in Practice

Fitting a Logistic Regression

  • We fit a multinomial logistic regression on the preprocessed training data; max_iter=1000 ensures convergence.
  • The model is then used to predict the validation set, with overall accuracy printed as a quick baseline check.
lr = LogisticRegression(
    max_iter=1000,    # multinomial is the default for 3+ classes in sklearn ≥ 1.5
    random_state=42,
)
lr.fit(X_train_p, y_train)

y_pred_lr = lr.predict(X_val_p)
print(f"Validation accuracy: {accuracy_score(y_val, y_pred_lr):.4f}")
Validation accuracy: 0.6162

Evaluating Logistic Regression

  • classification_report prints per-class precision, recall, and F1 alongside macro and weighted averages.
  • This gives a richer picture than accuracy alone — especially important given the class imbalance (Good ≈ 18 %, Standard ≈ 53 %).
print(classification_report(
    y_val, y_pred_lr,
    target_names=['Poor', 'Standard', 'Good'],
))
              precision    recall  f1-score   support

        Poor       0.63      0.43      0.51      5799
    Standard       0.64      0.73      0.69     10635
        Good       0.52      0.57      0.54      3566

    accuracy                           0.62     20000
   macro avg       0.60      0.58      0.58     20000
weighted avg       0.62      0.62      0.61     20000

Confusion Matrix — Logistic Regression

  • We produce the confusion matrix which summarizes the counts of true vs. predicted classes.
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(
    confusion_matrix(y_val, y_pred_lr),
    display_labels=['Poor', 'Standard', 'Good'],
).plot(ax=ax, colorbar=False, cmap='Oranges')
ax.set_title('Logistic Regression — Confusion Matrix', fontsize=11)
plt.tight_layout()
plt.show()

Confusion Matrix — Logistic Regression

Confusion matrix for logistic regression (validation set). The linear boundary struggles most with the ‘Good’ class, which overlaps heavily with ‘Standard’.

Decision Trees

🌳 What is a Decision Tree?

A decision tree partitions the feature space into rectangular regions using a recursive sequence of binary splits. Each leaf is assigned the majority class of training observations that land there.

Decision Tree Visualization.

Decision trees are fully non-parametric — no distributional assumption about the boundary shape. The tree structure is the model.

How Splits are Chosen: Gini Impurity

At each node we search for the feature j and threshold t that most reduce impurity in the resulting child nodes.

Gini Impurity

For a node containing observations with class proportions p_1, \ldots, p_K:

G = 1 - \sum_{k=1}^{K} p_k^2

G = 0 means the node is pure (all one class). G is maximised when classes are equally represented.

The Gini gain of a candidate split is:

\Delta G = G(\text{parent}) - \frac{n_L}{n}\,G(\text{left}) - \frac{n_R}{n}\,G(\text{right})

We choose (j^*, t^*) that maximises \Delta G.

Gini Impurity — Visualized

Gini impurity and (scaled) entropy as a function of the positive-class proportion in a 2-class problem. Both criteria are maximised at p = 0.5 and zero at pure nodes.

Growing and Stopping a Tree

A decision tree is grown recursively:

  1. Find the best split (j^*, t^*) at the current node.
  2. Partition observations into left (x_{j^*} \leq t^*) and right (x_{j^*} > t^*) children.
  3. Recurse on each child.
  4. Stop when a stopping criterion is met.

Stopping criteria

Parameter Effect
max_depth Hard cap on tree height
min_samples_split Minimum observations needed to attempt a split
min_samples_leaf Minimum observations required in each resulting leaf
min_impurity_decrease Only split if \Delta G exceeds this threshold

Overfitting with Decision Trees

Train and validation accuracy as a function of max_depth. Shallow trees underfit; very deep trees memorise the training set and generalise poorly.

Post-Pruning: Cost-Complexity

The stopping criteria above are pre-pruning controls — they prevent the tree from growing too large in the first place.

Post-pruning takes the opposite approach: grow the full tree first, then cut back branches whose impurity reduction is outweighed by the cost of keeping them.

Cost-Complexity Criterion

For a subtree T with |T| leaves, define the penalised training cost:

R_\alpha(T) = R(T) + \alpha \cdot |T|

where R(T) is the weighted leaf impurity and \alpha \geq 0 is the complexity parameter.

Value of \alpha Effect
\alpha = 0 No penalty — the full, unpruned tree is returned
\alpha small Only the weakest branches are removed
\alpha large Heavy pruning; tree shrinks toward a single node

The optimal \alpha is chosen by evaluating held-out performance across the pruning path.

Choosing \alpha by Validation

  • cost_complexity_pruning_path returns every \alpha at which a branch would be pruned; we evaluate validation accuracy across this path and select the best value.

Validation accuracy peaks at an intermediate α — just enough pruning to improve generalisation without discarding useful splits.

Decision Trees in Practice

Fitting Our First Decision Tree

  • To fit our decision tree we use sklearn.tree.DecisionTreeClassifier with max_depth=8 and min_samples_leaf=50 to prevent overfitting.
  • We then predict the validation set and print the accuracy as a quick check.
dt = DecisionTreeClassifier(
    max_depth=8,
    min_samples_leaf=50,
    random_state=42,
)
dt.fit(X_train_p, y_train)

y_pred_dt = dt.predict(X_val_p)
print(f"Validation accuracy: {accuracy_score(y_val, y_pred_dt):.4f}")
Validation accuracy: 0.7085

Visualizing the Top of the Tree

  • We produce a visualization of the top 3 levels of the tree using sklearn.tree.plot_tree.
    • Each node displays the splitting feature, Gini impurity, sample count, and class distribution.
    • Leaf colours indicate the predicted class.
fig, ax = plt.subplots(figsize=(14, 5))
plot_tree(
    dt,
    max_depth=3,
    feature_names=feature_names,
    class_names=['Poor', 'Standard', 'Good'],
    filled=True,
    rounded=True,
    fontsize=7,
    ax=ax,
)
ax.set_title('Decision Tree — Top 3 Levels', fontsize=12)
plt.tight_layout()
plt.show()

Visualizing the Top of the Tree

The first three levels of the decision tree. Each node shows the splitting feature, Gini impurity, sample count, and class distribution. Leaf colours indicate the predicted class.

Evaluating the Decision Tree

  • To evaluate the decision tree, we print the classification report which includes precision, recall, and F1-score for each class, as well as macro and weighted averages.
print(classification_report(
    y_val, y_pred_dt,
    target_names=['Poor', 'Standard', 'Good'],
))
              precision    recall  f1-score   support

        Poor       0.72      0.69      0.70      5799
    Standard       0.75      0.74      0.75     10635
        Good       0.58      0.65      0.61      3566

    accuracy                           0.71     20000
   macro avg       0.68      0.69      0.69     20000
weighted avg       0.71      0.71      0.71     20000
  • We see that the tree performs reasonably well on “Poor” and “Standard” classes, but struggles with “Good” — likely due to overlap with “Standard” and the linear splits.
  • It has higher recall for “Poor” than logistic regression, but lower precision — it flags more high-risk customers but also more false alarms.

Confusion Matrix — Decision Tree

  • Finally we plot the confusion matrix for the decision tree predictions on the validation set, which shows the counts of true vs. predicted classes.
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(
    confusion_matrix(y_val, y_pred_dt),
    display_labels=['Poor', 'Standard', 'Good'],
).plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title('Decision Tree — Confusion Matrix', fontsize=11)
plt.tight_layout()
plt.show()

Confusion Matrix — Decision Tree

Confusion matrix for the decision tree (validation set). Rows are true labels; columns are predicted labels.

From Trees to Forests

🌲 The Variance Problem with Single Trees

Decision trees have high variance: small changes in the training data can produce very different trees.

Why?

  • Each split depends entirely on which observations happened to be in the training set.
  • A single influential observation can redirect an entire branch.
  • Deep trees memorise noise rather than signal.

Can we reduce variance without substantially increasing bias?

Bagging: Bootstrap Aggregating

Idea: train many trees on different bootstrap samples of the training data, then combine their predictions by majority vote.

Bagging Algorithm

For b = 1, \ldots, B:

  1. Draw a bootstrap sample \mathcal{D}^*_b of size n from the training data (sampling with replacement).
  2. Fit a full, unpruned decision tree T_b on \mathcal{D}^*_b.

Prediction: \hat{y} = \text{mode}\!\left\{T_1(\mathbf{x}), \ldots, T_B(\mathbf{x})\right\}.

  • Each tree sees \approx 63\% of observations; the rest form a natural out-of-bag (OOB) validation set.
  • Averaging over B trees reduces variance by a factor of \approx Bprovided trees are uncorrelated.

Random Forests: Decorrelating the Trees

The problem with plain bagging: if one feature is very strong, all trees split on it first → trees are correlated → variance reduction is limited.

Random Forest

Bagging + feature subsampling: at each candidate split, consider only a random subset of m features (default: m = \lfloor\sqrt{p}\rfloor) rather than all p.

  • Decorrelates trees → greater variance reduction than plain bagging.
  • Slight bias increase (fewer features considered per split), but usually a net improvement.
  • The OOB observations give a free internal accuracy estimate — no separate validation set needed.

Feature Importance from Random Forests

Random forests provide a natural variable importance measure:

Mean Decrease in Impurity (MDI)

For feature j, sum the weighted Gini gain from every split on j across all trees:

\text{Importance}(j) = \frac{1}{B}\sum_{b=1}^{B} \sum_{\substack{v \in T_b \\ \text{split on } j}} \frac{n_v}{n}\,\Delta G_v

Features with large, frequent Gini gains are ranked most important.

Caveat: MDI can overstate importance for high-cardinality or continuous features. Permutation importance is a more reliable alternative for such cases.

Random Forests in Practice

Fitting a Random Forest

  • We fit a random forest using sklearn.ensemble.RandomForestClassifier with 200 trees, max_features='sqrt', and min_samples_leaf=10 to prevent overfitting.
  • We also set oob_score=True to compute the out-of-bag accuracy, which provides an internal validation estimate without needing a separate validation set.
rf = RandomForestClassifier(
    n_estimators=200,       # 200 trees in the ensemble
    max_features='sqrt',    # √p features considered per split
    min_samples_leaf=10,
    n_jobs=-1,              # parallelise across all CPU cores
    oob_score=True,         # compute out-of-bag accuracy
    random_state=42,
)
rf.fit(X_train_p, y_train)

y_pred_rf = rf.predict(X_val_p)
print(f"OOB accuracy:        {rf.oob_score_:.4f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_pred_rf):.4f}")
OOB accuracy:        0.7465
Validation accuracy: 0.7512

Evaluating the Random Forest

  • To evaluate the random forest, we print the classification report which includes precision, recall, and F1-score for each class, as well as macro and weighted averages.
print(classification_report(
    y_val, y_pred_rf,
    target_names=['Poor', 'Standard', 'Good'],
))

Evaluating the Random Forest

              precision    recall  f1-score   support

        Poor       0.77      0.72      0.74      5799
    Standard       0.78      0.79      0.78     10635
        Good       0.65      0.68      0.67      3566

    accuracy                           0.75     20000
   macro avg       0.73      0.73      0.73     20000
weighted avg       0.75      0.75      0.75     20000

Confusion Matrix — Random Forest

  • We produce the confusion matrix for the random forest predictions on the validation set, which shows the counts of true vs. predicted classes.
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(
    confusion_matrix(y_val, y_pred_rf),
    display_labels=['Poor', 'Standard', 'Good'],
).plot(ax=ax, colorbar=False, cmap='Greens')
ax.set_title('Random Forest — Confusion Matrix', fontsize=11)
plt.tight_layout()
plt.show()

Confusion Matrix — Random Forest

Confusion matrix for the random forest (validation set). Compare the off-diagonal entries with the decision tree.

Feature Importance Plot

  • The following code produces a horizontal bar plot of the top 15 most important features according to mean decrease in Gini impurity.
  • The features at the top produce the largest, most frequent gains across all 200 trees in the random forest.
importances = pd.Series(rf.feature_importances_, index=feature_names)
top15 = importances.sort_values(ascending=False).head(15)

fig, ax = plt.subplots(figsize=(9, 5))
top15.sort_values().plot.barh(ax=ax, color='steelblue', edgecolor='white')
ax.set_xlabel('Mean Decrease in Impurity', fontsize=11)
ax.set_title('Random Forest — Top 15 Feature Importances', fontsize=12)
plt.tight_layout()
plt.show()

Feature Importance Plot

Top 15 most important features by mean decrease in Gini impurity. Features at the top produce the largest, most frequent gains across all 200 trees.

Comparing All Three Classifiers

  • To compare the three classifiers head-to-head, we compute per-class precision, recall, and F1 for each model and visualise them side by side in a grouped bar chart.
classes = ['Poor', 'Standard', 'Good']

results = {}
for name, pred in [('Logistic Reg.', y_pred_lr),
                   ('Decision Tree', y_pred_dt),
                   ('Random Forest', y_pred_rf)]:
    p, r, f, _ = precision_recall_fscore_support(
        y_val, pred, labels=[0, 1, 2])
    results[name] = pd.DataFrame(
        {'Class': classes, 'Precision': p, 'Recall': r, 'F1': f})

combined = (
    pd.concat(results, names=['Model'])
    .reset_index(level=0)
    .rename(columns={'level_0': 'Model'})
)
melted = combined.melt(
    id_vars=['Model', 'Class'],
    value_vars=['Precision', 'Recall', 'F1'],
    var_name='Metric', value_name='Score',
)

fig, axes = plt.subplots(1, 3, figsize=(13, 4.5), sharey=True)
palette = ['#fc8d62', 'steelblue', '#66c2a5']
for ax, metric in zip(axes, ['Precision', 'Recall', 'F1']):
    sub = melted[melted['Metric'] == metric]
    sns.barplot(data=sub, x='Class', y='Score', hue='Model', ax=ax,
                palette=palette)
    ax.set_title(metric, fontsize=11)
    ax.set_ylim(0, 1.05)
    ax.set_xlabel('')
    ax.set_ylabel('Score' if ax is axes[0] else '')
    ax.legend(fontsize=7)

plt.suptitle('Logistic Regression vs. Decision Tree vs. Random Forest — Per-Class Metrics',
             fontsize=11, y=1.02)
plt.tight_layout()
plt.show()

Comparing All Three Classifiers

Per-class precision, recall, and F1 for all three classifiers. The random forest leads across most metrics; logistic regression sets the linear baseline.

Summary Table

  • We also summarise the key results in a table comparing validation accuracy, OOB accuracy (for random forest), whether the model has a linear decision boundary, and interpretability.
summary = pd.DataFrame({
    'Model': [
        'Logistic Regression',
        'Decision Tree (depth 8)',
        'Random Forest (200 trees)',
    ],
    'Val. Accuracy': [
        round(accuracy_score(y_val, y_pred_lr), 4),
        round(accuracy_score(y_val, y_pred_dt), 4),
        round(accuracy_score(y_val, y_pred_rf), 4),
    ],
    'OOB Accuracy':   ['—', '—', round(rf.oob_score_, 4)],
    'Linear boundary?': ['Yes', 'No', 'No'],
    'Interpretable?':   ['Coefficients', 'Tree diagram', 'Feature importance only'],
})
print(summary.to_string(index=False))

Summary Table

                    Model  Val. Accuracy OOB Accuracy Linear boundary?          Interpretable?
      Logistic Regression         0.6162            —              Yes            Coefficients
  Decision Tree (depth 8)         0.7085            —               No            Tree diagram
Random Forest (200 trees)         0.7512       0.7465               No Feature importance only

Conclusion

✅ What We Covered

  • Classification — predicting a categorical response and how it differs from regression.
  • Logistic regression — binary sigmoid recap; multinomial extension; linear decision boundaries; applied as our baseline.
  • Decision trees — recursive binary splitting via Gini impurity; overfitting and depth control; applied and visualized.
  • Random forests — bagging + feature subsampling to reduce variance; OOB error; feature importance; applied and compared.
  • Evaluation — confusion matrix, precision, recall, F1, and their multiclass extensions.
  • Three-way comparison — the Kaggle Credit Score dataset taken from raw data to three fitted classifiers with a head-to-head performance summary.

📅 What’s Next?

  • Lecture 17: Time Series — stationarity, autocorrelation, ARIMA, and forecasting with a real economic dataset.
  • Lectures 18–20: Unsupervised learning, neural networks, and ethics in data science.