Poisson Regression

Poisson regression in python.
Author

John Robin Inston

Published

May 13, 2026

Modified

May 13, 2026

ImportantLearning Objectives
  • Define the Poisson regression model.
  • Implement Poisson regression in python.
Libraries and Styling
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, binom
from scipy.special import expit          # the logistic (sigmoid) function
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.anova import anova_lm
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, roc_auc_score
)

# Figure styling
sns.set_style('whitegrid')
sns.set_palette('Set2')

Poisson Regression

Poisson regression is the GLM for count data: responses \(Y_i \in \{0, 1, 2, \ldots\}\). A natural model is \(Y_i \sim \text{Poisson}(\lambda_i)\), with \(\mathbb{E}[Y_i] = \lambda_i > 0\) and — crucially — \(\text{Var}(Y_i) = \lambda_i\) (variance equals mean).

We use the log link to ensure \(\lambda_i > 0\):

\[\log \lambda_i = \mathbf{x}_i^\top\boldsymbol{\beta}, \quad \text{so} \quad \lambda_i = e^{\mathbf{x}_i^\top\boldsymbol{\beta}}.\]

Coefficients are interpreted multiplicatively via the incidence rate ratio (IRR) \(e^{\hat{\beta}_j}\), the Poisson analogue of the odds ratio:

Coefficient Interpretation
\(\beta_j\) A one-unit increase in \(X_j\) adds \(\beta_j\) to \(\log\lambda\)
\(e^{\beta_j}\) A one-unit increase in \(X_j\) multiplies \(\lambda\) by \(e^{\beta_j}\)
\(e^{\beta_j} > 1\) The expected count increases
\(e^{\beta_j} < 1\) The expected count decreases

Application: Police Frisk Data

frisk = pd.read_csv("data/frisk.txt", sep=r"\s+", skiprows=6)
frisk.head()
stops pop past.arrests precinct eth crime
0 75 1720 191 1 1 1
1 36 1720 57 1 1 2
2 74 1720 599 1 1 3
3 17 1720 133 1 1 4
4 37 1368 62 1 2 1
X = (frisk
    .groupby(['eth', 'precinct'])[["stops", "past.arrests"]]
    .sum()
    .reset_index()
    .pipe(pd.get_dummies, columns=['eth', 'precinct'], dtype=int)
    .assign(intercept=1)  # Adds a column called 'intercept' with all values equal to 1.
    .sort_values(by='stops')
    .reset_index(drop=True)
)

y = X.pop("stops")
model_no_indicators = sm.GLM(
    y,
    X["intercept"],
    offset=np.log(X["past.arrests"]),
    family=sm.families.Poisson(),
)
result_no_indicators = model_no_indicators.fit()
print(result_no_indicators.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  stops   No. Observations:                  225
Model:                            GLM   Df Residuals:                      224
Model Family:                 Poisson   Df Model:                            0
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -23913.
Date:                Sat, 23 May 2026   Deviance:                       46120.
Time:                        13:13:28   Pearson chi2:                 4.96e+04
No. Iterations:                     5   Pseudo R-squ. (CS):              0.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     -0.5877      0.003   -213.058      0.000      -0.593      -0.582
==============================================================================
plt.plot(y, result_no_indicators.fittedvalues, 'o')
plt.plot(y, y, '--', label='y = x')
plt.ylabel("fitted value")
plt.xlabel("observed value")
plt.legend()
plt.show()

model_with_ethnicity = sm.GLM(
    y,
    X[['intercept', 'eth_2', 'eth_3']],
    offset=np.log(X["past.arrests"]),
    family=sm.families.Poisson(),
)
result_with_ethnicity = model_with_ethnicity.fit()
print(result_with_ethnicity.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  stops   No. Observations:                  225
Model:                            GLM   Df Residuals:                      222
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -23572.
Date:                Sat, 23 May 2026   Deviance:                       45437.
Time:                        13:13:29   Pearson chi2:                 4.94e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9519
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     -0.5881      0.004   -155.396      0.000      -0.596      -0.581
eth_2          0.0702      0.006     11.584      0.000       0.058       0.082
eth_3         -0.1616      0.009    -18.881      0.000      -0.178      -0.145
==============================================================================
model_with_ethnicity_and_precinct = sm.GLM(
    y,
    X.drop(columns=["eth_1", "precinct_1", "past.arrests"]),
    offset=np.log(X["past.arrests"]),
    family=sm.families.Poisson(),
)

result_with_ethnicity_and_precinct = model_with_ethnicity_and_precinct.fit()
print(result_with_ethnicity_and_precinct.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  stops   No. Observations:                  225
Model:                            GLM   Df Residuals:                      148
Model Family:                 Poisson   Df Model:                           76
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2566.9
Date:                Sat, 23 May 2026   Deviance:                       3427.1
Time:                        13:13:29   Pearson chi2:                 3.24e+03
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
eth_2           0.0102      0.007      1.498      0.134      -0.003       0.024
eth_3          -0.4190      0.009    -44.409      0.000      -0.437      -0.401
precinct_2     -0.1490      0.074     -2.013      0.044      -0.294      -0.004
precinct_3      0.5600      0.057      9.866      0.000       0.449       0.671
precinct_4      1.2106      0.058     21.037      0.000       1.098       1.323
precinct_5      0.2829      0.057      4.981      0.000       0.172       0.394
precinct_6      1.1442      0.058     19.712      0.000       1.030       1.258
precinct_7      0.2182      0.064      3.391      0.001       0.092       0.344
precinct_8     -0.3906      0.057     -6.868      0.000      -0.502      -0.279
precinct_9      0.4851      0.078      6.207      0.000       0.332       0.638
precinct_10     0.4171      0.059      7.087      0.000       0.302       0.532
precinct_11     0.6544      0.062     10.626      0.000       0.534       0.775
precinct_12     1.1487      0.061     18.811      0.000       1.029       1.268
precinct_13     1.0535      0.055     19.068      0.000       0.945       1.162
precinct_14     0.6141      0.059     10.444      0.000       0.499       0.729
precinct_15     1.0549      0.055     19.195      0.000       0.947       1.163
precinct_16     0.7977      0.060     13.231      0.000       0.680       0.916
precinct_17    -0.0493      0.060     -0.815      0.415      -0.168       0.069
precinct_18     0.1316      0.055      2.394      0.017       0.024       0.239
precinct_19     0.3192      0.058      5.503      0.000       0.205       0.433
precinct_20    -0.0406      0.057     -0.713      0.476      -0.152       0.071
precinct_21     0.4110      0.057      7.235      0.000       0.300       0.522
precinct_22     1.2112      0.053     22.683      0.000       1.107       1.316
precinct_23     0.5999      0.055     10.895      0.000       0.492       0.708
precinct_24     1.3195      0.055     24.166      0.000       1.212       1.426
precinct_25     0.9134      0.054     17.041      0.000       0.808       1.018
precinct_26    -0.1476      0.057     -2.589      0.010      -0.259      -0.036
precinct_27     1.8955      0.056     34.052      0.000       1.786       2.005
precinct_28    -0.7650      0.061    -12.558      0.000      -0.884      -0.646
precinct_29     1.1247      0.054     20.726      0.000       1.018       1.231
precinct_30     0.5247      0.055      9.475      0.000       0.416       0.633
precinct_31     1.6495      0.056     29.368      0.000       1.539       1.760
precinct_32     1.3863      0.060     23.229      0.000       1.269       1.503
precinct_33     1.0832      0.054     19.980      0.000       0.977       1.189
precinct_34     1.5190      0.054     27.945      0.000       1.412       1.626
precinct_35     0.8794      0.063     13.917      0.000       0.756       1.003
precinct_36     1.6163      0.059     27.490      0.000       1.501       1.732
precinct_37     1.4094      0.060     23.487      0.000       1.292       1.527
precinct_38     1.7638      0.056     31.388      0.000       1.654       1.874
precinct_39     0.5011      0.055      9.137      0.000       0.394       0.609
precinct_40     1.5320      0.061     24.924      0.000       1.412       1.652
precinct_41     1.9139      0.056     34.351      0.000       1.805       2.023
precinct_42     1.0842      0.055     19.890      0.000       0.977       1.191
precinct_43     0.5418      0.057      9.565      0.000       0.431       0.653
precinct_44     0.8830      0.057     15.582      0.000       0.772       0.994
precinct_45     0.8384      0.054     15.465      0.000       0.732       0.945
precinct_46     0.6559      0.054     12.202      0.000       0.551       0.761
precinct_47     1.1281      0.060     18.808      0.000       1.011       1.246
precinct_48     0.4119      0.057      7.239      0.000       0.300       0.523
precinct_49     1.1012      0.060     18.452      0.000       0.984       1.218
precinct_50     0.8788      0.054     16.241      0.000       0.773       0.985
precinct_51     0.3490      0.059      5.960      0.000       0.234       0.464
precinct_52     0.4919      0.055      8.870      0.000       0.383       0.601
precinct_53     0.5997      0.058     10.254      0.000       0.485       0.714
precinct_54     0.4918      0.060      8.192      0.000       0.374       0.610
precinct_55     0.4975      0.059      8.429      0.000       0.382       0.613
precinct_56     0.8909      0.066     13.502      0.000       0.762       1.020
precinct_57     1.5328      0.061     24.978      0.000       1.413       1.653
precinct_58     1.5788      0.055     28.955      0.000       1.472       1.686
precinct_59     1.0837      0.059     18.393      0.000       0.968       1.199
precinct_60     0.8338      0.054     15.380      0.000       0.728       0.940
precinct_61     1.2773      0.057     22.506      0.000       1.166       1.389
precinct_62     1.2064      0.057     21.289      0.000       1.095       1.317
precinct_63     1.3088      0.058     22.449      0.000       1.195       1.423
precinct_64     1.7125      0.058     29.642      0.000       1.599       1.826
precinct_65     1.8992      0.055     34.434      0.000       1.791       2.007
precinct_66     2.1021      0.054     38.763      0.000       1.996       2.208
precinct_67     1.2028      0.055     21.887      0.000       1.095       1.311
precinct_68     2.2013      0.059     37.155      0.000       2.085       2.317
precinct_69     1.7207      0.058     29.551      0.000       1.607       1.835
precinct_70     1.0320      0.055     18.828      0.000       0.925       1.139
precinct_71     1.4794      0.054     27.276      0.000       1.373       1.586
precinct_72     1.4650      0.054     27.262      0.000       1.360       1.570
precinct_73     0.9910      0.054     18.494      0.000       0.886       1.096
precinct_74     1.1515      0.058     19.845      0.000       1.038       1.265
precinct_75     1.5712      0.076     20.747      0.000       1.423       1.720
intercept      -1.3789      0.051    -27.027      0.000      -1.479      -1.279
===============================================================================
precinct_coefs = result_with_ethnicity_and_precinct.params.iloc[2:-1] # Only intersted in precinct
precinct_interval = result_with_ethnicity_and_precinct.conf_int().reindex(precinct_coefs.index)

plt.figure(figsize=(15, 6))
plt.plot(precinct_coefs, '.')
for precinct, interval in precinct_interval.iterrows():
    plt.plot([precinct, precinct], interval, color='C0')
plt.axhline(y=0, linestyle=':', color='black')
plt.xticks(
    precinct_coefs.index[::3],
    [int(x[1]) for x in precinct_coefs.index.str.split("_",)][::3]
)
plt.ylabel("Estimated coefficient")
plt.xlabel("Precinct")
plt.show()

plt.figure(figsize=(15,6))
plt.scatter(
    y, result_with_ethnicity.fittedvalues, 
    marker='o', color='#fc8d62', edgecolors='white', 
    linewidths=0.4, s=60, label="with ethnicity")
plt.scatter(
    y, result_with_ethnicity_and_precinct.fittedvalues, 
    marker='^', color='#8da0cb', edgecolors='white', 
    linewidths=0.4, s=60, label="with eth and precinct")
plt.plot(y, y, '--', color='black', label='y = x')
plt.ylabel("fitted value")
plt.xlabel("observed value")
plt.legend()
plt.show()

Application: Daily Bike Rentals

We simulate a dataset of daily bicycle rental counts with weather and calendar predictors — the structure mirrors real bike-sharing data and allows us to verify that the model recovers the known coefficients.

rng_b = np.random.default_rng(2024)
n_b = 500
temp_b    = rng_b.uniform(0.05, 0.95, n_b)
hum_b     = rng_b.uniform(0.30, 0.95, n_b)
wind_b    = rng_b.uniform(0.00, 0.50, n_b)
season_b  = rng_b.choice([1, 2, 3, 4], n_b)
working_b = rng_b.choice([0, 1], n_b)

log_lam = (
    5.5
    + 1.8 * temp_b
    - 1.2 * hum_b
    - 0.6 * wind_b
    + 0.3 * (season_b == 2).astype(float)
    + 0.1 * (season_b == 3).astype(float)
    - 0.1 * (season_b == 4).astype(float)
    + 0.15 * working_b
)
cnt_b = rng_b.poisson(np.exp(log_lam))

bikes = pd.DataFrame({
    'cnt': cnt_b, 'temp': temp_b, 'hum': hum_b,
    'windspeed': wind_b, 'season': season_b, 'workingday': working_b
})
print(bikes.describe().round(2))
           cnt    temp     hum  windspeed  season  workingday
count   500.00  500.00  500.00     500.00  500.00       500.0
mean    316.72    0.48    0.64       0.27    2.49         0.5
std     187.51    0.26    0.19       0.14    1.10         0.5
min      70.00    0.05    0.30       0.00    1.00         0.0
25%     177.75    0.25    0.47       0.15    2.00         0.0
50%     269.00    0.49    0.65       0.27    2.00         0.0
75%     404.25    0.71    0.82       0.39    3.00         1.0
max    1344.00    0.95    0.95       0.50    4.00         1.0

Distribution of daily rental counts. The right skew is characteristic of count data and motivates the use of Poisson regression over OLS.

Rentals vs. normalised temperature. A clear positive association is visible, consistent with the positive coefficient on temperature in the Poisson model.

Rentals vs. normalised humidity. Higher humidity is associated with fewer rentals, consistent with the negative humidity coefficient.

Rental counts by season (1 = Spring, 2 = Summer, 3 = Autumn, 4 = Winter). Summer and autumn show higher median rentals than spring or winter.
poisson1 = smf.poisson(
    'cnt ~ temp + hum + windspeed + C(season) + workingday',
    data=bikes
).fit()
print(poisson1.summary())
Optimization terminated successfully.
         Current function value: 4.250748
         Iterations 5
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:                    cnt   No. Observations:                  500
Model:                        Poisson   Df Residuals:                      492
Method:                           MLE   Df Model:                            7
Date:                Sat, 23 May 2026   Pseudo R-squ.:                  0.9204
Time:                        13:13:31   Log-Likelihood:                -2125.4
converged:                       True   LL-Null:                       -26687.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          5.5235      0.013    428.101      0.000       5.498       5.549
C(season)[T.2]     0.2826      0.007     40.687      0.000       0.269       0.296
C(season)[T.3]     0.0801      0.008     10.667      0.000       0.065       0.095
C(season)[T.4]    -0.1084      0.008    -13.792      0.000      -0.124      -0.093
temp               1.7894      0.010    175.609      0.000       1.769       1.809
hum               -1.2198      0.013    -93.153      0.000      -1.245      -1.194
windspeed         -0.5851      0.018    -32.299      0.000      -0.621      -0.550
workingday         0.1571      0.005     30.955      0.000       0.147       0.167
==================================================================================

Incidence rate ratios (IRR = exp(β̂)) with 95% CIs. An IRR > 1 means more rentals; < 1 means fewer rentals per unit increase in the predictor.

Poisson regression diagnostics. Left: Pearson residuals vs. fitted values — constant spread around zero indicates the model is well-specified; a fan shape would indicate overdispersion. Right: Q-Q plot of Pearson residuals — points near the diagonal confirm the standardised residuals follow N(0, 1) as expected under a correctly-specified Poisson model.

GLM Diagnostics and Assumptions

In OLS, residuals \(e_i = y_i - \hat{y}_i\) have constant variance. In GLMs the variance of \(Y_i\) depends on the mean, so raw residuals are not comparable across observations. Two standardised alternatives are used:

Residual Formula Use
Pearson \(r_i = (y_i - \hat{\mu}_i)/\sqrt{\hat{V}(\hat{\mu}_i)}\) Variance-standardised; detect outliers
Deviance \(d_i = \text{sign}(y_i - \hat{\mu}_i)\sqrt{2[y_i\log(y_i/\hat{\mu}_i) - (y_i - \hat{\mu}_i)]}\) Contribution to total deviance

Under a well-fitting GLM, both should look approximately \(\mathcal{N}(0,1)\) with no pattern against fitted values. Core assumptions and how to check them:

Assumption How to check
Correct distribution family Residual plots; compare AIC across families
Correct link function Component-plus-residual plots; try alternative links
Correct linear predictor Residuals vs. each predictor; added-variable plots
Independence of observations Study design; check for clusters or time structure
No influential observations Cook’s distance; leverage

GLMs do not require constant error variance or normally distributed errors — the exponential family framework handles both explicitly.

Overdispersion in Poisson Regression

ImportantOverdispersion

Overdispersion occurs when the observed variance exceeds the variance assumed by the model. For Poisson regression the model assumes \(\text{Var}(Y_i) = \lambda_i\). If the true variance is \(\phi \lambda_i\) with \(\phi > 1\), the data are overdispersed.

Overdispersion leads to underestimated standard errors and \(p\)-values that are too small. Diagnose it with the dispersion statistic \(\hat{\phi} = \chi^2_P / (n - p - 1)\), where \(\chi^2_P = \sum r_i^2\) is the Pearson chi-squared. Remedies include the quasi-Poisson model (which estimates \(\phi\) from the data) or the negative binomial distribution.

pearson_chi2 = np.sum(
    (bikes['cnt'] - poisson1.fittedvalues)**2 / poisson1.fittedvalues
)
phi_hat = pearson_chi2 / poisson1.df_resid
print(f"Pearson χ² / df = {phi_hat:.2f}")
print("Overdispersed" if phi_hat > 1.5 else "Dispersion OK")
Pearson χ² / df = 21750.01
Overdispersed

Summary

Linear regression is itself a GLM — Normal family, identity link. The framework unifies what might otherwise appear to be separate methods:

Feature Linear Regression GLM
Response type Continuous, unbounded Exponential family (binary, count, …)
Error distribution Normal Flexible
Mean function \(\mu = \mathbf{x}^\top\boldsymbol{\beta}\) (identity link) \(g(\mu) = \mathbf{x}^\top\boldsymbol{\beta}\)
Estimation OLS (closed-form) MLE via IRLS (iterative)
Residual measure RSS Deviance
Model comparison \(F\)-test, AIC, BIC LRT (\(G^2\), \(\chi^2\)), AIC, BIC
Coefficient interpretation Additive effect on \(\mu\) Additive effect on \(g(\mu)\); back-transform for multiplicative interpretation

References

Back to top