Data Visualization

Standard visualizations for EDA, matplotlib and seaborn fundamentals, variable-based formatting, and multiplotting with FacetGrid.
Author

John Robin Inston

Published

May 6, 2026

Modified

May 6, 2026

ImportantLearning Objectives
  • Understand what exploratory data analysis (EDA) is and how visualization supports it.
  • Know which plot types to use for univariate, bivariate, and grouped data.
  • Produce scatter plots, bar plots, histograms, box plots, and violin plots in matplotlib and seaborn.
  • Apply variable-based formatting (color, shape, size) to encode additional dimensions.
  • Create multiplots using matplotlib subplots and seaborn FacetGrid.

Exploratory Data Analysis

What is EDA?

After acquiring and preparing data, the next step in the data science lifecycle is exploration.

ImportantExploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analysing and visualising data to understand its structure, identify patterns, and uncover insights.

EDA typically follows three steps:

  1. Univariate analysis — understand the distribution of individual variables.
  2. Multivariate analysis — explore relationships between variables.
  3. Insight development — form hypotheses to inform modelling and deeper analysis.

EDA Workflow

A key tool throughout EDA is data visualization. Python’s standard library does not include a full visualization system, so we use dedicated libraries: matplotlib, seaborn, altair, and plotly.

Why Visualize Data?

Visualization helps to:

  • Improve interpretability of data.
  • Identify patterns and relationships that are not obvious in raw tables.
  • Communicate findings effectively to an audience.
  • Explore data in an interactive way.

Aim for visualizations that are:

  • Novel — contribute something beyond a plain table.
  • Informative — answer a specific question.
  • Efficient — convey the message with minimum ink.
  • Aesthetic — clear and visually appealing.

Always ask yourself: what is this visualization for? What is it contributing?

Types of Visualization

Typically we wish to produce visualizations for two main reasons:

  • Exploratory visualizations — created quickly during analysis to understand the data; not necessarily polished.
  • Presentation visualizations — used to communicate findings; refined with a focus on clarity and aesthetics.

Furthermore, we can categorize visualizations by their level of interactivity:

  • Static — fixed images; no user interaction.
  • Interactive — allow zooming, filtering, and hovering for detail.
  • Animated — show changes in data over time or across conditions.

Visualization Libraries

matplotlib

The foundational visualization library in Python. Almost every other library — including seaborn — is built on top of it.

  • Imperative interface: total control over all plot details, but often verbose for common plots.
  • Strong for fine-grained customization and publication-style figures.
  • We use the object-oriented interface (fig, ax = plt.subplots()), which is more flexible than the pyplot state-based interface.

seaborn

A declarative statistical visualization library built on top of matplotlib, similar to ggplot2 in R.

  • Creates common statistical visualizations with much less code.
  • Provides useful defaults for themes, color palettes, and grouped comparisons.
  • Well suited for EDA and rapid iteration.

Standard Visualizations

We use the penguins dataset from seaborn, which contains measurements of penguins from three species: Adelie, Chinstrap, and Gentoo.

penguins = sns.load_dataset("penguins")
print(penguins.shape)
penguins.head()
(344, 7)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Scatter Plots

A scatter plot pairs two continuous variables to visualise their relationship. We can produce a scater plot in matplotlib by calling ax.scatter() on an axis object:

fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(
    "bill_length_mm", "bill_depth_mm",
    data=penguins,
    color="steelblue",
    alpha=0.95,
    s=7,
    marker="D"
)
ax.set(
    xlabel="Bill Length (mm)",
    ylabel="Bill Depth (mm)",
    title="Penguin Bill Length vs Depth"
)
plt.show()

Scatter plot of bill length vs bill depth (matplotlib).

We can produce the same plot in seaborn with less code by calling sns.scatterplot():

sns.set_theme(style="whitegrid", palette="Set2")
ax = sns.scatterplot(
    x="bill_length_mm", y="bill_depth_mm",
    data=penguins
)
ax.set(
    xlabel="Bill Length (mm)",
    ylabel="Bill Depth (mm)",
    title="Penguin Bill Length vs Depth"
)
plt.show()

Scatter plot of bill length vs bill depth (seaborn).

Variable-Based Formatting

To examine the influence of a third categorical variable, we can map it to the color (hue), shape (style), or size (size) of points.

ax = sns.scatterplot(
    x="bill_length_mm", y="bill_depth_mm",
    data=penguins,
    hue="species"
)
ax.set(
    xlabel="Bill Length (mm)",
    ylabel="Bill Depth (mm)",
    title="Penguin Bill Length vs Depth"
)
plt.show()

Scatter plot colored by species (seaborn).

Bar Plots

A bar plot represents categorical data with rectangular bars whose lengths are proportional to the values they represent. They are commonly used to show counts or proportions across categories.

In matplotlib, we need to pre-aggregate the data before plotting.

bp1_data = penguins.groupby("species").size().reset_index(name="count")
print(bp1_data)
     species  count
0     Adelie    152
1  Chinstrap     68
2     Gentoo    124
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(x="species", height="count", data=bp1_data, color="steelblue")
ax.set(xlabel="Species", ylabel="Count", title="Penguin Species Counts")
plt.show()

Bar plot of species counts (matplotlib).

Count Plots

In seaborn to avoid the initial aggregation we can use countplot which computes category counts automatically:

ax = sns.countplot(
    x="species", 
    data=penguins
    )
ax.set(
    xlabel="Species", ylabel="Count", title="Penguin Species Counts")
plt.show()

Count plot of species (seaborn).

Subgrouping

To compare across two categorical variables, use the hue parameter to split each bar into subgroups.

ax = sns.countplot(x="island", hue="species", data=penguins)
ax.set(xlabel="Island", ylabel="Count", title="Penguin Species Counts by Island")
ax.legend(title="Species")
plt.show()

Species counts by island with subgrouping by species (seaborn).

Box Plots

A box plot (box-and-whisker plot) displays the distribution of a continuous variable using the five-number summary: minimum, Q1, median, Q3, and maximum. Outliers are shown as individual points.

ax = sns.boxplot(x="species", y="bill_length_mm", data=penguins)
ax.set(
    xlabel="Species",
    ylabel="Bill Length (mm)",
    title="Penguin Bill Length by Species"
)
plt.show()

Box plot of bill length by species (seaborn).

Histograms

A histogram groups continuous data into bins and represents the frequency of observations in each bin as a bar height.

ax = sns.histplot(x="bill_length_mm", data=penguins, bins=20)
ax.set(xlabel="Bill Length (mm)", ylabel="Frequency", title="Histogram of Penguin Bill Length")
plt.show()

Histogram of bill length (seaborn).

Histograms with Kernel Density Estimates

Setting kde=True overlays a kernel density estimate (KDE) — a smooth approximation of the distribution. Combined with hue, we can compare distributions across groups in a single plot.

Note

Histogram shape depends on bin count and KDE smoothness depends on bandwidth. Use these as exploratory summaries rather than definitive descriptions of the distribution.

ax = sns.histplot(
    x="bill_length_mm",
    data=penguins,
    bins=20,
    kde=True,
    hue="species",
    multiple="layer"
)
ax.set(xlabel="Bill Length (mm)", ylabel="Frequency", title="Histogram of Penguin Bill Length")
plt.show()

Histogram with KDE curves by species (seaborn).

Violin Plots

A violin plot combines a box plot with a KDE, providing a richer view of the distribution’s shape, central tendency, and spread.

ax = sns.violinplot(
    x="species", y="bill_length_mm",
    data=penguins,
    inner="quartile",
    color="steelblue"
)
ax.set(
    xlabel="Species",
    ylabel="Bill Length (mm)",
    title="Penguin Bill Length by Species"
)
plt.show()

Violin plot of bill length by species (seaborn).

More Advanced Visualizations

Creative Combinations

Don’t treat plot types as rigid categories — effective visualizations often layer multiple geometries. For example, overlaying individual data points on a box plot reveals both the summary statistics and the raw distribution.

ax = sns.boxplot(
    x="bill_length_mm", y="species",
    data=penguins,
    color="lightsteelblue"
)
sns.stripplot(
    x="bill_length_mm", y="species",
    data=penguins,
    color="black",
    alpha=0.5,
    size=4,
    ax=ax
)
ax.set(
    xlabel="Bill Length (mm)",
    ylabel="Species",
    title="Penguin Bill Length by Species"
)
plt.show()

Box plot with strip plot overlay (seaborn).

Encoding Continuous Variables

We can map a continuous variable such as body mass to point size using the size and sizes parameters.

ax = sns.scatterplot(
    x="bill_length_mm", y="bill_depth_mm",
    data=penguins,
    hue="species",
    style="species",
    size="body_mass_g",
    sizes=(20, 200),
    alpha=0.7
)
ax.set(
    xlabel="Bill Length (mm)",
    ylabel="Bill Depth (mm)",
    title="Penguin Bill Length vs Depth Scaled by Body Mass"
)
ax.legend(title="Species")
plt.show()

Scatter plot encoding species (color, shape) and body mass (size) (seaborn).

Multiplotting

Multiple Plots with matplotlib

plt.subplots() creates a figure with multiple axes. We then loop through axes to apply a plot to each subset of data.

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
species = penguins["species"].unique()

for ax, sp in zip(axes, species):
    subset = penguins[penguins["species"] == sp]
    ax.scatter(
        subset["bill_length_mm"], subset["bill_depth_mm"],
        color="steelblue",
        alpha=0.95,
        s=7,
        marker="D"
    )
    ax.set_title(sp)
    ax.set_xlabel("Bill Length (mm)")
    ax.set_ylabel("Bill Depth (mm)")

plt.tight_layout()
plt.show()

Per-species scatter plots using matplotlib subplots.

Facet Grids with seaborn

sns.FacetGrid() produces the same result with far less code. It creates a grid of subplots defined by the unique values of one or more categorical variables; g.map() applies a plotting function to each facet.

g = sns.FacetGrid(penguins, col="species", height=4, aspect=1)
g.map(sns.scatterplot, "bill_length_mm", "bill_depth_mm",
      color="steelblue", alpha=0.95, s=7, marker="D")
g.set_axis_labels("Bill Length (mm)", "Bill Depth (mm)")
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle("Penguin Bill Length vs Depth by Species")
plt.show()

Per-species scatter plots using seaborn FacetGrid.

FacetGrid is particularly powerful for EDA because it scales to many groups with no additional code — simply change the col, row, or hue arguments.

Back to top