PSTAT100: Data Science - Concepts and Analysis
May 6, 2026
matplotlib and seaborn libraries for creating visualizations in Python.Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, identify patterns, and uncover insights.
matplotlib, seaborn, altair, and plotly.Always ask yourself, what is this visualization for? What is it contributing?
matplotlibseaborn โ is built on top of it.seabornggplot2 in R.matplotlib, meaning:
altairmatplotlib.seaborn for EDA visualizations.seaborn visualizations.altair.penguinspenguins dataset loaded from the seaborn library.(344, 7)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
matplotlibseaborn and altair in this course.matplotlib.matplotlib Scatter Plotsmatplotlib as follows:
plt.subplots().ax.scatter()).ax.set().plt.show().fig, ax = plt.subplots(figsize=(8, 6)) # Create figure and axis objects
ax.scatter ( # Create scatter plot
"bill_length_mm", "bill_depth_mm",
data=penguins,
color="steelblue", # Set point color
alpha=0.95, # Set point transparency
s=7, # Set point size
marker="D" # Set point type
)
ax.set( # Set labels and title
xlabel="Bill Length (mm)",
ylabel="Bill Depth (mm)",
title="Penguin Bill Length vs Depth"
)
plt.show() # Show the plotmatplotlib Scatter Plots
matplotlib)seaborn Scatter Plotsseaborn so great?matplotlib but provides a higher-level interface for creating more complex and informative visualizations with less code.seabornseaborn Scatter Plots
seaborn)
seaborn)seaborn, we can produce a scatter plot with a best fit line using the lmplot() function.For now, treat best fit lines as descriptive trend summaries; we will cover regression assumptions, estimation, and interpretation in detail in future weeks.
seaborn Regression Linesg = sns.lmplot( # Scatter with regression line
x="bill_length_mm", y="bill_depth_mm",
data=penguins,
hue="species",
markers=["o", "s", "^"], # Set point types
line_kws={"linewidth": 2}, # Set line width
height=5, aspect=1.8 # Change figure size
)
g.set_axis_labels("Bill Length (mm)", "Bill Depth (mm)")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Penguin Bill Length vs Depth")
plt.show()
seaborn)matplotlib are driven by the summary data frame we provide. In this case, we compute species counts in advance.matplotlib Bar Plots
matplotlib)seaborn Count Plotsseaborn has an equivalent barplot function, a more useful plot type in this case is a countplot.
seaborn)
seaborn)seaborn, we can produce a box plot using the boxplot() function.
seaborn)seaborn, we can produce a histogram using the histplot() function.
seaborn)seaborn Histograms with KDEsax = sns.histplot(
x="bill_length_mm",
data=penguins,
bins=20, # Set number of bins
kde=True, # Show kernel density estimate
hue="species", # Color by species
multiple="layer" # Overlay histograms
)
ax.set(xlabel="Bill Length (mm)", ylabel="Frequency", title="Histogram of Penguin Bill Length")
plt.show()
seaborn)seaborn, we can produce a violin plot using the violinplot() function.
seaborn)ax = sns.boxplot(
x="bill_length_mm", y="species",
data=penguins,
color="lightsteelblue"
)
sns.stripplot(
x="bill_length_mm", y="species",
data=penguins,
color="black", # Use a single point color for readability
alpha=0.5, # Set point transparency
size=4, # Set point size
ax=ax
)
ax.set(
xlabel="Bill Length (mm)",
ylabel="Species",
title="Penguin Bill Length by Species"
)
plt.show()
seaborn)size and sizes parameters to control the size of points based on body mass.ax = sns.scatterplot(
x="bill_length_mm", y="bill_depth_mm",
data=penguins,
hue="species", # Color by species
style="species", # Shape by species
size="body_mass_g", # Scale by body mass
sizes=(20, 200), # Set size range
alpha=0.7 # Set point transparency
)
ax.set(xlabel="Bill Length (mm)", ylabel="Bill Depth (mm)", title="Penguin Bill Length vs Depth Scaled by Body Mass")
ax.legend(title="Species")
plt.show()
seaborn)matplotlib we can use the subplots() function to create multiple plots in a single figure.seaborn we can use the FacetGrid() function to create a grid of plots based on the values of one or more categorical variables.
seaborn becomes apparent as we can create complex multiplots with very little code.matplotlibplt.subplots().fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5)) # Create figure and axes objects
species = penguins["species"].unique() # Get unique species
for ax, sp in zip(axes, species): # Loop through axes and species
subset = penguins[penguins["species"] == sp] # Subset data for each species
ax.scatter( # Create scatter plot
subset["bill_length_mm"], subset["bill_depth_mm"],
color="steelblue",
alpha=0.95,
s=7,
marker="D"
)
ax.set_title(sp)
ax.set_xlabel("Bill Length (mm)")
ax.set_ylabel("Bill Depth (mm)")
plt.tight_layout() # Adjust layout
plt.show() matplotlib
matplotlib)seabornseabornโs FacetGrid() function with much less code.map() method is used to apply a plotting function (e.g., sns.scatterplot) to each subset of the data defined by the grid.seaborng = sns.FacetGrid(
penguins,
col="species",
height=4,
aspect=1) # Create facet grid
g.map(
sns.scatterplot,
"bill_length_mm", "bill_depth_mm",
color="steelblue",
alpha=0.95,
s=7,
marker="D") # Map scatter plot to each facet
g.set_axis_labels(
"Bill Length (mm)",
"Bill Depth (mm)"
) # Set axis labels
g.fig.subplots_adjust(top=0.8) # Adjust subplot spacing
g.fig.suptitle("Penguin Bill Length vs Depth by Species") # Set overall title
plt.show() # Show the plotseaborn
seaborn)matplotlib and seaborn.seaborn visualizations.altair.