Lecture 6: Exploratory Data Analysis

PSTAT100: Data Science - Concepts and Analysis

John Inston

johninston@ucsb.edu

University of California, Santa Barbara

May 23, 2026

🚁 Overview

Aims of the lecture

Perform exploratory data analysis (EDA) to understand data properties and relationships.
Use and understand descriptive statistics:
- Location measures (mean, median, mode).
- Spread measures (variance, standard deviation, IQR).
- Shape measures (skewness, kurtosis).
Use data visualizations:
- Histograms, box plots, scatter plots, heatmaps, etc.
- Understand distributions.

📚 Required Libraries

In this lecture we will be using the following libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

💅 Figure Styles

sns.set_style('whitegrid')
sns.set_palette('Set2')

Exploratory Data Analysis (EDA)

🔍 Exploratory Data Analysis (EDA)

What is EDA?

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, identify patterns, and uncover insights.

EDA Steps

Perform univariate analysis to understand the distribution of individual variables.
Perform multivariate analysis to explore relationships between variables.
Develop insight to inform modelling and deeper analysis.

Tools for EDA

Descriptive statistics (location, spread, shape, dependence).
Data visualization.

EDA: Example Dataset

Example Data Set: `titanic`

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

The titanic dataset contains information about passengers on the Titanic.
Collection of both categorical (e.g., sex, class) and numerical (e.g., age, fare) variables.

Univariate Analysis

EDA: Univariate Analysis

What is Univariate Analysis?

Univariate analysis focuses on analyzing and summarizing a single variable.
Aim to understand:
- Location (mean, median, mode).
- Variability (standard deviation, interquartile range).
- Distribution shape (skewness, kurtosis).
- Count and frequency of categorical variables.

Dependence on Variable Type

Approach depends on the variable type:
- Quantitative vs Categorical variables.
- Changes the utility of different visualizations and statistics.

Categorical Variables: Frequency and Proportions

Statistics

For categorical variables, we produce a table summarizing:
- Frequency: The count of each category.
- Proportion: The relative frequency of each category (frequency divided by total count).

Example: Categorical Variable Summary Table

pd.DataFrame({
  'Frequency': titanic['class'].value_counts(),
  'Proportion': titanic['class'].value_counts(normalize=True),
  'Percentage': titanic['class'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
}).sort_index()

	Frequency	Proportion	Percentage
class
First	216	0.242424	24.24%
Second	184	0.206510	20.65%
Third	491	0.551066	55.11%

Categorical Variables: Visualizations

Visualization Options

Our visualization options essentially just display this same information:
- Count Plot: Displays the frequency or proportion of each category.
- Pie Chart: Shows the proportion of each category as a slice of a pie.
  - Not generally recommended for accurate comparison, but can be useful for showing relative proportions.

Example: Count Plot and Pie Chart

colors = sns.color_palette('pastel')[0:3]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
sns.countplot(
  data=titanic, 
  x='class', 
  ax=ax[0]
  )
ax[0].set(title='Count Plot', xlabel='Class', ylabel='Count')
ax[1].pie(
  titanic['class'].value_counts(), 
  labels=titanic['class'].unique(), 
  colors=colors, 
  autopct='%.0f%%')
ax[1].set_title('Pie Chart of Passenger Class')
plt.tight_layout()
plt.show()

Categorical Variables: Visualizations

Count plot and pie chart for the `class` variable in the `titanic` dataset.

Numerical Variables: Location Measures

Mean

For some numerical variable we have sample \(x_1, ..., x_n\) of \(n\) observations. The sample mean is computed as: \[\bar{x} := \frac{1}{n} \sum_{i=1}^n x_i.\]

Median

The sample median is the middle value of the sorted data.
- If \(n\) is odd, it is the value at position \(\frac{n+1}{2}\).
- If \(n\) is even, it is the average of the values at positions \(\frac{n}{2}\) and \(\frac{n}{2}+1\).

Mode

The sample mode is the value that appears most frequently in the data.

Example: Location Measures

Mean

average of all values

Median

middle value when sorted

Mode

most frequent value

The mean is sensitive to outliers, while the median is more robust.
- What if we add a value of 100?
The mode can be useful for understanding the most common value, especially in categorical data.
- Why might this stop being useful with float data?

Numerical Variables: Spread Measures

Variance and Standard Deviation

The sample variance is computed as: \[s^2 := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2.\]
- This can be intuitively be thought of as the average squared distance of the data points from the mean.
- We divide by \(n-1\) instead of \(n\) to get an unbiased estimate of the population variance (Bessel’s correction - discussed further next week).
The sample standard deviation is the square root of the variance: \[s := \sqrt{s^2}.\]

Example: Sample Variance

Sample variance (n−1)

Squared deviations (xᵢ − x̄)²

Mean (x̄)

Sum of sq. dev.

Variance (÷n)

Std deviation

Min, Max, Range, and Interquartile Range

It is often useful to understand the range of values in the data, as well as the spread of the middle 50% of the data.
The minimum and maximum values provide the range of the data: \[\text{Range} := \max(x_i) - \min(x_i).\]
A quantile is a cut point that divides a sorted dataset into equal-sized groups.
- Percentiles are quantiles that divide the data into 100 equal parts.
- Quartiles are quantiles that divide the data into 4 equal parts.
  - The first quartile (Q1) is the 25th percentile.
  - The second quartile (Q2) is the 50th percentile (the median).
  - The third quartile (Q3) is the 75th percentile.
The interquartile range (IQR) is the difference between the third and first quartiles: \[\text{IQR} := Q3 - Q1.\]

Numerical Data Summary and Box Plots

Data Summary

We can use the describe() method in pandas to compute most of the common summary statistics for a numerical variable.
A box plot (or box-and-whisker plot) is a graphical representation of the summary statistics, showing the median, quartiles, and potential outliers.

Example: Data Summary Table

titanic['age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

Example: Box Plot

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=titanic, x='age', ax=ax)
ax.set_title('Box Plot of Passenger Age')
plt.tight_layout()
plt.show()

Numerical Data Summary and Box Plots

Box plot for the `age` variable grouped by `class` in the `titanic` dataset

Histograms

Box plots are useful for summarizing the distribution of a numerical variable.
Histograms are useful for visualizing the shape of the distribution.
- A histogram divides the range of the data into intervals (bins) and counts the number of observations in each bin.
- The choice of bin width can affect the appearance of the histogram and our interpretation of the data.

Example: Histogram

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(data=titanic, x='age', bins=20, kde=True, ax=ax)
ax.set_title('Histogram of Passenger Age')
plt.tight_layout()
plt.show()

Histograms

Histogram of the `age` variable in the `titanic` dataset.

Histogram with Summary Statistic Overlays

Histogram of the `age` variable in the `titanic` dataset with summary statistic overlays.

Skewness

Skewness measures the asymmetry of the distribution of a numerical variable.
- A distribution is positively skewed (right-skewed) if it has a long tail on the right side.
- A distribution is negatively skewed (left-skewed) if it has a long tail on the left side.
The skewness can be calculated using the formula: \[\text{Skewness} = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s} \right)^3.\]

Kurtosis

Kurtosis measures the “tailedness” of the distribution of a numerical variable.
- A distribution is leptokurtic if it has heavy tails and a sharp peak.
- A distribution is platykurtic if it has light tails and a flat peak.
- A distribution is mesokurtic if it has tails and a peak similar to the normal distribution.
The kurtosis can be calculated using the formula: \[\text{Kurtosis} = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s} \right)^4.\]

Example: Skewness and Kurtosis

Skewness and Kurtosis Calculation

There are built-in functions in libraries like scipy.stats to calculate skewness and kurtosis, but we can also implement these calculations manually using the formulas provided.

def calculate_skewness_kurtosis(data):
    n = len(data)
    mean = np.mean(data)
    std_dev = np.std(data, ddof=1)  # Sample standard deviation
    skewness = (1/n) * np.sum(((data - mean) / std_dev) ** 3)
    kurtosis = (1/n) * np.sum(((data - mean) / std_dev) ** 4)
    return skewness, kurtosis

Histogram with Kurtosis and Skewness

skewness, kurtosis = calculate_skewness_kurtosis(titanic['age'].dropna())
print(f'Skewness: {skewness:.2f}, Kurtosis: {kurtosis:.2f}')

Skewness: 0.39, Kurtosis: 3.16

Quick Note on the Normal Distribution

What is the Normal Distribution?

The normal distribution (or Gaussian distribution) is a continuous probability distribution that is symmetric around its mean, with a bell-shaped curve.
- We will cover probability distributions in more detail next week but it is helpful to have an idea about the normal distribution when discussing skewness and kurtosis.
It is defined by its mean (\(\mu\)) and standard deviation (\(\sigma\)).
Many natural phenomena and measurement errors tend to follow a normal distribution, making it a fundamental concept in statistics.

Properties of the Normal Distribution

The mean, median, and mode of a normal distribution are all equal.
The normal distribution is symmetric around the mean, meaning that it has no skewness.
The normal distribution has a kurtosis of 3 (or excess kurtosis of 0).
- Hence, leptokurtic distributions have kurtosis greater than 3, while platykurtic distributions have kurtosis less than 3.

Bell Curve

Bell Curve Example

Multivariate Analysis

Cross Tabulation

What are our aims?

Find and measure relationships between variables.

Categorical Cross Tabulation

A cross tabulation displays the frequency distribution of two or more categorical variables.
Examines the relationship between the variables by showing how the categories of one variable are distributed across the categories of another variable.

pd.crosstab(
  titanic['class'], titanic['survived'], 
  margins=True, normalize='index')

survived	0	1
class
First	0.370370	0.629630
Second	0.527174	0.472826
Third	0.757637	0.242363
All	0.616162	0.383838

Tabulation of Categorical and Quantitative Variables

What about when one variable is quantitative?

We can use grouped summary statistics to examine the relationship between a categorical variable and a quantitative variable.
- For example, we can compute the mean age of passengers in each class in the Titanic dataset.

Mixed Type Tabulation Example

titanic.groupby('class')['age'].describe()

	count	mean	std	min	25%	50%	75%	max
class
First	186.0	38.233441	14.802856	0.92	27.0	37.0	49.0	80.0
Second	173.0	29.877630	14.001077	0.67	23.0	29.0	36.0	70.0
Third	355.0	25.140620	12.495398	0.42	18.0	24.0	32.0	74.0

Covariance

Multiple Quantitative Variables

For two quantitative variables we are primarily interested in the sample covariance and correlation.

Sample Covariance

The sample covariance between two variables \(X\) and \(Y\) is calculated as: \[s_{XY} := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).\]
- This can be thought of as the average product of the deviations of the two variables from their respective means.
- A positive covariance indicates that the variables tend to increase together
- A negative covariance indicates that one variable tends to increase when the other decreases.

Correlation

Sample Correlation

The sample correlation between two variables \(X\) and \(Y\) is calculated as: \[r_{XY} := \frac{s_{XY}}{s_X s_Y},\] where \(s_X\) and \(s_Y\) are the sample standard deviations of \(X\) and \(Y\), respectively.
- The correlation coefficient ranges from -1 to 1.
- Values close to 1 indicate a strong positive linear relationship.
- Values close to -1 indicate a strong negative linear relationship.
- Values close to 0 indicate no linear relationship.

Correlation Visualization Example

Warnings about Correlation

Correlation is not causation!

Correlation does not imply causation!
- Just because two variables move together does not mean that one variable causes the other to move.
- For example, there is a strong correlation between ice cream sales and crime rates, but it does not mean that ice cream sales causes crime or vice versa.

Correlation measures linear relationships!

Correlation measures only linear relationships!
- Just because two variables have a correlation of 0 does not mean that they are independent.

Non-Linear Relationships

Look at the following scatter plot:

Quadratic relationship with a correlation of nearly 0.

The correlation is nearly 0, but there is clearly a strong relationship.

Covariance and Correlation Matrices

The covariance matrix of a set of variables is a square matrix that contains the variances and covariances of the variables.
The correlation matrix of a set of variables is a square matrix that contains the correlations of the variables.
- Both are symmetric and positive semi-definite (which means all eigenvalues are non-negative).

Covariance and Correlation Matrices Example

titanic[['age', 'fare']].cov()

	age	fare
age	211.019125	73.849030
fare	73.849030	2469.436846

titanic[['age', 'fare']].corr()

	age	fare
age	1.000000	0.096067
fare	0.096067	1.000000

Checking our Matrices

Let’s look at the scatter plot

fig, ax = plt.subplots(figsize=(10, 5))
sns.scatterplot(data=titanic, x='age', y='fare', ax=ax)
ax.set(
  title='Scatter Plot of Age and Fare',
  xlabel='Age',
  ylabel='Fare'
)
plt.tight_layout()
plt.show()

Checking our Matrices

Exploratory Analysis Example

Example Data: Wine Dataset

We consider the wine data set provided by Geeks for Geeks in the following note on Exploratory Data Analysis in Python.

wine = pd.read_csv('data/WineQT.csv')
wine.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	Id
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	0
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	2
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	3
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	4

The shape of the wine dataset is:

wine.shape

(1143, 13)

Summary Statistics

We can compute the summary statistics for the wine dataset using the describe method.

wine.describe()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	Id
count	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000	1143.000000
mean	8.311111	0.531339	0.268364	2.532152	0.086933	15.615486	45.914698	0.996730	3.311015	0.657708	10.442111	5.657043	804.969379
std	1.747595	0.179633	0.196686	1.355917	0.047267	10.250486	32.782130	0.001925	0.156664	0.170399	1.082196	0.805824	463.997116
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000	0.000000
25%	7.100000	0.392500	0.090000	1.900000	0.070000	7.000000	21.000000	0.995570	3.205000	0.550000	9.500000	5.000000	411.000000
50%	7.900000	0.520000	0.250000	2.200000	0.079000	13.000000	37.000000	0.996680	3.310000	0.620000	10.200000	6.000000	794.000000
75%	9.100000	0.640000	0.420000	2.600000	0.090000	21.000000	61.000000	0.997845	3.400000	0.730000	11.100000	6.000000	1209.500000
max	15.900000	1.580000	1.000000	15.500000	0.611000	68.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000	1597.000000

Univariate Analysis

Histograms to inspect the distribution of the variables

for feature in wine.columns:
    fig, ax = plt.subplots(figsize=(10, 5))
    sns.histplot(wine[feature], kde=True, ax=ax)
    plt.title(f"{feature} | Skewness: {round(wine[feature].skew(), 2)}")
    plt.savefig(f"assets/wine_{feature}.png")

plt.tight_layout()
plt.show()

Histogram Plots

1 / 13

Which variables are skewed?

Skewness and Kurtosis

Skewness

wine.skew()

fixed acidity           1.044930
volatile acidity        0.681547
citric acid             0.371561
residual sugar          4.361096
chlorides               6.026360
free sulfur dioxide     1.231261
total sulfur dioxide    1.665766
density                 0.102395
pH                      0.221138
sulphates               2.497266
alcohol                 0.863313
quality                 0.286792
Id                     -0.010419
dtype: float64

Kurtosis

wine.kurtosis()

fixed acidity            1.384614
volatile acidity         1.375531
citric acid             -0.714686
residual sugar          27.675366
chlorides               47.078324
free sulfur dioxide      1.932170
total sulfur dioxide     5.098748
density                  0.888123
pH                       0.925791
sulphates               12.017377
alcohol                  0.221179
quality                  0.314664
Id                      -1.216364
dtype: float64

Bivariate Analysis

Pairplots

sns.pairplot(wine.iloc[:, :2])

Quality vs Alcohol

Box Plots

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(x='quality', y='alcohol', data=wine, ax=ax)
ax.set(
  title='Box Plot of Quality and Alcohol',
  xlabel='Quality',
  ylabel='Alcohol'
)
plt.tight_layout()
plt.show()

Quality vs Alcohol

Multivariate Analysis

Heatmap

plt.figure(figsize=(7, 7))

sns.heatmap(wine.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)
plt.title('Correlation Heatmap')
plt.show()

Multivariate Analysis

Conclusion

✅ What we covered

Exploratory Data Analysis (EDA):
- Univariate analysis.
- Bivariate analysis.
- Multivariate analysis.
- Correlation and covariance.
- Skewness and kurtosis.
- Cross tabulation.
- Grouped summary statistics.
- Mixed type tabulation.

📅 What’s next?

Statistical Foundations.
Probability and Distributions.

Lecture 6: Exploratory Data Analysis

🚁 Overview

Aims of the lecture

📚 Required Libraries

💅 Figure Styles

Exploratory Data Analysis (EDA)

🔍 Exploratory Data Analysis (EDA)

What is EDA?

EDA Steps

Tools for EDA

EDA: Example Dataset

Example Data Set: titanic

Univariate Analysis

EDA: Univariate Analysis

What is Univariate Analysis?

Dependence on Variable Type

Categorical Variables: Frequency and Proportions

Statistics

Example: Categorical Variable Summary Table

Categorical Variables: Visualizations

Visualization Options

Example: Count Plot and Pie Chart

Categorical Variables: Visualizations

Numerical Variables: Location Measures

Mean

Median

Mode

Example: Location Measures

Numerical Variables: Spread Measures

Variance and Standard Deviation

Example: Sample Variance

Min, Max, Range, and Interquartile Range

Numerical Data Summary and Box Plots

Data Summary

Example: Data Summary Table

Example: Box Plot

Numerical Data Summary and Box Plots

Histograms

Histograms

Example: Histogram

Histograms

Histogram with Summary Statistic Overlays

Skewness

Skewness

Kurtosis

Example: Skewness and Kurtosis

Skewness and Kurtosis Calculation

Histogram with Kurtosis and Skewness

Quick Note on the Normal Distribution

What is the Normal Distribution?

Properties of the Normal Distribution

Bell Curve

Bell Curve Example

Multivariate Analysis

Cross Tabulation

What are our aims?

Categorical Cross Tabulation

Tabulation of Categorical and Quantitative Variables

What about when one variable is quantitative?

Mixed Type Tabulation Example

Covariance

Multiple Quantitative Variables

Sample Covariance

Correlation

Sample Correlation

Correlation Visualization Example

Warnings about Correlation

Correlation is not causation!

Correlation measures linear relationships!

Non-Linear Relationships

Look at the following scatter plot:

Covariance and Correlation Matrices

Covariance and Correlation Matrices

Covariance and Correlation Matrices Example

Checking our Matrices

Let’s look at the scatter plot

Checking our Matrices

Exploratory Analysis Example

Example Data: Wine Dataset

Summary Statistics

Example Data Set: `titanic`