Lecture 6: Exploratory Data Analysis

PSTAT100: Data Science - Concepts and Analysis

John Inston

University of California, Santa Barbara

May 6, 2026

🚁 Overview

Aims of the lecture

  • Perform exploratory data analysis (EDA) to understand data properties and relationships.
  • Use and understand descriptive statistics:
    • Location measures (mean, median, mode).
    • Spread measures (variance, standard deviation, IQR).
    • Shape measures (skewness, kurtosis).
  • Use data visualizations:
    • Histograms, box plots, scatter plots, heatmaps, etc.
    • Understand distributions.

📚 Required Libraries

In this lecture we will be using the following libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

💅 Figure Styles

sns.set_style('whitegrid')
sns.set_palette('Set2')

Exploratory Data Analysis (EDA)

🔍 Exploratory Data Analysis (EDA)

What is EDA?

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, identify patterns, and uncover insights.

EDA Steps

  1. Perform univariate analysis to understand the distribution of individual variables.
  2. Perform multivariate analysis to explore relationships between variables.
  3. Develop insight to inform modelling and deeper analysis.

Tools for EDA

  • Descriptive statistics (location, spread, shape, dependence).
  • Data visualization.

EDA: Example Dataset

Example Data Set: titanic

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
  • The titanic dataset contains information about passengers on the Titanic.
  • Collection of both categorical (e.g., sex, class) and numerical (e.g., age, fare) variables.

Univariate Analysis

EDA: Univariate Analysis

What is Univariate Analysis?

  • Univariate analysis focuses on analyzing and summarizing a single variable.
  • Aim to understand:
    • Location (mean, median, mode).
    • Variability (standard deviation, interquartile range).
    • Distribution shape (skewness, kurtosis).
    • Count and frequency of categorical variables.

Dependence on Variable Type

  • Approach depends on the variable type:
    • Quantitative vs Categorical variables.
    • Changes the utility of different visualizations and statistics.

Categorical Variables: Frequency and Proportions

Statistics

  • For categorical variables, we produce a table summarizing:
    • Frequency: The count of each category.
    • Proportion: The relative frequency of each category (frequency divided by total count).

Example: Categorical Variable Summary Table

pd.DataFrame({
  'Frequency': titanic['class'].value_counts(),
  'Proportion': titanic['class'].value_counts(normalize=True),
  'Percentage': titanic['class'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
}).sort_index()
Frequency Proportion Percentage
class
First 216 0.242424 24.24%
Second 184 0.206510 20.65%
Third 491 0.551066 55.11%

Categorical Variables: Visualizations

Visualization Options

  • Our visualization options essentially just display this same information:
    • Count Plot: Displays the frequency or proportion of each category.
    • Pie Chart: Shows the proportion of each category as a slice of a pie.
      • Not generally recommended for accurate comparison, but can be useful for showing relative proportions.

Example: Count Plot and Pie Chart

colors = sns.color_palette('pastel')[0:3]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
sns.countplot(
  data=titanic, 
  x='class', 
  ax=ax[0]
  )
ax[0].set(title='Count Plot', xlabel='Class', ylabel='Count')
ax[1].pie(
  titanic['class'].value_counts(), 
  labels=titanic['class'].unique(), 
  colors=colors, 
  autopct='%.0f%%')
ax[1].set_title('Pie Chart of Passenger Class')
plt.tight_layout()
plt.show()

Categorical Variables: Visualizations

Count plot and pie chart for the class variable in the titanic dataset.

Numerical Variables: Location Measures

Mean

  • For some numerical variable we have sample \(x_1, ..., x_n\) of \(n\) observations. The sample mean is computed as: \[\bar{x} := \frac{1}{n} \sum_{i=1}^n x_i.\]

Median

  • The sample median is the middle value of the sorted data.
    • If \(n\) is odd, it is the value at position \(\frac{n+1}{2}\).
    • If \(n\) is even, it is the average of the values at positions \(\frac{n}{2}\) and \(\frac{n}{2}+1\).

Mode

  • The sample mode is the value that appears most frequently in the data.

Example: Location Measures

Mean
average of all values
Median
middle value when sorted
Mode
most frequent value
  • The mean is sensitive to outliers, while the median is more robust.
    • What if we add a value of 100?
  • The mode can be useful for understanding the most common value, especially in categorical data.
    • Why might this stop being useful with float data?

Numerical Variables: Spread Measures

Variance and Standard Deviation

  • The sample variance is computed as: \[s^2 := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2.\]

    • This can be intuitively be thought of as the average squared distance of the data points from the mean.
    • We divide by \(n-1\) instead of \(n\) to get an unbiased estimate of the population variance (Bessel’s correction - discussed further next week).
  • The sample standard deviation is the square root of the variance: \[s := \sqrt{s^2}.\]

Example: Sample Variance

Squared deviations (xᵢ − x̄)²
Mean (x̄)
Sum of sq. dev.
Variance (÷n)
Std deviation

Min, Max, Range, and Interquartile Range

  • It is often useful to understand the range of values in the data, as well as the spread of the middle 50% of the data.

  • The minimum and maximum values provide the range of the data: \[\text{Range} := \max(x_i) - \min(x_i).\]

  • A quantile is a cut point that divides a sorted dataset into equal-sized groups.

    • Percentiles are quantiles that divide the data into 100 equal parts.
    • Quartiles are quantiles that divide the data into 4 equal parts.
      • The first quartile (Q1) is the 25th percentile.
      • The second quartile (Q2) is the 50th percentile (the median).
      • The third quartile (Q3) is the 75th percentile.
  • The interquartile range (IQR) is the difference between the third and first quartiles: \[\text{IQR} := Q3 - Q1.\]

Numerical Data Summary and Box Plots

Data Summary

  • We can use the describe() method in pandas to compute most of the common summary statistics for a numerical variable.
  • A box plot (or box-and-whisker plot) is a graphical representation of the summary statistics, showing the median, quartiles, and potential outliers.

Example: Data Summary Table

titanic['age'].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

Example: Box Plot

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=titanic, x='age', ax=ax)
ax.set_title('Box Plot of Passenger Age')
plt.tight_layout()
plt.show()

Numerical Data Summary and Box Plots

Box plot for the age variable grouped by class in the titanic dataset

Histograms

Histograms

  • Box plots are useful for summarizing the distribution of a numerical variable.
  • Histograms are useful for visualizing the shape of the distribution.
    • A histogram divides the range of the data into intervals (bins) and counts the number of observations in each bin.
    • The choice of bin width can affect the appearance of the histogram and our interpretation of the data.

Example: Histogram

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(data=titanic, x='age', bins=20, kde=True, ax=ax)
ax.set_title('Histogram of Passenger Age')
plt.tight_layout()
plt.show()

Histograms

Histogram of the age variable in the titanic dataset.

Histogram with Summary Statistic Overlays

Histogram of the age variable in the titanic dataset with summary statistic overlays.

Skewness

Skewness

  • Skewness measures the asymmetry of the distribution of a numerical variable.
    • A distribution is positively skewed (right-skewed) if it has a long tail on the right side.
    • A distribution is negatively skewed (left-skewed) if it has a long tail on the left side.
  • The skewness can be calculated using the formula: \[\text{Skewness} = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s} \right)^3.\]

Kurtosis

  • Kurtosis measures the “tailedness” of the distribution of a numerical variable.
    • A distribution is leptokurtic if it has heavy tails and a sharp peak.
    • A distribution is platykurtic if it has light tails and a flat peak.
    • A distribution is mesokurtic if it has tails and a peak similar to the normal distribution.
  • The kurtosis can be calculated using the formula: \[\text{Kurtosis} = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s} \right)^4.\]

Example: Skewness and Kurtosis

Skewness and Kurtosis Calculation

  • There are built-in functions in libraries like scipy.stats to calculate skewness and kurtosis, but we can also implement these calculations manually using the formulas provided.
def calculate_skewness_kurtosis(data):
    n = len(data)
    mean = np.mean(data)
    std_dev = np.std(data, ddof=1)  # Sample standard deviation
    skewness = (1/n) * np.sum(((data - mean) / std_dev) ** 3)
    kurtosis = (1/n) * np.sum(((data - mean) / std_dev) ** 4)
    return skewness, kurtosis

Histogram with Kurtosis and Skewness

Histogram of the age variable in the titanic dataset with summary statistic overlays.
skewness, kurtosis = calculate_skewness_kurtosis(titanic['age'].dropna())
print(f'Skewness: {skewness:.2f}, Kurtosis: {kurtosis:.2f}')
Skewness: 0.39, Kurtosis: 3.16

Quick Note on the Normal Distribution

What is the Normal Distribution?

  • The normal distribution (or Gaussian distribution) is a continuous probability distribution that is symmetric around its mean, with a bell-shaped curve.
    • We will cover probability distributions in more detail next week but it is helpful to have an idea about the normal distribution when discussing skewness and kurtosis.
  • It is defined by its mean (\(\mu\)) and standard deviation (\(\sigma\)).
  • Many natural phenomena and measurement errors tend to follow a normal distribution, making it a fundamental concept in statistics.

Properties of the Normal Distribution

  • The mean, median, and mode of a normal distribution are all equal.
  • The normal distribution is symmetric around the mean, meaning that it has no skewness.
  • The normal distribution has a kurtosis of 3 (or excess kurtosis of 0).
    • Hence, leptokurtic distributions have kurtosis greater than 3, while platykurtic distributions have kurtosis less than 3.

Bell Curve

Bell Curve Example

Bell curve example with different means and variances.

Multivariate Analysis

Cross Tabulation

What are our aims?

  • Find and measure relationships between variables.

Categorical Cross Tabulation

  • A cross tabulation displays the frequency distribution of two or more categorical variables.
  • Examines the relationship between the variables by showing how the categories of one variable are distributed across the categories of another variable.
pd.crosstab(
  titanic['class'], titanic['survived'], 
  margins=True, normalize='index')
survived 0 1
class
First 0.370370 0.629630
Second 0.527174 0.472826
Third 0.757637 0.242363
All 0.616162 0.383838

Tabulation of Categorical and Quantitative Variables

What about when one variable is quantitative?

  • We can use grouped summary statistics to examine the relationship between a categorical variable and a quantitative variable.
    • For example, we can compute the mean age of passengers in each class in the Titanic dataset.

Mixed Type Tabulation Example

titanic.groupby('class')['age'].describe()
count mean std min 25% 50% 75% max
class
First 186.0 38.233441 14.802856 0.92 27.0 37.0 49.0 80.0
Second 173.0 29.877630 14.001077 0.67 23.0 29.0 36.0 70.0
Third 355.0 25.140620 12.495398 0.42 18.0 24.0 32.0 74.0

Covariance

Multiple Quantitative Variables

  • For two quantitative variables we are primarily interested in the sample covariance and correlation.

Sample Covariance

  • The sample covariance between two variables \(X\) and \(Y\) is calculated as: \[s_{XY} := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).\]
    • This can be thought of as the average product of the deviations of the two variables from their respective means.
    • A positive covariance indicates that the variables tend to increase together
    • A negative covariance indicates that one variable tends to increase when the other decreases.

Correlation

Sample Correlation

  • The sample correlation between two variables \(X\) and \(Y\) is calculated as: \[r_{XY} := \frac{s_{XY}}{s_X s_Y},\] where \(s_X\) and \(s_Y\) are the sample standard deviations of \(X\) and \(Y\), respectively.
    • The correlation coefficient ranges from -1 to 1.
    • Values close to 1 indicate a strong positive linear relationship.
    • Values close to -1 indicate a strong negative linear relationship.
    • Values close to 0 indicate no linear relationship.

Correlation Visualization Example

Warnings about Correlation

Correlation is not causation!

  • Correlation does not imply causation!
    • Just because two variables move together does not mean that one variable causes the other to move.
    • For example, there is a strong correlation between ice cream sales and crime rates, but it does not mean that ice cream sales causes crime or vice versa.

Correlation measures linear relationships!

  • Correlation measures only linear relationships!
    • Just because two variables have a correlation of 0 does not mean that they are independent.

Non-Linear Relationships

Look at the following scatter plot:

Quadratic relationship with a correlation of nearly 0.
  • The correlation is nearly 0, but there is clearly a strong relationship.

Covariance and Correlation Matrices

Covariance and Correlation Matrices

  • The covariance matrix of a set of variables is a square matrix that contains the variances and covariances of the variables.
  • The correlation matrix of a set of variables is a square matrix that contains the correlations of the variables.
    • Both are symmetric and positive semi-definite (which means all eigenvalues are non-negative).

Covariance and Correlation Matrices Example

titanic[['age', 'fare']].cov()
age fare
age 211.019125 73.849030
fare 73.849030 2469.436846
titanic[['age', 'fare']].corr()
age fare
age 1.000000 0.096067
fare 0.096067 1.000000

Checking our Matrices

Let’s look at the scatter plot

fig, ax = plt.subplots(figsize=(10, 5))
sns.scatterplot(data=titanic, x='age', y='fare', ax=ax)
ax.set(
  title='Scatter Plot of Age and Fare',
  xlabel='Age',
  ylabel='Fare'
)
plt.tight_layout()
plt.show()

Checking our Matrices

Scatter Plot of Age and Fare

Exploratory Analysis Example

Example Data: Wine Dataset

wine = pd.read_csv('data/WineQT.csv')
wine.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality Id
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 0
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 2
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 3
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 4
  • The shape of the wine dataset is:
wine.shape
(1143, 13)

Summary Statistics

  • We can compute the summary statistics for the wine dataset using the describe method.
wine.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality Id
count 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000
mean 8.311111 0.531339 0.268364 2.532152 0.086933 15.615486 45.914698 0.996730 3.311015 0.657708 10.442111 5.657043 804.969379
std 1.747595 0.179633 0.196686 1.355917 0.047267 10.250486 32.782130 0.001925 0.156664 0.170399 1.082196 0.805824 463.997116
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000 0.000000
25% 7.100000 0.392500 0.090000 1.900000 0.070000 7.000000 21.000000 0.995570 3.205000 0.550000 9.500000 5.000000 411.000000
50% 7.900000 0.520000 0.250000 2.200000 0.079000 13.000000 37.000000 0.996680 3.310000 0.620000 10.200000 6.000000 794.000000
75% 9.100000 0.640000 0.420000 2.600000 0.090000 21.000000 61.000000 0.997845 3.400000 0.730000 11.100000 6.000000 1209.500000
max 15.900000 1.580000 1.000000 15.500000 0.611000 68.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000 1597.000000

Univariate Analysis

Histograms to inspect the distribution of the variables

for feature in wine.columns:
    fig, ax = plt.subplots(figsize=(10, 5))
    sns.histplot(wine[feature], kde=True, ax=ax)
    plt.title(f"{feature} | Skewness: {round(wine[feature].skew(), 2)}")
    plt.savefig(f"assets/wine_{feature}.png")

plt.tight_layout()
plt.show()

Histogram Plots

  • Which variables are skewed?

Skewness and Kurtosis

Skewness

wine.skew()
fixed acidity           1.044930
volatile acidity        0.681547
citric acid             0.371561
residual sugar          4.361096
chlorides               6.026360
free sulfur dioxide     1.231261
total sulfur dioxide    1.665766
density                 0.102395
pH                      0.221138
sulphates               2.497266
alcohol                 0.863313
quality                 0.286792
Id                     -0.010419
dtype: float64

Kurtosis

wine.kurtosis()
fixed acidity            1.384614
volatile acidity         1.375531
citric acid             -0.714686
residual sugar          27.675366
chlorides               47.078324
free sulfur dioxide      1.932170
total sulfur dioxide     5.098748
density                  0.888123
pH                       0.925791
sulphates               12.017377
alcohol                  0.221179
quality                  0.314664
Id                      -1.216364
dtype: float64

Bivariate Analysis

Pairplots

sns.pairplot(wine.iloc[:, :2])

Quality vs Alcohol

Box Plots

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(x='quality', y='alcohol', data=wine, ax=ax)
ax.set(
  title='Box Plot of Quality and Alcohol',
  xlabel='Quality',
  ylabel='Alcohol'
)
plt.tight_layout()
plt.show()

Quality vs Alcohol

Box Plot of Quality and Alcohol

Multivariate Analysis

Heatmap

plt.figure(figsize=(7, 7))

sns.heatmap(wine.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)
plt.title('Correlation Heatmap')
plt.show()

Multivariate Analysis

Heatmap of the wine dataset

Conclusion

✅ What we covered

  • Exploratory Data Analysis (EDA):
    • Univariate analysis.
    • Bivariate analysis.
    • Multivariate analysis.
    • Correlation and covariance.
    • Skewness and kurtosis.
    • Cross tabulation.
    • Grouped summary statistics.
    • Mixed type tabulation.

📅 What’s next?

  • Statistical Foundations.
  • Probability and Distributions.

References