PSTAT100: Data Science - Concepts and Analysis
May 6, 2026
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, identify patterns, and uncover insights.
titanic| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
titanic dataset contains information about passengers on the Titanic.sex, class) and numerical (e.g., age, fare) variables.| Frequency | Proportion | Percentage | |
|---|---|---|---|
| class | |||
| First | 216 | 0.242424 | 24.24% |
| Second | 184 | 0.206510 | 20.65% |
| Third | 491 | 0.551066 | 55.11% |
colors = sns.color_palette('pastel')[0:3]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
sns.countplot(
data=titanic,
x='class',
ax=ax[0]
)
ax[0].set(title='Count Plot', xlabel='Class', ylabel='Count')
ax[1].pie(
titanic['class'].value_counts(),
labels=titanic['class'].unique(),
colors=colors,
autopct='%.0f%%')
ax[1].set_title('Pie Chart of Passenger Class')
plt.tight_layout()
plt.show()class variable in the titanic dataset.float data?The sample variance is computed as: \[s^2 := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2.\]
The sample standard deviation is the square root of the variance: \[s := \sqrt{s^2}.\]
It is often useful to understand the range of values in the data, as well as the spread of the middle 50% of the data.
The minimum and maximum values provide the range of the data: \[\text{Range} := \max(x_i) - \min(x_i).\]
A quantile is a cut point that divides a sorted dataset into equal-sized groups.
The interquartile range (IQR) is the difference between the third and first quartiles: \[\text{IQR} := Q3 - Q1.\]
describe() method in pandas to compute most of the common summary statistics for a numerical variable.age variable grouped by class in the titanic datasetage variable in the titanic dataset.age variable in the titanic dataset with summary statistic overlays.scipy.stats to calculate skewness and kurtosis, but we can also implement these calculations manually using the formulas provided.age variable in the titanic dataset with summary statistic overlays.Skewness: 0.39, Kurtosis: 3.16
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | Id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | 1 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | 2 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | 3 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 4 |
describe method.| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | Id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 | 1143.000000 |
| mean | 8.311111 | 0.531339 | 0.268364 | 2.532152 | 0.086933 | 15.615486 | 45.914698 | 0.996730 | 3.311015 | 0.657708 | 10.442111 | 5.657043 | 804.969379 |
| std | 1.747595 | 0.179633 | 0.196686 | 1.355917 | 0.047267 | 10.250486 | 32.782130 | 0.001925 | 0.156664 | 0.170399 | 1.082196 | 0.805824 | 463.997116 |
| min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 | 0.000000 |
| 25% | 7.100000 | 0.392500 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 21.000000 | 0.995570 | 3.205000 | 0.550000 | 9.500000 | 5.000000 | 411.000000 |
| 50% | 7.900000 | 0.520000 | 0.250000 | 2.200000 | 0.079000 | 13.000000 | 37.000000 | 0.996680 | 3.310000 | 0.620000 | 10.200000 | 6.000000 | 794.000000 |
| 75% | 9.100000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 61.000000 | 0.997845 | 3.400000 | 0.730000 | 11.100000 | 6.000000 | 1209.500000 |
| max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 68.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 | 1597.000000 |
fixed acidity 1.044930
volatile acidity 0.681547
citric acid 0.371561
residual sugar 4.361096
chlorides 6.026360
free sulfur dioxide 1.231261
total sulfur dioxide 1.665766
density 0.102395
pH 0.221138
sulphates 2.497266
alcohol 0.863313
quality 0.286792
Id -0.010419
dtype: float64
fixed acidity 1.384614
volatile acidity 1.375531
citric acid -0.714686
residual sugar 27.675366
chlorides 47.078324
free sulfur dioxide 1.932170
total sulfur dioxide 5.098748
density 0.888123
pH 0.925791
sulphates 12.017377
alcohol 0.221179
quality 0.314664
Id -1.216364
dtype: float64