PSTAT100: Data Science - Concepts and Analysis
May 6, 2026
Tidy Data Standard
For data to be tidy, it must satisfy the following three rules:
Data Preparation
Data preparation is the process of cleaning and transforming data to make it suitable for analysis.
Missing Data
To graphically analyze the missingness of data we can use the Missingno Library, which provides a suite of tools for visualizing missing data.
Let’s take a first look at the first few rows and the shape.
| Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | ... | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009-09-10 | 2009091000 | 1 | 1 | NaN | 15:00 | 15 | 3600.0 | 0.0 | TEN | ... | NaN | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | NaN | NaN | 2009 |
| 1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 14:53 | 15 | 3593.0 | 7.0 | PIT | ... | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
| 2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | 14:16 | 15 | 3556.0 | 37.0 | PIT | ... | NaN | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | NaN | NaN | 2009 |
| 3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | 13:35 | 14 | 3515.0 | 41.0 | PIT | ... | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
| 4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | 13:27 | 14 | 3507.0 | 8.0 | PIT | ... | NaN | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | NaN | NaN | 2009 |
5 rows × 102 columns
NaN values.isnull identifies missing values and sum counts them.DefTwoPoint 362433
BlockingPlayer 362341
TwoPointConv 361919
ChalReplayResult 359476
RecFumbPlayer 358513
RecFumbTeam 358513
Interceptor 358387
FieldGoalDistance 354528
FieldGoalResult 354431
ExPointResult 353399
dtype: int64
27.652267428200588
msno.bar to visualize the missing data counts (for the first few columns to make it easier to see).msno.matrix visualizes the missing data (white = missingness) for 500 random rows.msno.heatmap visualizes these relationships.msno.dendrogram to visualize the hierarchical relationships between missing data.Rows in original dataset: 362447
Rows with na's dropped: 0
Columns in original dataset: 102
Columns with na's dropped: 37
TimeSecs is a column that is missing a large number of values.
PenalizedTeam is also missing values.
fillna methodfillna method to impute missing data.
method to specify:
ffill - forward fill (use the previous value).bfill - backward fill (use the next value).axis to specify the axis to impute along (i.e. rows or columns).Imputation is a powerful tool and a topic in it’s own right. We will not cover it in depth in this course but recommend you explore it further if your project data has missing values.
| Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | ... | yacEPA | Home_WP_pre | Away_WP_pre | Home_WP_post | Away_WP_post | Win_Prob | WPA | airWPA | yacWPA | Season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 15:00 | 15 | 3600.0 | 0.0 | TEN | ... | 1.146076 | 0.485675 | 0.514325 | 0.546433 | 0.453567 | 0.485675 | 0.060758 | -0.032244 | 0.036899 | 2009 |
| 1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 14:53 | 15 | 3593.0 | 7.0 | PIT | ... | 1.146076 | 0.546433 | 0.453567 | 0.551088 | 0.448912 | 0.546433 | 0.004655 | -0.032244 | 0.036899 | 2009 |
| 2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | 14:16 | 15 | 3556.0 | 37.0 | PIT | ... | -5.031425 | 0.551088 | 0.448912 | 0.510793 | 0.489207 | 0.551088 | -0.040295 | 0.106663 | -0.156239 | 2009 |
| 3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | 13:35 | 14 | 3515.0 | 41.0 | PIT | ... | -5.031425 | 0.510793 | 0.489207 | 0.461217 | 0.538783 | 0.510793 | -0.049576 | 0.106663 | -0.156239 | 2009 |
| 4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | 13:27 | 14 | 3507.0 | 8.0 | PIT | ... | 0.163935 | 0.461217 | 0.538783 | 0.558929 | 0.441071 | 0.461217 | 0.097712 | -0.010456 | 0.006029 | 2009 |
5 rows × 102 columns
bfill to replace the missing values with the next value.axis=0 to impute along the columns.For our example we consider synthetic customer data customer.csv.
| customer_id | name | age | city | spend | ||
|---|---|---|---|---|---|---|
| 0 | 101 | Alice Johnson | alice@email.com | 34 | New York | 250.00 |
| 1 | 102 | Bob Smith | bob@email.com | 28 | Chicago | 180.50 |
| 2 | 103 | Carol White | carol@email.com | 45 | Houston | 320.00 |
| 3 | 101 | Alice Johnson | alice@email.com | 34 | New York | 250.00 |
| 4 | 104 | David Brown | david@email.com | 52 | Phoenix | 410.75 |
duplicated():Total duplicate values: 6
Duplicate index values: Index([3, 5, 10, 12, 14, 16], dtype='int64')
Carol White vs carol white.str.title().str.strip().drop_duplicates.np.int64(0)
groupby to aggregate duplicate values.dtypes, info() and spot-check values with head(), unique().pd.to_numeric, pd.to_datetime)dtype={...} or parse dates with parse_dates=[...].Date column original type: str
Date column new type: datetime64[us]
Floats column original type: str
Floats column new type: float64
Booleans column original type: str
Booleans column new type: boolean
0 2024-01-01
1 2001-12-21
2 2001-12-21
3 2024-01-02
4 2018-03-04
Name: Dates, dtype: datetime64[us]
0 True
1 False
2 True
3 False
4 True
Name: Booleans, dtype: boolean
Series.between, or isin to flag rowsNaN and treat as missing."NYC" vs "nyc" vs "New York").customerID vs customer_id), which blocks merges.str.strip, str.lower / str.title, remove extra spaces.map or replace.01/02/2024), mixed formats, or Excel serial numbers.NaT only where you are not looking.pd.read_csv(..., parse_dates=["date_col"]) when the column is consistently formatted.pd.to_datetime; use format='...' when you know the pattern, and errors='coerce' to isolate bad strings.dayfirst=True or yearfirst=True when the data is not ISO-8601 and locale order matters.This is a good guide to follow but it is not a strict recipe (as always)!
