| Breed | Age (Years) | Weight (kg) | Color | Gender | |
|---|---|---|---|---|---|
| 0 | Russian Blue | 19 | 7 | Tortoiseshell | Female |
| 1 | Norwegian Forest | 19 | 9 | Tortoiseshell | Female |
| 2 | Chartreux | 3 | 3 | Brown | Female |
| 3 | Persian | 13 | 6 | Sable | Female |
| 4 | Ragdoll | 10 | 8 | Tabby | Male |
PSTAT100 Data Science Concepts and Analysis
April 7, 2026
Welcome to PSTAT100 Data Science Concepts and Analysis! π

π’ OH: South Hall 5431T R 1PM to 3PM.
I am being assisted this term by the following wonderful teaching assistants:

β Due to space availability section switching must be confirmed in advance with your TA.
python π.
python is not required but it is expected that you have some familiarity with similar programming languages.π Supplementary materials summarizing these topics will be made available in the online lecture notes for review.
Communication is key! π¬
No plagiarism! β
π€ AI tools are encouraged for learning.
Be respectful. π€
Please make sure to read through the course policies detailed in the syllabus.
The following is a tentative course outline:
In addition to technical topics we will also develop your professional skills as a data scientist including:
Your course aims should be:
What about if I am not familiar with Python? π€
I will be hosting a Python Bootcamp this Thursday from 1:00pm to 3:00pm in South Hall 5421.
β Attendance is optional!
Letβs see what Claude thinks:

Data science is the practice of using data to extract insights and knowledge. π

π€ Notice that these disciplines loosely correspond to the prerequisites for this course.
Data Science Venn Diagram.
Definition: Data
Data is (digital) information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.

The amount of data being collected, stored, and processed is growing exponentially!
Larger variety of increasingly complicated data types.

Definition: Dataset
A dataset is a collection of observations taken on observational units, consisting of values measured on a set of variables.
Letβs look at our first example dataset.
cats.csv| Breed | Age (Years) | Weight (kg) | Color | Gender | |
|---|---|---|---|---|---|
| 0 | Russian Blue | 19 | 7 | Tortoiseshell | Female |
| 1 | Norwegian Forest | 19 | 9 | Tortoiseshell | Female |
| 2 | Chartreux | 3 | 3 | Brown | Female |
| 3 | Persian | 13 | 6 | Sable | Female |
| 4 | Ragdoll | 10 | 8 | Tabby | Male |
The cats dataset was provided by Waqar Ali on Kaggle.
Here the observational unit is a cat.
This dataset of 1000 observations (rows, individual cats) each with 5 variables (columns).
In general, we distinguish between the semantics and structure of a dataset.
The semantics of a dataset refers to the meaning behind the data.
The structure of a dataset refers to the way the data is organized.
cats.csv| Breed | Age (Years) | Weight (kg) | Color | Gender | |
|---|---|---|---|---|---|
| 0 | Russian Blue | 19 | 7 | Tortoiseshell | Female |
| 1 | Norwegian Forest | 19 | 9 | Tortoiseshell | Female |
| 2 | Chartreux | 3 | 3 | Brown | Female |
| 3 | Persian | 13 | 6 | Sable | Female |
| 4 | Ragdoll | 10 | 8 | Tabby | Male |
The Breed, Color, and Gender variables are qualitative variables since they are categorical.
The Age and Weight variables are quantitative variables since they are numerical.
Suppose we are interested in transforming the quantitative age variable into the categories young, middle-aged, and senior.
AgeGroup with the following values:
Young: 0-5 yearsMiddle-Aged: 6-10 yearsSenior: 11+ yearscats.csv| Breed | Age (Years) | Weight (kg) | Color | Gender | AgeGroup | |
|---|---|---|---|---|---|---|
| 0 | Russian Blue | 19 | 7 | Tortoiseshell | Female | Senior |
| 1 | Norwegian Forest | 19 | 9 | Tortoiseshell | Female | Senior |
| 2 | Chartreux | 3 | 3 | Brown | Female | Young |
| 3 | Persian | 13 | 6 | Sable | Female | Senior |
| 4 | Ragdoll | 10 | 8 | Tabby | Male | Middle-Aged |
We have now transformed the numerical Age (Years) variable into a qualitative AgeGroup variable.
We will revisit variable transformations in more detail later in the course when we discuss data preparation.

In reality there is a wide variety of different data types:
Each data type has its own unique set of challenges and techniques for data scientists to apply.
MNIST database, a subset of the larger NIST made available by Yann LeCun on Kaggle.
MNIST Data Visualization.
The world of data is confusing! π΅βπ«

Definition: Data Literacy
Data Literacy is the ability to explore, understand, and communicate with data in a meaningful way. (Tableau)
The data science lifecycle (DLS) is the following multi-step process used to extract actionable insights from data:
In reality, the data science lifecycle has a more complex structure.
mammals.csvmammals.csvSuppose we are provided with the following data set:
| species | body_weight | brain_weight | slow_wave | paradox | total_sleep | lifespan | gestation | predation | exposure | danger | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | African elephant | 6654.000 | 5712.0 | NaN | NaN | 3.3 | 38.6 | 645.0 | 3 | 5 | 3 |
| 1 | African giant pouched rat | 1.000 | 6.6 | 6.3 | 2.0 | 8.3 | 4.5 | 42.0 | 3 | 1 | 3 |
| 2 | Arctic fox | 3.385 | 44.5 | NaN | NaN | 12.5 | 14.0 | 60.0 | 1 | 1 | 1 |
| 3 | Arctic ground squirrel | 0.920 | 5.7 | NaN | NaN | 16.5 | NaN | 25.0 | 5 | 2 | 3 |
| 4 | Asian elephant | 2547.000 | 4603.0 | 2.1 | 1.8 | 3.9 | 69.0 | 624.0 | 3 | 5 | 4 |
mammals.csv data set (web source) comes from Allison and Cicchetti (1976) and contains data for 62 mammals.Suppose we are interested in how the size of an animalβs brain scales with their body size.
Dimensions:
(62, 11)
Missingness analysis:
species 0
body_weight 0
brain_weight 0
slow_wave 14
paradox 12
total_sleep 4
lifespan 4
gestation 4
predation 0
exposure 0
danger 0
dtype: int64
Does this data suggest evidence of a relationship between a mammalβs brain size and body weight?
Summary statistics:
body_weight brain_weight
count 62.000000 62.000000
mean 198.789984 283.134194
std 899.158011 930.278942
min 0.005000 0.140000
25% 0.600000 4.250000
50% 3.342500 17.250000
75% 48.202500 166.000000
max 6654.000000 5712.000000
Correlation:
body_weight brain_weight
body_weight 1.000000 0.934164
brain_weight 0.934164 1.000000


Our visualization suggests that the relationship could be modeled as
\[ \begin{aligned} \log(\text{brain}) & = \beta_0 + \beta_1 \log(\text{body})\\ \text{brain} & = \exp(\beta_0 + \beta_1 \log(\text{body})) \\ \text{brain} & = c\cdot\exp(\log(\text{body}^{\beta_1})) \\ \implies \text{brain} & \propto \text{body}^{\beta_1}. \end{aligned} \]
This suggests that the relationship is a power law.
OLS Regression Results
================================================================================
Dep. Variable: np.log(brain_weight) R-squared: 0.921
Model: OLS Adj. R-squared: 0.919
Method: Least Squares F-statistic: 697.4
Date: Tue, 07 Apr 2026 Prob (F-statistic): 9.84e-35
Time: 22:12:32 Log-Likelihood: -64.336
No. Observations: 62 AIC: 132.7
Df Residuals: 60 BIC: 136.9
Df Model: 1
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 2.1348 0.096 22.227 0.000 1.943 2.327
np.log(body_weight) 0.7517 0.028 26.409 0.000 0.695 0.809
==============================================================================
Omnibus: 2.698 Durbin-Watson: 1.667
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.933
Skew: 0.405 Prob(JB): 0.380
Kurtosis: 3.301 Cond. No. 3.73
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Our fitted model is:
\[ \text{brain} \propto \text{body}^{0.7517}. \]
This means that a 1% increase in body weight is associated with a ~0.75% increase in brain weight.
This leads us to draw the following conclusion:
For these 62 mammals there is evidence that brain weight changes in proportion to a power of body weight.