PSTAT100 Data Science and Analysis
Lecture Notes
Introduction
These are my online lecture notes for the PSTAT100 Data Science and Analysis course taught at UC Santa Barbara. In these lecture notes we study fundamental topics in data science and the tools we use for data retrieval, analysis, visualization, and reproducible research in preparation for advanced data science courses.
Throughout these notes we will conduct our data analysis using the programming language python. It is assumed throughout this course that you have had some experience working in python or a similar programming language such as R. You may use the Integrated Development Environment (IDE) of your choice but my recommendation would be to use VSCode or some branch of the repository such as Positron or Cursor. If you are unfamiliar with python and require some help with setup I have included some guidance in the preliminary materials.
Contents
- Preliminary Material
- Getting Started with Python
- Python Basics
- Pandas
- Linear Algebra
- Introduction to Data Science
- Data Science Terminology
- Variable Classification
- Data Science Lifecycle
- Data Preparation
- Data Structure
- Missingness
- Duplicates
- Invalid Values and Outliers
- Variable Classification
- Exploratory Data Analysis
- Summary Statistics
- Data Visualizations
- Univariate, Bivariate, and Multivariate Analysis
- Probability Theory Fundamentals
- Probability Measures
- Random Variables
- Probability Distributions
- Expectation and Variance
- Conditional Probability and Independence
- Statistics
- Inferential Statistics
- Estimators and Bias
- Sampling Distributions
- Confidence Intervals and Hypothesis Testing
- Likelihood and Maximum Likelihood Estimation
- Modelling
- Types of Models
- Statistical Modelling
- Simple Linear Regression
- Generalized Linear Regression
- Principle Components Analysis
- Logistic Regression and Classification
- Clustering