| points | assists | rebounds | |
|---|---|---|---|
| team | |||
| A | 88 | 12 | 22 |
| B | 91 | 17 | 28 |
| C | 99 | 24 | 30 |
| D | 94 | 28 | 31 |
PSTAT100: Data Science - Concepts and Analysis
April 7, 2026
numpy arrays.pandas DataFrames.pd.melt and pd.pivot.
Python is a general purpose programming language.
Python natively stores data in variables of different types:
str) - text data.int) - whole numbers.float) - decimal numbers.bool) - true/false values.list) - ordered collections of data.dict) - key-value pairs.tuple) - ordered, immutable collections of data.set) - unordered collections of unique data.Packages such as pandas and numpy provide additional data types and functions for working with data such as:
.DataFrame) - tabular data..array) - numerical data.my_string = "Hello, World!"
my_integer = 10
my_float = 3.14
my_boolean = True
my_list = [1, 2, 3, 4, 5]
my_tuple = (1, 2, 3, 4, 5)
my_set = {1, 2, 3, 4, 5}
my_dict = {"name": "John", "age": 30}
item_list = [my_string, my_integer, my_float, my_boolean,
my_list, my_tuple, my_set, my_dict]
for item in item_list:
print(type(item), ":", item)<class 'str'> : Hello, World!
<class 'int'> : 10
<class 'float'> : 3.14
<class 'bool'> : True
<class 'list'> : [1, 2, 3, 4, 5]
<class 'tuple'> : (1, 2, 3, 4, 5)
<class 'set'> : {1, 2, 3, 4, 5}
<class 'dict'> : {'name': 'John', 'age': 30}
array, linspace, full, etc..txt) - simplest format but with loading challenges.csv) - most common format..tsv) - similar to CSV but uses tabs instead of commas..xlsx) - spreadsheets with complex formulas and formatting.
.csv files and .xlsx files.csv files.csv files are comma separated values files and are commonly used to store tabular data..csv file we can use the pd.read_csv function..xlsx files.xlsx files are Excel files and are commonly used to store tabular data..xlsx file we can use the pd.read_excel function. OrderDate Region Rep Item Units Unit Cost Total
0 2024-01-06 East Jones Pencil 95 1.99 189.05
1 2024-01-23 Central Kivell Binder 50 19.99 999.50
2 2024-02-09 Central Jardine Pencil 36 4.99 179.64
3 2024-02-26 Central Gill Pen 27 19.99 539.73
4 2024-03-15 West Sorvino Pencil 56 2.99 167.44
columns, index, shape, head, tail, etc..loc, .iloc).drop, .dropna, .insert, .concat).sort_values, .sort_index) Breed Color
0 Russian Blue Tortoiseshell
1 Norwegian Forest Tortoiseshell
2 Chartreux Brown
3 Persian Sable
4 Ragdoll Tabby
Breed Age (Years) Weight (kg) Color Gender
3 Persian 13 6 Sable Female
4 Ragdoll 10 8 Tabby Male
Often two data sets can have the same semantics but different data structures.
Data structures can be broadly categorized into two types:

pd.DataFrame.| points | assists | rebounds | |
|---|---|---|---|
| team | |||
| A | 88 | 12 | 22 |
| B | 91 | 17 | 28 |
| C | 99 | 24 | 30 |
| D | 94 | 28 | 31 |
| statistic | value | |
|---|---|---|
| team | ||
| A | points | 88 |
| A | assists | 12 |
| A | rebounds | 22 |
| B | points | 91 |
| B | assists | 17 |
| B | rebounds | 28 |
value)statistic)seaborn, ggplot in R)pandas.pd.meltpd.melt - Wide → Longpd.melt is a function that converts a wide data structure to a long data structure.id_vars are the columns that identify the subject.var_name is the name of the column that will contain the variable names.value_name is the name of the column that will contain the variable values. team statistic value
0 A points 88
1 B points 91
2 C points 99
3 D points 94
4 A assists 12
5 B assists 17
6 C assists 24
7 D assists 28
8 A rebounds 22
9 B rebounds 28
10 C rebounds 30
11 D rebounds 31
pd.pivotpd.pivot - Long → Widepd.pivot is a function that converts a long data structure to a wide data structure.index is the column that will contain the subject identifiers.columns is the column that will contain the variable names.values is the column that will contain the variable values.Data scientists introduced the tidy data standard to help solve the problem of inconsistent data structures.
It has two main advantages:

“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.” - Wickham and Grolemund, R for Data Science, 2017.
Tidy Data Standard
For data to be tidy, it must satisfy the following three rules:

.csv file using pd.read_csv.shape.(264, 61)
| Country Name | Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
| 1 | Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
| 2 | Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.403933 | 3.471976 | 8.542188 | 4.954545 | 4.822628 | 0.943572 | -2.580050 | -0.147213 | -2.003630 | -0.624644 |
| 3 | Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.706892 | 2.545322 | 1.417526 | 1.001987 | 1.774487 | 2.218752 | 3.314805 | 3.802197 | 4.071301 | 2.240070 |
| 4 | Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -1.974958 | -0.008070 | -4.974444 | -3.547597 | 2.504466 | 1.434140 | 3.709678 | 0.346072 | 1.588765 | 1.849238 |
5 rows × 61 columns
| Country Name | Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
| 1 | Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
| 2 | Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.403933 | 3.471976 | 8.542188 | 4.954545 | 4.822628 | 0.943572 | -2.580050 | -0.147213 | -2.003630 | -0.624644 |
| 3 | Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.706892 | 2.545322 | 1.417526 | 1.001987 | 1.774487 | 2.218752 | 3.314805 | 3.802197 | 4.071301 | 2.240070 |
| 4 | Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -1.974958 | -0.008070 | -4.974444 | -3.547597 | 2.504466 | 1.434140 | 3.709678 | 0.346072 | 1.588765 | 1.849238 |
5 rows × 61 columns
| Semantics | Structure | ||
|---|---|---|---|
| Observations: Variables: Observational units: |
Annual records GDP growth and year Countries |
Rows: Columns: Tables: |
Countries Value of year Just one |
set_index.drop.melt.sort_values.| year | growth_pct | |
|---|---|---|
| Country Name | ||
| Afghanistan | 1961 | NaN |
| Albania | 1961 | NaN |
| Algeria | 1961 | -13.605441 |
| American Samoa | 1961 | NaN |
| Andorra | 1961 | NaN |
| Semantics | Structure | ||
|---|---|---|---|
| Observations: Variables: Observational units: |
Annual records GDP growth and year Countries |
Rows: Columns: Tables: |
Annual records GDP growth and year Just one |
numpy and pandas