Data Structure

Python data structures, NumPy arrays, Pandas DataFrames, wide and long formats, and tidy data principles.

Author

John Robin Inston

Published

May 6, 2026

Modified

May 6, 2026

Learning Objectives

Understand Python’s built-in data types and the limitations of native data structures.
Work with NumPy arrays and Pandas DataFrames.
Load tabular data from .csv and .xlsx files.
Distinguish between wide and long data formats and convert between them using pd.melt and pd.pivot.
Apply the tidy data standard to assess and clean real-world datasets.

Data in Python

Python Variables

Python is a general purpose programming language. It natively stores data in variables of different types:

Strings (str) — text data.
Integers (int) — whole numbers.
Floating point numbers (float) — decimal numbers.
Booleans (bool) — true/false values.
Lists (list) — ordered collections of data.
Dictionaries (dict) — key-value pairs.
Tuples (tuple) — ordered, immutable collections of data.
Sets (set) — unordered collections of unique data.

Packages such as pandas and numpy provide additional data types optimised for data analysis:

Pandas DataFrames (.DataFrame) — tabular data.
NumPy arrays (.array) — numerical data.

Built-in Data Types

my_string = "Hello, World!"
my_integer = 10
my_float = 3.14
my_boolean = True
my_list = [1, 2, 3, 4, 5]
my_tuple = (1, 2, 3, 4, 5)
my_set = {1, 2, 3, 4, 5}
my_dict = {"name": "John", "age": 30}

item_list = [my_string, my_integer, my_float, my_boolean,
             my_list, my_tuple, my_set, my_dict]

for item in item_list:
    print(type(item), ":", item)

<class 'str'> : Hello, World!
<class 'int'> : 10
<class 'float'> : 3.14
<class 'bool'> : True
<class 'list'> : [1, 2, 3, 4, 5]
<class 'tuple'> : (1, 2, 3, 4, 5)
<class 'set'> : {1, 2, 3, 4, 5}
<class 'dict'> : {'name': 'John', 'age': 30}

Built-in data types have limitations for data analysis:

Scalability — inefficient memory usage at scale.
Functionality — no support for vector operations or tabular manipulation.

For example, adding a scalar to a list raises a TypeError:

my_list + 2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 my_list + 2

TypeError: can only concatenate list (not "int") to list

NumPy Arrays

NumPy arrays are \(n\)-dimensional arrays of homogeneous data. They support:

Vectorized operations — element-wise arithmetic without explicit loops.
Efficient memory usage — stored as contiguous blocks of typed data.

Make sure you are familiar with the basics of NumPy arrays: construction (array, linspace, full), indexing and slicing, and broadcasting.

import numpy as np

list_array = np.array([1, 2, 3, 4, 5])
linspace_array = np.linspace(0, 1, 5)

print(list_array)
print(linspace_array)

print(list_array[:3])
print(linspace_array[list_array > 2])

[1 2 3 4 5]
[0.   0.25 0.5  0.75 1.  ]
[1 2 3]
[0.5  0.75 1.  ]

Pandas DataFrames

What is Pandas?

Pandas is a library for data manipulation and analysis built on top of NumPy. It provides data structures and functions for working with tabular data.

Pandas DataFrames are 2-dimensional tabular data structures with labeled rows and columns. They are similar to NumPy arrays but with labeled axes, and similar to dictionaries but with labeled columns.

import pandas as pd

data = pd.DataFrame({
    "name": ["John", "Jane", "Jim", "Jill"],
    "age": [20, 21, 22, 23],
    "city": ["New York", "Los Angeles", "Chicago", "Houston"]
})

print(data)

   name  age         city
0  John   20     New York
1  Jane   21  Los Angeles
2   Jim   22      Chicago
3  Jill   23      Houston

Loading Data Files

Data Formats

In PSTAT100 we primarily use tabular data stored in the following formats:

Text files (.txt) — simplest format but with loading challenges.
Comma Separated Values (.csv) — most common format.
Tab Separated Values (.tsv) — similar to CSV but uses tabs as delimiters.
Excel files (.xlsx) — spreadsheets with complex formulas and formatting.

Loading `.csv` Files

.csv files store tabular data as plain text with comma-separated values. Use pd.read_csv to load them.

cats_data = pd.read_csv("data/cats.csv")
print(cats_data.head())

              Breed  Age (Years)  Weight (kg)          Color  Gender
0      Russian Blue           19            7  Tortoiseshell  Female
1  Norwegian Forest           19            9  Tortoiseshell  Female
2         Chartreux            3            3          Brown  Female
3           Persian           13            6          Sable  Female
4           Ragdoll           10            8          Tabby    Male

Loading `.xlsx` Files

.xlsx files are Excel spreadsheets. Use pd.read_excel to load them, specifying the sheet name where needed.

sales_data = pd.read_excel(
    "data/office_sales.xlsx",
    sheet_name="SalesOrders")
print(sales_data.head())

   OrderDate   Region      Rep    Item  Units  Unit Cost   Total
0 2024-01-06     East    Jones  Pencil     95       1.99  189.05
1 2024-01-23  Central   Kivell  Binder     50      19.99  999.50
2 2024-02-09  Central  Jardine  Pencil     36       4.99  179.64
3 2024-02-26  Central     Gill     Pen     27      19.99  539.73
4 2024-03-15     West  Sorvino  Pencil     56       2.99  167.44

DataFrame Basics

Manipulation and Indexing

There are many ways to manipulate DataFrames:

Finding information: columns, index, shape, head, tail.
Indexing and filtering: .loc, .iloc.
Adding and removing columns and rows: .drop, .dropna, .insert, pd.concat.
Sorting: .sort_values, .sort_index.

print(cats_data.columns)
print(cats_data.index)
print(cats_data.shape)

Index(['Breed', 'Age (Years)', 'Weight (kg)', 'Color', 'Gender'], dtype='object')
RangeIndex(start=0, stop=1000, step=1)
(1000, 5)

print(cats_data[["Breed", "Color"]].head())
print(cats_data.iloc[3:5])

              Breed          Color
0      Russian Blue  Tortoiseshell
1  Norwegian Forest  Tortoiseshell
2         Chartreux          Brown
3           Persian          Sable
4           Ragdoll          Tabby
     Breed  Age (Years)  Weight (kg)  Color  Gender
3  Persian           13            6  Sable  Female
4  Ragdoll           10            8  Tabby    Male

cats_data_dropped = cats_data.drop(columns=["Color"])
print(cats_data_dropped.columns)

Index(['Breed', 'Age (Years)', 'Weight (kg)', 'Gender'], dtype='object')

Data Structure

Wide and Long Formats

Often two datasets can represent the same information in different data structures. These can be broadly categorised as:

Wide data structure:

One row per subject.
Repeated measurements are in separate columns.

Long data structure:

One row per measurement.
A separate column identifies which variable was measured.

Wide Data Structure

To build intuition, consider a dataset of points, assists, and rebounds for four basketball teams (adapted from Luke Bennett).

basketball_data = pd.DataFrame({
    "team": ["A", "B", "C", "D"],
    "points": [88, 91, 99, 94],
    "assists": [12, 17, 24, 28],
    "rebounds": [22, 28, 30, 31]
})
basketball_data.set_index("team")

Basketball Data — Wide Format
	points	assists	rebounds
team
A	88	12	22
B	91	17	28
C	99	24	30
D	94	28	31

Each observation (team) occupies its own row, and each measurement type is in a separate column — this is the wide format.

Long Data Structure

The same data in a long format places each individual measurement on its own row, with a column identifying which variable was measured.

Basketball Data — Long Format
	statistic	value
team
A	points	88
A	assists	12
A	rebounds	22
B	points	91
B	assists	17
B	rebounds	28

Each measurement has its own row (value).
A column identifies which variable was measured (statistic).

Wide vs Long — Which Is Better?

Neither format is universally better; the appropriate choice depends on context:

Wide data is more human-readable and common in spreadsheets.
Long data is more computer-readable:
- Required by most plotting libraries (seaborn, ggplot in R).
- Easier to filter, group, and aggregate in pandas.

It is therefore important to be able to convert between the two formats.

`pd.melt` — Wide → Long

pd.melt converts a wide DataFrame to a long DataFrame. Its key arguments are:

id_vars — columns that identify the subject (kept as-is).
var_name — name for the column that will hold the former column names.
value_name — name for the column that will hold the values.

basketball_data_long = basketball_data.melt(
    id_vars=["team"],
    var_name="statistic",
    value_name="value"
)
print(basketball_data_long)

   team statistic  value
0     A    points     88
1     B    points     91
2     C    points     99
3     D    points     94
4     A   assists     12
5     B   assists     17
6     C   assists     24
7     D   assists     28
8     A  rebounds     22
9     B  rebounds     28
10    C  rebounds     30
11    D  rebounds     31

`pd.pivot` — Long → Wide

pd.pivot converts a long DataFrame back to a wide DataFrame. Its key arguments are:

index — column to use as row identifiers.
columns — column whose values become the new column names.
values — column whose values fill the cells.

basketball_data_wide = basketball_data_long.pivot(
    index="team",
    columns="statistic",
    values="value"
)
print(basketball_data_wide)

statistic  assists  points  rebounds
team                                
A               12      88        22
B               17      91        28
C               24      99        30
D               28      94        31

The Lack of a Standard Format

Most data is stored in a layout that made intuitive sense to the creator — idiosyncratic and unprincipled, with few widely used conventions and lots of variability in practice.

This creates two interdependent choices for data scientists:

Representation: how to encode information (e.g. dates as one variable or three? Ordinal values as numbers or labels?).
Form: how to display information (wide or long? one table or many?).

The tidy data standard was introduced to resolve this inconsistency by providing a principled set of rules for structuring datasets.

Tidy Data

Tidy Data Principles

“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.” — Wickham and Grolemund, R for Data Science, 2017.

A dataset has both:

Semantics: the meaning of each value.
Structure: how values are arranged.

The tidy standard aligns data semantics with data structure.

The Tidy Data Standard

Tidy Data Standard

For data to be tidy, it must satisfy three rules:

Each variable is a column.
Each observation is a row.
Each type of observational unit forms a table.

Example 1 — GDP Growth Data

Consider World Bank data on annual GDP growth:

gdp = pd.read_csv(
    "data/annual_growth.csv",
    encoding="latin1"
)
print(gdp.shape)
gdp.head()

(264, 61)

	Country Name	Country Code	1961	1962	1963	1964	1965	1966	1967	1968	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
0	Aruba	ABW	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	-3.685029	3.446055	-1.369863	4.198232	0.300000	5.700001	2.100000	1.999999	NaN	NaN
1	Afghanistan	AFG	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	14.362441	0.426355	12.752287	5.600745	2.724543	1.451315	2.260314	2.647003	1.189228	3.911603
2	Angola	AGO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4.403933	3.471976	8.542188	4.954545	4.822628	0.943572	-2.580050	-0.147213	-2.003630	-0.624644
3	Albania	ALB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	3.706892	2.545322	1.417526	1.001987	1.774487	2.218752	3.314805	3.802197	4.071301	2.240070
4	Andorra	AND	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	-1.974958	-0.008070	-4.974444	-3.547597	2.504466	1.434140	3.709678	0.346072	1.588765	1.849238

5 rows × 61 columns

To assess whether this is tidy, we compare semantics with structure:

Semantics		Structure
Observations:	Annual country records	Rows:	Countries
Variables:	GDP growth and year	Columns:	Values of year
Observational units:	Countries	Tables:	Just one

❌ Rules 1 and 2 are violated — column names are values (years), not variables. This dataset is not tidy.

Making the GDP Data Tidy

We need to:

Set the index to Country Name using set_index.
Drop the superfluous Country Code column using drop.
Melt the data so that year and growth_pct are variables using melt.
Sort by year and country using sort_values.

gdp_tidy = gdp.set_index(
    "Country Name"
).drop(
    columns="Country Code"
).melt(
    var_name="year",
    value_name="growth_pct",
    ignore_index=False
).reset_index(
).sort_values(
    ["year", "Country Name"]
).set_index("Country Name")

gdp_tidy.head()

	year	growth_pct
Country Name
Afghanistan	1961	NaN
Albania	1961	NaN
Algeria	1961	-13.605441
American Samoa	1961	NaN
Andorra	1961	NaN

Now we can verify:

Semantics		Structure
Observations:	Annual country records	Rows:	Annual country records
Variables:	GDP growth and year	Columns:	GDP growth and year
Observational units:	Countries	Tables:	Just one

✅ All three rules are satisfied. The data is now tidy.

Data in Python

Python Variables

Built-in Data Types

NumPy Arrays

Pandas DataFrames

What is Pandas?

Loading Data Files

Data Formats

Loading .csv Files

Loading .xlsx Files

DataFrame Basics

Manipulation and Indexing

Data Structure

Wide and Long Formats

Wide Data Structure

Long Data Structure

Wide vs Long — Which Is Better?

pd.melt — Wide → Long

pd.pivot — Long → Wide

The Lack of a Standard Format

Tidy Data

Tidy Data Principles

The Tidy Data Standard

Example 1 — GDP Growth Data

Making the GDP Data Tidy

Loading `.csv` Files

Loading `.xlsx` Files

`pd.melt` — Wide → Long

`pd.pivot` — Long → Wide