Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
This post contains notes for Chapter 4 of my course series Data Science with R covering dataframes and lists. Specifically we will look at R
’s base functionality for dataframes and lists including how to construct and manipulate these objects. We note that although it is preferable to manipulate dataframes using the tidyverse
package (something we cover in a future lesson) this note focusses first on detailing the base functionality.
Non-Atomic Data Structures
So far we have introduced a variety of atomic data structures (e.g. scalars, vectors, matrices and arrays) each of which are designed for ease of computation and interaction by being restricted to containing only one datatype. However, as any scientist who has ever collected data will tell you, we often need to define objects that contain multiple datatypes, namely:
- Dataframes; and
- Lists.
By containing multiple datatypes you sacrifice computationally tractability but gain immense flexibility for data storage.
Dataframes
In R
a dataframe can be intuitively pictured as a table of data where each column is of the same datatype. For example here is a table showing the first 6 rows of the pre-loaded dataframe iris
which contains 4 numeric columns and one factor column:
Constructing Dataframes
To construct a dataframe in R
we can use the dataframe()
function:
data.frame(..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = FALSE)
This function has multiple arguments but the most important are ...
, row.names
and stringsAsFactors
. Lets go through a few examples of constructing data frames to build up our understanding.
Lets start by building a dataframe with three columns, the first containing numbers, the second containing boolean objects and the third containing character strings. To do so we specify each columns values in a vector ensuring all three vectors are of the same length to avoid an error
message:
<- data.frame(
data c(1,2,3),
c(TRUE, TRUE, FALSE),
c("alpha", "beta", "gamma")
) data
c.1..2..3. c.TRUE..TRUE..FALSE. c..alpha....beta....gamma..
1 1 TRUE alpha
2 2 TRUE beta
3 3 FALSE gamma
Looking at our print out we note othat we have successfully created a dataframe with:
- Numbered rows from 1 to 3; and
- Generated column names based on the vector inputs.
We can specify the names of the columns of the dataframe by slightly tweaking our code:
<- data.frame(
data numbers = c(1,2,3),
booleans = c(TRUE, TRUE, FALSE),
strings = c("alpha", "beta", "gamma")
) data
numbers booleans strings
1 1 TRUE alpha
2 2 TRUE beta
3 3 FALSE gamma
What if we want to name the rows? In this case we can use the row.names
argument as
<- data.frame(
data numbers = c(1,2,3),
booleans = c(TRUE, TRUE, FALSE),
strings = c("alpha", "beta", "gamma"),
row.names = c("first", "second", "third")
) data
numbers booleans strings
first 1 TRUE alpha
second 2 TRUE beta
third 3 FALSE gamma
Finally, we can use the stringsAsFactors
argument to specify that we wish to convert the strings
column into a factor
column:
<- data.frame(
data numbers = c(1,2,3),
booleans = c(TRUE, TRUE, FALSE),
strings = c("alpha", "beta", "gamma"),
row.names = c("first", "second", "third"),
stringsAsFactors = TRUE
) data
numbers booleans strings
first 1 TRUE alpha
second 2 TRUE beta
third 3 FALSE gamma
This does not change the dataframe visually but if we will see in the following section that it has changed the datatype of the strings
column.
Indexing Dataframes
Dataframe indexing behaves a lot like matrix indexing as discussed in R Basics II (Atomic Data Structures). We can use numerical indexing with dataframes by specifying the row \(i\) and column number \(j\) of the elements we wish to return inside braces []
:
# return the element in the 2nd row and 3rd column
2,3] data[
[1] beta
Levels: alpha beta gamma
# return column 2
2] data[,
[1] TRUE TRUE FALSE
# return the second and third row for both columns 1 and 3
2:3, c(1,3)] data[
numbers strings
second 2 beta
third 3 gamma
Alternatively, we can specify the name of the column and rows using quotation marks ' '
:
# return the numbers column
"numbers"] data[
numbers
first 1
second 2
third 3
# return the numbers column for the second row
"second", "numbers"] data[
[1] 2
A quick syntax for returning a specific column is $
:
# select the boolean column
$booleans data
[1] TRUE TRUE FALSE
As with atomic data structures we can also filter values using logical statements but it is important to consider the datatypes being considered:
# return the rows that have numbers == 2
$numbers == 2,] data[data
numbers booleans strings
second 2 TRUE beta
# return the rows that have numbers == 1 or boolean == TRUE
$numbers == 2 | data$booleans == TRUE,] data[data
numbers booleans strings
first 1 TRUE alpha
second 2 TRUE beta
Lists
Lists are the most flexible data storage tool, allowing us to store scalars, matrices, arrays, dataframes or even other lists by some indexing or naming convention. We have already created a dataframe object data
and in previous posts we used the following code to create scalars, vectors and matrices:
<- "scalar"
y <- c("dog", "cat", "goose", "monkey", "elephant")
char_vec <- matrix(
B seq(from=2, length.out=16, by=2),
nrow = 4, ncol = 4, byrow = TRUE
)
To create a list of these objects called object_list
we naturally use the list()
function:
<- list(
object_list scalar = y,
vector = char_vec,
matrix = B,
dataframe = data
)
This creates a list with 4 elements which we can access using double braces [[]]
, either using the specified name or the corresponding numerical index:
# return the vector from the list
"vector"]] object_list[[
[1] "dog" "cat" "goose" "monkey" "elephant"
# return the third element from the list (the matrix)
3]] object_list[[
[,1] [,2] [,3] [,4]
[1,] 2 4 6 8
[2,] 10 12 14 16
[3,] 18 20 22 24
[4,] 26 28 30 32
We typically only use lists for convenient data storage due to their restrictive structure making any computations impossible.