R Basics II - Atomic Data Structures

This post contains notes for Chapter 3 of my course series Data Science with R covering atomic data structures. Specifically, we discuss scalars, vectors and matrices, how to construct them, manipulate them, and perform computations with them. For efficiency I avoid discussion of the underlying mathematics of vector and matrix computations and direct any interested readers towards my course and relevant notes on Linear Algebra.

Scalars

A scalar data form is an object holding one value. For example:

x <- 16
y <- "scalar"
z <- FALSE

We often need to use different data structures that contain multiple values of different dimensions of different data types. R can handle data types including: (1) Vectors; (2) Matrices; (3) Dataframes; and (4) Lists. In this note we will cover the atomic data types of vectors and matrices, specifically how we define them and how we perform computations with them.

Vectors

A vector data form is an object holding multiple values of the same datatype, i.e. all numeric or all strings. We call objects with this property atomic.

There are multiple functions for constructing vectors in R including c(), seq(), :, and rep().

`c()` Function

The simplest way to define a vector in R is using the c() function where you can specify each element of the vector within the parenthesis:

# Numeric vector
num_vec <- c(4,1,6,8,4,2,5)
num_vec

[1] 4 1 6 8 4 2 5

# Character string vector
char_vec <- c("dog", "cat", "goose", "monkey", "elephant")
char_vec

[1] "dog"      "cat"      "goose"    "monkey"   "elephant"

# Boolean vector
bool_vec <- c("TRUE", "FALSE", "FALSE", "FALSE", "TRUE")
bool_vec

[1] "TRUE"  "FALSE" "FALSE" "FALSE" "TRUE"

`rep()` Function

The rep() function is a simple way to construct vectors where a single value is repeated. The arguments for the function rep(x, times) where x specifies the scalar / vector you with to repeat and times specifies the number of times you want it repeated:

# rep() function
rep(x = 7, times = 14)

 [1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7

`seq()` Function

The seq() function is used specifically to construct numeric sequences. The syntax for seq(from, to, by, length.out) has arguments: (1) from specifying the starting values; (2) to specifying the final value; (3) by specifying the step size; and (4) length.out specifying the length of the resulting vector. Note that we cannot include arguments (3) and (4) at the same time!

# seq() function with by
seq(from = 1, to = 10, by = 2)

[1] 1 3 5 7 9

# seq() function with length.out
seq(from = 1, to = 10, length.out = 9)

[1]  1.000  2.125  3.250  4.375  5.500  6.625  7.750  8.875 10.000

A quick alternative if we wish to make a vector of increasing integers is simply to use the : syntax:

# integers from -3 to 4
-3:4

[1] -3 -2 -1  0  1  2  3  4

Combining Methods

Any of these functions can be nested within one another to produce more complicated vectors.

# c() with seq() and rep()
c(
    rep(4, 5),
    seq(0, 10, by = 1)
)

 [1]  4  4  4  4  4  0  1  2  3  4  5  6  7  8  9 10

Exercise 1

Use any combinations of the above functions to reproduce the following vectors and save them to the corresponding objects.

 [1] 1.00 1.25 1.50 1.75 2.00 1.00 9.00 1.00 9.00 1.00 9.00

 [1] 1 2 3 4 4 3 2 1 0 0

Important Vector Functions

When working with vectors we will often use several key base functions to ascertain important properties and manipulate vector objects. The first of these key functions is length() which returns the number of elements of a vector:

length(num_vec)

[1] 7

Next, it is often useful to know what datatype the vector contains, something that we can determine with the typeof() function:

# confirm the data type of the character strring vector
typeof(char_vec)

[1] "character"

Finally, we often wish to sort vectors into some kind of order (e.g. alphabetical, decreasing). To accomplish this we can use the sort() function:

# sort numerical vector in increasing order
sort(num_vec, decreasing = FALSE)

[1] 1 2 4 4 5 6 8

Note that here we have specified decreasing = FALSE in the function. This is what is known as an argument and they are what allow to control more sophisticated functions. With any function we can see details of what arguments we can specify by using the help() or ? functions.

Vector Indexing

Now suppose we wish to retrieve only specific elements of a vector. We do so by indexing whereby we select which elements to return inside braces []. To return specific elements we simply note the numerical index of the element (i.e. 3 for the third element):

# return the 3rd element of the boolean vector
bool_vec[3]

[1] "FALSE"

# return the 3rd, 4th and 5th element of the numerical vector
num_vec[c(3,4,5)]

[1] 6 8 4

Note above that to return multiple elements we need to specify our index as a vector. To return every element besides specific elements we use negative indices:

# return everything but the 3rd element of the boolean vector
bool_vec[-3]

[1] "TRUE"  "FALSE" "FALSE" "TRUE"

# return everything but the 3rd, 4th and 5th element of the numerical vector
num_vec[-c(3,4,5)]

[1] 4 1 2 5

Vector Filtering

We can also index vectors using logical statements, that is return elements of a vector that satisfy some logical condition, through a process known as filtering:

# filter the numerical vector to return only numbers greater than or equal to 4
num_vec[num_vec >= 4]

[1] 4 6 8 4 5

Vector Computations

One of the strengths of R as a programming language is its efficiency with performing vector computations. There are a variety of vector computations we can perform, starting with muliplication by a scalar which is applied element-wise:

# vector times a scalar
2 * c(1,2,3,4)

[1] 2 4 6 8

We can also add a scalar and a vector which is also applied element-wise:

2 + c(1,2,3,4)

[1] 3 4 5 6

If we wish to add or multiply two vectors we must ensure they are of the same length:

# add two vectors
c(1,2,3) + c(10,12,14)

[1] 11 14 17

# multiply two vectors
c(1,2,3) * c(10,12,14)

[1] 10 24 42

Recycling

You might ask yourself what happens if we add together two vectors of different lengths?

c(1,2) + c(10,20,30,40)

[1] 11 22 31 42

From this example you might have discerned that R did not return an error message but instead recycled values of the smaller vector (looping from start to finish) to perform the computation.

Coercion

Vector coercion is when R changes the data type in a vector when it is presented with a situation when you are trying to combine different data types. The default order of priority taken by R is:

Character Strings;
Numeric; and
Boolean.

# Coercion from boolean to numeric
c(TRUE, FALSE, 1,2,3,1)

[1] 1 0 1 2 3 1

# Coercion from boolen and numeric to character strings
c("Dog", TRUE, TRUE, 5, 10)

[1] "Dog"  "TRUE" "TRUE" "5"    "10"

Matrices

Matrices are essentially 2 dimensional vectors and so again the mathematics of matrix computations is given by Linear Algebra. For example, we define a matrix \(A\) with \(4\) columns and \(3\) rows by

\[ A := \begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ a_{4,1} & a_{4,2} & a_{4,3} \end{bmatrix}, \]

where \(a_{i,j}\in\mathbb{R}\) for all \(i,j\). We can see that vectors are simply a specific kind of matrix, i.e. a matrix with only one column.

Constructing Matrices

There are three functions we use to construct matrices:

matrix();
rbind();
cbind().

`matrix()`

The simplest syntax for the matrix() function is:

# matrix() syntax
matrix(data = NA, nrow = n, ncol = m, byrow = FALSE)

where:

data - is a vector of the matrix elements;
nrow - specifies the number of rows of the matrix;
ncol - specifies the number of columns of the matrix;
byrow - is a boolean specifying if the matrix should be built row-wise.

For an example, suppose we wish to construct a 4x4 matrix consisting of the integers from 1 to 16 constructed row-wise. Written mathematically, we wish to construct \[ A := \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \\ 13 & 14 & 15 & 16 \end{bmatrix}. \]

To construct this matrix in R we can write:

# matrix() function
A <- matrix(1:16, nrow = 4, ncol = 4, byrow = TRUE)
A

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16

`rbind()` and `cbind()`

Sometimes we might have several vectors that we with to combine into a matrix. To combine vectors row-wise we use rbind() and to combine vectors column-wise we use cbind().

# rbind() function
rbind(1:4, 5:8, 9:12, 13:16)

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16

cbind(1:4, 5:8, 9:12, 13:16)

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Matrix Computations

Atomic data structures are not flexible but this rigid structure allows them to be easily manipulated. The mathematical background for arithmetic with matrices is studied in Linear Algebra.

We can add to matrices together if they have the same number of columns and rows (just rows for vectors) in which case we say the matrices have the same dimension. In this case, the sum of two matrices is simply computed element-wise.

B <- matrix(
    seq(from=2, length.out=16, by=2), 
    nrow = 4, ncol = 4, byrow = TRUE
    )
B

     [,1] [,2] [,3] [,4]
[1,]    2    4    6    8
[2,]   10   12   14   16
[3,]   18   20   22   24
[4,]   26   28   30   32

A + B

     [,1] [,2] [,3] [,4]
[1,]    3    6    9   12
[2,]   15   18   21   24
[3,]   27   30   33   36
[4,]   39   42   45   48

Similarly, scalar multiplication is computed element-wise.

7 * A

     [,1] [,2] [,3] [,4]
[1,]    7   14   21   28
[2,]   35   42   49   56
[3,]   63   70   77   84
[4,]   91   98  105  112

When computing the cross-product of matrices \(A\times B\) we require that the number of rows of \(A\) is equal to the number of columns of \(B\).

A * B

     [,1] [,2] [,3] [,4]
[1,]    2    8   18   32
[2,]   50   72   98  128
[3,]  162  200  242  288
[4,]  338  392  450  512

B * c(1,2,3)

Warning in B * c(1, 2, 3): longer object length is not a multiple of shorter
object length

     [,1] [,2] [,3] [,4]
[1,]    2    8   18    8
[2,]   20   36   14   32
[3,]   54   20   44   72
[4,]   26   56   90   32

B * c(1,2,3,4)

     [,1] [,2] [,3] [,4]
[1,]    2    4    6    8
[2,]   20   24   28   32
[3,]   54   60   66   72
[4,]  104  112  120  128

Matrix Indexing

When we wish to retrieve a specific element or elements of a data structure we need to perform indexing which in R. Considering the matrix \(A\) we defined earlier. If we wished to select the 5th element of the matrix we can use the brace [] syntax and write:

A[5]

[1] 2

Now, supposing we wished to select from the 2nd to the 8th element of the matrix, we can use the same syntax but this time specifying a vector of the index values in the braces:

A[2:8]

[1]  5  9 13  2  6 10 14

Often we might be more interested in selecting specific rows or columns of the matrix. To select a column we use the same syntax but this time with a comma A[i,j] where i specifies the row number of j specifies the column number. If we with to select the entire row we simply leave the value of j blank and vice versa:

# 1st row
A[1,]

[1] 1 2 3 4

# 4th column
A[,4]

[1]  4  8 12 16

We can select multiple rows / columns by using the same syntax as before and including a vector of the row / column index values you with to select:

# 2nd and 3rd rows
B[2:3, ]

     [,1] [,2] [,3] [,4]
[1,]   10   12   14   16
[2,]   18   20   22   24

# 2nd and 3rd rows + 1st and 4th columns
A[2:3, c(1,4)]

     [,1] [,2]
[1,]    5    8
[2,]    9   12

We can also chain index whereby we immediately index the output of our indexing:

A[2:3, c(1,4)][2:3]

[1] 9 8

Matrix Filtering

Now suppose we wish to only select elements of a matrix that specify a certain logical condition. We can again index but this time instead of specifying the index numbers we specify the condition we wish to be satisfied:

A[A > 10]

[1] 13 14 11 15 12 16

This is particularly helpful when we wish to edit our original matrix. For example, if we wish to change \(A\) such that all elements \(\leq 10\) are set to zero we could write:

A_new <- A
A_new[A_new < 10] <- 0
A_new

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0   10   11   12
[4,]   13   14   15   16

Matrix Functions

There are also a collection of preloaded functions for performing computations and handling vectors and matrices such as:

dim() - returns the dimensions of a matrix (i.e. number of rows and columns);
det() - computes the determinant of a matrix;
solve() - computes the inverse of a matrix.

# return the dimensions of A
dim(A)

[1] 4 4

# compute the determinant of A
det(A)

[1] 4.733165e-30

For more information on matrix determinants and inverses please see my course notes on Linear Algebra

← Previous Chapter

Next Chapter →

Scalars

Vectors

c() Function

rep() Function

seq() Function

Combining Methods

Important Vector Functions

Vector Indexing

Vector Filtering

Vector Computations

Recycling

Coercion

Matrices

Constructing Matrices

matrix()

rbind() and cbind()

Matrix Computations

Matrix Indexing

Matrix Filtering

Matrix Functions

`c()` Function

`rep()` Function

`seq()` Function

`matrix()`

`rbind()` and `cbind()`