Data types and data structures in R
In this post, I will outline the basic mathematical operation in R, the different data types you can define, and how to manipulate these data types. If you have not yet installed R, you can read the first post here, which outlines the installation process. As you read along, it is better to run the commands to familiarize yourself with these commands. So start up R and follow along.
Basic data types
R has 6 basic data types that include:
- character
- numeric (real or decimal)
- integer
- logical
- complex
- raw
Characters can be specified by enclosing it with a “” (e.g., “ans” or “str”). Numeric are numbers such as 2.0 or 4.5. You can specify an integer in R by affixing an L after the number, such as 2L or 23L, to tell R to store the number as an integer. Logicals can be TRUE
or FALSE
(case sensitive). Complex type represents complex numbers with real and imaginary part such as 4+5i.
Basic numeric operations
You can use R as a simple calculator. All the mathematical operations like addition (+), subtraction (-), multiplication (*), and division (/) can be performed. You can compute exponents (^), square roots (using sqrt()
function), and logarithms (using log()
function), among others. Use ( )
to control the order of your computation with the operations inside the parenthesis evaluated first.
> 7+3
[1] 10
> 128 - 33
[1] 95
> 3*5
[1] 15
> 18/4
[1] 4.5
> 3^4
[1] 81
> 3^4+2
[1] 83
> 3^(4+2)
[1] 729
> (12+3)/(3+1.5)
[1] 3.333333
> sqrt(65)
[1] 8.062258
> log(33)
[1] 3.496508
Observe how the ()
change the output of the operation in lines 16 and 19. In line 16, 3 is raised to the 4th power then 2 is added. On the other hand, in line 19, 3 is raised to the 6th power.
In the examples above, R only prints the result of the operation in the console. If you want to save the result so that you can use it after, you will need to assign the result to a variable. A variable can be called anything as long as it begins with a letter, say x
. A variable is something with a value that can vary throughout the session. To assign a value to a variable, you can use the single equal sign symbol ( =
) or the arrow notation ( <-
).
> x = 5
> x
[1] 5
> x = x + 1
> x
[1] 6
> y = x * 10
> y
[1] 60
> ls()
[1] "x" "y"
In line 1, x
is assigned a value of 5. To see its value, you can just type the variable and press enter. In line 4, x
is assigned its original value plus 1, resulting its new value to be equal to 6. In line 7, another variable called y
is assigned a value of the product of the current value of x
and 10. The command ls()
will list all variables in the current session as illustrated in line 10 and 11.
Vectors
If you want to process several numbers at once, for example, dividing the numbers with the same number, it is possible to do it using the division operator one number at a time. But R provides a better way of doing this using vectors. A vector is basically just a collection of numbers. To create a vector, you can use the single letter c
with the vector elements (or the numbers) enclosed in parenthesis and separated by commas.
> vector = c(1,2,3,4,5)
> vector
[1] 1 2 3 4 5
> vector2 = c(vector, vector)
> vector2
[1] 1 2 3 4 5 1 2 3 4 5
> sq = 2:10
> sq
[1] 2 3 4 5 6 7 8 9 10
You can also create vectors from vectors such as in line 4. If you want to generate a vector consisting of a sequence of numbers with increment 1, you can use the colon operator (:) as given in line 7, which reads as “from 2 to 10 with increment equal to 1”. For more general sequence, you can use the seq()
function.
> sq = seq(from=20, to=50, by=3)
> sq
[1] 20 23 26 29 32 35 38 41 44 47 50
> sq = seq(from=50, to=20, by=-3)
> sq
[1] 50 47 44 41 38 35 32 29 26 23 20
If you want to generate a vector with repeated elements, you can use the rep()
function. Its arguments include x
, the vector to repeat, times
, the number of times to repeat x
, and each
, the number of times to repeat each element.
> rp = rep(x=1, times=10)
> rp
[1] 1 1 1 1 1 1 1 1 1 1
> rp = rep(x=c(1,5,10), times=3)
> rp
[1] 1 5 10 1 5 10 1 5 10
> rp = rep(x=c(2 8 12), times=2, each=3)
> rp
[1] 2 2 2 8 8 8 12 12 12 2 2 2 8 8 8 12 12 12
You can sort the elements of a vector using the sort()
function. Its arguments are the vector x
and decreasing
to indicate the direction of sorting. The value of decreasing
can either be TRUE
(case-sensitive) to sort from largest to smallest or FALSE
to sort from smallest to largest.
> vc = c(1, 3, 10, 23, 55, 29, 0, -12, 22, -34)
> sort(x=vc)
[1] -34 -12 0 1 3 10 22 23 29 55
> sort(x=vc, decreasing=TRUE)
[1] 55 29 23 22 10 3 1 0 -12 -34
> sort(x=vc, decreasing=FALSE)
[1] -34 -12 0 1 3 10 22 23 29 55
Without specifying the decreasing
argument, sort()
‘s default behavior is to sort from smallest to largest (decreasing=FALSE
) as can be seen in line 2. If you want to identify the number of elements in a vector, you can use the length()
function.
> length(x=c(3,1,23,45))
[1] 4
> vc = c(2, 1, 4, rep(x=1:3,times=3))
> length(x=vc)
[1] 12
Matrices
Matrices are extensions of vectors with dimensions given by the number of rows and columns. To create a matrix from vectors, you can use the function matrix()
. Its arguments include data
to specify the data that will be used to populate the matrix, nrow
to specify the number of rows, ncol
to specify the number of columns, byrow
to specify whether the matrix will be populated column-wise (FALSE
) or row-wise (TRUE
) using the data, and dimnames
to specify the names of the rows and columns.
> mm = matrix()
> mm
[,1]
[1,] NA
> mm = matrix(data=c(1,2,3,4,5,6))
> mm
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2)
> mm
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=TRUE)
> mm
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=3)
> mm
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 5 2
[3,] 3 6 3
Without any argument, matrix()
creates an empty matrix. When you specify only the data
without the other arguments, a column matrix is created. Specifying nrow
and ncol
creates a matrix with nrow
number of rows and ncol
number of columns. The matrix is also populated column-wise. To change this behavior so that the matrix is populated row-wise, set the byrow=TRUE
. Observed that when there are not enough data to populate the matrix (line 26), data
is recycled, that is, it uses again the first element in data
, then the second, and so on until the matrix is fully filled up.
If you want to create a matrix from multiple vectors having the same length, you can use rbind()
or cbind()
function to combine these vectors into a matrix. rbind()
combines vectors row-wise, while cbind()
column-wise.
> mat = cbind(1:3, 4:6)
> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> mat = rbind(1:3, 4:5)
> mat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> dim(mat)
[1] 2 3
> nrow(mat)
[1] 2
> ncol(mat)
[1] 3
To get the dimension of a matrix, you can use the function dim()
(line 12). The number of rows can be obtained using nrow()
function (line 14) and the number of columns using ncol()
function (line 16).
Arrays
Arrays are extensions of matrices with dimension greater than 2. You can think of arrays as a generalization of vectors and matrices with the former being arrays of dimension 1 and the latter as arrays of dimension 2. To create an array, you can use the array()
function with arguments that include data
, a vector that will be used to populate the array, dim
, another vector to specify the dimension of the array, and dimnames
to specify the names of the dimensions of the array.
> arr = array(data=1:27, dim=c(3,3,3))
> arr
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
Note that so far, we have only considered numeric data type in all the examples. Vectors, matrices, and arrays can be defined for other data types as well (see the section on basic data types above). The primary limitation is that these data structures can only store one type of data. If you assigned to a vector different data types, R will coerce the variables to have the same data type. The example code below showed that when you combine numerical vector and character vector, the final vector will be converted into a character vector.
> vec = c(1:4, "4", "6")
> vec
[1] "1" "2" "3" "4" "4" "6"
R has provided two more data structures, lists and data frames, which can hold multiple data types at once.
Lists
Lists are very useful data structure since you can use it to group together different data types into a single data structure. You can have a single list containing a numeric matrix, a vector of characters, a single number, and many others. To create a list, you can use the list()
function. The arguments of list usually have the form value
or tag = value
. This function will return a list of values or tagged values.
> lst = list(matrix(data=1:4, nrow=2, ncol=2),c("who", "are", "you"),"welcome")
> lst
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
[[2]]
[1] "who" "are" "you"
[[3]]
[1] "welcome"
> lst = list(a=matrix(data=1:4, nrow=2, ncol=2),b=c("who", "are", "you"),c="welcome")
> lst
$a
[,1] [,2]
[1,] 1 3
[2,] 2 4
$b
[1] "who" "are" "you"
$c
[1] "welcome"
In the list defined in line 1, the first element is a 2 x 2 matrix with numeric values, the second is a vector of characters, and the last element is a character string. In this list, the elements are not tagged. In the second definition (line 13), all the elements of the list are tagged. The tags are the characters before the equal sign in the list’s arguments. It is also possible to combine tagged or untagged elements.
Data frames
The last of the data structure that will be considered is data frame. Data frames represent a tight collection of variables. Like lists, data frames have no restrictions in terms of the data types that it can hold. You can store numeric values, characters, and so on. The important difference is that in data frames, the members must be a vector of the same length and column names should be non-empty. To create data frames, you can use the data.frame() function.
> df = data.frame(name=c("Peter","John","Paul","Ann","Dina","Fe"),
age=c(32, 23, 44, 33, 43,39),
sex=c("Male","Male","Male","Female","Female","Female"))
> df
name age sex
1 Peter 32 Male
2 John 23 Male
3 Paul 44 Male
4 Ann 33 Female
5 Dina 43 Female
6 Fe 39 Female
> dim(df)
[1] 6 3
> nrow(df)
[1] 6
> ncol(df)
[1] 3
In the above, a data frame is constructed with the first name, age, and sex of 6 individuals. Each row in the data frame is called a record and each column a variable. You can use the dim()
function to get the dimension of the data frame, the nrow()
to get the number of rows, and the ncol()
to get the number of columns, just like in matrices.