# Data types and data structures in R

In this post, I will outline the basic mathematical operation in R, the different data types you can define, and how to manipulate these data types. If you have not yet installed R, you can read the first post here, which outlines the installation process. As you read along, it is better to run the commands to familiarize yourself with these commands. So start up R and follow along.

## Basic data types

R has 6 basic data types that include:

- character
- numeric (real or decimal)
- integer
- logical
- complex
- raw

Characters can be specified by enclosing it with a “” (e.g., “ans” or “str”). Numeric are numbers such as 2.0 or 4.5. You can specify an integer in R by affixing an L after the number, such as 2L or 23L, to tell R to store the number as an integer. Logicals can be `TRUE`

or `FALSE`

(case sensitive). Complex type represents complex numbers with real and imaginary part such as 4+5i.

## Basic numeric operations

You can use R as a simple calculator. All the mathematical operations like addition (+), subtraction (-), multiplication (*), and division (/) can be performed. You can compute exponents (^), square roots (using `sqrt()`

function), and logarithms (using `log()`

function), among others. Use `( )`

to control the order of your computation with the operations inside the parenthesis evaluated first.

```
> 7+3
[1] 10
> 128 - 33
[1] 95
> 3*5
[1] 15
> 18/4
[1] 4.5
> 3^4
[1] 81
> 3^4+2
[1] 83
> 3^(4+2)
[1] 729
> (12+3)/(3+1.5)
[1] 3.333333
> sqrt(65)
[1] 8.062258
> log(33)
[1] 3.496508
```

Observe how the `()`

change the output of the operation in lines 16 and 19. In line 16, 3 is raised to the 4th power then 2 is added. On the other hand, in line 19, 3 is raised to the 6th power.

In the examples above, R only prints the result of the operation in the console. If you want to save the result so that you can use it after, you will need to *assign* the result to a *variable*. A variable can be called anything as long as it begins with a letter, say `x`

. A variable is something with a value that can vary throughout the session. To assign a value to a variable, you can use the single equal sign symbol ( `=`

) or the arrow notation ( `<-`

).

```
> x = 5
> x
[1] 5
> x = x + 1
> x
[1] 6
> y = x * 10
> y
[1] 60
> ls()
[1] "x" "y"
```

In line 1, `x`

is assigned a value of 5. To see its value, you can just type the variable and press enter. In line 4, `x`

is assigned its original value plus 1, resulting its new value to be equal to 6. In line 7, another variable called `y`

is assigned a value of the product of the current value of `x`

and 10. The command `ls()`

will list all variables in the current session as illustrated in line 10 and 11.

## Vectors

If you want to process several numbers at once, for example, dividing the numbers with the same number, it is possible to do it using the division operator one number at a time. But R provides a better way of doing this using *vectors*. A vector is basically just a collection of numbers. To create a vector, you can use the single letter `c`

with the vector elements (or the numbers) enclosed in parenthesis and separated by commas.

```
> vector = c(1,2,3,4,5)
> vector
[1] 1 2 3 4 5
> vector2 = c(vector, vector)
> vector2
[1] 1 2 3 4 5 1 2 3 4 5
> sq = 2:10
> sq
[1] 2 3 4 5 6 7 8 9 10
```

You can also create vectors from vectors such as in line 4. If you want to generate a vector consisting of a sequence of numbers with increment 1, you can use the colon operator (:) as given in line 7, which reads as “from 2 to 10 with increment equal to 1”. For more general sequence, you can use the `seq()`

function.

```
> sq = seq(from=20, to=50, by=3)
> sq
[1] 20 23 26 29 32 35 38 41 44 47 50
> sq = seq(from=50, to=20, by=-3)
> sq
[1] 50 47 44 41 38 35 32 29 26 23 20
```

If you want to generate a vector with repeated elements, you can use the `rep()`

function. Its arguments include `x`

, the vector to repeat, `times`

, the number of times to repeat `x`

, and `each`

, the number of times to repeat each element.

```
> rp = rep(x=1, times=10)
> rp
[1] 1 1 1 1 1 1 1 1 1 1
> rp = rep(x=c(1,5,10), times=3)
> rp
[1] 1 5 10 1 5 10 1 5 10
> rp = rep(x=c(2 8 12), times=2, each=3)
> rp
[1] 2 2 2 8 8 8 12 12 12 2 2 2 8 8 8 12 12 12
```

You can sort the elements of a vector using the `sort()`

function. Its arguments are the vector `x`

and `decreasing`

to indicate the direction of sorting. The value of `decreasing`

can either be `TRUE`

(case-sensitive) to sort from largest to smallest or `FALSE`

to sort from smallest to largest.

```
> vc = c(1, 3, 10, 23, 55, 29, 0, -12, 22, -34)
> sort(x=vc)
[1] -34 -12 0 1 3 10 22 23 29 55
> sort(x=vc, decreasing=TRUE)
[1] 55 29 23 22 10 3 1 0 -12 -34
> sort(x=vc, decreasing=FALSE)
[1] -34 -12 0 1 3 10 22 23 29 55
```

Without specifying the `decreasing`

argument, `sort()`

‘s default behavior is to sort from smallest to largest (`decreasing=FALSE`

) as can be seen in line 2. If you want to identify the number of elements in a vector, you can use the `length()`

function.

```
> length(x=c(3,1,23,45))
[1] 4
> vc = c(2, 1, 4, rep(x=1:3,times=3))
> length(x=vc)
[1] 12
```

## Matrices

Matrices are extensions of vectors with dimensions given by the number of rows and columns. To create a matrix from vectors, you can use the function `matrix()`

. Its arguments include `data`

to specify the data that will be used to populate the matrix, `nrow`

to specify the number of rows, `ncol`

to specify the number of columns, `byrow`

to specify whether the matrix will be populated column-wise (`FALSE`

) or row-wise (`TRUE`

) using the data, and `dimnames`

to specify the names of the rows and columns.

```
> mm = matrix()
> mm
[,1]
[1,] NA
> mm = matrix(data=c(1,2,3,4,5,6))
> mm
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2)
> mm
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=TRUE)
> mm
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
> mm = matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=3)
> mm
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 5 2
[3,] 3 6 3
```

Without any argument, `matrix()`

creates an empty matrix. When you specify only the `data`

without the other arguments, a column matrix is created. Specifying `nrow`

and `ncol`

creates a matrix with `nrow`

number of rows and `ncol`

number of columns. The matrix is also populated column-wise. To change this behavior so that the matrix is populated row-wise, set the `byrow=TRUE`

. Observed that when there are not enough data to populate the matrix (line 26), `data`

is recycled, that is, it uses again the first element in `data`

, then the second, and so on until the matrix is fully filled up.

If you want to create a matrix from multiple vectors having the same length, you can use `rbind()`

or `cbind()`

function to combine these vectors into a matrix. `rbind()`

combines vectors row-wise, while `cbind()`

column-wise.

```
> mat = cbind(1:3, 4:6)
> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> mat = rbind(1:3, 4:5)
> mat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> dim(mat)
[1] 2 3
> nrow(mat)
[1] 2
> ncol(mat)
[1] 3
```

To get the dimension of a matrix, you can use the function `dim()`

(line 12). The number of rows can be obtained using `nrow()`

function (line 14) and the number of columns using `ncol()`

function (line 16).

## Arrays

*Arrays* are extensions of matrices with dimension greater than 2. You can think of arrays as a generalization of vectors and matrices with the former being arrays of dimension 1 and the latter as arrays of dimension 2. To create an array, you can use the `array()`

function with arguments that include `data`

, a vector that will be used to populate the array, `dim`

, another vector to specify the dimension of the array, and `dimnames`

to specify the names of the dimensions of the array.

```
> arr = array(data=1:27, dim=c(3,3,3))
> arr
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
```

Note that so far, we have only considered numeric data type in all the examples. Vectors, matrices, and arrays can be defined for other data types as well (see the section on basic data types above). The primary limitation is that these data structures can only store one type of data. If you assigned to a vector different data types, R will coerce the variables to have the same data type. The example code below showed that when you combine numerical vector and character vector, the final vector will be converted into a character vector.

```
> vec = c(1:4, "4", "6")
> vec
[1] "1" "2" "3" "4" "4" "6"
```

R has provided two more data structures, lists and data frames, which can hold multiple data types at once.

## Lists

Lists are very useful data structure since you can use it to group together different data types into a single data structure. You can have a single list containing a numeric matrix, a vector of characters, a single number, and many others. To create a list, you can use the `list()`

function. The arguments of list usually have the form `value`

or `tag = value`

. This function will return a list of values or tagged values.

```
> lst = list(matrix(data=1:4, nrow=2, ncol=2),c("who", "are", "you"),"welcome")
> lst
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
[[2]]
[1] "who" "are" "you"
[[3]]
[1] "welcome"
> lst = list(a=matrix(data=1:4, nrow=2, ncol=2),b=c("who", "are", "you"),c="welcome")
> lst
$a
[,1] [,2]
[1,] 1 3
[2,] 2 4
$b
[1] "who" "are" "you"
$c
[1] "welcome"
```

In the list defined in line 1, the first element is a 2 x 2 matrix with numeric values, the second is a vector of characters, and the last element is a character string. In this list, the elements are not tagged. In the second definition (line 13), all the elements of the list are tagged. The tags are the characters before the equal sign in the list’s arguments. It is also possible to combine tagged or untagged elements.

## Data frames

The last of the data structure that will be considered is data frame. Data frames represent a tight collection of variables. Like lists, data frames have no restrictions in terms of the data types that it can hold. You can store numeric values, characters, and so on. The important difference is that in data frames, the members must be a vector of the same length and column names should be non-empty. To create data frames, you can use the data.frame() function.

```
> df = data.frame(name=c("Peter","John","Paul","Ann","Dina","Fe"),
age=c(32, 23, 44, 33, 43,39),
sex=c("Male","Male","Male","Female","Female","Female"))
> df
name age sex
1 Peter 32 Male
2 John 23 Male
3 Paul 44 Male
4 Ann 33 Female
5 Dina 43 Female
6 Fe 39 Female
> dim(df)
[1] 6 3
> nrow(df)
[1] 6
> ncol(df)
[1] 3
```

In the above, a data frame is constructed with the first name, age, and sex of 6 individuals. Each row in the data frame is called a *record* and each column a *variable*. You can use the `dim()`

function to get the dimension of the data frame, the `nrow()`

to get the number of rows, and the `ncol()`

to get the number of columns, just like in matrices.