Getting Started with R

Author

Dr. Mohammad Nasir Abdullah

Starting up

1) Commenting in R console and script

We can make a comment for every codes that we write in R script by typing ’#“.
It is a good practice to write a comment at the end of the codes that we write. This is to prevent miss-understanding about the code in future. This can be done as follows:

# This is a comment
# Anything we type after '#' consider as a comment
1+1 #this is a proper way to write a comment after writing a line of code.

[1] 2

We can make all above as comment at once by selecting all codes and hit “ctrl + shift + c”

# comment will not be executed in R console
# this is another example that comment is not executed in R console: 
# 1 + 1

2) Explaining output on console

We can type code in the console and it will print the output in the same console. The code “print("Mohammad Nasir Abdullah")” is the code chunk and the result [1] "Mohammad Nasir Abdullah" is the output produced by R.

print("Mohammad Nasir Abdullah")

[1] "Mohammad Nasir Abdullah"

3) Object assignation

You can create variables from within the R environment and from files on your computer. R uses “=” or “<-” to assign values to a variable name.

Example 1: Assign using “=”

x = 2
print(x)

[1] 2

Example 2: Assign using “<-”

y <- 2
print(y)

[1] 2

4) R is case sensitive

One of the fundamental aspects to understand while working with R is its case sensitivity. In R, the identifiers such as variable names, function names, and other object names are case-sensitive. This means that Variable, variable, and VaRiAbLe are considered different identifiers in R.

It is crucial to maintain consistent capitalization to ensure that your code works as expected.

Example 1:

# Case sensitivity in R
Variable <- 1
variable <- 2

#The following will output 1, not 2
print(Variable)

[1] 1

#The following will output 2, not 1
print(variable)

[1] 2

In this example, Variable and variable are treated as two distinct objects, each holding different values.

5) Listing the objects in the workspace

The operations from previous sections led to the creation of several simple R objects. These objects are stored in the current R workspace. A list of all objects in the current workspace can be printed to the screen using objects() function.

objects()

[1] "variable" "Variable" "x"        "y"

A synonym for objects() is ls(). Remember that if we quit our R session without saving the workspace image,then these objects will disappear. If we save the workspace image, then the workspace will be restored at out next R session.

Data Types in R

One of the fundamental concepts in any programming language is the data type. In R, there are several basic data types that are used to define the type of data that can be stored and manipulated.

Basic data types in R:

1) Numeric: represent both integer and decimal numbers.

x <- 23.5
y <- 4

2) Character: represents strings or text.

name <- "Mohammad Nasir Abdullah"
greeting <- "Hello, World!"

3) Logical: represents boolean values (“TRUE” or “FALSE”).

is_true <- TRUE
is_false <- FALSE

4) Complex: represents complex numbers.

z <- 3 + 4i

5) Raw: represents raw bytes

raw_data <- charToRaw("Hello")

Data Structures in R

While the basic data types define a single value, R offers several data structure to store collections of values:

1) Vectors: A one-dimensional array that can hold elements of the same data type.

numeric_vector <- c(1,2,3,4,5,6)
character_vector <- c("apple", "banana", "cherry")

2) Matrix: A two-dimensional array where elements are arranged in rows and columns

matrix_data <- matrix(1:6, nrow=2, ncol=3)

3) List: A collection that can hold elements of different data types.

my_list <- list(name="Nasir", age=16, score=c(100,99,100))

4) Data frame: A table-like structure where columns can be of different data types

students <- data.frame(Name=c("Ali", "Abu", "Ahmad"), 
                       Age = c(28, 39,10), 
                       Grade = c("A", "A","A+"))

5) Factor: Used to represent categorical data

gender <- factor(c("Male", "Female", "Male"))

Vector

Vector is a basic data structure that holds elements of the same type. It is a sequence of data elements. For example, a numeric vector holds only numeric data, and a character vector holds only character data.

1) Creating a vector

You can create a vector in R using the c() function.

#Example   
numeric_vector <- c(1,2,3,4,5,6)  
character_vector <- c("A","B", "C", "D", "E", "F")

They symbol can be used to create sequences of increasing (or decreasing) values. For example:

numbers5to20 <- 5:20  
numbers5to20

 [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

vectors can be joined together (i.e: concatenated) with the c() function. For example, note what happens when we type:

c(numbers5to20, numeric_vector)

 [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20  1  2  3  4  5  6

we can append number5to20 to the end of the end of numeric_vector, and then append the decreasing sequence from 4 to 1:

a.mess <- c(numeric_vector, numbers5to20, 4:1)

a.mess

 [1]  1  2  3  4  5  6  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20  4  3  2
[26]  1

2) Extracting elements from vectors

A nice way to display the 22^nd element of a.mess is to use square brackets [ ] to extract just that element:

#Extract 22nd element from a.mess
a.mess[22]

[1] 20

We can extract more than one element at a time. For example, the 3rd, 6th, and 7th elements of a.mess are:

#Extract 3rd, 6th, and 7th Elements
a.mess[c(3,6,7)]

[1] 3 6 5

To get the 3rd through 7th elements of a.mess, just type:

a.mess[3:7]

[1] 3 4 5 6 5

Negative indices can be used to omit certain element(s). For Example:

# To omit 3rd element in a.mess
a.mess[-3]

 [1]  1  2  4  5  6  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20  4  3  2  1

#To omit 2nd to 10th elements in a.mess  
a.mess[-c(2,10)]

 [1]  1  3  4  5  6  5  6  7  9 10 11 12 13 14 15 16 17 18 19 20  4  3  2  1

Do not mix positive and negative indices. To see what happens, observe:

a.mess[c(-2, 10)]    

#Error in a.mess[c(-2, 10)] : only 0's may be mixed with negative subscripts

Always be careful to make sure that vector indices are integer. When fractional values are used, they will be truncated towards 0.

a.mess[0.5]   
#numeric(0)

3) Vector arithmetic

Arithmetic can be done on R vectors. For example, we can multiply all elements of numeric_vector by 3.

#Multiplication
numeric_vector * 3

[1]  3  6  9 12 15 18

Note that the computation is performed element wise. Addition (+), substraction (-), and division (/) by a constant have the same kind of effect. For example:

#Substraction
numeric_vector - 5

[1] -4 -3 -2 -1  0  1

#Division
numeric_vector / 2

[1] 0.5 1.0 1.5 2.0 2.5 3.0

#Addition
numeric_vector + 2

[1] 3 4 5 6 7 8

Next, consider taking the 3^rd power of the elements of numeric_vector:

numeric_vector^3

[1]   1   8  27  64 125 216

4) Characters vectors

Scalars and vectors can be made up of strings of characters instead of numbers. All elements of a vectors must be of the same type. For Example:

colors <- c("red", "yellow", "blue")   

more.colors <- c(colors, "green", "magenta", "pink") #This appended some new elements to colors

#An attempt to mix data types in a vector   

new <- c("green", "yellow", 1)

To see the contents of more.colors and new, simply type;

more.colors

[1] "red"     "yellow"  "blue"    "green"   "magenta" "pink"

new

[1] "green"  "yellow" "1"

5) Factor vector

Factor offer an alternative way to store character data. For example, a factor with 4 elements and having the 2 levels control and treatment can be create using:

grp <- c("control", "treatment", "control", "treatment")   

grp

[1] "control"   "treatment" "control"   "treatment"

#set as factor 

grp <- as.factor(grp) 

grp

[1] control   treatment control   treatment
Levels: control treatment

Factors can be an efficient way to storing character data when there are repeated among the vector elements. This is because the levels of a factor are internally coded as integers. To see what the codes are for our factor, we can type:

as.integer(grp)

[1] 1 2 1 2

The labels for the levels are stored just once each, rather than being repeated. The codes are indices of the vector of levels:

levels(grp)

[1] "control"   "treatment"

The levels() function can be used to change factor labels as well. For example, suppose we wish to change the “control” label to “placebo”.

levels(grp)[1] <- "placebo"

An important use for factors is to list all possible values, even if some are not present. For example:

gender <- factor(c("Female", "Female", "Female"), levels = c("Female", "Male"))   

gender

[1] Female Female Female
Levels: Female Male

It shows that there are two possible values for gender, but only one is present in our vector.

Data Frames

Data sets frequently consist of more than one column of data, where each column represents measurements of a single variable. Each row usually represents a single observation. This format is referred to as case-by-variable format.

Most data sets are stored in R as data frames. These are like matrices, but with the columns having their own names.

A data frame is one of the most commonly used data structures in R, especially for data analysis and statistical modelling. Conceptually, it can be thought of as a table or a spreadsheet, where you have rows representing observations and columns representing variables. A data frame is similar to a matrix, but with the added flexibility that different columns can contain different types of data (eg: numeric, character, factor).

Features:

Mixed Data Types: Unlike matrices, data frames can store different classes of objects in each column.
Column Names: Columns in a data frame can have names, which makes accessing and manipulating data easier and more intuitive.
Row Names: By default, rows have index names (from 1 to the number of rows), but these can also be explicitly set to other values.

Creation:

A data frame can be created using the data.frame() function:

df <- data.frame(Name = c("Ali", "Abu", "Ahmad"), 
                 Age = c(9, 6, 2), 
                 Score = c(82, 93, 92))
df

   Name Age Score
1   Ali   9    82
2   Abu   6    93
3 Ahmad   2    92

Indexing:

Columns: You can access a column in a data frame using $ operator or double square brackets [[…]].
```
#Extracting first row
df[1, ]
```
```
  Name Age Score
1  Ali   9    82
```

Rows: Rows can be accessed using single square brackets […].

#Extracting 3rd row
df[3, ]

   Name Age Score
3 Ahmad   2    92

Subsetting: You can subset data frames using conditions

#Extracting data that contain more than 90
df[df$Score > 90, ]

   Name Age Score
2   Abu   6    93
3 Ahmad   2    92

Useful functions

head() and tail(): Display the first or last part of a data frame.
str(): Provides the structure of a data frame, showing the data type of each column and the first few entries.
summary(): Gives a statistical summary of all columns in a data frame.
dim(): Returns the dimensions (number of rows and columns) of a data frame.
rownames() and colnames(): Get or set the row or column names of a data frame.
merge(): Merges two data frames by common columns or row names.

Examples

1) head() and tail()

These functions display the first or last part of a data frame, respectively. By default, they show six rows.

# Create a sample data frame
df <- data.frame(Name = c("Ali", "Abu", "Ahmad", "Aminah", "Rosnah", "Rozanae", "Rohana"),
                 Age = c(25, 32, 29, 24, 27, 31, 23),
                 Score = c(85, 90, 93, 87, 78, 91, 82))

# Display the first few rows
head(df)

     Name Age Score
1     Ali  25    85
2     Abu  32    90
3   Ahmad  29    93
4  Aminah  24    87
5  Rosnah  27    78
6 Rozanae  31    91

# Display the last few rows
tail(df)

     Name Age Score
2     Abu  32    90
3   Ahmad  29    93
4  Aminah  24    87
5  Rosnah  27    78
6 Rozanae  31    91
7  Rohana  23    82

2) str()

This function provides a concise display of the structure of an object, such as a data frame.

# Display the structure of df
str(df)

'data.frame':   7 obs. of  3 variables:
 $ Name : chr  "Ali" "Abu" "Ahmad" "Aminah" ...
 $ Age  : num  25 32 29 24 27 31 23
 $ Score: num  85 90 93 87 78 91 82

3) summary()

Gives a statistical summary of all columns in a data frame.

# Get a summary of df
summary(df)

     Name                Age            Score      
 Length:7           Min.   :23.00   Min.   :78.00  
 Class :character   1st Qu.:24.50   1st Qu.:83.50  
 Mode  :character   Median :27.00   Median :87.00  
                    Mean   :27.29   Mean   :86.57  
                    3rd Qu.:30.00   3rd Qu.:90.50  
                    Max.   :32.00   Max.   :93.00

4) dim()

Returns the dimensions of an object.

# Get the dimensions of df (number of rows and columns)
dim(df)

[1] 7 3

5) rownames() and colnames()

Retrieve or set the row or column names of a data frame.

# Get row names of df
rownames(df)

[1] "1" "2" "3" "4" "5" "6" "7"

# Get column names of df
colnames(df)

[1] "Name"  "Age"   "Score"

# Set new row names for df
rownames(df) <- c("A", "B", "C", "D", "E", "F", "G")

6) merge()

Merge two data frames by common columns or row names.

# Create another sample data frame
df2 <- data.frame(Name = c("Ali", "Abu", "Rosnah", "Rohana"),
                  Grade = c("A", "B", "A", "C"))

# Merge df and df2 by the "Name" column
merged_df <- merge(df, df2, by="Name")
print(merged_df)

    Name Age Score Grade
1    Abu  32    90     B
2    Ali  25    85     A
3 Rohana  23    82     C
4 Rosnah  27    78     A

Let’s use mtcars data set.

This dataset comprises various specifications and details about different car models from the 1970s.

1. Quick Glance at the Dataset

First, let’s take a quick look at the mtcars dataset:

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2. Structure of the Dataset (str())

Examining the structure of mtcars:

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

3. Summary of the Dataset (summary())

Providing a statistical summary:

summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

4. Dimensions of the Dataset (dim())

Checking the number of rows and columns:

dim(mtcars)

[1] 32 11

5. Column Names (colnames())

Retrieving the names of the columns:

colnames(mtcars) #same as names(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

7. Subsetting Example

Extracting data for cars with 6 cylinders and horsepower (hp) greater than 150:

mtcars[mtcars$cyl == 6 & mtcars$hp > 150, ]

              mpg cyl disp  hp drat   wt qsec vs am gear carb
Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Exercise!

Create a numeric variable “num_var” with the value “42.5”.
Create a character variable “char_var” with the value “R is fun!”.
Print the data type “char_var”.
Create a list “student_info” with the following elements:
1. “name”: “Mohammad Nasir Abdullah”
2. “age”: 18
3. “grades”: a numeric vector with values “99, 100, 89”.
Create a data frame “df_students” with the following columns
1. “Name”: “John”, “Pablo”
2. “Age”: “22”, “30”
3. “Grade”: “A”, “C”
Create a numeric vector “vec_num” with the values “5,10,15,20”.
Extract the second and third elements from “vec_num”.
Create a vector “vec_seq” that contains a sequence of number from 1 to 10.
Create a vector “vec_rand” with 5 random numbers between 1 and 100.
Create a character vector “vec_char” with the values “apple”, “banana”, “cherry”.