Import and Exporting Data

Author

Dr. Mohammad Nasir Abdullah

Introduction to Packages

A package in R is a collection of functions, sample data, and documentation bundled together. By using packages, you can leverage the work of others to perform complex tasks with just a few lines of code.

Why Use Packages?

• Enhanced Functionality: Packages provide additional functions to perform a wide variety of tasks.

• Efficiency: Save time and effort by using pre-written and tested code.

• Community Support: Benefit from the extensive and vibrant R community.

Installing packages

You can install packages directly from CRAN (Comprehensive R Archive Network), or other repositories, and also from local files.

#installing the 'dplyr' package from CRAN 
install.packages("dplyr")

Loading packages

After installing a package, you need to load it into the R environment to use its functions.

#loading the 'dplyr' package 
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Using package functions

After loading a package, you can use its functions by calling them like any other function in R.

#using the 'filter' function from 'dplyr' to filter rows in a data frame.  
dplyr::filter(mtcars, mpg > 20)

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

To see list of functions available in a package

ls(getNamespace("dplyr"))

To see the documentation of the package

help(package="dplyr")

Importing Data

In many data analysis, the important step is data gathering. Normally, in analysing data, the data is already available from other sources such as Microsoft Excel, databases, and statistical packages format.

It is often the case that data for analysis is contained in an external file (external to R) in plain text (ASCII) format. The data is typically delimited or separated by a special character such as a space, tab, or comma.

See the Quick-R section on packages, for information on obtaining and installing the these packages. Example of importing data are provided below.

In this notes, we will use demo.sav, demo.csv, demo.xls, Data Exercise.sav and KAP1.dta. All the dataset can be downloaded through this link: https://dataintror.s3-ap-southeast-1.amazonaws.com/dataset+Introduction+to+R.zip

At first, let us setting up the evironment. we need to make sure we already identify the working directory and clear all object in the environment.

1) To clear all objects in the R environment:

rm(list=ls())

2) To check working directory

#to check current working directory
getwd()
#If not satisfied, we can change the working directory using 
#(not run)
setwd("C:/Users/aaa/Documents")
#(not run)

1. Importing from a Comma Delimited Text File

First, we must make sure the original file has these criteria’s:

The first row of the file should contain variable names at the top.

2. The variable names must include normal letters, numbers, and underscores (for example: Age, Admission_date, Score_1).

3. Should not include special characters such as space, @,$,#(for example: “Admission Date”, #race, @prob).

4. The data should begin at the second row and first column onwards.
5. Everything that is not part of the data should be removed (for example: comments, labels, graphs, pivot tables).

This is the example of comma delimited text file.

mydata <- read.table("demo.csv", header=TRUE, sep=",")

Next, we can check for the dimension of the dataset by using this code:

dim(mydata) #To look at the dimensions of the data.
            # It will display row and columns

we also can view some of the observations by:

head(mydata) 
#It will show first 6 observations for all variables

head(mydata[1:4], n=10L) 
#It will show first 10 observations (10L)
#for variable number 1 until number 4

2. Importing from MS Excel File

One of the best ways to read an Excel file is to export it to a comma delimited file #and import it using the method above. Alternatively you can use the xlsx package to access Excel files. The first row should contain variable/column names. As as start, we can read in the first worksheet from the workbook demo.xls by double click on the orginal excel file and make sure the first row contains variable names and all the criteria mention in the previous section.

To start importing excel file, we need to install a library named xlsx.

install.packages("readxl")

If you already installed the package, please load it up by calling the library

library(readxl)

Now we can begin to import the data from MS Excel file

mydata1 <- read_excel("demo.xls", sheet="demo")
#We need to specify the sheet name.
#In this case the sheet name is "demo"

In the file name, you can mention either it is .xls or .xlsx There are another method to import data from excel, without specifying the sheet name. But you need to know which number of sheets you want to import, in this case, the sheet is in the 1st sheet.

mydata1 <- read_excel("demo.xls", 1)
#We need to specify the sheet number.
# In this case the "demo" sheet is the first sheet

3. Importing dataset from SPSS file

To import data file from SPSS format file we need to install ‘foreign’ library. In foreign library, we able to import data file from SPSS and STATA

install.packages("foreign")

If you already install the package, please load it up by calling the library

library(foreign)

Now we can begin to import the data from SPSS file

data.sav = read.spss("Data Exercise.sav", to.data.frame = TRUE)#SPSS
#Last option converts value labels to R factors

There are another method to import SPSS data file into R, by using Hmisc library

install.packages("Hmisc")
library(Hmisc)
mydata1 <- spss.get("Data Exercise.sav", use.value.labels=T)

4. Importing data from STATA file

In this example, we will use foreign library to import data

library(foreign)

Now, we will begin to import STATA data file format .dta into R.

mydata <- read.dta("KAP1.dta")

It actually pretty fast to import STATA data file into R.

5. Importing data from Internet / Web

The internet is a vast repository of data, ranging from structured dataset in CSV or Excel formats, APIs returning JSON or XML data, to unstructured data in web pages. R provides a range of tools and packages that make importing this data straightforwards.

From website: https://catalog.data.gov/dataset/electric-vehicle-population-data, we will download this data https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD, noticed that this is .csv file data provided from this webpage.

my_data <- read_csv("https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD")

Now, let explore some other dataset from data.gov.my. which is Number of births in Malaysia by date: https://storage.data.gov.my/demography/births.csv

birth <- read_csv("https://storage.data.gov.my/demography/births.csv")

Another example from the same website: RON97 Petrol Price: https://storage.data.gov.my/commodities/fuelprice.csv

ron97 <- read.csv("https://storage.data.gov.my/commodities/fuelprice.csv")

6. Importing data from .parquet file

Parquet is a columnar storage file format optimized for analytics. Originating from the Hadoop ecosystem, Parquet is especially used with big data technologies because of its efficiency. Some features and advantages of Parquet include:

Columnar Storage: By storing data column-wise, Parquet allows more efficient I/O operations, compression, and encoding shcemes.
Schema Evolution: Parquet supports complex nested data structures and allows schema evolution, meaning you can modify the schema over time without having to rewrite the entire dataset.
Language and Framework Neutrality: Parquet can be used with any project in the Hadoop ecosystem, regardless of the data processing framework, data model, or programming language.
Compression: Columnar storage also makes it possible to better compress data. Different compression schemes can be specified for different columns.

Reading a Parquet file in R

To read a parquet file in R, you can use the arrow package, which provides a set of tools for working with Parquet and Arrow dataset:

Here’s how you can read a Parquet file into R using the arrow package:

install.packages("arrow")

once installed the package, you can load it up into R environment:

library(arrow)

Let try to import Parquet file from data.gov.my where this data is about Exchange Rate Malaysia vs China

data1 <- read_parquet("https://storage.data.gov.my/finsector/exchangerates.parquet")

head(data1)

Noticed that the parquet file here is same as tibble data format.

Exporting data

1. Exporting Data to a Tab Delimited Text file.

There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, you will need the writexl package.

First, we need to load foreign library to enable us to export to text file.

library(foreign)

To begin exporting, let consider mydata object at the last importing notes (import from STATA) which is KAP1.dta.

write.table(mydata, "text.txt", sep="\t")

The function works are where x is our object data (mydata), then file=“” is the new file that we want to create. Others, quote, eol, na, dec, row.names, col.names, qmethod and fileEncoding definition can be view in the document file of write.table by :

?write.table

I would like to emphasize on ‘sep’ function in the command line. For tab delimited we use sep="," for comma delimited is sep=”,“, For space delimited is sep=" ".

Now we can find the text.txt file that we have created in our working directory.

2. Exporting data to comma separated file (`.csv`)

write.csv(data1, "data2.csv")

3. Exporting to MS Excel Spreadsheet

We need to call writexl library in order to export the data into excel spreadsheet

library(writexl)

To write into excel spreadsheet

write_xlsx(mydata, "myexcel.xlsx")

3. Exporting to SPSS Data File

To write out text datafile and an SPSS program to read it we need to use foreign library

library(foreign)

Begin to export to SPSS datafile

write.foreign(mydata, "mydata.txt", "mydata.sps",   package="SPSS")

4. Exporting to STATA Data File

To Export datafile into STATA data file format still under foreign library

library(foreign)

Begin to export to STATA binary datafile format

write.dta(mydata, "mydata.dta")

Conclusion Remarks

R offers an extensive set of tools for exporting and importing data. Depending on you needs and the tools you or your collaborators use, you can choose the appropriate method for exporting and importing. Always keep in mind the compatibility and constrains of your chosen data format.

Exercise!

Question 1

a. Download a sample CSV file from the internet and import it into R. What function would you use from base R to achieve this?

b. Import an Excel file with multiple sheets into R. How would you access the second sheet specifically?

Question 2

a. Suppose you have a .sav file, which is an SPSS data file. Which R package would you consider to import this file?

b. How would you import a Parquet file into R?

Question 3

a. You have a dataframe df. Export it to a CSV file named “my_data.csv” without row names.

b. How would you export the same dataframe df to an Excel with a sheet name “Dataset”?

Question 4

a. You are given a URL of a CSV file. Write a script to download this CSV, import it into R, make any transformation of your choice (eg: filter, mutate), and then export the transformed dataset to an Excel file.

b. You have scraped multiple tables from different web pages into a list of dataframes in R. How would you export each dataframe in the list to its own sheet in a single Excel workbook?