1 Introduction to R
1.1 R Language
R is a complete programming language and software environment for statistical computing and graphical representation. As part of the GNU Project (free software, mass collaboration project), the source code is free available. Its functionalities can be expanded by importing packages. For more details on R see https://www.r-project.org/.
1.1.1 R Packages
A package is a file generally composed of R scripts (e.g., functions). On all operation systems the function “install.packages()” can be used to download and install a package automatically. Otherwise, a package already installed in R can be loaded in a session by using the command . In R, the directories where the packages are stored are called “libraries”. The terms “package” and “library” are sometimes used synonymously. For example, to check the list of the installed packeges, the function can be used. When you open an R Markdown document (.Rmd) the program propose you automatically to install the libraries listed there.
1.1.2 Some tips
- R is case sensitive!
- Previously used command can be recalled in the console by using the up arrow on the keyboard.
- The working directory by default is “C:/user/…/Documents”.
- It can be found using the command
- It can be changed using the command line
- In R Markdown: the working directory when evaluating R code chunks is the directory of the input document by default.
- To access to a specific file in a sub-folder use “. /subfolder/file.ext”
- To access to a specific file in a up-folder use “. . /upfolder/file.ext”
1.2 R Markdown
This is an R Markdown document :-)
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a simple and easy to use plain text language allowing to combine R code, results from data analysis (including plots and tables), and comments into a single nicely formatted and reproducible document (like a report, publication, thesis chapter or web pages).
Code lines are organized into code blocks, seeking to solve specified tasks, and referred to as “code chunk”. For more details on using R Markdown see http://rmarkdown.rstudio.com.
All what you have to do during the computing labs is to read each explanatory paragraph before running each individual R code chunk, one by one, and to interpret the results. Finally, to create a personal document (usually a PDF) from rmarkdown, you need to Knit the document. Knitting a document simply means taking all the text and code and creating a nicely formatted document.
1.3 Data type in computational analysis
1.3.1 Variables
Variables are used to store values in a computer program. Values can be numbers (real and complex), words (string), matrices, and even tables.
The fundamental or atomic data in R Programming can be:
- integer: number without decimals
- numeric: number with decimals (float or double depending on the precision)
- character: string, label
- factors: a label with a limited number of categories
- logical: true/false
1.3.2 Data structure in R
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they are homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types).
This gives rise to the four data structures most often used in data analysis:
A Vector is a one-dimensional structure winch can contain object of one type only: numerical (integer and double), character, and logical.
# Investigate vector's types:
v1 <- c(0.5, 0.7); v1; typeof(v1)
#> [1] 0.5 0.7
#> [1] "double"
v2 <-c(1:10); v2; typeof(v2)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> [1] "integer"
v3 <- c(TRUE, FALSE); v3; typeof(v3)
#> [1] TRUE FALSE
#> [1] "logical"
v4 <- c("Swiss", "Itay", "France", "Germany"); v4; typeof(v4)
#> [1] "Swiss" "Itay" "France" "Germany"
#> [1] "character"
#Create a sequence from 0 to 5 with a step of 0.5:
v5 <- seq(1, 5, by=0.5); v5; typeof(v5)
#> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
#> [1] "double"
length(v5)
#> [1] 9
summary(v5)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 2 3 3 4 5
#Extract the third element of the vector
v5[3]
#> [1] 2
#Exclude the third element from the vector and save as new vector
v5[-3]
#> [1] 1.0 1.5 2.5 3.0 3.5 4.0 4.5 5.0
w5<-v5[-3]; w5
#> [1] 1.0 1.5 2.5 3.0 3.5 4.0 4.5 5.0
A Matrix is a two-dimensional structure winch can contain object of one type only. The function can be used to construct matrices with specific dimensions.
# Matrix of elements equal to "zero" and dimension 2x5
m1<-matrix(0,2,5); m1 #(two rows by five columns)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 0 0 0 0
#> [2,] 0 0 0 0 0
# Matrix of integer elements (1 to 12, 3x4)
m2<-matrix(1:12, 3,4); m2
#> [,1] [,2] [,3] [,4]
#> [1,] 1 4 7 10
#> [2,] 2 5 8 11
#> [3,] 3 6 9 12
# Extract the second row
m2[2, ]
#> [1] 2 5 8 11
# Extract the third column
m2[,3]
#> [1] 7 8 9
# Extract the the second element of the third column
m2[2,3]
#> [1] 8
1.3.3 Data Frame
A data frame allows to collect data of different type. All elements must have the same length.
A list is a more flexible structure since it can contain variables of different types and lengths. Nevertheless, the preferred structure for statistical analyses and computation is the data frame.
It is a good practice to explore the data frame before performing further computation on the data. This can be simply accomplished by using the commands to explore the structure of the data and to display the summary statistics and quickly summarize the data. For numerical vectors the command can be used to plot the basic histogram of the given values.
# Create the vectors with the variables
cities <- c("Berlin", "New York", "Paris", "Tokyo")
area <- c(892, 1214, 105, 2188)
population <- c(3.4, 8.1, 2.1, 12.9)
continent <- c("Europe", "Norh America", "Europe", "Asia")
# Concatenate the vectors into a new data frame
df1 <- data.frame(cities, area, population, continent)
df1
#> cities area population continent
#> 1 Berlin 892 3.4 Europe
#> 2 New York 1214 8.1 Norh America
#> 3 Paris 105 2.1 Europe
#> 4 Tokyo 2188 12.9 Asia
#Add a column (e.g., language spoken) using the command "cbind"
df2 <- cbind (df1, "Language" = c ("German", "English", "Freanch", "Japanese"))
df2
#> cities area population continent Language
#> 1 Berlin 892 3.4 Europe German
#> 2 New York 1214 8.1 Norh America English
#> 3 Paris 105 2.1 Europe Freanch
#> 4 Tokyo 2188 12.9 Asia Japanese
#Explore the data frame
str(df2) # see the structure
#> 'data.frame': 4 obs. of 5 variables:
#> $ cities : chr "Berlin" "New York" "Paris" "Tokyo"
#> $ area : num 892 1214 105 2188
#> $ population: num 3.4 8.1 2.1 12.9
#> $ continent : chr "Europe" "Norh America" "Europe" "Asia"
#> $ Language : chr "German" "English" "Freanch" "Japanese"
summary(df2) # compute basic statistics
#> cities area population
#> Length:4 Min. : 105.0 Min. : 2.100
#> Class :character 1st Qu.: 695.2 1st Qu.: 3.075
#> Mode :character Median :1053.0 Median : 5.750
#> Mean :1099.8 Mean : 6.625
#> 3rd Qu.:1457.5 3rd Qu.: 9.300
#> Max. :2188.0 Max. :12.900
#> continent Language
#> Length:4 Length:4
#> Class :character Class :character
#> Mode :character Mode :character
#>
#>
#>
# Use the symbol "$" to address a particular column
pop<-(df2$population)
pop
#> [1] 3.4 8.1 2.1 12.9
hist(pop) # plot the histogram