4  A Quick Introduction to R

4.1 Why Use R?

R (R Core Team, 2023) has a reputation for being difficult to learn, and a lot of that reputation is deserved. However, it is possible to teach R in an accessible way, and a little bit of R can take you a long way.

R is open source, and therefore free, statistical software that is particularly good at obtaining, analyzing and visualizing data.

R Commands are stored in a script or code file that usually ends in .R, e.g. myscript.R. The command file is distinct from your actual data, stored in an .RData file, e.g. mydata.RData.

A great deal of data analysis and visualization involves the same core set of steps.

Given the fact that we often want to apply the same core set of tasks to new questions and new data, there are ways to overcome the steep learning curve and learn a replicable set of commands that can be applied to problem after problem. The same 5 to 10 lines of R code can often be tweaked over and over again for multiple projects.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#FFC20E',
      'primaryTextColor': '#000000',
      'primaryBorderColor': '#2D2926',
      'lineColor': '#2D2926',
      'secondaryColor': '#2D2926',
      'secondaryTextColor': '#000000',
      'tertiaryColor': '#F2F2F2',
      'tertiaryBorderColor': '#2D2926'
    }
  }
}%%

flowchart LR
  A(have a question) --> B(get data)
  B --> C(process and clean data) 
  C --> D(visualize data)
  D --> E(analyze data)
  E --> F(make conclusions)
  F --> G(share ideas)
  

4.2 Get R

R is available at https://www.r-project.org/. R is a lot easier to run if you run it from RStudio, http://www.rstudio.com.

4.3 Get Data

Data may already be in R format, or may come from other types of data files like SPSS, Stata, or Excel. Especially in beginning R programming, getting the data into R can be the most complicated part of your program.

4.3.1 Data in R Format

load("./simulate-data/MICSsimulated.RData") # data in R format

4.3.2 Data in Other Formats

If data are in other formats, slightly different code may be required.

library(haven) # library for importing data 

mydata <- read_sav("the/path/to/mySPSSfile.sav") # SPSS
mydata <- read_dta("the/path/to/myStatafile.dta") # Stata

library(readxl) # library for importing Excel files

mydata <- read_excel("the/path/to/mySpreadsheet.xls")

save(mydata, file = "mydata.RData") # save in R format

4.4 Process and Clean Data

4.4.1 The $ Sign

The $ sign is a kind of “connector”. mydata$x means: “The variable x in the dataset called mydata”.

4.4.2 Recoding Data

Data sometimes need to be recoded. For example, outliers may need to be changed to missing, or a value that is supposed to indicated missing data (e.g. -9) may need to be changed to missing.

Recoding uses the following construction:

data$variable[condition] <- new value

For example, change an outlier value: When cd1 is 2 change it to missing (NA).

MICSsimulated$cd1[MICSsimulated$cd1 == 2] <- NA # outlier (2) to NA

Change variable cd1 to missing (NA) when it is -9.

MICSsimulated$cd1[MICSsimulated$cd1 == -9] <- NA # missing (-9) to NA

4.4.3 Numeric and Factor Variables

R makes a strong distinction between continuous numeric variables that measure scales like mental health or neighborhood safety, and categorical factor variables that measure non-ordered categories like religious identity or gender identity.

Many statistical and graphical procedures are designed to recognize and work with different variable types. You often don’t need to use all of the options. e.g. mydata$w <- factor(mydata$z) will often work just fine. Changing variables from factor to numeric, and vice versa can sometimes be the simple solution that solves a lot of problems when you are trying to graph your variables.

MICSsimulated$aggression <- 
  factor(MICSsimulated$aggression, # original numeric variable
         levels = c(0, 1), 
         labels = c("no aggression", "aggression"), 
         ordered = TRUE) # whether order matters

# MICSsimulated$z <- as.numeric(MICSsimulated$w) # factor to numeric

4.5 Visualize Data

4.5.1 Histogram

hist(MICSsimulated$GII, # what I'm graphing
        main = "Gender Inequality Index", # title
        xlab = "GII", # label for x axis
        col = "blue") # color
Figure 4.1: Histogram of Gender Inequality Index
Tip

You often don’t need to use all of the options. e.g. hist(mydata$x) will work just fine.

4.5.2 Barplot

barplot(table(MICSsimulated$aggression), # what I'm graphing
        main = "Child Displays Aggression", # title
        xlab = "Aggression", # label for x axis
        col = "gold") # color
Figure 4.2: Barplot of Aggression
Tip

You often don’t need to use all of the options. e.g. barplot(table(mydata$z)) will work just fine.

4.6 Analyze Data: Descriptive Statistics

summary(mydata$x) # for continuous or factor variables

table(mydata$z) # especially suitable for factor variables
summary(MICSsimulated$GII)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    22.0    24.0    24.2    27.0    31.0 
table(MICSsimulated$aggression)

no aggression    aggression 
         1316          1684