4 A Quick Introduction to R
4.1 Why Use R?
R (R Core Team, 2023) has a reputation for being difficult to learn, and a lot of that reputation is deserved. However, it is possible to teach R in an accessible way, and a little bit of R can take you a long way.
R is open source, and therefore free, statistical software that is particularly good at obtaining, analyzing and visualizing data.
R Commands are stored in a script or code file that usually ends in .R, e.g. myscript.R
. The command file is distinct from your actual data, stored in an .RData file, e.g. mydata.RData
.
A great deal of data analysis and visualization involves the same core set of steps.
Given the fact that we often want to apply the same core set of tasks to new questions and new data, there are ways to overcome the steep learning curve and learn a replicable set of commands that can be applied to problem after problem. The same 5 to 10 lines of R code can often be tweaked over and over again for multiple projects.
4.2 Get R
R is available at https://www.r-project.org/. R is a lot easier to run if you run it from RStudio, http://www.rstudio.com.
4.3 Get Data
Data may already be in R format, or may come from other types of data files like SPSS, Stata, or Excel. Especially in beginning R programming, getting the data into R can be the most complicated part of your program.
4.3.1 Data in R Format
load("./simulate-data/MICSsimulated.RData") # data in R format
4.3.2 Data in Other Formats
If data are in other formats, slightly different code may be required.
library(haven) # library for importing data
<- read_sav("the/path/to/mySPSSfile.sav") # SPSS
mydata <- read_dta("the/path/to/myStatafile.dta") # Stata
mydata
library(readxl) # library for importing Excel files
<- read_excel("the/path/to/mySpreadsheet.xls")
mydata
save(mydata, file = "mydata.RData") # save in R format
4.4 Process and Clean Data
4.4.1 The $
Sign
The $
sign is a kind of “connector”. mydata$x
means: “The variable x
in the dataset called mydata
”.
4.4.2 Recoding Data
Data sometimes need to be recoded. For example, outliers may need to be changed to missing, or a value that is supposed to indicated missing data (e.g. -9
) may need to be changed to missing.
Recoding uses the following construction:
data$variable[condition] <- new value
For example, change an outlier value: When cd1
is 2
change it to missing (NA
).
$cd1[MICSsimulated$cd1 == 2] <- NA # outlier (2) to NA MICSsimulated
Change variable cd1 to missing (NA
) when it is -9
.
$cd1[MICSsimulated$cd1 == -9] <- NA # missing (-9) to NA MICSsimulated
4.4.3 Numeric and Factor Variables
R makes a strong distinction between continuous numeric variables that measure scales like mental health or neighborhood safety, and categorical factor variables that measure non-ordered categories like religious identity or gender identity.
Many statistical and graphical procedures are designed to recognize and work with different variable types. You often don’t need to use all of the options. e.g. mydata$w <- factor(mydata$z)
will often work just fine. Changing variables from factor to numeric, and vice versa can sometimes be the simple solution that solves a lot of problems when you are trying to graph your variables.
$aggression <-
MICSsimulatedfactor(MICSsimulated$aggression, # original numeric variable
levels = c(0, 1),
labels = c("no aggression", "aggression"),
ordered = TRUE) # whether order matters
# MICSsimulated$z <- as.numeric(MICSsimulated$w) # factor to numeric
4.5 Visualize Data
4.5.1 Histogram
hist(MICSsimulated$GII, # what I'm graphing
main = "Gender Inequality Index", # title
xlab = "GII", # label for x axis
col = "blue") # color
You often don’t need to use all of the options. e.g. hist(mydata$x)
will work just fine.
4.5.2 Barplot
barplot(table(MICSsimulated$aggression), # what I'm graphing
main = "Child Displays Aggression", # title
xlab = "Aggression", # label for x axis
col = "gold") # color
You often don’t need to use all of the options. e.g. barplot(table(mydata$z))
will work just fine.
4.6 Analyze Data: Descriptive Statistics
summary(mydata$x) # for continuous or factor variables
table(mydata$z) # especially suitable for factor variables
summary(MICSsimulated$GII)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.0 22.0 24.0 24.2 27.0 31.0
table(MICSsimulated$aggression)
no aggression aggression
1316 1684