6 Quantitative Data Analysis
6.1 Introduction
A great deal of data analysis (and visualization) involves the same core set of steps.
6.2 Some Tools for Analysis
Below we describe some simple data cleaning with R. We begin, however, by comparing several different tools for analysis including: Excel, Google Sheets, R, and Stata.
Tool | Cost | Ease of Use | Analysis Capabilities | Suitability for Large Data | Keep Track of Complicated Workflows |
---|---|---|---|---|---|
Excel | Comes installed on many computers | Easy | Limited | Difficult when N > 100 | Difficult to Impossible |
Google Sheets | Free with a Google account | Easy | Limited | Difficult when N > 100 | Difficult to Impossible |
R | Free | Challenging | Extensive | Excellent with large datasets | Yes, with script |
Stata | Some cost | Learning Curve but Intuitive | Extensive | Excellent with large datasets | Yes, with command file |
6.3 Working With R
6.3.1 Our Data
We take a look at our simulated data.
load("./simulate-data/MICSsimulated.RData") # data in R format
::look_for(MICSsimulated) # look at variables and variable labels labelled
pos variable label col_type missing values
1 id id int 0
2 country country int 0
3 GII Gender Inequality Index int 0
4 HDI Human Development Index int 0
5 cd1 spank int 0
6 cd2 beat int 0
7 cd3 shout int 0
8 cd4 explain int 0
9 aggression aggression int 0
head(MICSsimulated) # look at top (head) of data
id country GII HDI cd1 cd2 cd3 cd4 aggression
1 1 1 20 24 0 0 1 1 1
2 2 1 20 24 0 0 1 1 1
3 3 1 20 24 0 0 1 1 1
4 4 1 20 24 0 0 0 0 1
5 5 1 20 24 1 0 1 1 0
6 6 1 20 24 0 0 1 1 1
6.3.2 Cleaning Data
There are some basic data cleaning steps that are common to many projects.
- Only keep the variables of interest. Section 6.3.2.1
- Add variable labels (if we can). Section 6.3.2.2
- Add value labels (if we can). Section 6.3.2.3
- Recode outliers, values that are errors, or values that should be coded as missing Section 6.3.2.4
Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data. A sample R script for the steps found in this chapter can be found in Appendix A.
6.3.2.1 Only keep the variables of interest.
We can easily accomplish this with the
subset
function
<- subset(MICSsimulated,
mynewdata select = c(id, country, aggression)) # subset of data
head(mynewdata) # look at top (head) of data
id country aggression
1 1 1 1
2 2 1 1
3 3 1 1
4 4 1 1
5 5 1 0
6 6 1 1
6.3.2.2 Add variable labels (if we can).
Adding variable labels is still somewhat new in R. The
labelled
library allows us to add or change variable labels. However, not every library in R recognizes variable labels.
library(labelled) # variable labels
var_label(MICSsimulated$id) <- "id"
var_label(MICSsimulated$country) <- "country"
var_label(MICSsimulated$cd4) <- "explain"
6.3.2.3 Add value labels (if we can).
In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.
$cd4 <- factor(MICSsimulated$cd4,
MICSsimulatedlevels = c(0, 1),
labels = c("Did not explain",
"Explained"))
head(MICSsimulated) # head (top) of data
id country GII HDI cd1 cd2 cd3 cd4 aggression
1 1 1 20 24 0 0 1 Explained 1
2 2 1 20 24 0 0 1 Explained 1
3 3 1 20 24 0 0 1 Explained 1
4 4 1 20 24 0 0 0 Did not explain 1
5 5 1 20 24 1 0 1 Explained 0
6 6 1 20 24 0 0 1 Explained 1
6.3.2.4 Recode outliers, values that are errors, or values that should be coded as missing.
We can easily accomplish this using Base R’s syntax for recoding:
data$variable[rule] <- newvalue
.
$aggression[MICSsimulated$aggression > 1] <- NA # recode > 1 to NA
MICSsimulated
$GII[MICSsimulated$GII > 100] <- NA # recode > 100 to NA MICSsimulated
head(MICSsimulated) # head (top) of data
id country GII HDI cd1 cd2 cd3 cd4 aggression
1 1 1 20 24 0 0 1 Explained 1
2 2 1 20 24 0 0 1 Explained 1
3 3 1 20 24 0 0 1 Explained 1
4 4 1 20 24 0 0 0 Did not explain 1
5 5 1 20 24 1 0 1 Explained 0
6 6 1 20 24 0 0 1 Explained 1
6.3.3 Simple Analysis
Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.
Sometimes deciding whether a variable is continuous or categorical involves some hard thinking, or referring to the documentation for the data. In this data, all of the forms of discipline, as well as
aggression
are1/0
variables, so likely best conceptualized as categorical variables. In contrast,GII
andHDI
are best conceptualized as continuous variables.
- For continuous variables, it is most appropriate to take the average or mean.
- For categorical variables, it is most appropriate to generate a frequency table.
As a mostly command based language, R relies on the idea of
do_something(dataset$variable)
.
summary(MICSsimulated$GII) # descriptive statistics for GII
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.0 22.0 24.0 24.2 27.0 31.0
table(MICSsimulated$cd4) # frequency table of cd4
Did not explain Explained
674 2326