[[6]{.chapter-number}  [Quantitative Data Analysis]{.chapter-title}]{#sec-analysis .quarto-section-identifier}

6 Quantitative Data Analysis

6.1 Introduction

A great deal of data analysis (and visualization) involves the same core set of steps.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#FFC20E',
      'primaryTextColor': '#000000',
      'primaryBorderColor': '#2D2926',
      'lineColor': '#2D2926',
      'secondaryColor': '#2D2926',
      'secondaryTextColor': '#000000',
      'tertiaryColor': '#F2F2F2',
      'tertiaryBorderColor': '#2D2926'
    }
  }
}%%

flowchart LR
  A(have a <br>question) --> B(get data)
  B --> B2(select <br>variables)
  B2 --> C(process and <br>clean data) 
  C --> D(visualize <br>data)
  D --> E(analyze <br>data)
  E --> F(make <br>conclusions)
  F --> G(share <br>ideas)

A Common Workflow

6.2 Some Tools for Analysis

Below we describe some simple data cleaning with R. We begin, however, by comparing several different tools for analysis including: Excel, Google Sheets, R, and Stata.

Tool	Cost	Ease of Use	Analysis Capabilities	Suitability for Large Data	Keep Track of Complicated Workflows
Excel	Comes installed on many computers	Easy	Limited	Difficult when N > 100	Difficult to Impossible
Google Sheets	Free with a Google account	Easy	Limited	Difficult when N > 100	Difficult to Impossible
R	Free	Challenging	Extensive	Excellent with large datasets	Yes, with script
Stata	Some cost	Learning Curve but Intuitive	Extensive	Excellent with large datasets	Yes, with command file

6.3 Working With R

6.3.1 Our Data

We take a look at our simulated data.

load("./simulate-data/MICSsimulated.RData") # data in R format

labelled::look_for(MICSsimulated) # look at variables and variable labels

 pos variable   label                   col_type missing values
 1   id         id                      int      0             
 2   country    country                 int      0             
 3   GII        Gender Inequality Index int      0             
 4   HDI        Human Development Index int      0             
 5   cd1        spank                   int      0             
 6   cd2        beat                    int      0             
 7   cd3        shout                   int      0             
 8   cd4        explain                 int      0             
 9   aggression aggression              int      0

head(MICSsimulated) # look at top (head) of data

  id country GII HDI cd1 cd2 cd3 cd4 aggression
1  1       1  20  24   0   0   1   1          1
2  2       1  20  24   0   0   1   1          1
3  3       1  20  24   0   0   1   1          1
4  4       1  20  24   0   0   0   0          1
5  5       1  20  24   1   0   1   1          0
6  6       1  20  24   0   0   1   1          1

6.3.2 Cleaning Data

There are some basic data cleaning steps that are common to many projects.

Only keep the variables of interest. Section 6.3.2.1
Add variable labels (if we can). Section 6.3.2.2
Add value labels (if we can). Section 6.3.2.3
Recode outliers, values that are errors, or values that should be coded as missing Section 6.3.2.4

Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data. A sample R script for the steps found in this chapter can be found in Appendix A.

6.3.2.1 Only keep the variables of interest.

We can easily accomplish this with the subset function

mynewdata <- subset(MICSsimulated,
                    select = c(id, country, aggression)) # subset of data

head(mynewdata) # look at top (head) of data

  id country aggression
1  1       1          1
2  2       1          1
3  3       1          1
4  4       1          1
5  5       1          0
6  6       1          1

6.3.2.2 Add variable labels (if we can).

Adding variable labels is still somewhat new in R. The labelled library allows us to add or change variable labels. However, not every library in R recognizes variable labels.

library(labelled) # variable labels

var_label(MICSsimulated$id) <- "id"

var_label(MICSsimulated$country) <- "country"

var_label(MICSsimulated$cd4) <- "explain"

6.3.2.3 Add value labels (if we can).

In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.

MICSsimulated$cd4 <- factor(MICSsimulated$cd4,
                             levels = c(0, 1),
                             labels = c("Did not explain",
                                        "Explained"))

head(MICSsimulated) # head (top) of data

  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          1

6.3.2.4 Recode outliers, values that are errors, or values that should be coded as missing.

We can easily accomplish this using Base R’s syntax for recoding: data$variable[rule] <- newvalue.

MICSsimulated$aggression[MICSsimulated$aggression > 1] <- NA # recode > 1 to NA

MICSsimulated$GII[MICSsimulated$GII > 100] <- NA # recode > 100 to NA

head(MICSsimulated) # head (top) of data

  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          1

6.3.3 Simple Analysis

Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.

Sometimes deciding whether a variable is continuous or categorical involves some hard thinking, or referring to the documentation for the data. In this data, all of the forms of discipline, as well as aggression are 1/0 variables, so likely best conceptualized as categorical variables. In contrast, GII and HDI are best conceptualized as continuous variables.

For continuous variables, it is most appropriate to take the average or mean.
For categorical variables, it is most appropriate to generate a frequency table.

As a mostly command based language, R relies on the idea of do_something(dataset$variable).

summary(MICSsimulated$GII) # descriptive statistics for GII

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    22.0    24.0    24.2    27.0    31.0

table(MICSsimulated$cd4) # frequency table of cd4


Did not explain       Explained 
            674            2326