6  Quantitative Data Analysis

6.1 Introduction

A great deal of data analysis (and visualization) involves the same core set of steps.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#FFC20E',
      'primaryTextColor': '#000000',
      'primaryBorderColor': '#2D2926',
      'lineColor': '#2D2926',
      'secondaryColor': '#2D2926',
      'secondaryTextColor': '#000000',
      'tertiaryColor': '#F2F2F2',
      'tertiaryBorderColor': '#2D2926'
    }
  }
}%%

flowchart LR
  A(have a question) --> B(get data)
  B --> C(process and clean data) 
  C --> D(visualize data)
  D --> E(analyze data)
  E --> F(make conclusions)
  F --> G(share ideas)
  

6.2 Some Tools for Analysis

Below we describe some simple data cleaning with R. We begin, however, by comparing several different tools for analysis including: Excel, Google Sheets, R, and Stata.

Tool Cost Ease of Use Analysis Capabilities Suitability for Large Data Keep Track of Complicated Workflows
Excel Comes installed on many computers Easy Limited Difficult when N > 100 Difficult to Impossible
Google Sheets Free with a Google account Easy Limited Difficult when N > 100 Difficult to Impossible
R Free Challenging Extensive Excellent with large datasets Yes, with script
Stata Some cost Learning Curve but Intuitive Extensive Excellent with large datasets Yes, with command file

6.3 Working With R

6.3.1 Our Data

We take a look at our simulated data.

load("./simulate-data/MICSsimulated.RData") # data in R format
labelled::look_for(MICSsimulated) # look at variables and variable labels
 pos variable   label                   col_type missing values
 1   id         id                      int      0             
 2   country    country                 int      0             
 3   GII        Gender Inequality Index int      0             
 4   HDI        Human Development Index int      0             
 5   cd1        spank                   int      0             
 6   cd2        beat                    int      0             
 7   cd3        shout                   int      0             
 8   cd4        explain                 int      0             
 9   aggression aggression              int      0             
head(MICSsimulated) # look at top (head) of data
  id country GII HDI cd1 cd2 cd3 cd4 aggression
1  1       1  20  24   0   0   1   1          1
2  2       1  20  24   0   0   1   1          1
3  3       1  20  24   0   0   1   1          1
4  4       1  20  24   0   0   0   0          1
5  5       1  20  24   1   0   1   1          0
6  6       1  20  24   0   0   1   1          1

6.3.2 Cleaning Data

There are some basic data cleaning steps that are common to many projects.

Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data. A sample R script for the steps found in this chapter can be found in Appendix A.

6.3.2.1 Only keep the variables of interest.

We can easily accomplish this with the subset function

mynewdata <- subset(MICSsimulated,
                    select = c(id, country, aggression)) # subset of data
head(mynewdata) # look at top (head) of data
  id country aggression
1  1       1          1
2  2       1          1
3  3       1          1
4  4       1          1
5  5       1          0
6  6       1          1

6.3.2.2 Add variable labels (if we can).

Adding variable labels is still somewhat new in R. The labelled library allows us to add or change variable labels. However, not every library in R recognizes variable labels.

library(labelled) # variable labels

var_label(MICSsimulated$id) <- "id"

var_label(MICSsimulated$country) <- "country"

var_label(MICSsimulated$cd4) <- "explain"

6.3.2.3 Add value labels (if we can).

In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.

MICSsimulated$cd4 <- factor(MICSsimulated$cd4,
                             levels = c(0, 1),
                             labels = c("Did not explain",
                                        "Explained"))
head(MICSsimulated) # head (top) of data
  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          1

6.3.2.4 Recode outliers, values that are errors, or values that should be coded as missing.

We can easily accomplish this using Base R’s syntax for recoding: data$variable[rule] <- newvalue.

MICSsimulated$aggression[MICSsimulated$aggression > 1] <- NA # recode > 1 to NA

MICSsimulated$GII[MICSsimulated$GII > 100] <- NA # recode > 100 to NA
head(MICSsimulated) # head (top) of data
  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          1

6.3.3 Simple Analysis

Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.

Sometimes deciding whether a variable is continuous or categorical involves some hard thinking, or referring to the documentation for the data. In this data, all of the forms of discipline, as well as aggression are 1/0 variables, so likely best conceptualized as categorical variables. In contrast, GII and HDI are best conceptualized as continuous variables.

  • For continuous variables, it is most appropriate to take the average or mean.
  • For categorical variables, it is most appropriate to generate a frequency table.

As a mostly command based language, R relies on the idea of do_something(dataset$variable).

summary(MICSsimulated$GII) # descriptive statistics for GII
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    22.0    24.0    24.2    27.0    31.0 
table(MICSsimulated$cd4) # frequency table of cd4

Did not explain       Explained 
            674            2326