Appendix B — Simulating MICS Data

This appendix details the process of creating the simulated MICS data that is employed in the examples on this website.

MICS data are freely available, but usage of MICS requires completing a user agreement, and registering for a user account, on the MICS website, and thus MICS data should not be shared openly on a public website.

This Appendix is highly technical. It is not necessary to understand this Appendix to benefit from the rest of this website. However, the details of creating this simulated data may be of interest to some users.

B.1 Call Relevant Libraries

We need to call a number of relevant R libraries to simulate the data.

Show the code
library(tibble) # new dataframes

library(ggplot2) # nifty graphs

library(labelled) # labels

library(haven) # write Stata

library(tidyr) # tidy data

library(dplyr) # wrangle data

library(lme4) # multilevel models

library(sjPlot) # nice tables for MLM

library(pander) # nice tables

B.2 Setup Some Basic Parameters of the Data

Because simulation is a random process, we set a random seed so that the simulation produces the same data set each time it is run.

We are going to simulate data with 30 countries, and 100 individuals per country.

Show the code
set.seed(1234) # random seed

N_countries <- 30 # number of countries

N <- 100 # sample size / country

B.3 Simulate Data Based on MICS

This is multilevel data where individuals are nested, or clustered, inside countries. Excellent technical and pedagogical discussions of multilevel models can be found in Raudenbush & Bryk (2002), Singer & Willett (2003), Rabe-Hesketh & Skrondal (2022), Luke (2004), and Kreft & de Leeuw (1998).

B.3.1 Level 2

Simulating the second level of the data is relatively easy. We simply need to provide the number of countries, and then generate random effects for each country. Random effects are discussed in the above references, but essentially represent country level differences in the data.

We also create GII, a Gender Inequality Index (United Nations Development Program, 2023) variable, and HDI, a measure of the Human Development Index (United Nations Development Program, 2022), since these are country level, or Level 2 variables.

Show the code
country <- seq(1:N_countries) # sequence 1 to 30

GII <- rbinom(N_countries, 100, .25) # gender inequality index

HDI <- rbinom(N_countries, 100, .25) # Human Development Index

u0 <- rnorm(N_countries, 0, .25) # random intercept

u1 <- rnorm(N_countries, 0, .05) # random slope

randomeffects <- data.frame(country, 
                            GII, 
                            HDI, 
                            u0, 
                            u1) # dataframe of random effects

B.3.2 Level 1

Simulating the Level 1 data is more complex.

We uncount the data by 100 to create 100 observations for each country. We then create an id number.

We create randomly simulated parental discipline variables with proportions similar to those in MICS.

Lastly, we need to create the dependent variable. Because this is a dichotomous outcome, the process is somewhat complex. We need to craete a linear combination z, using regression weights derived from MICS. We then calculate predicted probabilities, and lastly generate a dichotomous aggression outcome from those probabilities.

Show the code
MICSsimulated <- randomeffects %>% 
  uncount(N) %>% # N individuals / country
  mutate(id = row_number()) %>% # unique id
  mutate(cd1 = rbinom(N * N_countries, 1, .38), # spank
         cd2 = rbinom(N * N_countries, 1, .05), # beat
         cd3 = rbinom(N * N_countries, 1, .64), # shout
         cd4 = rbinom(N * N_countries, 1, .78)) %>% # explain
  mutate(z = 0 + # linear combination based on MICS
           .01 * GII +
           .23 * cd1 +
           .52 * cd2 +
           .42 * cd3 +
           -.21 * cd4 + 
           u0) %>%
  mutate(p = exp(z) / (1 + exp(z))) %>% # probability
  mutate(aggression = rbinom(N * N_countries, 1, p)) %>% # binomial y
  select(id, country, GII, HDI,
         cd1, cd2, cd3, cd4, 
         aggression)

B.3.3 Variable Labels

We add variable labels to the data which will help us to understand the data as we analyze it.

Show the code
var_label(MICSsimulated$id) <- "id"

var_label(MICSsimulated$country) <- "country"

var_label(MICSsimulated$GII) <- "Gender Inequality Index"

var_label(MICSsimulated$HDI) <- "Human Development Index"

var_label(MICSsimulated$cd1) <- "spank"

var_label(MICSsimulated$cd2) <- "beat"

var_label(MICSsimulated$cd3) <- "shout"

var_label(MICSsimulated$cd4) <- "explain"

var_label(MICSsimulated$aggression) <- "aggression"
Show the code
pander(labelled::look_for(MICSsimulated)[1:4]) # list out variable labels
Table B.1: Variable Labels
pos variable label col_type
1 id id int
2 country country int
3 GII Gender Inequality Index int
4 HDI Human Development Index int
5 cd1 spank int
6 cd2 beat int
7 cd3 shout int
8 cd4 explain int
9 aggression aggression int

B.4 Explore The Simulated Data With A Graph

Exploring the simulated data with a graph helps us to ensure that we have simulated plausible data.

Show the code
ggplot(MICSsimulated,
       aes(x = cd1, # x is spanking
           y = aggression, # y is aggression
           color = factor(country))) + # color is country
  geom_smooth(method = "glm", # glm smoother
              method.args = list(family = "binomial"),
              alpha = .1) + # transparency for CI's
  labs(title = "Aggression as a Function of Spanking",
       x = "spank",
       y = "aggression") +
  scale_color_viridis_d(name = "Country") + # nice colors
  theme_minimal()
Figure B.1: Graph of Simulated Data

B.5 Explore The Simulated Data With A Logistic Regression

Similarly, exploring the data with a logistic regression confirms that we have created plausible data.

Show the code
fit1 <- glmer(aggression ~ cd1 + cd2 + cd3 + cd4 + GII +
                (1 | country), 
              family = "binomial",
              data = MICSsimulated,
              control = glmerControl(optimizer ="bobyqa"))

summary(fit1) # table for PDF
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: aggression ~ cd1 + cd2 + cd3 + cd4 + GII + (1 | country)
   Data: MICSsimulated
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
  4043.0   4085.1  -2014.5   4029.0     2993 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3105 -1.0575  0.6729  0.8799  1.3032 

Random effects:
 Groups  Name        Variance Std.Dev.
 country (Intercept) 0.04616  0.2148  
Number of obs: 3000, groups:  country, 30

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.40705    0.35298  -1.153  0.24883    
cd1          0.21337    0.07766   2.748  0.00600 ** 
cd2          0.58546    0.17948   3.262  0.00111 ** 
cd3          0.45124    0.07778   5.801 6.57e-09 ***
cd4         -0.36836    0.09156  -4.023 5.74e-05 ***
GII          0.02269    0.01388   1.635  0.10201    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
    (Intr) cd1    cd2    cd3    cd4   
cd1 -0.070                            
cd2 -0.022 -0.010                     
cd3 -0.157  0.002  0.020              
cd4 -0.204 -0.009 -0.004 -0.020       
GII -0.954 -0.011 -0.003  0.023  0.004

B.6 Write data to various formats

Lastly, we write the data out to various formats: R, Stata, and SPSS.

Show the code
save(MICSsimulated, 
     file = "./simulate-data/MICSsimulated.RData") # R

write_dta(MICSsimulated, 
          "./simulate-data/MICSsimulated.dta") # Stata

write_sav(MICSsimulated, 
          "./simulate-data/MICSsimulated.sav") # SPSS

write.csv(MICSsimulated,
          "./simulate-data/MICSsimulated.csv") # CSV