Appendix B — Simulating MICS Data

Appendix B — Simulating MICS Data

This appendix details the process of creating the simulated MICS data that is employed in the examples on this website.

MICS data are freely available, but usage of MICS requires completing a user agreement, and registering for a user account, on the MICS website, and thus MICS data should not be shared openly on a public website.

This Appendix is highly technical. It is not necessary to understand this Appendix to benefit from the rest of this website. However, the details of creating this simulated data may be of interest to some users.

B.1 Call Relevant Libraries

We need to call a number of relevant R libraries to simulate the data.

Show the code

library(tibble) # new dataframes

library(ggplot2) # nifty graphs

library(labelled) # labels

library(haven) # write Stata

library(tidyr) # tidy data

library(dplyr) # wrangle data

library(lme4) # multilevel models

library(sjPlot) # nice tables for MLM

library(pander) # nice tables

B.2 Setup Some Basic Parameters of the Data

Because simulation is a random process, we set a random seed so that the simulation produces the same data set each time it is run.

We are going to simulate data with 30 countries, and 100 individuals per country.

Show the code

set.seed(1234) # random seed

N_countries <- 30 # number of countries

N <- 100 # sample size / country

B.3 Simulate Data Based on MICS

This is multilevel data where individuals are nested, or clustered, inside countries. Excellent technical and pedagogical discussions of multilevel models can be found in Raudenbush & Bryk (2002), Singer & Willett (2003), Rabe-Hesketh & Skrondal (2022), Luke (2004), and Kreft & de Leeuw (1998).

B.3.1 Level 2

Simulating the second level of the data is relatively easy. We simply need to provide the number of countries, and then generate random effects for each country. Random effects are discussed in the above references, but essentially represent country level differences in the data.

We also create GII, a Gender Inequality Index (United Nations Development Program, 2023) variable, and HDI, a measure of the Human Development Index (United Nations Development Program, 2022), since these are country level, or Level 2 variables.

Show the code

country <- seq(1:N_countries) # sequence 1 to 30

GII <- rbinom(N_countries, 100, .25) # gender inequality index

HDI <- rbinom(N_countries, 100, .25) # Human Development Index

u0 <- rnorm(N_countries, 0, .25) # random intercept

u1 <- rnorm(N_countries, 0, .05) # random slope

randomeffects <- data.frame(country, 
                            GII, 
                            HDI, 
                            u0, 
                            u1) # dataframe of random effects

B.3.2 Level 1

Simulating the Level 1 data is more complex.

We uncount the data by 100 to create 100 observations for each country. We then create an id number.

We create randomly simulated parental discipline variables with proportions similar to those in MICS.

Lastly, we need to create the dependent variable. Because this is a dichotomous outcome, the process is somewhat complex. We need to craete a linear combination z, using regression weights derived from MICS. We then calculate predicted probabilities, and lastly generate a dichotomous aggression outcome from those probabilities.

Show the code

MICSsimulated <- randomeffects %>% 
  uncount(N) %>% # N individuals / country
  mutate(id = row_number()) %>% # unique id
  mutate(cd1 = rbinom(N * N_countries, 1, .38), # spank
         cd2 = rbinom(N * N_countries, 1, .05), # beat
         cd3 = rbinom(N * N_countries, 1, .64), # shout
         cd4 = rbinom(N * N_countries, 1, .78)) %>% # explain
  mutate(z = 0 + # linear combination based on MICS
           .01 * GII +
           .23 * cd1 +
           .52 * cd2 +
           .42 * cd3 +
           -.21 * cd4 + 
           u0) %>%
  mutate(p = exp(z) / (1 + exp(z))) %>% # probability
  mutate(aggression = rbinom(N * N_countries, 1, p)) %>% # binomial y
  select(id, country, GII, HDI,
         cd1, cd2, cd3, cd4, 
         aggression)

B.3.3 Variable Labels

We add variable labels to the data which will help us to understand the data as we analyze it.

Show the code

var_label(MICSsimulated$id) <- "id"

var_label(MICSsimulated$country) <- "country"

var_label(MICSsimulated$GII) <- "Gender Inequality Index"

var_label(MICSsimulated$HDI) <- "Human Development Index"

var_label(MICSsimulated$cd1) <- "spank"

var_label(MICSsimulated$cd2) <- "beat"

var_label(MICSsimulated$cd3) <- "shout"

var_label(MICSsimulated$cd4) <- "explain"

var_label(MICSsimulated$aggression) <- "aggression"

Show the code

pander(labelled::look_for(MICSsimulated)[1:4]) # list out variable labels

Table B.1: Variable Labels

pos	variable	label	col_type
1	id	id	int
2	country	country	int
3	GII	Gender Inequality Index	int
4	HDI	Human Development Index	int
5	cd1	spank	int
6	cd2	beat	int
7	cd3	shout	int
8	cd4	explain	int
9	aggression	aggression	int

B.4 Explore The Simulated Data With A Graph

Exploring the simulated data with a graph helps us to ensure that we have simulated plausible data.

Show the code

ggplot(MICSsimulated,
       aes(x = cd1, # x is spanking
           y = aggression, # y is aggression
           color = factor(country))) + # color is country
  geom_smooth(method = "glm", # glm smoother
              method.args = list(family = "binomial"),
              alpha = .1) + # transparency for CI's
  labs(title = "Aggression as a Function of Spanking",
       x = "spank",
       y = "aggression") +
  scale_color_viridis_d(name = "Country") + # nice colors
  theme_minimal()

B.5 Explore The Simulated Data With A Logistic Regression

Similarly, exploring the data with a logistic regression confirms that we have created plausible data.

Show the code

fit1 <- glmer(aggression ~ cd1 + cd2 + cd3 + cd4 + GII +
                (1 | country), 
              family = "binomial",
              data = MICSsimulated,
              control = glmerControl(optimizer ="bobyqa"))

summary(fit1) # table for PDF

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: aggression ~ cd1 + cd2 + cd3 + cd4 + GII + (1 | country)
   Data: MICSsimulated
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
  4043.0   4085.1  -2014.5   4029.0     2993 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3105 -1.0575  0.6729  0.8799  1.3032 

Random effects:
 Groups  Name        Variance Std.Dev.
 country (Intercept) 0.04616  0.2148  
Number of obs: 3000, groups:  country, 30

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.40705    0.35298  -1.153  0.24883    
cd1          0.21337    0.07766   2.748  0.00600 ** 
cd2          0.58546    0.17948   3.262  0.00111 ** 
cd3          0.45124    0.07778   5.801 6.57e-09 ***
cd4         -0.36836    0.09156  -4.023 5.74e-05 ***
GII          0.02269    0.01388   1.635  0.10201    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
    (Intr) cd1    cd2    cd3    cd4   
cd1 -0.070                            
cd2 -0.022 -0.010                     
cd3 -0.157  0.002  0.020              
cd4 -0.204 -0.009 -0.004 -0.020       
GII -0.954 -0.011 -0.003  0.023  0.004

B.6 Write data to various formats

Lastly, we write the data out to various formats: R, Stata, and SPSS.

Show the code

save(MICSsimulated, 
     file = "./simulate-data/MICSsimulated.RData") # R

write_dta(MICSsimulated, 
          "./simulate-data/MICSsimulated.dta") # Stata

write_sav(MICSsimulated, 
          "./simulate-data/MICSsimulated.sav") # SPSS

write.csv(MICSsimulated,
          "./simulate-data/MICSsimulated.csv") # CSV