This appendix details the process of creating the simulated MICS data that is employed in the examples on this website.
MICS data are freely available, but usage of MICS requires completing a user agreement, and registering for a user account, on the MICS website, and thus MICS data should not be shared openly on a public website.
This Appendix is highly technical. It is not necessary to understand this Appendix to benefit from the rest of this website. However, the details of creating this simulated data may be of interest to some users.
B.1 Call Relevant Libraries
We need to call a number of relevant R libraries to simulate the data.
Because simulation is a random process, we set a random seed so that the simulation produces the same data set each time it is run.
We are going to simulate data with 30 countries, and 100 individuals per country.
Show the code
set.seed(1234) # random seedN_countries <-30# number of countriesN <-100# sample size / country
B.3 Simulate Data Based on MICS
This is multilevel data where individuals are nested, or clustered, inside countries. Excellent technical and pedagogical discussions of multilevel models can be found in Raudenbush & Bryk (2002), Singer & Willett (2003), Rabe-Hesketh & Skrondal (2022), Luke (2004), and Kreft & de Leeuw (1998).
B.3.1 Level 2
Simulating the second level of the data is relatively easy. We simply need to provide the number of countries, and then generate random effects for each country. Random effects are discussed in the above references, but essentially represent country level differences in the data.
country <-seq(1:N_countries) # sequence 1 to 30GII <-rbinom(N_countries, 100, .25) # gender inequality indexHDI <-rbinom(N_countries, 100, .25) # Human Development Indexu0 <-rnorm(N_countries, 0, .25) # random interceptu1 <-rnorm(N_countries, 0, .05) # random sloperandomeffects <-data.frame(country, GII, HDI, u0, u1) # dataframe of random effects
B.3.2 Level 1
Simulating the Level 1 data is more complex.
We uncount the data by 100 to create 100 observations for each country. We then create an id number.
We create randomly simulated parental discipline variables with proportions similar to those in MICS.
Lastly, we need to create the dependent variable. Because this is a dichotomous outcome, the process is somewhat complex. We need to craete a linear combination z, using regression weights derived from MICS. We then calculate predicted probabilities, and lastly generate a dichotomous aggression outcome from those probabilities.
pander(labelled::look_for(MICSsimulated)[1:4]) # list out variable labels
Table B.1: Variable Labels
pos
variable
label
col_type
1
id
id
int
2
country
country
int
3
GII
Gender Inequality Index
int
4
HDI
Human Development Index
int
5
cd1
spank
int
6
cd2
beat
int
7
cd3
shout
int
8
cd4
explain
int
9
aggression
aggression
int
B.4 Explore The Simulated Data With A Graph
Exploring the simulated data with a graph helps us to ensure that we have simulated plausible data.
Show the code
ggplot(MICSsimulated,aes(x = cd1, # x is spankingy = aggression, # y is aggressioncolor =factor(country))) +# color is countrygeom_smooth(method ="glm", # glm smoothermethod.args =list(family ="binomial"),alpha = .1) +# transparency for CI'slabs(title ="Aggression as a Function of Spanking",x ="spank",y ="aggression") +scale_color_viridis_d(name ="Country") +# nice colorstheme_minimal()
B.5 Explore The Simulated Data With A Logistic Regression
Similarly, exploring the data with a logistic regression confirms that we have created plausible data.
Show the code
fit1 <-glmer(aggression ~ cd1 + cd2 + cd3 + cd4 + GII + (1| country), family ="binomial",data = MICSsimulated,control =glmerControl(optimizer ="bobyqa"))summary(fit1) # table for PDF
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: aggression ~ cd1 + cd2 + cd3 + cd4 + GII + (1 | country)
Data: MICSsimulated
Control: glmerControl(optimizer = "bobyqa")
AIC BIC logLik deviance df.resid
4043.0 4085.1 -2014.5 4029.0 2993
Scaled residuals:
Min 1Q Median 3Q Max
-2.3105 -1.0575 0.6729 0.8799 1.3032
Random effects:
Groups Name Variance Std.Dev.
country (Intercept) 0.04616 0.2148
Number of obs: 3000, groups: country, 30
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.40705 0.35298 -1.153 0.24883
cd1 0.21337 0.07766 2.748 0.00600 **
cd2 0.58546 0.17948 3.262 0.00111 **
cd3 0.45124 0.07778 5.801 6.57e-09 ***
cd4 -0.36836 0.09156 -4.023 5.74e-05 ***
GII 0.02269 0.01388 1.635 0.10201
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) cd1 cd2 cd3 cd4
cd1 -0.070
cd2 -0.022 -0.010
cd3 -0.157 0.002 0.020
cd4 -0.204 -0.009 -0.004 -0.020
GII -0.954 -0.011 -0.003 0.023 0.004
B.6 Write data to various formats
Lastly, we write the data out to various formats: R, Stata, and SPSS.
Rabe-Hesketh, S., & Skrondal, A. (2022). Multilevel and longitudinal modeling using Stata. In Stata Press (4th ed.). Stata Press.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (pp. xxiv, 485 p.). Sage Publications.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis : Modeling change and event occurrence. In Applied longitudinal data analysis : modeling change and event occurrence. Oxford University Press.