Simulating Multilevel Data: Part 1

Here, I show how to simulate multilevel data, specifically intensive longitudinal (daily diary) data. Multilevel data poses some challenges for simulation because it violates assumptions of independence and analyses typically include estimation of random effects (e.g., each person may have their own intercept and slope). This tutorial assumes background knowledge of multilevel modeling and within- and between-person effects.

Once you understand how to simulate multilevel data, you can see how to incorporate it into power simulations here.

Like all data simulations, you will need to have an idea as to the general parameters for the data you intend to simulate (means, standard deviations, etc.) as well as how they relate to each other (e.g., the model). But multilevel data requires more parameters to do this than, say, a simple regression model.

For the present example, I will base the simulation on an existing dataset of people with type 2 diabetes to inform these parameters. You can read more about the original dataset here. Depending on what you are using your simulated data for, you may need to pull these parameters from existing literature or pilot data (e.g., if you are conducting an a priori power analysis to inform data collection).

We will use these packages in this tutorial:

library(nlme)
library(MASS)
library(tidyverse)

set.seed(4321)

Intro to the Data

Ok let’s look at the data.

head(diabetes, n = 15)
## # A tibble: 15 × 10
##       id  time collaboration PosAff PA_within PA_between NegAff NA_within
##    <dbl> <dbl>         <dbl>  <dbl>     <dbl>      <dbl>  <dbl>     <dbl>
##  1     1     0             4   2.67     0.333      -1.32   1.33   -0.0556
##  2     1     1             2   3        0.667      -1.32   1.33   -0.0556
##  3     1     2             2   3.33     1          -1.32   1      -0.389 
##  4     1     3             2   3        0.667      -1.32   1.33   -0.0556
##  5     1     4             2   3        0.667      -1.32   1.67    0.278 
##  6     1     5             4   2.33     0          -1.32   3       1.61  
##  7     1     6             1   2.67     0.333      -1.32   1      -0.389 
##  8     1     7            NA  NA       NA          -1.32  NA      NA     
##  9     1     8             3   1.33    -1          -1.32   1      -0.389 
## 10     1     9             2   2.33     0          -1.32   1      -0.389 
## 11     1    10             3   1.33    -1          -1.32   1.33   -0.0556
## 12     1    11             2   1.67    -0.667      -1.32   1.33   -0.0556
## 13     1    12             3   1.33    -1          -1.32   1.33   -0.0556
## 14     1    13            NA  NA       NA          -1.32  NA      NA     
## 15     2     0             5   5        0           1.35   1       0     
## # … with 2 more variables: NA_between <dbl>, age <dbl>

We’ve got this data in long format: each participant has one row for each day they completed a daily diary.

We have the following variables:

  • id: the participant ID
  • time: the day of the diary assessment (0 through 13, with some exceptions due to missing data and/or participant error)
  • PosAff: daily positive affect (PA)
  • PA_between: participant’s mean PA across the 14 days, centered around the grand mean
  • PA_within: how much someone deviated from their own average PA on a given day
  • NegAff: daily negative affect (NA)
  • NA_between: participant’s mean PA across the 14 days, centered around the grand mean
  • NA_within: how much someone deviated from their own average NA on a given day
  • shared_appraisal: daily shared appraisal rating, or how much the person perceived their diabetes to be shared with their romantic partner on a given day
  • collaboration: daily collaboration rating, or the extent to which the participant collaborated with their romantic partner to manage their diabetes on that day.
  • age: the participant’s age

Notice that we have three different variables for both positive and negative affect: the raw daily variable (PosAff and NegAff), the between-person variables, and the within-person variables. You can find some quick code for parsing the raw variable into within- and between-person effects here. Most important, notice that the between-person variables are the same across time for an individual, while the within-person variables change from day to day.

How many participants do we have?

diabetes %>% 
  group_by(id) %>% 
  summarise() %>% 
  nrow()
## [1] 200

Let’s say we are interested in whether mood predicts collaboration on a given day. We think that positive affect (PA) will predict collaboration. We also know that PA and negative affect (e.g., depressed mood or NA) overlaps with PA. So we want to control for NA when we are testing this hypothesis. We also want to control for participant age, because both PA and collaboration tend to increase as participants get older. We allow participants to randomly vary in their intercepts, as well as their slopes for time and PA_within.

Thus, the model we estimate using the original data is as follows:

collabMod <- lme(collaboration ~ PA_within + PA_between +
                   NA_within + NA_between + age + time,
                  data = diabetes,
                  random = ~ PA_within + time + 1|id,
                  na.action = na.omit)

Our task is to simulate a dataset that captures the same relationships identified in this model. The first crucial step is to write out your equation. Really, do this now – it will save a lot of headache later.

Here is the math behind the above model, broken down into level 1 (within-person) and level 2 (between-person) equations, where i represents an individual and j indicates a single assessment timepoint:

Level 1: \[collaboration_{ij} = \beta_{0j} + \beta_{1j}(PAwithin) + \beta_{2j}(NAwithin) + \beta_{3j}(time) + r_{ij} \]

Level 2: \[\beta_{0j} = \gamma_{00} + (\gamma_{01})PAbetween + (\gamma_{02}NAbetween) + (\gamma_{03})age + u_{0j} \] \[\beta_{1j} = \gamma_{10} + u_{1j} \] \[\beta_{2j} = \gamma_{20}\] \[\beta_{3j} = \gamma_{30} + u_{3j} \]

Combined and rearranged:

\[collaboration_{ij} = (\gamma_{00} + u_{0j}) + (\gamma_{10} + u_{1j})PAwithin + \] \[(\gamma_{20})NAwithin + (\gamma_{30} + u_{3j})time + (\gamma_{01})PAbetween +\] \[(\gamma_{02}NAbetween) + (\gamma_{03})age + r_{ij}\]

Looking at the combined model, \(\gamma\) (gamma) represents a fixed effect, and \(u\) represents a random effect. So, as we specified in our code above, we have random effects for the intercept (\(u_{0j}\)), as well as for the slopes of PA_within (\(u_{1j}\)) and time (\(u_{3j}\)). We’ll need all of these parameters to simulate our dependent variable, collaboration.

Ok, now let’s look at the full model output:

summary(collabMod)
## Linear mixed-effects model fit by REML
##   Data: diabetes 
##        AIC     BIC    logLik
##   6658.053 6739.37 -3315.027
## 
## Random effects:
##  Formula: ~PA_within + time + 1 | id
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr         
## (Intercept) 0.92056299 (Intr) PA_wth
## PA_within   0.34580651  0.141       
## time        0.04764546 -0.300  0.394
## Residual    0.78970783              
## 
## Fixed effects:  collaboration ~ PA_within + PA_between + NA_within + NA_between +      age + time 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  1.8529140 0.3176015 2265  5.834085  0.0000
## PA_within    0.0924109 0.0412089 2265  2.242501  0.0250
## PA_between   0.4127638 0.0954993  196  4.322164  0.0000
## NA_within   -0.0026098 0.0354924 2265 -0.073532  0.9414
## NA_between  -0.1358025 0.1361690  196 -0.997308  0.3198
## age          0.0112956 0.0058007  196  1.947279  0.0529
## time        -0.0211969 0.0055048 2265 -3.850616  0.0001
##  Correlation: 
##            (Intr) PA_wth PA_btw NA_wth NA_btw age   
## PA_within   0.029                                   
## PA_between  0.155 -0.021                            
## NA_within  -0.005  0.372 -0.007                     
## NA_between -0.064 -0.001  0.545 -0.032              
## age        -0.974 -0.013 -0.164  0.002  0.062       
## time       -0.101  0.148  0.014  0.032  0.006 -0.002
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.59580931 -0.57825261 -0.08428441  0.54513978  4.19543636 
## 
## Number of Observations: 2468
## Number of Groups: 200

There’s a lot of information here.

First, notice the random effects. We are given a standard deviation for each of our random effects, as well as for the residual. We also have a correlation matrix showing how these are related to each other.

Next, we see our fixed effects, including the estimate, SE, DF, t-value, and p-value. Below that, we see another correlation matrix representing how these fixed effects relate to each other.

So how do we start simulating a dataset that captures all of these relations represented in this model?

Starting Small

Let’s start small. We know that we want 200 participants, each with 14 days of diary data:

n_participants <- 200
n_timepoints <- 14
time <- rep(seq(0, n_timepoints-1, length = n_timepoints), n_participants)
id <- rep(1:n_participants, each = n_timepoints)

#put together into a data frame
sim_data <- data.frame(id, time)
head(sim_data, n = 20)
##    id time
## 1   1    0
## 2   1    1
## 3   1    2
## 4   1    3
## 5   1    4
## 6   1    5
## 7   1    6
## 8   1    7
## 9   1    8
## 10  1    9
## 11  1   10
## 12  1   11
## 13  1   12
## 14  1   13
## 15  2    0
## 16  2    1
## 17  2    2
## 18  2    3
## 19  2    4
## 20  2    5

Now let’s add our participants’ ages. What do participant ages look like in our observed sample?

diabetes %>% 
  summarise(mean_age = mean(age, na.rm = T),
            sd_age = sd(age, na.rm = T))
## # A tibble: 1 × 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     53.4   11.2
ggplot(diabetes, aes(x = age)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can randomly sample from a normal distribution based on this mean, SD, and distribution shape.

sim_data$age <- rnorm(n_participants, mean = 53.40, sd = 11.25) %>% 
  #let's just keep one decimal place
  round(1) %>% 
  #repeat each age 14 times (age stays the same within a participant)
  rep(each = n_timepoints) 

head(sim_data)
##   id time  age
## 1  1    0 48.6
## 2  1    1 48.6
## 3  1    2 48.6
## 4  1    3 48.6
## 5  1    4 48.6
## 6  1    5 48.6

Ok, now let’s estimate our raw daily PosAff and NegAff variables.

Just like above, we’ll need the means and standard deviations from our original dataset:

diabetes %>% summarise(meanPA = mean(PosAff, na.rm = TRUE),
                       sdPA = sd(diabetes$PosAff, na.rm = TRUE),
                       meanNA = mean(NegAff, na.rm = TRUE),
                       sdNA = sd(NegAff, na.rm = TRUE))
## # A tibble: 1 × 4
##   meanPA  sdPA meanNA  sdNA
##    <dbl> <dbl>  <dbl> <dbl>
## 1   3.66  1.04   1.49 0.810

Again, we can randomly sample from a normal distribution based on these parameters. Note that this assumes PA and NA are statistically unrelated – which is almost certainly false. We’ll go with this for now, and below will look at how to simulate correlated independent variables.

sim_data$PosAff <- 
  #each person gets a different score each day
  rnorm(n_participants*n_timepoints, 
                     mean = 3.655, sd = 1.042)

sim_data$NegAff <- rnorm(n_participants*n_timepoints,
                       mean = 1.485, sd = .810)

head(sim_data)
##   id time  age   PosAff    NegAff
## 1  1    0 48.6 3.355986 2.1198075
## 2  1    1 48.6 2.982700 1.1057182
## 3  1    2 48.6 3.511898 2.0819549
## 4  1    3 48.6 2.143793 0.8968255
## 5  1    4 48.6 1.386186 1.1530269
## 6  1    5 48.6 5.605619 1.0946574

Looks good! Now we just need to parse our raw mood variables into within- and between-person effects. If you are confused by this code, you can see how it works and why we parse the data like this here.

# parsing within- and between-person effects
sim_data <-  sim_data %>% 
    group_by(id) %>%
    mutate(PA_14daymean = mean(PosAff, na.rm=TRUE),
           NA_14daymean = mean(NegAff, na.rm=TRUE))%>%
    mutate(PA_within = PosAff - PA_14daymean,
           NA_within = NegAff -NA_14daymean) %>%
    ungroup()%>%
    mutate(PA_grandm = mean(PA_14daymean, na.rm=TRUE),
           NA_grandm = mean(NA_14daymean, na.rm=TRUE)) %>%
    mutate(PA_between = PA_14daymean - PA_grandm,
           NA_between = NA_14daymean - NA_grandm) %>% 
  select(-c(PA_14daymean, NA_14daymean, PA_grandm, NA_grandm)) 
#we can drop these variables--we only needed them to calculate the 
#within and between-person effects

head(sim_data)
## # A tibble: 6 × 9
##      id  time   age PosAff NegAff PA_within NA_within PA_between NA_between
##   <int> <dbl> <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1     1     0  48.6   3.36  2.12    -0.226      0.520    -0.0854      0.112
## 2     1     1  48.6   2.98  1.11    -0.599     -0.494    -0.0854      0.112
## 3     1     2  48.6   3.51  2.08    -0.0703     0.482    -0.0854      0.112
## 4     1     3  48.6   2.14  0.897   -1.44      -0.703    -0.0854      0.112
## 5     1     4  48.6   1.39  1.15    -2.20      -0.447    -0.0854      0.112
## 6     1     5  48.6   5.61  1.09     2.02      -0.506    -0.0854      0.112

Looks good!

dd_data Function

We can summarize everything we’ve done so far in the following function, which takes the number of participants and timepoints, and outputs a simulated data frame with id, time, age, PosAff, NegAff, PA_within, NA_within, PA_between, and NA_between.

dd_data <- function(n_participants, n_timepoints) {
  time <- rep(seq(1, n_timepoints, length = n_timepoints), n_participants)  
  id <- rep(1:n_participants, each = n_timepoints)
  age <- runif(n_participants, min = 25, max = 82)
  sim_data <- data.frame(id, time, age = rep(age, each = n_timepoints))
  sim_data$PosAff <- rnorm(n_participants*n_timepoints, mean = 3.655, sd = 1.485)
  sim_data$NegAff <- rnorm(n_participants*n_timepoints, mean = 1.042, sd = .810)
  sim_data <-  sim_data %>% # parsing within- and between-person effects
    group_by(id) %>%
    mutate(PA_14daymean = mean(PosAff, na.rm=TRUE),
           NA_14daymean = mean(NegAff, na.rm=TRUE))%>%
    mutate(PA_within = PosAff - PA_14daymean,
           NA_within = NegAff -NA_14daymean) %>%
    ungroup()%>%
    mutate(PA_grandm = mean(PA_14daymean, na.rm=TRUE),
           NA_grandm = mean(NA_14daymean, na.rm=TRUE)) %>%
    mutate(PA_between = PA_14daymean - PA_grandm,
           NA_between = NA_14daymean - NA_grandm) %>% 
    select(-c(PA_14daymean, NA_14daymean, PA_grandm, NA_grandm)) 
  return(sim_data)
}

Always good to test it:

dd_test <- dd_data(200, 14)
head(dd_test, n = 20)
## # A tibble: 20 × 9
##       id  time   age PosAff  NegAff PA_within NA_within PA_between NA_between
##    <int> <dbl> <dbl>  <dbl>   <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
##  1     1     1  38.0   4.62  1.35      1.27      0.157      -0.276      0.153
##  2     1     2  38.0   3.01  1.83     -0.344     0.639      -0.276      0.153
##  3     1     3  38.0   3.66  0.391     0.307    -0.800      -0.276      0.153
##  4     1     4  38.0   3.23  3.48     -0.119     2.29       -0.276      0.153
##  5     1     5  38.0   2.94  1.17     -0.409    -0.0242     -0.276      0.153
##  6     1     6  38.0   2.47  2.58     -0.883     1.39       -0.276      0.153
##  7     1     7  38.0   3.49  1.28      0.140     0.0881     -0.276      0.153
##  8     1     8  38.0   3.91  0.676     0.560    -0.515      -0.276      0.153
##  9     1     9  38.0   4.79  1.44      1.43      0.248      -0.276      0.153
## 10     1    10  38.0   3.56 -0.269     0.203    -1.46       -0.276      0.153
## 11     1    11  38.0   3.92  1.70      0.564     0.511      -0.276      0.153
## 12     1    12  38.0   2.05  1.05     -1.30     -0.140      -0.276      0.153
## 13     1    13  38.0   3.03  0.558    -0.324    -0.633      -0.276      0.153
## 14     1    14  38.0   2.26 -0.556    -1.09     -1.75       -0.276      0.153
## 15     2     1  45.3   3.61  1.35     -0.354     0.172       0.331      0.140
## 16     2     2  45.3   5.78 -0.0491    1.82     -1.23        0.331      0.140
## 17     2     3  45.3   3.97  1.89      0.0128    0.711       0.331      0.140
## 18     2     4  45.3   2.77  0.780    -1.19     -0.398       0.331      0.140
## 19     2     5  45.3   3.13  1.99     -0.831     0.814       0.331      0.140
## 20     2     6  45.3   3.80  2.18     -0.156     1.01        0.331      0.140

Cool! Now we just need to estimate our dependent variable, collaboration, right? Unfortunately, no.

Adding Random Effects

This is where things get more complicated, because we need to add those random effects before estimating the outcomes variable. As a reminder, a random effect captures individual heterogeneity across our intercept and slope values. For example, while our output from our initial model tells us that our intercept is 1.85, our random intercept says that this value is actually variable across participants. How variable?

VarCorr(collabMod)
## id = pdLogChol(PA_within + time + 1) 
##             Variance   StdDev     Corr         
## (Intercept) 0.84743622 0.92056299 (Intr) PA_wth
## PA_within   0.11958214 0.34580651  0.141       
## time        0.00227009 0.04764546 -0.300  0.394
## Residual    0.62363845 0.78970783

This table shows us the variance and standard deviation of our random effects and residuals, as well as how they are related to each other (i.e., the correlations). For instance, we can see that the effect of time on our outcome really doesn’t change that much from person to person, but within-person PA does. We can also see that people with larger intercept random effects tend to have somewhat larger (r = .141) within-person PA random effects.

So what do we do with this information?

First, we need to pull the covariance matrix of the random effects, rather than the correlation matrix, because the covariance matrix takes into account the standard deviations (or variances) of each random effect.

For nlme models, the getVarCov function accomplishes this.

cov <- getVarCov(collabMod)
cov
## Random effects variance covariance matrix
##             (Intercept) PA_within       time
## (Intercept)    0.847440  0.044833 -0.0131510
## PA_within      0.044833  0.119580  0.0064890
## time          -0.013151  0.006489  0.0022701
##   Standard Deviations: 0.92056 0.34581 0.047645

Then, we’ll use a function from the MASS packages called mvrnorm, which randomly samples from a multivariate normal distribution based on these covariances.

random_effects <- mvrnorm(n_participants, c(0,0,0), Sigma = cov)

Note here that n = n_participants, rather than n_participants*n_timepoints. That is, random effects vary across people, not across within-person time points.

This results in a 3 by 200 random_effects matrix.

head(random_effects)
##      (Intercept)   PA_within         time
## [1,]   1.4632102 -0.31907004 -0.018897867
## [2,]   0.3884006 -0.07199089 -0.040755470
## [3,]  -0.7905150 -0.33635644 -0.013735338
## [4,]  -2.4296293 -0.28559102  0.067819419
## [5,]   0.4081169 -0.24265508 -0.007748071
## [6,]   0.1985533  0.54023222  0.007813314

Each column is a random effect, each row represents a different person, and because of the way we sampled them, they should covary in the same way as the covariance matrix from the model:

cov(random_effects)
##             (Intercept)   PA_within         time
## (Intercept)  0.92708400 0.056000468 -0.013733531
## PA_within    0.05600047 0.131580572  0.005525947
## time        -0.01373353 0.005525947  0.002251855

Looks pretty close!

Recall our equations from above, especially our combined model:

\[collaboration_{ij} = (\gamma_{00} + u_{0j}) + (\gamma_{10} + u_{1j})PAwithin + \] \[(\gamma_{20})NAwithin + (\gamma_{30} + u_{3j})time + (\gamma_{01})PAbetween +\] \[(\gamma_{02}NAbetween) + (\gamma_{03})age + r_{ij}\]

We can see that the intercept and beta values for PA_within and time are simply the sum of a fixed effect, \(\gamma\), and a random effect, \(u\).

So all we need to do is to add our newly simulated random effects to the fixed effects from the original model.

You can pull these fixed effect coefficients from the model like so:

beta_int <-  fixed.effects(collabMod)["(Intercept)"]
beta_time <-  fixed.effects(collabMod)["time"]
beta_PA_within <- fixed.effects(collabMod)["PA_within"]

And combine them with the random effects like this:

#Pull out each column of random effects
rand_intercepts <- rep(random_effects[,1], each = n_timepoints)
rand_PA_within <- rep(random_effects[,2], each = n_timepoints)
rand_time <- rep(random_effects[,3], each = n_timepoints) 


sim_data <- sim_data %>% 
    mutate(intercepts = (beta_int + rand_intercepts),
           slope_time = (beta_time + rand_time),
           slope_PA_within = (beta_PA_within + rand_PA_within))

Notice that for each random effect, we have to repeat it n_timepoints times. This is because our sim_data data frame is in long format, where each person has a new line of data for each time point.

So now when we look at sim_data, we see the new variables intercepts, slope_time, and slope_PA_within each of which are different for different people:

sim_data %>% 
  select(id, intercepts, slope_time, slope_PA_within) %>% 
  head(n = 18)
## # A tibble: 18 × 4
##       id intercepts slope_time slope_PA_within
##    <int>      <dbl>      <dbl>           <dbl>
##  1     1       3.32    -0.0401         -0.227 
##  2     1       3.32    -0.0401         -0.227 
##  3     1       3.32    -0.0401         -0.227 
##  4     1       3.32    -0.0401         -0.227 
##  5     1       3.32    -0.0401         -0.227 
##  6     1       3.32    -0.0401         -0.227 
##  7     1       3.32    -0.0401         -0.227 
##  8     1       3.32    -0.0401         -0.227 
##  9     1       3.32    -0.0401         -0.227 
## 10     1       3.32    -0.0401         -0.227 
## 11     1       3.32    -0.0401         -0.227 
## 12     1       3.32    -0.0401         -0.227 
## 13     1       3.32    -0.0401         -0.227 
## 14     1       3.32    -0.0401         -0.227 
## 15     2       2.24    -0.0620          0.0204
## 16     2       2.24    -0.0620          0.0204
## 17     2       2.24    -0.0620          0.0204
## 18     2       2.24    -0.0620          0.0204

Now the last thing we need to take into account is our residual, or error term, \(r_{ij}\). Just as the subscript indicates, this does vary across time points. We can pull \(\sigma\), the SD of our residual directly from the model, and use it to randomly sample from a normal distribution with a mean of 0.

sigma <-  collabMod[["sigma"]]
error <- rnorm(n_participants*n_timepoints, 0, sigma)

#add to the data frame
sim_data <- sim_data %>% 
  mutate(error)

sim_data %>% select(id, error) %>% 
  head()
## # A tibble: 6 × 2
##      id  error
##   <int>  <dbl>
## 1     1  0.742
## 2     1 -0.364
## 3     1 -1.45 
## 4     1  0.104
## 5     1 -0.650
## 6     1  0.289

get_random_effects Function

We can summarize all of this syntax used to create the random effects in the following function, which takes as arguments the sim_data from our dd_dat function, the numbers of participants and time points, and the original nlme model:

get_random_effects <- function(sim_data, n_participants, 
                                n_timepoints, nlme.model){
  beta_int <-  fixed.effects(nlme.model)["(Intercept)"]
  beta_time <-  fixed.effects(nlme.model)["time"]
  beta_PA_within <- fixed.effects(nlme.model)["PA_within"]
  sigma <-  nlme.model[["sigma"]]
  cov <- getVarCov(nlme.model)
  a <- mvrnorm(n_participants, c(0,0,0), Sigma = cov)
  rand_intercepts <- rep(a[,1], each = n_timepoints)
  rand_PA_within <- rep(a[,2], each = n_timepoints)
  rand_time <- rep(a[,3], each = n_timepoints) 
  error <- rnorm(n_participants*n_timepoints, 0, sigma)
  sim_data <- sim_data %>% 
    mutate(intercepts = (beta_int + rand_intercepts),
           slope_time = (beta_time + rand_time),
           slope_PA_within = (beta_PA_within + rand_PA_within),
           error)
  return(sim_data)
}

Are we done yet?

get_outcome Function

Not quite! Now we can (finally) calculate our independent variable, collaboration. To do so, we plug in our combined equation from above:

get_outcome <- function(sim_data, n_participants, 
                             n_timepoints, nlme.model){
  #betas for predictors with only fixed effects:
  beta_PA_between <- fixed.effects(nlme.model)["PA_between"]
  beta_NA_within <- fixed.effects(nlme.model)["NA_within"]
  beta_NA_between <- fixed.effects(nlme.model)["NA_between"]
  beta_age <- fixed.effects(nlme.model)["age"]
  sim_data <- sim_data %>% 
    #this should match combined model above:
    mutate(collaboration = 
             intercepts + 
             slope_PA_within*PA_within + 
             beta_PA_between*PA_between +
             beta_NA_within*NA_within + 
             beta_NA_between*NA_between +
             slope_time*time + 
             beta_age*age + 
             error)
  return(sim_data)
}

Note that we still need to pull the coefficients for predictors with fixed effects only. As fixed effects, these slopes are the same across all participants, so don’t need to be added into our data frame as new columns.

Putting it all together

Now we can combine our three functions dd_data, get_random_effects and get_outcome into one sleek simulate_data function:

simulate_data <- function(n_participants, n_timepoints, nlme.model){
  dd_data(n_participants, n_timepoints) %>% 
    get_random_effects(n_participants, n_timepoints, nlme.model) %>% 
    get_outcome(n_participants, n_timepoints, nlme.model)
}

Let’s recap everything that is going on here. We input the number of participants, number of time points, and original nlme model. dd_dat takes just n_participants and n_timepoints and outputs the sim_data data frame where each participant has 14 lines of data on our predictors of interest. get_random_effects takes this sim_data and adds random effects for intercepts, time, and PA_within, and outputs a modified sim_data object. Finally, get_outcome uses sim_data and some additional information from our model to estimate our outcome based on our combined model.

(Remember that the pipe, %>% , takes the output of a line and inputs it as the first argument in the next line. That’s why we don’t have to type sim_data as an input – it’s being passed through each function under the hood.)

Now we can run this function, and use the resulting data frame to fit our original model. This new output should closely match the original output.

sim_test <- simulate_data(200, 14, collabMod)
simMod <- lme(collaboration ~ PA_within + PA_between +
                   NA_within + NA_between + age + time,
                  data = sim_test,
                  random = ~ PA_within + time + 1|id,
                  na.action = na.omit)
summary_fixed <- function(nlme.mod){
  summary(nlme.mod)[["tTable"]] %>% 
  as.data.frame() %>% 
  select(Value, Std.Error, "p-value") %>% 
  rename(p = "p-value") %>% 
  mutate(
    Value = round(Value, 2),
    Std.Error = round(Std.Error, 2),
    p = round(p, 3))
}

summary_fixed(collabMod)
##             Value Std.Error     p
## (Intercept)  1.85      0.32 0.000
## PA_within    0.09      0.04 0.025
## PA_between   0.41      0.10 0.000
## NA_within    0.00      0.04 0.941
## NA_between  -0.14      0.14 0.320
## age          0.01      0.01 0.053
## time        -0.02      0.01 0.000
summary_fixed(simMod)
##             Value Std.Error     p
## (Intercept)  1.91      0.23 0.000
## PA_within    0.04      0.03 0.130
## PA_between   0.32      0.17 0.064
## NA_within   -0.01      0.02 0.697
## NA_between   0.45      0.31 0.148
## age          0.01      0.00 0.007
## time        -0.03      0.01 0.000

Looks pretty good! Notice that the non-significant predictors (like NA_between) are quite different across the two models, while the significant predictors match.

We can also compare the random effects:

VarCorr(collabMod)
## id = pdLogChol(PA_within + time + 1) 
##             Variance   StdDev     Corr         
## (Intercept) 0.84743622 0.92056299 (Intr) PA_wth
## PA_within   0.11958214 0.34580651  0.141       
## time        0.00227009 0.04764546 -0.300  0.394
## Residual    0.62363845 0.78970783
VarCorr(simMod)
## id = pdLogChol(PA_within + time + 1) 
##             Variance    StdDev     Corr         
## (Intercept) 0.912248071 0.95511678 (Intr) PA_wth
## PA_within   0.140996279 0.37549471  0.143       
## time        0.002571546 0.05071042 -0.240  0.332
## Residual    0.583742629 0.76403052

Looks good! Now, you can use the simulate_data function in other analyses, like in Monte Carlo power simulations where you need to simulate data over and over again. To learn more about this, see my tutorial here.

Simulating dependent predictors

If you read through all of this and said to yourself, “Hm, I really wish this was more complicated,” then never fear, we can add another layer of complexity.

Above, we talked about how our positive and negative affect predictors are probably related to each other. That is, if someone reports feeling very happy on one day, they are probably also reporting very little sadness on that day. The way we have simulated this so far does not take this covariance into account. To see how to do this, see this tutorial, which uses the same sample data and model used here.