Here, I show how to simulate multilevel data, specifically intensive longitudinal (daily diary) data. Multilevel data poses some challenges for simulation because it violates assumptions of independence and analyses typically include estimation of random effects (e.g., each person may have their own intercept and slope). This tutorial assumes background knowledge of multilevel modeling and within- and between-person effects.

Once you understand how to simulate multilevel data, you can see how to incorporate it into power simulations here.

Like all data simulations, you will need to have an idea as to the general parameters for the data you intend to simulate (means, standard deviations, etc.) as well as how they relate to each other (e.g., the model). But multilevel data requires more parameters to do this than, say, a simple regression model.

For the present example, I will base the simulation on an existing dataset of people with type 2 diabetes to inform these parameters. You can read more about the original dataset here. Depending on what you are using your simulated data for, you may need to pull these parameters from existing literature or pilot data (e.g., if you are conducting an a priori power analysis to inform data collection).

We will use these packages in this tutorial:

```
library(nlme)
library(MASS)
library(tidyverse)
set.seed(4321)
```

# Intro to the Data

Ok let’s look at the data.

`head(diabetes, n = 15)`

```
## # A tibble: 15 × 10
## id time collaboration PosAff PA_within PA_between NegAff NA_within
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 4 2.67 0.333 -1.32 1.33 -0.0556
## 2 1 1 2 3 0.667 -1.32 1.33 -0.0556
## 3 1 2 2 3.33 1 -1.32 1 -0.389
## 4 1 3 2 3 0.667 -1.32 1.33 -0.0556
## 5 1 4 2 3 0.667 -1.32 1.67 0.278
## 6 1 5 4 2.33 0 -1.32 3 1.61
## 7 1 6 1 2.67 0.333 -1.32 1 -0.389
## 8 1 7 NA NA NA -1.32 NA NA
## 9 1 8 3 1.33 -1 -1.32 1 -0.389
## 10 1 9 2 2.33 0 -1.32 1 -0.389
## 11 1 10 3 1.33 -1 -1.32 1.33 -0.0556
## 12 1 11 2 1.67 -0.667 -1.32 1.33 -0.0556
## 13 1 12 3 1.33 -1 -1.32 1.33 -0.0556
## 14 1 13 NA NA NA -1.32 NA NA
## 15 2 0 5 5 0 1.35 1 0
## # … with 2 more variables: NA_between <dbl>, age <dbl>
```

We’ve got this data in long format: each participant has one row for each day they completed a daily diary.

We have the following variables:

: the participant ID`id`

: the day of the diary assessment (0 through 13, with some exceptions due to missing data and/or participant error)`time`

: daily positive affect (PA)`PosAff`

: participant’s mean PA across the 14 days, centered around the grand mean`PA_between`

: how much someone deviated from their own average PA on a given day`PA_within`

: daily negative affect (NA)`NegAff`

: participant’s mean PA across the 14 days, centered around the grand mean`NA_between`

: how much someone deviated from their own average NA on a given day`NA_within`

: daily shared appraisal rating, or how much the person perceived their diabetes to be shared with their romantic partner on a given day`shared_appraisal`

: daily collaboration rating, or the extent to which the participant collaborated with their romantic partner to manage their diabetes on that day.`collaboration`

: the participant’s age`age`

Notice that we have three different variables for both positive and negative affect: the raw daily variable (`PosAff`

and `NegAff`

), the between-person variables, and the within-person variables. You can find some quick code for parsing the raw variable into within- and between-person effects here. Most important, notice that the between-person variables are the same across time for an individual, while the within-person variables change from day to day.

How many participants do we have?

```
diabetes %>%
group_by(id) %>%
summarise() %>%
nrow()
```

`## [1] 200`

Let’s say we are interested in whether mood predicts collaboration on a given day. We think that positive affect (PA) will predict collaboration. We also know that PA and negative affect (e.g., depressed mood or NA) overlaps with PA. So we want to control for NA when we are testing this hypothesis. We also want to control for participant age, because both PA and collaboration tend to increase as participants get older. We allow participants to randomly vary in their intercepts, as well as their slopes for `time`

and `PA_within`

.

Thus, the model we estimate using the original data is as follows:

```
collabMod <- lme(collaboration ~ PA_within + PA_between +
NA_within + NA_between + age + time,
data = diabetes,
random = ~ PA_within + time + 1|id,
na.action = na.omit)
```

Our task is to simulate a dataset that captures the same relationships identified in this model. The first **crucial** step is to write out your equation. Really, do this now – it will save a lot of headache later.

Here is the math behind the above model, broken down into level 1 (within-person) and level 2 (between-person) equations, where *i* represents an individual and *j* indicates a single assessment timepoint:

Level 1: \[collaboration_{ij} = \beta_{0j} + \beta_{1j}(PAwithin) + \beta_{2j}(NAwithin) + \beta_{3j}(time) + r_{ij} \]

Level 2: \[\beta_{0j} = \gamma_{00} + (\gamma_{01})PAbetween + (\gamma_{02}NAbetween) + (\gamma_{03})age + u_{0j} \] \[\beta_{1j} = \gamma_{10} + u_{1j} \] \[\beta_{2j} = \gamma_{20}\] \[\beta_{3j} = \gamma_{30} + u_{3j} \]

Combined and rearranged:

\[collaboration_{ij} = (\gamma_{00} + u_{0j}) + (\gamma_{10} + u_{1j})PAwithin + \] \[(\gamma_{20})NAwithin + (\gamma_{30} + u_{3j})time + (\gamma_{01})PAbetween +\] \[(\gamma_{02}NAbetween) + (\gamma_{03})age + r_{ij}\]

Looking at the combined model, \(\gamma\) (gamma) represents a fixed effect, and \(u\) represents a random effect. So, as we specified in our code above, we have random effects for the intercept (\(u_{0j}\)), as well as for the slopes of `PA_within`

(\(u_{1j}\)) and `time`

(\(u_{3j}\)). We’ll need all of these parameters to simulate our dependent variable, `collaboration.`

Ok, now let’s look at the full model output:

`summary(collabMod)`

```
## Linear mixed-effects model fit by REML
## Data: diabetes
## AIC BIC logLik
## 6658.053 6739.37 -3315.027
##
## Random effects:
## Formula: ~PA_within + time + 1 | id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.92056299 (Intr) PA_wth
## PA_within 0.34580651 0.141
## time 0.04764546 -0.300 0.394
## Residual 0.78970783
##
## Fixed effects: collaboration ~ PA_within + PA_between + NA_within + NA_between + age + time
## Value Std.Error DF t-value p-value
## (Intercept) 1.8529140 0.3176015 2265 5.834085 0.0000
## PA_within 0.0924109 0.0412089 2265 2.242501 0.0250
## PA_between 0.4127638 0.0954993 196 4.322164 0.0000
## NA_within -0.0026098 0.0354924 2265 -0.073532 0.9414
## NA_between -0.1358025 0.1361690 196 -0.997308 0.3198
## age 0.0112956 0.0058007 196 1.947279 0.0529
## time -0.0211969 0.0055048 2265 -3.850616 0.0001
## Correlation:
## (Intr) PA_wth PA_btw NA_wth NA_btw age
## PA_within 0.029
## PA_between 0.155 -0.021
## NA_within -0.005 0.372 -0.007
## NA_between -0.064 -0.001 0.545 -0.032
## age -0.974 -0.013 -0.164 0.002 0.062
## time -0.101 0.148 0.014 0.032 0.006 -0.002
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.59580931 -0.57825261 -0.08428441 0.54513978 4.19543636
##
## Number of Observations: 2468
## Number of Groups: 200
```

There’s a lot of information here.

First, notice the random effects. We are given a standard deviation for each of our random effects, as well as for the residual. We also have a correlation matrix showing how these are related to each other.

Next, we see our fixed effects, including the estimate, SE, DF, t-value, and p-value. Below that, we see another correlation matrix representing how these fixed effects relate to each other.

So how do we start simulating a dataset that captures all of these relations represented in this model?

# Starting Small

Let’s start small. We know that we want 200 participants, each with 14 days of diary data:

```
n_participants <- 200
n_timepoints <- 14
time <- rep(seq(0, n_timepoints-1, length = n_timepoints), n_participants)
id <- rep(1:n_participants, each = n_timepoints)
#put together into a data frame
sim_data <- data.frame(id, time)
head(sim_data, n = 20)
```

```
## id time
## 1 1 0
## 2 1 1
## 3 1 2
## 4 1 3
## 5 1 4
## 6 1 5
## 7 1 6
## 8 1 7
## 9 1 8
## 10 1 9
## 11 1 10
## 12 1 11
## 13 1 12
## 14 1 13
## 15 2 0
## 16 2 1
## 17 2 2
## 18 2 3
## 19 2 4
## 20 2 5
```

Now let’s add our participants’ ages. What do participant ages look like in our observed sample?

```
diabetes %>%
summarise(mean_age = mean(age, na.rm = T),
sd_age = sd(age, na.rm = T))
```

```
## # A tibble: 1 × 2
## mean_age sd_age
## <dbl> <dbl>
## 1 53.4 11.2
```

```
ggplot(diabetes, aes(x = age)) +
geom_histogram()
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

We can randomly sample from a normal distribution based on this mean, SD, and distribution shape.

```
sim_data$age <- rnorm(n_participants, mean = 53.40, sd = 11.25) %>%
#let's just keep one decimal place
round(1) %>%
#repeat each age 14 times (age stays the same within a participant)
rep(each = n_timepoints)
head(sim_data)
```

```
## id time age
## 1 1 0 48.6
## 2 1 1 48.6
## 3 1 2 48.6
## 4 1 3 48.6
## 5 1 4 48.6
## 6 1 5 48.6
```

Ok, now let’s estimate our raw daily `PosAff`

and `NegAff`

variables.

Just like above, we’ll need the means and standard deviations from our original dataset:

```
diabetes %>% summarise(meanPA = mean(PosAff, na.rm = TRUE),
sdPA = sd(diabetes$PosAff, na.rm = TRUE),
meanNA = mean(NegAff, na.rm = TRUE),
sdNA = sd(NegAff, na.rm = TRUE))
```

```
## # A tibble: 1 × 4
## meanPA sdPA meanNA sdNA
## <dbl> <dbl> <dbl> <dbl>
## 1 3.66 1.04 1.49 0.810
```

Again, we can randomly sample from a normal distribution based on these parameters. Note that this assumes PA and NA are statistically unrelated – which is almost certainly false. We’ll go with this for now, and below will look at how to simulate correlated independent variables.

```
sim_data$PosAff <-
#each person gets a different score each day
rnorm(n_participants*n_timepoints,
mean = 3.655, sd = 1.042)
sim_data$NegAff <- rnorm(n_participants*n_timepoints,
mean = 1.485, sd = .810)
head(sim_data)
```

```
## id time age PosAff NegAff
## 1 1 0 48.6 3.355986 2.1198075
## 2 1 1 48.6 2.982700 1.1057182
## 3 1 2 48.6 3.511898 2.0819549
## 4 1 3 48.6 2.143793 0.8968255
## 5 1 4 48.6 1.386186 1.1530269
## 6 1 5 48.6 5.605619 1.0946574
```

Looks good! Now we just need to parse our raw mood variables into within- and between-person effects. If you are confused by this code, you can see how it works and why we parse the data like this here.

```
# parsing within- and between-person effects
sim_data <- sim_data %>%
group_by(id) %>%
mutate(PA_14daymean = mean(PosAff, na.rm=TRUE),
NA_14daymean = mean(NegAff, na.rm=TRUE))%>%
mutate(PA_within = PosAff - PA_14daymean,
NA_within = NegAff -NA_14daymean) %>%
ungroup()%>%
mutate(PA_grandm = mean(PA_14daymean, na.rm=TRUE),
NA_grandm = mean(NA_14daymean, na.rm=TRUE)) %>%
mutate(PA_between = PA_14daymean - PA_grandm,
NA_between = NA_14daymean - NA_grandm) %>%
select(-c(PA_14daymean, NA_14daymean, PA_grandm, NA_grandm))
#we can drop these variables--we only needed them to calculate the
#within and between-person effects
head(sim_data)
```

```
## # A tibble: 6 × 9
## id time age PosAff NegAff PA_within NA_within PA_between NA_between
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 48.6 3.36 2.12 -0.226 0.520 -0.0854 0.112
## 2 1 1 48.6 2.98 1.11 -0.599 -0.494 -0.0854 0.112
## 3 1 2 48.6 3.51 2.08 -0.0703 0.482 -0.0854 0.112
## 4 1 3 48.6 2.14 0.897 -1.44 -0.703 -0.0854 0.112
## 5 1 4 48.6 1.39 1.15 -2.20 -0.447 -0.0854 0.112
## 6 1 5 48.6 5.61 1.09 2.02 -0.506 -0.0854 0.112
```

Looks good!

`dd_data`

Function

We can summarize everything we’ve done so far in the following function, which takes the number of participants and timepoints, and outputs a simulated data frame with `id`

, `time`

, `age`

, `PosAff`

, `NegAff`

, `PA_within`

, `NA_within`

, `PA_between`

, and `NA_between`

.

```
dd_data <- function(n_participants, n_timepoints) {
time <- rep(seq(1, n_timepoints, length = n_timepoints), n_participants)
id <- rep(1:n_participants, each = n_timepoints)
age <- runif(n_participants, min = 25, max = 82)
sim_data <- data.frame(id, time, age = rep(age, each = n_timepoints))
sim_data$PosAff <- rnorm(n_participants*n_timepoints, mean = 3.655, sd = 1.485)
sim_data$NegAff <- rnorm(n_participants*n_timepoints, mean = 1.042, sd = .810)
sim_data <- sim_data %>% # parsing within- and between-person effects
group_by(id) %>%
mutate(PA_14daymean = mean(PosAff, na.rm=TRUE),
NA_14daymean = mean(NegAff, na.rm=TRUE))%>%
mutate(PA_within = PosAff - PA_14daymean,
NA_within = NegAff -NA_14daymean) %>%
ungroup()%>%
mutate(PA_grandm = mean(PA_14daymean, na.rm=TRUE),
NA_grandm = mean(NA_14daymean, na.rm=TRUE)) %>%
mutate(PA_between = PA_14daymean - PA_grandm,
NA_between = NA_14daymean - NA_grandm) %>%
select(-c(PA_14daymean, NA_14daymean, PA_grandm, NA_grandm))
return(sim_data)
}
```

Always good to test it:

```
dd_test <- dd_data(200, 14)
head(dd_test, n = 20)
```

```
## # A tibble: 20 × 9
## id time age PosAff NegAff PA_within NA_within PA_between NA_between
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 38.0 4.62 1.35 1.27 0.157 -0.276 0.153
## 2 1 2 38.0 3.01 1.83 -0.344 0.639 -0.276 0.153
## 3 1 3 38.0 3.66 0.391 0.307 -0.800 -0.276 0.153
## 4 1 4 38.0 3.23 3.48 -0.119 2.29 -0.276 0.153
## 5 1 5 38.0 2.94 1.17 -0.409 -0.0242 -0.276 0.153
## 6 1 6 38.0 2.47 2.58 -0.883 1.39 -0.276 0.153
## 7 1 7 38.0 3.49 1.28 0.140 0.0881 -0.276 0.153
## 8 1 8 38.0 3.91 0.676 0.560 -0.515 -0.276 0.153
## 9 1 9 38.0 4.79 1.44 1.43 0.248 -0.276 0.153
## 10 1 10 38.0 3.56 -0.269 0.203 -1.46 -0.276 0.153
## 11 1 11 38.0 3.92 1.70 0.564 0.511 -0.276 0.153
## 12 1 12 38.0 2.05 1.05 -1.30 -0.140 -0.276 0.153
## 13 1 13 38.0 3.03 0.558 -0.324 -0.633 -0.276 0.153
## 14 1 14 38.0 2.26 -0.556 -1.09 -1.75 -0.276 0.153
## 15 2 1 45.3 3.61 1.35 -0.354 0.172 0.331 0.140
## 16 2 2 45.3 5.78 -0.0491 1.82 -1.23 0.331 0.140
## 17 2 3 45.3 3.97 1.89 0.0128 0.711 0.331 0.140
## 18 2 4 45.3 2.77 0.780 -1.19 -0.398 0.331 0.140
## 19 2 5 45.3 3.13 1.99 -0.831 0.814 0.331 0.140
## 20 2 6 45.3 3.80 2.18 -0.156 1.01 0.331 0.140
```

Cool! Now we just need to estimate our dependent variable, `collaboration`

, right? Unfortunately, no.

# Adding Random Effects

This is where things get more complicated, because we need to add those random effects before estimating the outcomes variable. As a reminder, a random effect captures individual heterogeneity across our intercept and slope values. For example, while our output from our initial model tells us that our intercept is 1.85, our random intercept says that this value is actually variable across participants. How variable?

`VarCorr(collabMod)`

```
## id = pdLogChol(PA_within + time + 1)
## Variance StdDev Corr
## (Intercept) 0.84743622 0.92056299 (Intr) PA_wth
## PA_within 0.11958214 0.34580651 0.141
## time 0.00227009 0.04764546 -0.300 0.394
## Residual 0.62363845 0.78970783
```

This table shows us the variance and standard deviation of our random effects and residuals, as well as how they are related to each other (i.e., the correlations). For instance, we can see that the effect of time on our outcome really doesn’t change that much from person to person, but within-person PA does. We can also see that people with larger intercept random effects tend to have somewhat larger (*r* = .141) within-person PA random effects.

So what do we do with this information?

First, we need to pull the **covariance** matrix of the random effects, rather than the correlation matrix, because the covariance matrix takes into account the standard deviations (or variances) of each random effect.

For `nlme`

models, the `getVarCov`

function accomplishes this.

```
cov <- getVarCov(collabMod)
cov
```

```
## Random effects variance covariance matrix
## (Intercept) PA_within time
## (Intercept) 0.847440 0.044833 -0.0131510
## PA_within 0.044833 0.119580 0.0064890
## time -0.013151 0.006489 0.0022701
## Standard Deviations: 0.92056 0.34581 0.047645
```

Then, we’ll use a function from the `MASS`

packages called `mvrnorm`

, which randomly samples from a multivariate normal distribution based on these covariances.

`random_effects <- mvrnorm(n_participants, c(0,0,0), Sigma = cov)`

Note here that `n = n_participants`

, rather than `n_participants*n_timepoints`

. That is, random effects vary across **people**, not across within-person time points.

This results in a 3 by 200 `random_effects`

matrix.

`head(random_effects)`

```
## (Intercept) PA_within time
## [1,] 1.4632102 -0.31907004 -0.018897867
## [2,] 0.3884006 -0.07199089 -0.040755470
## [3,] -0.7905150 -0.33635644 -0.013735338
## [4,] -2.4296293 -0.28559102 0.067819419
## [5,] 0.4081169 -0.24265508 -0.007748071
## [6,] 0.1985533 0.54023222 0.007813314
```

Each column is a random effect, each row represents a different person, and because of the way we sampled them, they should covary in the same way as the covariance matrix from the model:

`cov(random_effects)`

```
## (Intercept) PA_within time
## (Intercept) 0.92708400 0.056000468 -0.013733531
## PA_within 0.05600047 0.131580572 0.005525947
## time -0.01373353 0.005525947 0.002251855
```

Looks pretty close!

Recall our equations from above, especially our combined model:

\[collaboration_{ij} = (\gamma_{00} + u_{0j}) + (\gamma_{10} + u_{1j})PAwithin + \] \[(\gamma_{20})NAwithin + (\gamma_{30} + u_{3j})time + (\gamma_{01})PAbetween +\] \[(\gamma_{02}NAbetween) + (\gamma_{03})age + r_{ij}\]

We can see that the intercept and beta values for `PA_within`

and `time`

are simply the sum of a fixed effect, \(\gamma\), and a random effect, \(u\).

So all we need to do is to add our newly simulated random effects to the fixed effects from the original model.

You can pull these fixed effect coefficients from the model like so:

```
beta_int <- fixed.effects(collabMod)["(Intercept)"]
beta_time <- fixed.effects(collabMod)["time"]
beta_PA_within <- fixed.effects(collabMod)["PA_within"]
```

And combine them with the random effects like this:

```
#Pull out each column of random effects
rand_intercepts <- rep(random_effects[,1], each = n_timepoints)
rand_PA_within <- rep(random_effects[,2], each = n_timepoints)
rand_time <- rep(random_effects[,3], each = n_timepoints)
sim_data <- sim_data %>%
mutate(intercepts = (beta_int + rand_intercepts),
slope_time = (beta_time + rand_time),
slope_PA_within = (beta_PA_within + rand_PA_within))
```

Notice that for each random effect, we have to repeat it `n_timepoints`

times. This is because our `sim_data`

data frame is in long format, where each person has a new line of data for each time point.

So now when we look at sim_data, we see the new variables `intercepts`

, `slope_time`

, and `slope_PA_within`

each of which are different for different people:

```
sim_data %>%
select(id, intercepts, slope_time, slope_PA_within) %>%
head(n = 18)
```

```
## # A tibble: 18 × 4
## id intercepts slope_time slope_PA_within
## <int> <dbl> <dbl> <dbl>
## 1 1 3.32 -0.0401 -0.227
## 2 1 3.32 -0.0401 -0.227
## 3 1 3.32 -0.0401 -0.227
## 4 1 3.32 -0.0401 -0.227
## 5 1 3.32 -0.0401 -0.227
## 6 1 3.32 -0.0401 -0.227
## 7 1 3.32 -0.0401 -0.227
## 8 1 3.32 -0.0401 -0.227
## 9 1 3.32 -0.0401 -0.227
## 10 1 3.32 -0.0401 -0.227
## 11 1 3.32 -0.0401 -0.227
## 12 1 3.32 -0.0401 -0.227
## 13 1 3.32 -0.0401 -0.227
## 14 1 3.32 -0.0401 -0.227
## 15 2 2.24 -0.0620 0.0204
## 16 2 2.24 -0.0620 0.0204
## 17 2 2.24 -0.0620 0.0204
## 18 2 2.24 -0.0620 0.0204
```

Now the last thing we need to take into account is our residual, or error term, \(r_{ij}\). Just as the subscript indicates, this *does* vary across time points. We can pull \(\sigma\), the SD of our residual directly from the model, and use it to randomly sample from a normal distribution with a mean of 0.

```
sigma <- collabMod[["sigma"]]
error <- rnorm(n_participants*n_timepoints, 0, sigma)
#add to the data frame
sim_data <- sim_data %>%
mutate(error)
sim_data %>% select(id, error) %>%
head()
```

```
## # A tibble: 6 × 2
## id error
## <int> <dbl>
## 1 1 0.742
## 2 1 -0.364
## 3 1 -1.45
## 4 1 0.104
## 5 1 -0.650
## 6 1 0.289
```

`get_random_effects`

Function

We can summarize all of this syntax used to create the random effects in the following function, which takes as arguments the `sim_data`

from our `dd_dat`

function, the numbers of participants and time points, and the original nlme model:

```
get_random_effects <- function(sim_data, n_participants,
n_timepoints, nlme.model){
beta_int <- fixed.effects(nlme.model)["(Intercept)"]
beta_time <- fixed.effects(nlme.model)["time"]
beta_PA_within <- fixed.effects(nlme.model)["PA_within"]
sigma <- nlme.model[["sigma"]]
cov <- getVarCov(nlme.model)
a <- mvrnorm(n_participants, c(0,0,0), Sigma = cov)
rand_intercepts <- rep(a[,1], each = n_timepoints)
rand_PA_within <- rep(a[,2], each = n_timepoints)
rand_time <- rep(a[,3], each = n_timepoints)
error <- rnorm(n_participants*n_timepoints, 0, sigma)
sim_data <- sim_data %>%
mutate(intercepts = (beta_int + rand_intercepts),
slope_time = (beta_time + rand_time),
slope_PA_within = (beta_PA_within + rand_PA_within),
error)
return(sim_data)
}
```

Are we done yet?

`get_outcome`

Function

Not quite! Now we can (finally) calculate our independent variable, `collaboration`

. To do so, we plug in our combined equation from above:

```
get_outcome <- function(sim_data, n_participants,
n_timepoints, nlme.model){
#betas for predictors with only fixed effects:
beta_PA_between <- fixed.effects(nlme.model)["PA_between"]
beta_NA_within <- fixed.effects(nlme.model)["NA_within"]
beta_NA_between <- fixed.effects(nlme.model)["NA_between"]
beta_age <- fixed.effects(nlme.model)["age"]
sim_data <- sim_data %>%
#this should match combined model above:
mutate(collaboration =
intercepts +
slope_PA_within*PA_within +
beta_PA_between*PA_between +
beta_NA_within*NA_within +
beta_NA_between*NA_between +
slope_time*time +
beta_age*age +
error)
return(sim_data)
}
```

Note that we still need to pull the coefficients for predictors with fixed effects only. As fixed effects, these slopes are the same across all participants, so don’t need to be added into our data frame as new columns.

# Putting it all together

Now we can combine our three functions `dd_data`

, `get_random_effects`

and `get_outcome`

into one sleek `simulate_data`

function:

```
simulate_data <- function(n_participants, n_timepoints, nlme.model){
dd_data(n_participants, n_timepoints) %>%
get_random_effects(n_participants, n_timepoints, nlme.model) %>%
get_outcome(n_participants, n_timepoints, nlme.model)
}
```

Let’s recap everything that is going on here. We input the number of participants, number of time points, and original nlme model. `dd_dat`

takes just `n_participants`

and `n_timepoints`

and outputs the `sim_data`

data frame where each participant has 14 lines of data on our predictors of interest. `get_random_effects`

takes this sim_data and adds random effects for `intercepts`

, `time`

, and `PA_within`

, and outputs a modified `sim_data`

object. Finally, `get_outcome`

uses `sim_data`

and some additional information from our model to estimate our outcome based on our combined model.

(Remember that the pipe, `%>%`

, takes the output of a line and inputs it as the first argument in the next line. That’s why we don’t have to type `sim_data`

as an input – it’s being passed through each function under the hood.)

Now we can run this function, and use the resulting data frame to fit our original model. This new output should closely match the original output.

`sim_test <- simulate_data(200, 14, collabMod)`

```
simMod <- lme(collaboration ~ PA_within + PA_between +
NA_within + NA_between + age + time,
data = sim_test,
random = ~ PA_within + time + 1|id,
na.action = na.omit)
```

```
summary_fixed <- function(nlme.mod){
summary(nlme.mod)[["tTable"]] %>%
as.data.frame() %>%
select(Value, Std.Error, "p-value") %>%
rename(p = "p-value") %>%
mutate(
Value = round(Value, 2),
Std.Error = round(Std.Error, 2),
p = round(p, 3))
}
summary_fixed(collabMod)
```

```
## Value Std.Error p
## (Intercept) 1.85 0.32 0.000
## PA_within 0.09 0.04 0.025
## PA_between 0.41 0.10 0.000
## NA_within 0.00 0.04 0.941
## NA_between -0.14 0.14 0.320
## age 0.01 0.01 0.053
## time -0.02 0.01 0.000
```

`summary_fixed(simMod)`

```
## Value Std.Error p
## (Intercept) 1.91 0.23 0.000
## PA_within 0.04 0.03 0.130
## PA_between 0.32 0.17 0.064
## NA_within -0.01 0.02 0.697
## NA_between 0.45 0.31 0.148
## age 0.01 0.00 0.007
## time -0.03 0.01 0.000
```

Looks pretty good! Notice that the non-significant predictors (like NA_between) are quite different across the two models, while the significant predictors match.

We can also compare the random effects:

`VarCorr(collabMod)`

```
## id = pdLogChol(PA_within + time + 1)
## Variance StdDev Corr
## (Intercept) 0.84743622 0.92056299 (Intr) PA_wth
## PA_within 0.11958214 0.34580651 0.141
## time 0.00227009 0.04764546 -0.300 0.394
## Residual 0.62363845 0.78970783
```

`VarCorr(simMod)`

```
## id = pdLogChol(PA_within + time + 1)
## Variance StdDev Corr
## (Intercept) 0.912248071 0.95511678 (Intr) PA_wth
## PA_within 0.140996279 0.37549471 0.143
## time 0.002571546 0.05071042 -0.240 0.332
## Residual 0.583742629 0.76403052
```

Looks good! Now, you can use the `simulate_data`

function in other analyses, like in Monte Carlo power simulations where you need to simulate data over and over again. To learn more about this, see my tutorial here.

### Simulating dependent predictors

If you read through all of this and said to yourself, “Hm, I really wish this was more complicated,” then never fear, we can add another layer of complexity.

Above, we talked about how our positive and negative affect predictors are probably related to each other. That is, if someone reports feeling very happy on one day, they are probably also reporting very little sadness on that day. The way we have simulated this so far does not take this covariance into account. To see how to do this, see this tutorial, which uses the same sample data and model used here.