Variance Component Analysis

Variance component analysis (VCA) is a way to assess the reliability of measures designed for intensive longitudinal data, like daily diary or ecological momentary assessment studies. This type of study poses particular challenges when it comes to measurement, because you need your measure to accurate capture your construct of interest, but to do so in a way that is sensitive to small changes over short periods of time. That is, you are explicitly interested in capturing these within-person changes in a reliable way. For a more complete introduction to this topic, I recommend Bolger and Laurenceau’s classic Intensive Longitudinal Methods book.

This post focuses on conducting VCA in R. So while I’ll briefly review the underlying theory here, this post will focus on writing a function that can quickly calculate a measure’s reliability in one step.

Brief Background

In typical psychological measures, reliability assessments are concerned with differentiating true variability (e.g., real variance in the construct of interest) from measurement error (noise present in the measures). For intensive longitudinal data, this task is complicated by the fact that measure variance can stem from more sources.

For example, suppose we have a daily diary study where individuals reported their mood at the end of each day for 14 days. We’re interested in participants’ reported happiness, which was measured by three items each day. Here, we expect that every day, the three happiness items should “hang together” well (i.e., be highly correlated), but we would expect these measure to vary from day to day and person to person. In other words, the variability across persons and time points is meaningful – we don’t want to say this is “error”, because we are specifically interested in measuring within-person changes over time.

Bolger and Laurenceau use a random effects model to decompose these variances and estimate a reliability coefficient, \(R_C\). I’ll again point readers to their excellent book for a full discussion of this topic. But their formula of interest for VCA is:

\[ R_C = \frac{\sigma^2_{TP}}{\sigma^2_{TP} + \frac{\sigma^2_\epsilon}{k}}\]

Where \(\sigma^2_{TP}\) is the time-by-person random effect, aka the variance specific to each person/time point across the measurement period; \(\sigma^2_\epsilon\) represents the error variance, and k represents the number of items in the given measure.

(Note that \(\sigma^2_\epsilon\) is actually the sum of the three-way random effect \(\sigma^2_{TPI}\) and variance from all remaining sources, but these terms can’t be disentangled in this type of model.)

Conducting VCA in R

This post is specifically focused on how to actually calculate this \(R_C\) value for a daily diary measure in R. Note that you need at least three items in your measure to conduct VCA. There are three key steps:

Transform your data into the correct format
Run the correct mixed effects model
Extract the necessary variance terms from your model and calculate \(R_C\) using the formula above.

Here, I break each of these steps into different helper functions, and then combine these into an overall function to streamline VCA calculations.

For this tutorial, I’ll use sample data from a 14-day daily diary study. Each night, participants rated their mood (happy, depressed, anxious, angry) for that day. Each mood subscale was measured with three different items.

Since we’re assessing measure reliability, we need to work with the individual items rather than the subscale average. Here’s what the diary data look like:

head(ddata)

##   ptid day happy1 happy2 happy3 depr1 depr2 depr3 anx1 anx2 anx3 ang1 ang2 ang3
## 1    1   0      3      2      3     1     2     1    2    3    3    1    1    1
## 2    1   1      3      3      3     1     2     1    3    2    2    1    1    1
## 3    1   2      3      4      3     1     1     1    2    2    2    1    1    1
## 4    1   3      3      3      3     1     2     1    3    2    2    1    1    1
## 5    1   4      3      3      3     1     1     3    3    2    2    1    1    1
## 6    1   5      5      1      1     4     2     3    5    5    2    5    2    1

Step 1: Function to Transform Data

Notice that our data above are in long format, meaning that each participant (ptid) has multiple rows of data, one for each day of the diary period (day column).

For VCA, we need to do a few data transformation before we can proceed. First, we need to select just the variables contained in the measure of interest, along with our ID variable and our time variable. Let’s say we’re interested in the reliability of our “happy” measure.

We identify the variables we want to keep and then select them from our data:

vars_to_keep <- c("ptid", "day", "happy1", "happy2", "happy3")
trimmedData <- ddata[, vars_to_keep, drop = FALSE]
head(trimmedData)

##   ptid day happy1 happy2 happy3
## 1    1   0      3      2      3
## 2    1   1      3      3      3
## 3    1   2      3      4      3
## 4    1   3      3      3      3
## 5    1   4      3      3      3
## 6    1   5      5      1      1

Next, we need to transform this dataset into “longer” format. Rather than having a different column for each “happy” variable, we want a single “value” column that contains the numeric values for all three of these. We also want an “item” column that retains the information about which item each number corresponds to. This is what the tidyverse function pivot_longer was built to do.

library(tidyverse)

longerData <- trimmedData %>% 
  pivot_longer(cols = c(happy1, happy2, happy3),
               names_to = "item",
               values_to = "values")

Here, cols should contain just the variables in our scale of interest.

Let’s take a look:

head(longerData, n = 20)

## # A tibble: 20 × 4
##     ptid   day item   values
##    <int> <int> <chr>   <int>
##  1     1     0 happy1      3
##  2     1     0 happy2      2
##  3     1     0 happy3      3
##  4     1     1 happy1      3
##  5     1     1 happy2      3
##  6     1     1 happy3      3
##  7     1     2 happy1      3
##  8     1     2 happy2      4
##  9     1     2 happy3      3
## 10     1     3 happy1      3
## 11     1     3 happy2      3
## 12     1     3 happy3      3
## 13     1     4 happy1      3
## 14     1     4 happy2      3
## 15     1     4 happy3      3
## 16     1     5 happy1      5
## 17     1     5 happy2      1
## 18     1     5 happy3      1
## 19     1     6 happy1      3
## 20     1     6 happy2      2

We can now see that both ptid and day are repeated, and our items are “stacked” – our data frame is now much longer than it was before!

We can streamline this process by condensing the above code into a function. Here, the arguments id and time should be characters of your id and time variable names, e.g., "ptid" and "day". The argument variables should be a vector containing the names of the items in your scale of interest, again in character format. E.g., variables = c("happy1", "happy2", "happy3).

The only difference in the code below is that we use the all_of() function within pivot_longer to accurately specify our cols. Additionally, we transform the id, time, and item variables to factors at the very end, and rename our id and time variables to facilitate future operations.

reshapeLonger <- function(data, id, time, variables){
  require(tidyverse, quietly = TRUE, warn.conflicts = TRUE)
  
  #merge arguments and select columns from larger data set
  select_cols = append(variables, c(id, time), after = 0)
  subdata <- data[, select_cols, drop = FALSE]

  #transform to "longer" format
  longerData <- subdata %>% 
    pivot_longer(cols = all_of(variables), #use all_of() here
               names_to = "item",
               values_to = "value")
  
  #convert the id, time, an "item" variables to factors
  factors <- c(id, time, "item")
  longerData[factors] <- lapply(longerData[factors], as.factor)
  
  #standardize names for ease in the future  
  names(longerData)[1] = "id" #ptid becomes id
  names(longerData)[2] = "time" #day becomes time
  
  return(longerData)
}

Let’s test it out!

longerDataTest <- reshapeLonger(data = ddata, id = "ptid", time = "day", 
                            variables = c("happy1", "happy2", "happy3"))

head(longerDataTest)

## # A tibble: 6 × 4
##   id    time  item   value
##   <fct> <fct> <fct>  <int>
## 1 1     0     happy1     3
## 2 1     0     happy2     2
## 3 1     0     happy3     3
## 4 1     1     happy1     3
## 5 1     1     happy2     3
## 6 1     1     happy3     3

Looks good!

Step 2: Function to Run the Model

Here, we run the mixed effects model so that we can access the partitioned variance for our \(R_C\) formula. We use the package lme4 to do this.

Note that this function takes as it’s first argument the longerData object that we created in the function above. Since we created that data frame to our specifications, we know that it already has variables for id, time, and item. All we need to do is to run the appropriate model with the provided data.

Here, our model includes random effects for each of our key sources of variance: id, time, and item, as well as random effects for their two-way interactions. The function returns the fitted model.

runModel <- function(longerData){
  require(lme4, quietly = TRUE)

  mod <- lmer(value ~  1 + (1|id) + (1|time) + (1|item) 
           + (1|id:time) + (1|id:item) + (1|time:item), 
           data = longerData) 
  
  return(mod)
}

Let’s test this on our longerDataTest object from above. We’ll save the output as hapmod because it is a model for our “happy” subscale. This sometimes takes a few seconds to run because the model is quite complex.

hapmod <- runModel(longerDataTest)

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Step 3: Function to Calculate \(R_C\)

Recall our formula for \(R_C\) from above:

\[ R_C = \frac{\sigma^2_{TP}}{\sigma^2_{TP} + \frac{\sigma^2_\epsilon}{k}}\]

Our final task is to turn this into a function. To do so, we need to access all of the terms in the formula: \(\sigma^2_{TP}\), \(\sigma^2_\epsilon\), and \(k\).

\(\sigma^2_{TP}\) is the variance for the id:time random effect from our model. We can access this using the VarCorr function from lme4. This gives us the variance and correlation components of our mixed-effects model:

VarCorr(hapmod)

##  Groups    Name        Std.Dev.
##  id:time   (Intercept) 0.599702
##  id:item   (Intercept) 0.231063
##  id        (Intercept) 0.782819
##  time:item (Intercept) 0.034733
##  time      (Intercept) 0.035292
##  item      (Intercept) 0.207363
##  Residual              0.537315

While this prints the standard deviations, the variances are also stored in this object – and that is what we really want. We can access the variance for id:time like so:

modcov <- VarCorr(hapmod)
modcov[["id:time"]][1,1]

## [1] 0.359642

The error variance, \(\sigma^2_\epsilon\), is pulled from our model object using the sigma() function:

sigma(hapmod)

## [1] 0.5373149

Finally, \(k\), the number of items in our scale, is simply the length of our variables vector:

length(c("happy1", "happy2", "happy3"))

## [1] 3

Putting this all together, we can thus create a function that extracts all of these values and then combines them into our \(R_C\) model:

calcRc <- function (mod, variables) { 
  m = length(variables)  #number of items in scale
  modvar <- VarCorr(mod) #variance matrix
  idTime <- modvar[["id:time"]][1,1] #extract id:time random effect
  modRc = idTime/(idTime + ((sigma(mod)^2)/m)) #calculate RC
  return(modRc)
}

We test this by inputting our model object and our variables vector:

calcRc(hapmod, variables = c("happy1", "happy2", "happy3"))

## [1] 0.7889

This tells us that our “happy” subscale shows good measurement reliability!

Step 4: Combining into one

Now, we want to combine all of this work into a single function. This is actually the easiest step! We just use our three helper functions inside of a new varcomp function:

varcomp <- function(data, id, time, variables){
  longerData <- reshapeLonger(data, id, time, variables)
  mod <- runModel(longerData)
  return(calcRc(mod, variables))
}

And now we can run our VCA in a single step!

varcomp(data = ddata, id = "ptid", time = "day", 
        variables = c("happy1", "happy2", "happy3"))

## [1] 0.7889

Let’s try it on some of the other mood measures we have.

#anger
anger <- varcomp(data = ddata, id = "ptid", time = "day", 
        variables = c("ang1", "ang2", "ang3"))

#anxiety
anxiety <- varcomp(data = ddata, id = "ptid", time = "day", 
        variables = c("anx1", "anx2", "anx3"))

print(paste("Anger RC:", round(anger, 3)))

## [1] "Anger RC: 0.811"

print(paste("Anxiety RC:", round(anxiety, 3)))

## [1] "Anxiety RC: 0.491"

It looks like Anger has good reliability, but Anxiety does not. We would want to further investigate what is going on with Anxiety before proceeding with any analyses.