The `boot`

package in R makes bootstrapping easy – but the format of the output is somewhat of a nightmare. Here, I talk through a small function to extract your `boot`

confidence intervals in a more user-friendly format. If you are new to functions, this could be a good example to get a feel for how they work.

```
library(boot)
library(tidyverse)
```

Let’s say we have some sample data of N = 200. We’ve got 4 predictors `x1`

-`x4`

and an outcome variable `y.`

Predictors `x1`

-`x3`

and outcome `y`

are continuous, while `x4`

is binary:

`head(data)`

```
## # A tibble: 6 × 5
## x1 x2 x3 x4 y
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.33 2.21 1.25 1 6.6
## 2 5 1.05 1 2 6.2
## 3 4 1.04 1.07 1 6.9
## 4 5 1.32 1.02 1 6.4
## 5 4 1.08 1.10 1 6.5
## 6 3.28 1.08 1.95 1 5.8
```

We can see how these variables relate to y in the following simple model:

```
mod <- lm(y ~ x1 + x2 + x3 + x4,
data)
summary(mod)
```

```
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6296 -1.0859 -0.3642 0.4885 6.7462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.56805 1.06004 7.139 1.81e-11 ***
## x1 -0.37831 0.17696 -2.138 0.033779 *
## x2 -0.07984 0.26817 -0.298 0.766228
## x3 -0.06904 0.35081 -0.197 0.844191
## x4 0.84633 0.24370 3.473 0.000634 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.707 on 195 degrees of freedom
## Multiple R-squared: 0.08576, Adjusted R-squared: 0.06701
## F-statistic: 4.573 on 4 and 195 DF, p-value: 0.001495
```

But say we then notice that our outcome variable is actually quite skewed, and so we aren’t sure if we can trust the results from our model since our assumption of normality is violated:

```
ggplot(data, aes(x = y)) +
geom_histogram(bins = 100)
```

This is an instance where bootstrapping is handy. Bootstrapping takes a random sample of size `n`

from your sample, and fits your model again. Then it resamples, and fits your model again…and again, and again. It samples with *replacement*, so you’ll have some observations that appear multiple times in each iteration, while other observations are missed. After it has resampled your data and fit your model many times, it aggregates the results and spits out a bootstrapped estimate, standard error, and confidence interval. Since each iteration was based off of slightly different observations, your result is less likely to be biased by skew, outliers, or other wonky distributions.

Below we can see the arguments taken by the `boot::boot()`

function

`str(boot)`

```
## function (data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"),
## strata = rep(1, n), L = NULL, m = 0, weights = NULL, ran.gen = function(d,
## p) d, mle = NULL, simple = FALSE, ..., parallel = c("no", "multicore",
## "snow"), ncpus = getOption("boot.ncpus", 1L), cl = NULL)
```

Most important are the first three:

`data`

: the matrix or data frame you’re working with`statistic`

: a function that returns your statistic of interest`R`

: the number of times you want to resample

For us, the statistics we’re interested in are the estimates returned from our linear model. But we want to format a new function so that it automatically returns just those coefficients, and not all the other info that comes with `lm`

. We can create a small function to extract this info like so:

```
fun <- function(data, index){
return(coef(lm(mod, data = data, subset = index)))
}
```

Here, `mod`

is the model we fit above, `data`

is (obviously) our data, and `index`

is a vector indicating which observations (rows) from our data frame should be included. For example, if we select all of our rows, our output should be identical to the initial results:

`fun(data, 1:200)`

```
## (Intercept) x1 x2 x3 x4
## 7.56804577 -0.37830562 -0.07984083 -0.06903943 0.84632682
```

`coef(mod)`

```
## (Intercept) x1 x2 x3 x4
## 7.56804577 -0.37830562 -0.07984083 -0.06903943 0.84632682
```

`boot()`

randomly chooses which observations should be included in `index`

```
boot.object <- boot(data, fun, 1000)
boot.object
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 7.56804577 0.0482331382 1.0726538
## t2* -0.37830562 -0.0117212441 0.2004917
## t3* -0.07984083 0.0024717868 0.2827149
## t4* -0.06903943 -0.0040158995 0.3152365
## t5* 0.84632682 -0.0005952484 0.2533492
```

And `boot.ci`

gives us our confidence intervals. Note that, if you are interested in multiple coefficients like we are, you have to specifically call that estimate using the `index`

argument, where 1 calls t1* above (the intercept for us), 2 calls t2* (our estimate for x4) etc.

```
#this is the intercept
boot.ci(boot.object, conf = .95, type = c("norm", "basic", "bca"), index = 1)
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.object, conf = 0.95, type = c("norm",
## "basic", "bca"), index = 1)
##
## Intervals :
## Level Normal Basic BCa
## 95% ( 5.417, 9.622 ) ( 5.506, 9.635 ) ( 5.500, 9.630 )
## Calculations and Intervals on Original Scale
```

```
#this is for x1
boot.ci(boot.object, conf = .95, type = c("norm", "basic", "bca"), index = 2)
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.object, conf = 0.95, type = c("norm",
## "basic", "bca"), index = 2)
##
## Intervals :
## Level Normal Basic BCa
## 95% (-0.7595, 0.0264 ) (-0.7374, 0.0163 ) (-0.7717, -0.0188 )
## Calculations and Intervals on Original Scale
```

```
#this is for x2
boot.ci(boot.object, conf = .95, type = c("norm", "basic", "bca"), index = 3)
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.object, conf = 0.95, type = c("norm",
## "basic", "bca"), index = 3)
##
## Intervals :
## Level Normal Basic BCa
## 95% (-0.6364, 0.4718 ) (-0.6590, 0.4673 ) (-0.6288, 0.4856 )
## Calculations and Intervals on Original Scale
```

To return all of our coefficients and their associated CIs in a nice clean table, we can use the following formula:

```
CItable = function(mod, data, boot.object){
pred = c(names(mod$coefficients))
index = seq(1:length(pred))
CIt = data.frame(index, pred)
for (i in index){
CI = boot.ci(boot.object, conf = 0.95, type = "norm", index = i)
CIt$Estimate[i] = boot.object[["t0"]][[i]]
CIt$SEboot[i] = apply(boot.object$t,2,sd)[i]
CIt$lower.bound[i] = CI$norm[,2]
CIt$upper.bound[i] = CI$norm[,3]}
return(CIt)
}
```

Let’s walk through what this does.

As arguments, this function takes in our model of interest, our data, and `boot.object`

, the object created above using the `boot()`

function.

First, the function creates a vector called `pred`

which consists of the names of the model coefficients.

```
pred = c(names(mod$coefficients))
pred
```

`## [1] "(Intercept)" "x1" "x2" "x3" "x4"`

Next it creates an index object, which is just a sequence of the same length as `pred.`

```
index = seq(1:length(pred))
index
```

`## [1] 1 2 3 4 5`

We put them together in a data frame.

```
CIt = data.frame(index, pred)
CIt
```

```
## index pred
## 1 1 (Intercept)
## 2 2 x1
## 3 3 x2
## 4 4 x3
## 5 5 x4
```

Next, we initiate a `for`

loop which goes through each value in `index`

, and plugs that into the `boot.ci()`

function. Below, we input our `boot.object`

object from above and set `i`

to 1 as an example.

```
CI = boot.ci(boot.object, conf = 0.95, type = "norm", index = 1)
CI
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.object, conf = 0.95, type = "norm", index = 1)
##
## Intervals :
## Level Normal
## 95% ( 5.417, 9.622 )
## Calculations and Intervals on Original Scale
```

Ok, here we have our output in terrible text format.

Next, we pull the estimate out of our boot object.

`boot.object[["t0"]][[1]]`

`## [1] 7.568046`

Note that the `CIt$Estimate[i]=`

just adds this extracted value to the `i`

th row of `CIt`

in a new column called `Estimate`

.

To get the standard errors, we need to take the standard deviation of the 1000 replications of each of our coefficients. Here, `boot.object$t`

gives us a matrix with 5 columns (one for each coefficient), and 1000 rows (one for each repetition).

`head(boot.object$t)`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7.875881 -0.4552777 -0.62879867 0.5926316 0.7166760
## [2,] 6.570837 -0.1225229 0.06779021 -0.1819644 0.7719799
## [3,] 8.439980 -0.5099639 0.04043891 -0.3494414 0.7018210
## [4,] 6.619611 -0.2054833 -0.38904491 0.3489171 0.9155234
## [5,] 7.717201 -0.5518786 0.39719526 -0.3515270 0.8313202
## [6,] 7.269340 -0.3767260 -0.54977758 0.3026698 1.1379344
```

`str(boot.object$t)`

`## num [1:1000, 1:5] 7.88 6.57 8.44 6.62 7.72 ...`

The `apply()`

function is a fast way to apply a given function over different parts of a matrix. It takes in the matrix `boot.object`

, as well as a `margin`

value (1 for rows, 2 for columns) indicating which parts of the matrix the function will be applied over. Finally, it takes the function, `sd`

. The `[i]`

subscripts specifies which column we are interested in according to our progression through the `for`

loop. As an example:

`apply(boot.object$t, 2, sd)[1]`

`## [1] 1.072654`

Which matches the SE for the intercept from the overall `boot`

object.

`boot.object`

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 7.56804577 0.0482331382 1.0726538
## t2* -0.37830562 -0.0117212441 0.2004917
## t3* -0.07984083 0.0024717868 0.2827149
## t4* -0.06903943 -0.0040158995 0.3152365
## t5* 0.84632682 -0.0005952484 0.2533492
```

Finally, we can pull the lower and upper bounds like so:

`CI$norm[,2]`

`## [1] 5.41745`

`CI$norm[,3]`

`## [1] 9.622176`

Note that you can opt for different types of bootstrapped CIs, like BCA or percentile. But the subscripts in the final few lines of the function may change because of the different outputs. You can double check which ones to use by running.

`CI$norm`

```
## conf
## (Intercept) 0.95 5.41745 9.622176
```

Here we can clearly see that the first column is our confidence level, the second column is the lower bound, and the third is the upper bound. If we’re interested in BCA CIs:

```
CI = boot.ci(boot.object, conf = 0.95, type = "bca", index = 1)
CI$bca
```

```
## conf
## [1,] 0.95 25.01 975.96 5.500238 9.629616
```

We see there are some additional parameters in there, and we would want the 4th and 5th for our CI bounds.

Anyway, the estimate, SE, lower, and upper bounds are all added as columns to the original `CIt`

object, and finally, we return our completed table with `return(CIt)`

Now when we run our shiny new function, we get a shiny new table which doesn’t give you carpal tunnel from copying and pasting it to wherever it’s going to live. We can also now easily see which parameters have CIs that exclude 0, and which don’t. While both `x1`

and `x4`

were significant in the original linear model, it looks like `x1`

was not a very robust finding, as it’s CI includes 0 here.

`CItable(mod = mod, data = data, boot.object = boot.object)`

```
## index pred Estimate SEboot lower.bound upper.bound
## 1 1 (Intercept) 7.56804577 1.0726538 5.4174497 9.62217553
## 2 2 x1 -0.37830562 0.2004917 -0.7595409 0.02637217
## 3 3 x2 -0.07984083 0.2827149 -0.6364237 0.47179847
## 4 4 x3 -0.06903943 0.3152365 -0.6828758 0.55282870
## 5 5 x4 0.84632682 0.2533492 0.3503667 1.34347743
```

Getting comfortable with functions can really take your R skills to the next level. If you want to learn more about creating your own, I recommend Roger Peng’s R Programming for Data Science and Jenny Bryan’s Stat 545 sites.