23  Tutorial: Partial Associations

23.1 How a GLM works (for partial associations)

A key strength of a GLM is that it can estimate the association between a predictor and an outcome while simultaneously accounting for other variables in the model; that is, it can quantify partial associations rather than just simple bivariate relationships. It helps to remember that when a model includes multiple predictors, its coefficients describe partial associations, which are defined with respect to:

how variation in one predictor relates to the response after accounting for other predictors in the model

Although R (and other great software) estimates these quantities automatically (i.e. behind-the-scenes), it can be helpful to unpack what “controlling for” actually means. This can help demystify the GLM (or even general linear regression) framework. In this brief tutorial, we strip the GLM down and explain how partial associations are calculated by the model.

23.2 Residualization

One intuitive way to do this is through residualization. By successively removing the variation explained by a covariate, we can isolate any remaining structure in the data (using residuals) and then examine how that relates to the response variable. The sequence below reproduces –almost exactly– the coefficient for a single predictor in a GLM.

NoteFrisch–Waugh–Lovell Theorem (Residualization)

In a multiple regression, the coefficient of a predictor \(X_i\) equals the slope obtained by regressing the residuals of \(Y\) (after removing all other predictors) on the residuals of \(X_i\) (after removing those same predictors).

23.3 Residualization: Two model predictors

First, we begin with the previously examined dataset of Lepusantilocapra pufferi (puffer jackalope). Again, you will need to set up with R space, including loading R packages and reading in the example dataset.

source("chapters/class_r_scripts/load_packages.r")

And you will need the following file (simply click to save):

Read these data into R:

pufferi <- read.csv("chapters/class_exercise_data/lepusantilocapra_pufferi_abundance.csv")
str(pufferi) # examine data structure
'data.frame':   146 obs. of  9 variables:
 $ site      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ abund     : int  76 161 39 410 177 695 352 674 624 736 ...
 $ dens      : num  0.00207 0.00352 0.000981 0.008039 0.005933 ...
 $ mean_depth: int  804 808 809 848 853 960 977 982 985 986 ...
 $ year      : int  1978 2001 2001 1979 2002 1980 1981 1979 1982 1980 ...
 $ period    : int  1 2 2 1 2 1 1 1 1 1 ...
 $ x_km      : num  98.8 76.8 103.8 91.5 107.1 ...
 $ y_km      : num  -57.5 178.6 -50.1 146.4 -37.1 ...
 $ swept_area: num  36710 45741 39775 51000 29831 ...

We will adopt a model good enough for a short example (even though we established that this is a poorly specified model):

The goal of this exercise is to estimate the partial association between mean_depth and abund. That is, we are trying to estimate the effect of mean_depth on abund after accounting for dens.

There are three steps for this example. To go through the steps, click on each of the callout boxes below to work through the process of estimating partial effects in a GLM. For the sake of clarity, I only include two covariates. Also note that, for the sake of clarity, I use a lengthy object name that does not use snake_case.

First, regress mean_depth on dens (the two predictors, in this case) and calculate model residuals.

mod_DEPTH_on_DENS <- glm(mean_depth ~ dens, data = pufferi)
depth_resid <- resid(mod_DEPTH_on_DENS)

The residuals from this model represent the component of mean_depth that cannot be explained by dens. This answers the question: How does mean_depth vary once dens has been held constant?

Next, regress abund on dens and retain the residuals.

mod_ABUND_on_DENS <- glm(abund ~ dens, data = pufferi)
abund_resid <- resid(mod_ABUND_on_DENS)

These residuals represent variation in abund that is not associated with dens. By now, you should see the general pattern of this residualization approach.

Finally, regress the residuals of abund (Step 2) onto the residuals of mean_depth (Step 1).

You can run the summary() function to view the model output. The slope from this regression quantifies the association between mean_depth and abund after dens has been accounted for. This value will be very close to the coefficient for mean_depth in the full multivariable model:


Call:
glm(formula = abund ~ mean_depth + dens, data = pufferi)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.858e+01  1.926e+01   2.003   0.0471 *  
mean_depth  -3.730e-04  5.833e-03  -0.064   0.9491    
dens         3.787e+04  1.318e+03  28.735   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 5082.682)

    Null deviance: 7067483  on 145  degrees of freedom
Residual deviance:  726824  on 143  degrees of freedom
AIC: 1665.2

Number of Fisher Scoring iterations: 2

Call:
glm(formula = abund_resid ~ depth_resid, data = pufferi)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.009e-13  5.880e+00   0.000    1.000
depth_resid -3.730e-04  5.812e-03  -0.064    0.949

(Dispersion parameter for gaussian family taken to be 5047.386)

    Null deviance: 726844  on 145  degrees of freedom
Residual deviance: 726824  on 144  degrees of freedom
AIC: 1663.2

Number of Fisher Scoring iterations: 2

Indeed, you see that the effect size is -3.730e-04 in both cases. So, we have successful reconstructed how GLMs can examine partial associations between predictors and outcome variables.

NoteRepeat this logic for dens

You could do the same thing for dens and find its partial association with abund after controlling for mean_depth.

# residual of focal, controlling for non-focal
mod_DENS_on_DEPTH <- glm(dens ~ mean_depth, data = pufferi)
dens_resid <- resid(mod_DENS_on_DEPTH)

# residual of effect of non-focal predictor
mod_ABUND_on_DEPTH <- glm(abund ~ mean_depth, data = pufferi)
abund_resid <- resid(mod_ABUND_on_DEPTH)

# overall residual effect
mod_resid_dens <- glm(abund_resid ~ dens_resid, data = pufferi)

23.3.1 Plotting all the residuals

ggplot(data.frame(dens_resid = dens_resid,
                  abund_resid = abund_resid),
       aes(x = dens_resid, y = abund_resid)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    x = "residuals of `dens`",
    y = "residuals of `abund`",
    title = "partial association: `dens` on `abund`"
  ) +
  theme_bw()

ggplot(data.frame(depth_resid = depth_resid,
                  abund_resid = abund_resid),
       aes(x = depth_resid, y = abund_resid)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    x = "residuals of `mean_depth`",
    y = "residuals of `abund`",
    title = "partial association: `mean_depth` on `abund`"
  ) +
  theme_bw()

Examine the summary of modbase to compare the coefficients for mean_depth and dens to the variation in each of these plots. You should immediately notice the one with less residual structure (more clustered around the regression line) has the larger effect size.

23.4 Residualization: More than two model predictors

The above example clearly illustrates how residualization works for a model with two predictors. But what about a model with >2 predictors? We will simply add one of the other potential predictors (period) to our base model from above. Again, this is a rather nonsensical formulation:

mod_full <- glm(abund ~ mean_depth + dens + period, data = pufferi)
summary(mod_full)

Call:
glm(formula = abund ~ mean_depth + dens + period, data = pufferi)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.345e+01  2.821e+01   1.895   0.0601 .  
mean_depth  -9.281e-04  5.893e-03  -0.158   0.8751    
dens         3.764e+04  1.359e+03  27.702   <2e-16 ***
period      -9.265e+00  1.282e+01  -0.723   0.4710    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 5099.716)

    Null deviance: 7067483  on 145  degrees of freedom
Residual deviance:  724160  on 142  degrees of freedom
AIC: 1666.7

Number of Fisher Scoring iterations: 2

For more than two model predictors, the number of steps increases. But, arguably, the complexity does not. The logic is still the same:

A predictor’s coefficient equals the relationship between the parts of the outcome and that predictor that remain after removing the influence of all other variables (residuals!)

Below is the way in which we would calculate –by residualization– the partial association coefficient between one predictor (mean_depth) and abund (controlling for dens and period). You can apply the same process to the other two predictors. Click on each callout box to expand.

mod_DEPTH_on_OTHERS <- glm(mean_depth ~ dens + period, data = pufferi)
depth_resid <- resid(mod_DEPTH_on_OTHERS)
mod_ABUND_on_OTHERS <- glm(abund ~ dens + period, data = pufferi)
abund_resid_depth <- resid(mod_ABUND_on_OTHERS)

For this step, we regress the residuals of abund (controlling for the non-focal predictors; Step 2) onto the residuals of mean_depth (controlling for the non-focal predictors; Step 1).

mod_partial_depth <- glm(abund_resid_depth ~ depth_resid)
summary(mod_partial_depth)

Call:
glm(formula = abund_resid_depth ~ depth_resid)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.046e-13  5.869e+00   0.000    1.000
depth_resid -9.281e-04  5.852e-03  -0.159    0.874

(Dispersion parameter for gaussian family taken to be 5028.886)

    Null deviance: 724286  on 145  degrees of freedom
Residual deviance: 724160  on 144  degrees of freedom
AIC: 1662.7

Number of Fisher Scoring iterations: 2

The slope equals the mean_depth coefficient in mod_full (-9.281e-04).

23.5 Summary

The goal of this tutorial was to demystify the mathematics of regression so that it feels like a transparent tool rather than a black box. We showed that a partial coefficient is simply the relationship between the leftover variation in a predictor and the leftover variation in the response after accounting for other variables. It really is that simple.