30 DAG Construction

30.1 Introduction

In this section, we explore the Directed Acyclic Graph (DAG), the first step of constructing an operational Structural Causal Model. Though this step seems like a simple drawing exercise, this step often becomes one of the most illuminating parts of the SCM process. If done thoroughly and carefully, building the DAG forces you to explicitly state assumptions that you did not realize were critical to your questions.

30.2 General architecture of a Directed Acyclic Graph (DAG)

A Directed Acyclic Graph represents all plausible causal relationships in the system. In essence, it encapsulates we know about our study system. This DAG will form the basis for your entire set of questions about your system.

Important DAG terminology

node: a variable in the system (measured or not).
edge: a directed link (unidirectional arrow) representing a causal relationship between two nodes
exposure: the variable whose effect you want to estimate (your treatment or independent variable)
outcome: the variable you are trying to explain (the response or dependent variable)
confounder: a common cause of both exposure and outcome (can bias the estimate if not accounted for)
mediator: a variable through which the exposure affects an outcome
collider: a common effect of two variables (adjusting for it can induce spurious associations)
path: any sequence of connected arrows between two variables
backdoor path: a non-causal path that can introduce bias if not included in the model
adjustment set: the set of variables included in the model (blocking alternative causal pathways and allowing you to recover a causal effect)

The most important aspect of Directed Acyclic Graphs is that they are comprised of nodes (the boxes or circles) connected by edges (lines). In DAGs, edges have directionality (an arrow pointing from \(X\) to \(Y\) reflects our assumption that \(X\) directly causes \(Y\)).

As an example, consider the DAG picture here. It shows conceptual knowledge about what factors affect nest survival. Vegetation cover can influence both predator abundance and survival, while predators affect survival both directly and by influencing how well nests are concealed. In other words, part of the predator effect may work through how hidden the nests are (a behaviorally mediated effect).

Edges in the DAG should be motivated by:

ecological theory (e.g. predator abundance → prey survival)
empirical literature (e.g. temperature → metabolic rate)
natural history observations (e.g. insects emerge after rainfall)
experimental design (e.g. food supplementation → nestling growth)

Typical node categories include:

environmental drivers (e.g. rainfall, temperature)
organismal traits (e.g. body size, metabolic rate)
behavioral responses (e.g. foraging rate, vigilance)
demographic outcomes (e.g. survival, reproductive success)
measurement processes (e.g. detection probability, observer effort)

graph TD

C[Vegetation cover]:::node
X[Predator abundance]:::node
M[Nest concealment]:::node
Y[Nest survival]:::node

C --> X
C --> Y
X --> M
M --> Y
X --> Y

classDef node fill:#d9d9d9,color:#000,stroke-width:0px;

It is helpful to have an idea of what kinds of roles each node. Here is a useful breakdown of causal roles for different types of nodes:

Role	Key question	What it is	Include / Include in model?	Example
Driver	What generates variation?	Broad upstream cause	Yes, when they are confounders of the focal relationship	Climate, traits
System setup	What does the system look like at the start?	Initial conditions or state	Usually; they define context and may confound relationships	Density, habitat, distance
Mechanism	What processes modify the system?	Specific pathways (mediators)	Yes, only when estimating direct effects	Predation, dispersal, mutualists
Outcome	What are we trying to explain?	Response variable	No, this is the variable being explained	Recruitment, survival
Measurement	How is it observed?	Observation/detection process	Sometimes include, when correcting for detection or measurement bias	Detection probability, counts
Selection / design	Why do we observe these data?	Sampling process	Sometimes, because conditioning can induce bias	Plot placement, effort

Any good DAG should represent causal assumptions, not statistical correlations. If there is good ecological justification for a causal link between two variables, you should include it in the model (even if a relationship between two variables has never been tested). Unlike non-causal approaches, a useful DAG emphasizes conceptual completeness over statistical parsimony. In other words, it is better to include more knowledge in a DAG than try to simplify it. So, how do we efficiently construct a DAG that has strong causal assumptions?

30.3 Constructing a DAG for your study system

Rule: Build one DAG for the system – not one DAG per question/prediction.

The goal of this section is to construct a single global Directed Acyclic Graph (DAG) that represents the causal structure of your study system. I use the term “global” by analogy to multimodel inference (i.e. use of AIC to select the best-supported model) where a global model includes all plausible predictors. In this causal case, a global DAG includes all plausible causal relationships in the system and acknowledges that there can be many outcome/response variables and that variables can play different roles (colliders, mediators, confounders) along multiple causal pathways. A global DAG:

includes all relevant variables (measured and unmeasured)
represents all plausible causal relationships
integrates theory, empirical knowledge, and natural history
is designed to support many possible causal questions

In other words, the goal of DAG construction is not to build a single model; its goal is to thoroughly represent your study system. Once constructed, a global DAG allows you to:

logically and transparently ask multiple exposure–outcome questions without changing your assumptions about how the world works (what kitchen-sink models and AIC fundamentally do)
identify confounders, mediators, and colliders
determine valid adjustment sets (i.e. what covariates should be included in a statistical model)
convert causal assumptions into statistical models that test predictions

How big should a global DAG be?

You could easily imagine an ever-expanding (and reductionist) DAG. Certainly, global DAGs can be very large. For purposes of causal identification and modeling for a focal study, a large DAG is not necessary. As you will discover as you gain more experience with this approach, covariates more than two edges away rarely play any important role in answering your set of questions. Therefore, you can construct a DAG that has 1-2 linked covariates away from any dependent variable (called an exposure in the causal world).

30.4 A practical workflow for building a DAG

Building a DAG for a study system typically involves the following steps:

Step 1: Start with the study system, not a statistical model
Step 2: List your core exposure–outcome questions (based on your hypotheses/predictions)
Step 3: Draw simple causal links (e.g., \(X \rightarrow Y\))
Step 4: Add confounders (common causes of exposure and outcome)
Step 5: Add mediators (mechanisms linking exposure to outcome)
Step 6: Merge everything into a single global DAG
Step 7: Include important latent (unmeasured) variables when necessary
Step 8: Refine the DAG based on theory, literature, and natural history
Step 9: Add effect modification (interactions) to your structural equations
Step 10: Use the DAG to identify what can be asked and estimated

Let us detail these 10 steps.

30.4.1 Step 1: Start with the system, not a statistical model

This step is the hardest of all the steps. Begin by thinking about the ecological system you are studying. Do not think about the regression or other model that you plan to fit. Do not think about whether you should log-transform a variable or not. At this point, you should focus on three questions only:

What are the major processes in this system? (e.g. predator-prey, pollination, intrapair communication)
What variables are likely to influence one another? (e.g. temperature and metabolic rate)
What outcomes do I care about explaining? (e.g. any response variable: fitness, probability of recruitment)

Again, at this first stage, your goal is to ignore anything related to statistical models. Do not worry about coefficients, significance, or functional form of a model.

30.4.2 Step 2: List your core exposure–outcome questions

Next, your task is to write down the main causal questions you want the DAG to support. For example, let us consider the direct causal links I thought about before constructing the DAG above:

What is the effect of \(vegetation\_cover\) on \(nesting\_survival\)?
What is the effect of \(vegetation\_cover\) on \(predator\_abundance\)?
What is the effect of \(predator\_abundance\) on \(nest\_concealment\)?
What is the effect of \(predator\_abundance\) on \(nesting\_survival\)?
What is the effect of \(nest\_concealment\) on \(nesting\_survival\)?

These questions help identify the main parts of the system that must appear in the DAG.

30.4.3 Step 3: Start with one simple causal link

Your next step is to make an explicit causal claim about each of the exposure-outcome questions listed in Step 2. For example, if your question is:

What is the effect of \(vegetation\_cover\) (the exposure) on \(predator\_abundance\) (the outcome)?

you would draw the simplest causal chain possible, which is:

flowchart LR
  vegetation_cover[Vegetation cover]:::exp --> predator_abundance[Predator abundance]:::out

classDef exp fill:#bcd4f6,color:#000,stroke-width:0px;
classDef out fill:#bfe3c0,color:#000,stroke-width:0px;

Even though this looks simple and almost trivial, be certain that each exposure-outcome is well justified. Every arrow you draw should be scientifically defensible based on some combination of theory, data, or observation.

30.4.4 Step 4: Add common causes (confounders)

Next, ask whether there are any variables that could plausibly influence both the exposure and the outcome. Recall that these are called confounders, and they are often among the most important variables to idntify For each candidate confounder, ask these two questions:

Is there evidence (literature, pilot data, or natural history) that it affects \(X\) (the exposure/treatment)?
Is there evidence that it affects \(Y\) (the response/outcome)?

Only when both are justified should it be included as a confounder. Add this confounder variable (C) and the appropriate edges to your DAG:

flowchart LR
  C["C (confounder)"]:::node --> X["X (exposure)"]:::node
  C --> Y["Y (outcome)"]:::node
  X --> Y

  classDef node fill:#9fbfe8,color:#000,stroke-width:0px;

At this stage, it is fine to include as many confounders as you want, but each edge (and its directionality) should still be grounded in solid scientific evidence.

30.4.5 Step 5: Add mechanisms (mediators)

This is where the utility of DAGs really shines. The next step is to ask whether the effect of \(X\) on \(Y\) operates through intermediate variables. These variables are called mediators and represent ecological mechanisms. Note that, for the same causal question (“does \(X\) affect \(Y\)?”), a confounder cannot represent an ecological mechanism because it is upstream of the exposure \(X\). The question that you are asking is this:

Through what variables does \(X\) affect \(Y\)?

For each candidate mediator, \(X\), ask:

Is there evidence that \(X\) influences this variable?
Is there evidence that this variable influences \(Y\)?

If evidence supports both of these edges, then you can add the mediator (M) to the DAG:

flowchart LR
  C["C (confounder)"]:::node --> X["X (exposure)"]:::node
  C --> Y["Y (outcome)"]:::node
  X --> M["M (mediator)"]:::node
  M --> Y
  X --> Y

  classDef node fill:#9fbfe8,color:#000,stroke-width:0px;

Remember that mediators should represent real ecological/biological pathways and not just additional statistical covariates. If you ignore that, you will be in the kitchen sink again.

30.4.6 Step 6: Merge overlapping pieces into one global DAG

While going through Steps 1-5, you will quickly notice that the same variables and pathways appear repeatedly. This is a sign that you are capturing shared structure in the system rather than constructing isolated hypotheses. (You can, of course, streamline this process in the future by starting with upstream –or exogenous– variables and working your way towards to the outcome variable.) Your goal now is to integrate these overlapping pieces into a single global DAG. Remember:

You should build one DAG for your study system. Do not construct one DAG per question!

When constructing your merged DAG, focus on the following:

retaining all variables that appear across multiple questions
preserving pathways that are supported by evidence
avoiding duplication of nodes that represent the same process

In the end, your draft global DAG should include:

all major variables in your system
all major causal pathways
all shared confounders and mediators
any latent variables, where necessary (see Step 7)

Importantly, each edge in this integrated, global DAG should still be defensible based on theory, literature, pilot data, or natural history. Not to worry if you forget something at this step, as Step 8 is essentially a quality control check. Just keep moving forward through the process.

30.4.7 Step 7: Represent underlying processes (latent variables and scale)

At this stage, your goal is to ask whether your DAG represents your best knowledge of the ecological process under study, or if it only represents the variables you chose to measure out of convenience. Hopefully, these lines of questioning occurred during the project planning phase, where you could pre-emptively address two common issues that arise in ecological studies:

latent (unmeasured) variables
scale and measurement

Of course, sometimes it is nearly impossible to anticipate all relevant ecological processes when you have not yet learned about a study system. Let us delve into both of this issues:

Latent (unmeasured) variables

What if we know that other factors matter—from prior literature, natural history knowledge, or previous studies—but we did not measure them in our own study? Sometimes a variable is not measured because it is logistically difficult to quantify, because there was insufficient funding, or because the variable’s importance became clear only after the study was designed. This situation is extremely common in ecology.

Even when such variables are unmeasured, they can still play vital roles in the causal system (and your DAG representation of it). Exclusion of such a variable from your dataset does not remove it from the causal system. As such, it should be represented in the DAG as unmeasured or latent variables (from Latin latere meaning to “lie hidden”). A latent variable is simply a variable that:

exists in the causal system
influences other variables
was not measured in your dataset

Examples can really include any variable, as long as it was not directly quantified in your study.

Note

Why include latent variables? Why exactly do I need to consider an unmeasured variable? This is where kitchen-sink modeling approaches can falter glamorously, as there is no explicit way to include such variables in the kinds of “garbage can regression” commonly used in ecology. In Structural Causal Models, this is possible and ideal. There are three primary advantages of explicitly including latent variables:

they can help make your causal assumptions explicit
they prevent incorrect adjustment decisions (they can help you identify what kinds of effect you are really estimating)
they help identify unmeasured confounding (and determine if an effect is actually identifiable)

Summary: Drawing only what you measured is a common mistake!

A very common mistake is to include only the measured variables in your dataset. This means that you are implicitly assuming that unmeasured variables do not matter. Entire books could be written on that assumption.

In reality, there are a myriad important processes (e.g., habitat quality, predation risk) that are purposely ignored. While unmeasured, they can still act as causal confounders. Structural Causal Models address this by allowing latent variables in the DAG.

Scale

We know that different ecological processes are scale-dependent, so your “best knowledge” of your study system must also include information about the temporal and spatial scale at which you observed the system. That is certainly a tall order because, in ecology, those two scales often do not have the same dimensions. Ideally, process scale and measurement scale align, but, in practice, they often do not. The goal is to represent the data-generating structure as faithfully as possible and be transparent about your decisions.

So what does this mean? This means that your “best knowledge” must include not only the ecological processes in your study system, but it must also include:

the spatial and temporal scale at which you think those processes actually operate
the spatial and temporal scale at which your variables were measured

So, how do we represent scale in a DAG? Recall that, in a causal graph, nodes represent variables (i.e. measured or unmeasured –or latent– quantities). Scale is not a direct measurement that can cause something else. Because every variable is observed, inferred, or conceptualized at a particular spatial and temporal scale, scale should be embedded in the definition/measurement at the node-level. Scale itself (e.g. a categorical variable for “large scale” or “small scale”) should not appear as its own node in a DAG.

Practically, the integration of different scales involves using multiple nodes rather than one vague node. There are two primary scenarios you will likely need to deal with when it comes to defining scale in your DAG.

The first is when the same variable (e.g. precipitation) is measured at multiple scales (e.g., total daily precipitation versus total monthly precipitation). In this case, your should include them as two separate nodes with a defined relationship. What would that relationship be? First, think about how the variables measured at each scale are connected to each other. In this case, monthly precipitation is an aggregated quantity derived from daily precipitation measurement. Each is assumed to have direct effects on nesting success. Furthermore, daily precipitation can also affect nesting success via an indirect pathway through monthly precipitation. Therefore:

flowchart LR
  D[Daily precipitation]:::node --> M[Monthly precipitation]:::node
  D --> Y[Nesting success]:::node
  M --> Y

  classDef node fill:#9fbfe8,color:#000,stroke-width:0px;

The second scenario is when the quantity that you measure is not the process itself. Rather, it is alocal proxy for a broader-scale, unmeasured cause. Consider a system where \(territory\_quality\) is an unmeasured (latent) variable operating at a broader scale, while \(canopy\_cover\) is a directly observed, local measurement. Here is the DAG:

flowchart LR
  U[Territory quality]:::node --> X[Canopy cover]:::node
  U --> Y[Nesting success]:::node
  X --> Y

  classDef node fill:#9fbfe8,color:#000,stroke-width:0px;

The variable canopy cover is very localized –measured within five meters of the nest– and is driven by territory quality. If both are causally relevant, they should appear in the DAG as two separate nodes.

Side note: Why does territory quality point to canopy cover?

In practice, we often use what are called proxy or indicator measurements to describe some larger-level conceptual metric. In this example, we are measuring canopy cover (at multiple subplots, etc.) and perhaps combing this with other variables like humidity or insecet abundance to estimate territory quality. This can make it tempting to reverse the edge direction in the DAG.

However, we now know that a DAG represents the underlying causal structure structure of our study system; it should not represent measurement order. In our example, territory quality reflects underlying unmeasured (latent) processes such as soil, disturbance, topography, or other territory-specific variables that together generate measured variables like canopy cover.

So, remember: Measurement order is not causal order.

Again, in this DAG, scale is not a separate node. Instead, it is embedded in how variables are defined:

Territory quality: broader-scale latent condition
Canopy cover: local measured proxy
Nesting success: focal outcome

For both of the above scenarios, we represent scale by using separate nodes defined at different scales.

30.4.8 Step 8: Refine the DAG

Now you have a very solid draft DAG. Your goal in this next step is to take a big step back and carefully review and refine. This serves as a quality-control check on the causal assumptions that we have made in the previous steps. Once again, you ask some familiar questions:

Is every edge (arrow) clearly justified (via paper, pilot data, or observation)?
Are there any important pathways missing based on a priori knowledge?
Should some variables be represented as unmeasured (latent) rather than observed?
Does this DAG represent the system as it actually operates (compared to a preferred hypothesis)?

Though this step is laborious and seemingly a bit tedious, the formal step of careful and transparent refinement is where the DAG becomes more than a back-of-the-napkin sketch; that is, it becomes a scientifically defensible representation of your study system. (Keep your napkin sketch, though, as you might need it again one day.) Below is an example of a refined DAG created by a graduate student studying a community of frugivorous jackalope species (represented by i) and the fruits on which they forage (represented as a j).

How to read the colors on this DAG (click here to expand)

The colors in this diagram reflect different roles that variables play in the system:

Dark green: Outcome variable
The focal response we are trying to explain (e.g., frugivory).
Blue: Drivers (modulators)
Broad ecological variables that influence the magnitude or likelihood of processes in the system.
Light green: Opportunity constraints
Processes that must occur for the interaction to be possible (e.g., spatiotemporal overlap, encounter).
Orange: Feasibility constraints
Biological or physical limits that determine whether an interaction can proceed once an opportunity exists (e.g., trait matching, detectability).

Importantly, these colors are used for interpretation and visualization only. The causal structure is defined by the arrows, not by the color of the nodes.

%%{init: {'themeVariables': { 'fontSize': '24px' }}}%%
flowchart LR

WeatherMicroclimate["Weather / Microclimate"]:::driver

subgraph Plant_System["Plant System"]
direction TB
VegetationStructure["Vegetation Structure"]:::driver
FruitAbundance["Fruit Abundance j"]:::driver
PeakMaturity["Peak Maturity j"]:::driver
NutritionalContent["Nutritional Content j"]:::driver
TraitMatching["Trait Matching"]:::feas
StructuralConstraints["Structural Constraints"]:::feas
end

SpatiotemporalOverlap["Spatiotemporal Overlap ij"]:::opp
Encounter["Encounter ij"]:::opp

subgraph Frugivore_System["Frugivore System"]
direction TB
FrugivoreAbundance["Frugivore Abundance i"]:::driver
BreedingCycle["Breeding Cycle i"]:::driver
ForagingStrategy["Foraging Strategy i"]:::driver
CompetitorAbundance["Competing Frugivore Abundance"]:::driver
DetectabilityLimits["Detectability Limits"]:::feas
end

Frugivory["Frugivory ij"]:::out

%% Weather / microclimate effects
WeatherMicroclimate --> VegetationStructure
WeatherMicroclimate --> FruitAbundance
WeatherMicroclimate --> PeakMaturity
WeatherMicroclimate --> FrugivoreAbundance
WeatherMicroclimate --> BreedingCycle

%% Plant-side structure and phenology
VegetationStructure --> FruitAbundance
VegetationStructure --> ForagingStrategy
VegetationStructure --> SpatiotemporalOverlap

PeakMaturity --> FruitAbundance
PeakMaturity --> NutritionalContent
PeakMaturity --> SpatiotemporalOverlap

%% Frugivore-side drivers of overlap
FrugivoreAbundance --> SpatiotemporalOverlap
BreedingCycle --> SpatiotemporalOverlap
ForagingStrategy --> SpatiotemporalOverlap
CompetitorAbundance --> FrugivoreAbundance
CompetitorAbundance --> SpatiotemporalOverlap

%% Opportunity chain
SpatiotemporalOverlap --> Encounter
Encounter --> Frugivory

%% Modulators of frugivory
FruitAbundance --> Frugivory
NutritionalContent --> Frugivory
FrugivoreAbundance --> Frugivory
ForagingStrategy --> Frugivory
CompetitorAbundance --> Frugivory

%% Feasibility constraints
DetectabilityLimits --> Encounter
TraitMatching --> Frugivory
StructuralConstraints --> Frugivory

%% Styling
classDef out fill:#2e7d32,color:#fff,stroke-width:0px;
classDef driver fill:#9fbfe8,color:#000,stroke-width:0px;
classDef opp fill:#cfe8cf,color:#000,stroke-width:0px;
classDef feas fill:#f6b26b,color:#000,stroke-width:0px;

style Plant_System fill:#f7f7f7,stroke:#999,stroke-width:1px
style Frugivore_System fill:#f7f7f7,stroke:#999,stroke-width:1px

At this stage, a large DAG represents a defensible hypothesis about what causes what in your study system. The DAG now represents a huge amount of knowledge packed into a relatively small graph. The next step is to consider how those causes combine, including whether the effect of one variable depends on the level of another.

30.5 Step 9: Add effect modification (interactions)

The last effect type in our causal framework is the effect modifier. An effect modifier is a variable that makes the effect of \(X\) on \(Y\) depend on another variable \(Z\). In practice, this is what we mean by an interaction (a term familier to you from GLMs and GLMMs). Conceptually, it really is that simple. For example, predator abundance may reduce nesting success, but the strength of that effect may depend on vegetation structure. In a DAG, this is represented simply as both \(predator_abundance\) and \(vegetation_structure\) causing \(nesting success\):

flowchart LR

X[Predator abundance]:::exp_dark --> Y[Nest survival]:::out
Z[Vegetation structure]:::exp_light --> Y

classDef exp_dark fill:#9fbfe8,color:#000,stroke-width:0px;
classDef exp_light fill:#d6e6fb,color:#000,stroke-width:0px;
classDef out fill:#bfe3c0,color:#000,stroke-width:0px;

What does this show? Both predator abundance and vegetation structure affect nest survival. Effect modification arises when the effect of predator abundance on survival depends on vegetation structure (an interaction). The DAG itself does not change. Effect modification does not introduce new causal relationships—it only changes how existing causes combine in the model for \(Y\).

This raises an important question: where does the interaction appear, if not in the DAG? Remember that a DAG shows which variables are causes of \(Y\). The DAG does not specify how those causes combine, whether additively or interactively. Instead, the interaction appears in the structural equation for the model:

\[ Y \sim X + Z + X \times Z \]

Here, the \(predator\_abundance \times vegetation\_structure\) term means that the effect of \(predator\_abundance\) on \(vegetation\_structure\) changes depending on \(nesting\_success\).

So, after building your DAG, carefully document what variables will go into your structural models, including interactions and random effects.

30.5.1 Step 10: Use the DAG to see what questions are possible

This is one of the most fascinating and fun parts of DAG construction. A well-constructed global DAG does more than simply represent the causal assumptions underlying your original hypotheses/questions. A good DAG allows you to systematically evaluate what questions can be supported by the causal structure you have. Because you have painstakingly worked to justify every edge in these previous steps, you can now ask narrow your focus onto a number of probing -yet flexible– questions such as:

Which exposure–outcome relationships are supported by plausible pathways?
Which variables play multiple roles across different questions?
Which effects are identifiable or unidentifiable given the causal structure?
What new questions emerge from the causal structure?

Remember that you are formalizing the causal assumptions of your study system. This can be done at any time –before, during, or after a study–, though it is idea to do before you begin your study. But, as we all know from experience, our knowledge of the processes involved in our system grow as we study it (hopefully). We conduct more experiments, we make more observations, we read new and old published papers, we speak with colleagues. Therefore, we should absolutely expect that the DAG will evolve over time. If we can a DAG updated with new information, we can identify where past studies have fallen short, what present studies are constrained to estimate, and what future studies should prioritize.

30.6 Up next!

So, remember that a DAG is just one part of a Structural Causal Model. The other critical component is the set of actual structural equations.On the next page, we will focus on a single exposure–outcome pair, define the estimand, determine the correct adjustment set, and translate the DAG into a statistical model (using tools familiar to you).