Directed Acyclic Graphs (DAGs) for Causal Inference 2026: A Researcher’s Guide
Every quantitative researcher encounters the same uncomfortable question at the analysis stage: which variables should be included as covariates in the model? The answer cannot be found by running correlations, checking p-values, or applying stepwise selection algorithms. It requires causal reasoning — and directed acyclic graphs (DAGs) provide the formal language for doing that reasoning rigorously. Developed in its modern form by Judea Pearl and applied to epidemiology by Greenland, Pearl, and Robins in their landmark 1999 paper, DAGs have become the standard tool for transparent covariate-selection decisions in observational research across medicine, economics, psychology, and the social sciences.
This guide explains every core component of directed acyclic graphs for causal inference — nodes, edges, paths, confounders, colliders, mediators, and the backdoor criterion — and shows how to translate a DAG into a defensible adjustment strategy. Worked examples are drawn from recognisable research scenarios, and the guide closes with a practical walkthrough of DAGitty, the free browser-based tool that operationalises these concepts for everyday use.
Foundations: Nodes, Edges, and Paths
A DAG consists of three elements:
- Nodes — each node represents a variable (observed or unobserved). In a study on the effect of physical activity on cardiovascular disease, nodes might include physical activity (the exposure, X), cardiovascular disease (the outcome, Y), body mass index, age, and socioeconomic status.
- Directed edges — an arrow from node A to node B encodes the researcher’s assumption that A is a direct cause of B (holding all other variables constant). Crucially, an absent arrow is also an assumption: it encodes the belief that A has no direct causal effect on B.
- Acyclicity — the graph contains no directed cycles. A variable cannot be its own cause through any chain of arrows. This rules out contemporaneous feedback loops, which instead require dynamic models (such as structural equation models with lagged variables).
A path is any sequence of edges connecting two nodes, regardless of the direction of those edges. Paths are the channels through which associations — both causal and spurious — flow between variables. Understanding which paths are open (transmitting association) and which are blocked (transmitting nothing) is the entire purpose of DAG analysis.
d-Separation: The Blocking Rule
Two variables are d-separated (directionally separated) given a set of conditioned variables if every path between them is blocked. A path is blocked under three circumstances:
- The path contains a non-collider (a chain node A → B → C, or a fork A ← B → C) and B is in the conditioning set — conditioning on B blocks the flow.
- The path contains a collider (A → B ← C) and B is not in the conditioning set — the collider naturally blocks the path.
- The path contains a collider and B is in the conditioning set — conditioning on B opens the path.
Rule 3 is the source of collider bias, discussed in detail below. The concept of d-separation, introduced by Pearl in Causality: Models, Reasoning, and Inference (2nd ed., Cambridge University Press, 2009), provides the mathematical foundation for reading conditional independencies off a DAG without running any regressions.
An introductory lecture on causal inference and DAGs from Ohio State University Libraries. Source: Ohio State University Libraries on YouTube.
Confounders: The Classic Source of Bias
A confounder is a variable that is a common cause of both the exposure and the outcome. In DAG terms, confounding arises when there is an open backdoor path from exposure X to outcome Y — a path that begins with an arrow into X.
Consider a study on whether coffee consumption (X) causes coronary heart disease (Y). Smoking is a common cause of both: smokers drink more coffee and are at higher risk of heart disease. The path Coffee ← Smoking → Heart Disease is a backdoor path. If smoking is not controlled, the estimated coffee effect is a mixture of the true causal effect and the spurious association transmitted through the smoking pathway.
Smoking → Coffee (X) ; Smoking → Heart Disease (Y)The path X ← Smoking → Y is a backdoor path. It is open because Smoking is a common cause (fork node) and is not conditioned upon.
The standard definition of confounding in classical epidemiology relied on three criteria (association with exposure, association with outcome, not on the causal path). DAGs expose the limitations of this checklist: a variable can meet all three criteria and still not be a confounder, or fail one criterion and still need to be controlled. The graphical approach, formalised by Greenland, Pearl, and Robins (1999), replaces the checklist with a precise structural definition rooted in d-separation.
It is worth noting that an unobserved confounder — represented as a latent node U — cannot be adjusted for, which is a key reason why randomised controlled trials remain the gold standard: randomisation cuts the arrows from all confounders (observed or not) into the exposure, leaving only the causal path X → Y open.

Colliders: The Misunderstood Bias Amplifier
A collider on a path is a node where two arrowheads meet: A → C ← B. When a collider exists on a path, that path is naturally blocked — it transmits no association. This seems like good news, but the situation reverses the moment a researcher conditions on the collider (or any of its descendants).
Conditioning on a collider opens the blocked path and induces a spurious association between its two parent variables — even if those variables were previously independent. This phenomenon is called collider-stratification bias or selection bias (when the collider is a selection variable). Greenland’s 2003 paper in Epidemiology provided one of the first formal quantifications of this bias in applied settings.
The Berkson’s Bias Example
Berkson’s bias is a textbook instance of collider bias. Suppose a researcher studies the association between diabetes and cholecystitis using a hospital sample. Both conditions independently increase the probability of hospitalisation. Hospitalisation is therefore a collider: Diabetes → Hospitalisation ← Cholecystitis. By restricting the sample to hospitalised patients (conditioning on Hospitalisation), the researcher opens the path and induces a spurious negative association between diabetes and cholecystitis — even if no true biological relationship exists in the general population.
The implications for covariate selection are profound: adding more covariates to a model is not always safer. Conditioning on a collider or a descendant of a collider introduces bias that would not otherwise be present. This directly contradicts the intuition that “controlling for more variables” necessarily reduces confounding.
For a deeper treatment of how conditioning on colliders interacts with missing data patterns, see the guide on handling missing data in dissertations, which covers how selection into the observed sample can itself create collider structures.
Mediators: What Not to Adjust For
A mediator (or intermediate variable) sits on the causal pathway between the exposure and the outcome: X → M → Y. The mediator is caused by the exposure and in turn causes the outcome. It carries part — or all — of the causal effect of X on Y.
When the research goal is to estimate the total causal effect of X on Y, conditioning on a mediator is almost always a mistake. Doing so blocks the causal pathway through M and attenuates the estimated effect — producing a controlled direct effect rather than a total effect, without the researcher necessarily being aware of the distinction.
Mediation analysis is a legitimate research goal in its own right, but it requires a different estimand (the controlled direct effect or the natural direct/indirect effects) and distinct identifying assumptions. The foundational framework for this, drawing on potential outcomes integrated with DAGs, is laid out in detail by Hernán and Robins in Causal Inference: What If (Chapman & Hall/CRC, 2020; revised 2025), freely available from the authors.
The Backdoor Criterion and Adjustment Sets
The backdoor criterion, formalised by Pearl, provides a precise graphical answer to the question of which variables must be conditioned upon to identify the total causal effect of X on Y from observational data.
A set of variables S satisfies the backdoor criterion relative to the ordered pair (X, Y) if:
- No variable in S is a descendant of X.
- S blocks every backdoor path from X to Y (i.e., every path that begins with an arrow into X).
If such a set S exists and is observed, the causal effect of X on Y is identified (estimable from the data) and is given by the backdoor adjustment formula:
In practice, this formula reduces to running a regression of Y on X with S included as covariates — but the theoretical justification for which covariates to include comes from the DAG, not from the data.
Minimal Adjustment Sets
Multiple sets of variables may satisfy the backdoor criterion for a given DAG. A minimal adjustment set is one for which no proper subset also satisfies the criterion. Using a minimal set is preferable for two reasons: it reduces variance inflation from over-parameterisation, and it avoids inadvertently conditioning on colliders or mediators that happen to be associated with the exposure or outcome but do not belong in the adjustment set.
The identification of minimal adjustment sets is analytically complex in large DAGs. DAGitty automates this computation, which is one of the primary reasons it has become the standard tool in applied epidemiology and social science. This is conceptually related to the challenge of construct and internal validity in research design: both require the researcher to reason about which sources of variation are causal versus artefactual.
Worked Example: Education, Income, and Health
Suppose a researcher wants to estimate the causal effect of educational attainment (Edu) on self-rated health (Health). Based on subject-matter knowledge and existing literature, the researcher draws the following DAG:
- Family socioeconomic status (SES) → Edu (higher SES families invest more in education)
- SES → Health (higher SES is associated with better health independently)
- Edu → Income (Inc) (education increases income)
- Inc → Health (income improves health access and behaviours)
- Edu → Health (direct pathway, e.g., health literacy)
Identifying the paths from Edu to Health:
- Edu → Health (direct causal path — open, desired)
- Edu → Inc → Health (causal mediated path — open, desired for total effect)
- Edu ← SES → Health (backdoor path — open, confounding)
Applying the backdoor criterion:
- The only backdoor path is path 3, which runs through SES.
- Adjusting for SES blocks path 3 without blocking any causal path or conditioning on a collider.
- Adjusting for Inc would block the mediated causal path (path 2), reducing the estimate to the direct effect only.
- The minimal sufficient adjustment set for the total causal effect of Edu on Health is therefore {SES} alone.
This example illustrates how DAGs prevent two common errors: (1) omitting SES and thereby producing a confounded estimate, and (2) including Income and thereby inadvertently estimating a direct effect rather than the total effect. The decision requires no p-value or stepwise test — it follows directly from the structure of the graph. For comparison across statistical software packages that could be used to implement this adjustment, the guide on JASP vs Jamovi vs SPSS vs R for thesis statistics covers the relative merits of each platform for regression modelling.
Using DAGitty in Practice
DAGitty (described in detail in Textor et al., Epidemiology, 2011) is a browser-based tool that implements the d-separation rules computationally. It is freely available at dagitty.net and requires no installation. The R package dagitty, available on CRAN, provides the same functionality within scripted analysis workflows.
Step-by-Step Workflow
- Draw the DAG. Open dagitty.net and add nodes (double-click to create, drag to position). Draw arrows by hovering over the source node and dragging to the target. Label each variable clearly.
- Designate roles. Right-click each node to designate it as Exposure, Outcome, Adjusted, or Latent (unobserved). Latent nodes are shown as open circles — critical for honest representation of unmeasured confounders.
- Inspect the output panel. DAGitty immediately displays: (a) all minimal sufficient adjustment sets; (b) all paths from exposure to outcome, classified as causal or biasing; (c) all testable conditional independencies implied by the DAG, which can be used to partially check the DAG’s assumptions against the data.
- Export and report. DAGitty generates a text-based model code (dagitty syntax) that can be pasted into the Methods section for reproducibility. The R package allows the DAG to be embedded directly in analysis scripts.
The R package ggdag (built on dagitty) provides ggplot2-style visualisation of DAGs and is increasingly the standard for publication-quality DAG figures in quantitative dissertations. The 2023 comparison of open-source DAG software by McGowan and colleagues provides a useful benchmark of DAGitty, ggdag, and several alternatives for applied researchers choosing between tools.
Limitations and Common Mistakes
DAGs are tools for encoding and communicating causal assumptions, not for discovering them from data. Several limitations deserve explicit acknowledgement in any manuscript that uses DAGs.
DAGs Cannot Be Confirmed by Data Alone
A DAG encodes the researcher’s prior causal beliefs. Two DAGs with different structures may imply identical sets of conditional independencies (they are Markov equivalent), making them statistically indistinguishable even with infinite data. The choice between Markov-equivalent DAGs must be made on subject-matter grounds. This epistemic limitation is discussed at length in Hernán and Robins (2020) and is one reason why DAGs must be accompanied by explicit causal justification in the methods section.
Unmeasured Confounding
If a required adjustment variable is unobserved (a latent node in the DAG), the backdoor criterion cannot be satisfied and the causal effect is not identifiable from observational data through adjustment alone. Alternative identification strategies — instrumental variables, regression discontinuity, difference-in-differences — may apply depending on the DAG structure, but each has its own distinct identifying assumptions.
Time-Varying Treatments and Feedback
Standard DAGs assume a single time point and no feedback loops. Longitudinal studies with time-varying exposures that also affect subsequent confounders require an extension known as a Single World Intervention Graph (SWIG) or the analysis of time-varying confounding using marginal structural models with inverse probability weighting — methods detailed in Part III of Hernán and Robins (2020). This is directly relevant to studies where the exposure and outcome are measured repeatedly, such as panel data or ecological momentary assessment designs.
Model Specification Is Separate from DAG Identification
A DAG identifies the adjustment set; it does not specify how adjustment should be implemented (linear regression, propensity score matching, doubly robust estimators). The choice of estimation method introduces its own assumptions about functional form and overlap. A correctly specified DAG paired with a mis-specified regression model can still produce biased estimates. For guidance on interpreting the uncertainty around any adjusted estimate, see the article on confidence interval interpretation and the discussion of internal and external validity.
Researchers working on systematic reviews should also be aware that publication bias in the primary literature can distort the evidence base from which any DAG is built — studies with null results for assumed confounding pathways may be under-represented, leading to a DAG that omits arrows that should be present.
Frequently Asked Questions
What is a directed acyclic graph (DAG) in causal inference?
A DAG is a graphical representation of assumed causal relationships among variables. Nodes represent variables and directed edges (arrows) represent assumed direct causal effects. “Acyclic” means a variable cannot cause itself through any chain of edges — there are no feedback loops. DAGs make the researcher’s causal assumptions explicit and allow systematic identification of confounders, colliders, and valid adjustment sets.
What is the backdoor criterion?
The backdoor criterion, formalised by Judea Pearl, specifies when a set of variables S is a valid adjustment set. S satisfies the criterion if: (1) no member of S is a descendant of the treatment variable, and (2) S blocks every backdoor path — paths that begin with an arrow into the treatment — between treatment and outcome. Conditioning on a valid backdoor adjustment set removes all confounding bias and leaves only the causal effect estimate.
What is a collider and why is conditioning on one dangerous?
A collider is a variable that receives arrows from two or more variables on a given path. On its own, a collider blocks the path and transmits no confounding bias. However, conditioning on a collider (or one of its descendants) opens the path and introduces a spurious association known as collider-stratification bias. This is a common source of bias in studies that naively adjust for every available covariate.
How is a mediator different from a confounder?
A confounder is a common cause of both the exposure and the outcome, creating a non-causal association. A mediator sits on the causal pathway between exposure and outcome — caused by the exposure and in turn causing the outcome. Conditioning on a mediator blocks the causal pathway of interest and estimates only the direct effect, which is almost always a mistake when the goal is to estimate the total causal effect.
What software can I use to draw and analyse DAGs?
DAGitty (dagitty.net) is the most widely used free browser-based tool. It allows researchers to draw DAGs, label variables as exposures, outcomes, or adjusted, and then automatically identifies all confounders, colliders, valid minimal adjustment sets, and testable conditional independencies. It is also available as an R package (install.packages(‘dagitty’)). The ggdag package provides ggplot2-style DAG visualisation for publication figures.
Do DAGs replace statistical tests for confounding?
No. DAGs encode theoretical causal assumptions that cannot be tested from data alone. Two structurally different DAGs may be statistically indistinguishable (Markov equivalent). DAGs translate domain knowledge into covariate-selection strategy; they do not substitute for it. Statistical tests such as change-in-estimate or p-values for confounders are insufficient replacements for explicit causal reasoning with a DAG.

Leave a Reply