Bayesian causal inference

Author

Hovhannes Grigoryan

Published

April 26, 2026

Intended learning outcomes

By the end of this chapter, you will be able to:

State Rubin’s Bayesian potential-outcomes formulation and explain what posterior distribution is computed.
Apply BART (Bayesian Additive Regression Trees) to estimate heterogeneous treatment effects with uncertainty quantification.
Construct Bayesian Causal Forests (BCF, Hahn-Murray-Carvalho 2020) as a refinement of BART that separates prognostic from treatment-effect modeling.
Compute posterior intervals for individual treatment effects and interpret their frequentist coverage properties.
Perform Bayesian sensitivity analysis by placing informative priors on the unconfoundedness assumption.

Suggested lecture plan

Three 75–90 min lectures: (1) Bayesian formulation of potential outcomes, choice of priors on nuisances; (2) BART for causal inference (Hill 2011), computational aspects, hands-on with pymc-bart; (3) BCF and its improvements over BART, Bayesian sensitivity analysis, closing synthesis across the series.

1 The Bayesian potential-outcomes framework

Rubin (1978) formulated causal inference as a missing-data problem [@rubin1978bayesian]. The observed data are \((Y_i^{\text{obs}}, D_i, X_i)\) for \(i = 1, \ldots, N\). The unobserved data are the counterfactuals \(Y_i^{\text{mis}} = Y_i(1 - D_i)\). A Bayesian model places a joint prior over

\[ \theta = \text{model parameters}, \quad Y^{\text{mis}} = (Y_i^{\text{mis}})_{i=1}^N, \]

and computes the posterior \(p(Y^{\text{mis}}, \theta \mid Y^{\text{obs}}, D, X)\). From this posterior, any causal estimand is a functional:

\[ \tau_{\text{SATE}} := \frac{1}{N} \sum_i (Y_i(1) - Y_i(0)) = \frac{1}{N} \sum_i \left(D_i Y_i - (1 - D_i) Y_i + (1 - 2 D_i) Y_i^{\text{mis}}\right). \]

The posterior of \(\tau_{\text{SATE}}\) is obtained by propagating the uncertainty in \(Y^{\text{mis}}\).

This framework is conceptually clean but computationally demanding. Modern implementations use MCMC or variational inference over flexible nuisance models.

2 BART for causal inference

Bayesian Additive Regression Trees (BART), introduced by Chipman, George, and McCulloch (2010) [@chipman2010bart], is a non-parametric regression tool: \(f(x) = \sum_k g_k(x; T_k)\), a sum of regression trees with a regularizing prior. Each tree is shallow; the sum is flexible.

Hill (2011) [@hill2011bayesian] applied BART to causal inference by fitting a single BART model to \(Y\) as a function of \((D, X)\), then reading off \(\tau(x) = f(x, 1) - f(x, 0)\). The approach:

Fit BART on all data, with \(D, X\) as regressors.
Predict at \((D = 1, X_i)\) and \((D = 0, X_i)\) for each \(i\).
The posterior distribution of the prediction difference is the posterior of \(\tau(X_i)\).

Hill’s simulation results showed BART matches or exceeds propensity-score matching, IPW, and regression adjustment on standard benchmarks, often with smaller MSE and better-calibrated intervals.

3 Bayesian Causal Forests

Hahn, Murray, and Carvalho (2020) [@hahn2020bayesian] showed that vanilla BART suffers from regularization-induced confounding: the prior’s shrinkage on \(f(D, X)\) can bias the treatment effect by forcing \(\tau(x)\) toward the prior even when the data say otherwise. The fix is Bayesian Causal Forests (BCF):

\[ \mathbb{E}[Y \mid D, X] = \mu(X, \hat e(X)) + D \cdot \tau(X, \hat e(X)), \]

with separate BART priors on the prognostic \(\mu\) and the treatment effect \(\tau\). Including the propensity score \(\hat e(X)\) in both explicitly adjusts for confounding. BCF’s prior on \(\tau\) is stronger (less flexible) than on \(\mu\), reflecting the substantive prior that treatment effects are typically simpler than prognostic structure.

Theorem 12.1 — BCF bias reduction (Hahn-Murray-Carvalho 2020)

Under unconfoundedness and appropriate priors, BCF achieves bias of order \(o(1/\sqrt n)\) for CATE estimation, strictly smaller than BART’s bias. The intuition: separating \(\mu\) from \(\tau\) prevents the prior on \(\mu\) from contaminating the estimate of \(\tau\).

BCF has become the default Bayesian tool for heterogeneous treatment effect estimation in policy evaluation, particularly in settings with modest sample sizes.

4 Posterior coverage

Bayesian posteriors of CATE are well-calibrated under the model, but frequentist coverage depends on the prior and model flexibility. Empirically:

BART’s 95% credible intervals have frequentist coverage around 88–95% on standard benchmarks (varies by DGP).
BCF’s intervals are better-calibrated (91–96%), especially at extreme CATE values.
Compared to the honest causal forest (Chapter 11), Bayesian methods tend to have slightly wider intervals but better calibration in small-sample regimes.

5 Bayesian sensitivity analysis

A unique advantage of the Bayesian framework: placing a prior on unconfoundedness itself. Specifically, model the treatment assignment as

\[ D_i \mid X_i, U_i \sim \text{Bernoulli}(e(X_i, U_i)), \]

with \(U\) an unmeasured confounder given a prior. Integrate over the prior on the confounder’s strength to obtain a posterior over \(\tau\) that accounts for the assumed confounding.

This generalizes Rosenbaum \(\Gamma\)-bounds and E-values: instead of reporting “the result holds if \(\Gamma \leq 2\),” we can report “under a plausible prior on hidden confounding, the posterior probability that \(\tau > 0\) is 92%.”

6 Worked example

import numpy as np
import pymc as pm
import pymc_bart as pmb

rng = np.random.default_rng(0)
N, P = 500, 5

X = rng.normal(0, 1, (N, P))
e = 1 / (1 + np.exp(-(0.5 * X[:, 0])))
D = (rng.uniform(size=N) < e).astype(int)
tau_true = 1 + X[:, 0]          # heterogeneous CATE
Y = X[:, 0] + X[:, 1] + D * tau_true + rng.normal(0, 0.5, N)

# Simple BART for causal inference (Hill 2011)
# Covariates include D explicitly; predict at D=1 and D=0 for each unit
with pm.Model() as bart_model:
    sigma = pm.HalfNormal('sigma', 1)
    bart = pmb.BART('f', X=np.column_stack([D, X]), Y=Y, m=50)
    pm.Normal('Y', mu=bart, sigma=sigma, observed=Y)
    trace = pm.sample(500, tune=500, chains=2, random_seed=0,
                       nuts={'target_accept': 0.95})

# Predict at D=1 and D=0
X_cf1 = np.column_stack([np.ones(N), X])
X_cf0 = np.column_stack([np.zeros(N), X])

# PyMC-BART's prediction requires sampling from the posterior
with bart_model:
    pred1 = pmb.predict(bart, X_cf1)
    pred0 = pmb.predict(bart, X_cf0)

tau_hat = pred1.mean(axis=0) - pred0.mean(axis=0)
tau_ci = np.quantile(pred1 - pred0, [0.025, 0.975], axis=0)

rmse = np.sqrt(np.mean((tau_hat - tau_true) ** 2))
coverage = np.mean((tau_ci[0] <= tau_true) & (tau_true <= tau_ci[1]))
print(f"BART CATE RMSE:   {rmse:.3f}")
print(f"95% credible coverage: {coverage:.3f}")

On this DGP BART achieves RMSE comparable to a well-tuned causal forest with 500 trees, and 95% credible-interval coverage near 90% (slightly undercoverage due to BART’s regularization). BCF would improve the calibration but requires a custom implementation or the R package bcf.

7 Bibliographic notes

Rubin (1978), “Bayesian Inference for Causal Effects: The Role of Randomization,” Annals of Statistics 6, is the foundational paper.

Hill (2011), “Bayesian Nonparametric Modeling for Causal Inference,” Journal of Computational and Graphical Statistics 20, brought BART to causal inference.

Hahn, Murray, and Carvalho (2020), “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects,” Bayesian Analysis 15, introduces BCF.

Chipman, George, and McCulloch (2010), “BART: Bayesian Additive Regression Trees,” Annals of Applied Statistics 4, is the BART paper.

8 Exercises

Exercise 12.1 (\(\star\star\)). Derive the BART posterior for \(\tau(x)\) in the linear limit (BART with a single constant tree reduces to Bayesian linear regression). Show that the credible interval equals the classical Bayesian CI.

Exercise 12.2 (\(\star\star\)). Explain why regularization-induced confounding is a Bayesian phenomenon. What would the analogous phenomenon look like in a frequentist penalized regression?

Exercise 12.3 (\(\star\star\star\)). Design a prior on an unmeasured confounder and compute the posterior probability that the treatment effect is positive. Compare to the Rosenbaum \(\Gamma\)-sensitivity analysis from Chapter 2.

Exercise 12.4 (\(\star\star\)). Apply BCF (via R’s bcf package, accessed through rpy2) to a synthetic DGP. Compare its CATE RMSE to the causal forest from Chapter 11.

9 Series wrap-up

This completes the twelve-chapter arc from potential outcomes to ML-augmented causal inference. The thread:

Chapters 1–3. What causal estimands are, what identification requires, and how DAGs formalize those requirements.
Chapters 4–6. Classical identification strategies, randomization, unconfoundedness-via-observables, and instruments, and their estimators.
Chapters 7–9. Panel-data identification, difference-in-differences, regression discontinuity, synthetic control, all for settings where treatment has a natural timing or threshold structure.
Chapter 10–11. Machine-learning-augmented estimation, DML for orthogonalized plug-in, causal forests for heterogeneity, built on the classical identification foundations.
Chapter 12. Bayesian framework providing a unified uncertainty quantification across every chapter’s estimator.

The modern causal-inference toolkit is not any one estimator but a layering of identification assumptions, estimation strategies, and inference frameworks. Good applied work moves up and down this stack to match the problem. Identification first; estimation second; inference third.

10 References

Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). “BART: Bayesian Additive Regression Trees.” Annals of Applied Statistics 4(1), 266–298.

Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020). “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects.” Bayesian Analysis 15(3), 965–1056.

Hill, J. L. (2011). “Bayesian Nonparametric Modeling for Causal Inference.” Journal of Computational and Graphical Statistics 20(1), 217–240.

Rubin, D. B. (1978). “Bayesian Inference for Causal Effects: The Role of Randomization.” Annals of Statistics 6(1), 34–58.