Instrumental variables

Author

Hovhannes Grigoryan

Published

April 5, 2026

Intended learning outcomes

By the end of this chapter, you will be able to:

State the three instrumental-variable (IV) assumptions, relevance, exclusion, and independence, and recognize them in applied settings.
Derive the two-stage least-squares (2SLS) estimator and state its identification requirements.
Prove the Imbens-Angrist LATE theorem: under monotonicity, the IV estimand is the average treatment effect on the compliers.
Diagnose weak instruments using the Stock-Yogo test and understand the consequences of weak-instrument bias.
Apply control-function methods for continuous or nonlinear treatments.
Implement 2SLS and LATE estimation in Python with appropriate standard errors.

Suggested lecture plan

Three lectures of 75–90 minutes.

Lecture 1, IV identification.

The endogeneity problem (15 min)
IV assumptions: relevance, exclusion, independence (20 min)
2SLS derivation in the linear constant-effects model (25 min)
Reduced form, first stage, and Wald estimator (15 min)
Hands-on: simulate endogeneity and estimate via 2SLS (10 min)

Lecture 2, LATE and heterogeneous effects.

Imbens-Angrist monotonicity assumption (15 min)
The LATE theorem with full proof (30 min)
Interpretation of LATE: complier ATE (15 min)
ITT, LATE, and per-protocol: a comparison (20 min)
Hands-on: LATE in a noncompliance trial (10 min)

Lecture 3, Weak instruments and beyond.

Weak-instrument bias: the Nagar expansion (25 min)
First-stage F-statistic and Stock-Yogo critical values (15 min)
Control function approach for nonlinear treatments (25 min)
Many weak instruments (brief) (10 min)
Hands-on: diagnose weak IV on simulated data (15 min)

1 The endogeneity problem

Consider the linear causal model

\[ Y_i = \alpha + \tau D_i + \epsilon_i, \tag{1}\]

where \(D\) is the treatment and \(\epsilon\) includes everything else affecting \(Y\). The fundamental problem is endogeneity: \(\mathbb{E}[\epsilon \mid D] \neq 0\). Without additional assumptions, OLS is inconsistent for \(\tau\).

Endogeneity has three canonical sources:

Omitted variable bias, there is an unobserved confounder \(U\) that enters both \(D\) and \(\epsilon\).
Simultaneity, \(D\) and \(Y\) jointly determined (e.g., price and quantity).
Measurement error, \(D\) is observed with classical error in its argument.

In Chapter 2 we handled (1) by assuming unconfoundedness given observed \(X\). When that fails, when the confounding is intrinsically unobservable, the instrumental-variables approach offers an alternative identification path.

2 The three IV assumptions

An instrumental variable \(Z\) is a variable that satisfies three conditions:

Assumption 6.1 — IV conditions

A variable \(Z\) is a valid instrument for the effect of \(D\) on \(Y\) if:

Relevance. \(\text{Cov}(Z, D) \neq 0\), \(Z\) predicts \(D\).
Exclusion restriction. \(Z\) affects \(Y\) only through \(D\), no direct pathway \(Z \to Y\) bypassing \(D\).
Independence (or exogeneity). \(Z \perp\!\!\!\perp \epsilon\), \(Z\) is independent of the error term in Equation 1.

Relevance is testable from data: regress \(D\) on \(Z\) and check the coefficient is nonzero. Exclusion and independence are not testable from data alone, they are identifying assumptions requiring substantive justification.

2.1 Canonical examples

Lottery instruments. The Vietnam-era draft lottery assigned young men to military service based on birthdate. Angrist (1990) used the lottery number as an instrument for veteran status, estimating the effect of military service on earnings.
Randomized encouragement. In a trial with noncompliance, the randomized assignment \(Z\) is an instrument for the actually-received treatment \(D\), random assignment predicts actual treatment (relevance), is expected to affect outcomes only through treatment (exclusion), and is by construction independent of unobservables (independence).
Geographic instruments. Distance to an alternate hospital for studying the effect of a specific treatment received at a facility nearby.
Policy discontinuities. Changes in eligibility rules around a fixed threshold (bridge to Chapter 8’s regression-discontinuity designs).

3 Two-stage least squares

3.1 Derivation

Under IV assumptions, the structural parameter \(\tau\) is identified by

\[ \tau = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, D)}. \tag{2}\]

This is the Wald estimand, the ratio of reduced-form to first-stage covariances. It is the building block of 2SLS.

Proof of Equation 2

From the structural equation,

\[ \text{Cov}(Z, Y) = \text{Cov}(Z, \tau D + \epsilon) = \tau \text{Cov}(Z, D) + \text{Cov}(Z, \epsilon). \]

By independence, \(\text{Cov}(Z, \epsilon) = 0\). By relevance, \(\text{Cov}(Z, D) \neq 0\). Dividing gives Equation 2. \(\square\)

3.2 The 2SLS algorithm

With covariates \(X\) included linearly in the model, 2SLS generalizes the Wald estimator:

\[ Y_i = \alpha + \tau D_i + X_i^\top \beta + \epsilon_i, \quad D_i = \pi_0 + \pi_1 Z_i + X_i^\top \pi_X + v_i. \]

Stage 1. Regress \(D\) on \(Z, X\) by OLS. Obtain fitted values \(\hat D_i\).

Stage 2. Regress \(Y\) on \(\hat D, X\) by OLS. The coefficient on \(\hat D\) is \(\hat\tau^{\text{2SLS}}\).

In matrix form, letting \(W = [D, X]\), \(\tilde W = [Z, X]\),

\[ \hat\tau^{\text{2SLS}} = \left(W^\top P_{\tilde W} W\right)^{-1} W^\top P_{\tilde W} Y, \tag{3}\]

where \(P_{\tilde W} = \tilde W(\tilde W^\top \tilde W)^{-1} \tilde W^\top\) is the projection onto the column space of \(\tilde W\).

3.3 Standard errors

The standard errors from two separate OLS fits are wrong, they ignore the first-stage uncertainty. Use either the closed-form 2SLS variance formula or the heteroskedasticity-robust version (Wooldridge 2010 §5.2):

\[ \widehat{\text{Var}}(\hat\tau^{\text{2SLS}}) = \hat\sigma^2 \left(W^\top P_{\tilde W} W\right)^{-1}, \quad \hat\sigma^2 = \frac{1}{N - k} \sum_i (Y_i - W_i^\top \hat\theta)^2. \]

Python: linearmodels.iv.IV2SLS.

4 LATE and heterogeneous effects

The derivation above assumed a constant treatment effect \(\tau\). When effects are heterogeneous, what exactly does 2SLS estimate?

4.1 The Imbens-Angrist setup

Imbens and Angrist (1994) [@imbens1994identification] introduced the following framework for a binary treatment and binary instrument. For each unit \(i\), consider the pair of potential treatments:

\(D_i(1)\), treatment received if instrument \(Z_i = 1\)
\(D_i(0)\), treatment received if \(Z_i = 0\)

Based on these, units fall into four types:

Type	\(D(0)\)	\(D(1)\)	Interpretation
Complier	0	1	Takes treatment when encouraged
Always-taker	1	1	Takes treatment regardless
Never-taker	0	0	Never takes treatment
Defier	1	0	Takes treatment only when not encouraged

4.2 Monotonicity

Defiers are pathological: they do the opposite of what the instrument pushes them toward. The Imbens-Angrist monotonicity assumption rules them out:

Assumption 6.2 — Monotonicity

\(D_i(1) \geq D_i(0)\) for all \(i\).

Equivalently: the instrument moves treatment in a consistent direction across all units.

Monotonicity is often substantively plausible. In a draft-lottery setting, a low lottery number cannot dissuade someone from serving; at worst, it has no effect on a determined volunteer. Defiers would be individuals who intentionally serve only when exempted, implausible.

4.3 The LATE theorem

Theorem 6.1 — LATE theorem (Imbens-Angrist 1994)

Under Assumption 6.1 (IV) and Assumption 6.2 (monotonicity),

\[ \frac{\mathbb{E}[Y \mid Z = 1] - \mathbb{E}[Y \mid Z = 0]}{\mathbb{E}[D \mid Z = 1] - \mathbb{E}[D \mid Z = 0]} = \mathbb{E}[Y(1) - Y(0) \mid \text{complier}]. \tag{4}\]

The 2SLS estimand is the local average treatment effect (LATE), the ATE among compliers.

Proof sketch

Under monotonicity, no defiers exist. The observed relationship between \(Z\) and \(Y\) is entirely due to compliers (always-takers and never-takers have \(D_i(1) = D_i(0)\), so \(Z\) has no effect on their treatment and hence no effect on their outcome, by exclusion). Compute

\[ \mathbb{E}[Y \mid Z = 1] - \mathbb{E}[Y \mid Z = 0] = P(\text{complier}) \cdot \mathbb{E}[Y(1) - Y(0) \mid \text{complier}], \]

using the partition by type and the fact that only compliers’ outcomes change with \(Z\). Similarly \(\mathbb{E}[D \mid Z = 1] - \mathbb{E}[D \mid Z = 0] = P(\text{complier})\). The ratio yields the complier ATE. \(\square\)

4.4 Interpretation warning

LATE is the effect on compliers, not on the whole population. Compliers are defined by their response to the specific instrument used. Different instruments identify different LATEs. This is a feature, not a bug: each LATE answers a specific, well-posed question.

The ATE is identified only if one additionally assumes constant treatment effects, or if the instrument randomly assigns enough variation that the complier population equals the full population (as in a fully randomized trial).

5 Weak instruments

If \(Z\) weakly predicts \(D\), \(\text{Cov}(Z, D)\) is small relative to sampling noise, the denominator in the Wald estimand is noisy, inflating the 2SLS variance and introducing finite-sample bias.

5.1 The first-stage \(F\)-statistic

The conventional diagnostic is the first-stage \(F\)-statistic from regressing \(D\) on \(Z, X\). Stock and Yogo (2005) [@stock2005testing] provide critical values for various bias thresholds; the most cited rule of thumb is \(F > 10\) corresponds to acceptable weak-instrument bias.

5.2 Bias expansion

Nagar (1959) derived the bias of 2SLS in the weak-instrument case:

\[ \text{Bias}(\hat\tau^{\text{2SLS}}) \approx \frac{\text{Cov}(\epsilon, v)}{\text{Var}(\hat D)}. \]

When the first stage is weak (\(\text{Var}(\hat D)\) small), the bias dominates. In the extreme case, an irrelevant instrument, 2SLS bias equals OLS bias, so 2SLS provides no benefit.

5.3 Limited information maximum likelihood (LIML)

LIML [@anderson1949estimation] is a maximum-likelihood alternative to 2SLS that has better finite-sample properties under weak instruments. Although inefficient in the just-identified case relative to 2SLS, LIML’s bias is substantially smaller. When \(F < 10\), consider reporting LIML alongside 2SLS.

6 Control functions

For continuous or nonlinear treatments, 2SLS requires the first stage to be a linear projection. A more flexible approach: the control function (CF) method [@heckman1985alternative].

6.1 CF estimator

Suppose \(D = g(Z, X) + V\) where \(\mathbb{E}[V \mid Z, X] = 0\). Define the first-stage residual \(\hat V_i = D_i - \hat g(Z_i, X_i)\). Then

\[ \mathbb{E}[Y \mid D, X, V] = \alpha + \tau D + X^\top \beta + \gamma V. \]

Running OLS of \(Y\) on \((D, X, \hat V)\) gives \(\hat\tau^{\text{CF}}\). The residual \(\hat V\) “controls for” the endogenous portion of \(D\); under correct specification of the first stage, the CF approach is identical to 2SLS. CF generalizes naturally to nonlinear first-stage models (logistic, Probit, Poisson) and to heterogeneous effects.

7 Applied example

A synthetic noncompliance trial.

import numpy as np
import pandas as pd
from linearmodels.iv import IV2SLS

rng = np.random.default_rng(123)
N = 3000

# Unobserved confounder
U = rng.normal(0, 1, N)

# Instrument: random assignment
Z = rng.binomial(1, 0.5, N)

# Treatment received: depends on assignment Z and unobservable U (noncompliance)
compliance_prob = 0.6 + 0.5 * (U < 0.5)   # compliers, never-takers by type
D = (rng.uniform(size=N) < Z * compliance_prob + (1 - Z) * 0.1 * (U < -0.5)).astype(int)

# Outcome depends on D and U
tau_true = 2.0
Y = 1 + tau_true * D + 0.8 * U + rng.normal(0, 1, N)

df = pd.DataFrame({'Y': Y, 'D': D, 'Z': Z})

# OLS (biased due to U → D and U → Y)
ols_beta = np.cov(D, Y, ddof=1)[0, 1] / np.var(D, ddof=1)

# 2SLS
iv = IV2SLS.from_formula('Y ~ 1 + [D ~ Z]', df).fit()
tau_2sls = iv.params['D']

# First-stage F-statistic
from statsmodels.api import OLS, add_constant
first_stage = OLS(D, add_constant(Z)).fit()
F_stat = first_stage.fvalue

print(f"True τ:       {tau_true:.3f}")
print(f"OLS (biased): {ols_beta:.3f}")
print(f"2SLS LATE:    {tau_2sls:.3f}")
print(f"First-stage F: {F_stat:.1f}")

2SLS recovers approximately 2.0, OLS gives around 2.5 (biased upward by the positive correlation \(U \to D\) and \(U \to Y\)), and the first-stage F-statistic of \(\sim 200\) confirms the instrument is strong.

8 Evaluation metrics

First-stage F-statistic, must exceed 10 (Stock-Yogo) to avoid weak-instrument bias.
Reduced-form effect, \(\text{Cov}(Z, Y)\). Must be nonzero; the magnitude is the ITT effect.
Placebo test, regress \(Y\) on \(Z\) among units where exclusion should hold vacuously (e.g., pre-treatment Y). A significant placebo effect is evidence against exclusion.
LATE interpretation narrative, in your results, describe who the compliers are. “The estimated effect is for individuals whose treatment decision was influenced by the instrument.”
Robustness to LIML, if \(F < 30\), report both 2SLS and LIML; large disagreement is a weak-instrument concern.

9 Bibliographic notes

Angrist, Imbens, and Rubin (1996), “Identification of Causal Effects Using Instrumental Variables,” JASA 91, 444–455, is the definitive statement of the modern LATE framework.

Angrist and Pischke (2009), Mostly Harmless Econometrics, Princeton University Press, provides the accessible applied treatment.

Stock and Yogo (2005), “Testing for Weak Instruments in Linear IV Regression,” in Identification and Inference for Econometric Models, Cambridge, introduces the bias-based critical values for the first-stage \(F\)-statistic.

Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data (2nd ed.), MIT Press, covers 2SLS, LIML, control functions, and weak-IV remedies in depth.

Card (1995), “Using Geographic Variation in College Proximity to Estimate the Return to Schooling,” is the canonical applied paper using distance to college as an instrument for education.

10 Exercises

10.1 Theoretical exercises

Exercise 6.1 (\(\star\star\)). Prove that under the constant-effects model and Assumption 6.1, 2SLS is consistent for \(\tau\).

Exercise 6.2 (\(\star\star\)). In the LATE theorem, what happens if monotonicity fails but all other IV assumptions hold? Derive the bias explicitly when there are \(p_c\) compliers and \(p_d\) defiers with effects \(\tau_c\) and \(\tau_d\) respectively.

Exercise 6.3 (\(\star\star\star\)). Prove that the control-function estimator in §Section 6 is numerically identical to 2SLS when the first stage is linear in \(Z, X\).

10.2 Computational exercises

Exercise 6.4 (\(\star\)). Simulate weak-instrument bias. For \(F\)-stats of 2, 5, 10, 20, 50, plot the distribution of \(\hat\tau^{\text{2SLS}}\) across 1000 simulations and compare to the distribution of OLS.

Exercise 6.5 (\(\star\star\)). Implement LIML from scratch on a simulated DGP and compare its bias to 2SLS when \(F = 5\).

Exercise 6.6 (\(\star\star\)). Use linearmodels.iv to estimate the effect of education on earnings using parental education as an instrument. What assumption does this require, and how would you argue for or against it?

10.3 Discussion exercises

Exercise 6.7. A researcher proposes using “state gun-law strictness” as an instrument for individual gun ownership in a study of gun ownership’s effect on depression. Evaluate each IV assumption.

Exercise 6.8. LATE estimates are specific to compliers. A policymaker asks, “What is the effect of minimum-wage increases on employment?” Explain why this framing is problematic and what additional information the LATE can and cannot provide.

11 References

Anderson, T. W., and Rubin, H. (1949). “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations.” Annals of Mathematical Statistics 20(1), 46–63.

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91(434), 444–455.

Angrist, J. D., and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.

Card, D. (1995). “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” In Aspects of Labour Market Behaviour, University of Toronto Press.

Heckman, J. J., and Robb, R. (1985). “Alternative Methods for Evaluating the Impact of Interventions.” In Longitudinal Analysis of Labor Market Data, Cambridge University Press.

Imbens, G. W., and Angrist, J. D. (1994). “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62(2), 467–475.

Stock, J. H., and Yogo, M. (2005). “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference for Econometric Models, Cambridge University Press.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.