How instrumental variables help in causal inference: a 2SLS worked example in Python

When the regressor we care about is correlated with the error term, ordinary least squares estimates the wrong thing. Instrumental variables and two-stage least squares offer a way out, but only under two conditions: a strong first stage we can test, and an exclusion restriction we cannot. A reproducible IV2SLS walkthrough in Python on the Mroz labor-supply data, with the first-stage F-statistic, the weak-instrument check, and how it reads when it fails.

When we regress an outcome on a treatment and the treatment is correlated with something we left in the error term, ordinary least squares (OLS) does not estimate the causal effect we wanted. It estimates a mixture of that effect and whatever the omitted variable was doing. The regression still runs, the coefficient still looks plausible, and the standard errors still print. None of that tells us the coefficient carries the causal meaning we intended. This is the endogeneity problem, and it underlies most of the situations where a clean correlation is not a clean cause.

Instrumental variables (IV) is one of the oldest answers to that problem. The idea is to find a variable that moves the treatment but has no other path to the outcome, and to use only the part of the treatment that this variable explains. When that variable exists and behaves, two-stage least squares (2SLS) recovers a causal effect from observational data where OLS cannot. The catch is that IV rests on two requirements, and one of them can never be tested. AI assistants now draft IV code in seconds, which makes it easier than ever to run a 2SLS regression without having decided whether either requirement is met. That decision is ours, and it is the whole job.

Let’s work through a concrete case, name the two requirements as decisions an analyst owns, run the estimator on real data, and read both the result and the ways it fails.

The decision IV is built for: an endogenous regressor

The case we carry through the article is the return to a year of schooling: how much an extra year of education raises wages. The natural move is to regress log wage on years of education and a few controls. The Mroz (1987) data, a Panel Study of Income Dynamics extract of 428 working married women distributed with Wooldridge’s textbook, lets us run exactly that.1

When we run that OLS regression, the coefficient on education is 0.1075.2 Read literally, an extra year of schooling is associated with about an 11 percent higher wage. The problem is the word “associated.” Education is not handed out at random. People with more schooling differ in ways we do not observe, ability being the usual suspect, and ability raises wages on its own. The OLS coefficient therefore blends the true return to schooling with the wage premium that ability would have earned anyway. Education is correlated with the error term, which is the definition of an endogenous regressor.

This is the decision point. We can report 0.1075 and note in the same breath that it is probably biased upward, or we can look for a variable that shifts schooling without touching wages through any other door. That variable is an instrument, and choosing it is where the real work starts.

The two requirements, stated as decisions

An instrument has to satisfy two conditions. They are not symmetric. One we can check against the data; the other we have to argue for and can never confirm.

Relevance: the instrument actually moves the treatment. The instrument has to be correlated with the endogenous regressor, and the correlation has to be strong enough to build on. This is a decision we get to test, and we will. A weak instrument, one that barely moves the treatment, makes 2SLS unstable and biased even in large samples, so relevance is a quantity to measure, not a box to tick.

The exclusion restriction: the instrument affects the outcome only through the treatment. This is the requirement that cannot be tested. We argue for it from how the world works, and a reader is free to disagree.

For the return to schooling, the standard instruments are parents’ education: a mother’s and father’s years of schooling.1 Relevance is easy to believe and easy to check, since better-educated parents tend to have better-educated children. The exclusion restriction is the hard part. It asks us to assume that parents’ education raises a woman’s wage only by raising her own schooling, and through no other channel. That is a strong assumption. Educated parents may pass on connections, expectations, or unobserved ability that lift wages directly. We cannot test whether that is happening. We can only state the assumption plainly and let the reader weigh it, which is the honest way to present any IV estimate.

A 2SLS walkthrough in Python

The first stage regresses the endogenous treatment on the instruments and the exogenous controls, producing a predicted treatment built only from variation the instruments explain. The second stage regresses the outcome on that predicted treatment. Because the predicted treatment is constructed from the instruments alone, the part correlated with the error term is, under the exclusion restriction, gone.

We use the linearmodels package and its IV2SLS class, whose constructor takes the dependent variable, the exogenous regressors, the endogenous regressors, and the instruments in that order.3

import statsmodels.api as sm
from linearmodels.iv import IV2SLS

# df holds the Mroz working-women sample (n = 428).
# Outcome: lwage (log hourly wage). Treatment: educ (years of schooling).
# Controls: exper, expersq. Instruments: fatheduc, motheduc.

exog = sm.add_constant(df[["exper", "expersq"]])

iv = IV2SLS(
    dependent=df["lwage"],
    exog=exog,
    endog=df["educ"],
    instruments=df[["fatheduc", "motheduc"]],
).fit(cov_type="robust")

print(iv.params["educ"], iv.std_errors["educ"])

Before trusting the second stage, we check relevance. The first stage regresses education on the instruments plus the controls, and the question is whether the instruments move education enough to rely on. The conventional summary is the first-stage F-statistic on the excluded instruments, with a rule of thumb that it should clear 10.4

# First stage: educ on instruments and controls.
fs_X = sm.add_constant(df[["exper", "expersq", "fatheduc", "motheduc"]])
first_stage = sm.OLS(df["educ"], fs_X).fit()

# Joint F-test on the excluded instruments only.
f_test = first_stage.f_test("fatheduc = 0, motheduc = 0")
print(float(f_test.fvalue))

When we run this on the Mroz data, the joint F-statistic on the two parental-education instruments is 55.4.2 Both parents’ education enter the first stage with the expected sign and large t-statistics: father’s education at 0.19 (t = 5.6), mother’s at 0.16 (t = 4.4). An F of 55.4 sits well above the rule-of-thumb 10, so relevance, the testable requirement, is met.

The linearmodels results object reports its own first-stage diagnostics, and here they need a caveat. Its built-in first-stage F is 100.2, larger than our 55.4 because it tests all the first-stage regressors jointly, controls included, rather than the excluded instruments alone.2 The weak-instrument question is about the excluded instruments specifically, so the partial F of 55.4 is the one to read for that purpose. Both clear the threshold here, but the distinction matters in cases nearer the line.

# linearmodels reports first-stage diagnostics directly.
print(iv.first_stage.diagnostics)

How to read it, and how it fails

With relevance established, we can read the second stage. The 2SLS coefficient on education is 0.0614, with a robust standard error of 0.0332 and a 95 percent confidence interval running from -0.004 to 0.126.2

Two things stand out. First, the IV estimate of 0.0614 is a little more than half the OLS estimate of 0.1075. If the exclusion restriction is true, this is what we expected: stripping out the part of schooling correlated with unobserved ability lowers the estimated return, because the OLS estimate had included ability’s wage premium as part of the return to schooling. Second, the confidence interval now includes zero. IV trades precision for identification: using only the slice of schooling that parents’ education explains means a smaller, noisier signal than total schooling. That trade is typical, and it is honest to report the wider interval rather than the tighter OLS one that answered the wrong question.

How does this fail? Two ways, matching the two requirements.

Relevance fails when the instrument is weak. Had the first-stage F come in at 3 instead of 55, the 2SLS estimate would be unreliable no matter how large the sample, biased back toward the OLS estimate and with confidence intervals that the conventional formula understates. A weak instrument leaves the second-stage coefficient looking exactly like one built on a strong instrument. The first-stage F is what catches it, which is why we compute it before reading the result, not after.

A failed exclusion restriction produces no error, and nothing in the regression can catch it. If parents’ education raises a daughter’s wage through inherited ability or social connections, and not only through her own schooling, then the instrument has a second path to the outcome and the 2SLS estimate is biased. The only defense is the argument we made for the exclusion restriction in the first place, and the willingness to state it as an assumption a reader can reject. An IV result is only as credible as that argument.

When to reach for IV versus difference-in-differences

IV and difference-in-differences (DiD) solve different versions of the same problem, and the choice between them is a choice about what we can credibly assume.

IV is the tool when the threat is a confounder we cannot observe and cannot difference away, and when we can name a variable that moves the treatment through a single defensible channel. If we cannot tell a convincing story for why the instrument touches the outcome only through the treatment, IV does not help.

DiD is the tool when treatment switches on at a known time for some units and not others, and when the untreated units plausibly trace the path the treated units would have followed absent treatment. Its untestable assumption is parallel trends rather than exclusion, and our DiD worked example in Python walks through estimating it and checking that assumption. For the broader question of which designs fit which policy settings, our guide to when difference-in-differences is the right tool lays out the cases.

The practical rule is to match the design to the assumption we can defend. If we have a clean policy switch and credible comparison units, DiD asks less of us. If we have a stubborn unobserved confounder and a genuinely excludable instrument, IV is the way through. Reaching for IV without an instrument we can defend, or for DiD without a credible comparison group, produces a number that runs cleanly and means nothing.

Closing

Instrumental variables recovers a causal estimate from an untrustworthy regressor, but only by trading one assumption for two: a strong first stage we can measure, and an exclusion restriction we have to argue for and can never confirm. The first stage is a number, and we should report it; on the Mroz data it was an F of 55.4, well clear of the weak-instrument range, which is why the 2SLS return of 0.0614 carries more weight than the OLS 0.1075. The exclusion restriction is a sentence, and we should write it plainly, because the credibility of the whole estimate rests on whether a reader believes it.

The routine is portable. Name the endogenous regressor and why it is endogenous, find an instrument and check its first stage against the data, state the exclusion restriction as the assumption it is, and read the second stage knowing what precision it traded for identification. An assistant can draft the IV2SLS call; deciding whether the instrument is defensible is the part that stays with us.

Notes

  1. Mroz, T. A. (1987). The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. Econometrica, 55(4), 765-799. https://doi.org/10.2307/1911029. The extract used here is the mroz.dta file distributed with Wooldridge's Econometric Analysis of Cross Section and Panel Data, available at https://www.stata.com/data/jwooldridge/eacsap/mroz.dta.
  2. All numbers are computed from the saved reproduction script iv_2sls_mroz.py and its output iv_2sls_mroz_output.txt, run on the 428 working women in the Mroz extract. OLS return to schooling 0.1075; first-stage partial F on the excluded instruments (fatheduc, motheduc) 55.4; full first-stage F reported by linearmodels 100.2; 2SLS return 0.0614 (robust SE 0.0332, 95% CI [-0.004, 0.126]).
  3. Sheppard, K. linearmodels: IV2SLS. Constructor signature IV2SLS(dependent, exog, endog, instruments) with .fit(cov_type="robust"). https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html.
  4. Staiger, D., & Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica, 65(3), 557-586. https://doi.org/10.2307/2171753. The first-stage F greater than 10 rule of thumb is commonly attributed to this work and to Stock, Wright, and Yogo (2002).

Cite this article

Cholette, V. (2026, June 21). how instrumental variables help in causal inference: a 2SLS worked example in Python. Too Early To Say. https://tooearlytosay.com/research/methodology/instrumental-variables-python/