A reference library for empirical methods

A `/papers-md-generator` skill that turns each paper we read into a structured entry in a queryable methods reference: estimand, estimator, identification strategy, named assumptions, and bibliography cross-checks.

The chore we keep doing by hand

When we sit down to decide whether to cite a causal-inference paper, four questions need answers fast: what is the estimand, what is the estimator, what is the identification strategy, and which assumptions does the paper name and use. Most methods sections do not deliver those four things cleanly. They braid the estimand into the estimator, mention the identification strategy in a footnote, and reference named assumptions across two sections.

A few specific confusions recur. Estimand-vs-estimator slippage is the most common. A paper says “we estimate the ATT” and then runs a two-way fixed-effects regression whose probability limit, under heterogeneous effects, is not the ATT. A paper writes “we use Callaway-Sant’Anna” in the methods paragraph, and the code in the replication package estimates TWFE with a single post-period dummy. An identification strategy is claimed but not warranted by the body. Parallel trends is invoked without a pre-trend plot; an instrument is named without a first-stage table.

The working-paper-to-journal drift is a separate problem for citation work. A method is announced in a 2022 NBER working paper, refined for a 2023 Review of Economic Studies version, and the specific result a downstream applied paper relies on lives only in the journal version. Citing the working paper credits the wrong artifact. Citing the journal version requires having actually checked which version contains the result. Both versions are circulating; both are findable; the methods section rarely tells us which one to cite.

The cost of this is not theoretical. The data suggests most working economists spend 20 to 30 minutes per paper extracting these four fields cleanly when reading for a literature review. Across a 50-paper pillar, that is a full work week before any writing happens.

↑ Back to top

Why AI?

Pattern matching on a methods-section corpus catches some of this. A search for “we estimate” finds maybe 30% of identification-strategy declarations. The rest are in passive voice, in subordinate clauses, in theorem statements, or in the appendix. Hand-coding catches them, but hand-coding 50 papers per pillar does not scale across the dozen or so pillars TETS plans to build.

LLM extraction handles the long tail of prose variation. Given a precise written definition (an operational definition) of what counts as an estimand, an estimator, an identification strategy, and a named assumption, the model can find each across a wide range of phrasings, including in theorem statements where simple pattern matching fails.

The risk with LLMs is the familiar one. A “summarize this PDF” prompt produces fluent text that may or may not be in the source. For citation work, fluent-but-wrong is the worst possible failure mode. The skill addresses this with two mechanisms. First, every extracted claim must be backed by a verbatim quote from the methods or theorem block, matched back to the parsed body of the paper with character-level tolerance for typesetting and scanning artifacts. If the quote does not match, the claim drops. Second, misattribution flags pass through a two-run agreement check at different sampling settings before being queued for human review.

The result is asymmetric. AI is doing the long-tail extraction work where pattern matching fails. A rule-based catalog lookup is doing the misattribution flagging where LLM judgment would be too unreliable. The LLM is not allowed to assert anything it cannot ground in a quote.

Implementation detail. Pattern matching here means regex over a methods-section corpus. The four operational definitions are passed to the model as `ESTIMAND`, `ESTIMATOR`, `IDENTIFICATION_STRATEGY`, and `ASSUMPTIONS_NAMED` fields with field-level instructions and exclusion criteria.

↑ Back to top

Getting our bearings

Three pieces of background help in reading the rest.

The first is the estimand / estimator / identification-strategy distinction. These three terms travel together in causal-inference writing and get conflated in practice. An estimand is the population parameter we want, for example the average treatment effect on the treated (ATT). An estimator is the procedure that produces a number from data, for example the Callaway-Sant’Anna doubly-robust group-time estimator. An identification strategy is the argument that the estimator’s probability limit equals the estimand under stated assumptions, for example parallel trends plus no-anticipation under staggered adoption. The slippage happens because authors often name the estimator and let the reader infer the estimand and the identification strategy. Methods-aware extraction requires forcing each into its own field.

The second is the PDF parser. We use an open-source service called GROBID that turns a PDF into a structured document with sections, references, and inline formulas tagged. The skill delegates all PDF parsing to that service and treats its output as the canonical text. We are not rebuilding PDF parsing; we are layering an audit on top of it.

The third is canonical-origin drift. A method’s canonical citation is the paper a careful user should cite when invoking it. Honest DiD is a useful worked example because it has multiple parent papers, each carrying different results, and citations across the literature confuse them. The pretest-bias diagnostic and the recommendation against pre-trends tests as a sole diagnostic are in Roth (2022, AER:Insights). The constructive sensitivity framework with restriction sets on differential trends (Δ^RM, Δ^SD, breakdown M) is in Rambachan and Roth (2023, Review of Economic Studies). The fixed-length confidence interval (FLCI) machinery that RR 2023 adapts for Δ^SD goes back further still, to Armstrong and Kolesár (2018, Econometrica). Citing any one of these for results that live in another is the misattribution pattern the catalog captures.

Implementation detail. GROBID's output is TEI XML (Text Encoding Initiative format), with section tags carrying a methodology type attribute and structured bibliography entries. The pipeline reads the TEI tree directly rather than re-parsing the PDF.

↑ Back to top

What the tool does

The input is a DOI or a local PDF. The output is a papers.md block that documents:

  • Estimand: the population parameter the paper is trying to identify
  • Estimator: the named computational procedure used; a paper can have several, each tagged as a main estimator, a secondary estimator, or a robustness check
  • Identification strategy: the doctrinal warrant connecting the data to the estimand
  • Named assumptions: parallel trends, no-anticipation, monotonicity, etc., each quoted verbatim from the methods or theorem blocks
  • Quote evidence: verbatim text from the source, matched back to the parsed paper body even when typesetting and scanning have nudged a few characters

PDF parsing and bibliographic metadata are delegated to GROBID, CrossRef, and Semantic Scholar (the standard reference-data services for academic work). The build is the audit layer on top.

The misattribution-trap engine reads a curated list of common credit errors, stored as a configuration file. v0.1 ships entries including FLCI methodology (credited to RR 2023 when it belongs to Armstrong-Kolesár 2018), the pretest-bias diagnostic (credited to RR 2023 when it belongs to Roth 2022), the Δ-restriction sensitivity framework (credited to Roth 2022 when it belongs to RR 2023), the doubly-robust staggered DiD (credited to Sun-Abraham 2021 when it belongs to Callaway-Sant’Anna 2021), the not-yet-treated comparison group (Sun-Abraham vs Callaway-Sant’Anna Assumption 5), and causal forests. If a paper’s extracted estimator citation matches a known wrong-credit entry and the paper’s bibliography contains the wrong-credit DOI but not the canonical origin, the engine flags it as a misattribution candidate.

New to Claude Code skills? The setup guide covers installation and first invocation.

Schema diagram showing the papers.md block structure for one DOI, with the four required fields (estimand, estimator, identification strategy, named assumptions), the verbatim-quote evidence array, and the multi-estimator role flag (main / robustness / secondary). Diagram includes the misattribution-candidate field for the wrong-credit case.

The papers.md schema. One DOI in; up to N estimators out, each with role flag, canonical citation, named assumptions, and verbatim-quote evidence. The misattribution-candidate field fires when bibliography mismatch is detected.

Flags are not asserted on a single run. The two-run agreement check requires the extractor to produce the same flag under two different sampling settings. Even then, the flag is queued to a pending-review file (a holding queue, not part of the canonical output) for human confirmation against the source PDF. Reputational stakes for cited authors make false positives more costly than false negatives.

Quote-source constraints are strict: only methods sections and theorem / proposition / assumption blocks are eligible. Intro, related-work, and conclusion sections are excluded. If a paper extracts zero main estimators, the tool emits a papers-md-draft-FAILED.md with the failure reason rather than guess. The schema is versioned at v1.0 and the block carries identifying markers (DOI plus generation timestamp) so reruns append rather than silently overwrite human edits.

Implementation detail. Estimators carry a `role` field that takes values `main`, `secondary`, or `robustness`. The misattribution-trap engine reads a YAML catalog keyed on `estimator_canonical_citation` with a `wrong_credit` field; it flags `MISATTRIBUTION_CANDIDATE` when the catalog match holds AND `wrong_credit` DOI is in the paper's bibliography AND the canonical-origin DOI is not. Eligible sections are restricted via GROBID section tags. The self-consistency gate compares output at temperature 0.0 / seed 7 and temperature 0.2 / seed 13.

↑ Back to top

(Un)Tested

The smoke test is where v0.1 currently sits. All 23 unit tests pass, and the end-to-end run on a hand-crafted TEI fixture for Callaway and Sant’Anna (2021) produced 3 main estimators with correct canonical citations, 5 exact-match verbatim quotes, and a clean misattribution catalog. That tells us the extraction logic, the schema validator, and the quote-matching pipeline hang together on a controlled input.

The two-run agreement check is the layer we lean on for the misattribution call specifically. A flag is only emitted when the extractor produces the same call on two independent runs with different sampling settings. Even then the flag is queued to a pending-review file for human confirmation against the source PDF, not asserted. We treat misattribution as a hypothesis to verify, not a finding to publish.

Quote verification is the floor under everything else. Each extracted claim must be backed by a verbatim quote from the methods or theorem block, matched back to the parsed paper body even when invisible character differences (like café rendered two different ways), typographic ligatures (the fi in efficient rendered as a single glyph), or one or two scan-misread characters per line stand between the quote and the body. We handle all three before declaring a quote unmatched. If the quote does not match after that, the claim drops. No matched quote, no extracted claim.

What we have not yet done matters too. CAPABILITY-05 specifies a 50-paper hand-coded validation stratified by method family, with precision and recall thresholds we have to clear before treating the output as more than starting evidence. The full 50-paper run has not happened. A 10-paper subsample on the sister /attribution-audit-network skill returned precision = 0.20 [Wilson 95% CI 0.06–0.51] under bibliography-only flagging, telling us that section-evidence checking (what this skill produces) is a prerequisite for clearing the 0.85 spec target, rather than an optional refinement. And while the GROBID container is installed locally, the end-to-end run against a real PDF has yet to complete in a single uninterrupted session; the TEI-fixture smoke result remains valid for the extraction logic and the quote-matching pipeline.

Implementation detail. The hand-crafted fixture is a TEI XML file simulating GROBID output. Normalization is NFKC (a Unicode standard that treats `café` and `cafe`-with-combining-accent as equal). Ligature decomposition covers seven explicit ligatures (`fi fl ff ffi ffl ſt st`). Fallback match is Levenshtein distance ≤2 per 100 characters against the GROBID-parsed body. Two-run agreement is checked at temperature 0.0 / seed 7 and temperature 0.2 / seed 13. The next install step is `docker pull lfoppiano/grobid:0.8.0`.

↑ Back to top

Walking through a run

Suppose we want to add Callaway and Sant’Anna (2021) to a CS DiD pillar’s papers.md. We invoke:

/papers-md-generator doi:10.1016/j.jeconom.2020.12.001 pillar:cs-did

A clean extraction produces a block like this:

## Callaway & Sant'Anna (2021). Difference-in-differences with multiple time periods.
**doi:** 10.1016/j.jeconom.2020.12.001
**estimators:**
  - estimand: ATT(g,t), group-time average treatment effect on the treated
    design: staggered adoption DiD
    estimator_name: doubly-robust group-time estimator
    estimator_canonical_citation: Callaway & Sant'Anna 2021
    assumptions_named: [parallel trends conditional on covariates, no anticipation, irreversibility of treatment]
    role: main
    section_evidence: sec_methodology
    confidence: 0.92
**verbatim_quotes:**
  - text: "We propose a doubly robust estimator for ATT(g,t)…"
    section_id: sec_methodology
    verification_method: exact
  - text: "Assumption 4 (No Anticipation): …"
    section_id: assumption_block
    verification_method: exact

We read this and check three things. Does the estimand match the estimator (yes, ATT(g,t) with the DR group-time estimator). Are the named assumptions actually named in the paper (yes, Assumption 4 is quoted verbatim). Is the canonical citation the right artifact (yes, the Journal of Econometrics version is where the DR formulation lives).

Now suppose we instead pass a downstream applied paper that uses honest DiD with the FLCI bound and cites only Rambachan and Roth (2023) for the FLCI methodology. The extraction returns the estimator object, and the misattribution-trap engine fires because the catalog flags this exact pattern (FLCI methodology origin omitted):

**misattribution_flags:**
  - status: MISATTRIBUTION_CANDIDATE (pending human review)
    paper_cites: Rambachan & Roth 2023 (Review of Economic Studies)
    catalog_suggests_also: Armstrong & Kolesár 2018 (Econometrica) — FLCI methodology origin
    method: FLCI (fixed-length confidence interval), as adapted in RR 2023 for Δ^SD
    self_consistency: agreed at T=0.0/seed=7 and T=0.2/seed=13
    queued_to: misattribution-flags-pending-review.md

We then open the source PDF, check the bibliography, and either confirm the flag (and add a note when citing the downstream paper) or reject it. Nothing moves into the canonical papers.md until we confirm.

A FAILED status looks like this:

papers-md-draft-FAILED.md
reason: no_main_estimator_extracted
diagnostics:
  grobid_no_methods_section: true
  candidates_considered: 2
  candidates_dropped: both at fuzzy<=2/100 verification step

The right response is not to guess. It is to read the methods section ourselves, decide whether the paper actually has a main estimator we missed, and either re-run with a hint list (["sdid"], ["IPW + PSM"]) or move on.

The point of the worked example is that the output is meant to be read, audited, and accepted or rejected. The tool produces a draft. We do the confirmation.

↑ Back to top

Extending the skill

The skill is a blueprint, not a finished database. Three extension points are open by design, in increasing order of effort.

Add a misattribution-catalog entry. This is the lowest-friction extension. The catalog at shared/misattribution-catalog.yaml is a flat list of entries, each documenting one wrong-credit pattern in our literature. To add the next entry, copy the schema from an existing one and fill in three pieces: the wrong-credit citation (authors, year, DOI, and the surface forms we see in the wild), the correct-credit citation (authors, year, DOI, and the section of the correct paper where the result actually lives), and a source-of-claim citation that backs the correction. Confidence level is consensus, strong, or provisional. The inference rule reads the new entry on the next skill invocation; no code change. Pull requests against the public repo are the welcome path. Method families we have not seeded yet (shift-share IV, bunching estimators, regression-kink designs, marginal-effect heterogeneity) are all reasonable targets.

Add a schema field. The papers.md block schema is versioned at v1.0 and lives in papers-md-generator/src/writer.py. Suppose we want to track the cluster level for each estimator’s standard errors, because it is doing a lot of work in modern applied papers and rarely gets named in the methods paragraph. We add a cluster_level field to the estimator dict, write a short operational definition (what counts as the cluster level, what to do when the paper does not state one), add the field to the extractor prompt in shared/extractor-prompt-v1.md, and bump the schema version to v1.1. The two-run agreement check applies to the new field automatically. The verbatim-quote constraint applies too: no extracted cluster level without a quote from the methods or footnote backing it. A 5-paper smoke set is enough to see whether the field extracts cleanly before we commit to schema v1.1.

Add a verification source. Bibliographic metadata currently flows through CrossRef and Semantic Scholar. A reader who wants to cross-validate against OpenAlex, Web of Science, or a private institutional database can write a thin adapter that returns the same record shape ({doi, title, authors, year, journal}) and register it as an additional source in papers-md-generator/src/metadata.py. When two sources disagree on a field, the field is flagged as ambiguous and pushed to the pending-review file, which is the same gate the misattribution flags pass through. This is the heaviest of the three extensions, but it is the one that turns the audit layer from “is this paper cited correctly” into “do the metadata sources we rely on agree about this paper.”

The schema, the operational definitions, the quote-verification pipeline, and the misattribution-trap engine are the contribution. The seed catalog is a starter. Coverage of our literature is the reader’s adoption work.

↑ Back to top

Next steps

The PDF parser is installed locally as GROBID 0.9.0-crf, the ARM64-compatible CRF-only build (the deep-learning variants ship amd64-only). The end-to-end run on a real PDF now completes: a Callaway and Sant’Anna (2021) arxiv PDF flows through GROBID’s full-text endpoint in 24 seconds, yields 33 body sections, the methods-section heuristic correctly fires on the econometric section heads (“Identification,” “Estimation,” “Model”), the bibliography extracts 12 entries, and the LLM extractor generates estimator candidates. The remaining gap is normalization: GROBID’s PDF-to-text extraction introduces typesetting artifacts (ATT(g,t) rendered as AT T (g, t) with stray spaces around tokens), and the Levenshtein-2-per-100-characters tolerance does not fold the whitespace structural drift. The quote-verification floor catches this and refuses to publish unverified claims, which is the designed behavior, so the v0.1.0 end-to-end output on a real PDF is a papers-md-draft-FAILED.md with quote_pipeline_all_failed: all candidate quotes failed normalization match. A v0.2 normalization upgrade folds whitespace around punctuation and parenthesized tokens before the Levenshtein gate; that is the work that converts the FAILED draft into a clean papers.md block. The methods-section heuristic is also tuned for econometrics heads (“Identification,” “Estimation,” “Model,” “Framework”) and currently misses ML-style sections (“Feature Engineering,” “Classifiers,” “Cross-Validation”); extending the heuristic vocabulary is a one-line change for readers whose pillar is ML rather than econ.

Cross-source bibliographic validation is now live. CrossRef serves as the primary lookup and OpenAlex as the secondary. A working test against Callaway and Sant’Anna (2021) confirms both sources resolve the DOI and agree on the paper, while disagreeing on author-name format (CrossRef returns surname-initial pairs; OpenAlex returns full given names). The cross-validation gate flags this kind of structured disagreement on a metadata field rather than silently picking one source. The original spec also wired in Semantic Scholar as a third source, but the issued key returns HTTP 403 across every endpoint (consistent with a deactivated key, not rate limiting), and the unauthenticated tier IP-throttles too aggressively for pipeline use: requests from a residential or shared IP return HTTP 429 with no Retry-After header even after 60-second cooling-off, because the free quota is shared across all anonymous traffic on the same IP block. OpenAlex covers the same citation-graph data publicly without throttling, so the v0.1 design treats S2 as a v0.2 enhancement rather than a v0.1 dependency.

The misattribution catalog ships with sixteen entries seeded from the honest-DiD pillar (six entries covering FLCI, the pretest-bias diagnostic, the Δ-restriction sensitivity framework, the conditional confidence sets, the default sensitivity parameter, and a formal-results misattribution), the staggered-DiD pillar (four entries covering the not-yet-treated comparison group, the doubly-robust estimator, the TWFE contamination decomposition, and the interaction-weighted estimator), causal forests, the surrogate index, and four surname / author-order traps. A 10-paper hand-coded validation run on the sister /attribution-audit-network skill returned precision = 0.20 [Wilson 95% CI 0.06–0.51] under bibliography-only flagging, well below the 0.85 spec target. Eight of ten flagged candidates were survey papers or methodology reviews citing a wrong-credit paper without applying the method as a main or robustness estimator, which tells us that section-evidence checking (the link back to the quote-verification layer this skill builds) is a v0.2 prerequisite rather than a nice-to-have. The schema, the operational definitions, the quote-verification pipeline, and the misattribution-trap engine are the contribution. Readers extend the catalog with entries for the methods they work in.

Implementation detail. The PDF parser install commands are `docker pull lfoppiano/grobid:0.9.0-crf` followed by `docker run -t -d --name grobid -p 8070:8070 --init -e JAVA_OPTS="-Xmx4g -Xms2g" lfoppiano/grobid:0.9.0-crf`. The `-crf` suffix selects the CRF-only build, the only variant currently published with an `arm64` image manifest. The container needs roughly 4 GiB of Java heap on a real-PDF full-text run; on Apple Silicon under Colima, allocate the VM at least 8 GiB (`colima start --memory 8 --cpu 4`) so the container does not get OOM-killed by the host. Container health-check is `curl http://127.0.0.1:8070/api/isalive` returning HTTP 200, typically 20 to 30 seconds after `docker run`. Cross-source metadata resolves via CrossRef, then OpenAlex, then DataCite (for ICPSR / Zenodo / dataset DOIs), in that order; Semantic Scholar sits behind a stub keyed on `SEMANTIC_SCHOLAR_API_KEY` for v0.2 reactivation. The body-text normalizer is the v0.2 hot spot: the `papers-md-generator/src/normalizer.py` Levenshtein gate currently runs at 2 edits per 100 characters, which does not fold GROBID's punctuation-spacing artifacts on tokens like `ATT(g,t)`.

↑ Back to top

Worth taking with us

The misattribution layer is the deeper win on top of the simple cite-verification one. A generic PDF summarizer cannot see that a paper crediting Rambachan-Roth 2023 alone for FLCI omits Armstrong-Kolesár 2018, or that a paper crediting RR 2023 for the pretest-bias diagnostic should be crediting Roth 2022. The skill sees both because the catalog is the contract; the rest is gate logic.

Have input? Get in touch.


Cite this article

Cholette, V. (2026, May 25). A reference library for empirical methods. Too Early To Say. https://tooearlytosay.com/research/methodology/papers-md-generator/