A common shape for econ replication packages

A `/replication-package-analytics` skill that imposes a uniform structure on openICPSR and AEJ packages so they can be surveyed across the field, plus a panel of twelve structural-compliance metrics per package.

The chore we keep doing by hand

Anyone who has tried to replicate a published econ paper, for a class or a referee report or a working-paper extension, has had some version of the same afternoon. We pull the package from openICPSR. We unzip it. We open the folder. Sometimes there is a README and a clean code/ directory and a master run_all.do. Sometimes there is a flat dump of a dozen scripts with names like 02_clean_v3_FINAL_use_this.do and no indication of which order they belong in. Sometimes the README says “Run in Stata 16” and our license is Stata 18 and we discover three deprecated commands by trial and error. Sometimes the scripts are full of hardcoded paths to /Users/jsmith/Desktop/project/ and we are not jsmith.

The AEA Data Editor publishes policy guidance defining what an AEJ replication package should contain: a master script, data subdirectories, a README, software-version statements, a data-availability statement. The Data Editor enforces that policy on submissions to AEA journals at the gate. What we have not seen is a continuously updated empirical picture of what deposited packages actually contain across the field in aggregate, with measurement noise reported alongside each metric. The standards are public. The compliance landscape, as a measured object, is not. The skill described here runs alongside the Data Editor’s enforcement work, with complementary scope.

↑ Back to top

Why AI?

We could try to build this with regular expressions (regex, fixed-pattern matchers) only, and the failure mode shows up fast. Stata users write seed calls in enough different ways that any strict pattern leaves cases on the table: set seed 12345, set seed`whatever’‘,set seed c(seed)', seeds set inside a wrapper program the regex never enters. R users have at least four common ways to wire up a master script: a top-level run_all.R, a Makefile, or one of the workflow packages (targets, drake). Python projects mix runnable-as-a-script entry points with notebook-driven workflows and shell glue. READMEs are free text written by humans for humans, and “data are available from the authors on request” is a classification problem, not a regex problem.

The pattern emerging from prototyping: regex first where regex is strict enough (extension census, hardcoded-path detection, version-pin file presence) and an LLM classifier where surface variation defeats fixed patterns (data-availability statement classification, ambiguous master-script detection, README content classification). The LLM sits behind the inspection logic for the long tail. A regex-only pipeline would either miss a meaningful share of legitimate patterns or overgenerate on false positives.

Implementation detail. Stata's idiomatic variation in `set seed` calls includes constructs like `set seed `\`whatever''` and `set seed `c(seed)'`. R master-script detection covers top-level `run_all.R`, a Makefile, a `targets` pipeline, and a `drake` plan. Python master-script detection looks for `if __name__ == "__main__"` blocks. The LLM classifier is zero-shot Claude Sonnet with strict JSON output validated against the per-record schema.

↑ Back to top

Getting our bearings

A short orientation. openICPSR is the Inter-university Consortium for Political and Social Research’s open archive, the deposit destination for replication packages at several AEA journals and a growing list of others. Deposits are subject-coded; the econ subset is what we crawl. An AEJ-compliant package, in skeleton, has a master script that runs the analysis end-to-end, a code/ directory with the analysis scripts, a data/ directory (or pointers to restricted data and a data-availability statement explaining how to obtain it), a README that documents software versions and runtime, and a version-pinning file where the language supports one.

The skill reads files statically. For source files in languages with a grammar (Python, R), we use a code-aware parser that understands the syntax tree; for shell scripts and Makefiles we fall back to regex. Static reading is cheap, parallelizable, and reproducible, with the trade-off that a bug only visible at runtime stays invisible. The output describes what each package contains.

Implementation detail. Static analysis means we never execute the package. The static analyzer wraps tree-sitter for AST parsing where the language has a grammar (Python, R, and Stata via R-Stata interop) and falls back to regex for shell scripts, Makefiles, and dotfiles. The pipeline runs each package in an isolated working directory, with the top-level directory of any nested archive treated as the package root.

↑ Back to top

What the tool does

The crawl frame: all openICPSR deposits in econ subject codes from 2020 through 2026, plus AEJ:Applied / AEJ:Macro / AEJ:Policy / AER replication packages over the same window. The analytic sample is a stratified random draw of N=500 packages from the crawl, stratified by year, journal, and software stack. The validation sample is N=30 hand-coded packages (10 Stata, 10 R, 10 Python) where the manual codes set the error bounds on the regex and LLM detectors.

The skill accepts an openICPSR ID, a direct URL, or a local zip file; it unzips nested archives, treats the top-level directory as the package root, and flags unconventional layouts.

The per-package metrics come from inspecting files, not running them:

  • Software stack detected (Stata / R / Python / Matlab / shell / mixed)
  • Folder-structure conformance against the AEA’s canonical layout
  • README presence and word count
  • Software-version mentions
  • Runtime estimates from any documentation present
  • Data-availability statement classification (whether data is public, restricted, or available on request)
  • Master-script detection (per language, with a regex check first and an LLM check for the cases regex misses)
  • Seed-setting prevalence per language (set seed, set.seed, np.random.seed, etc.)
  • Hard-coded absolute path prevalence
  • Version-pin file presence (a lock file or pin file that records exact package versions for the language)
  • Lines of code per language
  • Reproducibility-statement language in the README

New to Claude Code skills? The setup guide covers installation and first invocation.

Layered diagram of the twelve structural-compliance metrics extracted per replication package, with Path A (Wayback, no auth, covers software stack / folder structure / README / extension census) and Path B (authenticated openICPSR session, full coverage including seed-setting / hardcoded paths / version pins / lines-of-code) annotations.

The twelve structural-compliance metrics, with two extraction paths. Path A runs against Wayback-cached metadata with no authentication and covers about 60 percent of the metrics. Path B runs against an authenticated openICPSR session and covers all twelve.

The framing is descriptive, not normative. The skill emits metric values, not scores. A “reproducibility score” carries an implicit weighting (is having a seed twice as important as having a master script?) that no one in the field has worked out and that we are not in a position to settle. A panel of per-package metric values lets readers compute whatever composite they want, or none. Where the metric is noisy, it carries a noise band. The visual landscape post the skill feeds will present marginal and joint distributions, not summary indices. This is the same case that has pushed meta-research in clinical and behavioral fields away from single-number quality scores.

Implementation detail. Data-availability statement classification uses a zero-shot Claude Sonnet classifier with strict JSON output. Version-pin files include `renv.lock` (R), `requirements.txt` (Python), and `Project.toml` / `Manifest.toml` (Julia). Lines-of-code counts come from `cloc`. The master-script detector implements a regex-first / LLM-fallback architecture where regex handles the high-confidence filename and shebang matches and the LLM classifier handles ambiguous cases (Makefile targets, `targets` pipelines, multi-file shell glue).

↑ Back to top

(Un)Tested

We test the skill in layers, and we are honest about which layers have run so far and which are specified for v0.2.

The smoke run produced 10 output records, all in the expected structure (fields present, types right), with no crashes along the way. The inspection pipeline holds together end to end on a clean input. That tells us nothing in the per-package code path is silently dropping cases. What it does not tell us is whether the detectors are calibrated. For that we need ground truth.

Implementation detail. 10/10 schema-conformant JSON records (validated against `schemas/replication-package-v1.json`), 0 uncaught exceptions, static-analyzer logic verified end-to-end on a hand-crafted fixture in `tests/`.

A note on what the smoke run actually measures, because the framing matters. The initial smoke seed list was plausible-but-unverified openICPSR IDs and produced a 2/10 Wayback hit rate that reflected the seed list, not the harvester. Re-running against a hand-verified seed list of 10 real openICPSR project IDs from AEJ-published papers (2020-2024), pulled from publicly searchable AEA repository pages, the harvester finds Wayback snapshots for 9/10 and a parsable landing-page description for 9/10. The one miss is a project whose CDX response timed out twice in the harvest window; a longer retry would likely close that gap. The detectors run cleanly on what the snapshots return. What Path A is measuring at this stage is what the landing-page description text contains, not what the full package README contains; the difference matters and we report it directly below.

The validation step we have started, and partially completed, is a hand-coded sample stratified by software stack. We hand-coded the same 10 packages on the Path A description content (what the unauthenticated harvester actually sees) and computed per-detector binomial CIs from those codes. On the cheap binary detectors that fire on landing-page text, the regex matches the hand codes 10/10 on software-version mentions and runtime estimates (both 0/10, expected; deposit abstracts do not mention Stata 16 or R 4.2), and 8/10 on reproducibility-language detection, with a Wilson 95 percent CI of 49 to 94 percent on the agreement rate. The full N=30 sample stratified by stack (10 Stata, 10 R, 10 Python) is what tightens the per-language detectors that Path B activates: seed-setting, hardcoded paths, master-script presence, version-pin files. Stata seed detection in particular is noisy because Stata users write seed calls in enough different ways that a strict regex misses cases. When the Path B validation runs, the bands tighten and the metric’s noise drops below the typical between-package variation, which is the threshold at which the panel becomes useful for comparing packages.

Implementation detail. The N=10 Path A calibration sub-sample (`validation/hand_coded_10_2026-05-26.csv`) compares hand codes to detector outputs on the landing-page description text. Wilson 95 percent CIs are computed per binary detector. The full N=30 validation sample sets per-language binomial CIs on the Path B detectors; detector-level CIs are reported separately for regex-only and LLM-fallback runs against the same packages so that the LLM's marginal contribution is identifiable.

The dual-path crawl is tested independently. Path A (Wayback, no login) and Path B (authenticated openICPSR session) hit different infrastructure, return different metadata densities, and need separate smoke runs. Path A covers about 60 percent of the metrics; Path B covers all twelve. Both produce records in the same shape against the same schema, so the panel concatenates cleanly across paths.

↑ Back to top

Walking through a run

Suppose we have an openICPSR deposit for a 2024 AEJ:Applied paper. We point the skill at the package and run /replication-package-analytics. What comes back, for that one package, is a JSON record matching the schema in replication-landscape-data/schema-v1.json. The interesting fields look something like:

{
  "package_id": "openicpsr-123456",
  "primary_language": "stata",
  "language_share": {"stata": 0.78, "python": 0.22},
  "folder_structure": {
    "has_code_dir": true, "has_data_dir": true,
    "has_output_dir": false, "layout_score": 2
  },
  "readme": {
    "present": true, "word_count": 612,
    "das_class": "public_with_scripts", "das_class_confidence": 0.91,
    "has_software_versions": true, "has_runtime_estimate": false
  },
  "code_metrics": {
    "loc_total": 4820, "has_master_script": true,
    "master_script_path": "run_all.do",
    "seed_set_count": 2, "hardcoded_paths_count": 11
  },
  "dependencies": {"has_version_pins": false, "version_pin_files": []},
  "measurement_notes": {
    "stata_regex_uncertainty": "+/-15% on seed detection (95% CI from N=10 Stata validation)"
  }
}

The panel-of-metrics CSV the skill feeds looks like this (truncated to a few rows and columns):

doi,year,journal,software_stack,has_readme,has_master_script,sets_seed,n_loc_total
10.1257/app.20200xxxx,2024,AEJ:Applied,stata,1,1,1,4823
10.1257/aer.20210xxxx,2025,AER,r,1,0,1,1247
10.3886/E199xxx,2023,openICPSR,python,1,1,0,8910
10.3886/E201xxx,2024,openICPSR,mixed,0,0,1,2104

Each row is one package; each column is one metric. The reader who wants their own compliance composite can compute it from these columns; the skill does not impose one.

Across N=500 packages, those records stack into a panel-of-metrics CSV: one row per package, one column per metric, plus the measurement-noise columns. From there, a reader who disagrees with how we would weight the metrics has everything we have. We can compute a compliance index from layout_score, has_master_script, seed_set_count, and has_version_pins in whatever ratio we like, and report what fraction of the field clears whatever bar we pick. We are not committing the reader to our weighting.

The noise bands matter for interpretation. Stata seed detection is noisy because Stata users write seed calls in enough different ways that a strict regex misses cases. The N=30 hand-coded validation sample sets a band on what fraction of detected/undetected calls is correct. When we report that 47% of Stata packages set a seed detectable via regex, the honest read is “between roughly 32% and 62% set one, with the regex undercounting.” Readers who want a narrower band can look at the LLM-fallback runs against the same packages.

The same template extends to the other noisy metrics. Master-script detection by filename match is a heuristic, not a guarantee: a package that wires up its analysis through a Makefile we did not pattern-match against will read as having no master script when in fact it has a perfectly serviceable one. Hard-coded path detection is somewhat better, because the path-like string itself is a tighter regex target, but cross-platform variation (Windows backslashes, mixed forward and back, tilde-expanded home directories) still leaves room for miscounts. The measurement_notes field on every JSON record is where these caveats live, per-package. That way the noise is not a footnote in the landscape post; it travels with the data.

Implementation detail. The reported band on Stata seed detection is a binomial 95% CI computed from the N=10 Stata validation sub-sample. The `measurement_notes` field is structured: per-detector keys map to a string with the point estimate, CI, and the source of the CI (validation sub-sample size and date). The panel CSV carries a parallel set of `*_uncertainty` columns so downstream users can filter or weight by reliability.

↑ Back to top

The openICPSR Cloudflare problem and Path A / Path B

Descriptive measurement only works if the packages are actually reachable, and the build pass turned up an access constraint the spec round had not flagged: openICPSR sits behind Cloudflare (a content-delivery layer that blocks unauthenticated scrapers). Every direct URL returns an access-denied response to an unauthenticated request. The Wayback Machine works for cached metadata, which covers about 60 percent of the metrics (folder structure, README detection, software-stack detection from extension census). Full coverage requires logging into openICPSR.

The v0.1 pipeline therefore runs two paths. Path A uses Wayback only, no login, about 60% of the metrics, and is the deployed default because it needs no credentials. Path B uses an authenticated openICPSR session for full metrics, and the upgrade is a path swap, not a rewrite. The dual-path design keeps v0.1 usable today and keeps v0.2 cheap.

Implementation detail. Unauthenticated requests to direct openICPSR URLs return HTTP 403. Path A queries the Wayback Machine's CDX API for cached snapshots and falls back to the most recent snapshot per package. Path B uses a session cookie from an authenticated openICPSR login, with backoff on the documented per-session rate limit. Both paths emit records in the same shape against the same schema, with a `path` field on each record so downstream filtering can keep them separate or pool them.

↑ Back to top

Next steps

A hand-verified seed list of 10 openICPSR IDs from AEJ-published papers is now committed, and the smoke run against it returns 9/10 Wayback snapshots and 9/10 landing-page descriptions, with 10/10 schema-conformant output records. The Path A calibration sub-sample (N=10 hand codes against detector outputs on landing-page description text) sets a Wilson 95 percent CI of 49 to 94 percent on the reproducibility-language detector. What remains is the full N=30 stratified validation (10 Stata, 10 R, 10 Python) against actual package contents, which is what tightens the per-language CIs on the Path B detectors (seed-setting, hardcoded paths, master-script presence, version-pin files). The forthcoming empirical-landscape post is the visual deliverable the skill feeds; the teaser is drafted on the content queue and the post lands when the analytic crawl runs.

↑ Back to top

Worth taking with us

The result is a public, continuously updated picture of what econ replication packages actually contain. Nobody publishes that today. AI-augmented detection makes the measurement tractable; regex alone would miss the long tail. The descriptive-not-normative framing keeps the skill out of the AEA Data Editor’s enforcement lane, complementary scopes, not overlapping ones. v0.1 is the skeleton; v0.2 swaps in the authenticated path and the hand-coded validation, and the landscape post follows.

Have input? Get in touch.


Cite this article

Cholette, V. (2026, May 25). A common shape for econ replication packages. Too Early To Say. https://tooearlytosay.com/research/methodology/replication-package-analytics/