CLAUDE.md: One Context File for AI Research Projects

Every new coding session with AI starts the same way: re-explaining the project. Census tracts are 11-digit FIPS codes stored as strings, not integers. Coordinates use WGS84, not a projected system. The validated store file has 4,847 rows after classification.

This re-explanation takes 3-5 minutes per session. Across a three-month project with daily coding sessions, we spend 4.5 to 7.5 hours just restating context.

Agent-based tools like Claude Code solve this through persistent context files. A single CLAUDE.md file in the project root loads automatically every session. We write the context once. The agent reads it every time.

The Re-Explanation Tax

Let's consider a typical debugging session. A transit time calculation script crashes with a coordinate reference system error. With traditional AI assistance, we explain:

Census tract centroids use EPSG:4326 (WGS84)
Store locations also use EPSG:4326
The OpenRouteService API expects coordinates in this format
Previous scripts already validated both datasets use this system

The AI suggests a fix. We run it. It crashes again: the fix assumed FIPS codes were integers when they're actually strings. We explain this detail. The AI suggests another fix.

Each session requires re-establishing these conventions from scratch. The AI has no memory of yesterday's session where we debugged a similar issue.

What Goes in CLAUDE.md

A research project context file typically contains four types of information:

Data Conventions

## Data Conventions
- Census tracts: 11-digit FIPS codes as strings (e.g., "06085511100")
- Coordinates: WGS84 (EPSG:4326), decimal degrees
- Validated stores: data/processed/stores_validated.csv (4,847 rows)
- Transit times: minutes as integers
- Missing values: -999 (not NaN, for compatibility with older scripts)

These conventions prevent a specific class of errors. If the agent assumes FIPS codes are integers, it might drop leading zeros. If it assumes coordinates need projection, it might transform them incorrectly. If it treats -999 as a valid transit time rather than a missing value indicator, calculations break.

File Naming Patterns

## File Organization
- Raw data: data/raw/ (never modify)
- Processed data: data/processed/ (intermediate outputs)
- Final outputs: data/output/ (results for publication)
- Scripts: numbered by execution order (01_, 10_, 20_...)

When we say "the store validation file," the agent knows we mean data/processed/stores_validated.csv. When we say "the transit calculation script," it knows to look for scripts/30_calculate_transit.py.

Current Analysis Focus

## Current Work
Analyzing relationship between transit-based grocery access and SNAP enrollment.
Key hypothesis: geographic access (store proximity) matters less than economic access (SNAP rates).
Current script: scripts/40_calculate_vulnerability.py
Known issue: Vulnerability calculation needs refinement for rural tracts.

This context helps the agent understand why we're asking certain questions. If we say "the vulnerability calculation seems wrong," it knows we mean line 47 in scripts/40_calculate_vulnerability.py, and it knows the current concern is how rural tracts are handled.

Known Gotchas

## Known Issues & Edge Cases
- Rural tracts: Some have zero grocery stores within 60 minutes by transit
  Handle these as NA rather than setting to maximum time
- Coordinate precision: Round to 6 decimal places (sufficient for ~10cm accuracy)
- API rate limits: OpenRouteService allows 40 requests/minute
- Data vintage: Census 2020 geometries, ACS 2023 5-year estimates (mismatch acceptable)

We discovered these issues through previous debugging sessions. Without documentation, we'd rediscover them every few weeks as we encounter edge cases again.

Real Example: Before and After

Before CLAUDE.md

Session 1 (Monday):
"Fix the transit calculation error."
"What format are your coordinates in?"
"WGS84."
"What format are FIPS codes?"
"Strings with leading zeros."
Time to first working fix: 12 minutes.

Session 2 (Tuesday):
"Optimize the vulnerability score calculation."
"What format are your coordinates in?"
"WGS84."
"What format are FIPS codes?"
"Strings."
Time to first working fix: 15 minutes (we forgot to mention leading zeros).

Session 3 (Wednesday):
"Generate maps of vulnerability scores."
"What coordinate system should I use?"
"WGS84."
Time to first working fix: 8 minutes.

Cumulative re-explanation time: 35 minutes over three sessions.

After CLAUDE.md

Session 1 (Monday):
"Fix the transit calculation error."
Agent reads CLAUDE.md, sees coordinate system and FIPS format
"Found the issue: line 47 treats FIPS codes as integers, dropping leading zeros."
Time to first working fix: 3 minutes.

Session 2 (Tuesday):
"Optimize the vulnerability score calculation."
Agent reads CLAUDE.md, sees data conventions and current analysis focus
"Reviewing scripts/40_calculate_vulnerability.py. Should I maintain the rural tract NA handling mentioned in Known Issues?"
Time to first working fix: 4 minutes.

Session 3 (Wednesday):
"Generate maps of vulnerability scores."
Agent reads CLAUDE.md, sees coordinate system and output conventions
"Creating map with WGS84 coordinates, saving to data/output/figures/ per file organization pattern."
Time to first working fix: 2 minutes.

Cumulative re-explanation time: 0 minutes.

Error Prevention

CLAUDE.md prevents errors before they occur. Consider coordinate reference systems.

Without context: An agent might assume coordinates need projection for distance calculations. It projects WGS84 to Web Mercator (EPSG:3857). Transit times now use Euclidean distance in meters instead of the original great circle distances in the API data. Results are subtly wrong. We don't notice until comparing results to manual spot-checks days later.

With context: The CLAUDE.md file states coordinates are WGS84 and should remain unprojected. The agent doesn't make this assumption. The error never occurs.

Similar prevention happens with FIPS codes (leading zeros), missing value conventions (don't calculate statistics on -999), and file locations (don't modify raw data).

Time Savings

For the food security project:

Development period: 90 days
Coding sessions: approximately 60 (most days, some skipped)
Average re-explanation time without CLAUDE.md: 4 minutes per session
Total time saved: 240 minutes (4 hours)

This understates the value. The real savings come from errors prevented:

Zero coordinate system errors after adding CLAUDE.md
Zero FIPS code leading-zero errors
Zero incorrect file modification incidents (e.g., overwriting raw data)

Debugging these errors typically takes 15-30 minutes each. We probably prevented 8-10 such incidents, saving another 2-3 hours.

Total time savings: approximately 6-7 hours across the project.

What Not to Include

CLAUDE.md documents conventions, not code. We don't put:

Entire data processing pipelines (those belong in scripts)
Statistical methods in detail (those belong in README or papers)
Every variable definition (code should be self-documenting)

The file contains information that would otherwise live in our heads: the unwritten conventions we apply when writing code.

If we catch ourselves explaining the same thing to AI in three consecutive sessions, that explanation belongs in CLAUDE.md.

Maintenance

The context file evolves with the project. When we discover a new gotcha (rural tracts need special handling), we add it. When we change conventions (switching from NaN to -999 for missing values), we update it.

Maintenance cost: approximately 2 minutes per week. Payoff: zero re-explanation time across all sessions.

We update CLAUDE.md in three situations:

After discovering an edge case through debugging
When changing project conventions
When starting a new analysis phase (update "Current Work" section)

Implementation

Creating a CLAUDE.md file takes 15-20 minutes. We document:

Data formats and conventions (5 minutes)
File organization patterns (3 minutes)
Current analysis focus (2 minutes)
Known gotchas we've already discovered (5 minutes)

This 15-minute investment saves 4+ hours over a three-month project.

For the food security analysis, the CLAUDE.md file has 47 lines. It documents the decisions we made over three months that would otherwise require constant re-explanation.

The alternative is spending those 4+ hours repeating ourselves, and debugging the errors that occur when AI makes incorrect assumptions about our data.

Next in series: From Methods Paragraph to Working Pipeline explores how well-specified methodology sections become direct implementation guides.

How to Cite This Research

Too Early To Say. "One Context File, Zero Re-Explanations: Teaching AI Your Research Project." October 2025. https://tooearlytosay.com/research/methodology/claude-md-research-context/

Copy citation