Question-First Data Tagging: Finding Forgotten Datasets

A Google Sheet became a paper's central contribution. I almost forgot it existed.

We were revising a paper on pandemic income shocks in California: how COVID-19 affected household income across labor markets, and whether safety net programs buffered the losses. The nominal income analysis told a reassuring story. Gaps between high-cost and low-cost areas narrowed during the pandemic. Low-income families in expensive metros like San Francisco gained ground relative to families in cheaper areas like Fresno.

Then came the intuition: "These families in expensive metros were probably still worse off in real terms."

That thought triggered a memory. Somewhere, years earlier, I had downloaded cost-of-living data from the Council for Community and Economic Research (C2ER). It sat in a Google Sheet, acquired for an entirely different project. It was absent from the analysis plan. It was absent from the editor's feedback. Simply forgotten.

Once found and integrated, the story reversed completely. Low-income families in San Francisco went from a $2,600 purchasing power advantage in 2019 to a $700 disadvantage by 2023. The finding that geographically uniform safety net benefits accidentally reversed fortunes across regions emerged entirely from auxiliary data we almost lost track of.

This raises an uncomfortable question: How much useful data sits dormant in our project folders, waiting for the right research question to make it relevant?

The Mismatch Problem

The C2ER incident reflects a systematic mismatch between how we organize research data and how we retrieve it. Let's look at this pattern more closely.

Split comparison showing folder organization on left versus question-based retrieval on right, with a mismatch symbol in the center — Research data organized by project, source, or date cannot be retrieved by research question. Question-first tagging bridges this gap.

How we organize:

Think about where our data lives. We store it by project (FoodSecurity/data/, MFCU/data/), by source (census/, bls/, c2er/), or by date (downloads folder, timestamped files). These organizational schemes make sense when working on a specific project and needing to find specific files.

How we retrieve:

But when a new research question emerges, we search by need: "What speaks to regional price variation?" or "Do we have cost-of-living adjustments?" or "The reviewer wants a robustness check with alternative data."

The folder structure fails to answer these queries. The connection between "what we have" and "what we need" lives entirely in our heads.

Why data is harder than notes:

Consider the difference with notes. Notes contain natural language. If we wrote about cost of living in a research memo, we can grep for "cost of living" and find the passage. Datasets differ: they contain tables of numbers. We can't grep a CSV for "helps answer questions about purchasing power."

The meaning of the data exists as tacit knowledge. It lives nowhere except our memory. And memory proves unreliable.

The documentation gap:

Standard data documentation captures what data is: source, variables, geography, time period. This matters for reproducibility and for others who might use our data.

But standard documentation omits what questions the data could answer. It omits "this dataset can compare purchasing power across regions" or "this dataset can adjust nominal income to real income." That semantic layer, the bridge between data and research questions, remains undocumented.

The Insight: Question-First Tagging

The solution is simple in principle: tag data by the questions it can answer, not just by what it contains.

Consider the C2ER cost-of-living data. Standard metadata captures the basics:

dataset: c2er_cost_of_living
source: Council for Community and Economic Research
variables: [COMPOSITE_INDEX, HOUSING, GROCERIES, UTILITIES]
geography: California metropolitan areas
time_period: 2018-2023 (quarterly)

This is necessary but insufficient. It tells us what the data is, not what it can do.

Question-based tags add the missing layer:

questions_answerable:
  - "How does purchasing power vary across regions?"
  - "Are flat-dollar policy benefits worth the same everywhere?"
  - "What is the real (cost-adjusted) income in different metros?"
  - "How much more expensive is San Francisco than Fresno?"

constructs_measured:
  - cost of living
  - regional price parity
  - real vs nominal conversion

linkable_to:
  - ACS microdata (via metro area)
  - Any income data needing COL adjustment

See the difference? The second set of tags would have surfaced C2ER immediately when the question arose: "Are these families actually worse off in real terms?"

The schema: Four dimensions for retrieval

Four concentric rings showing Questions Answerable at center, surrounded by Constructs Measured, Coverage, and Linkability — The question-first tagging schema prioritizes research questions as the primary retrieval dimension, supported by constructs, coverage, and linkability.

Here's a simple schema with four dimensions that support future retrieval:

Dimension	What It Captures	Example
Questions Answerable	Research questions this data could help answer	"Real vs nominal income comparison"
Constructs Measured	Conceptual variables, not just column names	Prices, purchasing power, cost of living
Coverage	Geography, time, unit of analysis	CA metros, 2018-2023, quarterly
Linkability	What other datasets it joins with	ACS via metro, any income data

The key shift: from "what is this data?" to "what can this data answer?"

When a new research question arises, we search for answers rather than data sources. Question-based tags align our storage patterns with our retrieval patterns.

Implementation: Data Cards

A data card is a brief document (YAML, Markdown, or JSON) that captures both standard metadata and question-based tags for each dataset we acquire.

Data card template showing three sections: Standard Metadata, Question-Based Tags (highlighted), and Notes — A data card captures standard metadata alongside question-based tags. The highlighted middle section enables semantic retrieval.

Here's the complete template, using C2ER as an example:

# Data Card: C2ER Cost of Living Index

## Standard Metadata
source: Council for Community and Economic Research
acquired_date: 2025-09-15
location: Google Sheets / FoodSecurity/data/c2er_cost_of_living.csv
format: CSV
variables: [COMPOSITE_INDEX, HOUSING, GROCERIES, UTILITIES, TRANSPORTATION]
geography: California metropolitan areas
time_period: 2018-2023 (quarterly)
unit_of_analysis: Metro area

## Question-Based Tags
questions_answerable:
  - "How does cost of living vary across California metros?"
  - "Can we convert nominal income to real purchasing power?"
  - "Are policy benefits worth the same in SF vs. Fresno?"
  - "How much higher are housing costs in coastal vs. inland areas?"

constructs_measured:
  - cost of living
  - regional price parity
  - housing affordability
  - real vs nominal income adjustment

linkable_to:
  - ACS microdata (join on metro area / CBSA)
  - BLS Regional Price Parities
  - Any household income dataset

potential_analyses:
  - Deflate nominal income to real income
  - Compare purchasing power across regions
  - Assess geographic equity of flat-dollar benefits

## Notes
acquisition_context: "Downloaded for MFCU wage competitiveness analysis"
known_limitations: "Some metros missing; imputation required"

A suggested workflow:

On acquisition: Create a data card with standard metadata plus 3-5 question-based tags
Claude enrichment: Ask Claude to suggest additional questions the data could answer
Periodic review: Revisit cards when starting new projects
Query on need: When a new research question arises, ask Claude to search the inventory

Where Claude fits:

Claude can help with both tag generation and inventory search:

Generate question-based tags: Given variable names and data structure, Claude can infer "this dataset could help answer questions about..."
Semantic matching: Given a new research question, Claude can search a tagged inventory for relevant datasets
Surface non-obvious connections: Claude can find matches that keyword search would miss

What Claude can't do:

Know what data we have unless we tell it
Replace our judgment about what matters
Automatically maintain the inventory

The system works through collaboration: We maintain inventory → Claude enriches tags → Claude queries on demand

What This Adds

Several practices already address data management, but each has a gap:

Practice	Purpose	Gap
Data documentation	Describe what data IS	Omits what questions it answers
FAIR principles	Make data findable for others	Omits findability for our future selves
Personal knowledge management	Organize notes and ideas	Excludes datasets
Research data management	Archive and version data	Excludes semantic retrieval

Most data management asks: "Can others use this?" or "Do we remember what this is?"

Question-first tagging asks a different question: "Will we find this when we need it for a question we haven't yet asked?"

The counterfactual:

Consider what would have happened with a data card for C2ER created when first downloading it, with question-based tags like "real vs nominal income" and "regional price comparison."

When the intuition arose during the JUE revision ("these families are probably still worse off in real terms"), we could have queried our inventory: "What speaks to real vs. nominal income?"

The C2ER card would have surfaced immediately. Instead of relying on a lucky memory, systematic retrieval would have done the work.

The broader pattern:

Researchers accumulate auxiliary datasets constantly. Census downloads for one project. BLS data for another. Conference presentation data. Replication package datasets. Most of this data sits unused after its initial purpose.

We almost certainly have useful data somewhere in our files. The question is whether we can find it when a new research question makes it relevant.

How to Start

Here's a concrete first step: tag our three most recent datasets.

Pick the three most recent datasets we've acquired
Create a data card for each with:
- Standard metadata (source, variables, coverage)
- 3-5 questions it could answer
- What other data it links to
Store the cards in a searchable location (folder, notes app, Claude project context)

Claude prompt for tag generation:

Here's a dataset we have:
- Source: [source]
- Variables: [list]
- Geography: [coverage]
- Time period: [years]

What research questions could this data help answer?
What other datasets might it link with?
Suggest 5 question-based tags for our data inventory.

Claude prompt for inventory query:

We have a new research question: [question]

Here's our data inventory: [paste cards or reference file]

Which datasets might be relevant? Include non-obvious connections.

The payoff:

We move from "we forgot we had this" to "Claude, what do we have that speaks to X?"

Serendipity becomes systematic. The auxiliary data accumulated over years of research becomes a searchable inventory rather than a graveyard of forgotten downloads.

The C2ER data transformed a paper. The next dataset that transforms our work might already be sitting in a folder somewhere, waiting for the right question to surface it. Question-first tagging ensures we can find it.

This is a methodology post from Too Early to Say, exploring how applied economists can work more effectively in the age of AI.

How to Cite This Research

Cholette, V. (2026, January 16). The data we forgot we had: A tagging system for research serendipity. Too Early To Say. https://tooearlytosay.com/research/methodology/question-first-data-management/

Copy citation