The Data We Forgot We Had: A Tagging System for Research Serendipity

How question-first tagging turns dormant datasets into discoverable assets

A Google Sheet became a paper’s central contribution. It almost got forgotten.

The revision was for a paper on pandemic income shocks in California: how COVID-19 affected household income across labor markets, and whether safety net programs buffered the losses. The nominal income analysis told a reassuring story. Gaps between high-cost and low-cost areas narrowed during the pandemic. Low-income families in expensive metros like San Francisco gained ground relative to families in cheaper areas like Fresno.

Then came the intuition: “These families in expensive metros were probably still worse off in real terms.”

That thought triggered a memory. Somewhere, months earlier, cost-of-living data from the Council for Community and Economic Research (C2ER) had been downloaded. It sat in a Google Sheet, acquired for an entirely different project. It was absent from the analysis plan. It was absent from the editor’s feedback. It had simply been forgotten.

Finding the file and integrating it reversed the story completely. Low-income families in San Francisco went from a $2,600 purchasing power advantage in 2019 to a $700 disadvantage by 2023. The finding that geographically uniform safety net benefits accidentally reversed fortunes across regions emerged entirely from auxiliary data that almost got lost.

This raised an uncomfortable realization: useful data sits dormant in project folders everywhere, waiting for the right research question to make it relevant.


The mismatch problem

The C2ER incident reflects a systematic mismatch between how research data gets organized and how it gets retrieved.

How data gets organized:

Research files are stored by project (FoodSecurity/data/, MFCU/data/), by source (census/, bls/, c2er/), or by date (downloads folder, timestamped files). These organizational schemes make sense when working on a specific project and needing to find specific files.

How data gets retrieved:

When a new research question emerges, the search is by need: “What speaks to regional price variation?” or “Do I have cost-of-living adjustments?” or “The reviewer wants a robustness check with alternative data.”

The folder structure fails to answer these queries. The connection between “what exists” and “what is needed” lives entirely in the researcher’s memory.

Why data is harder than notes:

Notes contain natural language. If something was written about cost of living in a research memo, grep finds the passage. Datasets differ: they contain tables of numbers. There is no way to grep a CSV for “helps answer questions about purchasing power.”

The meaning of the data exists as tacit knowledge. It lives nowhere except the researcher’s memory. And memory proves unreliable.

The documentation gap:

Standard data documentation captures what data is: source, variables, geography, time period. This matters for reproducibility and for others who might use the data.

Standard documentation omits what questions the data could answer. It omits “this dataset can compare purchasing power across regions” or “this dataset can adjust nominal income to real income.” That semantic layer, the bridge between data and research questions, remains undocumented.


The insight: question-first tagging

The solution is simple in principle: tag data by the questions it can answer, not just by what it contains.

Consider the C2ER cost-of-living data. Standard metadata captures the basics:

dataset: c2er_cost_of_living
source: Council for Community and Economic Research
variables: [COMPOSITE_INDEX, HOUSING, GROCERIES, UTILITIES]
geography: California metropolitan areas
time_period: 2018-2023 (quarterly)

This is necessary but insufficient. It tells what the data is, not what it can do.

Question-based tags add the missing layer:

questions_answerable:
  - "How does purchasing power vary across regions?"
  - "Are flat-dollar policy benefits worth the same everywhere?"
  - "What is the real (cost-adjusted) income in different metros?"
  - "How much more expensive is San Francisco than Fresno?"

constructs_measured:
  - cost of living
  - regional price parity
  - real vs nominal conversion

linkable_to:
  - ACS microdata (via metro area)
  - Any income data needing COL adjustment

The second set of tags would have surfaced C2ER immediately when the question arose: “Are these families actually worse off in real terms?”

The schema: four dimensions for retrieval

A simple schema with four dimensions supports future retrieval:

Dimension What it captures Example
Questions answerable Research questions this data could help answer “Real vs nominal income comparison”
Constructs measured Conceptual variables, not just column names Prices, purchasing power, cost of living
Coverage Geography, time, unit of analysis CA metros, 2018-2023, quarterly
Linkability What other datasets it joins with ACS via metro, any income data

The key shift: from “what is this data?” to “what can this data answer?”

When a new research question arises, the search is for answers rather than data sources. Question-based tags align storage patterns with retrieval patterns.


Implementation: data cards

A data card is a brief document (YAML, Markdown, or JSON) that captures both standard metadata and question-based tags for each dataset acquired.

Here is the complete template, using C2ER as an example:

# Data Card: C2ER Cost of Living Index

## Standard Metadata
source: Council for Community and Economic Research
acquired_date: 2025-09-15
location: Google Sheets / FoodSecurity/data/c2er_cost_of_living.csv
format: CSV
variables: [COMPOSITE_INDEX, HOUSING, GROCERIES, UTILITIES, TRANSPORTATION]
geography: California metropolitan areas
time_period: 2018-2023 (quarterly)
unit_of_analysis: Metro area

## Question-Based Tags
questions_answerable:
  - "How does cost of living vary across California metros?"
  - "Can I convert nominal income to real purchasing power?"
  - "Are policy benefits worth the same in SF vs. Fresno?"
  - "How much higher are housing costs in coastal vs. inland areas?"

constructs_measured:
  - cost of living
  - regional price parity
  - housing affordability
  - real vs nominal income adjustment

linkable_to:
  - ACS microdata (join on metro area / CBSA)
  - BLS Regional Price Parities
  - Any household income dataset

potential_analyses:
  - Deflate nominal income to real income
  - Compare purchasing power across regions
  - Assess geographic equity of flat-dollar benefits

## Notes
acquisition_context: "Downloaded for MFCU wage competitiveness analysis"
known_limitations: "Some metros missing; imputation required"

The workflow:

  1. On acquisition: Create a data card with standard metadata plus 3-5 question-based tags
  2. AI enrichment: Ask an assistant to suggest additional questions the data could answer
  3. Periodic review: Revisit cards when starting new projects
  4. Query on need: When a new research question arises, search the inventory

Where AI assistants fit:

AI coding assistants can help with both tag generation and inventory search:

  • Generate question-based tags: Given variable names and data structure, the assistant can infer “this dataset could help answer questions about…”
  • Semantic matching: Given a new research question, the assistant can search a tagged inventory for relevant datasets
  • Surface non-obvious connections: Semantic search finds matches that keyword search would miss

The assistant cannot know what data exists unless told. It cannot replace researcher judgment about what matters. It cannot automatically maintain the inventory.

The system works through collaboration: human maintains inventory, assistant enriches tags, assistant queries on demand.


What this adds

Several practices already address data management, but each has a gap:

Practice Purpose Gap
Data documentation Describe what data IS Omits what questions it answers
FAIR principles Make data findable for others Omits findability for the researcher’s future self
Personal knowledge management Organize notes and ideas Excludes datasets
Research data management Archive and version data Excludes semantic retrieval

Most data management asks: “Can others use this?” or “Do I remember what this is?”

Question-first tagging asks a different question: “Will I find this when I need it for a question I have yet to ask?”

The counterfactual:

If a data card had been created for C2ER at the time of download, with question-based tags like “real vs nominal income” and “regional price comparison,” what would have happened?

When the intuition arose during the revision (“these families are probably still worse off in real terms”), a query of the inventory (“What speaks to real vs. nominal income?”) would have surfaced C2ER immediately. Instead of relying on a lucky memory, the retrieval would have been systematic.

The broader pattern:

Researchers accumulate auxiliary datasets constantly. Census downloads for one project. Bureau of Labor Statistics (BLS) data for another. Conference presentation data. Replication package datasets. Most of this data sits unused after its initial purpose.

The question is whether it can be found when a new research question makes it relevant. Useful data almost certainly sits somewhere in most researchers’ files.


How to start

A concrete first step: tag three recent datasets.

  1. Pick the three most recently acquired datasets
  2. Create a data card for each with:
  3. Standard metadata (source, variables, coverage)
  4. 3-5 questions it could answer
  5. What other data it links to
  6. Store the cards in a searchable location (folder, notes app, project context file)

Prompt for tag generation:

Here's a dataset I have:
- Source: [source]
- Variables: [list]
- Geography: [coverage]
- Time period: [years]

What research questions could this data help answer?
What other datasets might it link with?
Suggest 5 question-based tags for my data inventory.

Prompt for inventory query:

I have a new research question: [question]

Here's my data inventory: [paste cards or reference file]

Which datasets might be relevant? Include non-obvious connections.

The payoff:

The move is from “I forgot I had this” to “what in the inventory speaks to X?”

Serendipity becomes systematic. The auxiliary data accumulated over years of research becomes a searchable inventory rather than a graveyard of forgotten downloads.

The C2ER data transformed a paper. The next dataset that transforms a piece of research might already be sitting in a folder somewhere, waiting for the right question to surface it. Question-first tagging ensures it can be found.

Cite this article

Cholette, V. (2026, January 15). the Data We Forgot We Had: A Tagging System for Research Serendipity. Too Early To Say. https://tooearlytosay.com/research/methodology/question-first-data-management/