# Question-First Data Card Template + Tagging Prompts

**Source article:** [The Data We Forgot We Had](https://tooearlytosay.com/research/methodology/question-first-data-management/)
**Pattern:** Tag data by the questions it can answer, not just by what it contains.

Standard data documentation captures what a dataset *is* (source, variables, geography, time period). It omits what questions the dataset can *answer*. Question-first tags add the missing semantic layer so the data can be retrieved by need rather than by memory.

Create a data card the moment a dataset enters our project folder, not later.

---

## Data card template

Save as a `.yaml` or `.md` file next to the dataset, in `<your-project>/data/_cards/<dataset-slug>.yaml`.

```yaml
# Data Card: <dataset slug>

## Standard Metadata
source: <provider, e.g., Council for Community and Economic Research>
acquired_date: <YYYY-MM-DD>
location: <relative path or URL>
format: <CSV | Parquet | API | Excel | other>
variables: [<var1>, <var2>, <var3>]
geography: <e.g., California metropolitan areas>
time_period: <e.g., 2018-2023 (quarterly)>
unit_of_analysis: <e.g., metro area, person-year, school district>

## Question-Based Tags
questions_answerable:
  - "<research question 1 this dataset could help answer>"
  - "<research question 2>"
  - "<research question 3>"
  - "<research question 4>"

constructs_measured:
  - <conceptual variable, not the column name>
  - <e.g., cost of living, regional price parity>
  - <e.g., real vs nominal income adjustment>

linkable_to:
  - <other dataset 1> (join on <key>)
  - <other dataset 2> (join on <key>)
  - <any third dataset class this can be merged with>

potential_analyses:
  - <what this dataset enables when combined with linkable_to>
  - <e.g., deflate nominal income to real income>
  - <e.g., compare purchasing power across regions>

## Notes
acquisition_context: "<why we originally pulled this; what project>"
known_limitations: "<missing geographies, top-coding, imputation, etc.>"
```

---

## Why four dimensions, not just metadata

| Dimension | What it captures | Example |
|-----------|------------------|---------|
| Questions answerable | Research questions this data could help answer | "Real vs nominal income comparison" |
| Constructs measured | Conceptual variables, not just column names | Prices, purchasing power, cost of living |
| Coverage | Geography, time, unit of analysis | CA metros, 2018–2023, quarterly |
| Linkability | What other datasets it joins with | ACS via metro, any income data |

The shift is from "what is this data?" to "what can this data answer?" When a new research question arises, the search is for *answers* rather than data sources. Question-based tags align storage with retrieval.

---

## Prompt 1 — Generate tags for a new dataset

Use this on intake, the same day the data lands.

```
Here's a dataset I have:
  - Source: <source>
  - Variables: <list>
  - Geography: <coverage>
  - Time period: <years>
  - Unit of analysis: <unit>

What research questions could this data help answer?
What constructs does it measure (conceptually, not column names)?
What other public datasets might it link with, and on which keys?
Suggest 5 question-based tags I should add to my data inventory.

Format your answer as YAML matching the data card schema in
<your-project>/data/_cards/_template.yaml.
```

---

## Prompt 2 — Query the inventory for a new research question

Use this when an intuition surfaces during analysis or revisions.

```
I have a new research question: <question>.

Here is my data inventory:
<paste the contents of data/_cards/ or reference the folder>.

Which datasets might be relevant? Include non-obvious connections that
go beyond keyword matching. For each candidate, tell me which of the
"questions_answerable" or "constructs_measured" tags makes it relevant
to my new question.
```

---

## The workflow

1. **On acquisition:** create a data card with standard metadata plus 3–5 question-based tags. (Use Prompt 1.)
2. **AI enrichment:** ask an assistant to suggest additional questions the data could answer.
3. **Periodic review:** revisit cards when starting new projects.
4. **Query on need:** when a new research question arises, search the inventory. (Use Prompt 2.)

The system works through collaboration: human maintains the inventory, the assistant enriches the tags and queries the inventory on demand.

---

## How to start

Tag three recent datasets, not the whole archive.

1. Pick the three most recently acquired datasets in our project folders.
2. Create a data card for each (standard metadata, 3–5 questions, linkable_to list).
3. Store the cards in a searchable location (folder, notes app, project context file).
4. Run Prompt 2 against the new three-card inventory the next time a research question surfaces.

The move is from "I forgot I had this" to "what in the inventory speaks to X?" Serendipity becomes systematic.

---

*Template extracted from work published on [Too Early To Say](https://tooearlytosay.com).*
*Licensed MIT. Victoria Cholette, 2026.*
