Data cleaning

Victoria Cholette

AI for Applied Researchers · Step 3 of 5

Updated July 21, 2026

Data cleaning

This step is the pattern we use when a categorical field is too noisy to trust and the dataset is too large to clean by hand. It produces an analysis-ready file and a reproducible final-test record with the remaining error exposed.

The problem this step solves

We work with the dataset we have, not the one we wish we had. A public application programming interface (API) or an agency export rarely arrives clean. Categories overlap, names are inconsistent, and a single label covers records that mean different things.

In a food-access build, Google Places returned thousands of results for "grocery store" across California counties. Many were not grocery stores. The same query that surfaced Safeway and Trader Joe's also returned 7-Eleven locations, gas-station minimarts, liquor stores, and restaurants with incidental grocery items.

For research, the mismatch changes the answer. A convenience store does not provide the same food-security value as a full-service supermarket, so counting it as one overstates access. This step gives us a repeatable way to separate the records that belong from the records that do not and to know how often we get it wrong.

A precise specification is what lets an agent do useful work. Here, the specification fixes the label rule, the data partitions, the metric, and the point at which model selection stops.

When to use this step, and when not to

A classifier earns its place when a categorical field is unreliable, the dataset is too large to check by hand, and a defensible sample can still be labeled independently.

We skip this step when the data are small enough to clean directly, when a deterministic rule separates the categories cleanly, or when no feature carries signal about the true label. If business names, type tags, and counts tell us nothing about which records are real, a model will not invent the distinction.

Decision rule: use the classifier only when the labeling budget leaves enough independently adjudicated rows for separate training, validation, and final-test partitions. If model choices have already been informed by the final-test rows, those rows are validation data and a new untouched test is required.

Inputs required

Before we bring in an artificial intelligence (AI) assistant, we assemble:

The raw export with the unreliable field, plus whatever attributes came attached to each record. In the grocery case these were business name, type tags, user rating, review count, and price level.
A hand-labeled sample large enough to preserve separate training, validation, and final-test partitions.
A source of ground truth and a written adjudication rule for ambiguous labels.
A coding environment with an AI assistant in the loop to write and revise the classifier.

The AI-assisted move

With a labeled set in hand, the assistant writes a split manifest, builds preprocessing inside the training partition, and compares candidate specifications on the validation partition. The final-test partition stays sealed while features, model family, and hyperparameters change.

Once we freeze one complete pipeline, the assistant fits that pipeline under the frozen rule and opens the final test once. It reports balanced accuracy, the confusion matrix, and class-specific error rates, then saves row-level predictions with the model specification and random seed.

In the grocery-store validation case, this structure separates model development from the final evaluation. The article's adaptive score path remains in Provenance because it did not use a sealed final test.

The assistant's job is to implement the partition and model rules exactly. Our job is to define the labels, choose the metric, freeze the candidate, and decide what error is acceptable for the analysis.

Copy-paste protocol

Paste this into the AI assistant once a labeled sample exists. The protocol separates model development from the one-time final test.

You are helping me clean a categorical field in a dataset by training a
binary classifier, with an honest accuracy estimate.

CONTEXT
- File: labeled_sample.csv
- Target column: is_real (1 = the record truly belongs to the category,
  0 = it does not). This was hand-labeled.
- Feature columns: [list every attribute, e.g. business_name, type_tags,
  user_rating, review_count, price_level]
- Full unlabeled file to score later: all_records.csv

DO THIS
1. Load labeled_sample.csv. Report class balance and any missing values.
2. Split once into training, validation, and final-test sets (60/20/20),
   stratified on is_real. Fix and print the random seed. Save
   split_manifest.csv with the stable row identifier (ID) and assigned partition.
3. Keep the final-test rows and metrics sealed through model selection.
   Do not use them to choose tokens, features, model family, thresholds,
   or hyperparameters.
4. Engineer features using the training partition only. Fit every learned
   preprocessing step on training data. Parse multi-value tag columns into
   explicit binary indicators.
5. Compare candidate pipelines on the validation partition only. Report
   balanced accuracy and a confusion matrix for each candidate. Record every
   attempted specification in model_search.csv.
6. Freeze one complete pipeline and decision threshold in model_spec.json.
   After that freeze, fit under the recorded rule and score the final-test
   partition exactly once.
7. Report final-test balanced accuracy, class-specific error rates, and the
   confusion matrix. Save final_test_predictions.csv with stable row IDs.
8. If the final-test rows or metrics influenced any earlier choice, label
   them validation data and stop. A new untouched final test is required.

If a required column or label definition is missing, state what is missing.
Do not guess.

STOP after step 8. Do not score all_records.csv yet. I will review the
split manifest, frozen specification, and final-test record first.

Failure check and validation

Failure condition: any feature, model, threshold, or stopping decision was chosen after inspecting final-test rows or metrics. In that case, the reported score is a validation result and a new untouched test is required.

Pass condition: the split manifest predates model selection, the model specification is frozen before the test opens, and one saved prediction file reproduces every final-test metric.

A later audit may deliberately sample equal numbers from each predicted class. That design estimates error within the predicted classes. Its raw error count is not a population error rate unless the two predicted classes have equal population shares. Estimate population error by weighting each within-class error rate by that predicted class's share of all scored records.

The deliverable

The deliverable is an analysis-ready file and a reproducible final-test record. Together they carry the corrected category, frozen model specification, split manifest, test predictions, and an error estimate only when the audit design and weighting support it.

Provenance from our work

This step comes from the Grocery Store Classifier Results Under Review article. The case begins with 400 hand labels.¹

The first reported adaptive score is 78 percent.

The last reported adaptive score is 94 percent. The same held-out set informed successive model changes, so this is not an untouched final-test estimate.

The predicted-class audit samples equal counts from each predicted class. Agreement in the predicted-grocery class is 94 percent.

Agreement in the predicted-non-grocery class is 96 percent.

The stratified audit records five errors. The raw count does not establish a population error rate without weighting by predicted-class prevalence.

The narrative's error-type labels conflict with the displayed predicted-class counts, so the error-type record needs reconciliation.

The article reports 4,847 grocery classifications.

The starting API collection contains 6,613 records.

The article reports a 27 percent reclassification rate. The hand labels and full classified output are not public, and the full-population result is under reconciliation with the related policy articles.

A pinned public folder provides supporting code and acquisition instructions, but it does not reproduce the full-population classification.²

References

Cholette, V. (2025, October 29). Grocery store classifier results under review. Too Early To Say. ↩
Grocery-store classifier materials at pinned public commit 3c338ae. ↩