AI for Applied Researchers · Step 3 of 5

Data cleaning

This step is the pattern we use when a categorical field is too noisy to trust and the dataset is too large to clean by hand. It produces an analysis-ready column and an error rate we can report instead of hide.

The problem this step solves

We work with the dataset we have, not the one we wish we had. A public API or an agency export rarely arrives clean. Categories overlap, names are inconsistent, and a single label covers records that mean different things.

In a food-access build, Google Places returned thousands of results for "grocery store" across California counties. Many were not grocery stores. The same query that surfaced Safeway and Trader Joe's also returned 7-Eleven locations, gas-station minimarts, liquor stores, and restaurants with incidental grocery items.

For research, the mismatch changes the answer. A convenience store does not provide the same food-security value as a full-service supermarket, so counting it as one overstates access. This step gives us a repeatable way to separate the records that belong from the records that do not and to know how often we get it wrong.

When to use this step, and when not to

A classifier earns its place when a categorical field is unreliable and the dataset is too large to check by hand. The grocery example ran to 6,613 locations, beyond what manual review can cover, but a few hundred records can still be labeled by a person.

We skip this step when the data are small enough to clean directly, when a deterministic rule separates the categories cleanly, or when no feature carries signal about the true label. If business names, type tags, and counts tell us nothing about which records are real, a model will not invent the distinction.

Inputs required

Before we bring in an AI assistant, we assemble:

  • The raw export with the unreliable field, plus whatever attributes came attached to each record. In the grocery case these were business name, type tags, user rating, review count, and price level.
  • A hand-labeled sample large enough to train and check. We labeled 400 locations: 200 confirmed grocery stores and 200 confirmed non-grocery records, drawn from convenience stores, liquor stores, and miscategorized restaurants.
  • A source of ground truth for labeling. Each location was checked against its business website and Google Street View imagery, about two hours of work.
  • A coding environment with an AI assistant in the loop to write and revise the classifier.

The AI-assisted move

With a labeled set in hand, the assistant iterates through classifier specifications in minutes rather than hours. Each cycle writes the code, runs it, reports balanced accuracy on a held-out test set, and proposes the next change.

In the grocery build, that loop ran five times. Logistic regression on raw features reached 78 percent balanced accuracy. Adding name-based indicators such as "market," "liquor," and "gas" reached 84 percent. Parsing the type tags into binary indicators reached 88 percent. Including review-count interaction terms reached 91 percent. A tuned gradient-boosted model reached 94 percent.

Two observations drove most of the lift. Business names carry signal. "Safeway" and "Trader Joe's" differ systematically from "7-Eleven" and "Chevron Food Mart," so substring indicators help. Feature interactions matter as well. Review count weighs more for records tagged "convenience_store" than for those tagged "supermarket," which is the kind of conditional structure a boosted model captures better than a linear one. Feature importance confirmed the ranking, with the type tags type_supermarket and type_convenience_store leading, followed by log review count and the name indicators.

The assistant's job is to explore this space quickly. Our job is to decide which model and error rate are acceptable for the analysis.

Copy-paste protocol

Paste this into the AI assistant once a labeled sample exists. The goal is to drive the iterate-and-report loop and force a held-out check, so reported accuracy is not measured on the training rows.

You are helping me clean a categorical field in a dataset by training a
binary classifier, with an honest accuracy estimate.

CONTEXT
- File: labeled_sample.csv
- Target column: is_real (1 = the record truly belongs to the category,
  0 = it does not). This was hand-labeled.
- Feature columns: [list every attribute, e.g. business_name, type_tags,
  user_rating, review_count, price_level]
- Full unlabeled file to score later: all_records.csv

DO THIS
1. Load labeled_sample.csv. Report class balance and any missing values.
2. Split into train and held-out test sets (80/20), stratified on is_real.
   Fix the random seed and print it.
3. Engineer features explicitly. For text columns, create substring
   indicators for tokens you expect to separate the classes, and tell me
   which tokens you chose and why. Parse multi-value tag columns into
   binary indicators.
4. Iterate, reporting BALANCED accuracy on the held-out test set each time:
   (a) logistic regression on raw numeric features,
   (b) add the name/text indicators,
   (c) add the parsed tag indicators,
   (d) a tree-based model (e.g. XGBoost) with light tuning.
   After each step print the balanced accuracy and what changed.
5. Print feature importances for the final model.
6. Do NOT report accuracy on rows the model trained on. If you ever do,
   flag it.

STOP after step 6. Do not score all_records.csv yet. I will review the
held-out accuracy and the feature list before we apply the model.

Failure check and validation

Held-out accuracy is necessary but not sufficient. Before trusting the cleaned data, we draw a fresh sample of the model's predictions and check them by hand against the same ground truth used for labeling.

In the grocery build we spot-checked 50 predictions from each category. Forty-seven of 50 predicted-grocery records were truly grocery, about 94 percent. Forty-eight of 50 predicted-non-grocery records were truly non-grocery, about 96 percent.

The errors matter as much as the averages. The three false negatives were ethnic markets with non-English names that lacked the usual "market" or "grocery" substrings, a real blind spot in the name features. The two false positives were a Walgreens and a CVS, large stores that arguably provide some food access. The boundary there is genuinely ambiguous rather than simply wrong.

The validation step tells us that roughly 6 percent misclassification remains, a number we can report rather than hide.

The deliverable

The deliverable is an analysis-ready file with the corrected category and a stated error rate attached to it. Applied to the full 6,613 locations, the classifier marked 4,847 as grocery stores, about 73 percent, and 1,766 as non-grocery, about 27 percent.

That 27 percent is the size of the error we would have carried into the analysis had we trusted the raw "grocery store" label. Food access would have been overstated by about that much. The cleaned column, the documented rule, the held-out accuracy, and the spot-check together show how the data were repaired and how much residual error remains.

Provenance from our work

This step is the classifier behind our published grocery-store validation. It used 400 hand labels, a five-step climb from 78 percent to 94 percent balanced accuracy, a held-out spot-check near 95 percent, and a documented finding that 27 percent of the raw "grocery store" results were not grocery stores. Code, labels, and acquisition steps are public.

Read the full account: "400 Labels to 94 Percent Accuracy: Validating Grocery Store Data."