How to Build a Classifier When 94% Accuracy Means Nothing

Classifying Medicaid billers as potential fraud cases sounds like a textbook machine learning problem. Load the data, train a classifier (a model that assigns observations to categories, like "fraud" or "legitimate"), report accuracy, done. But a 10:1 class imbalance, temporal data leakage risks, and the difference between AUC-ROC and precision@k make it anything but textbook. Consider: 94% accuracy actually means nothing when the base rate is 90%. A model that predicts every single biller is legitimate scores 91% accuracy and catches zero fraud. So what metrics do matter? And how do we build a classifier that performs well on those metrics instead?

Several approaches handle class imbalance: SMOTE generates synthetic minority examples, LightGBM and CatBoost offer built-in class-weight handling with fast training, and deep tabular models can work at scale. Here we'll combine Random Forest with controlled undersampling for transparency, since that combination makes it easier to see exactly how each piece of the pipeline affects performance. The Alternatives section covers other options in more detail.

This walkthrough covers a full scikit-learn pipeline (Python's standard machine learning library), from stratified splitting through forward-chaining temporal cross-validation, using the evaluation framework that surfaces real predictive signal in highly imbalanced data. If logistic regression is familiar territory but random forests and gradient boosting are not, this tutorial bridges that gap.

To build a classifier that performs well on those better metrics, we need tools that handle imbalance explicitly and give us multiple evaluation lenses. Here is the technical stack.

Tool Stack: scikit-learn, XGBoost, and Temporal CV

Pipeline Components
Component	Tool	Purpose
Models	RF, XGBoost, LogisticRegression	Three classifiers: one familiar, two tree-based
Imbalance	Undersampling + class weights	Corrects for rare positives in training data
Validation	Forward-chaining temporal CV	Train on past, validate on future (no data leakage)
Metrics	AUC-ROC, PR-AUC, precision@k	Multiple evaluation lenses

All code runs in Python 3.10+ with scikit-learn 1.3, XGBoost 2.0, pandas, and numpy.

Step 1: Load Data and Sanity-Check It

Before anything else, let's look at what we're working with. The dataset contains ~38,000 Medicaid biller records with 25 continuous features (billing volumes, claim patterns, geographic indicators) and 1 categorical feature (provider type). The target variable is a binary flag: excluded (fraud/abuse finding) or not.

import pandas as pd
import numpy as np

df = pd.read_csv("medicaid_billers.csv")
print(f"Shape: {df.shape}")
print(f"Class distribution:\n{df['excluded'].value_counts(normalize=True)}")
print(f"Missing values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")

A few things to check immediately:

Class balance. If roughly 9-10% of records are positive (excluded), we're looking at a ~10:1 imbalance. Not extreme by fraud-detection standards, but enough to make accuracy useless as a metric.
Missing values. Any feature with >30% missingness probably needs to be dropped or imputed with care. Median imputation works for continuous features here; for the categorical feature, a dedicated "Unknown" category avoids information loss.
Outliers. Billing-volume features often have extreme right tails. Let's leave them unclipped for now. Tree-based models handle skew well, and outliers in fraud data are often the signal, not the noise.

Step 2: Split with Stratification and Per-Fold Undersampling

Cross-validation splits the data into "folds," training the model on some folds and testing on the held-out fold, then rotating. It gives us a more honest estimate of performance than a single train/test split. But this is where the first subtle mistake typically happens. It's tempting to undersample the majority class once, globally, then split into folds. But that approach means every fold trains on the same subset of negatives. The model memorizes those specific controls rather than learning generalizable patterns.

Instead, undersampling needs to happen within each fold, with a deterministic but fold-specific seed. Notice in the code below that np.random.RandomState(42 + fold_idx) ensures each fold draws a different subset of negatives while remaining fully reproducible. A global np.random.seed(42) before the loop would reset to the same state if any operation consumes random numbers unpredictably, creating a silent reproducibility bug.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
RATIO = 3  # 3 negatives per positive

for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train_full, y_train_full = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]

    # Per-fold deterministic undersampling
    train_fraud = X_train_full[y_train_full == 1]
    train_legit = X_train_full[y_train_full == 0]

    rng = np.random.RandomState(42 + fold_idx)
    n_neg_train = min(len(train_fraud) * RATIO, len(train_legit))
    train_neg_idx = rng.choice(len(train_legit), size=n_neg_train, replace=False)

    X_train = pd.concat([train_fraud, train_legit.iloc[train_neg_idx]])
    y_train = pd.concat([
        y_train_full[y_train_full == 1],
        y_train_full[y_train_full == 0].iloc[train_neg_idx]
    ])

Step 3: Handle Class Imbalance (Belt and Suspenders)

Why use both undersampling and class weights? Undersampling changes the data distribution the model sees. Class weights change how the model penalizes errors. They operate on different mechanisms, and in practice the combination tends to yield better calibrated probability estimates for the minority class. This matters because our evaluation metrics all depend on predicted probabilities (each model outputs a fraud probability between 0 and 1 via predict_proba), not hard yes/no classifications.

Undersampling alone gets us partway there. Adding class_weight='balanced' to the classifier applies a second layer of correction at the loss-function level. This belt-and-suspenders approach tends to outperform either technique alone on PR-AUC:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200, max_depth=12, min_samples_leaf=5,
    class_weight='balanced',  # Adjusts loss to penalize minority-class errors more
    random_state=42, n_jobs=-1
)

A note on the 3:1 undersampling ratio: we tried 1:1, 2:1, 3:1, and 5:1 in preliminary experiments. A 1:1 ratio discards roughly 90% of the majority class, throwing away potentially useful patterns among legitimate billers. The 3:1 ratio preserves more majority-class information while still reducing imbalance enough for the models to attend to the minority class. PR-AUC peaked at 3:1 for random forest and XGBoost.

Step 4: Train Three Models

A natural question: why not just use XGBoost? In fraud detection, interpretability carries weight. Logistic regression coefficients can be directly explained to auditors. Random forests offer feature importance that maps to investigative priorities. XGBoost often wins on raw metrics, yet its explanations require SHAP or similar post-hoc tools. Running all three lets us see whether the performance gap justifies the interpretability cost.

Let's compare three classifiers. Logistic regression is the workhorse most economists already know. A random forest aggregates hundreds of decision trees (each one a series of if/then splits on different features) and averages their predictions, which tends to reduce overfitting. XGBoost builds trees sequentially, where each new tree focuses on the cases the previous trees got wrong. Both tree-based methods capture nonlinear relationships and feature interactions that logistic regression misses.

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# Logistic Regression - the familiar baseline
lr = LogisticRegression(
    class_weight='balanced', max_iter=1000, random_state=42
)

# Random Forest - averages many decision trees to reduce overfitting
rf = RandomForestClassifier(
    n_estimators=200, max_depth=12, min_samples_leaf=5,
    class_weight='balanced', random_state=42, n_jobs=-1
)

# XGBoost - builds trees sequentially, each correcting prior errors
scale_pos = (y_train == 0).sum() / (y_train == 1).sum()
xgb = XGBClassifier(
    n_estimators=300, max_depth=6, learning_rate=0.1,
    scale_pos_weight=scale_pos, random_state=42,
    use_label_encoder=False, eval_metric='aucpr'
)

models = {'LogisticRegression': lr, 'RandomForest': rf, 'XGBoost': xgb}
for name, model in models.items():
    model.fit(X_train, y_train)

Step 5: Evaluate with Metrics That Matter

This is where the pipeline diverges from textbook ML. Why three metrics? Each tells a different story, and together they reveal whether the model is actually useful or just good at a narrow statistical game. Let's look at all three:

from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_score
)

results = {}
for name, model in models.items():
    y_score = model.predict_proba(X_val)[:, 1]

    auc_roc = roc_auc_score(y_val, y_score)
    pr_auc = average_precision_score(y_val, y_score)

    # Precision at top k%
    prec_at_k = {}
    for pct in [1, 5, 10]:
        threshold = np.percentile(y_score, 100 - pct)
        y_pred = (y_score >= threshold).astype(int)
        prec = precision_score(y_val, y_pred, zero_division=0)
        prec_at_k[f'P@{pct}%'] = prec

    results[name] = {
        'AUC-ROC': auc_roc,
        'PR-AUC': pr_auc,
        **prec_at_k
    }

pd.DataFrame(results).T.round(3)

AUC-ROC (0.92-0.96 in our runs) measures discrimination across all thresholds. It answers: "Can the model rank positives above negatives?" Robust, but can be overly optimistic with class imbalance because the false positive rate denominator is huge.
PR-AUC (0.85-0.91) focuses on the positive class. It answers: "When the model says fraud, how often is it right, and how many frauds does it find?" Almost always more informative than AUC-ROC for imbalanced problems.
Precision@k% (~60-70% at k=5%) answers the operational question: "If we can only investigate the top 5% of flagged billers, what fraction are actually fraudulent?" This is the metric that maps directly to resource allocation. An investigative unit with 20 analysts cares about the precision of their caseload, not the full ROC curve.

To put precision@5% in concrete terms: a precision-at-5% of 65% means that out of roughly 1,900 flagged billers, about 1,230 would actually be fraudulent -- a substantial improvement over random selection, which at a 10% base rate would yield only about 190 true positives in that same group.

Overall accuracy, by contrast, is misleading here. A model that predicts every biller as legitimate scores 91% accuracy and catches zero fraud. PR-AUC and precision@k are the metrics that actually reflect operational value in imbalanced settings.

Step 6: Compare Against a Domain-Knowledge Baseline

A model's metrics are uninterpretable without a baseline, and not a random baseline. We need a domain-knowledge baseline. What's the simplest rule an experienced auditor might use?

# Single-feature baseline: flag billers above 95th percentile
# on total_claims_amount
threshold_95 = X_val['total_claims_amount'].quantile(0.95)
y_baseline = (X_val['total_claims_amount'] >= threshold_95).astype(int)

baseline_prec = precision_score(y_val, y_baseline, zero_division=0)
print(f"Baseline precision (top 5% by claims): {baseline_prec:.3f}")

If the random forest's precision@5% is 65% and the single-feature baseline hits 40%, we can say the model adds 25 percentage points of precision: a concrete, defensible improvement. If the baseline hits 60%, the model's marginal value is slim and the complexity may not be justified. Without this comparison, reporting "65% precision" floats in a vacuum.

So far, Steps 2 through 6 used stratified k-fold cross-validation: the data is split into folds randomly, with stratification ensuring each fold preserves the original class balance. This approach tells us the model works on a representative sample of the data. But it doesn't tell us whether the model will keep working as billing patterns change year over year, because random splits allow the model to train on future observations and predict past ones. For temporal data, we need a fundamentally different splitting strategy.

Step 7: Test Whether the Model Holds Up Over Time

The stratified k-fold cross-validation in Step 2 splits data randomly while preserving class balance -- it's a strong approach for estimating general predictive performance. But for data with a time dimension, random splits have a fatal flaw: they can train on future observations to predict past ones. If billing patterns shift year over year (and they do, given policy changes, new fraud schemes, and pandemic disruptions), random CV will overestimate performance.

Forward-chaining temporal CV takes a different approach entirely. Instead of random splits preserving class balance, it splits strictly by time: each fold trains only on earlier years and validates on the next year forward. The training set grows with each fold, mimicking how a production model would actually be retrained. This means the model never sees the future during training, which gives us a more honest estimate of how well it will generalize to new data.

temporal_folds = [
    {'train': (2018, 2019), 'val': 2020},  # Fold 1
    {'train': (2018, 2020), 'val': 2021},  # Fold 2
    {'train': (2018, 2021), 'val': 2022},  # Fold 3
    {'train': (2018, 2022), 'val': 2023},  # Fold 4
    {'train': (2018, 2023), 'val': 2024},  # Fold 5
]

temporal_results = []
for fold in temporal_folds:
    train_mask = df['year'].between(*fold['train'])
    val_mask = df['year'] == fold['val']

    X_train_t, y_train_t = X[train_mask], y[train_mask]
    X_val_t, y_val_t = X[val_mask], y[val_mask]

    # Apply per-fold undersampling (same logic as Step 2)
    # Train models, collect metrics...
    temporal_results.append(fold_metrics)

If AUC-ROC is stable across folds (say, 0.93 +/- 0.02), the model seems to generalize well over time. If it degrades in later folds, something may be shifting: concept drift, policy changes, or data quality issues. That's valuable to know before deployment.

What Can Go Wrong

Having worked through the pipeline, let's catalog the failure modes. Each one can silently destroy a model's real-world utility.

Global random seed resets cause identical controls every fold. If np.random.seed(42) is called once before a CV loop, the undersampling in each fold may draw from the same random state trajectory. Per-fold RandomState(42 + fold_idx) instances avoid this entirely.

Label leakage. Features like months_since_exclusion or exclusion_year encode the outcome directly. Any feature that could only be known after the label was assigned must be removed. This seems obvious but is easy to miss in wide feature sets assembled by different teams.

No domain-knowledge baseline. Without one, there is no way to assess whether the model provides value beyond a simple rule. Stakeholders will ask "couldn't we just flag the biggest billers?" We need a quantitative answer.

Random CV splits leak future information. A model trained on 2022 data and validated on 2019 data has seen the future. Forward-chaining temporal CV is the only valid approach for data with a time dimension.

Undersampling before splitting. If the majority class is undersampled globally before creating CV folds, every fold sees the same negative examples. The model overfits to those specific controls, and cross-validation estimates become optimistic.

When to Use This Approach

This pipeline fits well when:

The positive class is rare (1-15% prevalence)
Data has a temporal dimension that matters
The operational question is "who should we investigate first?" (ranked retrieval)
Interpretability matters alongside performance
The dataset is moderate-sized (thousands to low millions of records)

Less suitable when:

Classes are roughly balanced (standard CV and accuracy work fine)
The problem is purely predictive with no ranking/prioritization need
Data volume exceeds millions of records and training time matters. Gradient boosting on the full dataset with scale_pos_weight may outperform undersampling

Alternatives Worth Exploring

This pipeline is one way to handle class imbalance. Several other approaches are worth considering, depending on the dataset and operational constraints.

LightGBM and CatBoost often match or exceed XGBoost with faster training. CatBoost handles categorical features natively, which avoids one-hot encoding overhead.
SMOTE and its variants (Borderline-SMOTE, ADASYN) generate synthetic minority examples instead of discarding majority ones. Results are mixed in the literature; undersampling tends to be more robust for fraud-like problems where the minority class is heterogeneous.
Deep learning (tabular transformers, TabNet) can work for very large datasets but rarely outperforms well-tuned gradient boosting on structured data below ~1M records.

Limitations

A few caveats worth noting:

Undersampling discards information. With a 3:1 ratio and ~3,500 positives, we use roughly 10,500 of ~34,500 negatives per fold. The discarded negatives might contain patterns the model never sees. Ensemble approaches (training multiple models on different negative subsets) can mitigate this.
Temporal CV reduces effective training data. Fold 1 trains on only two years. If the signal is noisy, early folds may underperform simply due to sample size, not model quality.
Precision@k depends on the k. Reporting precision@5% assumes the investigation unit can handle 5% of the population. If capacity is 1% or 15%, the metric needs to shift accordingly. Always align k with operational reality.
Feature engineering is not covered here. The pipeline assumes features are already constructed. In practice, feature engineering (interaction terms, rolling averages, provider-network features) often matters more than model choice.

Code Availability

The complete implementation is available in phase3_classifiers_v2.py, which includes all preprocessing, temporal CV, model training, and evaluation logic in a single reproducible script.

References and Notes

[1] AUC-ROC values of 0.92-0.96 and PR-AUC of 0.85-0.91 reflect performance across 5 temporal folds on the Medicaid biller dataset described here. Results will vary with different feature sets, imbalance ratios, and temporal structures.