Can a Classifier Find What Investigators Miss?

We have 1.8 million providers, cleaned labels from the List of Excluded Individuals and Entities (LEIE) and state exclusion lists, and billing features from 84 months of Medicaid claims. The standard next step is to train a supervised classifier: give the model examples of excluded and non-excluded providers, let it learn which billing features distinguish them, and rank everyone by predicted probability of fraud.

Before jumping in, though, it helps to be clear about what a classifier can and cannot accomplish, given everything from the first three posts (1, 2, 3).

A note on what follows: this post describes the analytical framework and preliminary findings from an ongoing analysis. The full empirical results (model performance, SHAP decompositions, temporal validation) will appear in a forthcoming working paper. What we present here is the reasoning that guides the pipeline, grounded in what the published literature tells us to expect.

What Published Studies Achieve

The fraud detection literature reports a wide range of model performance, typically measured by AUC (area under the receiver operating characteristic curve), which captures how well a model ranks positive examples above negatives. Here are studies spanning the main approaches (supervised, unsupervised, and ensemble) and data sources (Medicare Part B, Part D, inpatient):

Study	Data	Method	Key Metric
Shekhar, Leder-Luis & Akoglu (2026)	Medicare inpatient	Unsupervised ensemble	8x lift over random
Johnson & Khoshgoftaar (2023)	Medicare Part B/D	XGBoost, Random Forest	AUC ~0.83 (imbalanced)
Tajrobehkar et al. (2024)	Medicare ophthalmology	Stacking ensemble	AUC 0.907
Herland et al. (2018)	Medicare Part B + D + DMEPOS	Logistic Regression	AUC 0.816

These studies use Medicare data, which has richer fields than the public Medicaid file (submitted charges, place of service, provider specialty directly in the data). They also use various definitions of “fraud” drawn from the LEIE, with all the label problems we discussed in Post 2.

What should we expect from a classifier trained on public Medicaid data, with its seven columns and no diagnosis codes? Probably an AUC somewhere between 0.70 and 0.85. That’s the range where the model discriminates better than chance and better than sorting by total paid, while still producing substantial numbers of false positives at any reasonable decision threshold.

What SHAP Tells Us About the Labels

Often the most informative output from a supervised model is the feature importance structure rather than the predictions themselves. SHAP values (SHapley Additive exPlanations) come from the same cooperative game theory that economists use to allocate contributions among players in a coalition [5]. The logic works like a decomposition: just as we might decompose a wage gap into portions attributable to education, experience, and industry, SHAP decomposes each fraud prediction into the contribution of each billing feature. Every feature gets a signed credit: positive if it pushed the prediction toward “fraud,” negative if it pushed away. Those credits sum exactly to the total prediction.

SHAP feature directions reverse when labels are restricted to fraud-specific exclusion codes. With all LEIE types (left), lower billing volume predicts exclusion. With fraud-only labels (right), higher billing volume predicts exclusion, recovering the expected fraud signal.

This is where the label contamination problem from Post 2 shows up in the model’s behavior.

If we train a classifier using all LEIE exclusion types as the positive class, the top SHAP features all point in the same direction: lower total claims, lower total paid, and fewer unique beneficiaries push toward the “excluded” prediction. Two mechanisms contribute. Most exclusions in the training data are license revocations (1128(b)(4)) rather than fraud convictions, and these providers tend to be smaller practices. Additionally, providers excluded during the 2018-2024 panel have shorter observation windows, which mechanically deflates their panel totals. The model may be learning observation window length as much as billing behavior.

When we restrict labels to fraud-specific exclusion codes and retrain, the SHAP directions reverse. Higher billing volume, more concentrated procedure codes, and sharper temporal spikes push toward the “fraud” prediction. Part of this reversal reflects cleaner labels. Part reflects that fraud prosecutions take longer than administrative license revocations, so fraud-excluded providers tend to have longer observation windows in the panel. Separating these two mechanisms requires comparing per-month billing rates rather than totals. When we do, fraud-excluded providers bill at roughly 140 claims per month regardless of when enforcement acted — more than double the non-excluded rate of 63. The billing intensity signal is real; panel totals distort its magnitude.

Other studies confirm this pattern. SHAP combined with unsupervised anomaly detection on Belgian GP data uncovered a previously unknown billing trend [6]. In one ophthalmology fraud model, a constructed ratio (total payments divided by total patients) emerged as the most predictive SHAP feature, more informative than any raw variable [3]. The point worth sitting with: SHAP shows what the model actually learns, which may diverge from what we intended.

If SHAP shows us what the model learns, the next question is whether the model itself needs to be complex.

Does ML Actually Beat Logistic Regression?

Most fraud detection papers compare random forests, gradient-boosted trees, and neural networks, reporting whichever achieves the highest AUC. Few compare against logistic regression, the simplest baseline, and that omission matters. (ML here refers to machine learning methods more complex than logistic regression: random forests, gradient-boosted trees, and neural networks.)

Precision-recall curves for three classifiers and a specialty-adjusted spending baseline. At the 5% recall threshold, logistic regression performs comparably to gradient-boosted methods.

A systematic review examined exactly this question across 71 clinical prediction studies [7]. At low risk of bias (proper validation, no data leakage, adequate sample size), the performance difference between ML and logistic regression was exactly zero: 0.00, with a 95% confidence interval from -0.18 to +0.18. ML only appeared superior in studies with high risk of bias, where methodological shortcuts inflated performance.

That review covered clinical prediction tasks (mortality, readmission, diagnosis) rather than fraud detection specifically. Fraud may involve more complex feature interactions that tree-based methods capture. Still, the burden of proof should run the other direction: demonstrate that the complex model outperforms the simple one on properly validated data, rather than assume it does.

Why does this matter for fraud detection on Medicaid data? If logistic regression matches XGBoost or random forest in discrimination, the simpler model wins. A state Medicaid Fraud Control Unit can explain to a judge exactly why logistic regression flagged a provider. It runs without GPU clusters or hyperparameter tuning pipelines. Analysts who understand regression coefficients can update it directly.

A broader argument in Nature Machine Intelligence makes the same point: for high-stakes decisions, interpretable models should be the default unless a black box demonstrably outperforms them [8]. In fraud detection, that “demonstrably outperforms” threshold is rarely met.

Temporal Validation: Why Cross-Validation Lies

A subtlety trips up much of the fraud detection literature: random k-fold cross-validation produces optimistically biased performance estimates. Why? Temporal leakage. If we train on 2022 data and test on 2020 data, the model can learn patterns that only became apparent after the fact. In real deployment, we always predict forward in time.

How big is the gap? A systematic review of 2,030 external validations of cardiovascular prediction models found a median 11% decrease in discrimination, with performance drops ranging from near-zero to over 30% depending on how different the validation population was from the development data [12]. In fraud detection, where temporal shifts in billing patterns and enforcement priorities are the norm, 0.05 to 0.15 AUC points is a reasonable estimate of the gap. A model reporting 0.85 on k-fold may perform at 0.70 to 0.80 on genuinely future data.

Proper temporal validation splits the data by time: train on 2018-2021, validate on 2022, test on 2023-2024. This mimics how the model would actually be used. Grouped cross-validation by provider National Provider Identifier (NPI) adds a second safeguard, preventing within-provider leakage where the model memorizes individual billing trajectories rather than learning generalizable patterns.

Temporal data leakage is one of the most common methodological problems in fraud detection, and several recent reviews have flagged it. Most published studies skip temporal validation entirely, which means their reported AUCs are upper bounds on real-world performance, likely overstating how a model would perform in deployment.

What a Modest Result Actually Means

Say the classifier produces an AUC of 0.75, right in the middle of the expected range. What does that actually mean in practice?

At the top 5% screening threshold, investigation costs remain below the MFCU break-even threshold. Beyond 10%, marginal cost per detection rises sharply as false positives accumulate.

At a 0.1% base rate, an AUC of 0.75 means the model ranks truly excluded providers above non-excluded ones 75% of the time. That’s better than random, and better than sorting providers by total paid. But at any given threshold, the false positive rate at useful precision levels will still be high.

This is where the cost-benefit math gets concrete. Medicaid Fraud Control Unit (MFCU) investigations recovered $1.2 billion in fiscal year (FY) 2023 from $369 million in expenditures, a return of $3.35 per dollar spent [11]. If a classifier generates a ranked list, investigators need each flagged case to yield enough expected recovery to justify the investigation cost. At a 0.1% base rate with 80% precision, every five flags produce four true positives and one false positive. At 50% precision, every two flags produce one true positive and one false positive. And those false positives carry real costs: a provider under investigation loses referrals, faces reputational harm, and may reduce services to vulnerable populations.

Few papers adopt this framing, but a modest AUC on public data is itself informative. It quantifies the ceiling of what this dataset can support. Understanding where prediction breaks down tells us where better data or different methods are needed [9]. ML has diagnosed physician decision-making errors, revealing that doctors simultaneously overtest low-risk patients and undertest high-risk ones [10]. Those predictions were informative precisely because the gap between model recommendations and physician behavior exposed systematic errors that traditional methods miss. The same logic applies here: where a classifier disagrees with investigator priorities may reveal more than where it agrees.

For Medicaid fraud detection, the informative finding may be: public billing data with seven columns and no diagnosis codes supports screening, though it falls short of adjudication. The ceiling is real. Pushing past it requires the restricted Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (the full claims data with diagnosis codes and demographics, available only to approved researchers), clinical records, or entirely different approaches.

Supervised and Unsupervised: Complements

Up to this point we’ve focused on supervised classification. But fraud detection also has an unsupervised side, and the distinction matters:

Dimension	Supervised	Unsupervised
Detects	Known patterns from historical labels	Novel anomalous behavior
Labels needed	Yes (LEIE/S&I)	No
Bias risk	Inherits enforcement bias from labels	Less susceptible
Strength	High precision for established fraud types	Can discover new schemes
Weakness	Misses novel fraud; reflects historical priorities	Higher false positive rate

The unsupervised ensemble we discussed in Post 3 achieved 8x lift over random selection for Medicare inpatient fraud [1]. The approach sidesteps the label problem entirely: instead of learning “what do excluded providers look like,” it asks “which providers look most unlike their peers?” That reframing bypasses the enforcement bias baked into LEIE labels, though it introduces its own difficulty: peer groups must be well-defined, and anomalous billing is not the same as fraud.

In practice, supervised and unsupervised methods serve different roles in the same pipeline. A supervised classifier trained on fraud-specific labels can screen for known patterns. An unsupervised anomaly detector can flag providers whose billing is unusual for reasons the labels never captured. Both still depend on the investigator who examines the medical records.

What Does All This Tell Us?

What have these four posts been building toward? One thread: the distance between what we can observe in public Medicaid billing data and what we would need to observe to identify fraud.

The data is real, large, and monthly (Post 1). The labels are noisy, enforcement-biased, and capture a non-random subset of problematic providers (Post 2). Fraud-excluded providers are not small practices — they bill at more than double the non-excluded rate when measured per month — but enforcement truncation makes them appear small in panel totals, and billing volume in either direction proxies population need better than fraud risk (Post 3). And a supervised classifier trained on these labels and features produces modest discrimination bounded by data and label quality, matches logistic regression when validated honestly, and tells us more through its SHAP explanations than through its predictions (this post).

The data still has value. It works as a screening tool: useful for generating leads that investigators with clinical expertise and legal authority can pursue through medical record review. The risk, visible in real time as thousands of amateur analysts race to flag providers by name and address, is treating a screening tool as a verdict.

Fraud detection that works requires better labels (prospective audit programs rather than post-hoc exclusion lists), better data (diagnosis codes, clinical context, beneficiary demographics), better methods (positive-unlabeled learning that accounts for undetected fraud, unsupervised ensembles, equity audits), and better governance (false positive tracking, disparate impact analysis, appeal mechanisms). The public Medicaid spending file is one input to that system, and only one.

References

Shekhar, S., Leder-Luis, J., & Akoglu, L. (2026). Can machine learning target health care fraud? Evidence from Medicare hospitalizations. Journal of Policy Analysis and Management, 45(1).
Johnson, J.M. & Khoshgoftaar, T.M. (2023). Data-centric AI for healthcare fraud detection. SN Computer Science, 4(4), 389.
Tajrobehkar, M. et al. (2024). Utilization analysis and fraud detection in Medicare via machine learning. medRxiv, 2024.12.30.24319784. [Preprint.]
Herland, M. et al. (2018). Big data fraud detection using multiple Medicare data sources. Journal of Big Data, 5, 29.
Lundberg, S.M. & Lee, S.I. (2017). A unified approach to interpreting model predictions. NeurIPS, 30.
De Meulemeester, H. et al. (2025). SHAP-enhanced anomaly detection in healthcare billing. BMC Medical Informatics and Decision Making.
Christodoulou, E. et al. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22.
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206-215.
Athey, S. (2017). Beyond prediction: Using big data for policy problems. Science, 355(6324), 483-485.
Mullainathan, S. & Obermeyer, Z. (2022). Diagnosing physician error: A machine learning approach to low-value health care. Quarterly Journal of Economics, 137(2), 679-727.
OIG OEI-09-24-00200. (2024). Medicaid Fraud Control Units fiscal year 2023 annual report.
Wessler, B.S. et al. (2021). External validations of cardiovascular clinical prediction models: A large-scale review of the literature. Circulation: Cardiovascular Quality and Outcomes, 14(8), e007858.

Last updated: February 25, 2026.

Updated February 25, 2026. The original analysis compared panel totals without accounting for observation window length. Providers excluded during the panel had shorter billing histories by construction, deflating their totals. Per-month billing rates — which correct for this — show fraud-excluded providers bill at more than double the non-excluded rate regardless of when enforcement occurred. Revised text reflects this correction.

Suggested Citation

Cholette, V. (February 2026). Can a Classifier Find What Investigators Miss?. Too Early To Say. https://tooearlytosay.com/research/methodology/medicaid-fraud-classifier/

Copy citation