MEDICAID FRAUD SERIES — POST 4 OF 4

Can a Classifier Find What Investigators Miss?

We've mapped the data, cleaned the labels, and examined what billing patterns look like. Now the question: does machine learning add anything that a simpler approach can't?

We have 1.8 million providers, cleaned labels, and billing features from 84 months of claims. The natural next step is a supervised classifier: give the model examples of excluded and non-excluded providers, let it learn which billing features distinguish them, and see how well it ranks everyone. But before jumping in, it helps to be clear about what a classifier can and cannot do with the data we have, given everything from the first three posts (1, 2, 3).

What Published Studies Achieve

The fraud detection literature reports a wide range of model performance, typically measured by AUC (area under the receiver operating characteristic curve), which captures how well a model ranks positive examples above negatives. Here are studies spanning the main approaches (supervised, unsupervised, and ensemble) and data sources (Medicare Part B, Part D, inpatient):

Study Data Method Key Metric
Shekhar, Leder-Luis & Akoglu (2026) Medicare inpatient Unsupervised ensemble 8x lift over random
Johnson & Khoshgoftaar (2023) Medicare Part B/D XGBoost, Random Forest AUC ~0.83 (imbalanced)
Tajrobehkar et al. (2024) Medicare ophthalmology Stacking ensemble AUC 0.907
Herland et al. (2018) Medicare Part B + D + DMEPOS Logistic Regression AUC 0.816

These studies use Medicare data, which has richer fields than the public Medicaid file (submitted charges, place of service, provider specialty directly in the data). They also use various definitions of “fraud” drawn from the LEIE, with all the label problems we discussed in Post 2.

What should we expect from a classifier trained on public Medicaid data, with its seven columns and no diagnosis codes? Probably an AUC somewhere between 0.70 and 0.85. That’s the range where the model discriminates better than chance and better than sorting by total paid, while still producing high false positive rates at any reasonable decision threshold.

What SHAP Tells Us About the Labels

Often the most informative output from a supervised model is the feature importance structure rather than the predictions themselves. SHAP values (SHapley Additive exPlanations) come from the same cooperative game theory that economists use to allocate contributions among players in a coalition [5]. The logic works like a decomposition: just as we might decompose a wage gap into portions attributable to education, experience, and industry, SHAP decomposes each fraud prediction into the contribution of each billing feature. Every feature gets a signed credit: positive if it pushed the prediction toward “fraud,” negative if it pushed away. Those credits sum exactly to the total prediction.

SHAP feature directions reverse when labels are restricted to fraud-specific exclusion codes. With all LEIE types (left), lower billing volume predicts exclusion. With fraud-only labels (right), higher billing volume predicts exclusion, recovering the expected fraud signal.
SHAP feature directions reverse when labels are restricted to fraud-specific exclusion codes. With all LEIE types (left), lower billing volume predicts exclusion. With fraud-only labels (right), higher billing volume predicts exclusion, recovering the expected fraud signal.

When we restrict labels to fraud-specific exclusion codes, SHAP reveals billing intensity as the dominant signal. Higher claims per month, more concentrated procedure codes, and sharper temporal spikes push toward the fraud prediction. Fraud-excluded providers bill at roughly 140 claims per month regardless of when enforcement acted, more than double the non-excluded rate of 63. The billing intensity signal is the core finding.

The contrast with an all-LEIE model is instructive. If we train a classifier using all exclusion types as the positive class, the SHAP directions reverse: lower total claims, lower total paid, and fewer unique beneficiaries push toward “excluded.” Two mechanisms drive this. Most exclusions in the training data are license revocations (1128(b)(4)) rather than fraud convictions, and these providers tend to be smaller practices. Additionally, providers excluded during the 2018-2024 panel have shorter observation windows, which mechanically deflates their panel totals. The model ends up learning observation window length and label contamination rather than billing behavior. This is the label contamination problem from Post 2 showing up in the model’s feature structure.

Other studies confirm this pattern. SHAP combined with unsupervised anomaly detection on Belgian GP data uncovered a previously unknown billing trend [6]. In one ophthalmology fraud model, a constructed ratio (total payments divided by total patients) emerged as the most predictive SHAP feature, more informative than any raw variable [3]. The takeaway: SHAP shows what the model actually learns, which may diverge from what we intended.

If SHAP shows us what the model learns, the next question is whether the model itself needs to be complex.

Does ML Actually Beat Logistic Regression?

Most fraud detection papers compare random forests, gradient-boosted trees, and neural networks, reporting whichever achieves the highest AUC. Few compare against logistic regression, the simplest baseline. (ML here refers to machine learning methods more complex than logistic regression: random forests, gradient-boosted trees, and neural networks.)

Precision-recall curves for three classifiers and a specialty-adjusted spending baseline. At the 5% recall threshold, logistic regression performs comparably to gradient-boosted methods.
Precision-recall curves for three classifiers and a specialty-adjusted spending baseline. At the 5% recall threshold, logistic regression performs comparably to gradient-boosted methods.

A systematic review examined exactly this question across 71 clinical prediction studies [7]. At low risk of bias (proper validation, no data leakage, adequate sample size), the performance difference between ML and logistic regression was exactly zero: 0.00, with a 95% confidence interval from -0.18 to +0.18. ML only appeared superior in studies with high risk of bias, where methodological shortcuts inflated performance.

That review covered clinical prediction tasks (mortality, readmission, diagnosis) rather than fraud detection specifically. Fraud may involve more complex feature interactions that tree-based methods capture. Still, the burden of proof should run the other direction: demonstrate that the complex model outperforms the simple one on properly validated data, rather than assume it does.

Why does this matter for fraud detection on Medicaid data? If logistic regression matches XGBoost or random forest in discrimination, the simpler model wins. A state Medicaid Fraud Control Unit can explain to a judge exactly why logistic regression flagged a provider. It runs without GPU clusters or hyperparameter tuning pipelines. Analysts who understand regression coefficients can update it directly.

A broader argument in Nature Machine Intelligence makes the same point: for high-stakes decisions, interpretable models should be the default unless a black box demonstrably outperforms them [8]. In fraud detection, that “demonstrably outperforms” threshold is rarely met.

Temporal Validation: Why Cross-Validation Lies

A subtlety trips up much of the fraud detection literature: random k-fold cross-validation produces optimistically biased performance estimates. Why? Temporal leakage. If we train on 2022 data and test on 2020 data, the model can learn patterns that only became apparent after the fact. In real deployment, we always predict forward in time.

How big is the gap? A systematic review of 2,030 external validations of cardiovascular prediction models found a median 11% decrease in discrimination, with performance drops ranging from near-zero to over 30% depending on how different the validation population was from the development data [12]. In fraud detection, where temporal shifts in billing patterns and enforcement priorities are the norm, 0.05 to 0.15 AUC points is a reasonable estimate of the gap. A model reporting 0.85 on k-fold may perform at 0.70 to 0.80 on genuinely future data.

Proper temporal validation splits the data by time: train on 2018-2021, validate on 2022, test on 2023-2024. This mimics how the model would actually be used. Grouped cross-validation by provider National Provider Identifier (NPI) adds a second safeguard, preventing within-provider leakage where the model memorizes individual billing trajectories rather than learning generalizable patterns.

Temporal data leakage is one of the most common methodological problems in fraud detection; several recent reviews have flagged it. Most published studies skip temporal validation entirely, which means their reported AUCs are upper bounds on real-world performance, likely overstating how a model would perform in deployment.

What a Modest Result Actually Means

So what happens when we actually train a prospective classifier — one that learns from during-panel exclusions and gets tested on providers excluded after the panel ends?

At the top 5% screening threshold, investigation costs remain below the MFCU break-even threshold. Beyond 10%, marginal cost per detection rises sharply as false positives accumulate.
At the top 5% screening threshold, investigation costs remain below the MFCU break-even threshold. Beyond 10%, marginal cost per detection rises sharply as false positives accumulate.

The AUC comes in at 0.725 (95% CI: 0.676-0.777). Is that fraud-specific, or could it be an artifact of enforcement timing? A placebo test helps here: we apply the same model to non-fraud post-panel exclusions. That yields an AUC of 0.579 (95% CI: 0.482-0.670). The confidence intervals do not overlap, consistent with a fraud-specific signal rather than an artifact.

What does 0.725 mean in practice? At the top 5% screening threshold, the classifier captures 23% of future fraud exclusions — roughly 1 in 4 providers who will eventually be excluded for fraud. This is a screening result, not a verdict. But it means billing intensity carries enough fraud-specific signal to concentrate audit resources on a smaller, higher-yield pool.

A natural worry: maybe the model is just learning observation window length, since providers excluded mid-panel have shorter billing histories. Dropping months_active from the feature set changes AUC from 0.725 to 0.723. The signal is not window length. A rate-only specification (claims per month, beneficiaries per month, paid per month, paid per beneficiary) achieves an AUC of 0.670, which tells us billing intensity alone carries most of the discrimination. The 2.3x billing intensity gap from Post 3 turns out to be the core feature driving prospective classification.

What about the cost-benefit math? MFCU investigations recovered $1.4 billion in FY 2024 from expenditures totaling roughly $3.50 returned per dollar spent [11]. At a base rate of 0.098%, precision at any operationally feasible threshold remains below 2%, so the classifier functions as an audit prioritization tool rather than a verdict engine. Concentrating 23% of future fraud cases into the top 5% of scored providers narrows the search space enough to direct investigative resources. And those false positives carry real costs: a provider under investigation loses referrals, faces reputational harm, and may reduce services to vulnerable populations.

One more thing worth noting: the within-panel AUC of 0.830 overstates prospective performance by 0.105 points, because enforcement censoring creates systematic feature differences between during-panel and non-excluded providers. The prospective AUC of 0.725 is the clean benchmark — what the classifier achieves when it cannot exploit enforcement-truncated billing histories.

For Medicaid fraud detection, the informative finding is this: public billing data with seven columns and no diagnosis codes supports screening based on billing intensity, though it falls short of adjudication. Understanding where prediction breaks down tells us where better data or different methods are needed [9]. Pushing past this ceiling requires the restricted T-MSIS Analytic Files (the full claims data with diagnosis codes and demographics, available only to approved researchers), clinical records, or entirely different approaches.

Supervised and Unsupervised: Complements

Up to this point we have been asking one question: can a supervised classifier, trained on historical labels, predict future fraud? But there is another approach entirely. What if we skip the labels?

Dimension Supervised Unsupervised
Detects Known patterns from historical labels Novel anomalous behavior
Labels needed Yes (LEIE/S&I) No
Bias risk Inherits enforcement bias from labels Less susceptible
Strength High precision for established fraud types Can discover new schemes
Weakness Misses novel fraud; reflects historical priorities Higher false positive rate

The unsupervised ensemble we discussed in Post 3 achieved 8x lift over random selection for Medicare inpatient fraud [1]. Instead of learning “what do excluded providers look like,” unsupervised methods ask “which providers look most unlike their peers?” That reframing bypasses the enforcement bias baked into LEIE labels, though it introduces its own difficulty: peer groups must be well-defined, and anomalous billing is not the same as fraud.

In practice, supervised and unsupervised methods serve different roles in the same pipeline. A supervised classifier trained on fraud-specific labels can screen for known patterns. An unsupervised anomaly detector can flag providers whose billing is unusual for reasons the labels never captured. Both still depend on the investigator who examines the medical records.

What Does All This Tell Us?

What do these four posts add up to? One finding: billing intensity is a detectable, fraud-specific signal in public Medicaid data.

The evidence chain: the data is real, large, and monthly (Post 1). The labels are noisy, enforcement-biased, and capture a non-random subset of problematic providers (Post 2). When we clean those labels and adjust for observation windows, fraud-excluded providers bill at more than double the non-excluded rate — a pattern that holds regardless of when enforcement acted (Post 3). And a prospective classifier formalizes that intensity signal into an AUC of 0.725, with a placebo test confirming it is fraud-specific (this post).

So the data works as a screening tool: useful for concentrating investigative attention on providers whose billing intensity patterns resemble those of future fraud exclusions. The risk — visible in real time as thousands of amateur analysts race to flag providers by name and address — is treating a screening tool as a verdict.

What would fraud detection that actually works require? Better labels (prospective audit programs rather than post-hoc exclusion lists), better data (diagnosis codes, clinical context, beneficiary demographics), better methods (positive-unlabeled learning that accounts for undetected fraud, unsupervised ensembles, equity audits), and better governance (false positive tracking, disparate impact analysis, appeal mechanisms). The public Medicaid spending file is one input to that system, and only one. But the billing intensity signal establishes a floor: even with minimal data, fraud patterns are partially recoverable.


References

  1. Shekhar, S., Leder-Luis, J., & Akoglu, L. (2026). Can machine learning target health care fraud? Evidence from Medicare hospitalizations. Journal of Policy Analysis and Management, 45(1).
  2. Johnson, J.M. & Khoshgoftaar, T.M. (2023). Data-centric AI for healthcare fraud detection. SN Computer Science, 4(4), 389.
  3. Tajrobehkar, M. et al. (2024). Utilization analysis and fraud detection in Medicare via machine learning. medRxiv, 2024.12.30.24319784. [Preprint.]
  4. Herland, M. et al. (2018). Big data fraud detection using multiple Medicare data sources. Journal of Big Data, 5, 29.
  5. Lundberg, S.M. & Lee, S.I. (2017). A unified approach to interpreting model predictions. NeurIPS, 30.
  6. De Meulemeester, H. et al. (2025). SHAP-enhanced anomaly detection in healthcare billing. BMC Medical Informatics and Decision Making.
  7. Christodoulou, E. et al. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22.
  8. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206-215.
  9. Athey, S. (2017). Beyond prediction: Using big data for policy problems. Science, 355(6324), 483-485.
  10. Mullainathan, S. & Obermeyer, Z. (2022). Diagnosing physician error: A machine learning approach to low-value health care. Quarterly Journal of Economics, 137(2), 679-727.
  11. OIG OEI-09-25-00200. (2025). Medicaid Fraud Control Units fiscal year 2024 annual report.
  12. Wessler, B.S. et al. (2021). External validations of cardiovascular clinical prediction models: A large-scale review of the literature. Circulation: Cardiovascular Quality and Outcomes, 14(8), e007858.

Updated March 2026 with validated results from Cholette (2026).

This series is based on the peer-reviewed working paper: Cholette, V. (2026). What Do Medicaid Fraud Classifiers Actually Detect? SSRN Working Paper.

Suggested Citation

Cholette, V. (February 2026). Can a Classifier Find What Investigators Miss?. Too Early To Say. https://tooearlytosay.com/research/methodology/medicaid-fraud-classifier/
Copy citation