What 227 Million Rows of Medicaid Data Can and Can't Tell Us

On February 13, 2026, the Department of Health and Human Services published what it called “the largest Medicaid dataset in department history” on its open data portal. A former Department of Government Efficiency (DOGE) affiliate with a background in social media and no experience in fraud research posted on his own platform: “Medicaid data has been open sourced, so the level of fraud is easy to identify.” Fourteen million people saw it. Sixty-nine thousand liked it.

As someone who has been studying Medicaid fraud since before DOGE, I disagree.¹

Within hours of the release, thousands of people downloaded the file. The findings came fast: a cryptocurrency commentator reported $90 billion in fraudulent payments from scanning 0.16% of providers. An anonymous social media account flagged providers billing from residential apartments in Maine. A Substack newsletter dedicated to health misinformation, run by a Florida attorney, identified 184 Medicaid providers at a single Minneapolis address. The Maine apartments are home addresses of behavior technicians who deliver autism therapy in clients’ homes. The Minneapolis address is an ABA therapy center where every technician registers a separate NPI: 184 technicians at one center is how autism services are structured nationwide. Both claims required only a basic understanding of how provider registration works to check.

Open-sourced data needs open-sourced methods. In this series, we walk through the context and technique required to use this dataset constructively.

The gap between what this dataset contains and what people think it contains is where most of the mistakes are happening. Let’s talk it through.

Seven Columns and 227 Million Rows

The file comes from the Transformed Medicaid Statistical Information System (T-MSIS), the federal warehouse where states submit their Medicaid claims data. It covers January 2018 through December 2024, a span of 84 months. It includes fee-for-service (FFS) claims, managed care encounters, and Children’s Health Insurance Program (CHIP) claims.

Here are the seven columns:

Column	What It Contains
`BILLING_PROVIDER_NPI_NUM`	The provider submitting the claim
`SERVICING_PROVIDER_NPI_NUM`	The provider who performed the service
`HCPCS_CODE`	The Healthcare Common Procedure Coding System (HCPCS) code: what was done
`CLAIM_FROM_MONTH`	The month of service
`TOTAL_UNIQUE_BENEFICIARIES`	How many patients received this service
`TOTAL_CLAIMS`	Number of claim lines
`TOTAL_PAID`	Total Medicaid payment

The grain (the unit of each row) is billing provider by servicing provider by procedure code by month. That gives us 227 million rows covering about 1.8 million unique provider NPIs (National Provider Identifiers).

And that’s it. The Centers for Medicare & Medicaid Services (CMS) published no documentation, no data dictionary, and no methodology paper explaining how it constructed the file. This is unusual. When CMS released the Medicare Provider Utilization and Payment Data, known as the Public Use File (PUF), in 2014, after a 33-year legal battle over physician payment transparency, it included extensive methodology documentation [6]. The Medicaid spending file came with none.

The 12-Claim Privacy Threshold

Before we look at what the data shows, we need to understand what it suppresses. CMS’s Cell Size Suppression Policy, grounded in the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule (45 CFR 164.514(b)), prohibits the display of cells with values between 1 and 10 [5]. The Medicaid dataset applies a slightly more conservative version: any row where TOTAL_CLAIMS falls below 12 is dropped entirely.

At the grain of provider by procedure by month, 12 claims is a low bar for any active practice, so the suppression mostly removes small providers, rare procedures, and new entrants. For fraud screening, these are not the providers we care about. The bigger effect is on peer group construction: removing the lower tail of the distribution inflates specialty averages and makes remaining providers look higher-volume than the true population. This distortion matters, though less than the gaps described below.

What This File Is Missing

Here the gap between perception and reality widens. The public spending file contains aggregate billing data. The restricted T-MSIS Analytic Files that CMS makes available to approved researchers contain far more:

Data Element	In Restricted T-MSIS?	In Public Spending File?
Diagnosis codes (ICD-10)	Yes	No
Patient demographics (age, sex, race)	Yes	No
Managed care plan ID	Yes	No
Service begin/end dates	Yes	No (month only)
Beneficiary ID	Yes	No
Eligibility group codes	Yes	No
Provider taxonomy/specialty	Yes	No (requires NPPES crosswalk)

The absence of diagnosis codes, classified under the International Classification of Diseases (ICD-10), is the most consequential gap. Without them, we can see what a provider billed for and how much they were paid, but we cannot evaluate whether the services were clinically appropriate for the patients who received them. A provider billing heavily for evaluation and management codes (the 99213-99215 series) could be committing upcoding fraud, or they could be running a high-volume primary care practice in an underserved area. The billing data alone cannot distinguish between these.

Similarly, the absence of patient demographics means we cannot examine whether flagged providers disproportionately serve particular populations. Without race, age, or eligibility category, we cannot conduct the kind of equity analysis that demonstrated a widely used healthcare algorithm systematically disadvantaged Black patients by using cost as a proxy for health needs [2].

The provider NPI is the only identifier in the file. To get a provider’s specialty, location, or organization type, we need to cross-reference the National Plan and Provider Enumeration System (NPPES), a separate federal registry. Solvable, but a step many amateur analysts skip.

Even with provider details filled in, the deeper question remains: which kinds of fraud can this data detect?

Types of Fraud, One Dataset

Healthcare fraud takes several forms, and the data handles each differently. A recent taxonomy identifies several major types [1]:

Upcoding: Billing for a more expensive service than what the provider actually delivered. A provider sees a patient for a routine 15-minute visit but bills it as a complex 40-minute visit. In the public data, this leaves a statistical footprint. If a provider bills 99215 (high-complexity) at three times the rate of specialty peers, that’s a signal. But without diagnosis codes, we cannot confirm whether those visits warranted the higher code. Partially detectable.

Phantom billing: Billing for services never rendered. A provider submits claims for patients who never visited, or for visits that never occurred. This is the fraud type at the center of the Minnesota autism scheme and the single most commonly prosecuted form of Medicaid fraud. In aggregate billing data, phantom billing can leave a footprint when volumes exceed what is physically possible — a solo provider billing 200 patient-hours per day, or a clinic billing during periods it was closed. But at plausible volumes, phantom billing looks identical to legitimate high-volume practice. Partially detectable.

Substandard care: Providing lower-quality care than what the claim describes. The claim looks normal. The billing codes match. The fraud lies in the gap between what the provider documented and what actually happened in the exam room. Aggregate billing data cannot detect this. Not detectable.

Medical necessity fraud: Performing services that patients do not need. A provider orders unnecessary lab tests or refers patients for procedures with no clinical indication. Detection requires diagnosis codes to evaluate whether the procedure matched the patient’s condition. Without diagnoses, aggregate billing data cannot reveal this. Not detectable.

Of these fraud types, upcoding and phantom billing leave statistical footprints in this dataset, though both are ambiguous without clinical context. Substandard care and medical necessity fraud remain invisible in aggregate billing data.

None of this criticizes the data release itself. It describes what the data can and cannot do, context that matters for anyone claiming fraud can be “easily identified” from this file.

How does this file compare to the Medicare data that researchers have worked with for over a decade?

How This Compares to Medicare Data

Medicare provider payment data has been available since 2014:

Feature	Medicare PUF (since 2014)	Medicaid Spending File (Feb 2026)
Granularity	NPI x HCPCS x place of service (annual)	NPI x HCPCS x month
Time resolution	Annual	Monthly
Payment fields	Allowed amount, Medicare payment, submitted charges	Total paid only
Population	Medicare FFS only	FFS + managed care + CHIP
Suppression	<11 beneficiaries	<12 total claims
Provider details	Name, credentials, address, specialty	NPI only (requires NPPES)
History	10+ years of annual releases	Single release (2018-2024)
Documentation	Extensive CMS methodology docs	None published

The Medicare PUF has some real advantages: submitted charges allow us to compare what providers charge versus what Medicare pays, which is one of the clearest upcoding signals. Multiple payment fields enable richer feature construction. And a decade of annual releases provides temporal depth for trend analysis.

The Medicaid file has one clear advantage: monthly time resolution. Where the Medicare PUF collapses everything to annual totals, the Medicaid file preserves month-by-month billing patterns. This lets us observe seasonal variation, detect billing spikes, and identify providers whose patterns change sharply over time. For fraud detection, this temporal granularity matters.

The Medicaid file also covers a broader population. Medicare data captures only fee-for-service claims, missing the roughly 50% of Medicare beneficiaries enrolled in Medicare Advantage. The Medicaid file includes fee-for-service, managed care, and CHIP, providing a more complete picture of the program.

T-MSIS Data Quality: The Known Unknowns

The underlying data warehouse has well-documented quality problems, and those problems flow directly into the public spending file.

A January 2021 Government Accountability Office (GAO) report found that 30 states failed to submit acceptable data for inpatient managed care encounters [3]. A March 2021 Office of Inspector General (OIG) report described Medicaid managed care payment data as “incomplete and inaccurate” [4]. CMS’s own Medicaid & CHIP Scorecard shows improvement: 41 states and 3 territories met data quality targets as of the 2025 Scorecard, up from 25 in April 2022. But “met data quality targets” is not the same as “research-grade data.”

The practical implications:

Inconsistent coding across states. What counts as a “claim” varies by state Medicaid program. States report provider specialization identifiers differently. Procedure code usage varies with state-specific billing rules.
Managed care encounter gaps. Managed care organizations submit encounter data to states, which submit to T-MSIS. At each step, data can be dropped, delayed, or miscoded.
Lag between service and claim. Months can pass between when a service is delivered and when the claim appears in the data. Payers adjust or void some claims long after initial submission.

None of this makes the data useless. But cross-state comparisons require accounting for state-level variation in data quality and coding conventions. A provider in one state who appears to bill twice as much as a peer in another state may just be in a state that codes encounters differently. If we want to build something useful from this data, these are the adjustments we need to start with.

What Can We Actually Do With This?

The honest answer: quite a lot, if we’re careful about what we claim.

With 84 months of billing, we can construct provider-level profiles, identify statistical outliers in billing volume and payment amounts within peer groups, track temporal patterns, and flag sharp changes. We can also cross-reference against the OIG’s List of Excluded Individuals and Entities to check whether providers who were eventually excluded showed different billing patterns beforehand.

This data alone cannot confirm fraud. Every statistical flag requires investigation by people with clinical expertise, legal authority, and access to the underlying medical records. The dataset is a screening tool, not a verdict engine.

In the posts that follow, we’ll walk through each of these steps: how we construct labels from exclusion lists and why those labels are more complicated than they appear (Post 2), what billing patterns actually look like for excluded versus non-excluded providers (Post 3), and whether a supervised classifier can find anything that simpler methods miss (Post 4).

This data has real uses. Identifying fraud by sorting on total paid is not one of them.

References

Leder-Luis, J. & Malani, A. (2025). The economics of healthcare fraud. NBER Working Paper 33592.
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
GAO-21-196. (2021). Medicaid: CMS should take steps to mitigate program risks.
OIG OEI-02-19-00180. (2021). Opportunities exist to improve Medicaid managed care encounter data quality.
CMS Cell Size Suppression Policy. ResDAC. https://resdac.org/articles/cms-cell-size-suppression-policy
Abelson, R. (2014). The Medicare physician-data release — Context and questions. New England Journal of Medicine, 371, 99-101.

See Perez, V. & Wing, C. (2019). Should we do more to police Medicaid fraud? American Journal of Health Economics, 5(4), 481-508; Nguyen, T. & Perez, V. (2020). Privatizing plaintiffs. Journal of Risk and Insurance, 87(4), 1063-1091; Perez, V. & Ramos Pastrana, J.A. (2023). Finding fraud. International Journal of Health Economics and Management, 23, 393-409. ↩

Suggested Citation

Cholette, V. (February 2026). What 227 Million Rows of Medicaid Data Can and Can't Tell Us. Too Early To Say. https://tooearlytosay.com/research/methodology/medicaid-data-landscape/

Copy citation