How to Interpret a Classifier with SHAP Values

A fraud detection classifier lands somewhere between 92% and 96% AUC depending on how the training sample is constructed. That's a solid result. But the natural follow-up question is harder than the modeling itself: what's actually driving these predictions?

Feature importance from tree-based models gives us a ranking, sure. But a ranking doesn't tell us whether a feature pushes predictions toward fraud or away from it, how strong that push is, or whether the effect depends on other features. SHAP values (SHapley Additive exPlanations) answer all three questions. They decompose every prediction into per-feature contributions, grounded in cooperative game theory. The catch: interpreting SHAP correctly requires understanding what it measures and what it does not. A note on causal interpretation appears in Limitations.

Several interpretation methods exist for this kind of problem. LIME fits local linear models around individual predictions. Permutation importance measures how much test-set performance drops when a feature is shuffled. Partial dependence plots show the marginal effect of a feature across its range. SHAP is worth focusing on because it uniquely provides both local and global interpretability, with theoretical guarantees from cooperative game theory that the other methods lack. Let's walk through the workflow.

The interpretation workflow has seven steps, but they all depend on two core tools: a tree-based classifier and the TreeSHAP explainer. Here is the foundation.

Tool Stack: Random Forest + TreeSHAP

Tool Stack
Component	Tool	Purpose
Classifier	`RandomForestClassifier`	200 trees, `max_depth=12`
Explainer	`shap.TreeExplainer`	Fast exact SHAP for tree-based models
Visualization	SHAP summary plot	Feature importance with directional effects
Validation	Category aggregation	Group features into volume / intensity / behavioral / peer-relative

We're working with 26 engineered features across four conceptual categories. The dataset contains Medicare provider billing records, and the target is a binary fraud label. Nothing exotic about the setup, and that's deliberate. The interpretation layer is where the real work happens.

Step 1: Fit the Random Forest

Let's start with a straightforward classifier. The key choices here are class_weight='balanced' (fraud is rare, so we need the model to pay attention to minority-class examples) and max_depth=12 (deep enough to capture interactions, shallow enough to avoid memorizing noise).

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_leaf=5,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

The model is a means to an end; what we really want is the explanation layer on top.

Step 2: Initialize TreeSHAP

TreeSHAP computes exact SHAP values for tree ensembles in polynomial time, versus the exponential cost of brute-force computation. For a 200-tree random forest on a few thousand test samples, it runs in seconds.

The idea behind SHAP values comes from cooperative game theory. Shapley values were originally designed to fairly allocate credit among players in a coalition game. Here, the "players" are features and the "game" is the model's prediction. Each feature's SHAP value represents its average marginal contribution across all possible feature combinations, which gives us a principled way to decompose any single prediction into per-feature effects.

Computing SHAP values on the test set tells us how the model explains new, unseen cases. This matters more for generalization than explaining training data the model may have memorized. Training-set SHAP values can reflect overfitting patterns, so test-set explanations tend to be a more honest picture of what the model has actually learned.

One thing worth noting about TreeSHAP specifically: the default TreeExplainer uses an interventional approach that treats features as independent when computing expectations. When features are correlated (as they often are with engineered features), the SHAP credit may get distributed among the correlated features in ways that understate each one's individual contribution. We'll see this come up again when we look at peer-relative z-scores in Step 6.

import shap
import numpy as np

# TreeExplainer: uses the tree structure for fast, exact SHAP computation.
# model_output='raw' gives log-odds contributions; omit for probability scale.
# feature_perturbation='tree_path_dependent' is the default for tree models.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)  # Explain test set, not training set

Step 3: Handle the Binary Classification Array Shape

The first version-dependent issue appears here. Depending on the shap library version, shap_values comes back in different shapes. Older versions return a Python list of two arrays (one per class). Newer versions sometimes return a 3D NumPy array of shape (n_samples, n_features, n_classes). Our code needs to handle both.

if isinstance(shap_values, list):
    sv = np.array(shap_values[1])   # Positive class (fraud)
else:
    sv = shap_values[:, :, 1]       # Same thing, different structure

We take index 1 because that's the positive class, the fraud predictions. The SHAP values for class 0 are the negation of class 1 in binary classification, so there's no information loss.

After this step, sv has shape (n_samples, 26): one SHAP value per feature per test observation.

Step 4: Aggregate to Mean Absolute SHAP

Individual SHAP values are signed: positive means the feature pushed toward a fraud prediction, negative means it pushed away. To get overall importance, we take the mean of absolute values across all test samples.

It's worth pausing on what "importance" means here. Mean |SHAP| captures the average magnitude of a feature's contribution, but a feature with high mean |SHAP| does not necessarily have a consistent directional effect. It might push some predictions strongly toward fraud and others strongly away. The summary plot in Step 5 helps disambiguate this, so we should interpret the importance ranking and the directional plot together rather than treating the ranking alone as conclusive.

mean_abs_shap = np.abs(sv).mean(axis=0)
importance = pd.Series(mean_abs_shap, index=X_test.columns).sort_values(ascending=False)

This gives us a global importance ranking (keeping in mind that SHAP values reflect association, not causation — see Limitations).

Step 5: Summary Plot (Top 15 Features)

The SHAP summary plot is where things get genuinely informative. Each dot is one test observation. The x-axis is the SHAP value (contribution to log-odds of fraud), and the color encodes the feature's actual value (red = high, blue = low).

shap.summary_plot(sv, X_test, max_display=15, show=False)
plt.tight_layout()
plt.savefig("shap_summary_top15.png", dpi=150, bbox_inches="tight")
plt.close()

What to look for: features where red dots cluster on the right (high feature value pushes toward fraud) versus features where the relationship is mixed or reversed. For instance, if avg_paid_per_claim shows red dots on the right, that means providers with high per-claim billing amounts are being flagged, consistent with domain knowledge about upcoding.

Reading a Single Provider's SHAP Profile

To make this concrete, consider what a SHAP waterfall might look like for a specific flagged provider. Suppose the base rate (average model output across the dataset) sits at 0.12 on the probability scale. For this provider, billing_intensity pushes the prediction +0.15 toward fraud, monthly_spending_volatility adds another +0.07, while years_in_practice pulls it -0.08 toward legitimate and unique_hcpcs pulls -0.03. The net effect tips the model to about 0.23 — enough to trigger a flag. That per-feature decomposition is precisely what makes SHAP useful for audit workflows: rather than a black-box score, we can point to which features drove this particular decision and by how much.

Step 6: Group Features into Conceptual Categories

Twenty-six features are too many to reason about individually. Grouping them into conceptual categories reveals which types of signal the model relies on most.

Intensity features (per-claim amounts, per-beneficiary costs, spending volatility) tend to dominate in fraud classifiers. This makes intuitive sense: fraudulent providers tend to bill at unusual rates per encounter rather than simply submitting more claims. Volume matters too, but intensity appears to be the stronger discriminator. The category breakdown below gives us a way to test whether this intuition holds in our particular model.

categories = {
    'Volume': ['total_paid', 'total_claims', 'total_beneficiaries',
               'months_active', 'claims_per_month', 'avg_monthly_paid',
               'entity_type'],
    'Intensity': ['avg_paid_per_claim', 'avg_paid_per_beneficiary',
                  'claims_per_beneficiary', 'max_single_month_paid',
                  'monthly_spending_volatility', 'cv_monthly_paid'],
    'Behavioral': ['share_top_code', 'hcpcs_hhi', 'hcpcs_entropy',
                   'unique_hcpcs', 'share_em_codes', 'share_high_reimburse',
                   'telehealth_share', 'rbcs_category_diversity',
                   'billing_gap_ratio'],
    'Peer-Relative': ['z_cpm', 'z_ppb', 'z_entropy', 'z_paid']
}

for cat, feats in categories.items():
    cat_importance = importance[feats].sum()
    cat_share = cat_importance / importance.sum()
    print(f"{cat:15s}  {cat_share:.1%}")

In our fraud detection pipeline, the breakdown looks roughly like this:

Category Share of Total SHAP Importance
Category	Share of Total SHAP
Intensity	~40%
Volume	~35%
Behavioral	~15%
Peer-Relative	~10%

The peer-relative features (z-scores comparing a provider to specialty peers) contribute surprisingly little. We engineered these to normalize for specialty differences, but the model appears to extract most of that signal from raw intensity features directly. One likely explanation: correlated features split SHAP values between them, and the z-scores correlate heavily with their raw counterparts. The category aggregation partially addresses this by summing within groups, but it's worth keeping in mind when interpreting any single feature's ranking.

Step 7: Does Importance Align with Domain Knowledge?

This is the step that separates mechanical SHAP computation from actual interpretation. Let's pose some questions and see whether the evidence supports sensible answers.

Do the top features match known fraud patterns? If avg_paid_per_claim and monthly_spending_volatility rank high, that's consistent with upcoding and burst-billing, both well-documented fraud schemes. If entity_type ranks high, we should check whether the model is picking up a real signal (organizations vs. individuals bill differently) or a data artifact.

Are any surprises genuine discoveries or artifacts? Suppose telehealth_share ranks unexpectedly high. Is that because telehealth genuinely correlates with fraud in the data (plausible for certain time periods), or because telehealth providers also tend to be smaller practices with different billing patterns? Disentangling association from mechanism requires domain investigation beyond what SHAP alone can provide.

Does the category breakdown shift across subgroups? Running the same aggregation on different provider specialties or entity types can reveal whether the model uses different signal types for different subpopulations. A model that relies on volume for one specialty and intensity for another might be picking up legitimate structural differences, or it might reflect label imbalance across groups.

What Can Go Wrong

SHAP is robust, but interpretation is fragile. Here are the failure modes we've encountered.

Array shape ambiguity. As noted in Step 3, the shap library's output format changed across versions. Code that works with shap==0.41 may break on shap==0.44. Always check type(shap_values) and shap_values.shape before proceeding.

Feature engineering artifacts. Z-scores, ratios, and log-transforms change what SHAP measures. If we feed the model z_paid (a provider's total billing normalized by specialty mean), SHAP tells us how much deviation from peers matters, not how much raw billing matters. These are different questions, and it's easy to conflate them.

Reweighting changes the SHAP landscape. If the training pipeline includes entropy balancing, inverse propensity weighting, or any sample reweighting, the SHAP values reflect the reweighted model rather than raw data relationships. Conditional importance under reweighting can look very different from unconditional importance. Both are valid; they answer different questions.

When to Use This Approach

TreeSHAP works best when we have a tree-based model (random forest, gradient boosting, XGBoost, LightGBM) and need per-prediction explanations. It's fast, exact, and well-supported. For global interpretation of a production classifier, it's hard to beat.

Less suitable when:

The model is a neural network or SVM (use KernelSHAP or DeepSHAP instead, but expect slower computation)
Features number in the thousands (SHAP summary plots become unreadable; consider feature selection first)
The goal is causal inference rather than predictive explanation (use causal methods instead)
Stakeholders need a single importance number per feature and won't engage with distributional plots (permutation importance might communicate more clearly)

Limitations: SHAP Explains the Model, Not the World

Three constraints are worth keeping in mind. First, SHAP values explain model behavior, not data-generating processes. High SHAP importance means the feature is useful for prediction; it does not mean the feature causes the outcome. A provider's zip code might have high SHAP importance because fraud prosecution rates vary by jurisdiction — that tells us nothing about whether geography causes fraud. If the model is wrong (biased training data, label noise, concept drift), SHAP will faithfully explain that wrong model. Second, computational cost scales with feature count and sample size; our 26-feature, ~3,000-sample test set runs in under a minute, but 500 features and 100,000 samples will need subsampling. Third, SHAP values are additive (they sum to the difference between the prediction and the base rate), which keeps the math clean but assumes feature interactions decompose neatly. They often don't.

The full pipeline is available in 05_fraud_detection/phase4g_shap.py.

References

[1] Lundberg, S.M. & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems 30. The foundational paper connecting Shapley values to model interpretation.

[2] Lundberg, S.M. et al. (2020). "From local explanations to global understanding with explainable AI for trees." Nature Machine Intelligence, 2, 56--67. Introduces the TreeSHAP algorithm used here.

[3] The category shares (Volume ~35%, Intensity ~40%, Behavioral ~15%, Peer-Relative ~10%) are from a specific model run on Medicare Part B provider-level data. These proportions shift with feature engineering choices, training sample construction, and reweighting strategy.