XAI: Variance-Based Analysis
MVA - APM_5MV75_TP
Mathieu FONTAINE
mathieu.fontaine@telecom-paris.fr
January 2025
Goal: define "feature importance" precisely, then introduce Sobol indices and their use for local XAI (Fel et al.).
Big picture: importance is not a primitive concept — it depends on how we vary the input.
We will emphasize interactions (why total-order indices matter in deep models and in audio).
Outline
I - What Does “Feature Importance” Really Mean?
II - Measuring Importance Through Variance
III - Sobol Decomposition: Theory and Interpretation [Sob. 01]
IV - Local Feature Importance via Perturbations [Fel. 21]
V - Sobol Indices Estimation
VI - What Sobol Explains (and What It Does Not)
VII - Take-Home Messages
Part I: make the concept of importance explicit with simple examples.
Part II: introduce variance and conditional variance as the clean measure of sensitivity.
Part III: functional ANOVA / Sobol decomposition (why variance becomes additive).
Part IV: turn Sobol into local XAI by studying perturbation masks around a fixed input.
Part V: practical estimators (Saltelli/Jansen) + sample efficiency (QMC + Sobol sequences).
Part VI: interpretability caveats (off-manifold, baseline dependence, independence assumptions).
Sobol, I. M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation , 2001
Fel, Thomas, et al. "Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis." NeurIPS , 2021
I - What Does “Feature Importance” Really Mean?
We start with the key point: importance is not uniquely defined.
We need to specify how we change the input and how we measure the effect on the prediction.
Not really mathematic…
Consider a model $f:\mathbb{R}^d \to \mathbb{R}$ (e.g. a classification score).
Intuitive definition:
A feature is important if changing it changes the prediction.
Changing it how?
Gradient-based importance (infinitesimal changes)
Occlusion / masking (finite removal)
SHAP (coalitions / marginal contributions)
Sobol (variance decomposition + interactions)
Stress: you cannot talk about importance without defining a notion of variation.
Different XAI methods correspond to different variation mechanisms + aggregation rules.
Example 1: Linear model
Consider $f(x_1,x_2)=3x_1+0.1x_2$.
Changing $x_1$ affects the output much more than changing $x_2$.
In this special case: importance $\approx$ coefficient magnitude.
This is a global notion (same for all inputs).
Here, "importance" feels obvious because the model is additive and linear.
But this breaks as soon as the function is non-linear or interaction-dominated.
Example 2: Gradient (local)
Consider $f(x_1,x_2)=x_1^2+x_2$.
$\nabla f(x) = (2x_1,\,1)$.
At $x=(0,1)$: the sensitivity to $x_1$ is $0$, to $x_2$ is $1$.
We have a local notion of importance.
It depends on the local metric / scaling.
Gradient answers: "which infinitesimal direction changes the output most at this point?".
It can fail on plateaus, saturations, or when information is mainly in interactions.
Example 3: Occlusion (finite change)
Occlusion = remove some input information and compare the prediction before/after.
Image: hide a patch
Audio: remove a frequency band / TF region
Text: delete a word
Let $f(x_1,x_2)=x_1x_2$ and input $(x_1,x_2)=(1,1)$. Then $f(1,1)=1$.
If we remove $x_1$ (set $x_1=0$), then $f(0,1)=0$. If we remove $x_2$, $f(1,0)=0$.
Both are important… individually or only jointly?
Occlusion detects "usefulness" but does not explain whether the usefulness is due to main effects or interactions.
This motivates variance decomposition methods that explicitly separate main/interaction contributions.
Hidden interaction (Sobol motivation)
In $f(x_1,x_2)=x_1x_2$, neither variable has a strong individual effect everywhere:
the effect comes from their interaction .
How can we separate what comes from individual effects and what comes from interactions?
We want a principled decomposition that assigns a "variance budget" to main effects and interactions.
This is the key promise of Sobol indices.
Feature importance (in this course)
Feature importance: measure how much the prediction varies when a given feature is allowed to vary,
possibly jointly with others.
Sobol: provides an exact and additive decomposition of variance into main effects and interactions (under assumptions).
We now formalize "variation" using probability and variance.
Then we show how variance can be split across variables and interactions.
Variance-based feature importance
To define "importance", we must define input variability .
$$\textbf{(1) Random input:}\qquad X=(X_1,\dots,X_d)\sim p_X$$
$$\textbf{(2) Random output:}\qquad Y=f(X)$$
$$\textbf{(3) Global variability:}\qquad \mathrm{Var}(Y)=\mathbb{E}[Y^2]-\mathbb{E}[Y]^2$$
$$\textbf{(4) Variance explained by }X_i:\qquad \mathrm{Var}\!\big(\mathbb{E}[Y\mid X_i]\big)$$
Intuition: if knowing $X_i$ changes the conditional mean prediction a lot, then $X_i$ is important.
At this stage, we do not define any "Sobol index" yet.
We only build the variance-based intuition: importance = explained variance.
\(\mathbb{E}[Y|X_i]\) is the best prediction of \(Y\) using only \(X_i\).
If \(X_i\) carries information, then \(\mathbb{E}[Y|X_i]\) varies with \(X_i\), hence its variance is large.
If \(X_i\) is irrelevant, then \(\mathbb{E}[Y|X_i]\approx \mathbb{E}[Y]\), hence variance is ~0.
II - Measuring Importance Through Variance
We formalize input variations by introducing random variables and studying output variance.
Key tool: law of total variance, leading to main and total effects.
Setup: make the output random
Deterministic model: $f:\mathbb{R}^d \to \mathbb{R}$
Introduce random input: $X=(X_1,\dots,X_d)$
Output becomes a random variable:
$$Y=f(X)$$
All importance notions depend on how we choose the distribution of $X$ (or local perturbations later).
Variance-based methods are global by default (distribution over the domain).
Fel et al. will adapt this to local explanations using perturbation distributions around a fixed input.
Notation
For a given index $i$, we denote:
$$X_{\sim i}=(X_1,\dots,X_{i-1},X_{i+1},\dots,X_d)$$
$X_i$ = “feature of interest”, $X_{\sim i}$ = “all other features”.
I will use $X_{\sim i}$ everywhere: it simply means “everything except $X_i$”.
Variance as global sensitivity
$$\mathbb{E}[Y]=\int f(x)\,p_X(x)\,dx$$
$$\mathrm{Var}(Y)=\mathbb{E}[Y^2]-\mathbb{E}[Y]^2$$
High variance $\Rightarrow$ sensitive output. Low variance $\Rightarrow$ stable output.
Variance ignores sign: it measures magnitude of variation, not direction.
This is exactly what we need for an “importance magnitude”.
Law of total variance
For any $Y$ and any variable $X_i$:
$$\mathrm{Var}(Y)=\mathrm{Var}\big(\mathbb{E}[Y\mid X_i]\big)+\mathbb{E}\big[\mathrm{Var}(Y\mid X_i)\big]$$
This identity is the starting point for variance-based feature importance.
Interpretation: explained variance + residual variance.
$\mathrm{Var}(\mathbb{E}[Y|X_i])$: how much the mean output changes with $X_i$.
$\mathbb{E}[\mathrm{Var}(Y|X_i)]$: what remains variable even if $X_i$ is known.
Main effect vs total effect
Main-effect variance contribution:
$$D_i \triangleq \mathrm{Var}\big(\mathbb{E}[Y\mid X_i]\big)$$
Total-effect variance contribution (includes interactions):
$$D_i^{\mathrm{tot}} \triangleq \mathbb{E}\big[\mathrm{Var}(Y\mid X_{\sim i})\big]$$
$$= \mathrm{Var}(Y)-\mathrm{Var}\big(\mathbb{E}[Y\mid X_{\sim i}]\big)$$
$D_i$: effect of $X_i$ alone.
$D_i^{\mathrm{tot}}$: everything involving $X_i$ (main + interactions).
Normalized indices
$$S_i \triangleq \frac{D_i}{\mathrm{Var}(Y)} \qquad (\text{first-order})$$
$$ST_i \triangleq \frac{D_i^{\mathrm{tot}}}{\mathrm{Var}(Y)} \qquad (\text{total-order})$$
$$0 \le S_i \le ST_i \le 1$$
$S_i$ tells you if the feature matters alone.
$ST_i$ tells you if it matters at all (alone or via interactions).
The gap $ST_i - S_i$ is an interaction diagnostic.
Toy example A: additive model
$X_1,X_2 \sim \mathcal{U}[0,1]$ independent, $Y=X_1+X_2$
$$\mathrm{Var}(Y)=\mathrm{Var}(X_1)+\mathrm{Var}(X_2)=\frac{1}{12}+\frac{1}{12}=\frac{1}{6}$$
$$D_1=\mathrm{Var}(\mathbb{E}[Y|X_1])=\mathrm{Var}(X_1)=\frac{1}{12}$$
$$S_1=\frac{D_1}{\mathrm{Var}(Y)}=\frac{1/12}{1/6}=\frac{1}{2}$$
No interactions: $S_1=ST_1=1/2$, $S_2=ST_2=1/2$.
Great sanity check: additive model means no interaction terms in Sobol decomposition.
Toy example B: pure interaction
$X_1,X_2 \sim \mathcal{U}[0,1]$ independent, $Y=(X_1-\frac{1}{2})(X_2 - \frac{1}{2})$
$$\mathbb{E}[Y]=0$$
$$\mathbb{E}[Y^2]=\frac{1}{12}.\frac{1}{12}=\frac{1}{144} = \mathrm{Var}(Y)$$
$\mathbb{E}[Y \mid X_1]=0$ etc.
Main effects vanish: $S_1=S_2=0$. Total effects are maximal: $ST_1=ST_2=1$.
This is the “interaction-only” archetype: importance exists, but only jointly.
This is why we will care about total-order indices for XAI.
III - Sobol Decomposition: Theory and Interpretation [Sob. 01]
Now we justify why variance splits cleanly across main effects and interactions.
We do NOT aim for a full proof; we focus on the role of centering and orthogonality.
Functional ANOVA / Hoeffding decomposition
Assume $X_1,\dots,X_d$ independent and $f\in L^2$.
Assume that $\int f_u(X_u)d\mathbb{P}_{X_i} = 0, \forall i \in u, \forall u \subset \{1, \dots, d\}$
Then $f$ admits a unique decomposition:
$$f(X)=f_0+\sum_i f_i(X_i)+\sum_{i< j} f_{ij}(X_i,X_j)+\cdots+f_{1\cdots d}(X_1,\dots,X_d)$$
“main effects + pairwise interactions + higher-order interactions”.
Sobol. Sensitivity estimates for nonlinear mathematical models.
Wiley 1 407-414, 1993
Van Der Vaart, A. W. Asymptotic Statistics. Cambridge University
Press, ,2012
This is an ANOVA-like decomposition but for general non-linear functions.
Each term represents the information not explained by lower-order terms.
Centering constraint for variance additivity (1/2)
Without constraints, the decomposition is not unique (constants can be moved between terms).
The two previous assumptions (aka. zero-mean constraints) make terms identifiable
$$\mathbb{E}[f_i(X_i)] = 0,\quad \mathbb{E}[f_{ij}(X_i,X_j)\mid X_i]=0,\quad \mathbb{E}[f_{ij}(X_i,X_j)\mid X_j]=0 $$ etc.
Each term contains only the “new effect” not already captured by lower-order components.
Example: you can add +c to f1 and subtract c from f0: same f. So constraints fix a canonical representation.
Interpretation: f_i is the pure main effect after removing the global mean; f_ij is the pure interaction after removing main effects.
Centering constraint for variance additivity (2/2)
Under independence + centering constraints, ANOVA components are orthogonal:
$$\mathbb{E}\big[f_u(X_u)\,f_v(X_v)\big]=0 \quad \text{for } u\neq v$$
Therefore, variance decomposes exactly:
$$\mathrm{Var}(f(X))=\sum_{u\neq\emptyset}\mathrm{Var}(f_u(X_u))$$
This is the core reason Sobol indices are “clean”: exact variance budget, no heuristics.
Like perpendicular vectors: cross-terms vanish (no covariance terms).
That’s why we choose variance (not Lp norms): it decomposes additively here.
Sobol indices for subsets
For any subset $u\subseteq\{1,\dots,d\}$:
$$D_u \triangleq \mathrm{Var}(f_u(X_u)),\qquad S_u\triangleq \frac{D_u}{\mathrm{Var}(f(X))}$$
$$S_u\ge 0,\qquad \sum_{u\neq\emptyset} S_u = 1$$
These indices form a probability-like decomposition of variance over subsets.
They quantify contribution of each interaction order.
Alternative expressions (useful)
Main-effect function (intuition):
$$f_i(x_i)=\mathbb{E}[f(X)\mid X_i=x_i]-f_0$$
Main-effect variance contribution (from Part II):
$$D_i=\mathrm{Var}\big(\mathbb{E}[f(X)\mid X_i]\big)$$
This is the bridge between Part II (conditional variance) and Part III (functional ANOVA).
Students often ask: “where do the functions f_i come from?” Answer: conditional expectation minus global mean.
This is also the basis for Monte Carlo estimators.
Total-order index and interactions
Total-order index aggregates all terms involving feature $i$:
$$ST_i = \sum_{u\ni i} S_u$$
Interpretation: $ST_i$ measures “everything that uses feature $i$” (directly or via interactions).
$$ST_i \approx S_i \Rightarrow \text{weak interactions}, \qquad ST_i \gg S_i \Rightarrow \text{strong interactions}.$$
For deep networks and signal tasks, the gap is often large: interactions dominate.
IV - Local Feature Importance via Perturbations [Fel. 21]
Part II–III were global: Sobol indices measure sensitivity under an input distribution.
Fel et al. adapt Sobol to LOCAL explanations by introducing a perturbation distribution around a fixed input x.
Randomness here is not the dataset distribution: it is the perturbation mask distribution.
Goal: produce a saliency map (heatmap) for one input x.
Local explanation (1/2)
Local variance-based explanation $\texttt{[Fel. 21]}$: perturb patches, evaluate the model, estimate total-order Sobol indices, and visualize as a heatmap.
We want to explain a single input $x$ (image / audio / text).
Introduce a random mask $M=(M_1,\dots,M_d)$ controlling visibility of regions.
Define a perturbation operator $\Phi(x,M)$.
Local explanation (2/2)
We want to explain a single input $x$ (image / audio / text).
Introduce a random mask $M=(M_1,\dots,M_d)$ controlling visibility of regions.
Define a perturbation operator $\Phi(x,M)$.
$$M=(M_1,\dots,M_d)\in[0,1]^d$$
$$\tilde x = \Phi(x,M)$$
$$Y = f(\tilde x)=f(\Phi(x,M))$$
We apply variance-based analysis to the random variable $Y$ induced by random perturbations $M$.
Very important: here “features” are the components of the mask, not raw pixels.
We can choose d as patches / superpixels / TF tiles / words.
Independence assumption is now on mask variables, and is often true by design (we sample them i.i.d.).
Perturbation operator: baseline inpainting
Typical continuous perturbation used in the paper:
$$\Phi(x,M)=x\odot M + (1-M)\odot \mu$$
$\mu$ is a baseline / reference input (black image / blur / mean / inpainted content).
Explanations are conditional on the perturbation model $(\Phi,\mu)$.
This is exactly like occlusion, but relaxed to a soft mask in [0,1].
Baseline choice is critical: it defines what “removing information” means.
In audio, baseline can be silence, spectral floor, or TF inpainting.
From perturbations to a saliency map
We sample many masks $M^{(1)},\dots,M^{(N)}$.
We build perturbed inputs $\tilde x^{(n)}=\Phi(x,M^{(n)})$.
We evaluate the model outputs $Y^{(n)}=f(\tilde x^{(n)})$.
We estimate Sobol indices for each mask component $M_i$.
One importance score per region $\Rightarrow$ reshape into a grid $\Rightarrow$ heatmap.
This slide explains where the heatmap comes from: ST_i is computed per patch.
Then we just “put back” ST_i on the corresponding spatial region and upsample for display.
Local Sobol importance (Total-order)
For a given region/feature $i$ (mask component $M_i$), we use the total-order Sobol index.
$$ST_i =
\frac{
\mathbb{E}_{M_{\sim i}}\left[\mathrm{Var}_{M_i}\left(Y \mid M_{\sim i}\right)\right]
}{
\mathrm{Var}(Y)
}$$
$ST_i$ captures the contribution of region $i$, including all interactions with other regions.
This is the correct total-order formula: expectation of conditional variance.
Interpretation: fix all regions except i, vary only i, measure how much variability remains.
Deep models often rely on interactions (textures, shapes, harmonics) → total-order is safer than first-order.
Why total-order is the default in this paper
First-order $S_i$ measures the isolated effect of region $i$.
Total-order $ST_i$ measures everything that involves region $i$.
If $ST_i \gg S_i$, region $i$ mostly acts through interactions .
For local saliency, interactions matter a lot (edges + context, harmonics + formants, etc.).
Good opportunity to connect to the product example in Part II: pure interaction can have S_i=0 but ST_i=1.
This is why interaction-aware explanations are very different from gradients.
Signed importance (optional)
Sobol indices are non-negative: they measure magnitude (how much it matters), not direction.
A simple signed variant attaches a sign using an occlusion test:
$$\mathrm{sign}_i=\mathrm{sign}\big(f(x)-f(x\setminus i)\big)$$
$$ST_i^{\Delta} = ST_i \cdot \mathrm{sign}_i$$
The sign is baseline-dependent. The magnitude $ST_i$ is the robust part.
You can skip this slide if you want to stay minimal.
Signed maps are useful for “supports vs contradicts” storytelling, but depend strongly on what “removal” means.
V - Sobol Indices in Practice (Estimators)
Now we need to compute ST_i efficiently with only black-box access to f.
Fel et al. use Jansen estimators + Quasi-Monte Carlo (Sobol sequences) to reduce estimator variance.
Why we need estimators
Definitions involve conditional expectations / high-dimensional integrals.
We want a black-box estimator using only evaluations of $f(\Phi(x,M))$.
Key trick: build pairs of perturbations that differ in one component .
We will estimate the conditional variance term inside $ST_i$.
We are approximating “vary only region i, keep others fixed” using structured sampling matrices.
This is where the A/B/C matrices come from.
Saltelli sampling: A, B, and mixed C
Sample two matrices of masks: $A,B\in[0,1]^{N\times d}$.
For each feature $i$, build $C^{(i)}$ by replacing column $i$ of $A$ with column $i$ of $B$.
$$C^{(i)} = (A_1,\dots,A_{i-1},B_i,A_{i+1},\dots,A_d)$$
Evaluate $f(\Phi(x,\cdot))$ on $A$, $B$, and each $C^{(i)}$.
Saltelli, Andrea, et al. "Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index." Computer physics communications 181.2 , 2010
Great figure: draw A and B as two matrices, then show the “swap one column” operation.
Interpretation: A and C(i) differ only in region i → isolates feature i effect (including interactions).
Jansen estimator: total-order
Estimate the output variance (using $A$):
$$\hat V = \frac{1}{N}\sum_{j=1}^N f(\Phi(x,A_j))^2
- \left(\frac{1}{N}\sum_{j=1}^N f(\Phi(x,A_j))\right)^2$$
$$\widehat{ST}_i=\frac{1}{2N\hat V}\sum_{j=1}^N
\Big(f(\Phi(x,A_j))-f(\Phi(x,C^{(i)}_j))\Big)^2$$
Jansen M.. Analysis of variance designs for model output.
Computer Physics Communications ,1999
Key point: squared differences measure sensitivity when only region i is switched.
The 1/(2N Vhat) normalization makes it a fraction of explained variance.
This estimator targets the “expected conditional variance” definition of ST_i.
Jansen estimator: first-order
$$\widehat{S}_i = 1-\frac{1}{2N\hat V}\sum_{j=1}^N
\Big(f(\Phi(x,B_j))-f(\Phi(x,C^{(i)}_j))\Big)^2$$
In XAI, we typically visualize $\widehat{ST}_i$ as the main saliency map.
First-order is useful pedagogically and for diagnosing interactions (ST much larger than S).
But in the paper, total-order is the main explanation map.
Computational cost
Evaluations required: $N(d+2)$ forward passes.
Typical setting in the paper: mask grid $11\times 11$ $\Rightarrow d=121$ and $N=32$:
$$32\times(121+2)=3936 \text{ forward passes}$$
Still feasible with batching on GPU.
Compare to random masking methods: many more masks are needed for stable heatmaps.
Batching is crucial: we can evaluate all perturbed inputs as a batch.
Quasi-Monte Carlo (Sobol sequences)
Instead of i.i.d. Monte Carlo masks, use a low-discrepancy sequence.
Better coverage of $[0,1]^d$ $\Rightarrow$ lower estimator variance in practice.
Very easy visual demo: random points vs Sobol points in 2D.
Main message: higher stability for the same budget.
Gerber M. On integration methods based on scrambled nets of arbitrary size. Journal of Complexity,
2015
White-box vs Black-box explanations
In XAI, the evaluation protocol often depends on whether we have access to the model internals.
White-box methods use gradients / internals:
Examples: Saliency maps (∇), Integrated Gradients, Grad-CAM, LRP
Require access to model structure + backpropagation
Black-box methods use only model outputs:
Examples: Occlusion, RISE, SHAP, Sobol perturbation-based saliency
Only require forward passes: $f(x)$ queries
Black-box explanations are generally more expensive, but more model-agnostic.
White-box: you can open the model and see inside (gradients, activations).
Black-box: you treat the model as an oracle: you query it and observe outputs.
Fel et al. is fully black-box: everything is estimated from forward passes.
Evaluation metrics we show next are also mostly black-box protocols.
Evaluation #1: Pointing Game
Goal: check whether the saliency map points to the object of interest.
Requires: ground-truth object location (bounding box or segmentation mask).
Protocol:
Compute a saliency map $S(x)\in\mathbb{R}^{H\times W}$
Find the most salient pixel / region:
$$p^\star = \arg\max_{p} S_p(x)$$
Hit if $p^\star$ lies inside the ground-truth object mask / box
Score = fraction of images where the explanation “points” inside the target object.
Simple and intuitive, but only tests the single most salient point.
Pointing game is like: “Where does the explanation look the most?”
It is very popular because it is simple and cheap to compute.
But it ignores the rest of the heatmap: only the maximum matters.
Also depends on annotation quality: bounding boxes may be too coarse.
Evaluation #1: Pointing Game
Evaluation #2: Deletion score
Goal: measure how “faithful” the saliency map is to the model decision.
Idea: remove the most important pixels first and track the output drop.
Protocol:
Rank pixels/regions by importance (descending): $p_1,p_2,\dots$
Create a sequence of perturbed inputs $x^{(k)}$ by deleting top-$k$ pixels
Track the score (e.g., class logit/probability) along the deletion path:
$$s_k = f_c\!\left(x^{(k)}\right)$$
A good explanation yields a fast decrease of the target class score under deletion.
Typical metric: Area Under the Curve (AUC) of $s_k$ vs. deleted fraction
Lower AUC = more faithful (faster score collapse)
Deletion answers: “If we remove what the explanation says is important, does the model fail?”
We can compute it for any black-box model, no gradients needed.
Important caveat: depends strongly on the deletion operator (baseline / blur / inpainting).
Deletion may create off-manifold samples: the model might behave strangely.
Evaluation #2: Deletion score
Fel et al. (2021): what to remember
Local Sobol saliency maps are obtained by applying Sobol indices to a perturbation model.
Total-order maps naturally capture interactions between regions.
QMC sampling improves stability / convergence for a fixed evaluation budget.
Provides a principled “variance budget” view of explanations.
Emphasize: the method is faithful to variance-based decomposition, not just heuristic masking.
VI - What Sobol Explains (and What It Does Not)
We now clarify what kind of “explanation” Sobol provides.
It’s sensitivity under a perturbation model — not causality.
Interpretation: sensitivity, not causality
Sobol-based explanations quantify sensitivity under a user-defined perturbation model.
They do not provide causal explanations.
Depends on $p(M)$ (mask distribution) and $\Phi(x,M)$ (perturbation operator).
Depends on baseline $\mu$.
Key message: explanations are conditional.
If you change baseline or perturbation model, you change the question you ask the model.
Limitations / pitfalls
Off-manifold: perturbations may create unrealistic inputs.
Feature dependence: classical Sobol assumes independent variables (true for masks; not always for raw features).
Granularity: pixel vs superpixel vs patch changes the explanation.
Faithfulness metrics: deletion/insertion depend on the perturbation protocol.
Sobol is principled, but the explanation is conditional on modeling choices.
Important: in “real” images/audio, features are correlated; Sobol classic theory assumes independence.
Fel et al. partially bypass this by defining independent mask variables.
Audio / Signal Segment (Why Sobol is natural here)
Audio is interaction-heavy: time-frequency structure, harmonics, formants.
Sobol total-order naturally captures such joint effects.
Time-frequency features
Mixture: $x(t)=s(t)+n(t)$
STFT: $X(\omega,\tau)$
Mask TF regions:
$$\widetilde{X}(\omega,\tau)=M(\omega,\tau)\odot X(\omega,\tau)$$
$$Y = f(\widetilde{X})$$
Feature can be a TF-bin, a TF-patch, or a frequency band.
This matches your audio background: Sobol saliency can be presented as “which TF regions drive the model output”.
Good figure: show a spectrogram and an importance heatmap overlay.
Why interactions matter in audio
Speech/music structure:
harmonics (correlated frequencies)
formants (bands)
time coherence (onsets/transients)
Many features are informative only jointly .
Total-order Sobol indices capture these spectro-temporal interactions.
Example: removing one harmonic alone may not hurt, but removing the harmonic structure does.
Sobol total-order attributes importance to bins that matter through joint structure.
Spectrogram in audio
the word "friend" seems important (maybe the phoneme ?). some specific frequencies too.
Take-Home Messages
Feature importance must be defined : it depends on “how you vary the input”.
Variance is a principled sensitivity measure (scalar, stable, decomposable).
Sobol = functional ANOVA : exact decomposition into main effects + interactions (under assumptions).
Total-order indices are key in interaction-heavy models.
Local XAI : apply Sobol to $g(M)=f(\Phi(x,M))$ using perturbation masks.
End with the “conditional explanation” statement: Sobol explains sensitivity under a specific perturbation model.
If time: recap the interaction diagnostic $ST_i - S_i$.
References
Sobol, I. M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation , 2001
Fel, Thomas, et al. "Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis." NeurIPS , 2021
Jansen M.. Analysis of variance designs for model output.
Computer Physics Communications ,1999
Gerber M. On integration methods based on scrambled nets of arbitrary size. Journal of Complexity,
2015
XAI: Counterfactual explanations
MVA - APM_5MV75_TP
Mathieu FONTAINE
mathieu.fontaine@telecom-paris.fr
January 2025
Goal: explain a prediction by answering: “what minimal changes would flip the decision?”
Focus today: DiCE = generate multiple diverse counterfactual explanations.
We keep the course MVA-style: optimization, constraints, metrics, limitations.
Outline
I — Why counterfactuals?
II — Formal definition
III — Single counterfactual = optimization view
IV — DiCE: multiple diverse counterfactuals
V — Distances, constraints, mixed data
VI — Evaluation + limitations
VII — App: Counterfactuals for medical image
VIII — App: Counterfactuals in text-audio
We’ll move from “one CF” to “set of CFs” + diversity.
We’ll emphasize actionability, constraints, and evaluation.
I — Why Counterfactual Explanations?
Counterfactuals are explanations in the form of actionable alternatives.
Different from “feature importance”: not “what mattered?”, but “what to change?”.
Explanation as “minimal change”
We have an instance $x$ and a model prediction $y=f(x)$.
We want a new point $c$ such that the decision becomes $y^\star$.
Counterfactual = “a feasible input close to $x$ that flips the outcome”.
Human-friendly: “If income was +5k€ and savings +2k€, loan would be approved”.
Useful for recourse: suggests changes.
Why one counterfactual is not enough
Many different changes can flip the decision.
We want multiple options to provide flexibility.
For a given $x$, the set $\{c: f(c)=y^\star\}$ can be large and disconnected.
Users may be unable/unwilling to change certain features.
Hence “diverse CFs”: different routes to the target outcome.
II — Formal Definition
We keep this general: classifier or score model.
Then we’ll instantiate in optimization objectives.
Model + target outcome
Let $f:\mathbb{R}^D \to \mathcal{Y}$ be a black-box predictor.
Given an input $x\in\mathbb{R}^D$, we want $c\in\mathbb{R}^D$ such that:
$$f(c)=y^\star$$
In practice: $f$ may output a probability/logit; we enforce $f(c)\ge \tau$ for some target threshold.
For multiclass, y* is a class; for binary, y*=1 for “approved”.
Thresholding: ensures robust flip.
Core desiderata
Validity: $c$ achieves the target outcome.
Proximity: $c$ is close to $x$ (small change).
Feasibility / actionability: respect constraints (immutable features ).
Interpretability: sparse changes preferred (few features ).
These are in tension: smaller proximity may conflict with feasibility or sparsity.
DiCE focuses also on “diversity” for sets of counterfactuals.
III — Single Counterfactual: Optimization View
First: one CF, standard formulation: minimize distance subject to flipping.
Then: relax constraints into loss.
Constrained formulation
Minimal change subject to achieving the target:
$$\min_{c\in\mathcal{C}} \; d(c,x) \quad \text{s.t.}\quad f(c)=y^\star$$
$\mathcal{C}$ encodes feasibility: immutable features, box constraints, categorical constraints.
This is a generic “projection to decision set” problem.
Hard constraint can be relaxed for gradient-based optimization.
Relaxed (penalized) formulation
Convert constraint to a loss term:
$$\min_{c\in\mathcal{C}}\; \lambda \,\ell\big(f(c),y^\star\big) + d(c,x)$$
$\ell$ can be hinge-like on the target class probability (or cross-entropy on $y^\star$).
Typical trick: if model returns score s(c), enforce s(c) >= tau using hinge loss.
Many CF methods are variants of this optimization.
Discrete / categorical constraints
Some features are categorical ({CDI, CDD, FreeLancer}) (one-hot) or ordinal {Bac < Bachelor < Master < PhD} or continuous {salary etc.}.
Practical solutions:
continuous relaxation (categories becomes real) + projection (on category space)
heuristic search (beam search, simulated annealing etc.)
optimize in a continuous latent space of a generator
DiCE provides practical handling for mixed data via tailored distance + postprocessing.
IV — DiCE: Diverse Counterfactual Explanations
Now the key: produce K counterfactuals at once.
We add an explicit diversity term. This is the paper’s main novelty.
K counterfactuals
Find $\mathcal{C}=\{c_1,\dots,c_K\}$ such that each $c_k$ is valid and close.
Additionally: counterfactuals should be diverse .
$$f(c_k)=y^\star,\quad k=1,\dots,K$$
Diversity gives multiple “paths” to reach the desired outcome.
Example: either change income, or reduce debt, or extend employment duration…
Users can pick the most feasible explanation.
DiCE objective in a nutshell
Optimize all CFs jointly:
$$\min_{c_1,\dots,c_K\in\mathcal{C}} \;\;
\sum_{k=1}^K \Big[ \lambda\,\ell(f(c_k),y^\star) + d(c_k,x)\Big]
\;-\; \eta \cdot \mathrm{Diversity}(c_1,\dots,c_K)$$
This is the standard CF loss + a repulsion/spread term.
The paper uses a DPP-style determinant-based diversity.
DPP-style diversity (determinant)
Define a similarity kernel between counterfactuals, e.g. RBF:
$$K_{ij}=\exp\!\left(-\frac{\|c_i-c_j\|_2^2}{\sigma^2}\right)$$
$$\mathrm{Diversity}(c_1,\dots,c_K)=\det(K)$$
$\det(K)$ is large when points are dissimilar (spread out).
Intuition: if two CFs are identical, kernel rows become dependent → det small.
Determinant encourages “orthogonal-ish” set → spread.
We don’t need full DPP theory; determinant as a diversity score is enough.
Practical note: gradients
If $f$ is differentiable (NN), optimize by gradient descent in input space.
If $f$ is non-differentiable, DiCE can use model-agnostic search / sampling.
Differentiable optimization is fast but can produce off-manifold counterfactuals.
In practice you often constrain or regularize to avoid unrealistic CFs.
We’ll discuss feasibility and manifold issues in Part VI.
V — Distances, Constraints, Mixed Data
The choice of distance d(.,.) is not cosmetic: it defines what “minimal change” means.
For mixed data, we must treat categorical features properly.
Continuous features: robust scaling (MAD)
For continuous features, normalize by median absolute deviation (MAD):
$$d_{\text{cont}}(c,x)=\sum_{j\in \mathcal{J}_{\text{cont}}}\frac{|c_j-x_j|}{\mathrm{MAD}_j}$$
MAD is more robust than standard deviation (less sensitive to outliers).
This is important in tabular data: different scales (income vs age).
Using raw L2 can be meaningless if features have different units.
Categorical features: mismatch penalty
For one-hot encoded categorical features:
$$d_{\text{cat}}(c,x)=\sum_{j\in \mathcal{J}_{\text{cat}}} \mathbf{1}[c_j \neq x_j]$$
Enforce valid one-hot vectors via projection (argmax) or constraints.
If you relax one-hot to continuous, you must project back (otherwise infeasible categories).
Some implementations treat categorical as immutable to avoid invalid states.
Immutable / actionable constraints
Partition features into:
immutable: cannot change (e.g., age, past defaults)
actionable: can change (e.g., savings, debt)
conditionally actionable: changes allowed only in one direction
$$c_j = x_j \;\;\text{for } j\in\mathcal{J}_{\text{immut}}$$
This is essential: otherwise explanations can be unethical/unrealistic.
Also prevents trivial solutions like “change race / gender”.
Sparsity
Interpretability often requires changing few features.
Proxy: encourage sparse deltas $\Delta = c-x$ via $\ell_1$ penalty:
$$d(c,x) \;+\; \gamma \|\;c-x\;\|_1$$
Alternative: post-process CFs to keep only the largest changes.
True sparsity is L0, but L1 is a convex proxy.
For discrete features, “number of changed features” is more direct.
VI — Evaluation and Limitations
How do we compare counterfactual methods?
And what can go wrong: feasibility, realism, causality, manifold issues.
Standard evaluation metrics
Validity: fraction of CFs achieving the target.
Proximity: average distance $d(c_k,x)$.
Diversity: determinant-based or pairwise distance statistics.
Sparsity: average number of changed features.
These metrics can be optimized but also reported for comparison.
Diversity is crucial when K>1; otherwise CFs collapse to the same solution.
Local decision boundary approximation (intuition)
Counterfactuals approximate the local boundary around $x$.
One can fit a simple surrogate near $x$ (e.g., 1-NN on generated samples).
Idea: evaluate how well CFs “cover” different boundary directions locally.
This connects CFs with local explanation fidelity: are we really exploring the neighborhood?
Good to mention, but keep details light unless you want extra depth.
Main limitations
Off-manifold: CFs can be unrealistic (not resembling real data).
Feature dependence: changing one feature may require changing others.
Causality: actionable recourse should respect causal relations.
Multiple solutions: must balance proximity vs diversity vs sparsity.
Counterfactuals are “action suggestions” only under a chosen feasibility model.
Emphasize: CFs are not causal explanations unless constraints encode causality.
Connecting to MVA: this is about optimization under constraints, not truth.
Take-Home Messages
Wrap-up: definition, optimization, diversity, constraints, pitfalls.
Take-Home Messages
Counterfactuals explain by giving changes that flip the outcome.
Mathematically: solve a constrained optimization (validity + proximity).
DiCE: generate K CFs and add a diversity term (determinant / DPP-style).
Mixed data + constraints are central (actionability, immutability, categorical).
CFs are only meaningful under an explicit feasibility model .
Encourage students to always state distance + constraints + target threshold.
Next step: connect to recourse and causal constraints.
References
DiCE: Mothilal, Sharma, Tan. “Explaining ML Classifiers through Diverse Counterfactual Explanations.” FAT* 2020.
(Optional background) Wachter et al. “Counterfactual explanations without opening the black box.” 2017.
Application (Images) — Counterfactuals via Diffusion Autoencoder [Atad et al., 2024]
Goal: generate counterfactual medical images that remain realistic by moving in a generative latent space.
Key idea: do counterfactuals in the latent space of a diffusion autoencoder (unsupervised), then add a simple linear classifier/regressor on top.
Why MVA-friendly: geometry in latent space + hyperplane distance + controlled traversal across decision boundary.
Problem setting (medical imaging)
We have an image classifier / regressor: what minimal change flips the decision?
We want counterfactual image $x_{cf}$ such that:
$$\text{(flip)}\quad f(x_{cf}) \neq f(x) \qquad \text{and}\qquad \text{(realism)}\quad x_{cf}\sim p_{\text{data}}$$
Pixel-space optimization often produces adversarial-like artifacts. Generative latent spaces help keep edits on-manifold.
Matan Atad et al., Counterfactual Explanations for Medical Image Classification and Regression using Diffusion Autoencoder, JMLBI, 2024
Remind difference with adversarial examples: adversarial can be imperceptible but off-manifold; counterfactual aims to be plausible and semantically meaningful.
In medical imaging, plausibility is critical for clinician trust.
Diffusion Autoencoder (DAE) as unsupervised feature extractor
Train a DAE on unlabeled images → semantic latent space zsem .
DAE learns a compressed representation where interpolations are meaningful.
$$x \xrightarrow{\text{encoder}} z_{\mathrm{sem}} \in \mathbb{R}^m \qquad\qquad \hat x = \text{decode}(z_{\mathrm{sem}}, \text{(stochastic latent)})$$
Unsupervised training → same latent space can support multiple downstream tasks (classification + ordinal regression).
From the paper: DAE has a semantic encoder that yields z_sem; decoding uses conditional diffusion to reconstruct.
Take-home: CE generation becomes a latent edit + decode, instead of directly editing pixels.
Linear decision boundary in latent space
Encode labeled samples: $w = z_{\mathrm{sem}}(x)$
Fit a linear classifier (SVM / linear layer): hyperplane
$$\mathcal{P}:\quad n^\top w + b = 0$$
$$\text{signed distance:}\quad \mathrm{dist}(w,\mathcal{P})=\frac{n^\top w + b}{\|n\|}$$
$n:$ semantic direction corresponsing to the pathology existence
$b:$ bias term
The vector n becomes a semantic direction for the pathology.
This is the key geometric ingredient: you now have a direction that corresponds to “more pathology” vs “more healthy”.
Distance to hyperplane can be used as a continuous severity proxy (then calibrate to ordinal grades).
Binary counterfactual = reflect across hyperplane
Given latent code w , the paper proposes a closed-form CE in latent space.
$$w_{cf} = w - 2\cdot \mathrm{dist}(w,\mathcal{P})\;\frac{n}{\|n\|}$$
Then decode: $$x_{cf} = \mathrm{decode}(w_{cf})$$
CE is an on-manifold semantic edit: flip class while staying realistic.
Geometric intuition: reflection keeps the same distance to boundary but on the other side.
This is much cleaner than iterative pixel optimization; the generative model enforces realism.
Ordinal counterfactuals(severity progression)
Severity measure :
severity is defined as the signed distance to the hyperplane
$$
\text{dist}(w,P) = \frac{n^\top w + b}{\|n\|}
$$
Calibration to ordinal grades :
$$
\text{grade}(w) \approx \alpha\,\text{dist}(w,P) + \beta
$$
Controlled severity traversal :
we explicitly parametrize a latent trajectory along the normal direction
$$
w(t) = w_0 + t\,\frac{n}{\|n\|}
$$
Key property :
along this path,
$$
\text{dist}(w(t),P) = \text{dist}(w_0,P) + t
$$
$\Rightarrow$ increasing $t$ increases severity linearly .
Counterfactual generation :
decoded images form a smooth severity progression
$$
x(t) = \text{decode}(w(t))
$$
Results of the generator
sometimes hallucinations appear when moving too far in latent space.
Results of the generator
sometimes hallucinations appear when moving too far in latent space.
Results of accuracy, grade etc.
sometimes hallucinations appear when moving too far in latent space.
Limitations
Linearity assumption : a single hyperplane direction may not match complex pathologies.
Confounders in latent space : “move to healthy” may introduce spurious anatomical changes (bias).
Calibration : mapping distance→grade can be imperfect (especially under label scarcity / imbalance).
Counterfactuals are only as good as (i) the generative model, (ii) the latent geometry, (iii) the downstream classifier.
Strong MVA message: this is geometry + representation learning + simple supervised layer.
Counterfactuals become a tool for diagnosing model bias (not only explaining decisions).
Application (Audio) — Learning with Counterfactual Captions [Vosoughi et al., 2024]
Switch domain: here counterfactuals are used during representation learning, not only for explaining a fixed classifier.
Idea: build counterfactual captions (text) for the same audio and enforce the embedding geometry to separate factual vs counterfactual descriptions.
Why it matters: many “alternative scenarios” are hard/unsafe/impossible to record (gunshot vs firework in same context).
Main idea
Can be a chosen paper for the evaluation !
Ali. Vosoughi et al,LEARNING AUDIO CONCEPTS FROM COUNTERFACTUAL NATURAL LANGUAGE, ICASSP, 2024
XAI: Prototypical Networks
MVA - APM_5MV75_TP
Mathieu FONTAINE
mathieu.fontaine@telecom-paris.fr
February 2025
Goal: explain predictions by comparisons to learned prototypes.
Connect to kNN, metric learning, counterfactuals, and robustness.
We focus on the math: objectives, gradients, and a few clean propositions.
Why prototypes?
Standard deep nets rely on abstract latent features.
Prototype nets explain by analogy : “this looks like that”.
Interpretability is intrinsic (part of the model), not post-hoc.
We build a model where logits are sums of similarities to learned prototypes.
Snell et al. Prototypical Networks for Few Shot Learning, NIPS 2017
Good to contrast with gradients/SHAP: here explanations are example-based.
Prototype = representative pattern in a learned metric space.
Outline
I — Motivations
II — Prototypical networks: $\texttt{[Sne. 17]}$
III — ProtoPNet: patch prototypes $\texttt{[Che. 19]}$
IV - Prototypical and Counterfactual
V — Prototypes in Sound Classification $\texttt{[Zin. 21]}$
VI — take-home
Snell et al. Prototypical Networks for Few Shot Learning, NIPS 2017
Chen et al. This Looks Like That: Deep Learning for Interpretable Image Recognition, NeurIPS 2019
Zinemanas et al. An Interpretable Deep Learning Model for Automatic
Sound Classification, Electronics, 2021.
We will prove 3–5 short propositions; full proofs in the .tex handout.
Notebook TP: implement prototypical classifier + prototype projection + explain a few test samples.
II — Prototypical networks: $\texttt{[Sne. 17]}$
II – Prototypical Networks
Few-shot classification setting
We consider an episodic learning setup.
Each episode mimics a few-shot task.
Support set:
$S = \{(x_i, y_i)\}_{i=1}^N,\quad y_i \in \{1,\dots,K\}$
$S_k = \{x_i : y_i = k\}$
Key idea: train and test conditions must match.
Episodes are small classification problems.
Learning an embedding
Inputs are mapped to a representation space:
$f_\phi : \mathbb{R}^D \to \mathbb{R}^M$
$f_\phi$ is a neural network (CNN, ResNet, …)
All reasoning happens in the embedding space
Non-linearity is entirely in the embedding.
The classifier itself will be very simple.
Class prototypes
Each class is represented by a prototype .
$$
c_k = \frac{1}{|S_k|} \sum_{(x_i,y_i)\in S_k} f_\phi(x_i)
$$
A prototype is the mean embedding of support examples.
This is not a heuristic: it will be justified mathematically.
One prototype per class.
Classification rule
$$
p_\phi(y=k\mid x)
=
\frac{\exp\big(-d(f_\phi(x), c_k)\big)}
{\sum_{k'} \exp\big(-d(f_\phi(x), c_{k'})\big)}
$$
$d(\cdot,\cdot)$ is a distance in embedding space
This is a soft nearest-prototype classifier.
All classes compete via distances.
Episodic training objective
Minimize negative log-likelihood on queries:
$$
\mathcal{L}(\phi)
=
- \log p_\phi(y=k \mid x)
$$
Episodes are sampled during training
SGD on $\phi$
Same embedding used for support and query.
No fine-tuning at test time.
Why is the prototype the mean?
Assume the distance is a Bregman divergence ($\phi$ strictly convex on a convex domain).
$$
d_\varphi(z,z')
=
\varphi(z) - \varphi(z') - (z-z')^\top \nabla\varphi(z')
$$
For Bregman divergences, the cluster representative minimizing
$\sum_i d(z_i, c)$ is the mean .
Classic result from Banerjee et al. (2005).
This is the core theoretical justification.
Connection to mixture models
Exponential family density:
$$
p(z\mid\theta)
=
\exp\{-d_\varphi(z,\mu(\theta)) - g(z)\}
$$
One component per class
Equal mixture weights
Inference = soft assignment to clusters.
ProtoNet ≈ discriminative mixture model.
Squared Euclidean distance
$$
d(z,c) = \|z-c\|^2
$$
Corresponds to spherical Gaussian clusters
Prototype = empirical mean
This choice works best empirically.
Cosine distance breaks the theory.
Equivalence to a linear classifier
$$
-\|f(x)-c_k\|^2
=
2c_k^\top f(x) - \|c_k\|^2 + \text{const}
$$
$$
= w_k^\top f(x) + b_k
$$
ProtoNet = linear classifier on learned embeddings
All non-linearity is in $f_\phi$.
Classifier head is extremely simple.
Few-shot terminology and datasets
N-way : number of classes in a classification episode
K-shot : number of labeled examples per class
An episode = N-way K-shot classification task
Omniglot : handwritten characters, many classes, low intra-class variability
miniImageNet : natural images, fewer classes, high variability
Important: “way” refers to the task, not the dataset size.
Few-shot learning is about learning to solve new tasks.
Experimental validation (ProtoNet)
Benchmarks: Omniglot and miniImageNet
Few-shot classification tasks:
$N$-way, $K$-shot (e.g. 5-way 1-shot, 5-way 5-shot)
Episodic training and testing
Support set → prototypes → softmax over distances
No fine-tuning at test time, only prototype computation.
Stress that train/test tasks have the same structure.
This avoids the usual train–test mismatch in few-shot learning.
The simplicity of inference is part of the contribution.
Results
This slide closes the theory → practice loop.
Distance choice is not cosmetic: it encodes assumptions.
Nice example where theory predicts empirical behavior.
Key takeaways
Classification via distances to prototypes
Prototype = mean (theoretically optimal)
Distance choice encodes distributional assumptions
Model is simple, stable, and interpretable
Simplicity is the inductive bias.
Why ProtoNets matter for interpretability
Each decision is explained by distances to class prototypes
Prototypes live in the same space as data embeddings
No opaque classifier head
Interpretability is intrinsic , not post-hoc.
This contrasts with gradient or perturbation-based explanations.
ProtoPNet will push this idea further with patch-level prototypes.
III — ProtoPNet: patch prototypes $\texttt{[Che. 19]}$
Patch embedding map
Backbone (ResNet, VGG ...) outputs feature map: $F_\theta(x)\in\mathbb{R}^{H\times W\times d}$
d: channel number
Patch vectors: $z_{ij}(x)\in\mathbb{R}^d$
A prototype corresponds to a patch pattern (beak, wheel, texture...), not the whole image.
Show a figure here: feature map grid + one prototype matching to one cell.
Explain: interpretability comes from localizing which patch activated which prototype.
Matching: max over patches
$$
s_k(x)=\max_{i,j}\; -\|z_{ij}(x)-p_k\|_2^2
$$
Non-smooth due to max; use subgradients or a softmax approximation.
Max selects the best-matching patch: yields localization for free (argmax location).
Non-smoothness is fine; training uses subgradient (choose argmax).
Softmax (log-sum-exp) smoothing
$$
s_k^\tau(x)= -\tau \log\sum_{i,j}\exp\Big(-\tfrac{1}{\tau}\|z_{ij}(x)-p_k\|^2\Big)
$$
As $\tau\to 0$, $s_k^\tau(x)\to \max_{i,j} -\|z_{ij}-p_k\|^2$.
Classic log-sum-exp approximation: stable gradients, avoids argmax jumps.
You can mention: temperature controls “softness” of localization.
Objective: accuracy + interpretability
$$
\mathcal{L}=\mathcal{L}_{cls}+\lambda_1\mathcal{L}_{p2d}+\lambda_2\mathcal{L}_{d2p}
$$
$\mathcal{L}_{cls}$: cross-entropy on logits
$\mathcal{L}_{p2d}$: each prototype close to at least one real patch (same class)
$\mathcal{L}_{d2p}$: each sample covered by at least one prototype (same class)
p2d prevents “phantom” prototypes; d2p avoids unused prototypes.
Min over dataset/patches makes it non-smooth but manageable (batch approximations).
Prototype-to-data term (patch version)
$$
\mathcal{L}_{p2d}=\sum_{k=1}^K \min_{x_i:y_i=c(k)} \min_{i,j}\; \|z_{ij}(x_i)-p_k\|_2^2
$$
Makes each prototype represent a real pattern from the training data.
In practice: approximate mins via nearest patch found during training.
Keep the “same class” constraint to avoid prototypes drifting to other classes.
Data-to-prototype term
$$
\mathcal{L}_{d2p}=\mathbb{E}\Big[\min_{k:c(k)=y}\min_{i,j}\|z_{ij}(x)-p_k\|_2^2\Big]
$$
This avoids prototypes that never explain any sample (coverage).
Interpretation: each sample should be “near” some prototype of its class.
Prototype projection (key XAI step)
$$
(x^\star,i^\star,j^\star)=\arg\min_{x_i:y_i=c(k)}\min_{i,j}\|z_{ij}(x_i)-p_k\|^2
$$
$$
p_k \leftarrow z_{i^\star j^\star}(x^\star)
$$
Prototypes become exact training patches, hence directly visualizable.
After projection, you can display the corresponding image patch as “this is the prototype”.
Easy proposition: this cannot increase the p2d nearest distance (argmin).
V — Prototypical and Counterfactual ?
Prototype-driven counterfactual (idea)
Target class $y'$ has prototypes $\{p_k:c(k)=y'\}$.
Seek $c$ close to $x$ but near a target prototype:
$$
\min_c \|c-x\|
\quad\text{s.t.}\quad
\exists k:c(k)=y'\;\; \|\phi(c)-p_k\|\le \epsilon
$$
Interpretation: recourse as moving toward a “typical” pattern of the target class.
This is a nice bridge to your counterfactual lecture.
Proto-based Interpretability for Audio: APNet [Zin. 21]
We now switch from “prototype nets for images” to an audio-native prototype architecture.
Goal: explanations that are not post-hoc saliency maps, but part of the model’s reasoning.
Paper: Zinemanas et al., Electronics 2021 (APNet).
What APNet tries to fix (audio XAI)
Deep audio classifiers are accurate but opaque.
Post-hoc saliency (gradients / LRP / etc.) can be unstable or hard to interpret in TF.
APNet : explanation is case-based :
prediction is based on similarity to a small set of learned prototypes .
Interpretability is intrinsic: the decision is explicitly computed from prototype similarities.
Key angle: “explain like a human”: show representative prototype(s) and how close you are to them.
In audio, prototypes can be visualized (mel-spectrogram) and even listened to (via inversion).
Architecture at a glance
Input: log-mel spectrogram $X_i \in \mathbb{R}^{T\times F}$
Encoder: $Z_i = f(X_i)$ with $Z_i \in \mathbb{R}^{T'\times F'\times C}$
Prototype layer: store $M$ prototypes $P_j \in \mathbb{R}^{T'\times F'\times C}$
Similarity $\rightarrow$ weighted sum over frequency $\rightarrow$ linear classifier
The model is basically: autoencoder for latent space + prototype classifier on top.
Important: prototypes live in latent space, but can be decoded back to mel and audio.
Autoencoder objective (makes prototypes audible)
Decoder reconstructs $\hat X_i = g(Z_i)$
Reconstruction loss:
$$\ell_r=\frac{1}{N}\sum_{i=1}^N \|X_i-\hat X_i\|_2^2$$
This is crucial: if reconstruction is garbage, decoded prototypes will be meaningless.
They use pooling indices/masks (SegNet-like trick) to make decoding workable after max-pool.
Prototype loss = “prototypes are real” + “data are covered”
Distance from each sample to each prototype (latent space):
$$D_{ij}=\|Z_i-P_j\|_2^2$$
$$\ell_p=\frac{1}{N}\sum_{i=1}^N\min_j D_{ij} \;+\; \frac{1}{M}\sum_{j=1}^M\min_i D_{ij}$$
First term = each training example is close to at least one prototype (“coverage”).
Second term = each prototype is close to at least one training example (“no fake prototype drifting off-manifold”).
So: prototypes become representative and usable for explanation.
Audio-specific twist: frequency-dependent similarity
They compute similarity per frequency bin $f$ (latent TF structure):
$$S_{ij}[f]=\exp\!\left(-\sum_{t=1}^{T'}\sum_{c=1}^{C}\big(Z_i[t,f,c]-P_j[t,f,c]\big)^2\right)$$
Then aggregate frequency with a learnable kernel $H_j[f]$:
$$\hat S_{ij}=\sum_{f=1}^{F'} H_j[f]\,S_{ij}[f]$$
Unlike images, frequency has semantics: low vs high band matters.
$H_j[f]$ learns which frequency regions matter for prototype j.
Interpretability bonus: you can “see” what band the prototype cares about.
Classifier + training objective
Linear layer on similarities (no bias for interpretability):
$$\hat Y = \mathrm{softmax}(\hat S\,W)$$
$$\ell_c=-\frac{1}{N}\sum_{i=1}^N\sum_{k=1}^K Y_{ik}\log \hat Y_{ik}$$
$$\ell=\alpha \ell_c + \beta \ell_p + \gamma \ell_r$$
$\ell_c$ = accuracy, $\ell_p$ = prototype meaningfulness, $\ell_r$ = decodability.
Interpretability is enforced by construction + objectives, not after training.
Datasets (3 tasks)
UrbanSound8K : 10 urban classes (≤4s clips).
Medley-solos-DB : 9 instruments, 3s clips .
Google Speech Commands V2 : 35 keywords, 1s clips.
Nice didactic point: speech/music/environmental = very different audio “natures”.
So if interpretability works across all three, it’s more convincing.
Results
They compare to SB-CNN, Att-CRNN, and OpenL3 features + MLP.
APNet is competitive on all tasks.
Key message: interpretability does not necessarily kill accuracy.
Also mention model size: APNet is not huge compared to OpenL3 pipeline.
Inspection: what makes it interpretable? (1/2)
Decode prototypes $\to$ mel-spectrograms (and audio) to inspect “what a class looks like”.
Fig.8 is a great story: mean aggregation fails, learned weighted sum finds the correct “background” siren.
This is a rare paper where you can literally show “the model’s reasoning chain”.
Inspection: what makes it interpretable? (2/2)
Inspect $W$ (prototype-to-class connections): prototypes mostly connect to their own class.
Fig.8 is a great story: mean aggregation fails, learned weighted sum finds the correct “background” siren.
This is a rare paper where you can literally show “the model’s reasoning chain”.
Refinement (because you can debug it)
Prototype redundancy pruning using prototype-prototype distances.
Channel redundancy pruning by analyzing prototype distances per channel.
After pruning + short retrain: fewer params and sometimes better accuracy.
Interpretability is also a tool for model improvement (not just “pretty explanations”).
“if you can see it, you can fix it”.
Interested by it ? Then select his extension for the project :-)
Take-home (APNet)
Prototype networks give intrinsic explanations: “this sounds like that”.
Audio needs domain-aware similarity (frequency matters).
Decoding prototypes to audio makes explanations tangible (listen + view).
Interpretable structure enables pruning + debugging beyond accuracy.
Bridge to your next lecture: ProtoPNet patch-based for images; audio analogue is TF-local prototypes.
Also connect to counterfactuals: prototypes give “nearest explanation”, counterfactuals give “minimal change”.