EUSIPCO 2019, Coruña

Cauchy Multichannel Speech Enhancement with a Deep Speech Prior

matfontaine.github.io

Mathieu FONTAINE
fontaine.mathieu2@gmail.com

September 03rd, 2019

Authors :
Mathieu FONTAINE, Aditya Arie NUGRAHA, Roland BADEAU, Kazuyoshi YOSHII, Antoine LIUTKUS

I: Introduction (state-of-the-art)

Speech Enhancement ?

Partially or totally remove the noise from a speech signal

In the short-time Fourier transform (STFT) domain, it implies removing$\bold{x}_{ft}^{n}\in \mathbb{C}^K$: $K\text{: number of channels}$ $F\text{: number of frequency bins}$ $T\text{: number of time frame}$

Paradigm for Probabilistic Denoising Algorithms

Vincent, E. et al. (2011, Machine Audition). Probabilistic modeling paradigms for audio source separation.

Example: Gaussian & Wiener Filter

Assuming the following model for speech$\bold{x}^s$ and noise$\bold{x}^n$

and given the observation$\bold{x}$, we can estimate$\bold{x}^{s}$ as: $$ \mathbb{E}\left[\bold{x}_{ft}^s \mid \bold{x}_{ft},\left\{\color{blue}{a_{ft}^j}, \color{red}{\bold{R}_f^{j}}\right\}_{j \in \{s,n\}}\right] = \color{blue}{a_{ft}^s}\color{red}{\bold{R}_{f}^{s}}\left(\sum_{j \in \{s,n\}}\color{blue}{a_{ft}^j}\color{red}{\bold{R}_{f}^{j}}\right)^{-1}\bold{x}_{ft}$$

Parameters estimation$\rightarrow$ log-likelihood, variational autoencoder (VAE) etc.

What about heavy-tailed distributions ?

Duong N Q.K et al. (2009, TASLP). Under-determined reverberant audio source separation using a full-rank spatial covariance model.

II: Cauchy Model & Projection-Based Wiener Filter

Cauchy Distribution

$\mathbf{y} \sim \mathcal{C}_{c}^{K}\!\left(\mathbf{y} | \bold{\mu}, \mathbf{V} \right)$ follows a circularly-symmetric multivariate complex Cauchy distribution of dimension K iff. its probability density$p_{ \bold{\mu},\mathbf{V}}$ is $$ p_{ \bold{\mu},\mathbf{V}}\!\left(\mathbf{y}\right) =A_{K,\mathbf{V}}\left(1+\left(\mathbf{y} - \bold{\mu}\right)^{H}\mathbf{V}^{-1}\left(\mathbf{y} - \bold{\mu}\right)\right)^{-K-\frac{1}{2}}, $$ where $$ A_{K,\mathbf{V}} =\prod_{k=1}^{K}\left(K-k+\frac{1}{2}\right)\pi^{-K}\det\left(\mathbf{V}\right)^{-1} $$ Real Cauchy distribution with$K=1$:

Samoradnitsky, G. (1995). Stable non-Gaussian random processes.

Sources Model

Which filtering method is suitable ?

C. Févotte et al. (2011, Neural Computation). Algorithms for nonnegative matrix factorization with the β-divergence.

Projection-Based Wiener Filter

Projection of observation vectors$\bold{x}_{ft} \in \mathbb{C}^K$ to$\mathbb{C}$: $$ x_{mft} = \mathbf{u}_{m}^{H}\mathbf{x}_{ft} \ \forall m,f,t $$
where$\bold{u}_m \in \mathbb{C}^{K}$ and $x_{mft}\in \mathbb{C}$ is the $\text{m}^{\text{th}}$-projection of $\bold{x}_{ft}$. We have then: $$ \hat{x}^{s}_{mft}\triangleq\mathbb{E}\left[\mathbf{u}_m^{H}x^{s}_{ft}| x_{mft},\bold{\Psi}\right] = \sqrt{\frac{v^{s}_{mft}}{v_{mft}}}x_{mft},$$
where $$ \begin{cases} v^{s}_{mft} = a^{s}_{ft} \mathbf{u}_{m}^{H}\mathbf{R}^{s}_{f}\mathbf{u}_{m},\\ v^{n}_{mft} = a^{n}_{ft} \mathbf{u}_{m}^{H}\mathbf{R}^{n}_{f}\mathbf{u}_{m},\\ v_{mft} = \left(\sqrt{v^{s}_{mft}} + \sqrt{v^{n}_{mft}}\right)^2 , \end{cases}~~~~~\text{and} ~~~~~\bold{\Psi} \triangleq \left\{a^{s}_{ft}, a^{n}_{ft}, \mathbf{R}^{s}_{f}, \mathbf{R}^{n}_{f}\right\}. $$
An estimator$\hat{\bold{x}}_{ft}^{s}$ of$\bold{x}_{ft}^{s}$ is: $$ \hat{\mathbf{x}}^{s}_{ft} = \mathbf{U^{\dagger}}\left[\hat{x}^{s}_{1ft}, \cdots, \hat{x}^{s}_{Mft}\right]^{T} $$
with $\mathbf{U}\triangleq\left[\mathbf{u}_1,\cdots, \mathbf{u}_M\right]^{H}\in \mathbb{C}^{M\times K}$ and $.^{\dagger}$ the pseudo-inverse operator.

A. Liutkus et al. (2016, ICASSP). PROJET - Spatial Audio Separation Using Projections.

III: Parameter Estimation

VAE for speech magnitude estimation (training phase)

where model parameters $\theta, \phi$ are optimized by minimizing the negative log-likelihood: $$ -\ln p_\theta\left(\bold{a}_{t}^s\right) = - \ln \! \int_{\mathbf{z}_{t}} \frac{q_{\phi}(\mathbf{z}_{t} | \mathbf{a}^{s}_{t})}{q_{\phi}(\mathbf{z}_{t} | \mathbf{a}^{s}_{t})} p_{\theta} (\mathbf{a}^{s}_{t}, \mathbf{z}_{t}) \mathrm{d} \mathbf{z}_{t} \leq \mathcal{L}^{\text{mag}} + \mathcal{L}^{\text{reg}} $$
$$ \footnotesize{\mathcal{L}^{\text{mag}} \stackrel{c}{=} \frac{1}{T} \! \sum^{F,T}_{f,t=1} \!\! \left( \ln\!\left[\gamma_{\theta}(\mathbf{z}_t)\right]_f + \ln\!\left(\! 1 + \frac{\big(a^{s}_{ft} - \left[\mu_{\theta}(\mathbf{z}_t)\right]_f\big)^2}{ \gamma_{ft}^2}\right)\! \right);~~ \mathcal{L}^{\text{reg}} =\frac{1}{2T} \! \sum_{d,t=1}^{D,T}\!\! \Bigg([\bold{\mu}_{\phi}^{q}(\mathbf{a}^{s}_{t})]_{d}^{2} +[\bold{\sigma}_{\phi}^{q} (\mathbf{a}^{s}_{t})]_{d}^{2} -\ln [\bold{\sigma}_{\phi}^{q}(\mathbf{a}^{s}_{t})]_{d}^{2} - 1\Bigg).} $$
with$p_{\theta} ( \mathbf{z}_{t} ) \sim \mathcal{N} ( \mathbf{z}_{t} | \mathbf{0}, \text{\textbf{I}} )$ and we take$\hat{\bold{a}}_t^s =$${\mu}$$_t$ from decoder output.

Test phase

Deep Speech Prior & magnitude spectrogram of speech

Sampling from$q_{\phi}\left(\bold{z}_t \mid \left|\bold{x}_t\right|\right)$ by averaging$\left|\bold{x}_t\right|$ over channels $\Rightarrow \mu_\theta\left(\bold{z}_t\right)$ and$\bold{a}_t^s$
Update of$\bold{z}_t$ by using backpropagation with a gradient descent method to minimize: $$ \!\! D\left(v\right) \stackrel{c}{=} \sum_{m,f,t=1}^{M,F,T}\frac{3}{2}\ln\left(v_{mft} + \left|x_{mft}\right|^2\right) - \frac{1}{2}\ln\left(v_{mft}\right) $$

Magnitude spectrogram of noise & spatial scatter matrices

Use the Majorization-Equalization (ME) strategy to get:

$$ {w}_{fl} \leftarrow \frac{1}{3}w_{fl}\frac{\sum_{mt}h_{lt}\psi^{n}_{mf}}{\sum_{mt}h_{lt}\psi^{n}_{mf}\xi_{mft}}, ~~ {h}_{lt} \leftarrow \frac{1}{3}h_{lt}\frac{\sum_{mf}h_{fl}\psi^{n}_{mf}}{\sum_{mf}w_{fl}\psi^{n}_{mf}\xi_{mft}}, ~~ {r}^j_{m'f}\leftarrow\frac{1}{3} {r}^j_{m'f}\frac{\sum_{mt}a^j_{ft}\eta^j_{mm'ft}}{\sum_{mt}\eta^j_{mm'ft}\xi_{mft}} $$
with$\forall j \in \{s,n\}$: $$\psi^{n}_{mf} \triangleq \frac{\mathbf{u}_{m}^{H}\hat{\mathbf{R}}^{n}_{f}\mathbf{u}_{m}}{\sqrt{v^{n}_{mft}v_{mft}}}, ~~ \xi_{mft} \triangleq 1 + \frac{|x_{mft}|^{2}}{v_{mft}}, ~~ \eta^j_{mm'ft} \triangleq \frac{|\mathbf{u}_{m'}^{H}\mathbf{u}_{m}|^{2}}{\sqrt{v^j_{mft}v_{mft}}} ~~\text{and}~\hat{\mathbf{R}}^j_{f} = \sum_{m^{\prime}} {r}^j_{m^{\prime},f}\mathbf{u}_{m^{\prime}}\mathbf{u}_{m^{\prime}}^{H} $$

IV: Evaluation

Experimental setup

Algorithms

Proposed Method: Cauchy VAE-MNMF
Similar model but with Gaussian distributions: Gaussian VAE-MNMF $\text{[Leg. 2019]}$
Cauchy NMF with speech NMF trained on clean speech: Cauchy MNMF

Corpus

CHiME-4 corpus sampled at 16 kHZ
7138 single-channel clean speech signals for the DNN and MNMF training
1640 single-channel clean speech signals as the validation set for the DNN training
Evaluation done on 132 ($\simeq$ 10%) noisy utterances

Settings

Latent variable dimension of$\bold{z}_t:D=32$
Number of bases of the noise model$:L=32$
Projection matrix$\bold{U}$ is taken unitary $\bold{U}=\bold{U}^{\dagger}$ and$M=8$ projectors are sampled
$64$ optimization iterations for Cauchy MNMF and$50$ for both VAE-MNMF methods

S. Leglaive et al. (2019, ICASSP). Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization.

Results scores

Mean in white and standard deviation in black

Audio demo

Noisy

Clean Speech

Gaussian
VAE-MNMF

Cauchy
VAE-MNMF

Cauchy
MNMF

Conclusion

Discussion

New combination of VAE with heavy-tailed distribution
Globally outperforms the Gaussian model

Future works

STOI, PESQ and SDR scores given with respect to the type of noise
Replace the backpropagation method by a Metropolis-Hastings sampling
Extend the Cauchy VAE-MNMF to an elliptically contoured multivariate stable one

Thank you ! Questions ?

M. Fontaine et al. (2019, EUSIPCO). Cauchy Multichannel Speech Enhancement with a Deep Speech Prior.