Authors :
Mathieu FONTAINE, Aditya Arie NUGRAHA,
Roland BADEAU, Kazuyoshi YOSHII, Antoine LIUTKUS
I: Introduction (state-of-the-art)
Speech Enhancement ?
Partially or totally remove the noise from a speech signal
In the short-time Fourier transform (STFT) domain, it implies removing$\bold{x}_{ft}^{n}\in \mathbb{C}^K$:
$K\text{: number of channels}$$F\text{: number of frequency bins}$$T\text{: number of time frame}$
Paradigm for Probabilistic Denoising Algorithms
Vincent, E. et al. (2011, Machine Audition). Probabilistic modeling paradigms for audio source separation.
Example: Gaussian & Wiener Filter
Assuming the following model for speech$\bold{x}^s$ and noise$\bold{x}^n$
and given the observation$\bold{x}$, we can estimate$\bold{x}^{s}$ as:
$$
\mathbb{E}\left[\bold{x}_{ft}^s \mid \bold{x}_{ft},\left\{\color{blue}{a_{ft}^j}, \color{red}{\bold{R}_f^{j}}\right\}_{j \in \{s,n\}}\right] =
\color{blue}{a_{ft}^s}\color{red}{\bold{R}_{f}^{s}}\left(\sum_{j \in \{s,n\}}\color{blue}{a_{ft}^j}\color{red}{\bold{R}_{f}^{j}}\right)^{-1}\bold{x}_{ft}$$
Parameters estimation$\rightarrow$ log-likelihood, variational autoencoder (VAE) etc.
What about heavy-tailed distributions ?
Duong N Q.K et al. (2009, TASLP). Under-determined reverberant audio source separation using a full-rank spatial covariance model.
II: Cauchy Model & Projection-Based Wiener Filter
Cauchy Distribution
$\mathbf{y} \sim \mathcal{C}_{c}^{K}\!\left(\mathbf{y} | \bold{\mu}, \mathbf{V} \right)$ follows a circularly-symmetric multivariate complex Cauchy distribution of dimension K iff. its probability density$p_{ \bold{\mu},\mathbf{V}}$ is
Similar model but with Gaussian distributions: Gaussian VAE-MNMF $\text{[Leg. 2019]}$
Cauchy NMF with speech NMF trained on clean speech: Cauchy MNMF
Corpus
CHiME-4 corpus sampled at 16 kHZ
7138 single-channel clean speech signals for the DNN and MNMF training
1640 single-channel clean speech signals as the validation set for the DNN training
Evaluation done on 132 ($\simeq$ 10%) noisy utterances
Settings
Latent variable dimension of$\bold{z}_t:D=32$
Number of bases of the noise model$:L=32$
Projection matrix$\bold{U}$ is taken unitary $\bold{U}=\bold{U}^{\dagger}$ and$M=8$ projectors are sampled
$64$ optimization iterations for Cauchy MNMF and$50$ for both VAE-MNMF methods
S. Leglaive et al. (2019, ICASSP). Semi-supervised multichannel speech enhancement with variational autoencoders
and non-negative matrix factorization.
Results scores
Mean in white and standard deviation in black
Audio demo
Conclusion
Discussion
New combination of VAE with heavy-tailed distribution
Globally outperforms the Gaussian model
Future works
STOI, PESQ and SDR scores given with respect to the type of noise
Replace the backpropagation method by a Metropolis-Hastings sampling
Extend the Cauchy VAE-MNMF to an elliptically contoured multivariate stable one
Thank you ! Questions ?
M. Fontaine et al. (2019, EUSIPCO). Cauchy Multichannel Speech Enhancement with a Deep Speech Prior.