Roland BADEAU & Mathieu FONTAINE
{mathieu.fontaine, roland.badeau}@telecom-paris.fr
June 10th, 17th 2022
Outline
I - Introduction
II - Mathematical reminders
III - Linear instantaneous mixtures
IV - Independent component analysis
V - Second order methods
VI - Time-frequency methods
VII- Convolutive mixtures
VIII- Under-determined mixtures
IX - Conclusion
I - Introduction
Source separation
Art of estimating "source" signals, assumed independent, from the observation
of one or several "mixtures" of these sources
Applications examples
Source separation
Art of estimating "source" signals, assumed independent, from the observation
of one or several "mixtures" of these sources
Applications examples
Denoising (cocktail party, suppression of vuvuzela, karaoke)
Source separation
Art of estimating "source" signals, assumed independent, from the observation
of one or several "mixtures" of these sources
Applications examples
Denoising (cocktail party, suppression of vuvuzela, karaoke)
Separation of the instruments in polyphonic music
Typology of the mixture models (1/2)
Definition of the problem
Observations: $M$ mixtures $x_m(t)$ concatenated in a vector $\bold{x}(t)$
Unknowns: $K$ sources $s_k(t)$ concatenated in a vector $\bold{s}(t)$
General mixture model: function $\mathcal{A}$ which transforms $\bold{s}(t)$ into $\bold{x}(t)$
Property typology
Typology of the mixture models (1/2)
Definition of the problem
Observations: $M$ mixtures $x_m(t)$ concatenated in a vector $\bold{x}(t)$
Unknowns: $K$ sources $s_k(t)$ concatenated in a vector $\bold{s}(t)$
General mixture model: function $\mathcal{A}$ which transforms $\bold{s}(t)$ into $\bold{x}(t)$
Property typology
Stationarity: $\mathcal{A}$ is translation invariant
Linearity: $\mathcal{A}$ is a linear map
Memory
Typology of the mixture models (1/2)
Definition of the problem
Observations: $M$ mixtures $x_m(t)$ concatenated in a vector $\bold{x}(t)$
Unknowns: $K$ sources $s_k(t)$ concatenated in a vector $\bold{s}(t)$
General mixture model: function $\mathcal{A}$ which transforms $\bold{s}(t)$ into $\bold{x}(t)$
Property typology
Stationarity: $\mathcal{A}$ is translation invariant
Linearity: $\mathcal{A}$ is a linear map
Memory
Convolutive mixtures
Instantaneous mixtures: $\bold{x}(t)=\bold{A}\bold{s}(t) \\
\quad\rightarrow \mathcal{A}$ is defined by the "mixture matrix"
$\bold{A}$ (of dimension $M\times K$)
Typology of the mixture models (2/2)
Inversibility
Determined mixtures: $M=K$
Overdetermined mixtures: $M>K$
Under-determined mixtures: $M < K$
Instantaneous linear mixtures
Anechoic linear mixtures
Convolutive linear mixtures
II - Mathematical reminders
Real random vectors
Notations
$\bold{x}$ is a random vector of dimension $M$.
$\phi[\bold{x}]$ denotes a function of $p(\bold{x})$
Real random vectors
Notations
$\bold{x}$ is a random vector of dimension $M$.
$\phi[\bold{x}]$ denotes a function of $p(\bold{x})$
Mean: $\mu_{x}=\mathbb{E}[\bold{x}]$
Real random vectors
Notations
$\bold{x}$ is a random vector of dimension $M$.
$\phi[\bold{x}]$ denotes a function of $p(\bold{x})$
BSS problem: estimate $\bold{A}$ and sources $\bold{s}(t)$ given $\bold{x}(t)$
Blind source separation (BSS) (2/2)
Non-mixing matrix
A matrix $\bold{C}$ of dimension $K\times K$ is non-mixing iff. it has a unique non-zero entry in each row and each column
If $\tilde{\bold{s}}(t) = \bold{C}\bold{s}(t)$ and $\tilde{\bold{A}}=\bold{A}\bold{C}^{-1}$, then $\bold{x}(t)= \tilde{\bold{A}}\tilde{\bold{s}}(t)$
is another admissible decomposition of the observations
$\quad\rightarrow$ Sources can be recovered up to a permutation and a multiplicative factor
We can thus assume $\bold{B} = \bold{U}^\top\bold{W}$ where $\bold{U}$ is a rotation matrix
Higher order statistics
One can estimate $\Sigma_{xx}$ from the observations and get $\bold{W}$
The whiteness property (second order cumulants) determines $\bold{W}$ and leaves $\bold{U}$ unknown
If sources are Gaussian, the $z_k$ are independent and $\bold{U}$ cannot be determined
In order to determine a rotation $\bold{U}$, we need to exploit the non-Gaussianity of sources and characterize
the independence by using cumulants of ordrer greater than 2.
Segmentation of mixture signals and estimation of covariance
matrices $\Sigma_{xx}(t)$ on windows centered at different times $t$
Joint diagonalization of matrices $\Sigma_{xx}(t)$ in a common basis $\bold{B}$
Estimation of source signals via $\bold{y}(t) = \bold{Bx}(t)$
Conclusion of the first part
The use of higher order cumulants is only necessary for the
non-Gaussian IID source model
Second order statistics are sufficient for sources that are:
$\quad\rightarrow$ stationary but not IID (→ spectral dynamics)
$\quad\rightarrow$non stationary (→ temporal dynamics)
Remember that classical tools (based on second order
statistics) are appropriate for blind separation of independent
(and possibly Gaussian) sources, on condition that the
spectral / temporal source dynamics is taken into account
VI - Time-frequency methods
Time-frequency (TF) representations
Motivations
Spectral and temporal dynamics are highlighted by a TF representation of signals
TF domain: adequate to process convolutive and/or under-determined mixture
Use of a filter bank (STFT and MDCT)
Decomposition in $F$ sub-bands and decimation of factor $T \leq F$
Analysis filters $h_f$ and synthesis filters $g_f$
TF representation of mixture: $x_m(f,n) = (h_f\ast x_m)(nT)$
Instantanenous mixture model: unsuitable for real acoustic mixtures
Mixture of source image model
Let $\bold{x}_k(f,n) \in \mathbb{R}^M$ be the source image of $s_k(f,n)$
$\quad\rightarrow$ received multichannel signal if only source $s_k(f,n)$ was active
Let $s_k(t)$ be $K$ IID sources, among which at most one is Gaussian,
and $\bold{y}(t) = \bold{C} \ast \bold{s}(t)$ with $C$ invertible ((over)-determined case).
If signals $y_k(t)$ are independent, then $\bold{C}$ is non-mixing.
Time-Frequency approach
Mixture model and narrow-band approximation
$x_m(t) = \sum_{k=1}^{K}(a_{mk} \ast s_k)(t)$,
the filter bank corresponds to an STFT
the IR of $a_{mk}$ is short compared with the window length
$\forall m,k,f, a_{mk}(\nu)$ varies slowly compared with $h_f(\nu)$
Approximation of the convolutive mixture model
$x_m(f,n)=\sum_{k=1}^K a_{mk}(f)s_k(f,n)$ i.e. $\bold{x}(f,n)=\bold{A}(f)\bold{s}(f,n)$
$\quad\rightarrow$ $F$ instantaneous mixture models in every sub-band
$\quad\rightarrow$ we can use any ICA method in every sub-band
Independent component analysis
Let $\bold{y}(f,n) = \bold{B}(f)\bold{x}(f,n)$ where $\bold{B}(f)\in\mathbb{C}^{K\times M}$
Linear separation is feasible if $\bold{A}(f)$ has rank $K$.
$\quad\rightarrow\bold{y}(f,n)=\bold{s}(f,n)$ with
$\bold{B}(f)=
\begin{cases}
\bold{A}(f)^{-1} & \mathrm{if~} M=K \\
\bold{A}(f)^{\dagger} & \mathrm{if~} M>K \\
\emptyset & \mathrm{if~} M< K
\end{cases}
$
In practice $\bold{A}(f)$ is unknown:
$\quad\rightarrow$ We look for $\bold{B}(f)$ that makes $y_k(f,n)$ independent (ICA)
$\quad\rightarrow$ We get $\bold{y}(f,n) = \bold{C}(f)\bold{s}(f,n)$ where $\bold{C}(f) = \bold{B}(f)\bold{A}(f)$
$\quad\rightarrow \bold{C}(f)$ is non-mixing
Indeterminacies
Indeterminacies (perumatations and multiplicatives factors) in matrices $\bold{C}(f)$
$\forall k$, identify indexes $k, f$ such that $\forall f, y_{k_f}(f,n)=c_{k_f,k}s_k(f,n)$
identify the multiplicative factors $c_{k_f,k}$
Infinitely many solutions $\implies$ need to constrain the model
Assumptions on the mixture and sources
continuity of the frequency responses $a_{mk}(f)$ with respect to $f$
$\quad\rightarrow$ beamforming model or anechoic model
similarity of the temporal dynamics of $\sigma_k^2(f,n)$ (or NMF model)
Convolutive mixture models
Beamforming model
Assumptions: plane waves, far field, no reverberation, linear antenna
Model: $a_{mk}(f)=e^{-2i\pi f\tau_{mk}}$ where $\tau_{mk}=\frac{d_m}{c}\sin(\theta_k)$
Parameters: positions $d_m$ of the sensors and angles $\theta_k$ of the sources
Anechoic model
Assumptions: punctual sources, no reverberation
Model: $a_{mk}(f)=\alpha_{mk}e^{-2i\pi f\tau_{mk}}$ where $\tau_{mk}=\frac{r_{mk}}{c}$ and $\alpha_{mk} = \frac{1}{\sqrt{4\pi}r_{mk}}$
Parameters: distances $r_{mk}$ between the sensors and sources
VIII - Under-determined mixtures
Under-determined convolutive mixtures
Usual case in audio: monophonic $(M = 1)$ or stereophonic
$(M = 2)$ mixtures, with a number of sources $K > M$
Convolutive mixture model and assumption
$\bold{x}(f,n)=\bold{A}(f)\bold{s}(f,n)$ with $M< K$
We assume that $\bold{A}(f)$ and $\Sigma_{ss}(f,n) = \mathrm{diag}(\sigma^2_k(f,n))$ are known
Even in this case, there is no matrix $\bold{B}(f)$ such that $\bold{B}(f)\bold{A}(f)=\bold{I}_K$
Separation via non-stationary filtering
Let $\bold{y}(f,n) = \bold{B}(f,n)\bold{x}(f,n)$ where $\bold{B}(f,n) \in \mathbb{C}^{K\times M}$
Mean-Squared error estimation
We look for $\bold{B}(f,n)$ which minimizes $\mathbb{E}[\mid\mid\bold{y}(f,n)-\bold{s}(f,n)\mid\mid^2_{2} ]$
Estimation of $\theta_{k}$ and of the active source $k_{(f,n)}$
$\quad\rightarrow$ computation of the histogram of the angles of vectors $\bold{x}(f,n)$
$\quad\rightarrow$ peak detection in order to estimate $\theta_k$
$\quad\rightarrow$ determination of the active source by proximity with $\theta_{k}$
Source separation: for all $k$,
$\quad\rightarrow$ estimation of source images via binary masking:
$\quad \bold{y}_k(f,n)=
\begin{cases}
0 & \mathrm{if~} k \neq k_{(f,n)}, \\
\bold{x}(f,n) & \mathrm{otherwise}
\end{cases}
$
MMSE estimation of the sources : $y_k(f,n) = \frac{\bold{a}_k(f)^{\mathrm{H}}}{\mid\mid\bold{a}_k(f)\mid\mid^2_2}\bold{y}_{k}(f,n)$
TF synthesis of the sources:
$y_k(t) = \sum_{f=1}^F \sum_{n\in \mathbb{Z}} g_f(t-nT)y_k(f,n)$
IX - Conclusion
Conclusion
Summary
Source separation requires to make assumptions about the
mixture and sources
For an (over-)determined instantaneous linear mixture, the
assumption of independent sources is sufficient
In all other cases, we need to model the mixture and/or the
sources
Perspectives
Non-stationary mixtures (adaptive algorithms)
Informed source separation (e.g. from music score)
Deep learning techniques
Objective assessment of audio source separation
Bibliography
Audio source separation and Blind source separation
Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot. Audio Source Separation and Speech
Enhancement. Wiley Publishing, 1st edition, 2018.
Jean-François Cardoso. Blind signal separation: statistical principles. Proceedings of the IEEE,
86(10):2009–2025, 1998.
Independent Component Analysis
Pierre Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287 –
314, April 1994. Special issue on Higher-Order Statistics.
Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent Component
Analysis and Applications. Academic Press, Inc. (Elsevier), USA, 1st edition, 2010.
Informed Source separation
Sebastian Ewert and Meinard Müller. Multimodal Music Processing, volume 3, chapter Score-
Informed Source Separation for Music Signals, pages 73–94. January 2012.
Simon Leglaive, Roland Badeau, Gael Richard. Multichannel Audio Source Separation with Probabilistic Reverberation Priors. IEEE/ACM Transactions on Audio, Speech and Language Processing,
Institute of Electrical and Electronics Engineers, 2016, 24 (12), pp.2453-2465.
Multichannel Nonnegative Matrix Factorization for audio source separation
Alexey Ozerov and Cédric Févotte. Multichannel nonnegative matrix factorization in convolutive
mixtures for audio source separation. IEEE Transactions on Audio, Speech, and Language Process-
ing, 18(3):550–563, 2010.
Multichannel Deep learning audio source separation
Aditya Arie Nugraha, Antoine Liutkus, Emmanuel Vincent. Multichannel audio source separation
with deep neural networks. [Research Report] RR-8740, INRIA. 2015.