TAD-GAN: a robust unsupervised time anomaly detection

DSAIDIS (Axis 1)

Mathieu FONTAINE
mathieu.fontaine@telecom-paris.fr

May, 24th 2023

I - Introduction

What is a time series ?

  • $\bold{X} = (\bold{x}^1, \dots,$ $\bold{x}^t$$, \dots, \bold{x}^T) \in \mathbb{R}^{M \times T}$
  • $M$: number of sensors / measurements
  • $T$: number of time steps
multivariate time series $\implies$ several synchronous measurement

Anomaly in time series: definition

  • Given observed data $\bold{X}$
  • Find sequences ${\color{red}\bold{A}_{seq}} \triangleq \{{\color{red}\bold{a}^{(1)}_{seq, 1}}, \dots, {\color{red}\bold{a}^{(K)}_{seq, K}}\} $ showing unusual behavior
  • ${seq, i}$ is the associated time indexes for the $i^{\text{th}}$ anomaly
In an industrial context, which machine learning framework may be suitable ?

Unsupervised setting for time anomaly detection (TAD)

Unsupervised approach $\rightarrow$ suitable in various real-world scenarios:

  • No identified "known anomalies" to train the model
  • Non availability of "normal baselines"
  • $\quad \rightarrow$ signal that may include anomalies.
  • No clear segmentation
  • $\quad \rightarrow$ periodic signal (e.g. EEG, ECG $\texttt{[Qiu. 19]}$) OK... but otherwise ?

  • Qiu, J. et al. (2019). Kpi-tsad: A time-series anomaly detector for kpi monitoring in cloud applications. Symmetry.

Unsupervised setting for time anomaly detection (TAD)

Unsupervised approach $\rightarrow$ suitable in various real-world scenarios:

  • No identified "known anomalies" to train the model
  • Non availability of "normal baselines"
  • $\quad \rightarrow$ signal that may include anomalies.
  • No clear segmentation
  • $\quad \rightarrow$ periodic signal (e.g. EEG, ECG $\texttt{[Qiu. 19]}$) OK... but otherwise ?

Three main methodology family of unsupervised TAD:
Proximity, Prediction and Reconstruction-based methods
  • Qiu, J. et al. (2019). Kpi-tsad: A time-series anomaly detector for kpi monitoring in cloud applications. Symmetry.

Prediction-based anomaly detection (1/2)

Method

  • learn predictive model to fit given time series data
  • use the model to predict future values
$\text{pred}_{value} - \text{true}_{value} > \text{threshold}$ $\implies$ anomaly

Prediction-based anomaly detection (2/2)

Examples

  • Statisticals Models
  • $\quad \rightarrow$ ARIMA $\texttt{[Pen. 13]}$, FDA $\texttt{[Tor. 11]}$
  • Deep Neural Network
  • $\quad \rightarrow$ Long Short Term Memory (LSTM) technique $\texttt{[Hun. 18]}$
    $\quad \rightarrow$ Transformers-based
    $\quad \quad \triangleright$ Temporal Fusion Transformer $\texttt{[Lim 21]}$, Autoformer $\texttt{[Wu 21]}$
  • Pena, E. H. et al. (2013). Anomaly detection using forecasting methods arima and hwds, SCCC.
  • Torres, J. M. et al. (2011). Detection of outliers in gas emissions from urban areas using functional data analysis, JHM.
  • Hundman, K. et al. (2018). Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding, SIGKDD.
  • Lim, B. et al. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting, IJF.
  • Wu, H. et al. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting, ANIPS.

Reconstruction-based anomaly detection (1/2)

Method

  • learn a model to capture a low-dimensional representation $\bold{Z}$ of time series $\bold{X}$
  • create a synthetic reconstruction $\hat{\bold{X}}$ of the data from $\bold{Z}$
$d(\bold{X}_{\text{seq}}, \hat{\bold{X}}_{\text{seq}}) > \text{threshold}$ $\implies $ anomaly on indexes $\text{seq}$

Reconstruction-based anomaly detection (2/2)

Examples

  • Machine Learning Approach
  • $\quad \rightarrow$ Principal Component Analysis (PCA) $[\texttt{Rin. 07}]$
  • Auto-encoders (AE)
  • $\quad \rightarrow$ AE & variational AE $[\texttt{An 15}]$ for TAD
  • LSTM
  • $\quad \rightarrow$ LSTM Encoder-Decoder $[\texttt{Mal. 16}]$
  • Diffusion Models
  • $\quad \rightarrow$ (NOT times series) Medical anomaly detection $[\texttt{Wol. 22, Mul. 22}]$
  • Generative adversarial models (GAN)
  • $\quad \rightarrow$ BeatGAN $[\texttt{Zho. 19}]$, MAD-GAN $[\texttt{Li 19}]$, TAD-GAN $[\texttt{Gei. 20}]$
  • Ringberg, H. et al. (2007). Sensitivity of PCA for traffic anomaly detection, ACM SIGMETRICS.
  • An, J. et al. (2015). Variational autoencoder based anomaly detection using reconstruction probability, IE.
  • Malhotra, P. et al. (2016). LSTM-based encoder-decoder for multi-sensor anomaly detection, ICML.
  • Wolleb, J. et al. (2022). Diffusion models for medical anomaly detection, MICCAI.
  • Müller-Franzes, G. et al. (2022). Diffusion Probabilistic Models beat GANs on Medical Images, CVPR.
  • Zhou, B. et al. (2019). BeatGAN: Anomalous Rhythm Detection using Adversarially Generated Time Series, IJCAI.
  • Li, D. et al. (2019). MAD-GAN: Multivariate anomaly detection for time series data with GAN, ICANN.
  • Geiger, A. et al. (2020). Tadgan: Time series anomaly detection using generative adversarial networks, ICBD.

Outlines

I - Introduction

II - TAD-GAN

$\quad$ II.A - GAN and extensions

$\quad$II.B - TAD-GAN in a nutshell

III - Anomaly detection process for TAD-GAN

IV - Experiences

V - Conclusion

II.A - Generative adversarial networks and extensions

General Definition (1/2)

  • Generator: $\mathcal{G}(z)$ intended to come from $\mathbb{P}_X$ (trained to fool the critics $\mathcal{C}_x$)
  • Discriminator: tell if $\mathcal{G}(z)$ is real/fake ? (trained using $\bold{X}$ and answering "real")
A common loss (based on zero-sum game) is designed: $$ \mathcal{L} = \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\log(\mathcal{C}_x(x))] + \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\log\left(1-\mathcal{C}_x(\mathcal{G}(z))\right)] $$
  • Goodfellow I.J et al. (2014). Generative Adversarial Networks, NEURIPS.

General Definition (2/2)

Summary of the training procedure

  • $\{z_k\}_{k=1}^{K}, \{x_k\}_{k=1}^{K}$: samples from $\mathbb{P}_{\bold{Z}}$ and $\mathbb{P}_{\bold{X}}$ respectively.
  • $\Theta_{\mathcal{C},x}$ and $\Theta_{\mathcal{G}}$ the parameters of $\mathcal{C}_x$ and $\mathcal{G}$ respectively.

  1. Compute $P$ times (at each iteration, new $\{z_k\}_{k=1}^{K}, \{x_k\}_{k=1}^{K}$) $\nabla_{\Theta_{\mathcal{C},x}}\frac{1}{K}\sum_{k} [\log(\mathcal{C}_x(x_k)) + \log\left(1-\mathcal{C}_x(\mathcal{G}(z_k))\right)] \quad \texttt{(Critics training)}$ $\quad \quad \rightarrow$ backpropagation along $\Theta_{\mathcal{C},x}$
  2. Compute (with new $\{z_k\}_{k=1}^{K}$): $\nabla_{\Theta_{\mathcal{G}}}\frac{1}{K}\sum_{k} [ \log\left(1-\mathcal{C}_x(\mathcal{G}(z_k))\right)] \qquad\qquad\qquad~~ \texttt{(Generator training)}$ $\quad \quad \rightarrow$ backpropagation along $\Theta_{\mathcal{G}}$
  3. Repeat i. and ii. many times (at least, to be close to Nash Equilibrium)

Pros & Cons

  • Can approach any implicit distribution if well tuned and trained
  • in other way, may generate well data close to $\bold{X}$
  • Hard to train in practice (instability, very poor convergence etc.)

Wasserstein Extension

Definition

  • A way to deal with "gradient explosion" using $\mathcal{W}$ the Wasserstein-1 distance
  • Under some conditions, if $\mathcal{G}$ is locally Lipschitz, then $\mathcal{W}(\mathbb{P}_{\bold{X}}, \mathbb{P}_{\bold{Z}})$ is differentiable
  • The objective becomes (using Kantorovich-Rubinstein duality):
  • $\underset{\mathcal{G}}{\min} \underset{\mathcal{C}_x \in \mathscr{L}_{x}^{1}}{\max} \underbrace{\left( \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\mathcal{C}_x(x)] - \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\mathcal{C}_x(\mathcal{G}(z))] \right)}_{\triangleq \mathcal{L}_{\mathcal{W}}} $
  • $\mathscr{L}_{x}^{1}$: $1$-Lipschitz function.

New Procedure

  • Replace the previous loss $\mathcal{L}$ with $\mathcal{L}_{\mathcal{W}}$
  • clip $\Theta_{\mathcal{C}_x}$ between small values
  • Arjovsky, M. et al. (2017). Wasserstein gan, ICML.

Multivariate anomaly detection: MAD-GAN (1/3)

Generalities

  • Take into account the intercorrelation between sensors for TAD + GAN
  • LSTM for both the critics $\mathcal{C}_x$ and the generator $\mathcal{D}$
  • Use "normal data" for training and compute a "residual reconstruction error"

Data Pre-processing

  • $M$, $T$ measurements and time step respectively
  • sub-sequences of $\bold{X} \in \mathbb{R}^{M \times T}$ with sliding window of length $s_w$ and step size $s_s$
  • $\quad \rightarrow$ results in $\bold{\tilde{X}}_{s,w} = \{\tilde{\bold{X}}_i\}_{i=1}^{I}$ where $I=s_s^{-1}(T-s_w), \tilde{\bold{X}}_i \in \mathbb{R}^{M \times s_w}$

Multivariate anomaly detection: MAD-GAN (2/3)

Training procedure $\rightarrow$ as a "usual" GAN on $\bold{\tilde{X}}^{\text{train}}_{s,w} \in \mathbb{R}^{M \times I_1 \times s_w}$
For the sake of clarity, $M=1$ and $s,w$ are omitted.

Test Procedure

two losses on $\bold{\tilde{X}}\in \mathbb{R}^{I_2 \times s_w}$ with $\tilde{x}^{\tau}_{i} \in \mathbb{R}, \forall (\tau,i) \in \llbracket 1, s_w \rrbracket \times \llbracket 1, I_2 \rrbracket$:
  • Critic loss : $\mathcal{L}_{\mathcal{C}_x}^{\tau}=\mathcal{C}_x(\tilde{\bold{x}}^{\tau}) \in \mathbb{R}^{I_2}$
  • Residual loss between $\bold{\tilde{X}}$ and $\mathcal{G}(Z^k)$:
  • $\quad \rightarrow$ sample $Z^1$ to get a random $\mathcal{G}(Z^1) \\$ $\quad \rightarrow$ iteratively sample update with the gradient obtained from $\epsilon\\$
    $$ \underset{Z_k}{\min ~}\epsilon(\bold{\tilde{X}} , \mathcal{G}(Z^k)) = 1- \mathrm{Cov}(\bold{\tilde{X}}, \mathcal{G}(Z^k)) $$
    $\quad \rightarrow$ compute the residual error $\mathcal{L}_\mathcal{R}$ (point-wise)
    $$ \mathcal{L}_{\mathcal{R}}^{\tau} = |\tilde{\bold{x}}^{\tau} - \mathcal{G}(Z^{k,\tau})| \in \mathbb{R}^{I_2} $$
  • The Anomaly Detection Loss (ADL) is given for $\tau \in \llbracket 1, s_w \rrbracket$ by:
  • $$ \mathcal{L}_{\text{An}}^{\tau} = \lambda \mathcal{L}_{\mathcal{C}_x}^{\tau} + (1-\lambda)\mathcal{L}_{\mathcal{R}}^{\tau} \in \mathbb{R}^{I_2} $$

Multivariate anomaly detection: MAD-GAN (3/3)

Discrimination and Reconstruction Anomaly Score (DR-Score)

  • We finally get $\{\mathcal{L}_{\text{An},i}^{\tau}\} \in \mathbb{R}^{I_2 \times s_w}$ ADL
  • DR-Score (DRS) computed by mapping ADL to the original time series:

  • $$ DRS_t = \frac{\sum_{i+\tau = t} \mathcal{L}_{\text{An},i}}{\sharp(\{i, \tau \in {i + \tau = t}\})} $$

Anomaly detection

  • The anomaly computed such as:
  • $$ A_t^{\mathrm{test}} = \begin{cases} 1, \quad \text{if } \mathbb{H}(DRS_t, 1) > \text{threshold} \\ 0 \quad \text{otherwise} \end{cases} $$
  • $\mathbb{H} (., .)$ : cross-entropy error
  1. Worst results than ARIMA on many real dataset !
  2. Weaker point: the residual error + conventional GAN

II.B - TAD-GAN in a nutshell

Ingredients for a robust time reconstruction GAN

  • Two mapping functions : $\mathcal{E}:\bold{X} \to \bold{Z}$ and $\mathcal{G}: \bold{Z} \to \bold{X}$
  • We can reconstruct input times series: $\bold{x}^{t} \to \mathcal{E}(\bold{x}^{t}) \to \mathcal{G}(\mathcal{E}(\bold{x}^{t})) \approx \bold{x}^{t}$
  • $\mathcal{G}$ and $\mathcal{E}$ can be seen as a generator with their Critics $\mathcal{C}_x$ and $\mathcal{C}_z$

The "high level" objective of TAD-GAN consists in

  • Two Wassertein losses (match $\mathcal{G}(\bold{Z})\leftrightarrow\bold{X}$ and $\mathcal{E}(\bold{X})\leftrightarrow\bold{Z}$)
  • Cycle consistency losses to prevent a contradiction between $\mathcal{G}$ and $\mathcal{E}$

Cycle Consistency Principle

  • First introduced in $\texttt{[Zhu. 17]}$ for Image-to-image translation
  • mode collapse: several input images map to the same output image
  • Consistency loss: from one image to another one and coming back = same result
  • Zhu J.Y. et al. (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV

Full Objective Loss

It was finally considered:
  • The two following Wasserstein objectives:
  • $\underset{\mathcal{G}}{\min} \underset{\mathcal{C}_x \in \mathscr{L}_{x}^{1}}{\max} \underbrace{\left( \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\mathcal{C}_x(x)] - \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\mathcal{C}_x(\mathcal{G}(z))] \right)}_{\triangleq \mathcal{L}_{\mathcal{W},\mathcal{G}}} $

    $\underset{\mathcal{E}}{\min} \underset{\mathcal{C}_z \in \mathscr{L}_{z}^{1}}{\max} \underbrace{\left( \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\mathcal{C}_z(z)] - \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\mathcal{C}_z(\mathcal{E}(x))] \right)}_{\triangleq \mathcal{L}_{\mathcal{W},\mathcal{E}}} $
  • Only the consistency loss on reconstructed samples:
    $ \mathcal{L}_{2,\mathcal{G}} = ||x-\mathcal{G}(\mathcal{E}(x))||_2 $

The Full objective loss $\mathcal{L}$ is then:

$ \mathcal{L} = \mathcal{L}_{2,\mathcal{G}} + \mathcal{L}_{\mathcal{W},\mathcal{G}} + \mathcal{L}_{\mathcal{W},\mathcal{E}} $

Training procedure

For simplifying, let's consider $M=1$ (univariate time series)
  • As in MAD-GAN we consider ($s,w$ are omitted for sake of clarity)
  • $\bold{{X}} = \{\bold{x}_i\}_{i=1}^{I}$ where $I=s_s^{-1}(T-s_w), \bold{x}_i \in \mathbb{R}^{s_w}$
  • We also consider $\bold{Z} = \{\bold{z}_i\}_{i=1}^{I}$ with $\bold{z}_i \in \mathbb{R}^{K}$ drawn from a Normal distribution
  • $\bold{X},\bold{Z}$ are feed to the GAN model and trained with the loss $\mathcal{L}$

Data Pre-Processing of the reconstruction

Again, we set $M=1$
  • Pre-processed test set $\bold{{X}} = \{\bold{x}_i\}_{i=1}^{I}$
  • Generate reconstructed sequences as $\bold{x}_i \to \mathcal{E}(\bold{x}_i) \to \mathcal{G}(\mathcal{E}(\bold{x}_i)) = \hat\bold{x}_i$
  • We get a collection for a time point $t, \tilde{\bold{x}}^t \triangleq \{\hat x_i^\tau, i+\tau=t\}$
  • Take the median on each $\tilde{\bold{x}}^t$ to keep the reconstructed value $\hat{x}^t \triangleq \mathbb{M}[\tilde{\bold{x}}^t]$
  • The reconstructed time series is $(\hat{x}^1, \dots, \hat{x}^T)$

Data Pre-Processing of the Critics scores

  • The Critic $\bold{x}_i \to \mathcal{C}_x(\bold{x}_i) = \hat{\bold{c}}_i$ can directly serves as an anomaly score
  • Similarly at a time step $t$ we have some $\{\hat c_i^\tau, i+\tau=t\}$
  • Kernel density estimation (KDE) is applied and the maximum value is selected
  • We get a sequence $(\hat{c}^1, \dots, \hat{c}^T)$


III - Anomaly detection process for TAD-GAN

Anomaly scores using reconstruction errors

Point-wise difference

  • Most intuitive score:
  • $|x^t - \hat{x}^t|$

Area difference

  • Measure the similarity between local reigions.
  • defined as
  • $\frac{1}{2l}\!\left|\int_{t-l}^{t+l}(x^t\!\!-\!\!\hat{x}^t) \mathrm{d}x\right|$

Dynamic Time Warping (DTW)

  • Handle time shift issues
  • $\bold{X}=\{x_{t+k}\}_{k=-l+1}^{l} \\$ $\hat{\bold{X}}=\{\hat x_{t+k}\}_{k=-l+1}^{l}$
  • $\bold{W} \in \mathbb{R}^{L \times L}$ a matrix with distance between each point
  • The DTW is defined as follows:
  • $$ s_t=\min_{\bold{W}}\left(\frac{1}{L^2}\sqrt{\sum_{l,l^\prime =1}^{L} [\bold{W}]_{l,l^\prime}}\right) $$

Scores combination

  • $RE(x)$ the reconstruction error and $\mathcal{C}_x(x)$ the Critic output
  • High $RE(x)$ and Low $\mathcal{C}_x(x)$ indicate high anomaly scores
  • Mean and std. are computed and their $z$-scores $Z_{RE}(x), Z_{\mathcal{C}_x}(x)$ are used

Two proposed combination:

  • $\bold{a}(x) = \alpha Z_{RE}(x) + (1-\alpha)Z_{\mathcal{C}_x}(x) \quad \quad \texttt{(convex combination)}$
  • $\bold{a}(x) = \alpha Z_{RE}(x) \odot Z_{\mathcal{C}_x}(x) \qquad \qquad ~~~ \texttt{(product combination)}$

Locally adaptive thresholding

  • Sliding windows are used with $s_w = T/3$ and $s_s = T/30$
  • for each sliding window static threshold defined as $4$ std. from the mean.
  • anomaly score is larger than the threshold $\implies$ anomaly
  • continuous points yields the anomalous sequences $\{\bold{a}_{\text{seq}}^k\}_{k=1}^{K}$

Anomaly pruning $[\texttt{Hun. 18]}$

  • Technique to mitigate with false positives
  • Consider the maximum values $\{\bold{a}_{\max}^k\}_{k=1}^{K}$ in descending order
  • Compute the decrease percent $p^{k} = (a^{k-1}_{\max} - a^{k}_{\max})/a^{k-1}_{\max}$
  • When the first $p^i < \text{threshold} = 0.1$ the sequence is normal
  • It remains $\{\bold{a}_{\text{seq}}^{j}, 1 \leq j < i\}$ as anomalies
  • Hundman K. et al. (2018). Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding, SIGKDD.

IV - Experiences

TAD-GAN Architecture and inputs

  • Inputs are times series of length $100$ and the latent space dimension is $20$
  • 1 Layer of Bi-LSTM with $100$ hidden units for $\mathcal{E}$
  • 2 layers of Bi-LSTM with $64$ hidden units for $\mathcal{G}$
  • 1-D convolutional layer for both Critics
  • batch size of $64$
  • $2000$ iterations in total

Baselines

  • ARIMA Learn autocollerations in the times series for future value prediction. Point-wise prediction error is used as an anomaly score.
  • HTM encodes the current input to a hidden state and predicts the next hidden state. Prediction errors are computed as the differences between the predicted state and the true state
  • LSTM Prediction based with a point-wise prediction errors used as anomaly detection
  • AutoEncoder an LSTM autoencoder with a point-wise reconstruction error used
  • MAD-GAN: already introduced. A multivariate time series reconstruction with GAN
  • Microsoft Azure Anomaly Detector: use a spectral residual CNN (SR-CNN)
  • Amazon DeepAR: use an autoregressive recurrent network.

Considered datasets

Data Preparation

  • the data is normalized
  • $s_w=100$ and $s_s=1$

Data presentation

  • spacecraft telemetry: signal provided by the NASA
  • Yahoo S5: $A1$ based on real production traffic, $A_2, \dots, A_4$ synthetic datasets.
  • Numenta Anomaly Benchmark (NAB): multiple times series data from various domains

Evaluation metric

F1-score = $\frac{\text{TP}}{\text{TP}+0.5(\text{FN} + \text{FP})}$

Comparison with best settings for TAD-GAN

  • Overall have a lower std and better mean F1-score
  • TAD-GAN highly outperformed MAD-GAN
  • TAD-GAN is slightly outperformed by ARIMA on Yahoo synthetic data

Ablation results

  • Using only the Critic provides bad results
  • DTW outperforms the two other reconstruction types slightly

Baseline models in comparison to ARIMA

V - Conclusion

Takehome message

  • Design of TAD-GAN : a new reconstruction-based anomaly method
  • Outperforms MAD-GAN and is more stable
  • However, a specific setting of combination score to perform best
  • TAD-GAN is Outperformed on synthetic dataset that have seasonnality

Improvements

  • Some combination of Beat-GAN and TAD-GAN to deal with periodicity ?
  • Learn an "anomaly score"
  • Transfert Learning Technique for domain adaptation ?
Thank you !