TAD-GAN: a robust unsupervised time anomaly detection

DSAIDIS (Axis 1)

Mathieu FONTAINE
mathieu.fontaine@telecom-paris.fr

May, 24th 2023

I - Introduction

What is a time series ?

$\bold{X} = (\bold{x}^1, \dots,$ $\bold{x}^t$$, \dots, \bold{x}^T) \in \mathbb{R}^{M \times T}$
$M$: number of sensors / measurements
$T$: number of time steps

multivariate time series $\implies$ several synchronous measurement

Anomaly in time series: definition

Given observed data $\bold{X}$
Find sequences ${\color{red}\bold{A}_{seq}} \triangleq \{{\color{red}\bold{a}^{(1)}_{seq, 1}}, \dots, {\color{red}\bold{a}^{(K)}_{seq, K}}\} $ showing unusual behavior
${seq, i}$ is the associated time indexes for the $i^{\text{th}}$ anomaly

In an industrial context, which machine learning framework may be suitable ?

Unsupervised setting for time anomaly detection (TAD)

Unsupervised approach $\rightarrow$ suitable in various real-world scenarios:

No identified "known anomalies" to train the model
Non availability of "normal baselines"
No clear segmentation

e.g.

Qiu, J. et al. (2019). Kpi-tsad: A time-series anomaly detector for kpi monitoring in cloud applications. Symmetry.

Unsupervised setting for time anomaly detection (TAD)

Unsupervised approach $\rightarrow$ suitable in various real-world scenarios:

No identified "known anomalies" to train the model
Non availability of "normal baselines"
No clear segmentation

e.g.

Three main methodology family of unsupervised TAD:
Proximity, Prediction and Reconstruction-based methods

Qiu, J. et al. (2019). Kpi-tsad: A time-series anomaly detector for kpi monitoring in cloud applications. Symmetry.

Prediction-based anomaly detection (1/2)

Method

learn predictive model to fit given time series data
use the model to predict future values

$\text{pred}_{value} - \text{true}_{value} > \text{threshold}$ $\implies$ anomaly

Prediction-based anomaly detection (2/2)

Examples

Statisticals Models
Deep Neural Network

Pena, E. H. et al. (2013). Anomaly detection using forecasting methods arima and hwds, SCCC.
Torres, J. M. et al. (2011). Detection of outliers in gas emissions from urban areas using functional data analysis, JHM.
Hundman, K. et al. (2018). Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding, SIGKDD.
Lim, B. et al. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting, IJF.
Wu, H. et al. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting, ANIPS.

Reconstruction-based anomaly detection (1/2)

Method

learn a model to capture a low-dimensional representation $\bold{Z}$ of time series $\bold{X}$
create a synthetic reconstruction $\hat{\bold{X}}$ of the data from $\bold{Z}$

$d(\bold{X}_{\text{seq}}, \hat{\bold{X}}_{\text{seq}}) > \text{threshold}$ $\implies $ anomaly on indexes $\text{seq}$

Reconstruction-based anomaly detection (2/2)

Examples

Machine Learning Approach
Auto-encoders (AE)
LSTM
Diffusion Models

(NOT times series)

Generative adversarial models (GAN)

TAD-GAN $[\texttt{Gei. 20}]$

Ringberg, H. et al. (2007). Sensitivity of PCA for traffic anomaly detection, ACM SIGMETRICS.
An, J. et al. (2015). Variational autoencoder based anomaly detection using reconstruction probability, IE.
Malhotra, P. et al. (2016). LSTM-based encoder-decoder for multi-sensor anomaly detection, ICML.
Wolleb, J. et al. (2022). Diffusion models for medical anomaly detection, MICCAI.
Müller-Franzes, G. et al. (2022). Diffusion Probabilistic Models beat GANs on Medical Images, CVPR.
Zhou, B. et al. (2019). BeatGAN: Anomalous Rhythm Detection using Adversarially Generated Time Series, IJCAI.
Li, D. et al. (2019). MAD-GAN: Multivariate anomaly detection for time series data with GAN, ICANN.
Geiger, A. et al. (2020). Tadgan: Time series anomaly detection using generative adversarial networks, ICBD.

Outlines

I - Introduction

II - TAD-GAN

$\quad$ II.A - GAN and extensions

$\quad$II.B - TAD-GAN in a nutshell

III - Anomaly detection process for TAD-GAN

IV - Experiences

V - Conclusion

II.A - Generative adversarial networks and extensions

General Definition (1/2)

Generator: $\mathcal{G}(z)$ intended to come from $\mathbb{P}_X$ (trained to fool the critics $\mathcal{C}_x$)
Discriminator: tell if $\mathcal{G}(z)$ is real/fake ? (trained using $\bold{X}$ and answering "real")

A common loss (based on zero-sum game) is designed: $$ \mathcal{L} = \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\log(\mathcal{C}_x(x))] + \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\log\left(1-\mathcal{C}_x(\mathcal{G}(z))\right)] $$

Goodfellow I.J et al. (2014). Generative Adversarial Networks, NEURIPS.

General Definition (2/2)

Summary of the training procedure

$\{z_k\}_{k=1}^{K}, \{x_k\}_{k=1}^{K}$: samples from $\mathbb{P}_{\bold{Z}}$ and $\mathbb{P}_{\bold{X}}$ respectively.
$\Theta_{\mathcal{C},x}$ and $\Theta_{\mathcal{G}}$ the parameters of $\mathcal{C}_x$ and $\mathcal{G}$ respectively.

Compute $P$ times (at each iteration, new $\{z_k\}_{k=1}^{K}, \{x_k\}_{k=1}^{K}$) $\nabla_{\Theta_{\mathcal{C},x}}\frac{1}{K}\sum_{k} [\log(\mathcal{C}_x(x_k)) + \log\left(1-\mathcal{C}_x(\mathcal{G}(z_k))\right)] \quad \texttt{(Critics training)}$ $\quad \quad \rightarrow$ backpropagation along $\Theta_{\mathcal{C},x}$
Compute (with new $\{z_k\}_{k=1}^{K}$): $\nabla_{\Theta_{\mathcal{G}}}\frac{1}{K}\sum_{k} [ \log\left(1-\mathcal{C}_x(\mathcal{G}(z_k))\right)] \qquad\qquad\qquad~~ \texttt{(Generator training)}$ $\quad \quad \rightarrow$ backpropagation along $\Theta_{\mathcal{G}}$
Repeat i. and ii. many times (at least, to be close to Nash Equilibrium)

Pros & Cons

Can approach any implicit distribution if well tuned and trained
in other way, may generate well data close to $\bold{X}$
Hard to train in practice (instability, very poor convergence etc.)

Wasserstein Extension

Definition

A way to deal with "gradient explosion" using $\mathcal{W}$ the Wasserstein-1 distance
Under some conditions, if $\mathcal{G}$ is locally Lipschitz, then $\mathcal{W}(\mathbb{P}_{\bold{X}}, \mathbb{P}_{\bold{Z}})$ is differentiable
The objective becomes (using Kantorovich-Rubinstein duality):

$\underset{\mathcal{G}}{\min} \underset{\mathcal{C}_x \in \mathscr{L}_{x}^{1}}{\max} \underbrace{\left( \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\mathcal{C}_x(x)] - \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\mathcal{C}_x(\mathcal{G}(z))] \right)}_{\triangleq \mathcal{L}_{\mathcal{W}}} $

$\mathscr{L}_{x}^{1}$: $1$-Lipschitz function.

New Procedure

Replace the previous loss $\mathcal{L}$ with $\mathcal{L}_{\mathcal{W}}$
clip $\Theta_{\mathcal{C}_x}$ between small values

Arjovsky, M. et al. (2017). Wasserstein gan, ICML.

Multivariate anomaly detection: MAD-GAN (1/3)

Generalities

Take into account the intercorrelation between sensors for TAD + GAN
LSTM for both the critics $\mathcal{C}_x$ and the generator $\mathcal{D}$
Use "normal data" for training and compute a "residual reconstruction error"

Data Pre-processing

$M$, $T$ measurements and time step respectively
sub-sequences of $\bold{X} \in \mathbb{R}^{M \times T}$ with sliding window of length $s_w$ and step size $s_s$

Multivariate anomaly detection: MAD-GAN (2/3)

Training procedure $\rightarrow$ as a "usual" GAN on $\bold{\tilde{X}}^{\text{train}}_{s,w} \in \mathbb{R}^{M \times I_1 \times s_w}$
For the sake of clarity, $M=1$ and $s,w$ are omitted.

Test Procedure

two losses on $\bold{\tilde{X}}\in \mathbb{R}^{I_2 \times s_w}$ with $\tilde{x}^{\tau}_{i} \in \mathbb{R}, \forall (\tau,i) \in \llbracket 1, s_w \rrbracket \times \llbracket 1, I_2 \rrbracket$:

Critic loss : $\mathcal{L}_{\mathcal{C}_x}^{\tau}=\mathcal{C}_x(\tilde{\bold{x}}^{\tau}) \in \mathbb{R}^{I_2}$
Residual loss between $\bold{\tilde{X}}$ and $\mathcal{G}(Z^k)$:

$$ \underset{Z_k}{\min ~}\epsilon(\bold{\tilde{X}} , \mathcal{G}(Z^k)) = 1- \mathrm{Cov}(\bold{\tilde{X}}, \mathcal{G}(Z^k)) $$

$$ \mathcal{L}_{\mathcal{R}}^{\tau} = |\tilde{\bold{x}}^{\tau} - \mathcal{G}(Z^{k,\tau})| \in \mathbb{R}^{I_2} $$

The Anomaly Detection Loss (ADL) is given for $\tau \in \llbracket 1, s_w \rrbracket$ by:

$$ \mathcal{L}_{\text{An}}^{\tau} = \lambda \mathcal{L}_{\mathcal{C}_x}^{\tau} + (1-\lambda)\mathcal{L}_{\mathcal{R}}^{\tau} \in \mathbb{R}^{I_2} $$

Multivariate anomaly detection: MAD-GAN (3/3)

Discrimination and Reconstruction Anomaly Score (DR-Score)

We finally get $\{\mathcal{L}_{\text{An},i}^{\tau}\} \in \mathbb{R}^{I_2 \times s_w}$ ADL
DR-Score (DRS) computed by mapping ADL to the original time series:

$$ DRS_t = \frac{\sum_{i+\tau = t} \mathcal{L}_{\text{An},i}}{\sharp(\{i, \tau \in {i + \tau = t}\})} $$

Anomaly detection

The anomaly computed such as:

$$ A_t^{\mathrm{test}} = \begin{cases} 1, \quad \text{if } \mathbb{H}(DRS_t, 1) > \text{threshold} \\ 0 \quad \text{otherwise} \end{cases} $$

$\mathbb{H} (., .)$ : cross-entropy error

Worst results than ARIMA on many real dataset !
Weaker point: the residual error + conventional GAN

II.B - TAD-GAN in a nutshell

Ingredients for a robust time reconstruction GAN

Two mapping functions : $\mathcal{E}:\bold{X} \to \bold{Z}$ and $\mathcal{G}: \bold{Z} \to \bold{X}$
We can reconstruct input times series: $\bold{x}^{t} \to \mathcal{E}(\bold{x}^{t}) \to \mathcal{G}(\mathcal{E}(\bold{x}^{t})) \approx \bold{x}^{t}$
$\mathcal{G}$ and $\mathcal{E}$ can be seen as a generator with their Critics $\mathcal{C}_x$ and $\mathcal{C}_z$

The "high level" objective of TAD-GAN consists in

Two Wassertein losses (match $\mathcal{G}(\bold{Z})\leftrightarrow\bold{X}$ and $\mathcal{E}(\bold{X})\leftrightarrow\bold{Z}$)
Cycle consistency losses to prevent a contradiction between $\mathcal{G}$ and $\mathcal{E}$

Cycle Consistency Principle

First introduced in $\texttt{[Zhu. 17]}$ for Image-to-image translation
mode collapse: several input images map to the same output image
Consistency loss: from one image to another one and coming back = same result

Zhu J.Y. et al. (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV

Full Objective Loss

It was finally considered:

The two following Wasserstein objectives:

$\underset{\mathcal{E}}{\min} \underset{\mathcal{C}_z \in \mathscr{L}_{z}^{1}}{\max} \underbrace{\left( \mathbb{E}_{z \in \mathbb{P}_{\bold{Z}}}[\mathcal{C}_z(z)] - \mathbb{E}_{x \in \mathbb{P}_{\bold{X}}}[\mathcal{C}_z(\mathcal{E}(x))] \right)}_{\triangleq \mathcal{L}_{\mathcal{W},\mathcal{E}}} $

Only the consistency loss on reconstructed samples:
$ \mathcal{L}_{2,\mathcal{G}} = ||x-\mathcal{G}(\mathcal{E}(x))||_2 $

The Full objective loss $\mathcal{L}$ is then:

$ \mathcal{L} = \mathcal{L}_{2,\mathcal{G}} + \mathcal{L}_{\mathcal{W},\mathcal{G}} + \mathcal{L}_{\mathcal{W},\mathcal{E}} $

Training procedure

For simplifying, let's consider $M=1$ (univariate time series)

As in MAD-GAN we consider ($s,w$ are omitted for sake of clarity)
We also consider $\bold{Z} = \{\bold{z}_i\}_{i=1}^{I}$ with $\bold{z}_i \in \mathbb{R}^{K}$ drawn from a Normal distribution
$\bold{X},\bold{Z}$ are feed to the GAN model and trained with the loss $\mathcal{L}$

Data Pre-Processing of the reconstruction

Again, we set $M=1$

Pre-processed test set $\bold{{X}} = \{\bold{x}_i\}_{i=1}^{I}$
Generate reconstructed sequences as $\bold{x}_i \to \mathcal{E}(\bold{x}_i) \to \mathcal{G}(\mathcal{E}(\bold{x}_i)) = \hat\bold{x}_i$
We get a collection for a time point $t, \tilde{\bold{x}}^t \triangleq \{\hat x_i^\tau, i+\tau=t\}$
Take the median on each $\tilde{\bold{x}}^t$ to keep the reconstructed value $\hat{x}^t \triangleq \mathbb{M}[\tilde{\bold{x}}^t]$
The reconstructed time series is $(\hat{x}^1, \dots, \hat{x}^T)$

Data Pre-Processing of the Critics scores

The Critic $\bold{x}_i \to \mathcal{C}_x(\bold{x}_i) = \hat{\bold{c}}_i$ can directly serves as an anomaly score
Similarly at a time step $t$ we have some $\{\hat c_i^\tau, i+\tau=t\}$
Kernel density estimation (KDE) is applied and the maximum value is selected
We get a sequence $(\hat{c}^1, \dots, \hat{c}^T)$

III - Anomaly detection process for TAD-GAN

Anomaly scores using reconstruction errors

Point-wise difference

Most intuitive score:

$|x^t - \hat{x}^t|$

Area difference

Measure the similarity between local reigions.
defined as

Dynamic Time Warping (DTW)

Handle time shift issues
$\bold{X}=\{x_{t+k}\}_{k=-l+1}^{l} \\$ $\hat{\bold{X}}=\{\hat x_{t+k}\}_{k=-l+1}^{l}$
$\bold{W} \in \mathbb{R}^{L \times L}$ a matrix with distance between each point
The DTW is defined as follows:

$$ s_t=\min_{\bold{W}}\left(\frac{1}{L^2}\sqrt{\sum_{l,l^\prime =1}^{L} [\bold{W}]_{l,l^\prime}}\right) $$

Scores combination

$RE(x)$ the reconstruction error and $\mathcal{C}_x(x)$ the Critic output
High $RE(x)$ and Low $\mathcal{C}_x(x)$ indicate high anomaly scores
Mean and std. are computed and their $z$-scores $Z_{RE}(x), Z_{\mathcal{C}_x}(x)$ are used

Two proposed combination:

$\bold{a}(x) = \alpha Z_{RE}(x) + (1-\alpha)Z_{\mathcal{C}_x}(x) \quad \quad \texttt{(convex combination)}$
$\bold{a}(x) = \alpha Z_{RE}(x) \odot Z_{\mathcal{C}_x}(x) \qquad \qquad ~~~ \texttt{(product combination)}$

Locally adaptive thresholding

Sliding windows are used with $s_w = T/3$ and $s_s = T/30$
for each sliding window static threshold defined as $4$ std. from the mean.
anomaly score is larger than the threshold $\implies$ anomaly
continuous points yields the anomalous sequences $\{\bold{a}_{\text{seq}}^k\}_{k=1}^{K}$

Anomaly pruning $[\texttt{Hun. 18]}$

Technique to mitigate with false positives
Consider the maximum values $\{\bold{a}_{\max}^k\}_{k=1}^{K}$ in descending order
Compute the decrease percent $p^{k} = (a^{k-1}_{\max} - a^{k}_{\max})/a^{k-1}_{\max}$
When the first $p^i < \text{threshold} = 0.1$ the sequence is normal
It remains $\{\bold{a}_{\text{seq}}^{j}, 1 \leq j < i\}$ as anomalies

Hundman K. et al. (2018). Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding, SIGKDD.

IV - Experiences

TAD-GAN Architecture and inputs

Inputs are times series of length $100$ and the latent space dimension is $20$
1 Layer of Bi-LSTM with $100$ hidden units for $\mathcal{E}$
2 layers of Bi-LSTM with $64$ hidden units for $\mathcal{G}$
1-D convolutional layer for both Critics
batch size of $64$
$2000$ iterations in total

Baselines

ARIMA Learn autocollerations in the times series for future value prediction. Point-wise prediction error is used as an anomaly score.
HTM encodes the current input to a hidden state and predicts the next hidden state. Prediction errors are computed as the differences between the predicted state and the true state
LSTM Prediction based with a point-wise prediction errors used as anomaly detection
AutoEncoder an LSTM autoencoder with a point-wise reconstruction error used
MAD-GAN: already introduced. A multivariate time series reconstruction with GAN
Microsoft Azure Anomaly Detector: use a spectral residual CNN (SR-CNN)
Amazon DeepAR: use an autoregressive recurrent network.

Considered datasets

Data Preparation

the data is normalized
$s_w=100$ and $s_s=1$

Data presentation

spacecraft telemetry: signal provided by the NASA
Yahoo S5: $A1$ based on real production traffic, $A_2, \dots, A_4$ synthetic datasets.
Numenta Anomaly Benchmark (NAB): multiple times series data from various domains

Evaluation metric

F1-score = $\frac{\text{TP}}{\text{TP}+0.5(\text{FN} + \text{FP})}$

Comparison with best settings for TAD-GAN

Overall have a lower std and better mean F1-score
TAD-GAN highly outperformed MAD-GAN
TAD-GAN is slightly outperformed by ARIMA on Yahoo synthetic data

Ablation results

Using only the Critic provides bad results
DTW outperforms the two other reconstruction types slightly

Baseline models in comparison to ARIMA

V - Conclusion

Takehome message

Design of TAD-GAN : a new reconstruction-based anomaly method
Outperforms MAD-GAN and is more stable
However, a specific setting of combination score to perform best
TAD-GAN is Outperformed on synthetic dataset that have seasonnality

Improvements

Some combination of Beat-GAN and TAD-GAN to deal with periodicity ?
Learn an "anomaly score"
Transfert Learning Technique for domain adaptation ?

Thank you !