DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)
- Abstract
- Introduction
- Methodology
- Overall Architecture
- Dual Attention Contrastive Structure
- Representation Discrepancy
- Anomaly Criterion
0. Abstract
Challenge of TS anomaly detection
- learn a representation map that enables effective discrimination of anomalies.
Categories of methods
- Reconstruction-based methods
- Contrastive learning
- a multi-scale dual attention contrastive representation learning model
- utilizes a novel dual attention asymmetric design to create the permutated environment
- learn a permutation invariant representation with superior discrimination abilities
1. Introduction
Challenges in TS-AD
- (1) Determining what the anomalies will be like.
- (2) Anomalies are rare
- hard to get labels
- most supervised or semi-supervised methods fail to work given limited labeled training data.
- (3) Should consider temporal, multidimensional, and non-stationary features for TS
TS anomaly detection methods
( ex. statistical, classic machine learning, and deep learning based methods )
- Supervised and Semi-supervised methods
- can not handle the challenge of limited labeled data
- Unsupervised methods
- without strict requirements on labeled data
- ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches
- Reconstruction-based methods
- pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally.
- cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies.
- Contrastive Learning
- outstanding performance in downstream tasks in the computer vision
- effectiveness of contrastive representative learning still needs to be explored in the TS-AD
( Dual attention Contrastive representation learning anomaly detector )
handle the challenges in TS AD
key idea : normal TS points share the latent pattern
( = normal points have strong correlations with other points <-> anomalies do not )
- Learning consistent representations :
- hard for anomalies
- easy for normal points
- Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies without a highly qualified reconstruction model
- contrastive structure with two branches & dual attention
- two branches share weights
- representation difference between normal and abnormal data is enlarged
- patching-based attention networks: to capture the temporal dependency
- multi-scale design: to reduce information loss during patching
- channel independence design for MTS
- does not require prior knowledge about anomalies
2. Methodology
MTS of length \(\mathrm{T}\) : \(X=\left(x_1, x_2, \ldots, x_T\right)\)
- where \(x_t \in \mathbb{R}^d\)
- given input TS \(\mathcal{X}\),
for another unknown test sequence \(\mathcal{X}_{\text {test }}\) of length \(T^{\prime}\)
- we want to predict \(\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)\).
- \(y_t \in\{0,1\}\) : 1 = anomaly & 0 = normal
Inductive bias ( as Anomaly Transformer explored )
- anomalies have less connection with the whole TS than their adjacent points
- Anomaly Transformer: detects anomalies by association discrepancy between ..
- (1) a learned Gaussian kernel
- (2) attention weight distribution.
- DCdetector
- via a dual-attention self-supervised contrastive-type structure.
- Reconstruction-based approach
- Anomaly Transformer
- observation that it is difficult to build nontrivial associations from abnormal points to the whole series.
- discrepancies
- prior discrepancy : learned with Gaussian Kernel
- association discrepancy : learned with a transformer module
- MinMax association learning & Reconstruction loss
- DCdetector
- concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
- mainly leverages the designed CL-based dual-branch attention for discrepancy learning of anomalies in different views
(1) Overall Architecture
4 main components
- Forward Process module
- Dual Attention Contrastive Structure module
- Representation Discrepancy module
- Anomaly Criterion module.
a) Forward Process module
( channel-independent )
- a-1) instance normalization
- a-2) patching
b) Dual Attention Contrastive Structure module
- each channel shares the same self-attention network
- representation results are concatenated as the final output \(\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)\).
- Dual Attention Contrastive Structure module
- learns the representation of inputs in different views.
c) Representation Discrepancy module
Key Insight
- normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed).
- anomalies: rare & do not have explicit patterns
\(\rightarrow\) difference will be slight for normal points representations in different views and large for anomalies.
d) Anomaly Criterion module.
calculate anomaly scores based on the discrepancy between the two representations
use a prior threshold for AD
(2) Dual Attention Contrastive Structure
TS from different views: takes ..
- (1) patch-wise representations
- (2) in-patch representations
Does not construct pairs like the typical contrastive methods
- similar to the contrastive methods only using positive samples
a) Dual Attention
Input time series \(\mathcal{X} \in \mathbb{R}^{T \times d}\) are patched as \(\mathcal{X} \in \mathbb{R}^{P \times N \times d}\)
- \(P\) : patch size
- \(N\) : number of patches
Fuse the channel information with the batch dimension ( \(\because\) channel independence )
\(\rightarrow\) becomes \(\mathcal{X} \in \mathbb{R}^{P \times N}\).
[ Patch-wise representation ]
- single patch is considered as a unit
- embedded operation will be applied in the patch_size \((P)\) dimension
- capture dependencies among patches ( = patch-wise attention )
- embedding shape : \(X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}\).
- apply multi-head attention to \(X_{\mathcal{N}}\)
[ In-patch representation ]
- dependencies of points in the same patch
- embedded operation will be applied in the number of patches \((N)\) dimension
Note that the \(W_{Q_i}, W_{\mathcal{K}_i}\) are shared weights within the in-patch & patch-wise attention
b) Up-sampling and Multi-scale Design
Patch-wise attention
- ignores the relevance among points in a patch
In-patch attention
- ignores the relevance among patches.
To compare these two representations …. need upsampling!
Multi-scale design:
= final representation concatenates results in different scales (i.e., patch sizes)
- final patch-wise representation: \(\mathcal{N}\)
- \(\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)\),
- Final in-patch representation: \(\mathcal{P}\)
- \(\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)\).
c) Contrastive Structure
Patch-wise sample representation
- learns a weighted combination between sample points in the same position from each patch
In-patch sample representation
- learns a weighted combination between points within the same patch.
\(\rightarrow\) Treat these two representations as “permutated multi-view representations”
(3) Representation Discrepancy
Kullback-Leibler divergence (KL divergence)
- to measure the similarity of such two representations
Loss function definition
( no reconstruction part is used )
\(\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))\).
- Stop-gradient : to train 2 branches asynchronously
(4) Anomaly Criterion
Final anomaly score of \(\mathcal{X} \in \mathbb{R}^{T \times d}\) :
- \(\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }\).
\(y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}\).