DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)
https://arxiv.org/pdf/2306.10347.pdf
Contents
- Abstract
 - Introduction
 - Methodology
    
- Overall Architecture
 - Dual Attention Contrastive Structure
 - Representation Discrepancy
 - Anomaly Criterion
 
 
0. Abstract
Challenge of TS anomaly detection
- learn a representation map that enables effective discrimination of anomalies.
 
Categories of methods
- Reconstruction-based methods
 - Contrastive learning
 
DCdetector
- a multi-scale dual attention contrastive representation learning model
    
- utilizes a novel dual attention asymmetric design to create the permutated environment
 
 - learn a permutation invariant representation with superior discrimination abilities
 
1. Introduction
Challenges in TS-AD
- (1) Determining what the anomalies will be like.
 - (2) Anomalies are rare
    
- hard to get labels
 - most supervised or semi-supervised methods fail to work given limited labeled training data.
 
 - (3) Should consider temporal, multidimensional, and non-stationary features for TS
 
TS anomaly detection methods
( ex. statistical, classic machine learning, and deep learning based methods )
- Supervised and Semi-supervised methods
    
- can not handle the challenge of limited labeled data
 
 - Unsupervised methods
    
- without strict requirements on labeled data
 - ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches
 
 
Examples
- Reconstruction-based methods
    
- pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally.
 - cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies.
 
 - Contrastive Learning
    
- outstanding performance in downstream tasks in the computer vision
 - effectiveness of contrastive representative learning still needs to be explored in the TS-AD
 
 
DCdetector
( Dual attention Contrastive representation learning anomaly detector )
- 
    
handle the challenges in TS AD
 - 
    
key idea : normal TS points share the latent pattern
( = normal points have strong correlations with other points <-> anomalies do not )
 - Learning consistent representations :
    
- hard for anomalies
 - easy for normal points
 
 - Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies without a highly qualified reconstruction model
 
Details
- contrastive structure with two branches & dual attention
    
- two branches share weights
 
 - representation difference between normal and abnormal data is enlarged
 - patching-based attention networks: to capture the temporal dependency
 - multi-scale design: to reduce information loss during patching
 - channel independence design for MTS
 - does not require prior knowledge about anomalies
 
2. Methodology
MTS of length \(\mathrm{T}\) : \(X=\left(x_1, x_2, \ldots, x_T\right)\)
- where \(x_t \in \mathbb{R}^d\)
 
Task:
- given input TS \(\mathcal{X}\),
 - 
    
for another unknown test sequence \(\mathcal{X}_{\text {test }}\) of length \(T^{\prime}\)
 - we want to predict \(\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)\).
    
- \(y_t \in\{0,1\}\) : 1 = anomaly & 0 = normal
 
 
Inductive bias ( as Anomaly Transformer explored )
- anomalies have less connection with the whole TS than their adjacent points
 - Anomaly Transformer: detects anomalies by association discrepancy between ..
    
- (1) a learned Gaussian kernel
 - (2) attention weight distribution.
 
 - DCdetector
    
- via a dual-attention self-supervised contrastive-type structure.
 
 
Comparison

- Reconstruction-based approach
 - Anomaly Transformer
    
- observation that it is difficult to build nontrivial associations from abnormal points to the whole series.
 - discrepancies
        
- prior discrepancy : learned with Gaussian Kernel
 - association discrepancy : learned with a transformer module
 
 - MinMax association learning & Reconstruction loss
 
 - DCdetector
    
- concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
 - mainly leverages the designed CL-based dual-branch attention for discrepancy learning of anomalies in different views
 
 
(1) Overall Architecture

4 main components
- Forward Process module
 - Dual Attention Contrastive Structure module
 - Representation Discrepancy module
 - Anomaly Criterion module.
 

a) Forward Process module
( channel-independent )
- a-1) instance normalization
 - a-2) patching
 
b) Dual Attention Contrastive Structure module
- each channel shares the same self-attention network
 - representation results are concatenated as the final output \(\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)\).
 - Dual Attention Contrastive Structure module
    
- learns the representation of inputs in different views.
 
 
c) Representation Discrepancy module
Key Insight
- normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed).
 - anomalies: rare & do not have explicit patterns
 
\(\rightarrow\) difference will be slight for normal points representations in different views and large for anomalies.
d) Anomaly Criterion module.
- 
    
calculate anomaly scores based on the discrepancy between the two representations
 - 
    
use a prior threshold for AD
 
(2) Dual Attention Contrastive Structure
TS from different views: takes ..
- (1) patch-wise representations
 - (2) in-patch representations
 
Does not construct pairs like the typical contrastive methods
- similar to the contrastive methods only using positive samples
 
a) Dual Attention
Input time series \(\mathcal{X} \in \mathbb{R}^{T \times d}\) are patched as \(\mathcal{X} \in \mathbb{R}^{P \times N \times d}\)
- \(P\) : patch size
 - \(N\) : number of patches
 
Fuse the channel information with the batch dimension ( \(\because\) channel independence )
\(\rightarrow\) becomes \(\mathcal{X} \in \mathbb{R}^{P \times N}\).
[ Patch-wise representation ]
- single patch is considered as a unit
    
- embedded operation will be applied in the patch_size \((P)\) dimension
 
 - capture dependencies among patches ( = patch-wise attention )
 - embedding shape : \(X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}\).
 - apply multi-head attention to \(X_{\mathcal{N}}\)
 
[ In-patch representation ]
- dependencies of points in the same patch
    
- embedded operation will be applied in the number of patches \((N)\) dimension
 
 
Note that the \(W_{Q_i}, W_{\mathcal{K}_i}\) are shared weights within the in-patch & patch-wise attention
b) Up-sampling and Multi-scale Design
Patch-wise attention
- ignores the relevance among points in a patch
 
In-patch attention
- ignores the relevance among patches.
 
To compare these two representations …. need upsampling!

Multi-scale design:
= final representation concatenates results in different scales (i.e., patch sizes)
- final patch-wise representation: \(\mathcal{N}\)
    
- \(\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)\),
 
 - Final in-patch representation: \(\mathcal{P}\)
    
- \(\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)\).
 
 
c) Contrastive Structure
Patch-wise sample representation
- learns a weighted combination between sample points in the same position from each patch
 
In-patch sample representation
- learns a weighted combination between points within the same patch.
 
\(\rightarrow\) Treat these two representations as “permutated multi-view representations”
(3) Representation Discrepancy
Kullback-Leibler divergence (KL divergence)
- to measure the similarity of such two representations
 
Loss function definition
( no reconstruction part is used )
\(\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))\).
- Stop-gradient : to train 2 branches asynchronously
 
(4) Anomaly Criterion
Final anomaly score of \(\mathcal{X} \in \mathbb{R}^{T \times d}\) :
- \(\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }\).
 
\(y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}\).