DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)

https://arxiv.org/pdf/2306.10347.pdf

Abstract
Introduction
Methodology
1. Overall Architecture
2. Dual Attention Contrastive Structure
3. Representation Discrepancy
4. Anomaly Criterion

0. Abstract

Challenge of TS anomaly detection

learn a representation map that enables effective discrimination of anomalies.

Categories of methods

Reconstruction-based methods
Contrastive learning

DCdetector

a multi-scale dual attention contrastive representation learning model
- utilizes a novel dual attention asymmetric design to create the permutated environment
learn a permutation invariant representation with superior discrimination abilities

1. Introduction

Challenges in TS-AD

(1) Determining what the anomalies will be like.
(2) Anomalies are rare
- hard to get labels
- most supervised or semi-supervised methods fail to work given limited labeled training data.
(3) Should consider temporal, multidimensional, and non-stationary features for TS

TS anomaly detection methods

( ex. statistical, classic machine learning, and deep learning based methods )

Supervised and Semi-supervised methods
- can not handle the challenge of limited labeled data
Unsupervised methods
- without strict requirements on labeled data
- ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches

Examples

Reconstruction-based methods
- pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally.
- cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies.
Contrastive Learning
- outstanding performance in downstream tasks in the computer vision
- effectiveness of contrastive representative learning still needs to be explored in the TS-AD

DCdetector

( Dual attention Contrastive representation learning anomaly detector )

handle the challenges in TS AD
key idea : normal TS points share the latent pattern

( = normal points have strong correlations with other points <-> anomalies do not )
Learning consistent representations :
- hard for anomalies
- easy for normal points
Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies without a highly qualified reconstruction model

Details

contrastive structure with two branches & dual attention
- two branches share weights
representation difference between normal and abnormal data is enlarged
patching-based attention networks: to capture the temporal dependency
multi-scale design: to reduce information loss during patching
channel independence design for MTS
does not require prior knowledge about anomalies

2. Methodology

MTS of length \(\mathrm{T}\) : \(X=\left(x_1, x_2, \ldots, x_T\right)\)

where \(x_t \in \mathbb{R}^d\)

Task:

given input TS \(\mathcal{X}\),
for another unknown test sequence \(\mathcal{X}_{\text {test }}\) of length \(T^{\prime}\)
we want to predict \(\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)\).
- \(y_t \in\{0,1\}\) : 1 = anomaly & 0 = normal

Inductive bias ( as Anomaly Transformer explored )

anomalies have less connection with the whole TS than their adjacent points
Anomaly Transformer: detects anomalies by association discrepancy between ..
- (1) a learned Gaussian kernel
- (2) attention weight distribution.
DCdetector
- via a dual-attention self-supervised contrastive-type structure.

Comparison

Reconstruction-based approach
Anomaly Transformer
- observation that it is difficult to build nontrivial associations from abnormal points to the whole series.
- discrepancies
  - prior discrepancy : learned with Gaussian Kernel
  - association discrepancy : learned with a transformer module
- MinMax association learning & Reconstruction loss
DCdetector
- concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
- mainly leverages the designed CL-based dual-branch attention for discrepancy learning of anomalies in different views

(1) Overall Architecture

4 main components

Forward Process module
Dual Attention Contrastive Structure module
Representation Discrepancy module
Anomaly Criterion module.

a) Forward Process module

( channel-independent )

a-1) instance normalization
a-2) patching

b) Dual Attention Contrastive Structure module

each channel shares the same self-attention network
representation results are concatenated as the final output \(\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)\).
Dual Attention Contrastive Structure module
- learns the representation of inputs in different views.

c) Representation Discrepancy module

Key Insight

normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed).
anomalies: rare & do not have explicit patterns

\(\rightarrow\) difference will be slight for normal points representations in different views and large for anomalies.

d) Anomaly Criterion module.

calculate anomaly scores based on the discrepancy between the two representations
use a prior threshold for AD

(2) Dual Attention Contrastive Structure

TS from different views: takes ..

(1) patch-wise representations
(2) in-patch representations

Does not construct pairs like the typical contrastive methods

similar to the contrastive methods only using positive samples

a) Dual Attention

Input time series \(\mathcal{X} \in \mathbb{R}^{T \times d}\) are patched as \(\mathcal{X} \in \mathbb{R}^{P \times N \times d}\)

\(P\) : patch size
\(N\) : number of patches

Fuse the channel information with the batch dimension ( \(\because\) channel independence )

\(\rightarrow\) becomes \(\mathcal{X} \in \mathbb{R}^{P \times N}\).

[ Patch-wise representation ]

single patch is considered as a unit
- embedded operation will be applied in the patch_size \((P)\) dimension
capture dependencies among patches ( = patch-wise attention )
embedding shape : \(X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}\).
apply multi-head attention to \(X_{\mathcal{N}}\)

[ In-patch representation ]

dependencies of points in the same patch
- embedded operation will be applied in the number of patches \((N)\) dimension

Note that the \(W_{Q_i}, W_{\mathcal{K}_i}\) are shared weights within the in-patch & patch-wise attention

b) Up-sampling and Multi-scale Design

Patch-wise attention

ignores the relevance among points in a patch

In-patch attention

ignores the relevance among patches.

To compare these two representations …. need upsampling!

Multi-scale design:

= final representation concatenates results in different scales (i.e., patch sizes)

final patch-wise representation: \(\mathcal{N}\)
- \(\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)\),
Final in-patch representation: \(\mathcal{P}\)
- \(\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)\).

c) Contrastive Structure

Patch-wise sample representation

learns a weighted combination between sample points in the same position from each patch

In-patch sample representation

learns a weighted combination between points within the same patch.

\(\rightarrow\) Treat these two representations as “permutated multi-view representations”

(3) Representation Discrepancy

Kullback-Leibler divergence (KL divergence)

to measure the similarity of such two representations

Loss function definition

( no reconstruction part is used )

\(\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))\).

Stop-gradient : to train 2 branches asynchronously

(4) Anomaly Criterion

Final anomaly score of \(\mathcal{X} \in \mathbb{R}^{T \times d}\) :

\(\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }\).

\(y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}\).

Twitter Facebook LinkedIn

(paper 102) DCdetector; Dual Attention Contrastive Representation Learning for TS Anomaly Detection

Seunghan Lee

DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)

Contents

0. Abstract

DCdetector

1. Introduction

Challenges in TS-AD

TS anomaly detection methods

Examples

DCdetector

Details

2. Methodology

Comparison

(1) Overall Architecture

a) Forward Process module

b) Dual Attention Contrastive Structure module

c) Representation Discrepancy module

d) Anomaly Criterion module.

(2) Dual Attention Contrastive Structure

a) Dual Attention

b) Up-sampling and Multi-scale Design

c) Contrastive Structure

(3) Representation Discrepancy

Loss function definition

(4) Anomaly Criterion

You May Also Enjoy