DCdetector: Dual Attention Contrastive Representation Learning for TS Anomaly Detection (KDD 2023)


https://arxiv.org/pdf/2306.10347.pdf

Contents

  1. Abstract
  2. Introduction
  3. Methodology
    1. Overall Architecture
    2. Dual Attention Contrastive Structure
    3. Representation Discrepancy
    4. Anomaly Criterion


0. Abstract

Challenge of TS anomaly detection

  • learn a representation map that enables effective discrimination of anomalies.


Categories of methods

  • Reconstruction-based methods
  • Contrastive learning


DCdetector

  • a multi-scale dual attention contrastive representation learning model
    • utilizes a novel dual attention asymmetric design to create the permutated environment
  • learn a permutation invariant representation with superior discrimination abilities


1. Introduction

Challenges in TS-AD

  • (1) Determining what the anomalies will be like.
  • (2) Anomalies are rare
    • hard to get labels
    • most supervised or semi-supervised methods fail to work given limited labeled training data.
  • (3) Should consider temporal, multidimensional, and non-stationary features for TS


TS anomaly detection methods

( ex. statistical, classic machine learning, and deep learning based methods )

  • Supervised and Semi-supervised methods
    • can not handle the challenge of limited labeled data
  • Unsupervised methods
    • without strict requirements on labeled data
    • ex) one class classification-based, probabilistic based, distance-based, forecasting-based, reconstruction-based approaches


Examples

  • Reconstruction-based methods
    • pros) developing rapidly due to its power in handling complex data by combining it with different machine learning models and its interpretability that the instances behave unusually abnormally.
    • cons) challenging to learn a well-reconstructed model for normal data without being obstructed by anomalies.
  • Contrastive Learning
    • outstanding performance in downstream tasks in the computer vision
    • effectiveness of contrastive representative learning still needs to be explored in the TS-AD


DCdetector

( Dual attention Contrastive representation learning anomaly detector )

  • handle the challenges in TS AD

  • key idea : normal TS points share the latent pattern

    ( = normal points have strong correlations with other points <-> anomalies do not )

  • Learning consistent representations :
    • hard for anomalies
    • easy for normal points
  • Motivation : if normal and abnormal points’ representations are distinguishable, we can detect anomalies without a highly qualified reconstruction model


Details

  • contrastive structure with two branches & dual attention
    • two branches share weights
  • representation difference between normal and abnormal data is enlarged
  • patching-based attention networks: to capture the temporal dependency
  • multi-scale design: to reduce information loss during patching
  • channel independence design for MTS
  • does not require prior knowledge about anomalies


2. Methodology

MTS of length \(\mathrm{T}\) : \(X=\left(x_1, x_2, \ldots, x_T\right)\)

  • where \(x_t \in \mathbb{R}^d\)


Task:

  • given input TS \(\mathcal{X}\),
  • for another unknown test sequence \(\mathcal{X}_{\text {test }}\) of length \(T^{\prime}\)

  • we want to predict \(\mathcal{Y}_{\text {test }}=\left(y_1, y_2, \ldots, y_{T^{\prime}}\right)\).
    • \(y_t \in\{0,1\}\) : 1 = anomaly & 0 = normal


Inductive bias ( as Anomaly Transformer explored )

  • anomalies have less connection with the whole TS than their adjacent points
  • Anomaly Transformer: detects anomalies by association discrepancy between ..
    • (1) a learned Gaussian kernel
    • (2) attention weight distribution.
  • DCdetector
    • via a dual-attention self-supervised contrastive-type structure.


Comparison

figure2

  1. Reconstruction-based approach
  2. Anomaly Transformer
    • observation that it is difficult to build nontrivial associations from abnormal points to the whole series.
    • discrepancies
      • prior discrepancy : learned with Gaussian Kernel
      • association discrepancy : learned with a transformer module
    • MinMax association learning & Reconstruction loss
  3. DCdetector
    • concise ( does not need a specially designed Gaussian Kernel, a MinMax learning strategy, or a reconstruction loss )
    • mainly leverages the designed CL-based dual-branch attention for discrepancy learning of anomalies in different views


(1) Overall Architecture

figure2

4 main components

  1. Forward Process module
  2. Dual Attention Contrastive Structure module
  3. Representation Discrepancy module
  4. Anomaly Criterion module.


figure2

a) Forward Process module

( channel-independent )

  • a-1) instance normalization
  • a-2) patching


b) Dual Attention Contrastive Structure module

  • each channel shares the same self-attention network
  • representation results are concatenated as the final output \(\left(X^{\prime} \in \mathbb{R}^{N \times d}\right)\).
  • Dual Attention Contrastive Structure module
    • learns the representation of inputs in different views.


c) Representation Discrepancy module

Key Insight

  • normal points: share the same latent pattern even in different views (a strong correlation is not easy to be destroyed).
  • anomalies: rare & do not have explicit patterns

\(\rightarrow\) difference will be slight for normal points representations in different views and large for anomalies.


d) Anomaly Criterion module.

  • calculate anomaly scores based on the discrepancy between the two representations

  • use a prior threshold for AD


(2) Dual Attention Contrastive Structure

TS from different views: takes ..

  • (1) patch-wise representations
  • (2) in-patch representations


Does not construct pairs like the typical contrastive methods

  • similar to the contrastive methods only using positive samples


a) Dual Attention

Input time series \(\mathcal{X} \in \mathbb{R}^{T \times d}\) are patched as \(\mathcal{X} \in \mathbb{R}^{P \times N \times d}\)

  • \(P\) : patch size
  • \(N\) : number of patches


Fuse the channel information with the batch dimension ( \(\because\) channel independence )

\(\rightarrow\) becomes \(\mathcal{X} \in \mathbb{R}^{P \times N}\).


[ Patch-wise representation ]

  • single patch is considered as a unit
    • embedded operation will be applied in the patch_size \((P)\) dimension
  • capture dependencies among patches ( = patch-wise attention )
  • embedding shape : \(X_{\mathcal{N}} \in \mathbb{R}^{N \times d_{\text {model }}}\).
  • apply multi-head attention to \(X_{\mathcal{N}}\)


[ In-patch representation ]

  • dependencies of points in the same patch
    • embedded operation will be applied in the number of patches \((N)\) dimension


Note that the \(W_{Q_i}, W_{\mathcal{K}_i}\) are shared weights within the in-patch & patch-wise attention


b) Up-sampling and Multi-scale Design

Patch-wise attention

  • ignores the relevance among points in a patch

In-patch attention

  • ignores the relevance among patches.


To compare these two representations …. need upsampling!

figure2


Multi-scale design:

= final representation concatenates results in different scales (i.e., patch sizes)

  • final patch-wise representation: \(\mathcal{N}\)
    • \(\mathcal{N}=\sum_{\text {Patch list }} \operatorname{Upsampling}\left(\text { Attn }_{\mathcal{N}}\right)\),
  • Final in-patch representation: \(\mathcal{P}\)
    • \(\mathcal{P}=\sum_{\text {Patch list }} \text { Upsampling }\left(\text { Attn }_{\mathcal{P}}\right)\).


c) Contrastive Structure

Patch-wise sample representation

  • learns a weighted combination between sample points in the same position from each patch

In-patch sample representation

  • learns a weighted combination between points within the same patch.

\(\rightarrow\) Treat these two representations as “permutated multi-view representations”


(3) Representation Discrepancy

Kullback-Leibler divergence (KL divergence)

  • to measure the similarity of such two representations


Loss function definition

( no reconstruction part is used )

\(\mathcal{L}\{\mathcal{P}, \mathcal{N} ; X\}=\frac{1}{2} \mathcal{D}(\mathcal{P}, \operatorname{Stopgrad}(\mathcal{N}))+\frac{1}{2} \mathcal{D}(\mathcal{N}, \operatorname{Stopgrad}(\mathcal{P}))\).

  • Stop-gradient : to train 2 branches asynchronously


(4) Anomaly Criterion

Final anomaly score of \(\mathcal{X} \in \mathbb{R}^{T \times d}\) :

  • \(\text { AnomalyScore }(X)=\frac{1}{2} \mathcal{D}(\mathcal{P}, \mathcal{N})+\frac{1}{2} \mathcal{D}(\mathcal{N}, \mathcal{P}) \text {. }\).


\(y_i= \begin{cases}1: \text { anomaly } & \text { AnomalyScore }\left(X_i\right) \geq \delta \\ 0: \text { normal } & \text { AnomalyScore }\left(X_i\right)<\delta\end{cases}\).

Categories: ,

Updated: