Change Point Detection in Time Series Data by Relative Density Ratio Estimation (2012, 440)

Contents

  1. Abstract
  2. Introduction
  3. Problem Formulation
  4. CPD via Density-Ratio Estimation
    1. Divergence-based dissimilarity measure & Density Ratio estimation
    2. KLIEP ( KL Importance estimation procedure )
    3. uLSIF ( Unconstrained least-squares importance fitting )
    4. RuLSIF ( Relative uLSIF )


0. Abstract

Change Point = abrupt property changes

propose statistical change-point detection

  • based on “non-parametric” divergence estimation
  • use “relative Pearson divergence”


1. Introduction

2 categories of CPD

  • 1) Real-time detection
  • 2) Retrospective detection


1) Real-time detection

  • require “immediate” responses

2) Retrospective detection

  • require “longer reaction periods”
  • more robust & accurate detection
  • allow certain delays


This paper focuses on “2) Retrospective detection”

& propose a novel “non-parametric” method


**[ method 1 ] **

  • Compare probability distributions of time-series samples over 2 intervals
  • alarm, when 2 distn becomes significantly different!


[ method 2 ]

  • Subspace method
  • dissimilarity measured by distance between the subspaces


Both [method 1] & [method 2] rely on pre-designed parametric models


Non-parametric

  • ex) KDE (Kernel Density Estimation)
    • but, becomes less accurate in high-dim problems
  • solution : use “RATIO of probability densities”


Contribution

  • 1) apply uLSIF (unconstrained least-squares importance fitting) to CPD
  • 2) further improve uLSIF , by RuLSIF (Relative uLSIF)
    • problem of density-ratio based approach : “ratio can be unbounded”
    • solution : RuLSIF


2. Problem Formulation

\(\boldsymbol{y}(t) \in \mathbb{R}^{d}\).

  • \(d\)-dim TS at time \(t\)


\(\boldsymbol{Y}(t):=\left[\boldsymbol{y}(t)^{\top}, \boldsymbol{y}(t+1)^{\top}, \ldots, \boldsymbol{y}(t+k-1)^{\top}\right]^{\top} \in \mathbb{R}^{d k}\).

  • subsequence of time series
  • at time \(t\), with length \(k\)
  • \(\boldsymbol{Y}(t)\) is one sample


\(\mathcal{Y}(t):=\{\boldsymbol{Y}(t), \boldsymbol{Y}(t+1), \ldots, \boldsymbol{Y}(t+n-1)\}\).

  • \(n\) retrospective subsequence samples
  • \([\boldsymbol{Y}(t), \boldsymbol{Y}(t+1), \ldots, \boldsymbol{Y}(t+n-1)] \in \mathbb{R}^{d k \times n}\) : Hankel matrix
    • key role in CPD based on subspace learning


For CPD, consider two consecutive segments

  • \[\mathcal{Y}(t) \text { and } \mathcal{Y}(t+n)\]
  • compute certain dissimilarity measure between \(\mathcal{Y}(t) \text { and } \mathcal{Y}(t+n)\)


3. CPD via Density-Ratio Estimation

define dissimilarity measure


(1) Divergence-based dissimilarity measure & Density Ratio estimation

basic form : \(D\left(P_{t} \mid \mid P_{t+n}\right)+D\left(P_{t+n} \mid \mid P_{t}\right)\)


\(f\)-divergence ( \(D\left(P \mid \mid P^{\prime}\right)\) )

\(D\left(P \mid \mid P^{\prime}\right):=\int p^{\prime}(\boldsymbol{Y}) f\left(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}\right) \mathrm{d} \boldsymbol{Y}\).

  • \(f\) : convex function & \(f(1)=0\)
  • \(p(\boldsymbol{Y})\) and \(p^{\prime}(\boldsymbol{Y})\) are strictly positive
  • asymmetric!
    • thus, symmetrize as above
  • example)
    • KL-divergence : \(f(t)=t \log t\).
      • \(\operatorname{KL}\left(P \mid \mid P^{\prime}\right) :=\int p(\boldsymbol{Y}) \log \left(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}\right) \mathrm{d} \boldsymbol{Y}\).
    • Pearson (PE) divergence : \(f(t)=\frac{1}{2}(t-1)^{2}\)
      • \(\operatorname{PE}\left(P \mid \mid P^{\prime}\right) :=\frac{1}{2} \int p^{\prime}(\boldsymbol{Y})\left(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}-1\right)^{2} \mathrm{~d} \boldsymbol{Y}\).


problem : \(p(\boldsymbol{Y})\) and \(p^{\prime}(\boldsymbol{Y})\) are unknown in practice!

  • naive way?

    • plug in estimated densities \(\widehat{p}(\boldsymbol{Y})\) and \(\widehat{p}^{\prime}(\boldsymbol{Y})\)

    • but not reliable in practice


Direct density-ratio estimation

  • learn the density-ratio function \(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}\)
  • much easier to solve
  • use samples from…
    • \(\left\{\boldsymbol{Y}_{i}\right\}_{i=1}^{n}\) \(\sim p(\boldsymbol{Y})\)
    • \(\left\{\boldsymbol{Y}_{j}^{\prime}\right\}_{j=1}^{n}\) \(\sim p^{\prime}(\boldsymbol{Y})\)


(2) KLIEP ( KL Importance estimation procedure )

estimate KL divergece


kernel model : \(g(\boldsymbol{Y} ; \boldsymbol{\theta}):=\sum_{\ell=1}^{n} \theta_{\ell} K\left(\boldsymbol{Y}, \boldsymbol{Y}_{\ell}\right)\).

  • \(\boldsymbol{\theta}:=\left(\theta_{1}, \ldots, \theta_{n}\right)^{\top}\) : parameters to be learend
  • \(K\left(\boldsymbol{Y}, \boldsymbol{Y}^{\prime}\right)\) : kernel basis function
    • ex) Gaussian kernel : \(K\left(\boldsymbol{Y}, \boldsymbol{Y}^{\prime}\right)=\exp \left(-\frac{\left \mid \mid \boldsymbol{Y}-\boldsymbol{Y}^{\prime}\right \mid \mid ^{2}}{2 \sigma^{2}}\right)\)
  • goal : minimize KL divergence from 1) to 2)
    • 1) \(p(\boldsymbol{Y})\)
    • 2) \(g(\boldsymbol{Y} ; \boldsymbol{\theta}) p^{\prime}(\boldsymbol{Y})\)
  • unique global optimal solution can be obtained


density-ratio estimator

  • \(\widehat{g}(\boldsymbol{Y})=\sum_{\ell=1}^{n} \widehat{\theta}_{\ell} K\left(\boldsymbol{Y}, \boldsymbol{Y}_{\ell}\right)\).


approximator of KL divergence

  • \(\widehat{\mathrm{KL}}:=\frac{1}{n} \sum_{i=1}^{n} \log \widehat{g}\left(\boldsymbol{Y}_{i}\right)\).
  • minimizing above = more accurate \(g\) = more accurate density ratio


(3) uLSIF ( Unconstrained least-squares importance fitting )

estimate PE divergence


same model as KLIEP…

but training criterion is different : squared loss

\(\begin{aligned} J(\boldsymbol{Y}) &=\frac{1}{2} \int\left(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}-g(\boldsymbol{Y} ; \boldsymbol{\theta})\right)^{2} p^{\prime}(\boldsymbol{Y}) \mathrm{d} \boldsymbol{Y} \\ &=\frac{1}{2} \int\left(\frac{p(\boldsymbol{Y})}{p^{\prime}(\boldsymbol{Y})}\right)^{2} p^{\prime}(\boldsymbol{Y}) \mathrm{d} \boldsymbol{Y}-\int p(\boldsymbol{Y}) g(\boldsymbol{Y} ; \boldsymbol{\theta}) \mathrm{d} \boldsymbol{Y}+\frac{1}{2} \int g(\boldsymbol{Y} ; \boldsymbol{\theta})^{2} p^{\prime}(\boldsymbol{Y}) \mathrm{d} \boldsymbol{Y} \end{aligned}\).


solution can be obtained analytically

  • \(\widehat{\boldsymbol{\theta}}=\left(\widehat{\boldsymbol{H}}+\lambda \boldsymbol{I}_{n}\right)^{-1} \widehat{\boldsymbol{h}}\).


density-ratio estimator

  • \(\widehat{g}(\boldsymbol{Y})=\sum_{\ell=1}^{n} \widehat{\theta}_{\ell} K\left(\boldsymbol{Y}, \boldsymbol{Y}_{\ell}\right)\).


approximator of PE divergence

  • \(\widehat{\mathrm{PE}}:=-\frac{1}{2 n} \sum_{j=1}^{n} \widehat{g}\left(\boldsymbol{Y}_{j}^{\prime}\right)^{2}+\frac{1}{n} \sum_{i=1}^{n} \widehat{g}\left(\boldsymbol{Y}_{i}\right)-\frac{1}{2}\).


(4) RuLSIF ( Relative uLSIF )

problem : density-ratio value can be unbounded!

consider \(\alpha\)-relative PE-divergence!


\(\begin{aligned} \operatorname{PE}_{\alpha}\left(P \mid \mid P^{\prime}\right) &:=\operatorname{PE}\left(P \mid \mid \alpha P+(1-\alpha) P^{\prime}\right) \\ &=\int p_{\alpha}^{\prime}(\boldsymbol{Y})\left(\frac{p(\boldsymbol{Y})}{p_{\alpha}^{\prime}(\boldsymbol{Y})}-1\right)^{2} \mathrm{~d} \boldsymbol{Y}, \end{aligned}\).

  • where \(p_{\alpha}^{\prime}(\boldsymbol{Y})=\alpha p(\boldsymbol{Y})+(1-\alpha) p^{\prime}(\boldsymbol{Y})\) is the \(\alpha\)-mixture density

  • \(r_{\alpha}(\boldsymbol{Y})=\frac{p(\boldsymbol{Y})}{\alpha p(\boldsymbol{Y})+(1-\alpha) p^{\prime}(\boldsymbol{Y})}\) : \(\alpha\)-relative density-ratio

Tags:

Categories:

Updated: