Diffusion Models for Tabular Data Imputation and Synthetic Data Generation


Contents

  1. Abstract


0. Abstract

Use of diffusion models with transformer conditioning for both

  • (1) data imputation
  • (2) data generation


Transformer conditioning

  • harness the ability of transformers to model dependencies and cross-feature interactions within tabular data.


1. Introduction

Consider synthetic data generation as a general case of data imputation


TabGenDDPM

  • New conditioning in diffusion model for tabular data using a transformer
  • Special masking mechanism
    • makes it possible to tackle both tasks with a single model


Contributions

  • (1) Incorporation of a transformer within the diffusion model
    • to model inter-feature interactions better within tabular data.
  • (2) Innovative masking & conditioning strategy on features
    • enabling both data imputation and generation with a single model.
  • (3) SOTA in Machine Learning (ML) utility and statistical similariy


2. Related Work

(1) Diffusion Models

Pass

(2) Data Imputation

Traditional approaches

  • ex) Involve removing rows or columns with missing entries
  • ex) Filling gaps with average values for a particular feature.


Recent trends

  • ML techniques
  • Deep generative models


(3) Generative Models

Generative models for tabular data

  • Tabular VAEs and GANs
  • ex) TabDDPM
    • powerful method for tabular data generation, leveraging the strengths of Diffusion Models.

$\rightarrow$ TabGenDDPM builds upon TabDDPM, targeting both tabular data generation and imputation.


3. Background

(1) Diffusion

Forward process

  • $q\left(x_{1: T} \mid x_0\right)=\prod_{t=1}^T q\left(x_t \mid x_{t-1}\right)$ .


Reverse process

  • $p\left(x_{0: T}\right)=\prod_{t=1}^T p\left(x_{t-1} \mid x_t\right)$.


Loss function: variational lower bound

  • $L_{\mathrm{vlb}} :=L_0+L_1+\ldots+L_{T-1}+L_T $.
    • $L_0 :=-\log p_\theta\left(x_0 \mid x_1\right)$.
    • $L_{t-1} :=D_{K L}\left(q\left(x_{t-1} \mid x_t, x_0\right) | p_\theta\left(x_{t-1} \mid x_t\right)\right)$.
    • $L_{t-1} :=D_{K L}\left(q\left(x_{t-1} \mid x_t, x_0\right) | p_\theta\left(x_{t-1} \mid x_t\right)\right)$.
    • $L_T :=D_{K L}\left(q\left(x_T \mid x_0\right) | p\left(x_T\right)\right)$.


a) Gaussian diffusion models

Operate in continuous spaces $\left(x_t \in \mathbb{R}^n\right)$


Forward process

  • $q\left(x_t \mid x_{t-1}\right):=\mathcal{N}\left(x_t ; \sqrt{1-\beta_t} x_{t-1}, \beta_t I\right)$.
    • where $\pi=\left[\alpha_t x_t+\left(1-\alpha_t\right) / C l\right] \odot\left[\bar{\alpha}{t-1} x_0+\left(1-\bar{\alpha}{t-1}\right) / C l\right]$.


Prior

  • $q\left(x_T\right):=\mathcal{N}\left(x_T ; 0, I\right)$.


Reverse process

  • $p_\theta\left(x_{t-1} \mid x_t\right):=\mathcal{N}\left(x_{t-1} ; \mu_\theta\left(x_t, t\right), \Sigma_\theta\left(x_t, t\right)\right)$.


Noise prediction

  • $\mu_\theta\left(x_t, t\right)=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}t}} \epsilon\theta\left(x_t, t\right)\right)$.


Loss

  • $\mu_\theta\left(x_t, t\right)=\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}t}} \epsilon\theta\left(x_t, t\right)\right)$.


b) Multinomial diffusion models

Generate categorical data where $x_t \in{0,1}^{C l}$ is a one-hot encoded categorical variable with $C l$ classes.


Forward process

  • $q\left(x_t \mid x_{t-1}\right):=\operatorname{Cat}\left(x_t ;\left(1-\beta_t\right) x_{t-1}+\beta_t / C l\right)$.
  • $q\left(x_t \mid x_0\right):=\operatorname{Cat}\left(x_t ; \bar{\alpha}_t x_0+\left(1-\bar{\alpha}_t\right) / C l\right)$.


Prior

  • $q\left(x_T\right):=\operatorname{Cat}\left(x_T ; 1 / C l\right) $.


Forward posterior

  • $q\left(x_{t-1} \mid x_t, x_0\right)=\operatorname{Cat}\left(x_{t-1} ; \pi / \sum_{k=1}^{C l} \pi_k\right)$.


Reverse distribution

  • $p_\theta\left(x_{t-1} \mid x_t\right)$ is parameterized as $q\left(x_{t-1} \mid x_t, \hat{x}_0\left(x_t, t\right)\right)$,


4., TabGenDDPM

Builds upon the principles of TabDDPM

  • Improve its capabilities in data imputation and synthetic data generation


figure2


Key distinctions

  • (1) Denoising model
    • [TabDDPM] a simple MLP architecture
    • [TabGenDDPM] : an encoder-decoder structure
      • columnar embedding and transformer architecture
    • Boost synthetic data quality & offer improved conditioning for the reverse diffusion process
  • (2) Conditioning mechanism.


(1) Problem Definition

  • $D=\left{x_i^{j_c}, x_i^{j_n}, y_i\right}_{i=1}^N$ ,
    • $x_i^{j_n}$ with $j_n \in$ $\left[1, K_{\text {num }}\right]$ : set of numerical features,
    • $x_i^{j_c}$ with $j_c \in\left[1, K_{c a t}\right]$ : set of categorical features, $y_i$ is the label
  • $i \in[1, N]$ : dataset rows
  • $N$ : total number of rows
  • $K=K_{\text {num }}+K_{\text {cat }}$ : total number of features.


Consistent preprocessing procedure across our benchmark datasets

  • [Numerical] Gaussian quantile transformation
  • [Categorical] Ordinal encoding

  • Missing values are replaced with zeros


Modeling

  • [Numerical] with Gaussian diffusion
  • [Categorical] with multinomial diffusion


TabGenDDPM

  • generalizes the approach of TabDDPM
  • [TabDDPM] learns $p\left(x_{t-1} \mid x_t, y\right)$,
  • [TabGenDDPM] extend this by …
    • allowing conditioning on a target variable $y$ and a subset of input features


(Details) Partition variable $x$ into $x^M$ and $\bar{x}^M$.

  • $x^M$ : Masked variables set
    • perturbed by the forward diffusion process
  • $\bar{x}^M$ : Untouched variable subset
    • conditions the reverse diffusion.

$\rightarrow$ Models $p\left(x_{t-1}^M \mid x_t^M, \bar{x}^M, y\right)$, with $\bar{x}^M$ remaining constant across timesteps $t$.


Results

  • enhances model performance in data generation
  • enables the possibility of performing data imputation with the same model.


Reverse diffusion process $p\left(x_{t-1}^M \mid x_t^M, \bar{x}^M, y\right)$

figure2

  • [Numerical] estimate the amount of noise added
  • [Categorical] predict the (logit of) distribution of the categorical variable a $t=0$.


Output dim = $K_{n u m}+\sum_{i=1}^{K_{c a t}} C l_i$

  • where $C l_i$ is the number of classes of $i$-th categorical feature


Categories: , ,

Updated: