Tabular Data: Deep Learning is Not All You Need (ICML 2021)

https://arxiv.org/pdf/2106.03253.pdf


Contents

  1. Abstract

  2. Introduction
  3. DL for Tabular
    1. TabNet
    2. NODE
    3. DNF-Net
    4. 1D-CNN
    5. Ensemble of models
  4. Comparing the models
    1. Experimental Setup
    2. Results


Abstract

Exploere whether DL is needed in tabular!

Result

  • (1) XGBoost outperforms DL models
  • (2) XGBoost requires much less tunining
  • (3) Ensemble of DL + XGBoost performs better than XGBoost alone


1. Introduction

DL in tabular

  • usually use different datasets & no standard benchmark

\(\rightarrow\) making it difficult to compare!


Questions.

  • Q1) Are the models more accuracte for the unseen datasets ( not the datasets used in their papers )?
  • Q2) How long does training & hyperparameter search takes?


Result: XGBoost is better & Ensemble of XGboost and DL is even better


2. DL for Tabular

Algorithms

  • TabNet
  • NODE ( Neural Oblivious Decision Ensembles )
  • DNF-Net
  • 1D-CNN


(1) TabNet

  • sequencial decision steps encode features ( using sparse learned masks )
  • select relevant features using masks ( with attention )
    • sparsemax layers: force to use small set of features


(2) NODE ( Neural Oblivious Decision Ensembles )

  • contains equal-depth oblivious decision trees ( ODTs )

    ( = ensemble of differentiable trees )

  • only one feature is chosen at each level

    \(\rightarrow\) balanced ODT


(3) DNF-Net

  • simulate disjunctive normal formulas (DNF) in DNns
  • key = DNNF block
    • (1) FC layer
    • (2) DNNF layer ( formed by soft version of binary conjunctions over literals )


(4) 1D-CNN

  • best single model performance in Kaggle competetion with tabular data
  • CNN : no local characterictics
  • thus, use FC , then 1D-conv ( with short-cut connection )


(5) Ensemble of models

includes 5 classifiers: (1)~(4) + XGbosot

  • weight = normalized validation loss of each model


3. Comparing the models

Desirable properties

  1. perform accuractly
  2. efficient inference
  3. short optimization time


(1) Experimental Setup

a) Datasets

4 DL models

11 datasets

  • 9 datasets = 3 datasets x 3 papers
  • 2 new unseen datasets ( from Kaggle )


b) Optimization process

Bayesian optimization ( with HyperOpt )

Use 3 random seed initializations in the same partition & average performance


(2) Results

a) Do DL generalize well to other datasets?

  1. DL models perform worse on two unseen datasets

  2. XGBoost generally outperform DL

  3. No DL models consistently outperform others

    ( 1D-CNN may seem to perform better )

  4. Esnemble of DL & XGBoost outperforms others in mosts cases


figure2


b) Do we need both XGBOost & DL?

3 types

  • Simple ensemble: XGBoost + XVm + CatBoost
  • Deep Ensembles w/o XGBoost: only (1) ~ (4)
  • Deep Ensembles w/ XGBoost: (1) ~ (4) + XGBoost


c) Subset of models

Ensemble improves accuracy!

But, additional computation…

\(\rightarrow\) consider using subsets of the models within the ensemble


Criterion

  • (1) validation loss ( model with low val error first )
  • (2) based on model’s uncertainty for each example
  • (3) random order

figure2


d) How difficult is the optimization?

XGBoost outperformed the deep models, converging faster!

figure2


Results may be affected by several factors

1) Bayesian optimization hyperparams

2) Initial hyperparams of XGBoost may be more robust

( had previously been optimized over many datasets )

3) XGBoost’s inherent characteristics.


Updated: