( reference : Machine Learning Data Lifecycle in Production )

Feature Engineering

[1] Introduction to Processing

Outline

  • Getting the most of our data
  • Feature Engineering (FE) process
  • FE in typical ML pipeline


  1. Getting the most of our data
  • make data useful for training a model
    • ex) normaliation, standardization
  • increase predictive quality
    • ex) useful feature for prediction
  • dimension reduction
    • lower computational cost


  1. Feature Engineering (FE) process

figure2


  1. FE in typical ML pipeline
  • training : BATCH
  • serving : REAL-TIME

figure2


Summary

  • FE is difficult & time consuming, but VERY IMPORTANT

  • Squeeze the most of our data!

    $\rightarrow$ higher model performance

  • LOW feature dimension, HIGH predictive information

    $\rightarrow$ Lower computational cost

  • FE during training $\rightarrow$ also during serving


[2] Preprocessing Operations

Outline

  • Main preprocessing operations
  • Mapping raw data $\rightarrow$ feature
  • Mapping numeric / categorical values
  • Empirical knowledge of data


Data preprocessing

= transforming raw data into clean & training-ready dataset


1. Main preprocessing operations

figure2


2. Mapping raw data $\rightarrow$ feature

  • raw data는 직접적으로 모델의 인풋으로 사용될 수 없다.

    따라서, feature vector로 변경을 해줘야 한다

figure2


3. Mapping categorical values

  • categorical value는 수치형이 아니므로, 모델의 인풋으로 사용될 수 없다.

    따라서, 원핫인코딩 등과 같은 방법으로 변형해줘야한다.

figure2


ex) categorical vocabulary

  • 참고 : https://www.tensorflow.org/api_docs/python/tf/feature_column
vocab_list = ['kitchenware', 'electronics', 'sports']
vocab_file = 'product_class.txt'

vocab_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(
  key = 'kitchenware',
  vocabulary_list = vocab_list
)

vocab_feature_column = tf.feature_column.categorical_column_with_vocabulary_file(
  key = feature_name,
  vocabulary_file = vocab_file,
  vocabulary_size = 3
)

figure2


4. Empirical knowledge of data

Text

  • stemming, lemmatization, TF-IDF, n-grams, embedding lookup table

Images

  • clipping, resizing, cropping, blur, …


[3] Feature Engineering Techniques

Outline

  • Feature Scaling
  • Normalization & Standardization
  • Bucketizing & Binning
  • Other techniques


[ types of Feature Engineering techniques ]

  • Numerical Range : scaling, normalizing, standardizing
  • Grouping : bucketizing, bag of words (bow)

1. Feature Scaling

Scaling = converting values into prescribed range

  • ex) img pixel (0,256) $\rightarrow$ (-1, 1)

Advantages

  • faster convergence of NN
  • do away with NaN erros
  • for each feature, the model learns the right weights


2. Normalization & Standardization

Normalization

  • min-max scaling : into range of (0,1)

  • standardization (z-score) : into Standard Normal distribution


3. Bucketizing & Binning

convert into groups & one-hot encoding

figure2


4. Other techniques

Dimension Reduction in embeddings

  • ex) PCA, t-SNE, UMAP

Feature Crossing


TF embedding projector

figure2


[4] Feature Crosses

Feature Cross = combine multiple features into new feature

( = synthetic feature encoding nonlinearity in feature space )

  • non-linearity in feature space
  • use fewer feature dimension


Example

  • AxBxCxDxE : mulipyling 5 features
  • [Day of Week, Hour] -> [Hour of Week]

Categories:

Updated: