( reference : Machine Learning Data Lifecycle in Production )

Feature Transformation at Scale

[1] Preprocessing Data at Scale

Preprocessing data AT SCALE

Real-world data : LARGE SCALE
thus, will deal with LARGE SCALE data processing frameworks
consistent transform
- not only on train & eval dataset,
- but also on serving dataset

ML Pipeline

Outline

(1) inconsistencies in FE
(2) preprocessing granularity
(3) preprocessing training data
(4) optimize instance-level transformation

(1) inconsistencies in FE

Training & Serving code paths : DIFFERENT
Diverse deployment scenarios
- mobile / web / server …
Training & Serving skews

(2) preprocessing granularity

Full-pass vs Instance-level

Full-pass : WHOLE dataset
Instance-level : INDIVIDUAL data

(3) preprocessing training data

Pre-process your training dataset!

pros) only ONCE ( not one per training )
cons ) have to do those transformation also on Serving data

Transforming within the model

pros) easy iteration
cons) expensive transform
- transformation per batch : skew

Why transform per batch?

ex) batch normalization
only access to single batch of data ( not full ) is available

(4) optimize instance-level transformation

affecst both the training & serving efficiency
use accelerators…!

[2] TFT ( TensorFlow Transform )

Outline

benefits of using TFT
Feature Transformation
tf.Transform Analyzers

How it works

Tensorflow Extended

will deal with in detail in coding post

tf.Transform : layout

[ Transformation ]

( not only on TRAINING time, but also on SERVING time )

input : from ExampleGen & SchemaGen
- data splitted by ExampleGen
- schema generated by SchemaGen
+ user’s code ( of FE that we want )
result : TF graph
- transform graph & transform data

$\rightarrow$ given to trainer

tf.Transform : deeper

Training model also creates a TF graph!

2 graphs

(1) from transform
(2) from training

( 2 graphs are given to serving infrastructure )

with Tf Transform API…

express FE that we want ( give the code )

( or, give that code to Apache Beam distributed processing cluster )
result : saved model

Analyzer

makes a full pass over our dataset in order to collect constants,

which are needed during feature engineering

ex) tft.min ( minimum of all training dataset )

[3] Hello World with `tf.Transform`

[ Steps ]

Data Collection
Define Meta data
- DatasetMetadata
Transform
- wth tf.Transform analyzers
Result : Constant Graph

1) Data Collection

example) 3 features & 3 data

[
  {'x':1, 'y':1, 's':'hello'},
  {'x':2, 'y':2, 's':'world'},
  {'x':3, 'y':3, 's':'hello'},
]

2) Define Meta data

meta data = express the types of those 3 features

x & y : float
s : string

from tensorflow_transform.tf_metadata import (dataset_metadata, dataset_schema)

raw_data_metadata = dataset_metadata.DatasetMetadata(
  dataset_schema.from_feature_spec({
    'y':tf.io.FixedLenFeature([], tf.float32),
    'x':tf.io.FixedLenFeature([], tf.float32),
    's':tf.io.FixedLenFeature([], tf.string)
  })
)

3) Transform

Preprocess Data

def preprocessing_fn(inputs):
  x, y, s = inputs['x'], inputs['y'], inputs['s']
  x_centered = x - tft.mean(x)
  y_normalized = tft.scale_to_0_1(y)
  s_integerized = tft.compute_and_apply_vocabulary(s)
  z = (x_centered * y_normalized)
  outputs = {
    'x_centered' : x_centered,
    'y_normalized' : y_normalized,
    's_integerized' : s_integerized,
    'z' : z
  }
  return outputs

4) Result : Constant Graph

Running the pipeline ( with main() function )

use Apache Beam

def main():
  with tft_beam.Context(temp_dir = tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (
      (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTranformDataset(preprocessing_fn)
    )
  transformed_data, transformed_metadata = transformed_dataset

if __name__ == '__main__':
	main()

Result :

Summary

in TFX pipeline, tf.Transform is used for feature engineering ( transformation )
tf.Transform :
- preprocessing of input data & feature engineering
- preprocessing pipelines
  
  & execute using large-scale data processing frameworks

Twitter Facebook LinkedIn

[Week 2-2] Feature Transformation at Scale

Seunghan Lee

Feature Transformation at Scale

[1] Preprocessing Data at Scale

ML Pipeline

(1) inconsistencies in FE

(2) preprocessing granularity

(3) preprocessing training data

(4) optimize instance-level transformation

[2] TFT ( TensorFlow Transform )

How it works

Tensorflow Extended

tf.Transform : layout

tf.Transform : deeper

Analyzer

[3] Hello World with `tf.Transform`

1) Data Collection

2) Define Meta data

3) Transform

4) Result : Constant Graph

Summary

You May Also Enjoy

[Week 2-2] Feature Transformation at Scale

Seunghan Lee

Feature Transformation at Scale

[1] Preprocessing Data at Scale

ML Pipeline

(1) inconsistencies in FE

(2) preprocessing granularity

(3) preprocessing training data

(4) optimize instance-level transformation

[2] TFT ( TensorFlow Transform )

How it works

Tensorflow Extended

tf.Transform : layout

tf.Transform : deeper

Analyzer

[3] Hello World with tf.Transform

1) Data Collection

2) Define Meta data

3) Transform

4) Result : Constant Graph

Summary

You May Also Enjoy

[3] Hello World with `tf.Transform`