( reference : Introduction to Machine Learning in Production )

Label and Organize Data

[1] Obtaining data

How much to spend obtaining data?

figure2

  • get this iteration loop as quickly as possible

  • question

    • (X) how long to take \(m\) examples?
    • (O) how much data can we get in \(m\) days?

    ( exception : when we know in advance how much data we need! )


Inventory data

brainstorm list of data sources

( + size of data & cost/time to collect )

figure2


How to label data?

3 options

  • 1) in-house
  • 2) out-sourcing
  • 3) crowd-source


Who labels? ( qualification )

  • ex) speech recognition : fluent speaker
  • ex) factory inspection, medical image diagnosis : SME (subject matter expert)
  • ex) recommender system : (maybe impossible)


Tips

  • do not increase data more than “10 times at a time”
  • do it gradually & check model performance!


[2] Data pipeline

Data pipelines ( = Data Cascades )

  • multiple steps of processing before getting to the final output.


Example)

  • data : user information
  • model goal : predict if a given user is looking for a job
  • business goal : surface job ads


Given raw data, need pre-processing (cleaning) before feeding into algorithm

  • by using script

  • ex) spam cleanup, user ID merge, ….

figure2


figure2

  • make sure the pre-processing scripts are highly replicable

    ( distribution of 2 datasets should be similar )


POC & Production phases

POC (Proof-of-Concept)

  • goal : decide if the application is worth deploying

  • focus on getting the “prototype”

  • OK if pre-processing is manual

    ( just take notes/comments )


Production phases

  • after project utility is established, use more sophisticated tools to make sure that

    the “data pipeline is replicable”

  • ex) Tensorflow transform, Apache Beam, Airflow..


[3] Meta-data, data provenance and lineage

Data pipeline example

figure2


What if error is detected in “Spam dataset”…!?

\(\rightarrow\) it will affect all the process!

\(\rightarrow\) thus, need to keep track of data provenance & lineage

  • data provenance : where the data “comes from”
  • data lineage : “sequence of steps”


Meta-data

meta-data = “data of data”

usefulness

  • 1) error analysis ( spotting unexpected effects )
  • 2) keeping track of data provenance


[4] Balanced train/dev/test splits

stratified split!

ex) 100 examples & 30 are positive ( 30% )

  • train/dev/test (60%/20%/20%) = 60/20/20

  • positive samples inside train/dev/test = also (60%/20%/20%)

    \(\rightarrow\) 18/6/6

Categories:

Updated: