( reference : Introduction to Machine Learning in Production )

Data Iteration

[1] Data-centric AI development

2 types of view

  • 1) model-centric view
  • 2) data-centric view


MODEL-centric view

  • develop a model that does well on a “GIVEN data”
  • keep improving the code/model


DATA-centric view

  • key point : “QUALITY of data”

    \(\rightarrow\) use tools to improve data quality

  • keep code FIXED, and iteratively improve “data”


[2] Useful picture of data augmentation

Data augmentation to help the performance of a model


Example ) different types of “speech input data”

  • type A) Car / Plane / Train / Machine noise
  • type B) Cafe / Library / Food court noise


different peformance & HLP with each category

figure2


Improvement in certain category ( by data augmentation )

will also improve other categories! ( with different degrees, depending on the similarity )

figure2


[3] Data Augmentation

Example) speech recognition

  • make synthetic training example, by adding “noise”

figure2


Goal of “data augmentation”

create “realistic” examples, that

  • 1) the algorithm does “poorly” on,
  • 2) but “humans (or other baselines)” do well on


Checklist

  • 1) does it sound “realistic”?

  • 2) is “mapping \(X \rightarrow Y\) “ clear?

    ( = can humans recognize it? )

  • 3) is the algorithm doing poorly on it now?


Data Iteration Loop

figure2


[4] Can adding data hurt?

answer : “depends on the capacity of model


Unstructured data problems : if

  • 1) model is LARGE ( ex. NN )
  • 2) mapping \(X \rightarrow Y\) is clear

then “adding data RARELY hurts” the performance!


Counter example : Photo OCR

figure2


[5] Adding feature

Structured data problems :

  • data augmentation…?

    \(\rightarrow\) creating brand new training examples is difficult!

  • then, do what…?

    \(\rightarrow\) figure out if there are additional useful features to add

Example) restaurant recommendations

  • after error analysis, have found that the model was

    “recommending to vegetarians restaurants that only had meat options”

  • why not add feature indicating vegetarian? ( hard version )

    or, proportion of vegetables of their previous orders? ( soft version )


Data iteration for “structured data problems”

  • step 1) start out with some model & train the model

  • step 2) error analysis

    • ( might be harder on structured data problems,

      if there is no good baseline ( ex. HLP ) to compare )

    • go back to select some features to add


figure2


Hand design features are QUITE GOOD!

( only for the case of “structured” dataset )


[6] From big data to good data

Properties of “GOOD” data

  • 1) good coverage of input \(x\)
  • 2) definition of \(y\) is unambiguous
  • 3) distribution covers data/concept drift
  • 4) is sized appropriately

Categories:

Updated: