( reference : Introduction to Machine Learning in Production )

Data Iteration

[1] Data-centric AI development

2 types of view

1) model-centric view
2) data-centric view

MODEL-centric view

develop a model that does well on a “GIVEN data”
keep improving the code/model

DATA-centric view

key point : “QUALITY of data”

\(\rightarrow\) use tools to improve data quality
keep code FIXED, and iteratively improve “data”

[2] Useful picture of data augmentation

Data augmentation to help the performance of a model

Example ) different types of “speech input data”

type A) Car / Plane / Train / Machine noise
type B) Cafe / Library / Food court noise

different peformance & HLP with each category

Improvement in certain category ( by data augmentation )

will also improve other categories! ( with different degrees, depending on the similarity )

[3] Data Augmentation

Example) speech recognition

make synthetic training example, by adding “noise”

Goal of “data augmentation”

create “realistic” examples, that

1) the algorithm does “poorly” on,
2) but “humans (or other baselines)” do well on

Checklist

1) does it sound “realistic”?
2) is “mapping \(X \rightarrow Y\) “ clear?

( = can humans recognize it? )
3) is the algorithm doing poorly on it now?

Data Iteration Loop

[4] Can adding data hurt?

answer : “depends on the capacity of model”

Unstructured data problems : if

1) model is LARGE ( ex. NN )
2) mapping \(X \rightarrow Y\) is clear

then “adding data RARELY hurts” the performance!

Counter example : Photo OCR

[5] Adding feature

Structured data problems :

data augmentation…?

\(\rightarrow\) creating brand new training examples is difficult!
then, do what…?

\(\rightarrow\) figure out if there are additional useful features to add

Example) restaurant recommendations

after error analysis, have found that the model was

“recommending to vegetarians restaurants that only had meat options”
why not add feature indicating vegetarian? ( hard version )

or, proportion of vegetables of their previous orders? ( soft version )

Data iteration for “structured data problems”

step 1) start out with some model & train the model
step 2) error analysis
- ( might be harder on structured data problems,
  
  if there is no good baseline ( ex. HLP ) to compare )
- go back to select some features to add

Hand design features are QUITE GOOD!

( only for the case of “structured” dataset )

[6] From big data to good data

Properties of “GOOD” data

1) good coverage of input \(x\)
2) definition of \(y\) is unambiguous
3) distribution covers data/concept drift
4) is sized appropriately

Twitter Facebook LinkedIn

[Week 2-3] Data Iteration

Seunghan Lee