( reference : Introduction to Machine Learning in Production )

Define Data and Establish Baseline

[1] Why is data definition hard?

figure2

how to define data

  • what is \(X\) & \(Y\)?


Labels may be “inconsistent”

ex) Image Detection

figure2

figure2


ex) Speech Recognition

figure2

\(\rightarrow\) standardize on “one convention”!!


[2] Major types of data problems

figure2


(axis 1) structured vs unstructured

Unstructured data problems

  • can get people’s help to label data ( or data augmentation )


Structured data problems,

  • harder to obtain more data & data augmentation

Best practices for unstructured & structured data are different!


(axis 2) large vs small dataset

Large & Small dataset

  • no precise definition
  • set arbitrary threshold, 10000


Small dataset

  • “having clean labels is critical”


Huge dataset

  • by difficulty of going through every example, “data process” is more important!
  • ex) how you collect / install the data / the labeling instructions … be impossible for


When working on 1 out of those 4 quadrants…

  • advice from same quadrant will be much more helpful!


[3] Small data and label consistency

figure2

5 examples on the right most

  • have clean & consistent labels
  • can more confidently fit a function! ( compared to left most )


ex) Phone defect

figure2

people might have different criterion for “scratch”

\(\rightarrow\) make the labelers can agree with “length of 0.3”


Big data problems can have small data challenges too!

  • case : “long tail of rare events”

  • ex 1) web search

    • large web search engine companies all have very large data sets of web search queries,

      but many most companies do not!

  • ex 2) self-driving cars

    • rare accident data
  • ex 3) product recommendation

    • lot of products were sold very few


[4] Improving label consistency

How to improve “label consistency” ?

  • 1) multiple labelers with same example
  • 2) have discussion about definition of \(y\) , when disagreement happens
  • 3) if labelers believe \(x\) doesn’t contain enough information, consider changing \(x\)
  • 4) iterate until it is hard to significantly increase agreement


Example

figure2


Another option

add an another class label to capture “uncertainty”

  • ex) defect detection : “borderline”
  • ex) speech recognition : “[unintelligible]”


figure2


Small & Big data

Small data

  • small number of labelers \(\rightarrow\) discuss with themselves


Big data

  • define labelers as “groups”
  • send labeling “instructions” to them
  • + can have “multiple labelers” for same example


[5] Human Level Performance (HLP)

Why do we need HLP?

  • to estimate bayes error / irreducible error
  • thus, get help from error analysis, and decide where to “focus on”


Other uses of HLP

  • 1) respectable benchmark
  • 2) reasonable target
  • 3) prove ML > humans
    • but, use this with “CAUTION”!


beating HLP \(\neq\) proof of ML superiority

\(\rightarrow\) due to biased advantage!

example )

figure2


[6] Raising HLP

2 cases : HLP are

  • case 1) externally defined ( REAL ground truth )
  • case 2) just defined by another human


case 1) externally defined

  • HLP gives an estimate of “Bayes error / Irreducible error”


case 2) by another human

  • compare “HLP”( = person 1) & “another person”( = person 2 )

figure2


Summary

  • if \(y\) comes from human ( = case 2 ),

    HLP « 100% means “ambiguous labeling instructions”

  • improve label consistency \(\rightarrow\) raise HLP

    • makes ML harder to beat HLP

    • But, “more consistent labels” makes ML give better performance

      \(\rightarrow\) benefit the actual application performance


HLP on structured data

  • HLP is less used in structured data
  • exception :
    • ex) User ID merging : are those 2 people indicating the same person?
    • ex) is it a spam account?

Categories:

Updated: