( reference : Machine Learning Data Lifecycle in Production )

Collecting Data

[1] Importance of Data

Outline

  • importance of DATA QUALITY
  • Data pipeline consists of..
    • (1) data COLLECTION
    • (2) data INGESTION
    • (3) data PREPARATION
  • continuously MONITOR data collection


Every starts with DATA!

  • garbage in, garbage out
  • models aren’t magic!
  • need to translate user’s needs into data problems


Meaningful data

  • Maximize predictive content
  • Remove non-informative data
  • feature space coverage


[2] Knowing your data

Key considerations

  • (1) data availability & collection
    • what / how much data is available?
    • how often is it refreshed?
    • is it labeled?
  • (2) translate users’ needs to data needs
    • data / features / labels


Know your data!

  • identify the data sources
  • check if they are refreshed
  • check consistencey
    • values / units / data types
  • keep monitor your data!


Measure Data Effectiveness

  • check feature importance
    • does it help prediction?
  • Feature Engineering & Feature Selection :
    • Feature Engineering : helps maximize predictive signals
    • Feature Selection : helps measure predictive signals


Summary : understand your user’s need! translate into DATA PROBLEMS


[3] Responsible Data : Security, Privacy, Fairness

Outline

  • Data Sourcing
  • Data Security & User Privacy
  • Bias & Fairness


Avoid problmeatic biases in datasets!

  • example )

figure2


Source Data responsibly

  • data may come from many different sources!
  • need to think about where it came from

figure2


Data Security & Privacy

  • give user control of what data can be collected
  • comply with regulations & policies ( ex. GDPR )


User Privacy

  • protect personally identifiable information
  • Solution :
    • (1) aggergation : summary individual values
    • (2) redaction : remove som data, to create less complete picture


Fairness

your model needs to be fair!

  • group fairness, equal accuracy

may be bias in human labeled data & ML models might ampliy it!!


Human Labeled data

who are the raters?

-> depends on the data!

figure2

Categories:

Updated: