[Week 1-2] Collecting Data

(coursera) Machine Learning Data Lifecycle in Production - Collecting, Labeling and Validating Data

1 minute read

Seunghan Lee

Seunghan Lee

Deep Learning, Data Science, Statistics

( reference : Machine Learning Data Lifecycle in Production )

Collecting Data

[1] Importance of Data

Outline

importance of DATA QUALITY
Data pipeline consists of..
- (1) data COLLECTION
- (2) data INGESTION
- (3) data PREPARATION
continuously MONITOR data collection

Every starts with DATA!

garbage in, garbage out
models aren’t magic!
need to translate user’s needs into data problems

Meaningful data

Maximize predictive content
Remove non-informative data
feature space coverage

[2] Knowing your data

Key considerations

(1) data availability & collection
- what / how much data is available?
- how often is it refreshed?
- is it labeled?
(2) translate users’ needs to data needs
- data / features / labels

Know your data!

identify the data sources
check if they are refreshed
check consistencey
- values / units / data types
keep monitor your data!

Measure Data Effectiveness

check feature importance
- does it help prediction?
Feature Engineering & Feature Selection :
- Feature Engineering : helps maximize predictive signals
- Feature Selection : helps measure predictive signals

Summary : understand your user’s need! translate into DATA PROBLEMS

[3] Responsible Data : Security, Privacy, Fairness

Outline

Data Sourcing
Data Security & User Privacy
Bias & Fairness

Avoid problmeatic biases in datasets!

example )

Source Data responsibly

data may come from many different sources!
need to think about where it came from

Data Security & Privacy

give user control of what data can be collected
comply with regulations & policies ( ex. GDPR )

User Privacy

protect personally identifiable information
Solution :
- (1) aggergation : summary individual values
- (2) redaction : remove som data, to create less complete picture

Fairness

your model needs to be fair!

group fairness, equal accuracy

may be bias in human labeled data & ML models might ampliy it!!

Human Labeled data

who are the raters?

-> depends on the data!

Twitter Facebook LinkedIn

You May Also Enjoy

8 minute read

2 minute read

5 minute read

14 minute read