( reference : Machine Learning Data Lifecycle in Production )

Feature Selection

[1] Feature Spaces

Outline

What is Feature Space
Introduction to Feature Selection
Feature Selection Methods
- (1) filter methods
- (2) wrapper methods
- (3) embedded methos

Feature Space

$N$ features $\rightarrow$ $N$ dimension feature space

( do not include target label )

Feature Space Coverage

IMPORTANT : ensure feature space coverage!!

train & evaluation datasets’ feature coverage $\approx$ serving dataset’s feature coverage
same feature coverage =
- same numerical ranges
- same classes
- similar characteristics of data ( IMAGE )
- simliar vocab, syntax, semantics ( NLP )

Challenges

Data is affected by …. seasonality / trend / drift
New values in features & labels in serving data

$\rightarrow$ thus, need CONTINUOUS MONITORING

[2] Feature Selection

identify features’ correlation
remove unimportant features
reduce dimension of feature space

Why??

Feature Selection methods

Unsupervised
- remove redundant features
Supervised
- select most contributing features
- ex) filter methods, wrapper methods, embedded methods

[3] Filter Methods

Filter Methods

(1) correlation
(2) univariate feature selection

(1) correlation

high correlation $\rightarrow$ redundant information
draw correlation matrix
ex) Pearson Correlation, Univariate Feature Selection

(2) Univariate feature selection

( feat. Sklearn )

Univariate Feature Selection Routines

SelectKBest
SelectPercentile
GenericUnivariateSelect

Statistical tests

Regression : f_regression, mutual_info_regression
Classification : chi2, f_classif, mutual_info_classif

from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(chi2, k=20)

X_new = selector.fit_transform(X, y)
X_new_feature_idx = selector.get_support()

[4] Wrapper Methods

Wrapper Methods

(1) Forward Selection
(2) Backward Elimination
(3) Recursive Feature Elimination

(1) Forward Selection

Iterative, Greedy method

start with 1 feature
evaluate model performance,

when ADDING each of additional features ( one at a time )
add next feature,

that gives the BEST performance
Repeat until no improvement

(2) Backward Elimination

start with ALL features
evaluate model performance,

when REMOVING each feature ( one at a time )
Remove next feature,

that gives the WORST performance
Repeat until no improvement

(3) Recursive Feature Elimination

select model ( JUST FOR measuring FEATURE IMPORTANCE )
select desired # of features
fit the model
Rank features by importance
discard least important features
repeat until desired # of features remains

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

K = 20
model = RandomForestClassifier()
rfe = RFE(model, K)
rfe = rfe.fit(X,y)

X_features_idx = rfe.get_support()
X_features = X.columns[X_features_idx]

[5] Embedded Methods

Embedded Methods

(1) L1 regularization
(2) Feature Importance
- scores for each feature in data
- discard features with low feature importance

model = RandomForestClassifier()
model = model.fit(X,y)
FI = model.feature_importances_

print(FI)
FI.nlargest(10).plot(kind='barh')

Twitter Facebook LinkedIn

[Week 2-3] Feature Selection

Seunghan Lee