🐍 Practice n°5: cross validation & hyperparameter tuning¶
The objective of this session is to optimize an end-to-end machine learning pipeline from feature engineering to model training and evaluation. Some notions have already been tackled during session 1 or 2, the idea of this notebook is to have a focus on cross validation and hyperparameters tuning.
We will remain in the frame of the binary classification of the session 2. Our goal is to have a model that outputs a label/class (good review or bad review).
Here are the main steps of the notebook :
- Preparation
- Data split
- Model training and evaluation
- Manual hyperparameter tuning
- Cross validation & Automated hyperparameters tuning
Preparation¶
Install & import modules¶
import pandas as pd
import numpy as np
from sklearn import (
pipeline,
metrics,
linear_model,
model_selection,
tree
)
pd.set_option("display.max_colwidth", None)
Read remote dataset¶
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df = pd.read_parquet(file_url)
df.head()
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter chocolate flavor with a good coffee cocoa finish | 51 | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with garnet highlights, short lived foam, aroma is quite port like and raisin, flavor is sweet raisin or prune, toasty almond, waffle batter, nice finish | 4 | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, slight caramel flavor but watery not much hops some effervescence | 43 | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | blutt59 | bottle, golden orange color with light tan foam , citric soour aroma with hoppy citrus and lime bitter flavor, could use a little more malt for balance | 20 | 413 | 300 | 151 | 2009-10-17 02:00:00 | 0 |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | blutt59 | bottle, thanks to SS, almond amber colored pour with off white soapy foam, fruity metallic aroma, flavors of dried apricots, metallics, fruity medicinal finish | 18 | 978 | 300 | 159 | 2010-08-17 02:00:00 | 0 |
Splits: Train/test & features/target¶
Split the data keeping 20% in the test set.
target = "is_good" # LINE TO BE REMOVED FOR STUDENTS
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, X_test, y_train, y_test = model_selection.train_test_split(
df[features],
df[target],
test_size=0.2, # LINE TO BE REMOVED FOR STUDENTS
random_state=42
)
Model training and evaluation¶
Train a logistic regression on the train set
pipe = pipeline.make_pipeline(
linear_model.LogisticRegression(solver="liblinear"),
)
pipe.fit(
X_train, # LINE TO BE REMOVED FOR STUDENTS
y_train # LINE TO BE REMOVED FOR STUDENTS
)
y_pred = pipe.predict(X_test)
metrics.confusion_matrix(y_pred, y_test)
array([[28111, 3314],
[ 2061, 6514]])
print(metrics.classification_report(y_pred, y_test))
precision recall f1-score support
0 0.93 0.89 0.91 31425
1 0.66 0.76 0.71 8575
accuracy 0.87 40000
macro avg 0.80 0.83 0.81 40000
weighted avg 0.87 0.87 0.87 40000
Let's use the F1 score metric for evaluation Compute the score on the test set.
metrics.f1_score(y_test, y_pred) # LINE TO BE REMOVED FOR STUDENTS
0.70792805520839
Manual hyperparameter tuning¶
Let's try to see which parameters of the Logistic Regression reach the best F1-score on the test set.
def compute_test_score(pipe, reg, c):
pipe[-1].set_params(penalty=reg)
pipe[-1].set_params(C=c)
pipe.fit(
X_train,
y_train
)
y_pred = pipe.predict(X_test)
return (
metrics.f1_score(y_test, y_pred)
)
Find the best combination of penalty method and strength to maximize the F1-score on the test set.
compute_test_score(
pipe, # LINE TO BE REMOVED FOR STUDENTS
"l1", # LINE TO BE REMOVED FOR STUDENTS
10 # LINE TO BE REMOVED FOR STUDENTS
)
0.7079847908745247
TODO: add the concept of validation set before introducting cross validation
Cross validation¶
When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.
A solution to this problem is a procedure called cross-validation. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
- A model is trained using k - 1 of the folds as training data
- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
Automated hyperparameters tuning¶
As the manual tuning of hyperparameters can take a long time, some solutions exist to help you find the best combination.
Let's implement a grid search with a 3-fold cross validation to optimize our Logistic regression hyperparameters
pipe.get_params()
{'memory': None,
'steps': [('logisticregression',
LogisticRegression(C=10, penalty='l1', solver='liblinear'))],
'verbose': False,
'logisticregression': LogisticRegression(C=10, penalty='l1', solver='liblinear'),
'logisticregression__C': 10,
'logisticregression__class_weight': None,
'logisticregression__dual': False,
'logisticregression__fit_intercept': True,
'logisticregression__intercept_scaling': 1,
'logisticregression__l1_ratio': None,
'logisticregression__max_iter': 100,
'logisticregression__multi_class': 'auto',
'logisticregression__n_jobs': None,
'logisticregression__penalty': 'l1',
'logisticregression__random_state': None,
'logisticregression__solver': 'liblinear',
'logisticregression__tol': 0.0001,
'logisticregression__verbose': 0,
'logisticregression__warm_start': False}
regularization_types = ["l1", "l2"]
regularization_strengths = [0.0001, 0.01, 1, 100, 10000]
gridsearch = model_selection.GridSearchCV(
pipe,
param_grid={
# parameters to explore, with the syntax "pipelinestep__parameter"
# (e.g: logisticregression__penalty)
# use pipe.get_params() to know which parameters are available in the pipeline
"logisticregression__C": regularization_strengths, # LINE TO BE REMOVED FOR STUDENTS
"logisticregression__penalty": regularization_types, # LINE TO BE REMOVED FOR STUDENTS
},
# scoring function
scoring="f1", # LINE TO BE REMOVED FOR STUDENTS
# K-fold cross-validation parameter
cv=3, # LINE TO BE REMOVED FOR STUDENTS
n_jobs=-1,
refit=True,
return_train_score=True,
verbose=1,
)
gridsearch.fit(X_train, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('logisticregression',
LogisticRegression(C=10, penalty='l1',
solver='liblinear'))]),
n_jobs=-1,
param_grid={'logisticregression__C': [0.0001, 0.01, 1, 100, 10000],
'logisticregression__penalty': ['l1', 'l2']},
return_train_score=True, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('logisticregression',
LogisticRegression(C=10, penalty='l1',
solver='liblinear'))]),
n_jobs=-1,
param_grid={'logisticregression__C': [0.0001, 0.01, 1, 100, 10000],
'logisticregression__penalty': ['l1', 'l2']},
return_train_score=True, scoring='f1', verbose=1)Pipeline(steps=[('logisticregression',
LogisticRegression(C=10, penalty='l1', solver='liblinear'))])LogisticRegression(C=10, penalty='l1', solver='liblinear')
(
pd.DataFrame(gridsearch.cv_results_)
.sort_values(by="rank_test_score")
.drop("params", axis=1)
.style.background_gradient()
)
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_logisticregression__C | param_logisticregression__penalty | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 0.302523 | 0.014409 | 0.027974 | 0.012377 | 100 | l2 | 0.712707 | 0.713193 | 0.708231 | 0.711377 | 0.002233 | 1 | 0.711017 | 0.710367 | 0.713287 | 0.711557 | 0.001252 |
| 9 | 0.324879 | 0.013821 | 0.030534 | 0.014661 | 10000 | l2 | 0.712707 | 0.713193 | 0.708231 | 0.711377 | 0.002233 | 1 | 0.711017 | 0.710367 | 0.713287 | 0.711557 | 0.001252 |
| 5 | 0.306638 | 0.003537 | 0.028256 | 0.012410 | 1 | l2 | 0.712939 | 0.713086 | 0.708036 | 0.711354 | 0.002346 | 3 | 0.710612 | 0.710109 | 0.713219 | 0.711313 | 0.001363 |
| 4 | 2.994843 | 0.098568 | 0.028198 | 0.014128 | 1 | l1 | 0.712601 | 0.713193 | 0.708245 | 0.711346 | 0.002206 | 4 | 0.710979 | 0.710367 | 0.713352 | 0.711566 | 0.001287 |
| 8 | 3.025996 | 0.038840 | 0.026668 | 0.012305 | 10000 | l1 | 0.712601 | 0.713164 | 0.708269 | 0.711345 | 0.002187 | 5 | 0.710979 | 0.710353 | 0.713364 | 0.711565 | 0.001297 |
| 6 | 3.007619 | 0.046535 | 0.028817 | 0.014680 | 100 | l1 | 0.712601 | 0.713164 | 0.708269 | 0.711345 | 0.002187 | 5 | 0.710979 | 0.710338 | 0.713364 | 0.711560 | 0.001302 |
| 2 | 2.724853 | 0.040249 | 0.027870 | 0.013004 | 0.010000 | l1 | 0.709125 | 0.711334 | 0.706343 | 0.708934 | 0.002042 | 7 | 0.706986 | 0.708851 | 0.711213 | 0.709017 | 0.001730 |
| 3 | 0.253460 | 0.007172 | 0.034018 | 0.016331 | 0.010000 | l2 | 0.670664 | 0.672511 | 0.668110 | 0.670428 | 0.001805 | 8 | 0.670319 | 0.669710 | 0.671176 | 0.670402 | 0.000602 |
| 1 | 0.154929 | 0.006636 | 0.035645 | 0.007476 | 0.000100 | l2 | 0.001232 | 0.001540 | 0.000616 | 0.001129 | 0.000384 | 9 | 0.001232 | 0.000847 | 0.001540 | 0.001206 | 0.000283 |
| 0 | 0.275470 | 0.008802 | 0.024821 | 0.007042 | 0.000100 | l1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Now we can take our best estimator and compute its score on the test set.
optimized_pipe = gridsearch.best_estimator_
y_pred = (
optimized_pipe
.predict(X_test) # LINE TO BE REMOVED FOR STUDENTS
)
metrics.f1_score(y_test, y_pred)
0.7079847908745247
Now it's your turn ! Choose a classifier and use the sklearn documentation to understand its hyperparameters. Then, use the code above to optimize these hyperparameters using a gridsearch and crossvalidation.
pipe = pipeline.make_pipeline(
tree.DecisionTreeClassifier(), # LINE TO BE REMOVED FOR STUDENTS
)
max_depths = [3, 5, 10, 20] # LINE TO BE REMOVED FOR STUDENTS
class_weights = [None, "balanced"] # LINE TO BE REMOVED FOR STUDENTS
scoring = "f1"
gridsearch = model_selection.GridSearchCV(
pipe,
param_grid={
"decisiontreeclassifier__max_depth": max_depths, # LINE TO BE REMOVED FOR STUDENTS
"decisiontreeclassifier__class_weight": class_weights, # LINE TO BE REMOVED FOR STUDENTS
},
scoring=scoring,
cv=3,
n_jobs=-1,
refit=True,
return_train_score=True,
verbose=1,
)
gridsearch.fit(X_train, y_train)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('decisiontreeclassifier',
DecisionTreeClassifier())]),
n_jobs=-1,
param_grid={'decisiontreeclassifier__class_weight': [None,
'balanced'],
'decisiontreeclassifier__max_depth': [3, 5, 10, 20]},
return_train_score=True, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('decisiontreeclassifier',
DecisionTreeClassifier())]),
n_jobs=-1,
param_grid={'decisiontreeclassifier__class_weight': [None,
'balanced'],
'decisiontreeclassifier__max_depth': [3, 5, 10, 20]},
return_train_score=True, scoring='f1', verbose=1)Pipeline(steps=[('decisiontreeclassifier', DecisionTreeClassifier())])DecisionTreeClassifier()
(
pd.DataFrame(gridsearch.cv_results_)
.sort_values(by="rank_test_score")
.drop("params", axis=1)
.style.background_gradient()
)
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_decisiontreeclassifier__class_weight | param_decisiontreeclassifier__max_depth | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.089219 | 0.005742 | 0.030550 | 0.010181 | None | 5 | 0.714246 | 0.720191 | 0.715334 | 0.716590 | 0.002585 | 1 | 0.715069 | 0.716594 | 0.719009 | 0.716890 | 0.001622 |
| 6 | 0.158087 | 0.003758 | 0.029076 | 0.011842 | balanced | 10 | 0.707766 | 0.705384 | 0.706907 | 0.706686 | 0.000985 | 2 | 0.717082 | 0.715781 | 0.719324 | 0.717396 | 0.001463 |
| 2 | 0.125592 | 0.003296 | 0.033634 | 0.014105 | None | 10 | 0.701359 | 0.709018 | 0.706432 | 0.705603 | 0.003181 | 3 | 0.716454 | 0.719820 | 0.725999 | 0.720758 | 0.003953 |
| 5 | 0.112411 | 0.003879 | 0.026811 | 0.012117 | balanced | 5 | 0.701048 | 0.705499 | 0.699621 | 0.702056 | 0.002503 | 4 | 0.704027 | 0.701759 | 0.701387 | 0.702391 | 0.001166 |
| 4 | 0.095171 | 0.004111 | 0.028576 | 0.012639 | balanced | 3 | 0.676007 | 0.713444 | 0.712545 | 0.700666 | 0.017440 | 5 | 0.679652 | 0.713050 | 0.713499 | 0.702067 | 0.015851 |
| 7 | 0.189083 | 0.005064 | 0.033745 | 0.012654 | balanced | 20 | 0.690135 | 0.687700 | 0.688462 | 0.688766 | 0.001017 | 6 | 0.749428 | 0.748467 | 0.748198 | 0.748698 | 0.000528 |
| 3 | 0.172290 | 0.014766 | 0.042265 | 0.008551 | None | 20 | 0.681040 | 0.680347 | 0.680676 | 0.680687 | 0.000283 | 7 | 0.753875 | 0.751615 | 0.754406 | 0.753299 | 0.001210 |
| 0 | 0.066172 | 0.004491 | 0.029563 | 0.013022 | None | 3 | 0.662919 | 0.665324 | 0.661313 | 0.663185 | 0.001648 | 8 | 0.663316 | 0.662115 | 0.664121 | 0.663184 | 0.000824 |
optimized_pipe = gridsearch.best_estimator_
y_pred = optimized_pipe.predict(X_test)
metrics.f1_score(y_test, y_pred)
0.717235566793089