🐍 Practice n°5: cross validation & hyperparameter tuning (students version)¶
The objective of this session is to optimize an end-to-end machine learning pipeline from feature engineering to model training and evaluation. Some notions have already been tackled during session 1 or 2, the idea of this notebook is to have a focus on cross validation and hyperparameters tuning.
We will remain in the frame of the binary classification of the session 2. Our goal is to have a model that outputs a label/class (good review or bad review).
Here are the main steps of the notebook :
- Preparation
- Data split
- Model training and evaluation
- Manual hyperparameter tuning
- Cross validation & Automated hyperparameters tuning
Preparation¶
Install & import modules¶
import pandas as pd
import numpy as np
from sklearn import (
pipeline,
metrics,
linear_model,
model_selection,
tree
)
pd.set_option("display.max_colwidth", None)
Read remote dataset¶
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df = pd.read_parquet(file_url)
df.head()
Splits: Train/test & features/target¶
Split the data keeping 20% in the test set.
*** FILL THE MISSING LINE ***
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, X_test, y_train, y_test = model_selection.train_test_split(
df[features],
df[target],
*** FILL THE MISSING LINE ***
random_state=42
)
Model training and evaluation¶
Train a logistic regression on the train set
pipe = pipeline.make_pipeline(
linear_model.LogisticRegression(solver="liblinear"),
)
pipe.fit(
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
)
y_pred = pipe.predict(X_test)
metrics.confusion_matrix(y_pred, y_test)
print(metrics.classification_report(y_pred, y_test))
Let's use the F1 score metric for evaluation Compute the score on the test set.
*** FILL THE MISSING LINE ***
Manual hyperparameter tuning¶
Let's try to see which parameters of the Logistic Regression reach the best F1-score on the test set.
def compute_test_score(pipe, reg, c):
pipe[-1].set_params(penalty=reg)
pipe[-1].set_params(C=c)
pipe.fit(
X_train,
y_train
)
y_pred = pipe.predict(X_test)
return (
metrics.f1_score(y_test, y_pred)
)
Find the best combination of penalty method and strength to maximize the F1-score on the test set.
compute_test_score(
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
)
TODO: add the concept of validation set before introducting cross validation
Cross validation¶
When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.
A solution to this problem is a procedure called cross-validation. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
- A model is trained using k - 1 of the folds as training data
- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
Automated hyperparameters tuning¶
As the manual tuning of hyperparameters can take a long time, some solutions exist to help you find the best combination.
Let's implement a grid search with a 3-fold cross validation to optimize our Logistic regression hyperparameters
pipe.get_params()
regularization_types = ["l1", "l2"]
regularization_strengths = [0.0001, 0.01, 1, 100, 10000]
gridsearch = model_selection.GridSearchCV(
pipe,
param_grid={
# parameters to explore, with the syntax "pipelinestep__parameter"
# (e.g: logisticregression__penalty)
# use pipe.get_params() to know which parameters are available in the pipeline
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
},
# scoring function
*** FILL THE MISSING LINE ***
# K-fold cross-validation parameter
*** FILL THE MISSING LINE ***
n_jobs=-1,
refit=True,
return_train_score=True,
verbose=1,
)
gridsearch.fit(X_train, y_train)
(
pd.DataFrame(gridsearch.cv_results_)
.sort_values(by="rank_test_score")
.drop("params", axis=1)
.style.background_gradient()
)
Now we can take our best estimator and compute its score on the test set.
optimized_pipe = gridsearch.best_estimator_
y_pred = (
optimized_pipe
*** FILL THE MISSING LINE ***
)
metrics.f1_score(y_test, y_pred)
Now it's your turn ! Choose a classifier and use the sklearn documentation to understand its hyperparameters. Then, use the code above to optimize these hyperparameters using a gridsearch and crossvalidation.
pipe = pipeline.make_pipeline(
*** FILL THE MISSING LINE ***
)
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
scoring = "f1"
gridsearch = model_selection.GridSearchCV(
pipe,
param_grid={
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
},
scoring=scoring,
cv=3,
n_jobs=-1,
refit=True,
return_train_score=True,
verbose=1,
)
gridsearch.fit(X_train, y_train)
(
pd.DataFrame(gridsearch.cv_results_)
.sort_values(by="rank_test_score")
.drop("params", axis=1)
.style.background_gradient()
)
optimized_pipe = gridsearch.best_estimator_
y_pred = optimized_pipe.predict(X_test)
metrics.f1_score(y_test, y_pred)