Download Open in colab

🐍 Practice n°5: cross validation & hyperparameter tuning¶

The objective of this session is to optimize an end-to-end machine learning pipeline from feature engineering to model training and evaluation. Some notions have already been tackled during session 1 or 2, the idea of this notebook is to have a focus on cross validation and hyperparameters tuning.

We will remain in the frame of the binary classification of the session 2. Our goal is to have a model that outputs a label/class (good review or bad review).

Here are the main steps of the notebook :

Preparation
Data split
Model training and evaluation
Manual hyperparameter tuning
Cross validation & Automated hyperparameters tuning

Preparation¶

Install & import modules¶

In [1]:

Copied!





import pandas as pd
import numpy as np
from sklearn import (
    pipeline,
    metrics,
    linear_model,
    model_selection,
    tree
)
pd.set_option("display.max_colwidth", None)
import pandas as pd
import numpy as np
from sklearn import (
    pipeline,
    metrics,
    linear_model,
    model_selection,
    tree
)
pd.set_option("display.max_colwidth", None)

Read remote dataset¶

In [2]:

Copied!

file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df = pd.read_parquet(file_url)
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df = pd.read_parquet(file_url)

In [3]:

Copied!

df.head()
df.head()

Out[3]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter chocolate flavor with a good coffee cocoa finish	51	413	300	90	2008-07-31 02:00:00
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with garnet highlights, short lived foam, aroma is quite port like and raisin, flavor is sweet raisin or prune, toasty almond, waffle batter, nice finish	4	413	300	191	2011-08-06 02:00:00
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, slight caramel flavor but watery not much hops some effervescence	43	413	300	109	2008-03-09 01:00:00
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	blutt59	bottle, golden orange color with light tan foam , citric soour aroma with hoppy citrus and lime bitter flavor, could use a little more malt for balance	20	413	300	151	2009-10-17 02:00:00
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	blutt59	bottle, thanks to SS, almond amber colored pour with off white soapy foam, fruity metallic aroma, flavors of dried apricots, metallics, fruity medicinal finish	18	978	300	159	2010-08-17 02:00:00

Splits: Train/test & features/target¶

Split the data keeping 20% in the test set.

In [4]:

Copied!





target = "is_good"  # LINE TO BE REMOVED FOR STUDENTS
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df[features],
    df[target],
    test_size=0.2,  # LINE TO BE REMOVED FOR STUDENTS
    random_state=42
)
target = "is_good"  # LINE TO BE REMOVED FOR STUDENTS
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df[features],
    df[target],
    test_size=0.2,  # LINE TO BE REMOVED FOR STUDENTS
    random_state=42
)

Model training and evaluation¶

Train a logistic regression on the train set

In [5]:

Copied!





pipe = pipeline.make_pipeline(
    linear_model.LogisticRegression(solver="liblinear"),
)

pipe.fit(
    X_train,  # LINE TO BE REMOVED FOR STUDENTS
    y_train  # LINE TO BE REMOVED FOR STUDENTS
)

y_pred = pipe.predict(X_test)
pipe = pipeline.make_pipeline(
    linear_model.LogisticRegression(solver="liblinear"),
)

pipe.fit(
    X_train,  # LINE TO BE REMOVED FOR STUDENTS
    y_train  # LINE TO BE REMOVED FOR STUDENTS
)

y_pred = pipe.predict(X_test)

In [6]:

Copied!

metrics.confusion_matrix(y_pred, y_test)
metrics.confusion_matrix(y_pred, y_test)

Out[6]:

array([[28111,  3314],
       [ 2061,  6514]])

In [7]:

Copied!

print(metrics.classification_report(y_pred, y_test))
print(metrics.classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.93      0.89      0.91     31425
           1       0.66      0.76      0.71      8575

    accuracy                           0.87     40000
   macro avg       0.80      0.83      0.81     40000
weighted avg       0.87      0.87      0.87     40000

Let's use the F1 score metric for evaluation Compute the score on the test set.

In [8]:

Copied!

metrics.f1_score(y_test, y_pred)  # LINE TO BE REMOVED FOR STUDENTS
metrics.f1_score(y_test, y_pred)  # LINE TO BE REMOVED FOR STUDENTS

Out[8]:

0.70792805520839

Manual hyperparameter tuning¶

Let's try to see which parameters of the Logistic Regression reach the best F1-score on the test set.

In [9]:

Copied!





def compute_test_score(pipe, reg, c):
    pipe[-1].set_params(penalty=reg)
    pipe[-1].set_params(C=c)
    pipe.fit(
        X_train,
        y_train
    )
    y_pred = pipe.predict(X_test)
    return (
        metrics.f1_score(y_test, y_pred)
    )


def compute_test_score(pipe, reg, c):
    pipe[-1].set_params(penalty=reg)
    pipe[-1].set_params(C=c)
    pipe.fit(
        X_train,
        y_train
    )
    y_pred = pipe.predict(X_test)
    return (
        metrics.f1_score(y_test, y_pred)
    )

Find the best combination of penalty method and strength to maximize the F1-score on the test set.

In [10]:

Copied!





compute_test_score(
    pipe,  # LINE TO BE REMOVED FOR STUDENTS
    "l1",  # LINE TO BE REMOVED FOR STUDENTS
    10  # LINE TO BE REMOVED FOR STUDENTS
)
compute_test_score(
    pipe,  # LINE TO BE REMOVED FOR STUDENTS
    "l1",  # LINE TO BE REMOVED FOR STUDENTS
    10  # LINE TO BE REMOVED FOR STUDENTS
)

Out[10]:

0.7079847908745247

TODO: add the concept of validation set before introducting cross validation

Cross validation¶

When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.

A solution to this problem is a procedure called cross-validation. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using k - 1 of the folds as training data
The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

Automated hyperparameters tuning¶

As the manual tuning of hyperparameters can take a long time, some solutions exist to help you find the best combination.

Let's implement a grid search with a 3-fold cross validation to optimize our Logistic regression hyperparameters

In [11]:

Copied!

pipe.get_params()
pipe.get_params()

Out[11]:

{'memory': None,
 'steps': [('logisticregression',
   LogisticRegression(C=10, penalty='l1', solver='liblinear'))],
 'verbose': False,
 'logisticregression': LogisticRegression(C=10, penalty='l1', solver='liblinear'),
 'logisticregression__C': 10,
 'logisticregression__class_weight': None,
 'logisticregression__dual': False,
 'logisticregression__fit_intercept': True,
 'logisticregression__intercept_scaling': 1,
 'logisticregression__l1_ratio': None,
 'logisticregression__max_iter': 100,
 'logisticregression__multi_class': 'auto',
 'logisticregression__n_jobs': None,
 'logisticregression__penalty': 'l1',
 'logisticregression__random_state': None,
 'logisticregression__solver': 'liblinear',
 'logisticregression__tol': 0.0001,
 'logisticregression__verbose': 0,
 'logisticregression__warm_start': False}

In [12]:

Copied!





regularization_types = ["l1", "l2"]
regularization_strengths = [0.0001, 0.01, 1, 100, 10000]

gridsearch = model_selection.GridSearchCV(
    pipe,
    param_grid={
        # parameters to explore, with the syntax "pipelinestep__parameter"
        # (e.g: logisticregression__penalty)
        # use pipe.get_params() to know which parameters are available in the pipeline
        "logisticregression__C": regularization_strengths,  # LINE TO BE REMOVED FOR STUDENTS
        "logisticregression__penalty": regularization_types,  # LINE TO BE REMOVED FOR STUDENTS
    },
    # scoring function
    scoring="f1",  # LINE TO BE REMOVED FOR STUDENTS
    # K-fold cross-validation parameter
    cv=3,  # LINE TO BE REMOVED FOR STUDENTS
    n_jobs=-1,
    refit=True,
    return_train_score=True,
    verbose=1,
)
gridsearch.fit(X_train, y_train)
regularization_types = ["l1", "l2"]
regularization_strengths = [0.0001, 0.01, 1, 100, 10000]

gridsearch = model_selection.GridSearchCV(
    pipe,
    param_grid={
        # parameters to explore, with the syntax "pipelinestep__parameter"
        # (e.g: logisticregression__penalty)
        # use pipe.get_params() to know which parameters are available in the pipeline
        "logisticregression__C": regularization_strengths,  # LINE TO BE REMOVED FOR STUDENTS
        "logisticregression__penalty": regularization_types,  # LINE TO BE REMOVED FOR STUDENTS
    },
    # scoring function
    scoring="f1",  # LINE TO BE REMOVED FOR STUDENTS
    # K-fold cross-validation parameter
    cv=3,  # LINE TO BE REMOVED FOR STUDENTS
    n_jobs=-1,
    refit=True,
    return_train_score=True,
    verbose=1,
)
gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Out[12]:

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('logisticregression',
                                        LogisticRegression(C=10, penalty='l1',
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid={'logisticregression__C': [0.0001, 0.01, 1, 100, 10000],
                         'logisticregression__penalty': ['l1', 'l2']},
             return_train_score=True, scoring='f1', verbose=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [13]:

Copied!





(
    pd.DataFrame(gridsearch.cv_results_)
    .sort_values(by="rank_test_score")
    .drop("params", axis=1)
    .style.background_gradient()
)
(
    pd.DataFrame(gridsearch.cv_results_)
    .sort_values(by="rank_test_score")
    .drop("params", axis=1)
    .style.background_gradient()
)

Out[13]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_logisticregression__C	param_logisticregression__penalty	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	mean_train_score	std_train_score
7	0.302523	0.014409	0.027974	0.012377	100	l2	0.712707	0.713193	0.708231	0.711377	0.002233	1	0.711017	0.710367	0.713287	0.711557	0.001252
9	0.324879	0.013821	0.030534	0.014661	10000	l2	0.712707	0.713193	0.708231	0.711377	0.002233	1	0.711017	0.710367	0.713287	0.711557	0.001252
5	0.306638	0.003537	0.028256	0.012410	1	l2	0.712939	0.713086	0.708036	0.711354	0.002346	3	0.710612	0.710109	0.713219	0.711313	0.001363
4	2.994843	0.098568	0.028198	0.014128	1	l1	0.712601	0.713193	0.708245	0.711346	0.002206	4	0.710979	0.710367	0.713352	0.711566	0.001287
8	3.025996	0.038840	0.026668	0.012305	10000	l1	0.712601	0.713164	0.708269	0.711345	0.002187	5	0.710979	0.710353	0.713364	0.711565	0.001297
6	3.007619	0.046535	0.028817	0.014680	100	l1	0.712601	0.713164	0.708269	0.711345	0.002187	5	0.710979	0.710338	0.713364	0.711560	0.001302
2	2.724853	0.040249	0.027870	0.013004	0.010000	l1	0.709125	0.711334	0.706343	0.708934	0.002042	7	0.706986	0.708851	0.711213	0.709017	0.001730
3	0.253460	0.007172	0.034018	0.016331	0.010000	l2	0.670664	0.672511	0.668110	0.670428	0.001805	8	0.670319	0.669710	0.671176	0.670402	0.000602
1	0.154929	0.006636	0.035645	0.007476	0.000100	l2	0.001232	0.001540	0.000616	0.001129	0.000384	9	0.001232	0.000847	0.001540	0.001206	0.000283
0	0.275470	0.008802	0.024821	0.007042	0.000100	l1	0.000000	0.000000	0.000000	0.000000	0.000000	10	0.000000	0.000000	0.000000	0.000000	0.000000

Now we can take our best estimator and compute its score on the test set.

In [14]:

Copied!





optimized_pipe = gridsearch.best_estimator_

y_pred = (
    optimized_pipe
    .predict(X_test)  # LINE TO BE REMOVED FOR STUDENTS
)

metrics.f1_score(y_test, y_pred)
optimized_pipe = gridsearch.best_estimator_

y_pred = (
    optimized_pipe
    .predict(X_test)  # LINE TO BE REMOVED FOR STUDENTS
)

metrics.f1_score(y_test, y_pred)

Out[14]:

0.7079847908745247

Now it's your turn ! Choose a classifier and use the sklearn documentation to understand its hyperparameters. Then, use the code above to optimize these hyperparameters using a gridsearch and crossvalidation.

In [15]:

Copied!





pipe = pipeline.make_pipeline(
    tree.DecisionTreeClassifier(),  # LINE TO BE REMOVED FOR STUDENTS
)

max_depths = [3, 5, 10, 20]  # LINE TO BE REMOVED FOR STUDENTS
class_weights = [None, "balanced"]  # LINE TO BE REMOVED FOR STUDENTS

scoring = "f1"
gridsearch = model_selection.GridSearchCV(
    pipe,
    param_grid={
        "decisiontreeclassifier__max_depth": max_depths,  # LINE TO BE REMOVED FOR STUDENTS
        "decisiontreeclassifier__class_weight": class_weights,  # LINE TO BE REMOVED FOR STUDENTS
    },
    scoring=scoring,
    cv=3,
    n_jobs=-1,
    refit=True,
    return_train_score=True,
    verbose=1,
)
gridsearch.fit(X_train, y_train)
pipe = pipeline.make_pipeline(
    tree.DecisionTreeClassifier(),  # LINE TO BE REMOVED FOR STUDENTS
)

max_depths = [3, 5, 10, 20]  # LINE TO BE REMOVED FOR STUDENTS
class_weights = [None, "balanced"]  # LINE TO BE REMOVED FOR STUDENTS

scoring = "f1"
gridsearch = model_selection.GridSearchCV(
    pipe,
    param_grid={
        "decisiontreeclassifier__max_depth": max_depths,  # LINE TO BE REMOVED FOR STUDENTS
        "decisiontreeclassifier__class_weight": class_weights,  # LINE TO BE REMOVED FOR STUDENTS
    },
    scoring=scoring,
    cv=3,
    n_jobs=-1,
    refit=True,
    return_train_score=True,
    verbose=1,
)
gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits

Out[15]:

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('decisiontreeclassifier',
                                        DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'decisiontreeclassifier__class_weight': [None,
                                                                  'balanced'],
                         'decisiontreeclassifier__max_depth': [3, 5, 10, 20]},
             return_train_score=True, scoring='f1', verbose=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [16]:

Copied!





(
    pd.DataFrame(gridsearch.cv_results_)
    .sort_values(by="rank_test_score")
    .drop("params", axis=1)
    .style.background_gradient()
)
(
    pd.DataFrame(gridsearch.cv_results_)
    .sort_values(by="rank_test_score")
    .drop("params", axis=1)
    .style.background_gradient()
)

Out[16]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_decisiontreeclassifier__class_weight	param_decisiontreeclassifier__max_depth	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	mean_train_score	std_train_score
1	0.089219	0.005742	0.030550	0.010181	None	5	0.714246	0.720191	0.715334	0.716590	0.002585	1	0.715069	0.716594	0.719009	0.716890	0.001622
6	0.158087	0.003758	0.029076	0.011842	balanced	10	0.707766	0.705384	0.706907	0.706686	0.000985	2	0.717082	0.715781	0.719324	0.717396	0.001463
2	0.125592	0.003296	0.033634	0.014105	None	10	0.701359	0.709018	0.706432	0.705603	0.003181	3	0.716454	0.719820	0.725999	0.720758	0.003953
5	0.112411	0.003879	0.026811	0.012117	balanced	5	0.701048	0.705499	0.699621	0.702056	0.002503	4	0.704027	0.701759	0.701387	0.702391	0.001166
4	0.095171	0.004111	0.028576	0.012639	balanced	3	0.676007	0.713444	0.712545	0.700666	0.017440	5	0.679652	0.713050	0.713499	0.702067	0.015851
7	0.189083	0.005064	0.033745	0.012654	balanced	20	0.690135	0.687700	0.688462	0.688766	0.001017	6	0.749428	0.748467	0.748198	0.748698	0.000528
3	0.172290	0.014766	0.042265	0.008551	None	20	0.681040	0.680347	0.680676	0.680687	0.000283	7	0.753875	0.751615	0.754406	0.753299	0.001210
0	0.066172	0.004491	0.029563	0.013022	None	3	0.662919	0.665324	0.661313	0.663185	0.001648	8	0.663316	0.662115	0.664121	0.663184	0.000824

In [17]:

Copied!

optimized_pipe = gridsearch.best_estimator_

y_pred = optimized_pipe.predict(X_test)

metrics.f1_score(y_test, y_pred)
optimized_pipe = gridsearch.best_estimator_

y_pred = optimized_pipe.predict(X_test)

metrics.f1_score(y_test, y_pred)

Out[17]:

0.717235566793089