Download Open in colab

🐍 Practice n°3: regression¶

The objective of this session is to understand the regression problems and the analytical approach to solve them. This notebook discusses theoretical notions and contains practical parts.

Here are the main steps of the notebook :

What is regression ?
Focus on linear regression
Preparation
Implementation of a linear regression
Model improvement

1. What is regression ?¶

Regression in machine learning consists of mathematical methods that allow to predict a continuous outcome (y) based on the value of one or more predictor variables (x)

1D Example¶

x = Number of aircrafts in an airspace sector

y = Subjective rating of air traffic controller workload

To predict the air traffic controller workload, we need to find the best function y=f(x). For that, we will define an error and find the function which minimizes this error.

Capture d’écran 2022-11-04 à 11.25.07.png

Generalisation¶

In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1.

$$x = (x_1,...,x_n)$$ $$\hat{y} = f(x_1,...,x_n)$$

If we come back to our traffic control example, the air traffic controller workload can be a function of both the number of aircraft and the weather.

2. Focus on linear regression¶

The linear regression is a regression using a linear function ŷ = ax + b

1D Example¶

Capture d’écran 2022-11-04 à 12.18.01.png

Generalisation¶

In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1. The linear function will look like :

$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

Now let's practice with the ratebeer dataset !

3. Preparation¶

Install & import modules¶

In [1]:

Copied!





import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
    linear_model,
    preprocessing,
    pipeline,
    model_selection,
    metrics,
    compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl

mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})
import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
    linear_model,
    preprocessing,
    pipeline,
    model_selection,
    metrics,
    compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl

mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})

Read remote dataset¶

The data is in this git repository: ML-boot-camp/ratebeer.git.

The data is located in the the ratebeer/data/ folder.

In [2]:

Copied!

file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"

Implementation of a linear regression¶

In this first section, you will train a linear regression model to predict the overall rating of a review based on its other ratings. For that, you will :

Load & describe the data
Split the data between train and test set
Train and evaluate the linear regression model

Load data & describe¶

In [3]:

Copied!

df_master = pd.read_parquet(file_url)
df_master = pd.read_parquet(file_url)

In [4]:

Copied!

df_master.head()
df_master.head()

Out[4]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter choc...	51	413	300	90	2008-07-31 02:00:00
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with gar...	4	413	300	191	2011-08-06 02:00:00
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, sl...	43	413	300	109	2008-03-09 01:00:00
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	blutt59	bottle, golden orange color with light tan foa...	20	413	300	151	2009-10-17 02:00:00
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	blutt59	bottle, thanks to SS, almond amber colored pou...	18	978	300	159	2010-08-17 02:00:00

In [5]:

Copied!

df_master.shape
df_master.shape

Out[5]:

(200000, 18)

Splits: Train/test & features/target¶

The goal of supervised learning is to build a model that performs well on new data. If you have new data, it’s a good idea to see how your model performs on it. The problem is that you may not have new data, but you can simulate this experience by splitting your dataset between train and test.

Train/test split¶

Split keeping 10% of the data in the test set.

Hint: https://scikit-learn.org/stable/index.html

In [6]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(
    df_master, 
    test_size=0.1 # LINE TO BE REMOVED FOR STUDENTS
)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(
    df_master, 
    test_size=0.1 # LINE TO BE REMOVED FOR STUDENTS
)

Features/target split¶

Split keeping:

only the wanted features
N data samples in the training set

In [7]:

Copied!





def split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N=None,
):
    if N is None :
        X_train = df_features_and_target_train[features]
        y_train = df_features_and_target_train[target]
    else :
        X_train = df_features_and_target_train[features].sample(N)
        y_train = df_features_and_target_train[target].loc[X_train.index]
    X_test = df_features_and_target_test[features]
    y_test = df_features_and_target_test[target]
    return X_train, y_train, X_test, y_test


target = [
    "rating" # LINE TO BE REMOVED FOR STUDENTS
] 
features = [
    "alcohol", # LINE TO BE REMOVED FOR STUDENTS
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train, # LINE TO BE REMOVED FOR STUDENTS
    df_features_and_target_test, # LINE TO BE REMOVED FOR STUDENTS
    features, # LINE TO BE REMOVED FOR STUDENTS
    target, # LINE TO BE REMOVED FOR STUDENTS
)
def split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N=None,
):
    if N is None :
        X_train = df_features_and_target_train[features]
        y_train = df_features_and_target_train[target]
    else :
        X_train = df_features_and_target_train[features].sample(N)
        y_train = df_features_and_target_train[target].loc[X_train.index]
    X_test = df_features_and_target_test[features]
    y_test = df_features_and_target_test[target]
    return X_train, y_train, X_test, y_test


target = [
    "rating" # LINE TO BE REMOVED FOR STUDENTS
] 
features = [
    "alcohol", # LINE TO BE REMOVED FOR STUDENTS
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train, # LINE TO BE REMOVED FOR STUDENTS
    df_features_and_target_test, # LINE TO BE REMOVED FOR STUDENTS
    features, # LINE TO BE REMOVED FOR STUDENTS
    target, # LINE TO BE REMOVED FOR STUDENTS
)

Model training & evaluation¶

Model training¶

Use a pipeline to wrap the model with its automatized preprocessing steps

In [8]:

Copied!





pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(
    X_train, # LINE TO BE REMOVED FOR STUDENTS
    y_train # LINE TO BE REMOVED FOR STUDENTS
)
pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(
    X_train, # LINE TO BE REMOVED FOR STUDENTS
    y_train # LINE TO BE REMOVED FOR STUDENTS
)

Out[8]:

Pipeline(steps=[('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model evaluation¶

R2 score¶

To assess the performance of our model, many metrics can be used. The easiest one for regression is $R^2$. It is used to measure the goodness of fit or best-fit line.

$$R^2 = 1 - \frac{\sum_{i}{(y_i - \hat{y}_i)^2}}{\sum_{i}{(y_i - \bar{y}_i)^2}}$$

R2 score in practice¶

In [9]:

Copied!





score_train = pipe.score(
    X_train, 
    y_train
)
score_test = pipe.score(
    X_test, # LINE TO BE REMOVED FOR STUDENTS
    y_test # LINE TO BE REMOVED FOR STUDENTS
)

print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")
score_train = pipe.score(
    X_train, 
    y_train
)
score_test = pipe.score(
    X_test, # LINE TO BE REMOVED FOR STUDENTS
    y_test # LINE TO BE REMOVED FOR STUDENTS
)

print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")

R2 (train): 0.8063431786340525
R2 (test): 0.8024042308518796

Residuals¶

Compute residuals dataframe containing the true & predicted values of y and and the residual, which is the difference between the true & predicted values.

Plot the residuals, using histograms & violinplots.

In [10]:

Copied!





def compute_df_residual(pipe, X, y):
    return pd.DataFrame(
        {
            "y_true": y.rating.reset_index(drop=True),
            "y_pred": pipe.predict(X).reshape(-1),
        }
    ).assign(residual=lambda df: df.y_true - df.y_pred)


def plot_residual(df_residual):
    df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
    plt.show()
    sns.violinplot(data=df_residual, x="y_true", y="y_pred")


df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)
def compute_df_residual(pipe, X, y):
    return pd.DataFrame(
        {
            "y_true": y.rating.reset_index(drop=True),
            "y_pred": pipe.predict(X).reshape(-1),
        }
    ).assign(residual=lambda df: df.y_true - df.y_pred)


def plot_residual(df_residual):
    df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
    plt.show()
    sns.violinplot(data=df_residual, x="y_true", y="y_pred")


df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)

No description has been provided for this image

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

Other regression metrics¶

Another popular metric is the Mean Squared Error (MSE)

$$MSE = \frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2$$

This metric is interesting if we want to penalize large errors. However it is not easily interpretable as it is not in the same unit than the target.

This is why the Root Mean Squared Error (RMSE) is widely used.

$$RMSE = \sqrt{\frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2}$$

As well as the Mean Absolute Error which is even more interpretable.

$$MAE = \frac{1}{N} \sum_{i}\lvert(y_i – \hat{y}_i)\lvert$$

Other metrics in practice¶

Compute a set of metrics on train & test datasets:

negative RMSE
negative MAE
negative MSE

NB: A "score" function in sklearn obeys to the "greater is better" principle. That's why error functions are converted to their negative versions to obey this principle.

In [11]:

Copied!





def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
    if metric_names is None:
        metric_names = ["neg_root_mean_squared_error"]
    scores = dict()
    for metric_name in metric_names:
        metric = metrics.get_scorer(metric_name)
        score = metric(pipe, X, y)
        printed_label = f" ({label})" if label else ""
        if verbose:
            print(f"{metric_name}{printed_label}: {score:.3g}")
        scores.update({f"{metric_name}{printed_label}": score})
    return scores


metric_names = [
    "neg_root_mean_squared_error",
    "neg_mean_absolute_error",
    "neg_mean_squared_error",
]
score_train = compute_score(
    pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
    pipe, X_test, y_test, metric_names, label="test", verbose=True # LINE TO BE REMOVED FOR STUDENTS
)
def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
    if metric_names is None:
        metric_names = ["neg_root_mean_squared_error"]
    scores = dict()
    for metric_name in metric_names:
        metric = metrics.get_scorer(metric_name)
        score = metric(pipe, X, y)
        printed_label = f" ({label})" if label else ""
        if verbose:
            print(f"{metric_name}{printed_label}: {score:.3g}")
        scores.update({f"{metric_name}{printed_label}": score})
    return scores


metric_names = [
    "neg_root_mean_squared_error",
    "neg_mean_absolute_error",
    "neg_mean_squared_error",
]
score_train = compute_score(
    pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
    pipe, X_test, y_test, metric_names, label="test", verbose=True # LINE TO BE REMOVED FOR STUDENTS
)

neg_root_mean_squared_error (train): -1.47
neg_mean_absolute_error (train): -1.09
neg_mean_squared_error (train): -2.17
neg_root_mean_squared_error (test): -1.49
neg_mean_absolute_error (test): -1.09
neg_mean_squared_error (test): -2.22

Feature importance¶

Plot model's coefficients in a bar chart

In [12]:

Copied!





def get_feature_names(pipe, X):
    return (
        X.columns
        if len(pipe) == 1
        else pipe[:-1].get_feature_names_out(X.columns)
    )


def plot_features_coefficients(pipe, X):
    df_coef = pd.DataFrame(
        {"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
    ).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
    df_coef.coef.plot(
        kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
    )


plot_features_coefficients(pipe, X_train)
def get_feature_names(pipe, X):
    return (
        X.columns
        if len(pipe) == 1
        else pipe[:-1].get_feature_names_out(X.columns)
    )


def plot_features_coefficients(pipe, X):
    df_coef = pd.DataFrame(
        {"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
    ).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
    df_coef.coef.plot(
        kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
    )


plot_features_coefficients(pipe, X_train)

The coefficients of the linear regression can be misleading ! They do not represent the feature importance if the explicative variables are not comparable.

Model improvement¶

Feature engineering¶

Outliers management¶

In [13]:

Copied!

df_master["alcohol"].plot(kind="hist", bins=100)
df_master["alcohol"].plot(kind="hist", bins=100)

Out[13]:

<Axes: ylabel='Frequency'>

We can see that we have few lines with alcohol = -1. Let's remove those lines

In [14]:

Copied!





df_clean = (
    df_master
    .loc[lambda df: df.alcohol >= 0] # LINE TO BE REMOVED FOR STUDENTS
)

df_clean["alcohol"].plot(kind="hist", bins=100)
df_clean = (
    df_master
    .loc[lambda df: df.alcohol >= 0] # LINE TO BE REMOVED FOR STUDENTS
)

df_clean["alcohol"].plot(kind="hist", bins=100)

Out[14]:

<Axes: ylabel='Frequency'>

In [15]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)

In [16]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(), # LINE TO BE REMOVED FOR STUDENTS
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(), # LINE TO BE REMOVED FOR STUDENTS
)

pipe.fit(X_train, y_train)

Out[16]:

Pipeline(steps=[('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [17]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

Let's have a look at the impact on the MAE

In [18]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.08
MAE test : -1.09

Categorical variable encoding¶

How to use categorical variables ?

type	rating
Stout	14
Belgian Ale	11
IPA	13

We have to encode them to numerical values. Multiple approaches exist for that.

Label encoding

type	type_encoded	rating
Stout	1	14
Belgian Ale	2	11
IPA	3	13

Label encoding is a simple way to encode categorical variables. However, it creates a hierarchy/order between the categories, which does not always reflect the reality.

One hot encoding

type	type_Stout	type_Belgian_Ale	type_IPA	rating
Stout	1	0	0	14
Belgian Ale	0	1	0	11
IPA	0	0	1	13

Here, there is no hierarchy, but in the case of a lot of categories it will add many columns to the dataset.

Let's try to one hot encode some categorical variables ! For that, we will create a custom categorical variable from alcohol.

In [19]:

Copied!

df_clean.head(3)
df_clean.head(3)

Out[19]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter choc...	51	413	300	90	2008-07-31 02:00:00
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with gar...	4	413	300	191	2011-08-06 02:00:00
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, sl...	43	413	300	109	2008-03-09 01:00:00

In [20]:

Copied!





def alcohol_level (row):
    if row['alcohol'] < 5.5 :
        return 'Light'
    if row['alcohol'] > 7 :
        return 'Strong'
    else :
        return 'Medium'

df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)

df_clean['alcohol_level'].value_counts()
def alcohol_level (row):
    if row['alcohol'] < 5.5 :
        return 'Light'
    if row['alcohol'] > 7 :
        return 'Strong'
    else :
        return 'Medium'

df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)

df_clean['alcohol_level'].value_counts()

/tmp/ipykernel_2016/4176320187.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)

Out[20]:

alcohol_level
Light     74945
Strong    62413
Medium    53205
Name: count, dtype: int64

In [21]:

Copied!

df_clean.head()
df_clean.head()

Out[21]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date	alcohol_level
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter choc...	51	413	300	90	2008-07-31 02:00:00	Light
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with gar...	4	413	300	191	2011-08-06 02:00:00	Strong
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, sl...	43	413	300	109	2008-03-09 01:00:00	Light
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	blutt59	bottle, golden orange color with light tan foa...	20	413	300	151	2009-10-17 02:00:00	Medium
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	blutt59	bottle, thanks to SS, almond amber colored pou...	18	978	300	159	2010-08-17 02:00:00	Strong

In [22]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)

target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = (
    pd.get_dummies(X_train, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)
X_test = (
    pd.get_dummies(X_test, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)

target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = (
    pd.get_dummies(X_train, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)
X_test = (
    pd.get_dummies(X_test, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)

In [23]:

Copied!

pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

plot_features_coefficients(pipe, X_train)
pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

plot_features_coefficients(pipe, X_train)

In [24]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.08
MAE test : -1.09

Feature scaling¶

In [25]:

Copied!





target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])

pipe = pipeline.make_pipeline(
    compose.ColumnTransformer([
        ('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
        ('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
    ], remainder='passthrough'),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])

pipe = pipeline.make_pipeline(
    compose.ColumnTransformer([
        ('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
        ('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
    ], remainder='passthrough'),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

Out[25]:

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('scaler', StandardScaler(),
                                                  ['rating_appearance',
                                                   'rating_aroma',
                                                   'rating_palate',
                                                   'rating_taste']),
                                                 ('passthrough', 'passthrough',
                                                  ['alcohol_level_Strong',
                                                   'alcohol_level_Medium',
                                                   'alcohol_level_Light'])])),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Check the features created & the transformation applied to them, using histograms.

In [26]:

Copied!





def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
        )
    else:
        print("no plot: features not transformed")


plot_features_transformed_histograms(pipe, X_train)
def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
        )
    else:
        print("no plot: features not transformed")


plot_features_transformed_histograms(pipe, X_train)

And now we can check again the linear regression coefficients. They can now be seen as feature importance.

In [27]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [28]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.08
MAE test : -1.09

Feature standardization¶

We have also seen that some features do not have a gaussian distribution. However one of the assumption made by the linear regression is that all the features follow a gaussian distribution. We can use some transformers to make the features more normal.

In [29]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.PowerTransformer(), # LINE TO BE REMOVED FOR STUDENTS
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.PowerTransformer(), # LINE TO BE REMOVED FOR STUDENTS
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

Out[29]:

Pipeline(steps=[('powertransformer', PowerTransformer()),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [30]:

Copied!





def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
        )
    else:
        print("no plot: features not transformed")

plot_features_transformed_histograms(pipe, X_train)
def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
        )
    else:
        print("no plot: features not transformed")

plot_features_transformed_histograms(pipe, X_train)

In [31]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.19
MAE test : -1.2

In [32]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

Polynomial Features (optional)¶

Another possible feature engineering step is to add some non linearity. As the linear regression is a linear model, the non linearity can be managed thanks to the use of polynomial features.

In [33]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(degree=2), # LINE TO BE REMOVED FOR STUDENTS
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(degree=2), # LINE TO BE REMOVED FOR STUDENTS
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

Out[33]:

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [34]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [35]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.06
MAE test : -1.07

Review feature extraction¶

Word counts & mean review¶

Compute the "word counts" dataframe:

containing:
- count of words in review
- mean rating review of the reviews containing this text
remove stop words
remove single letter words
keep only the 1000 most frequent words

In [36]:

Copied!





def tokenize(serie):
    return (
        (serie)
        .str.lower() # LINE TO BE REMOVED FOR STUDENTS
        .str.replace(r"[^a-z]", " ")
        .str.replace(r" +", " ")
        .str.split(" ")
    )


df_words_count = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .loc[:, ["rating", "tokenized_text"]]
    .explode("tokenized_text")
    .groupby("tokenized_text", as_index=False)
    .agg(["mean", "count"])
    .reset_index()
    .sort_values(by=("rating", "count"), ascending=False)
    .loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
    .loc[lambda df: df.tokenized_text.str.len() > 1]
    .head(1000) # LINE TO BE REMOVED FOR STUDENTS
)
def tokenize(serie):
    return (
        (serie)
        .str.lower() # LINE TO BE REMOVED FOR STUDENTS
        .str.replace(r"[^a-z]", " ")
        .str.replace(r" +", " ")
        .str.split(" ")
    )


df_words_count = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .loc[:, ["rating", "tokenized_text"]]
    .explode("tokenized_text")
    .groupby("tokenized_text", as_index=False)
    .agg(["mean", "count"])
    .reset_index()
    .sort_values(by=("rating", "count"), ascending=False)
    .loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
    .loc[lambda df: df.tokenized_text.str.len() > 1]
    .head(1000) # LINE TO BE REMOVED FOR STUDENTS
)

In [37]:

Copied!

df_words_count.sort_values(by=[('rating',  'mean')])
df_words_count.sort_values(by=[('rating',  'mean')])

Out[37]:

	index	tokenized_text	rating
			mean	count
87390	87390	corn	8.36	2165
76903	76903	cheap	8.79	1153
252071	252071	water	9.10	1129
60309	60309	bland	9.74	1372
62886	62886	boring	9.92	1045
...	...	...	...	...
188864	188864	perfect	16.08	2141
257196	257196	wonderful	16.25	2040
109976	109976	fantastic	16.29	1047
42498	42498	amazing	16.34	1184
49833	49833	awesome	16.35	1122

1000 rows × 4 columns

Word mean review¶

Compute the "word ratings" dictionary containing the mean review for each word

In [38]:

Copied!

word_mean_review = (
    df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)
word_mean_review = (
    df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)

Positive & negative words sets¶

Compute the "positive words" set containing the words with a rating >= 14.5
Compute the "negative words" set containing the words with a rating <= 13

In [39]:

Copied!





positive_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
    .loc[:, "tokenized_text"]
    .tolist()
)
negative_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] <= 13] # LINE TO BE REMOVED FOR STUDENTS
    .loc[:, "tokenized_text"]
    .tolist()
)
positive_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
    .loc[:, "tokenized_text"]
    .tolist()
)
negative_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] <= 13] # LINE TO BE REMOVED FOR STUDENTS
    .loc[:, "tokenized_text"]
    .tolist()
)

In [40]:

Copied!

print(len(negative_words_set))
print(len(positive_words_set))
print(len(negative_words_set))
print(len(positive_words_set))

167
155

Features & target¶

Compute the "features & target" dataframe containing:

tokenized text
mean word rating
positive words count
negative words count
negative words negative count

In [41]:

Copied!





def compute_mean_word_rating(words, word_mean_review):
    return np.mean(
        [word_mean_review[w] for w in words if w in word_mean_review] or [0]
    )


def count_words_in_set(words, word_set):
    return len(set(words) & word_set)


df_features_and_target = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .assign(
        mean_word_rating=lambda df: (df.tokenized_text).apply(
            compute_mean_word_rating, args=(word_mean_review,)
        )
    )
    .assign(
        positive_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(positive_words_set,)
        )
    )
    .assign(
        negative_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(negative_words_set,)
        )
    )
)
def compute_mean_word_rating(words, word_mean_review):
    return np.mean(
        [word_mean_review[w] for w in words if w in word_mean_review] or [0]
    )


def count_words_in_set(words, word_set):
    return len(set(words) & word_set)


df_features_and_target = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .assign(
        mean_word_rating=lambda df: (df.tokenized_text).apply(
            compute_mean_word_rating, args=(word_mean_review,)
        )
    )
    .assign(
        positive_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(positive_words_set,)
        )
    )
    .assign(
        negative_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(negative_words_set,)
        )
    )
)

In [42]:

Copied!

df_features_and_target.head(5)
df_features_and_target.head(5)

Out[42]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	...	brewery_degree	user_degree	text_length	date	alcohol_level	tokenized_text	mean_word_rating	positive_words_count	negative_words_count
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	...	413	300	90	2008-07-31 02:00:00	Light	[bottle,, oat, nose, with, black, color,, bitt...	14.09	4	0
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	...	413	300	191	2011-08-06 02:00:00	Strong	[bottle,, received, in, trade,, dark, brown, w...	13.92	4	1
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	...	413	300	109	2008-03-09 01:00:00	Light	[12, oz., bottle,, amber, color, with, soapy, ...	13.24	0	3
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	...	413	300	151	2009-10-17 02:00:00	Medium	[bottle,, golden, orange, color, with, light, ...	13.64	2	2
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	...	978	300	159	2010-08-17 02:00:00	Strong	[bottle,, thanks, to, ss,, almond, amber, colo...	13.46	1	3

5 rows × 23 columns

In [43]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)

In [44]:

Copied!





target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count", # LINE TO BE REMOVED FOR STUDENTS
    "negative_words_count", # LINE TO BE REMOVED FOR STUDENTS
    "mean_word_rating" # LINE TO BE REMOVED FOR STUDENTS
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count", # LINE TO BE REMOVED FOR STUDENTS
    "negative_words_count", # LINE TO BE REMOVED FOR STUDENTS
    "mean_word_rating" # LINE TO BE REMOVED FOR STUDENTS
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

Out[44]:

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [45]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [46]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

MAE train : -1.08
MAE test : -1.09

Learning curve¶

Simple model¶

Compute the training & validation R2-scores for various training sets sizes

In [47]:

Copied!





target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
    train_sizes,
    train_scores,
    validation_scores,
    _,
    _,
) = model_selection.learning_curve(
    model,
    X_train,
    y_train,
    cv=7,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)
target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
    train_sizes,
    train_scores,
    validation_scores,
    _,
    _,
) = model_selection.learning_curve(
    model,
    X_train,
    y_train,
    cv=7,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

[learning_curve] Training set sizes: [   147    211    304    437    629    905   1302   1873   2694   3876
   5575   8019  11536  16594  23870  34336  49391  71046 102196 147005]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed:    2.8s finished

Plot the training & validation scores.

Since they come from a K-fold cross-validation, you have K values for each training set size: use seaborn's lineplot to plot the mean value with a confidence interval.

In [48]:

Copied!





def compute_df_sizes(train_sizes):
    return pd.DataFrame(
        {
            "train_size_id": range(len(train_sizes)),
            "train_size": train_sizes,
        }
    )


def compute_df_score(scores, name):
    df = (
        pd.DataFrame(scores)
        .stack()
        .rename_axis(index=("train_size_id", "run_id"))
        .rename(name)
        .reset_index()
    )
    return df


def compute_df_scores_long(train_scores, validation_scores, train_sizes):
    return (
        compute_df_score(train_scores, name="train_score")
        .merge(compute_df_score(validation_scores, name="validation_score"))
        .merge(compute_df_sizes(train_sizes), on="train_size_id")
        .melt(
            id_vars=["train_size_id", "run_id", "train_size"],
            value_vars=["train_score", "validation_score"],
            var_name="score_name",
            value_name="score",
        )
    )


df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")
def compute_df_sizes(train_sizes):
    return pd.DataFrame(
        {
            "train_size_id": range(len(train_sizes)),
            "train_size": train_sizes,
        }
    )


def compute_df_score(scores, name):
    df = (
        pd.DataFrame(scores)
        .stack()
        .rename_axis(index=("train_size_id", "run_id"))
        .rename(name)
        .reset_index()
    )
    return df


def compute_df_scores_long(train_scores, validation_scores, train_sizes):
    return (
        compute_df_score(train_scores, name="train_score")
        .merge(compute_df_score(validation_scores, name="validation_score"))
        .merge(compute_df_sizes(train_sizes), on="train_size_id")
        .melt(
            id_vars=["train_size_id", "run_id", "train_size"],
            value_vars=["train_score", "validation_score"],
            var_name="score_name",
            value_name="score",
        )
    )


df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

Out[48]:

[None]

Plot the features importance of the linear model trained on the full dataset

In [49]:

Copied!

pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Complex model (optional)¶

Compute the training & validation R2-scores for various training sets sizes

Plot the training & validation scores.

In [50]:

Copied!





features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.LinearRegression(),
)
(
    train_sizes,
    train_scores,
    validation_scores,
    fit_times,
    _,
) = model_selection.learning_curve(
    pipe,
    X_train,
    y_train,
    cv=5,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.LinearRegression(),
)
(
    train_sizes,
    train_scores,
    validation_scores,
    fit_times,
    _,
) = model_selection.learning_curve(
    pipe,
    X_train,
    y_train,
    cv=5,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")

[learning_curve] Training set sizes: [   137    197    283    408    587    844   1215   1748   2514   3617
   5203   7485  10767  15488  22278  32047  46098  66309  95383 137204]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   44.2s finished
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

Out[50]:

[None]

Plot the features importance of the linear model with polynomial features trained on the full dataset with 6 features

In [51]:

Copied!

pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Overfitting & Underfitting¶

Overfitting happens when a model fits quite well to the training data but does not generalize well on new data it has never seen. It can have multiple causes :

The training dataset is not representative of the real data
The model is too complex and learned from noise on the training dataset

Underfitting happens when the model is too simple to capture the data patterns. in this case, the model would perform bad on the train and on the test dataset.

To avoid overfitting, multiple actions can be performed.

From here we'll use a dataset containing 2000 rows & 7 features

In [52]:

Copied!





N = 2000
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "mean_word_rating",
    "positive_words_count",
    "negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N,
)
N = 2000
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "mean_word_rating",
    "positive_words_count",
    "negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N,
)

Model regularization¶

The regularization allows to reduce the model complexity. For that, we will not only minimize our cost error, but we will minimize an objective function with a regularization term.

$$min \sum_{i}(y_i – X_i\beta)^2 + Regularization$$

Ridge - L2 regularization¶

For the ridge regression, the regularization term will be the L2-norm of the coefficients :

$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_2^2$$

where $$\vert\vert\beta\vert\vert_2^2 = \sum_{j}\beta_j^2$$

It is mainly used in the case of correlated features, which can bring unexpected behaviours with regard to the coefficients.

Simple loop on regularization parameter¶

Fit n models for n values of alpha (regularization parameter)
Compute the scores
Retrieve the values of coefficients

In [53]:

Copied!





alphas = np.logspace(-6, 3, 28)

def compute_scores_and_coefs(pipe, alphas):
    scores = dict()
    coefs = dict()
    for alpha in tqdm(alphas):
        pipe[-1].set_params(alpha=alpha)
        pipe.fit(X_train, y_train)
        score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
        score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
        scores.update({alpha: {**score_train, **score_test}})
        coef = dict(
            zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
        )
        coefs.update({alpha: coef})
    return scores, coefs


pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
alphas = np.logspace(-6, 3, 28)

def compute_scores_and_coefs(pipe, alphas):
    scores = dict()
    coefs = dict()
    for alpha in tqdm(alphas):
        pipe[-1].set_params(alpha=alpha)
        pipe.fit(X_train, y_train)
        score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
        score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
        scores.update({alpha: {**score_train, **score_test}})
        coef = dict(
            zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
        )
        coefs.update({alpha: coef})
    return scores, coefs


pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)

100%|██████████| 28/28 [00:03<00:00,  7.61it/s]

Plot the evolution of the score as a function of alpha

In [54]:

Copied!





(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)

Out[54]:

<Axes: >

Plot the evolution of the coefficients as a function of alpha

In [55]:

Copied!





(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)
(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)

Out[55]:

<Axes: >

Plot the features importance of a model trained with a specific alpha value (e.g: alpha=1e-1)

In [56]:

Copied!





pipe[-1].set_params(
    alpha=1e1 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe[-1].set_params(
    alpha=1e1 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Lasso - L1 regularization¶

For the lasso regression, the regularization term will be the L1-norm of the coefficients :

$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_1^2$$

where $$\vert\vert\beta\vert\vert_1^2 = \sum_{j}\vert\beta_j\vert$$

Simple loop on regularization parameter¶

Fit n models for n values of alpha (regularization parameter)
Compute the scores
Retrieve the values of coefficients

In [57]:

Copied!





pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)

  0%|          | 0/28 [00:00<?, ?it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.090e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
  4%|▎         | 1/28 [00:00<00:04,  5.43it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.088e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
  7%|▋         | 2/28 [00:00<00:05,  5.11it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.084e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 11%|█         | 3/28 [00:00<00:04,  5.22it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.076e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 14%|█▍        | 4/28 [00:00<00:04,  5.30it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.057e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 18%|█▊        | 5/28 [00:00<00:04,  5.32it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.016e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 21%|██▏       | 6/28 [00:01<00:04,  5.35it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.924e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 25%|██▌       | 7/28 [00:01<00:04,  5.19it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.730e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 29%|██▊       | 8/28 [00:01<00:03,  5.24it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.388e+03, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 32%|███▏      | 9/28 [00:01<00:03,  5.34it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.910e+02, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 36%|███▌      | 10/28 [00:01<00:03,  5.42it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.635e+01, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 39%|███▉      | 11/28 [00:02<00:03,  5.48it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.007e+01, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 43%|████▎     | 12/28 [00:02<00:02,  5.58it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.858e+00, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 46%|████▋     | 13/28 [00:02<00:02,  5.54it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.213e+00, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
 50%|█████     | 14/28 [00:02<00:02,  5.73it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.760e+00, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(
100%|██████████| 28/28 [00:04<00:00,  6.21it/s]

Plot the evolution of the score as a function of alpha

In [58]:

Copied!





(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)

Out[58]:

<Axes: >

Plot the evolution of the coefficients as a function of alpha

In [59]:

Copied!





(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)
(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)

Out[59]:

<Axes: >

Plot the features importance of a model trained with a specific alpha value (e.g: alpha=1e-2)

In [60]:

Copied!





pipe[-1].set_params(
    alpha=1e-2 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)
pipe[-1].set_params(
    alpha=1e-2 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.858e+00, tolerance: 2.331e+00
  model = cd_fast.enet_coordinate_descent(

0.8110210044666418
0.7937570454994202