Download Open in colab

🐍 Practice n°3: regression (students version)¶

The objective of this session is to understand the regression problems and the analytical approach to solve them. This notebook discusses theoretical notions and contains practical parts.

Here are the main steps of the notebook :

What is regression ?
Focus on linear regression
Preparation
Implementation of a linear regression
Model improvement

1. What is regression ?¶

Regression in machine learning consists of mathematical methods that allow to predict a continuous outcome (y) based on the value of one or more predictor variables (x)

1D Example¶

x = Number of aircrafts in an airspace sector

y = Subjective rating of air traffic controller workload

To predict the air traffic controller workload, we need to find the best function y=f(x). For that, we will define an error and find the function which minimizes this error.

Capture d’écran 2022-11-04 à 11.25.07.png

Generalisation¶

In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1.

$$x = (x_1,...,x_n)$$ $$\hat{y} = f(x_1,...,x_n)$$

If we come back to our traffic control example, the air traffic controller workload can be a function of both the number of aircraft and the weather.

2. Focus on linear regression¶

The linear regression is a regression using a linear function ŷ = ax + b

1D Example¶

Capture d’écran 2022-11-04 à 12.18.01.png

Generalisation¶

In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1. The linear function will look like :

$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

Now let's practice with the ratebeer dataset !

3. Preparation¶

Install & import modules¶

In [ ]:

Copied!





import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
    linear_model,
    preprocessing,
    pipeline,
    model_selection,
    metrics,
    compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl

mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})
import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
    linear_model,
    preprocessing,
    pipeline,
    model_selection,
    metrics,
    compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl

mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})

Read remote dataset¶

The data is in this git repository: ML-boot-camp/ratebeer.git.

The data is located in the the ratebeer/data/ folder.

In [ ]:

Copied!

file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"

Implementation of a linear regression¶

In this first section, you will train a linear regression model to predict the overall rating of a review based on its other ratings. For that, you will :

Load & describe the data
Split the data between train and test set
Train and evaluate the linear regression model

Load data & describe¶

In [ ]:

Copied!

df_master = pd.read_parquet(file_url)
df_master = pd.read_parquet(file_url)

In [ ]:

Copied!

df_master.head()
df_master.head()

In [ ]:

Copied!

df_master.shape
df_master.shape

Splits: Train/test & features/target¶

The goal of supervised learning is to build a model that performs well on new data. If you have new data, it’s a good idea to see how your model performs on it. The problem is that you may not have new data, but you can simulate this experience by splitting your dataset between train and test.

Train/test split¶

Split keeping 10% of the data in the test set.

Hint: https://scikit-learn.org/stable/index.html

In [ ]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(
    df_master, 
    *** FILL THE MISSING LINE ***
)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(
    df_master, 
    *** FILL THE MISSING LINE ***
)

Features/target split¶

Split keeping:

only the wanted features
N data samples in the training set

In [ ]:

Copied!





def split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N=None,
):
    if N is None :
        X_train = df_features_and_target_train[features]
        y_train = df_features_and_target_train[target]
    else :
        X_train = df_features_and_target_train[features].sample(N)
        y_train = df_features_and_target_train[target].loc[X_train.index]
    X_test = df_features_and_target_test[features]
    y_test = df_features_and_target_test[target]
    return X_train, y_train, X_test, y_test


target = [
    *** FILL THE MISSING LINE ***
] 
features = [
    *** FILL THE MISSING LINE ***
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)
def split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N=None,
):
    if N is None :
        X_train = df_features_and_target_train[features]
        y_train = df_features_and_target_train[target]
    else :
        X_train = df_features_and_target_train[features].sample(N)
        y_train = df_features_and_target_train[target].loc[X_train.index]
    X_test = df_features_and_target_test[features]
    y_test = df_features_and_target_test[target]
    return X_train, y_train, X_test, y_test


target = [
    *** FILL THE MISSING LINE ***
] 
features = [
    *** FILL THE MISSING LINE ***
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)

Model training & evaluation¶

Model training¶

Use a pipeline to wrap the model with its automatized preprocessing steps

In [ ]:

Copied!





pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)
pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)

Model evaluation¶

R2 score¶

To assess the performance of our model, many metrics can be used. The easiest one for regression is $R^2$. It is used to measure the goodness of fit or best-fit line.

$$R^2 = 1 - \frac{\sum_{i}{(y_i - \hat{y}_i)^2}}{\sum_{i}{(y_i - \bar{y}_i)^2}}$$

R2 score in practice¶

In [ ]:

Copied!





score_train = pipe.score(
    X_train, 
    y_train
)
score_test = pipe.score(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)

print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")
score_train = pipe.score(
    X_train, 
    y_train
)
score_test = pipe.score(
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
)

print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")

Residuals¶

Compute residuals dataframe containing the true & predicted values of y and and the residual, which is the difference between the true & predicted values.

Plot the residuals, using histograms & violinplots.

In [ ]:

Copied!





def compute_df_residual(pipe, X, y):
    return pd.DataFrame(
        {
            "y_true": y.rating.reset_index(drop=True),
            "y_pred": pipe.predict(X).reshape(-1),
        }
    ).assign(residual=lambda df: df.y_true - df.y_pred)


def plot_residual(df_residual):
    df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
    plt.show()
    sns.violinplot(data=df_residual, x="y_true", y="y_pred")


df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)
def compute_df_residual(pipe, X, y):
    return pd.DataFrame(
        {
            "y_true": y.rating.reset_index(drop=True),
            "y_pred": pipe.predict(X).reshape(-1),
        }
    ).assign(residual=lambda df: df.y_true - df.y_pred)


def plot_residual(df_residual):
    df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
    plt.show()
    sns.violinplot(data=df_residual, x="y_true", y="y_pred")


df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)

Other regression metrics¶

Another popular metric is the Mean Squared Error (MSE)

$$MSE = \frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2$$

This metric is interesting if we want to penalize large errors. However it is not easily interpretable as it is not in the same unit than the target.

This is why the Root Mean Squared Error (RMSE) is widely used.

$$RMSE = \sqrt{\frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2}$$

As well as the Mean Absolute Error which is even more interpretable.

$$MAE = \frac{1}{N} \sum_{i}\lvert(y_i – \hat{y}_i)\lvert$$

Other metrics in practice¶

Compute a set of metrics on train & test datasets:

negative RMSE
negative MAE
negative MSE

NB: A "score" function in sklearn obeys to the "greater is better" principle. That's why error functions are converted to their negative versions to obey this principle.

In [ ]:

Copied!





def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
    if metric_names is None:
        metric_names = ["neg_root_mean_squared_error"]
    scores = dict()
    for metric_name in metric_names:
        metric = metrics.get_scorer(metric_name)
        score = metric(pipe, X, y)
        printed_label = f" ({label})" if label else ""
        if verbose:
            print(f"{metric_name}{printed_label}: {score:.3g}")
        scores.update({f"{metric_name}{printed_label}": score})
    return scores


metric_names = [
    "neg_root_mean_squared_error",
    "neg_mean_absolute_error",
    "neg_mean_squared_error",
]
score_train = compute_score(
    pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
    *** FILL THE MISSING LINE ***
)
def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
    if metric_names is None:
        metric_names = ["neg_root_mean_squared_error"]
    scores = dict()
    for metric_name in metric_names:
        metric = metrics.get_scorer(metric_name)
        score = metric(pipe, X, y)
        printed_label = f" ({label})" if label else ""
        if verbose:
            print(f"{metric_name}{printed_label}: {score:.3g}")
        scores.update({f"{metric_name}{printed_label}": score})
    return scores


metric_names = [
    "neg_root_mean_squared_error",
    "neg_mean_absolute_error",
    "neg_mean_squared_error",
]
score_train = compute_score(
    pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
    *** FILL THE MISSING LINE ***
)

Feature importance¶

Plot model's coefficients in a bar chart

In [ ]:

Copied!





def get_feature_names(pipe, X):
    return (
        X.columns
        if len(pipe) == 1
        else pipe[:-1].get_feature_names_out(X.columns)
    )


def plot_features_coefficients(pipe, X):
    df_coef = pd.DataFrame(
        {"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
    ).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
    df_coef.coef.plot(
        kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
    )


plot_features_coefficients(pipe, X_train)
def get_feature_names(pipe, X):
    return (
        X.columns
        if len(pipe) == 1
        else pipe[:-1].get_feature_names_out(X.columns)
    )


def plot_features_coefficients(pipe, X):
    df_coef = pd.DataFrame(
        {"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
    ).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
    df_coef.coef.plot(
        kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
    )


plot_features_coefficients(pipe, X_train)

The coefficients of the linear regression can be misleading ! They do not represent the feature importance if the explicative variables are not comparable.

Model improvement¶

Feature engineering¶

Outliers management¶

In [ ]:

Copied!

df_master["alcohol"].plot(kind="hist", bins=100)
df_master["alcohol"].plot(kind="hist", bins=100)

We can see that we have few lines with alcohol = -1. Let's remove those lines

In [ ]:

Copied!





df_clean = (
    df_master
    *** FILL THE MISSING LINE ***
)

df_clean["alcohol"].plot(kind="hist", bins=100)
df_clean = (
    df_master
    *** FILL THE MISSING LINE ***
)

df_clean["alcohol"].plot(kind="hist", bins=100)

In [ ]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)

In [ ]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
)

pipe.fit(X_train, y_train)

In [ ]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

Let's have a look at the impact on the MAE

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

Categorical variable encoding¶

How to use categorical variables ?

type	rating
Stout	14
Belgian Ale	11
IPA	13

We have to encode them to numerical values. Multiple approaches exist for that.

Label encoding

type	type_encoded	rating
Stout	1	14
Belgian Ale	2	11
IPA	3	13

Label encoding is a simple way to encode categorical variables. However, it creates a hierarchy/order between the categories, which does not always reflect the reality.

One hot encoding

type	type_Stout	type_Belgian_Ale	type_IPA	rating
Stout	1	0	0	14
Belgian Ale	0	1	0	11
IPA	0	0	1	13

Here, there is no hierarchy, but in the case of a lot of categories it will add many columns to the dataset.

Let's try to one hot encode some categorical variables ! For that, we will create a custom categorical variable from alcohol.

In [ ]:

Copied!

df_clean.head(3)
df_clean.head(3)

In [ ]:

Copied!





def alcohol_level (row):
    if row['alcohol'] < 5.5 :
        return 'Light'
    if row['alcohol'] > 7 :
        return 'Strong'
    else :
        return 'Medium'

df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)

df_clean['alcohol_level'].value_counts()
def alcohol_level (row):
    if row['alcohol'] < 5.5 :
        return 'Light'
    if row['alcohol'] > 7 :
        return 'Strong'
    else :
        return 'Medium'

df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)

df_clean['alcohol_level'].value_counts()

In [ ]:

Copied!

df_clean.head()
df_clean.head()

In [ ]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)

target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = (
    *** FILL THE MISSING LINE ***
)
X_test = (
    *** FILL THE MISSING LINE ***
)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)

target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = (
    *** FILL THE MISSING LINE ***
)
X_test = (
    *** FILL THE MISSING LINE ***
)

In [ ]:

Copied!

pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

plot_features_coefficients(pipe, X_train)
pipe = pipeline.make_pipeline(
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

plot_features_coefficients(pipe, X_train)

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

Feature scaling¶

In [ ]:

Copied!





target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])

pipe = pipeline.make_pipeline(
    compose.ColumnTransformer([
        ('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
        ('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
    ], remainder='passthrough'),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol_level",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])

pipe = pipeline.make_pipeline(
    compose.ColumnTransformer([
        ('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
        ('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
    ], remainder='passthrough'),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

Check the features created & the transformation applied to them, using histograms.

In [ ]:

Copied!





def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
        )
    else:
        print("no plot: features not transformed")


plot_features_transformed_histograms(pipe, X_train)
def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
        )
    else:
        print("no plot: features not transformed")


plot_features_transformed_histograms(pipe, X_train)

And now we can check again the linear regression coefficients. They can now be seen as feature importance.

In [ ]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

Feature standardization¶

We have also seen that some features do not have a gaussian distribution. However one of the assumption made by the linear regression is that all the features follow a gaussian distribution. We can use some transformers to make the features more normal.

In [ ]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

In [ ]:

Copied!





def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
        )
    else:
        print("no plot: features not transformed")

plot_features_transformed_histograms(pipe, X_train)
def plot_features_transformed_histograms(pipe, X, **kwargs):
    if len(pipe) > 1:
        columns = pipe[:-1].get_feature_names_out(X.columns)
        (
            pd.DataFrame(
                pipe[:-1].transform(X),
                columns=columns,
            ).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
        )
    else:
        print("no plot: features not transformed")

plot_features_transformed_histograms(pipe, X_train)

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

In [ ]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

Polynomial Features (optional)¶

Another possible feature engineering step is to add some non linearity. As the linear regression is a linear model, the non linearity can be managed thanks to the use of polynomial features.

In [ ]:

Copied!





target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "alcohol",
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    *** FILL THE MISSING LINE ***
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

In [ ]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

Review feature extraction¶

Word counts & mean review¶

Compute the "word counts" dataframe:

containing:
- count of words in review
- mean rating review of the reviews containing this text
remove stop words
remove single letter words
keep only the 1000 most frequent words

In [ ]:

Copied!





def tokenize(serie):
    return (
        (serie)
        *** FILL THE MISSING LINE ***
        .str.replace(r"[^a-z]", " ")
        .str.replace(r" +", " ")
        .str.split(" ")
    )


df_words_count = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .loc[:, ["rating", "tokenized_text"]]
    .explode("tokenized_text")
    .groupby("tokenized_text", as_index=False)
    .agg(["mean", "count"])
    .reset_index()
    .sort_values(by=("rating", "count"), ascending=False)
    .loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
    .loc[lambda df: df.tokenized_text.str.len() > 1]
    *** FILL THE MISSING LINE ***
)
def tokenize(serie):
    return (
        (serie)
        *** FILL THE MISSING LINE ***
        .str.replace(r"[^a-z]", " ")
        .str.replace(r" +", " ")
        .str.split(" ")
    )


df_words_count = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .loc[:, ["rating", "tokenized_text"]]
    .explode("tokenized_text")
    .groupby("tokenized_text", as_index=False)
    .agg(["mean", "count"])
    .reset_index()
    .sort_values(by=("rating", "count"), ascending=False)
    .loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
    .loc[lambda df: df.tokenized_text.str.len() > 1]
    *** FILL THE MISSING LINE ***
)

In [ ]:

Copied!

df_words_count.sort_values(by=[('rating',  'mean')])
df_words_count.sort_values(by=[('rating',  'mean')])

Word mean review¶

Compute the "word ratings" dictionary containing the mean review for each word

In [ ]:

Copied!

word_mean_review = (
    df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)
word_mean_review = (
    df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)

Positive & negative words sets¶

Compute the "positive words" set containing the words with a rating >= 14.5
Compute the "negative words" set containing the words with a rating <= 13

In [ ]:

Copied!





positive_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
    .loc[:, "tokenized_text"]
    .tolist()
)
negative_words_set = set(
    *** FILL THE MISSING LINE ***
    .loc[:, "tokenized_text"]
    .tolist()
)
positive_words_set = set(
    df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
    .loc[:, "tokenized_text"]
    .tolist()
)
negative_words_set = set(
    *** FILL THE MISSING LINE ***
    .loc[:, "tokenized_text"]
    .tolist()
)

In [ ]:

Copied!

print(len(negative_words_set))
print(len(positive_words_set))
print(len(negative_words_set))
print(len(positive_words_set))

Features & target¶

Compute the "features & target" dataframe containing:

tokenized text
mean word rating
positive words count
negative words count
negative words negative count

In [ ]:

Copied!





def compute_mean_word_rating(words, word_mean_review):
    return np.mean(
        [word_mean_review[w] for w in words if w in word_mean_review] or [0]
    )


def count_words_in_set(words, word_set):
    return len(set(words) & word_set)


df_features_and_target = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .assign(
        mean_word_rating=lambda df: (df.tokenized_text).apply(
            compute_mean_word_rating, args=(word_mean_review,)
        )
    )
    .assign(
        positive_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(positive_words_set,)
        )
    )
    .assign(
        negative_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(negative_words_set,)
        )
    )
)
def compute_mean_word_rating(words, word_mean_review):
    return np.mean(
        [word_mean_review[w] for w in words if w in word_mean_review] or [0]
    )


def count_words_in_set(words, word_set):
    return len(set(words) & word_set)


df_features_and_target = (
    (df_clean)
    .assign(tokenized_text=lambda df: tokenize(df.text))
    .assign(
        mean_word_rating=lambda df: (df.tokenized_text).apply(
            compute_mean_word_rating, args=(word_mean_review,)
        )
    )
    .assign(
        positive_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(positive_words_set,)
        )
    )
    .assign(
        negative_words_count=lambda df: (df.tokenized_text).apply(
            count_words_in_set, args=(negative_words_set,)
        )
    )
)

In [ ]:

Copied!

df_features_and_target.head(5)
df_features_and_target.head(5)

In [ ]:

Copied!





(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)
(
    df_features_and_target_train,
    df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)

In [ ]:

Copied!





target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)
target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
    *** FILL THE MISSING LINE ***
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

pipe = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LinearRegression(),
)

pipe.fit(X_train, y_train)

In [ ]:

Copied!

plot_features_coefficients(pipe, X_train)
plot_features_coefficients(pipe, X_train)

In [ ]:

Copied!





metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))

Learning curve¶

Simple model¶

Compute the training & validation R2-scores for various training sets sizes

In [ ]:

Copied!





target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
    train_sizes,
    train_scores,
    validation_scores,
    _,
    _,
) = model_selection.learning_curve(
    model,
    X_train,
    y_train,
    cv=7,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)
target = ["rating"]
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
)

model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
    train_sizes,
    train_scores,
    validation_scores,
    _,
    _,
) = model_selection.learning_curve(
    model,
    X_train,
    y_train,
    cv=7,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

Plot the training & validation scores.

Since they come from a K-fold cross-validation, you have K values for each training set size: use seaborn's lineplot to plot the mean value with a confidence interval.

In [ ]:

Copied!





def compute_df_sizes(train_sizes):
    return pd.DataFrame(
        {
            "train_size_id": range(len(train_sizes)),
            "train_size": train_sizes,
        }
    )


def compute_df_score(scores, name):
    df = (
        pd.DataFrame(scores)
        .stack()
        .rename_axis(index=("train_size_id", "run_id"))
        .rename(name)
        .reset_index()
    )
    return df


def compute_df_scores_long(train_scores, validation_scores, train_sizes):
    return (
        compute_df_score(train_scores, name="train_score")
        .merge(compute_df_score(validation_scores, name="validation_score"))
        .merge(compute_df_sizes(train_sizes), on="train_size_id")
        .melt(
            id_vars=["train_size_id", "run_id", "train_size"],
            value_vars=["train_score", "validation_score"],
            var_name="score_name",
            value_name="score",
        )
    )


df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")
def compute_df_sizes(train_sizes):
    return pd.DataFrame(
        {
            "train_size_id": range(len(train_sizes)),
            "train_size": train_sizes,
        }
    )


def compute_df_score(scores, name):
    df = (
        pd.DataFrame(scores)
        .stack()
        .rename_axis(index=("train_size_id", "run_id"))
        .rename(name)
        .reset_index()
    )
    return df


def compute_df_scores_long(train_scores, validation_scores, train_sizes):
    return (
        compute_df_score(train_scores, name="train_score")
        .merge(compute_df_score(validation_scores, name="validation_score"))
        .merge(compute_df_sizes(train_sizes), on="train_size_id")
        .melt(
            id_vars=["train_size_id", "run_id", "train_size"],
            value_vars=["train_score", "validation_score"],
            var_name="score_name",
            value_name="score",
        )
    )


df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")

Plot the features importance of the linear model trained on the full dataset

In [ ]:

Copied!

pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Complex model (optional)¶

Compute the training & validation R2-scores for various training sets sizes

Plot the training & validation scores.

In [ ]:

Copied!





features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.LinearRegression(),
)
(
    train_sizes,
    train_scores,
    validation_scores,
    fit_times,
    _,
) = model_selection.learning_curve(
    pipe,
    X_train,
    y_train,
    cv=5,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "alcohol",
    "user_degree",
    "positive_words_count",
    "negative_words_count",
    "mean_word_rating"
]

pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.LinearRegression(),
)
(
    train_sizes,
    train_scores,
    validation_scores,
    fit_times,
    _,
) = model_selection.learning_curve(
    pipe,
    X_train,
    y_train,
    cv=5,
    n_jobs=-1,
    train_sizes=train_sizes_percent,
    return_times=True,
    verbose=1,
)

df_scores_long = compute_df_scores_long(
    train_scores, validation_scores, train_sizes
)
sns.lineplot(
    data=df_scores_long.loc[lambda df: df.score > 0.5],
    x="train_size",
    y="score",
    hue="score_name",
    style="score_name",
    dashes=False,
    markers=True,
).set(xscale="log")

Plot the features importance of the linear model with polynomial features trained on the full dataset with 6 features

In [ ]:

Copied!

pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Overfitting & Underfitting¶

Overfitting happens when a model fits quite well to the training data but does not generalize well on new data it has never seen. It can have multiple causes :

The training dataset is not representative of the real data
The model is too complex and learned from noise on the training dataset

Underfitting happens when the model is too simple to capture the data patterns. in this case, the model would perform bad on the train and on the test dataset.

To avoid overfitting, multiple actions can be performed.

From here we'll use a dataset containing 2000 rows & 7 features

In [ ]:

Copied!





N = 2000
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "mean_word_rating",
    "positive_words_count",
    "negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N,
)
N = 2000
features = [
    "rating_appearance",
    "rating_aroma",
    "rating_palate",
    "rating_taste",
    "mean_word_rating",
    "positive_words_count",
    "negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
    df_features_and_target_train,
    df_features_and_target_test,
    features,
    target,
    N,
)

Model regularization¶

The regularization allows to reduce the model complexity. For that, we will not only minimize our cost error, but we will minimize an objective function with a regularization term.

$$min \sum_{i}(y_i – X_i\beta)^2 + Regularization$$

Ridge - L2 regularization¶

For the ridge regression, the regularization term will be the L2-norm of the coefficients :

$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_2^2$$

where $$\vert\vert\beta\vert\vert_2^2 = \sum_{j}\beta_j^2$$

It is mainly used in the case of correlated features, which can bring unexpected behaviours with regard to the coefficients.

Simple loop on regularization parameter¶

Fit n models for n values of alpha (regularization parameter)
Compute the scores
Retrieve the values of coefficients

In [ ]:

Copied!





alphas = np.logspace(-6, 3, 28)

def compute_scores_and_coefs(pipe, alphas):
    scores = dict()
    coefs = dict()
    for alpha in tqdm(alphas):
        pipe[-1].set_params(alpha=alpha)
        pipe.fit(X_train, y_train)
        score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
        score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
        scores.update({alpha: {**score_train, **score_test}})
        coef = dict(
            zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
        )
        coefs.update({alpha: coef})
    return scores, coefs


pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
alphas = np.logspace(-6, 3, 28)

def compute_scores_and_coefs(pipe, alphas):
    scores = dict()
    coefs = dict()
    for alpha in tqdm(alphas):
        pipe[-1].set_params(alpha=alpha)
        pipe.fit(X_train, y_train)
        score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
        score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
        scores.update({alpha: {**score_train, **score_test}})
        coef = dict(
            zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
        )
        coefs.update({alpha: coef})
    return scores, coefs


pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)

Plot the evolution of the score as a function of alpha

In [ ]:

Copied!





(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)

Plot the evolution of the coefficients as a function of alpha

In [ ]:

Copied!





(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)
(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)

Plot the features importance of a model trained with a specific alpha value (e.g: alpha=1e-1)

In [ ]:

Copied!





pipe[-1].set_params(
    *** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
pipe[-1].set_params(
    *** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)

Lasso - L1 regularization¶

For the lasso regression, the regularization term will be the L1-norm of the coefficients :

$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_1^2$$

where $$\vert\vert\beta\vert\vert_1^2 = \sum_{j}\vert\beta_j\vert$$

Simple loop on regularization parameter¶

Fit n models for n values of alpha (regularization parameter)
Compute the scores
Retrieve the values of coefficients

In [ ]:

Copied!





pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
pipe = pipeline.make_pipeline(
    preprocessing.PolynomialFeatures(include_bias=False, degree=2),
    preprocessing.PowerTransformer(),
    linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)

Plot the evolution of the score as a function of alpha

In [ ]:

Copied!





(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
(
    pd.DataFrame.from_dict(scores, orient="index")
    .plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)

Plot the evolution of the coefficients as a function of alpha

In [ ]:

Copied!





(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)
(
    pd.DataFrame.from_dict(coefs, orient="index").plot(
        logx=True, legend=False, marker="."
    )
)

Plot the features importance of a model trained with a specific alpha value (e.g: alpha=1e-2)

In [ ]:

Copied!





pipe[-1].set_params(
    *** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)
pipe[-1].set_params(
    *** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)