š Practice n°3: regression¶
The objective of this session is to understand the regression problems and the analytical approach to solve them. This notebook discusses theoretical notions and contains practical parts.
Here are the main steps of the notebook :
- What is regression ?
- Focus on linear regression
- Preparation
- Implementation of a linear regression
- Model improvement
1. What is regression ?¶
Regression in machine learning consists of mathematical methods that allow to predict a continuous outcome (y) based on the value of one or more predictor variables (x)
1D Example¶
x = Number of aircrafts in an airspace sector
y = Subjective rating of air traffic controller workload
To predict the air traffic controller workload, we need to find the best function y=f(x). For that, we will define an error and find the function which minimizes this error.
Generalisation¶
In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1.
$$x = (x_1,...,x_n)$$ $$\hat{y} = f(x_1,...,x_n)$$
If we come back to our traffic control example, the air traffic controller workload can be a function of both the number of aircraft and the weather.
2. Focus on linear regression¶
The linear regression is a regression using a linear function Å· = ax + b
1D Example¶
Generalisation¶
In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1. The linear function will look like :
$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$
Now let's practice with the ratebeer dataset !
3. Preparation¶
Install & import modules¶
import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
linear_model,
preprocessing,
pipeline,
model_selection,
metrics,
compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})
Read remote dataset¶
The data is in this git repository: ML-boot-camp/ratebeer.git.
The data is located in the the ratebeer/data/ folder.
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
Implementation of a linear regression¶
In this first section, you will train a linear regression model to predict the overall rating of a review based on its other ratings. For that, you will :
- Load & describe the data
- Split the data between train and test set
- Train and evaluate the linear regression model
Load data & describe¶
df_master = pd.read_parquet(file_url)
df_master.head()
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... | 51 | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with gar... | 4 | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, sl... | 43 | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | blutt59 | bottle, golden orange color with light tan foa... | 20 | 413 | 300 | 151 | 2009-10-17 02:00:00 | 0 |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | blutt59 | bottle, thanks to SS, almond amber colored pou... | 18 | 978 | 300 | 159 | 2010-08-17 02:00:00 | 0 |
df_master.shape
(200000, 18)
Splits: Train/test & features/target¶
The goal of supervised learning is to build a model that performs well on new data. If you have new data, itās a good idea to see how your model performs on it. The problem is that you may not have new data, but you can simulate this experience by splitting your dataset between train and test.
Train/test split¶
Split keeping 10% of the data in the test set.
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(
df_master,
test_size=0.1 # LINE TO BE REMOVED FOR STUDENTS
)
def split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
N=None,
):
if N is None :
X_train = df_features_and_target_train[features]
y_train = df_features_and_target_train[target]
else :
X_train = df_features_and_target_train[features].sample(N)
y_train = df_features_and_target_train[target].loc[X_train.index]
X_test = df_features_and_target_test[features]
y_test = df_features_and_target_test[target]
return X_train, y_train, X_test, y_test
target = [
"rating" # LINE TO BE REMOVED FOR STUDENTS
]
features = [
"alcohol", # LINE TO BE REMOVED FOR STUDENTS
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train, # LINE TO BE REMOVED FOR STUDENTS
df_features_and_target_test, # LINE TO BE REMOVED FOR STUDENTS
features, # LINE TO BE REMOVED FOR STUDENTS
target, # LINE TO BE REMOVED FOR STUDENTS
)
Model training & evaluation¶
Model training¶
Use a pipeline to wrap the model with its automatized preprocessing steps
pipe = pipeline.make_pipeline(
linear_model.LinearRegression(),
)
pipe.fit(
X_train, # LINE TO BE REMOVED FOR STUDENTS
y_train # LINE TO BE REMOVED FOR STUDENTS
)
Pipeline(steps=[('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('linearregression', LinearRegression())])LinearRegression()
Model evaluation¶
R2 score¶
To assess the performance of our model, many metrics can be used. The easiest one for regression is $R^2$. It is used to measure the goodness of fit or best-fit line.
$$R^2 = 1 - \frac{\sum_{i}{(y_i - \hat{y}_i)^2}}{\sum_{i}{(y_i - \bar{y}_i)^2}}$$
R2 score in practice¶
score_train = pipe.score(
X_train,
y_train
)
score_test = pipe.score(
X_test, # LINE TO BE REMOVED FOR STUDENTS
y_test # LINE TO BE REMOVED FOR STUDENTS
)
print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")
R2 (train): 0.8063431786340525 R2 (test): 0.8024042308518796
Residuals¶
Compute residuals dataframe containing the true & predicted values of y and
and the residual, which is the difference between the true & predicted values.
Plot the residuals, using histograms & violinplots.
def compute_df_residual(pipe, X, y):
return pd.DataFrame(
{
"y_true": y.rating.reset_index(drop=True),
"y_pred": pipe.predict(X).reshape(-1),
}
).assign(residual=lambda df: df.y_true - df.y_pred)
def plot_residual(df_residual):
df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
plt.show()
sns.violinplot(data=df_residual, x="y_true", y="y_pred")
df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):
Other regression metrics¶
Another popular metric is the Mean Squared Error (MSE)
$$MSE = \frac{1}{N} \sum_{i}(y_i ā \hat{y}_i)^2$$
This metric is interesting if we want to penalize large errors. However it is not easily interpretable as it is not in the same unit than the target.
This is why the Root Mean Squared Error (RMSE) is widely used.
$$RMSE = \sqrt{\frac{1}{N} \sum_{i}(y_i ā \hat{y}_i)^2}$$
As well as the Mean Absolute Error which is even more interpretable.
$$MAE = \frac{1}{N} \sum_{i}\lvert(y_i ā \hat{y}_i)\lvert$$
Other metrics in practice¶
Compute a set of metrics on train & test datasets:
- negative RMSE
- negative MAE
- negative MSE
NB: A "score" function in sklearn obeys to the "greater is better" principle. That's why error functions are converted to their negative versions to obey this principle.
def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
if metric_names is None:
metric_names = ["neg_root_mean_squared_error"]
scores = dict()
for metric_name in metric_names:
metric = metrics.get_scorer(metric_name)
score = metric(pipe, X, y)
printed_label = f" ({label})" if label else ""
if verbose:
print(f"{metric_name}{printed_label}: {score:.3g}")
scores.update({f"{metric_name}{printed_label}": score})
return scores
metric_names = [
"neg_root_mean_squared_error",
"neg_mean_absolute_error",
"neg_mean_squared_error",
]
score_train = compute_score(
pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
pipe, X_test, y_test, metric_names, label="test", verbose=True # LINE TO BE REMOVED FOR STUDENTS
)
neg_root_mean_squared_error (train): -1.47 neg_mean_absolute_error (train): -1.09 neg_mean_squared_error (train): -2.17 neg_root_mean_squared_error (test): -1.49 neg_mean_absolute_error (test): -1.09 neg_mean_squared_error (test): -2.22
Feature importance¶
Plot model's coefficients in a bar chart
def get_feature_names(pipe, X):
return (
X.columns
if len(pipe) == 1
else pipe[:-1].get_feature_names_out(X.columns)
)
def plot_features_coefficients(pipe, X):
df_coef = pd.DataFrame(
{"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
df_coef.coef.plot(
kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
)
plot_features_coefficients(pipe, X_train)
The coefficients of the linear regression can be misleading ! They do not represent the feature importance if the explicative variables are not comparable.
Model improvement¶
Feature engineering¶
Outliers management¶
df_master["alcohol"].plot(kind="hist", bins=100)
<Axes: ylabel='Frequency'>
We can see that we have few lines with alcohol = -1. Let's remove those lines
df_clean = (
df_master
.loc[lambda df: df.alcohol >= 0] # LINE TO BE REMOVED FOR STUDENTS
)
df_clean["alcohol"].plot(kind="hist", bins=100)
<Axes: ylabel='Frequency'>
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
linear_model.LinearRegression(), # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('linearregression', LinearRegression())])LinearRegression()
plot_features_coefficients(pipe, X_train)
Let's have a look at the impact on the MAE
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.08 MAE test : -1.09
Categorical variable encoding¶
How to use categorical variables ?
| type | rating |
|---|---|
| Stout | 14 |
| Belgian Ale | 11 |
| IPA | 13 |
We have to encode them to numerical values. Multiple approaches exist for that.
Label encoding
| type | type_encoded | rating |
|---|---|---|
| Stout | 1 | 14 |
| Belgian Ale | 2 | 11 |
| IPA | 3 | 13 |
Label encoding is a simple way to encode categorical variables. However, it creates a hierarchy/order between the categories, which does not always reflect the reality.
One hot encoding
| type | type_Stout | type_Belgian_Ale | type_IPA | rating |
|---|---|---|---|---|
| Stout | 1 | 0 | 0 | 14 |
| Belgian Ale | 0 | 1 | 0 | 11 |
| IPA | 0 | 0 | 1 | 13 |
Here, there is no hierarchy, but in the case of a lot of categories it will add many columns to the dataset.
Let's try to one hot encode some categorical variables ! For that, we will create a custom categorical variable from alcohol.
df_clean.head(3)
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... | 51 | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with gar... | 4 | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, sl... | 43 | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 |
def alcohol_level (row):
if row['alcohol'] < 5.5 :
return 'Light'
if row['alcohol'] > 7 :
return 'Strong'
else :
return 'Medium'
df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)
df_clean['alcohol_level'].value_counts()
/tmp/ipykernel_2016/4176320187.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)
alcohol_level Light 74945 Strong 62413 Medium 53205 Name: count, dtype: int64
df_clean.head()
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | alcohol_level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... | 51 | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 | Light |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with gar... | 4 | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 | Strong |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, sl... | 43 | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 | Light |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | blutt59 | bottle, golden orange color with light tan foa... | 20 | 413 | 300 | 151 | 2009-10-17 02:00:00 | 0 | Medium |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | blutt59 | bottle, thanks to SS, almond amber colored pou... | 18 | 978 | 300 | 159 | 2010-08-17 02:00:00 | 0 | Strong |
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)
target = ["rating"]
features = [
"alcohol_level",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
X_train = (
pd.get_dummies(X_train, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)
X_test = (
pd.get_dummies(X_test, columns=["alcohol_level"]) # LINE TO BE REMOVED FOR STUDENTS
)
pipe = pipeline.make_pipeline(
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.08 MAE test : -1.09
Feature scaling¶
target = ["rating"]
features = [
"alcohol_level",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])
pipe = pipeline.make_pipeline(
compose.ColumnTransformer([
('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
], remainder='passthrough'),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('scaler', StandardScaler(),
['rating_appearance',
'rating_aroma',
'rating_palate',
'rating_taste']),
('passthrough', 'passthrough',
['alcohol_level_Strong',
'alcohol_level_Medium',
'alcohol_level_Light'])])),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('scaler', StandardScaler(),
['rating_appearance',
'rating_aroma',
'rating_palate',
'rating_taste']),
('passthrough', 'passthrough',
['alcohol_level_Strong',
'alcohol_level_Medium',
'alcohol_level_Light'])])),
('linearregression', LinearRegression())])ColumnTransformer(remainder='passthrough',
transformers=[('scaler', StandardScaler(),
['rating_appearance', 'rating_aroma',
'rating_palate', 'rating_taste']),
('passthrough', 'passthrough',
['alcohol_level_Strong',
'alcohol_level_Medium',
'alcohol_level_Light'])])['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']
StandardScaler()
['alcohol_level_Strong', 'alcohol_level_Medium', 'alcohol_level_Light']
passthrough
[]
passthrough
LinearRegression()
Check the features created & the transformation applied to them, using histograms.
def plot_features_transformed_histograms(pipe, X, **kwargs):
if len(pipe) > 1:
columns = pipe[:-1].get_feature_names_out(X.columns)
(
pd.DataFrame(
pipe[:-1].transform(X),
columns=columns,
).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
)
else:
print("no plot: features not transformed")
plot_features_transformed_histograms(pipe, X_train)
And now we can check again the linear regression coefficients. They can now be seen as feature importance.
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.08 MAE test : -1.09
Feature standardization¶
We have also seen that some features do not have a gaussian distribution. However one of the assumption made by the linear regression is that all the features follow a gaussian distribution. We can use some transformers to make the features more normal.
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
preprocessing.PowerTransformer(), # LINE TO BE REMOVED FOR STUDENTS
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('powertransformer', PowerTransformer()),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('powertransformer', PowerTransformer()),
('linearregression', LinearRegression())])PowerTransformer()
LinearRegression()
def plot_features_transformed_histograms(pipe, X, **kwargs):
if len(pipe) > 1:
columns = pipe[:-1].get_feature_names_out(X.columns)
(
pd.DataFrame(
pipe[:-1].transform(X),
columns=columns,
).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
)
else:
print("no plot: features not transformed")
plot_features_transformed_histograms(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.19 MAE test : -1.2
plot_features_coefficients(pipe, X_train)
Polynomial Features (optional)¶
Another possible feature engineering step is to add some non linearity. As the linear regression is a linear model, the non linearity can be managed thanks to the use of polynomial features.
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(degree=2), # LINE TO BE REMOVED FOR STUDENTS
preprocessing.StandardScaler(),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])PolynomialFeatures()
StandardScaler()
LinearRegression()
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.06 MAE test : -1.07
Review feature extraction¶
Word counts & mean review¶
Compute the "word counts" dataframe:
- containing:
- count of words in review
- mean rating review of the reviews containing this text
- remove stop words
- remove single letter words
- keep only the 1000 most frequent words
def tokenize(serie):
return (
(serie)
.str.lower() # LINE TO BE REMOVED FOR STUDENTS
.str.replace(r"[^a-z]", " ")
.str.replace(r" +", " ")
.str.split(" ")
)
df_words_count = (
(df_clean)
.assign(tokenized_text=lambda df: tokenize(df.text))
.loc[:, ["rating", "tokenized_text"]]
.explode("tokenized_text")
.groupby("tokenized_text", as_index=False)
.agg(["mean", "count"])
.reset_index()
.sort_values(by=("rating", "count"), ascending=False)
.loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
.loc[lambda df: df.tokenized_text.str.len() > 1]
.head(1000) # LINE TO BE REMOVED FOR STUDENTS
)
df_words_count.sort_values(by=[('rating', 'mean')])
| index | tokenized_text | rating | ||
|---|---|---|---|---|
| mean | count | |||
| 87390 | 87390 | corn | 8.36 | 2165 |
| 76903 | 76903 | cheap | 8.79 | 1153 |
| 252071 | 252071 | water | 9.10 | 1129 |
| 60309 | 60309 | bland | 9.74 | 1372 |
| 62886 | 62886 | boring | 9.92 | 1045 |
| ... | ... | ... | ... | ... |
| 188864 | 188864 | perfect | 16.08 | 2141 |
| 257196 | 257196 | wonderful | 16.25 | 2040 |
| 109976 | 109976 | fantastic | 16.29 | 1047 |
| 42498 | 42498 | amazing | 16.34 | 1184 |
| 49833 | 49833 | awesome | 16.35 | 1122 |
1000 rows Ć 4 columns
Word mean review¶
Compute the "word ratings" dictionary containing the mean review for each word
word_mean_review = (
df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)
Positive & negative words sets¶
- Compute the "positive words" set containing the words with a rating >= 14.5
- Compute the "negative words" set containing the words with a rating <= 13
positive_words_set = set(
df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
.loc[:, "tokenized_text"]
.tolist()
)
negative_words_set = set(
df_words_count.loc[lambda df: df.rating["mean"] <= 13] # LINE TO BE REMOVED FOR STUDENTS
.loc[:, "tokenized_text"]
.tolist()
)
print(len(negative_words_set))
print(len(positive_words_set))
167 155
Features & target¶
Compute the "features & target" dataframe containing:
- tokenized text
- mean word rating
- positive words count
- negative words count
- negative words negative count
def compute_mean_word_rating(words, word_mean_review):
return np.mean(
[word_mean_review[w] for w in words if w in word_mean_review] or [0]
)
def count_words_in_set(words, word_set):
return len(set(words) & word_set)
df_features_and_target = (
(df_clean)
.assign(tokenized_text=lambda df: tokenize(df.text))
.assign(
mean_word_rating=lambda df: (df.tokenized_text).apply(
compute_mean_word_rating, args=(word_mean_review,)
)
)
.assign(
positive_words_count=lambda df: (df.tokenized_text).apply(
count_words_in_set, args=(positive_words_set,)
)
)
.assign(
negative_words_count=lambda df: (df.tokenized_text).apply(
count_words_in_set, args=(negative_words_set,)
)
)
)
df_features_and_target.head(5)
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | ... | brewery_degree | user_degree | text_length | date | is_good | alcohol_level | tokenized_text | mean_word_rating | positive_words_count | negative_words_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | ... | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 | Light | [bottle,, oat, nose, with, black, color,, bitt... | 14.09 | 4 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | ... | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 | Strong | [bottle,, received, in, trade,, dark, brown, w... | 13.92 | 4 | 1 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | ... | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 | Light | [12, oz., bottle,, amber, color, with, soapy, ... | 13.24 | 0 | 3 |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | ... | 413 | 300 | 151 | 2009-10-17 02:00:00 | 0 | Medium | [bottle,, golden, orange, color, with, light, ... | 13.64 | 2 | 2 |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | ... | 978 | 300 | 159 | 2010-08-17 02:00:00 | 0 | Strong | [bottle,, thanks, to, ss,, almond, amber, colo... | 13.46 | 1 | 3 |
5 rows Ć 23 columns
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)
target = ["rating"]
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
"positive_words_count", # LINE TO BE REMOVED FOR STUDENTS
"negative_words_count", # LINE TO BE REMOVED FOR STUDENTS
"mean_word_rating" # LINE TO BE REMOVED FOR STUDENTS
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
preprocessing.StandardScaler(),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])StandardScaler()
LinearRegression()
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
MAE train : -1.08 MAE test : -1.09
Learning curve¶
Simple model¶
Compute the training & validation R2-scores for various training sets sizes
target = ["rating"]
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
"positive_words_count",
"negative_words_count",
"mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
train_sizes,
train_scores,
validation_scores,
_,
_,
) = model_selection.learning_curve(
model,
X_train,
y_train,
cv=7,
n_jobs=-1,
train_sizes=train_sizes_percent,
return_times=True,
verbose=1,
)
[learning_curve] Training set sizes: [ 147 211 304 437 629 905 1302 1873 2694 3876 5575 8019 11536 16594 23870 34336 49391 71046 102196 147005]
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed: 2.8s finished
Plot the training & validation scores.
Since they come from a K-fold cross-validation, you have K values for each
training set size: use seaborn's lineplot to plot the mean value with a
confidence interval.
def compute_df_sizes(train_sizes):
return pd.DataFrame(
{
"train_size_id": range(len(train_sizes)),
"train_size": train_sizes,
}
)
def compute_df_score(scores, name):
df = (
pd.DataFrame(scores)
.stack()
.rename_axis(index=("train_size_id", "run_id"))
.rename(name)
.reset_index()
)
return df
def compute_df_scores_long(train_scores, validation_scores, train_sizes):
return (
compute_df_score(train_scores, name="train_score")
.merge(compute_df_score(validation_scores, name="validation_score"))
.merge(compute_df_sizes(train_sizes), on="train_size_id")
.melt(
id_vars=["train_size_id", "run_id", "train_size"],
value_vars=["train_score", "validation_score"],
var_name="score_name",
value_name="score",
)
)
df_scores_long = compute_df_scores_long(
train_scores, validation_scores, train_sizes
)
sns.lineplot(
data=df_scores_long.loc[lambda df: df.score > 0.5],
x="train_size",
y="score",
hue="score_name",
style="score_name",
dashes=False,
markers=True,
).set(xscale="log")
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
[None]
Plot the features importance of the linear model trained on the full dataset
pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Complex model (optional)¶
Compute the training & validation R2-scores for various training sets sizes
Plot the training & validation scores.
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
"positive_words_count",
"negative_words_count",
"mean_word_rating"
]
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.LinearRegression(),
)
(
train_sizes,
train_scores,
validation_scores,
fit_times,
_,
) = model_selection.learning_curve(
pipe,
X_train,
y_train,
cv=5,
n_jobs=-1,
train_sizes=train_sizes_percent,
return_times=True,
verbose=1,
)
df_scores_long = compute_df_scores_long(
train_scores, validation_scores, train_sizes
)
sns.lineplot(
data=df_scores_long.loc[lambda df: df.score > 0.5],
x="train_size",
y="score",
hue="score_name",
style="score_name",
dashes=False,
markers=True,
).set(xscale="log")
[learning_curve] Training set sizes: [ 137 197 283 408 587 844 1215 1748 2514 3617 5203 7485 10767 15488 22278 32047 46098 66309 95383 137204]
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 44.2s finished
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
[None]
Plot the features importance of the linear model with polynomial features trained on the full dataset with 6 features
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Overfitting & Underfitting¶
Overfitting happens when a model fits quite well to the training data but does not generalize well on new data it has never seen. It can have multiple causes :
- The training dataset is not representative of the real data
- The model is too complex and learned from noise on the training dataset
Underfitting happens when the model is too simple to capture the data patterns. in this case, the model would perform bad on the train and on the test dataset.
To avoid overfitting, multiple actions can be performed.
From here we'll use a dataset containing 2000 rows & 7 features
N = 2000
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"mean_word_rating",
"positive_words_count",
"negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
N,
)
Model regularization¶
The regularization allows to reduce the model complexity. For that, we will not only minimize our cost error, but we will minimize an objective function with a regularization term.
$$min \sum_{i}(y_i ā X_i\beta)^2 + Regularization$$
Ridge - L2 regularization¶
For the ridge regression, the regularization term will be the L2-norm of the coefficients :
$$min \sum_{i}(y_i ā X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_2^2$$
where $$\vert\vert\beta\vert\vert_2^2 = \sum_{j}\beta_j^2$$
It is mainly used in the case of correlated features, which can bring unexpected behaviours with regard to the coefficients.
Simple loop on regularization parameter¶
- Fit
nmodels fornvalues ofalpha(regularization parameter) - Compute the scores
- Retrieve the values of coefficients
alphas = np.logspace(-6, 3, 28)
def compute_scores_and_coefs(pipe, alphas):
scores = dict()
coefs = dict()
for alpha in tqdm(alphas):
pipe[-1].set_params(alpha=alpha)
pipe.fit(X_train, y_train)
score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
scores.update({alpha: {**score_train, **score_test}})
coef = dict(
zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
)
coefs.update({alpha: coef})
return scores, coefs
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
100%|āāāāāāāāāā| 28/28 [00:03<00:00, 7.61it/s]
Plot the evolution of the score as a function of alpha
(
pd.DataFrame.from_dict(scores, orient="index")
.plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
<Axes: >
Plot the evolution of the coefficients as a function of alpha
(
pd.DataFrame.from_dict(coefs, orient="index").plot(
logx=True, legend=False, marker="."
)
)
<Axes: >
Plot the features importance of a model trained with a specific alpha
value (e.g: alpha=1e-1)
pipe[-1].set_params(
alpha=1e1 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Lasso - L1 regularization¶
For the lasso regression, the regularization term will be the L1-norm of the coefficients :
$$min \sum_{i}(y_i ā X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_1^2$$
where $$\vert\vert\beta\vert\vert_1^2 = \sum_{j}\vert\beta_j\vert$$
Simple loop on regularization parameter¶
- Fit
nmodels fornvalues ofalpha(regularization parameter) - Compute the scores
- Retrieve the values of coefficients
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
0%| | 0/28 [00:00<?, ?it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.090e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 4%|ā | 1/28 [00:00<00:04, 5.43it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.088e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 7%|ā | 2/28 [00:00<00:05, 5.11it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.084e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 11%|ā | 3/28 [00:00<00:04, 5.22it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.076e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 14%|āā | 4/28 [00:00<00:04, 5.30it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.057e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 18%|āā | 5/28 [00:00<00:04, 5.32it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.016e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 21%|āāā | 6/28 [00:01<00:04, 5.35it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.924e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 25%|āāā | 7/28 [00:01<00:04, 5.19it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.730e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 29%|āāā | 8/28 [00:01<00:03, 5.24it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.388e+03, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 32%|āāāā | 9/28 [00:01<00:03, 5.34it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.910e+02, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 36%|āāāā | 10/28 [00:01<00:03, 5.42it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.635e+01, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 39%|āāāā | 11/28 [00:02<00:03, 5.48it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.007e+01, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 43%|āāāāā | 12/28 [00:02<00:02, 5.58it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.858e+00, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 46%|āāāāā | 13/28 [00:02<00:02, 5.54it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.213e+00, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 50%|āāāāā | 14/28 [00:02<00:02, 5.73it/s]/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.760e+00, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent( 100%|āāāāāāāāāā| 28/28 [00:04<00:00, 6.21it/s]
Plot the evolution of the score as a function of alpha
(
pd.DataFrame.from_dict(scores, orient="index")
.plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
<Axes: >
Plot the evolution of the coefficients as a function of alpha
(
pd.DataFrame.from_dict(coefs, orient="index").plot(
logx=True, legend=False, marker="."
)
)
<Axes: >
Plot the features importance of a model trained with a specific alpha
value (e.g: alpha=1e-2)
pipe[-1].set_params(
alpha=1e-2 # LINE TO BE REMOVED FOR STUDENTS
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.858e+00, tolerance: 2.331e+00 model = cd_fast.enet_coordinate_descent(
0.8110210044666418 0.7937570454994202