🐍 Practice n°3: regression (students version)¶
The objective of this session is to understand the regression problems and the analytical approach to solve them. This notebook discusses theoretical notions and contains practical parts.
Here are the main steps of the notebook :
- What is regression ?
- Focus on linear regression
- Preparation
- Implementation of a linear regression
- Model improvement
1. What is regression ?¶
Regression in machine learning consists of mathematical methods that allow to predict a continuous outcome (y) based on the value of one or more predictor variables (x)
1D Example¶
x = Number of aircrafts in an airspace sector
y = Subjective rating of air traffic controller workload
To predict the air traffic controller workload, we need to find the best function y=f(x). For that, we will define an error and find the function which minimizes this error.
Generalisation¶
In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1.
$$x = (x_1,...,x_n)$$ $$\hat{y} = f(x_1,...,x_n)$$
If we come back to our traffic control example, the air traffic controller workload can be a function of both the number of aircraft and the weather.
2. Focus on linear regression¶
The linear regression is a regression using a linear function ŷ = ax + b
1D Example¶
Generalisation¶
In reality, we often deal with multiple explicative variables and the dimensionality of the input data is greater than 1. The linear function will look like :
$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$
Now let's practice with the ratebeer dataset !
3. Preparation¶
Install & import modules¶
import pandas as pd
from wordcloud import STOPWORDS
import numpy as np
from sklearn import (
linear_model,
preprocessing,
pipeline,
model_selection,
metrics,
compose,
)
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})
Read remote dataset¶
The data is in this git repository: ML-boot-camp/ratebeer.git.
The data is located in the the ratebeer/data/ folder.
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
Implementation of a linear regression¶
In this first section, you will train a linear regression model to predict the overall rating of a review based on its other ratings. For that, you will :
- Load & describe the data
- Split the data between train and test set
- Train and evaluate the linear regression model
Load data & describe¶
df_master = pd.read_parquet(file_url)
df_master.head()
df_master.shape
Splits: Train/test & features/target¶
The goal of supervised learning is to build a model that performs well on new data. If you have new data, it’s a good idea to see how your model performs on it. The problem is that you may not have new data, but you can simulate this experience by splitting your dataset between train and test.
Train/test split¶
Split keeping 10% of the data in the test set.
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(
df_master,
*** FILL THE MISSING LINE ***
)
def split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
N=None,
):
if N is None :
X_train = df_features_and_target_train[features]
y_train = df_features_and_target_train[target]
else :
X_train = df_features_and_target_train[features].sample(N)
y_train = df_features_and_target_train[target].loc[X_train.index]
X_test = df_features_and_target_test[features]
y_test = df_features_and_target_test[target]
return X_train, y_train, X_test, y_test
target = [
*** FILL THE MISSING LINE ***
]
features = [
*** FILL THE MISSING LINE ***
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
)
Model training & evaluation¶
Model training¶
Use a pipeline to wrap the model with its automatized preprocessing steps
pipe = pipeline.make_pipeline(
linear_model.LinearRegression(),
)
pipe.fit(
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
)
Model evaluation¶
R2 score¶
To assess the performance of our model, many metrics can be used. The easiest one for regression is $R^2$. It is used to measure the goodness of fit or best-fit line.
$$R^2 = 1 - \frac{\sum_{i}{(y_i - \hat{y}_i)^2}}{\sum_{i}{(y_i - \bar{y}_i)^2}}$$
R2 score in practice¶
score_train = pipe.score(
X_train,
y_train
)
score_test = pipe.score(
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
)
print(f"R2 (train): {score_train}")
print(f"R2 (test): {score_test}")
Residuals¶
Compute residuals dataframe containing the true & predicted values of y and
and the residual, which is the difference between the true & predicted values.
Plot the residuals, using histograms & violinplots.
def compute_df_residual(pipe, X, y):
return pd.DataFrame(
{
"y_true": y.rating.reset_index(drop=True),
"y_pred": pipe.predict(X).reshape(-1),
}
).assign(residual=lambda df: df.y_true - df.y_pred)
def plot_residual(df_residual):
df_residual.plot.hist(bins=200, subplots=True, layout=(1,3), figsize=(20, 5))
plt.show()
sns.violinplot(data=df_residual, x="y_true", y="y_pred")
df_residual_train = compute_df_residual(pipe, X_train, y_train)
df_residual_test = compute_df_residual(pipe, X_test, y_test)
plot_residual(df_residual_train)
plot_residual(df_residual_test)
Other regression metrics¶
Another popular metric is the Mean Squared Error (MSE)
$$MSE = \frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2$$
This metric is interesting if we want to penalize large errors. However it is not easily interpretable as it is not in the same unit than the target.
This is why the Root Mean Squared Error (RMSE) is widely used.
$$RMSE = \sqrt{\frac{1}{N} \sum_{i}(y_i – \hat{y}_i)^2}$$
As well as the Mean Absolute Error which is even more interpretable.
$$MAE = \frac{1}{N} \sum_{i}\lvert(y_i – \hat{y}_i)\lvert$$
Other metrics in practice¶
Compute a set of metrics on train & test datasets:
- negative RMSE
- negative MAE
- negative MSE
NB: A "score" function in sklearn obeys to the "greater is better" principle. That's why error functions are converted to their negative versions to obey this principle.
def compute_score(pipe, X, y, metric_names=None, label=None, verbose=False):
if metric_names is None:
metric_names = ["neg_root_mean_squared_error"]
scores = dict()
for metric_name in metric_names:
metric = metrics.get_scorer(metric_name)
score = metric(pipe, X, y)
printed_label = f" ({label})" if label else ""
if verbose:
print(f"{metric_name}{printed_label}: {score:.3g}")
scores.update({f"{metric_name}{printed_label}": score})
return scores
metric_names = [
"neg_root_mean_squared_error",
"neg_mean_absolute_error",
"neg_mean_squared_error",
]
score_train = compute_score(
pipe, X_train, y_train, metric_names, label="train", verbose=True
)
score_test = compute_score(
*** FILL THE MISSING LINE ***
)
Feature importance¶
Plot model's coefficients in a bar chart
def get_feature_names(pipe, X):
return (
X.columns
if len(pipe) == 1
else pipe[:-1].get_feature_names_out(X.columns)
)
def plot_features_coefficients(pipe, X):
df_coef = pd.DataFrame(
{"coef": pipe[-1].coef_.reshape(-1)}, index=get_feature_names(pipe, X)
).assign(color=lambda df: (df.coef >= 0).map({True: "g", False: "r"}))
df_coef.coef.plot(
kind="barh", color=df_coef.color, figsize=(6, len(df_coef) * 0.25)
)
plot_features_coefficients(pipe, X_train)
The coefficients of the linear regression can be misleading ! They do not represent the feature importance if the explicative variables are not comparable.
Model improvement¶
Feature engineering¶
Outliers management¶
df_master["alcohol"].plot(kind="hist", bins=100)
We can see that we have few lines with alcohol = -1. Let's remove those lines
df_clean = (
df_master
*** FILL THE MISSING LINE ***
)
df_clean["alcohol"].plot(kind="hist", bins=100)
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1)
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
*** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Let's have a look at the impact on the MAE
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
Categorical variable encoding¶
How to use categorical variables ?
| type | rating |
|---|---|
| Stout | 14 |
| Belgian Ale | 11 |
| IPA | 13 |
We have to encode them to numerical values. Multiple approaches exist for that.
Label encoding
| type | type_encoded | rating |
|---|---|---|
| Stout | 1 | 14 |
| Belgian Ale | 2 | 11 |
| IPA | 3 | 13 |
Label encoding is a simple way to encode categorical variables. However, it creates a hierarchy/order between the categories, which does not always reflect the reality.
One hot encoding
| type | type_Stout | type_Belgian_Ale | type_IPA | rating |
|---|---|---|---|---|
| Stout | 1 | 0 | 0 | 14 |
| Belgian Ale | 0 | 1 | 0 | 11 |
| IPA | 0 | 0 | 1 | 13 |
Here, there is no hierarchy, but in the case of a lot of categories it will add many columns to the dataset.
Let's try to one hot encode some categorical variables ! For that, we will create a custom categorical variable from alcohol.
df_clean.head(3)
def alcohol_level (row):
if row['alcohol'] < 5.5 :
return 'Light'
if row['alcohol'] > 7 :
return 'Strong'
else :
return 'Medium'
df_clean['alcohol_level'] = df_clean.apply(lambda row: alcohol_level(row), axis=1)
df_clean['alcohol_level'].value_counts()
df_clean.head()
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_clean, test_size=0.1, random_state=1)
target = ["rating"]
features = [
"alcohol_level",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
X_train = (
*** FILL THE MISSING LINE ***
)
X_test = (
*** FILL THE MISSING LINE ***
)
pipe = pipeline.make_pipeline(
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
Feature scaling¶
target = ["rating"]
features = [
"alcohol_level",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
X_train = pd.get_dummies(X_train, columns=["alcohol_level"])
X_test = pd.get_dummies(X_test, columns=["alcohol_level"])
pipe = pipeline.make_pipeline(
compose.ColumnTransformer([
('scaler', preprocessing.StandardScaler(), ['rating_appearance', 'rating_aroma', 'rating_palate', 'rating_taste']),
('passthrough', "passthrough", ["alcohol_level_Strong", "alcohol_level_Medium", "alcohol_level_Light"])
], remainder='passthrough'),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
Check the features created & the transformation applied to them, using histograms.
def plot_features_transformed_histograms(pipe, X, **kwargs):
if len(pipe) > 1:
columns = pipe[:-1].get_feature_names_out(X.columns)
(
pd.DataFrame(
pipe[:-1].transform(X),
columns=columns,
).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(20, 10))
)
else:
print("no plot: features not transformed")
plot_features_transformed_histograms(pipe, X_train)
And now we can check again the linear regression coefficients. They can now be seen as feature importance.
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
Feature standardization¶
We have also seen that some features do not have a gaussian distribution. However one of the assumption made by the linear regression is that all the features follow a gaussian distribution. We can use some transformers to make the features more normal.
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
*** FILL THE MISSING LINE ***
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
def plot_features_transformed_histograms(pipe, X, **kwargs):
if len(pipe) > 1:
columns = pipe[:-1].get_feature_names_out(X.columns)
(
pd.DataFrame(
pipe[:-1].transform(X),
columns=columns,
).plot.hist(subplots=True, layout=(2, 4), bins=200, figsize=(23, 10))
)
else:
print("no plot: features not transformed")
plot_features_transformed_histograms(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
plot_features_coefficients(pipe, X_train)
Polynomial Features (optional)¶
Another possible feature engineering step is to add some non linearity. As the linear regression is a linear model, the non linearity can be managed thanks to the use of polynomial features.
target = ["rating"]
features = [
"alcohol",
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
*** FILL THE MISSING LINE ***
preprocessing.StandardScaler(),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
Review feature extraction¶
Word counts & mean review¶
Compute the "word counts" dataframe:
- containing:
- count of words in review
- mean rating review of the reviews containing this text
- remove stop words
- remove single letter words
- keep only the 1000 most frequent words
def tokenize(serie):
return (
(serie)
*** FILL THE MISSING LINE ***
.str.replace(r"[^a-z]", " ")
.str.replace(r" +", " ")
.str.split(" ")
)
df_words_count = (
(df_clean)
.assign(tokenized_text=lambda df: tokenize(df.text))
.loc[:, ["rating", "tokenized_text"]]
.explode("tokenized_text")
.groupby("tokenized_text", as_index=False)
.agg(["mean", "count"])
.reset_index()
.sort_values(by=("rating", "count"), ascending=False)
.loc[lambda df: ~df.tokenized_text.isin(list(STOPWORDS))]
.loc[lambda df: df.tokenized_text.str.len() > 1]
*** FILL THE MISSING LINE ***
)
df_words_count.sort_values(by=[('rating', 'mean')])
Word mean review¶
Compute the "word ratings" dictionary containing the mean review for each word
word_mean_review = (
df_words_count.set_index("tokenized_text").rating["mean"].to_dict()
)
Positive & negative words sets¶
- Compute the "positive words" set containing the words with a rating >= 14.5
- Compute the "negative words" set containing the words with a rating <= 13
positive_words_set = set(
df_words_count.loc[lambda df: df.rating["mean"] >= 14.5]
.loc[:, "tokenized_text"]
.tolist()
)
negative_words_set = set(
*** FILL THE MISSING LINE ***
.loc[:, "tokenized_text"]
.tolist()
)
print(len(negative_words_set))
print(len(positive_words_set))
Features & target¶
Compute the "features & target" dataframe containing:
- tokenized text
- mean word rating
- positive words count
- negative words count
- negative words negative count
def compute_mean_word_rating(words, word_mean_review):
return np.mean(
[word_mean_review[w] for w in words if w in word_mean_review] or [0]
)
def count_words_in_set(words, word_set):
return len(set(words) & word_set)
df_features_and_target = (
(df_clean)
.assign(tokenized_text=lambda df: tokenize(df.text))
.assign(
mean_word_rating=lambda df: (df.tokenized_text).apply(
compute_mean_word_rating, args=(word_mean_review,)
)
)
.assign(
positive_words_count=lambda df: (df.tokenized_text).apply(
count_words_in_set, args=(positive_words_set,)
)
)
.assign(
negative_words_count=lambda df: (df.tokenized_text).apply(
count_words_in_set, args=(negative_words_set,)
)
)
)
df_features_and_target.head(5)
(
df_features_and_target_train,
df_features_and_target_test,
) = model_selection.train_test_split(df_features_and_target, test_size=0.1, random_state=1)
target = ["rating"]
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
*** FILL THE MISSING LINE ***
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
pipe = pipeline.make_pipeline(
preprocessing.StandardScaler(),
linear_model.LinearRegression(),
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
metric = metrics.get_scorer("neg_mean_absolute_error")
score = metric(pipe, X_train, y_train)
print("MAE train : " + str(round(score, 2)))
score = metric(pipe, X_test, y_test)
print("MAE test : " + str(round(score, 2)))
Learning curve¶
Simple model¶
Compute the training & validation R2-scores for various training sets sizes
target = ["rating"]
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
"positive_words_count",
"negative_words_count",
"mean_word_rating"
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
)
model = linear_model.LinearRegression()
train_sizes_percent = np.geomspace(0.001, 1, 20)
(
train_sizes,
train_scores,
validation_scores,
_,
_,
) = model_selection.learning_curve(
model,
X_train,
y_train,
cv=7,
n_jobs=-1,
train_sizes=train_sizes_percent,
return_times=True,
verbose=1,
)
Plot the training & validation scores.
Since they come from a K-fold cross-validation, you have K values for each
training set size: use seaborn's lineplot to plot the mean value with a
confidence interval.
def compute_df_sizes(train_sizes):
return pd.DataFrame(
{
"train_size_id": range(len(train_sizes)),
"train_size": train_sizes,
}
)
def compute_df_score(scores, name):
df = (
pd.DataFrame(scores)
.stack()
.rename_axis(index=("train_size_id", "run_id"))
.rename(name)
.reset_index()
)
return df
def compute_df_scores_long(train_scores, validation_scores, train_sizes):
return (
compute_df_score(train_scores, name="train_score")
.merge(compute_df_score(validation_scores, name="validation_score"))
.merge(compute_df_sizes(train_sizes), on="train_size_id")
.melt(
id_vars=["train_size_id", "run_id", "train_size"],
value_vars=["train_score", "validation_score"],
var_name="score_name",
value_name="score",
)
)
df_scores_long = compute_df_scores_long(
train_scores, validation_scores, train_sizes
)
sns.lineplot(
data=df_scores_long.loc[lambda df: df.score > 0.5],
x="train_size",
y="score",
hue="score_name",
style="score_name",
dashes=False,
markers=True,
).set(xscale="log")
Plot the features importance of the linear model trained on the full dataset
pipe = pipeline.make_pipeline(model)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Complex model (optional)¶
Compute the training & validation R2-scores for various training sets sizes
Plot the training & validation scores.
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"alcohol",
"user_degree",
"positive_words_count",
"negative_words_count",
"mean_word_rating"
]
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.LinearRegression(),
)
(
train_sizes,
train_scores,
validation_scores,
fit_times,
_,
) = model_selection.learning_curve(
pipe,
X_train,
y_train,
cv=5,
n_jobs=-1,
train_sizes=train_sizes_percent,
return_times=True,
verbose=1,
)
df_scores_long = compute_df_scores_long(
train_scores, validation_scores, train_sizes
)
sns.lineplot(
data=df_scores_long.loc[lambda df: df.score > 0.5],
x="train_size",
y="score",
hue="score_name",
style="score_name",
dashes=False,
markers=True,
).set(xscale="log")
Plot the features importance of the linear model with polynomial features trained on the full dataset with 6 features
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Overfitting & Underfitting¶
Overfitting happens when a model fits quite well to the training data but does not generalize well on new data it has never seen. It can have multiple causes :
- The training dataset is not representative of the real data
- The model is too complex and learned from noise on the training dataset
Underfitting happens when the model is too simple to capture the data patterns. in this case, the model would perform bad on the train and on the test dataset.
To avoid overfitting, multiple actions can be performed.
From here we'll use a dataset containing 2000 rows & 7 features
N = 2000
features = [
"rating_appearance",
"rating_aroma",
"rating_palate",
"rating_taste",
"mean_word_rating",
"positive_words_count",
"negative_words_count",
]
X_train, y_train, X_test, y_test = split_features_and_target(
df_features_and_target_train,
df_features_and_target_test,
features,
target,
N,
)
Model regularization¶
The regularization allows to reduce the model complexity. For that, we will not only minimize our cost error, but we will minimize an objective function with a regularization term.
$$min \sum_{i}(y_i – X_i\beta)^2 + Regularization$$
Ridge - L2 regularization¶
For the ridge regression, the regularization term will be the L2-norm of the coefficients :
$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_2^2$$
where $$\vert\vert\beta\vert\vert_2^2 = \sum_{j}\beta_j^2$$
It is mainly used in the case of correlated features, which can bring unexpected behaviours with regard to the coefficients.
Simple loop on regularization parameter¶
- Fit
nmodels fornvalues ofalpha(regularization parameter) - Compute the scores
- Retrieve the values of coefficients
alphas = np.logspace(-6, 3, 28)
def compute_scores_and_coefs(pipe, alphas):
scores = dict()
coefs = dict()
for alpha in tqdm(alphas):
pipe[-1].set_params(alpha=alpha)
pipe.fit(X_train, y_train)
score_train = compute_score(pipe, X_train, y_train, metric_names = ["neg_mean_absolute_error"], label="train")
score_test = compute_score(pipe, X_test, y_test, metric_names = ["neg_mean_absolute_error"], label="test")
scores.update({alpha: {**score_train, **score_test}})
coef = dict(
zip(get_feature_names(pipe, X_train), pipe[-1].coef_.reshape(-1))
)
coefs.update({alpha: coef})
return scores, coefs
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.Ridge(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
Plot the evolution of the score as a function of alpha
(
pd.DataFrame.from_dict(scores, orient="index")
.plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
Plot the evolution of the coefficients as a function of alpha
(
pd.DataFrame.from_dict(coefs, orient="index").plot(
logx=True, legend=False, marker="."
)
)
Plot the features importance of a model trained with a specific alpha
value (e.g: alpha=1e-1)
pipe[-1].set_params(
*** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
Lasso - L1 regularization¶
For the lasso regression, the regularization term will be the L1-norm of the coefficients :
$$min \sum_{i}(y_i – X_i\beta)^2 + \alpha\vert\vert\beta\vert\vert_1^2$$
where $$\vert\vert\beta\vert\vert_1^2 = \sum_{j}\vert\beta_j\vert$$
Simple loop on regularization parameter¶
- Fit
nmodels fornvalues ofalpha(regularization parameter) - Compute the scores
- Retrieve the values of coefficients
pipe = pipeline.make_pipeline(
preprocessing.PolynomialFeatures(include_bias=False, degree=2),
preprocessing.PowerTransformer(),
linear_model.Lasso(),
)
scores, coefs = compute_scores_and_coefs(pipe, alphas)
Plot the evolution of the score as a function of alpha
(
pd.DataFrame.from_dict(scores, orient="index")
.plot(logx=True, marker=".", ylim=(-1.2, -1.05))
)
Plot the evolution of the coefficients as a function of alpha
(
pd.DataFrame.from_dict(coefs, orient="index").plot(
logx=True, legend=False, marker="."
)
)
Plot the features importance of a model trained with a specific alpha
value (e.g: alpha=1e-2)
pipe[-1].set_params(
*** FILL THE MISSING LINE ***
)
pipe.fit(X_train, y_train)
plot_features_coefficients(pipe, X_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print(score_train)
print(score_test)