Download Open in colab

🐍 Practice n°4: classification¶

The objective of this session is to learn about classification task. You will have to build a new model only using the text review.

In order to apply classification models, we need to change the target definition. We will divide the reviews into good or bad using the rating.

So our goal is to have a model, a function, that takes text as input and output a label/class (good review or bad review).

$$f(text) = label$$

The data are the same as those used for the regression (source ratebeer dataset description).

Here are the main steps of the notebook :

What is classification ?
Focus on logistic regression
Preparation
Binary target definition
Text cleaning
Modelling

6.1 First model using CountVectorizer
6.2 Choose the right metrics
6.3 Second model using TF-IDF

1. What is classification ?¶

Classification in machine learning consists of mathematical methods that allow to predict a discrete outcome (y) based on the value of one or more predictor variables (x).

There are several types of classification :

Binary classification: the task of classifying the data into two groups (each called class).

Example: an email can be classified as belonging to one of two classes: "spam" and "not spam".

Multi-class classification: the task of classifying the data into N groups (N > 2).

Example: an image can be classified as belonging to one of N classes: "cat", "dog", "cow" or "fish".

Multi-label classification: this is a generalization of multi-class classification problem where an instance can be assigned to multiple classes.

Example: a movie can be classified as belonging to one or more classes: "action", "adventure", "thriller" or all simultaneously.

In this session, we will focus on the Binary classification.

2. Focus on logistic regression¶

As seen in the last session, we can represent the link between the explicatives variables and the target to be predicted as follows

$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

The difference here is that the target is not a continous variable (rating) but a discrete one (good review / bad review). If we limit ourselves to this model, the linear combination of each input will give an unbounded number that does not allow us to classify the review into good or bad.

To transform the number provided by the linear combination into a classification, we use a function called sigmoid function which has the interesting property of transforming the numbers passed inside into numbers between 0 and 1.

After passing the linear combination through this function, the output will be considered as a probability $P$ and using a threshold (0.5 in general), the review can be classified as bad review, if $P < 0.5$, or good review, if $P >= 0.5$.

This threshold can be modified in some contexts

Then during the training phase, we will compute the parameters β in order to optimize the Maximum Likelihood i.e., for a given bad review, we want the probalility estimated by our model to be minimal and for a given good review, we want the probalility estimated by our model to be maximal.

3. Preparation¶

Install & import modules¶

In [1]:

Copied!





import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, classification_report
import nltk
from nltk.corpus import stopwords
import seaborn as sns
nltk.download('stopwords')

pd.set_option('display.max_colwidth', None)
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, classification_report
import nltk
from nltk.corpus import stopwords
import seaborn as sns
nltk.download('stopwords')

pd.set_option('display.max_colwidth', None)
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")

[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Read remote dataset¶

In [2]:

Copied!

file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df_full = pd.read_parquet(file_url)
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df_full = pd.read_parquet(file_url)

In [3]:

Copied!

df_full.head(5)
df_full.head(5)

Out[3]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter chocolate flavor with a good coffee cocoa finish	51	413	300	90	2008-07-31 02:00:00
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with garnet highlights, short lived foam, aroma is quite port like and raisin, flavor is sweet raisin or prune, toasty almond, waffle batter, nice finish	4	413	300	191	2011-08-06 02:00:00
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, slight caramel flavor but watery not much hops some effervescence	43	413	300	109	2008-03-09 01:00:00
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	blutt59	bottle, golden orange color with light tan foam , citric soour aroma with hoppy citrus and lime bitter flavor, could use a little more malt for balance	20	413	300	151	2009-10-17 02:00:00
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	blutt59	bottle, thanks to SS, almond amber colored pour with off white soapy foam, fruity metallic aroma, flavors of dried apricots, metallics, fruity medicinal finish	18	978	300	159	2010-08-17 02:00:00

4. Binary target definition¶

Filter the data using only 1000 reviews and then explore some reviews

In [4]:

Copied!

# Filter the data
N_rows = 1000
df = df_full[["text", "rating", "is_good"]].sample(N_rows, random_state=42)
# Filter the data
N_rows = 1000
df = df_full[["text", "rating", "is_good"]].sample(N_rows, random_state=42)

In [5]:

Copied!

# Display some text reviews

print("GUESS THE RATING ?")
df_example = df.sample(n=1)
df_example.text
# Display some text reviews

print("GUESS THE RATING ?")
df_example = df.sample(n=1)
df_example.text

GUESS THE RATING ?

Out[5]:

117167    Cask at Spoons, Milton Keynes. Dark amber with a thick white head. Couldnt pick up an aroma. Faint malty taste overlaid by hops. Oily, creamy mouthfeel, probably thanks to the head, which was courtesy of Mr Wetherspoon. Slight bitter finish. Something lacking here.
Name: text, dtype: object

In [6]:

Copied!

print(f"RATING: {df_example.rating.iloc[0]}")
print(f"RATING: {df_example.rating.iloc[0]}")

RATING: 10

To begin with a binary classification problem, we will bin the target into 2 classes: bad review and good review

First, look at the target distribution and choose a threshlold to identify good reviews from the rest.

In [7]:

Copied!

# display the target distribution
df.rating.astype(int).plot(kind="hist")
# display the target distribution
df.rating.astype(int).plot(kind="hist")

Out[7]:

<Axes: ylabel='Frequency'>

No description has been provided for this image

You can play with the rating_threshold and look at the new target distribution.

In [8]:

Copied!

# Create a binary target and display the target distribution
rating_threshold = 16 # LINE TO BE REMOVED FOR STUDENTS
(df.rating >= rating_threshold).astype(int).value_counts(normalize=True)
# Create a binary target and display the target distribution
rating_threshold = 16 # LINE TO BE REMOVED FOR STUDENTS
(df.rating >= rating_threshold).astype(int).value_counts(normalize=True)

Out[8]:

rating
0    0.74
1    0.26
Name: proportion, dtype: float64

Usually the threshold is defined by looking manually at the data: annotating a few reviews as "good" or "bad" and see which ratings they had. E.g: on google maps a "good" review is above 4 stars (out of 5 stars). For simplicity, here we'll use the is_good binary target defined during the data engineering phase.

In [9]:

Copied!

# Create a binary target and display the target distribution
rating_threshold = df.rating.median() # LINE TO BE REMOVED FOR STUDENTS
(df.rating >= rating_threshold).astype(int).value_counts(normalize=True)
# Create a binary target and display the target distribution
rating_threshold = df.rating.median() # LINE TO BE REMOVED FOR STUDENTS
(df.rating >= rating_threshold).astype(int).value_counts(normalize=True)

Out[9]:

rating
1    0.56
0    0.44
Name: proportion, dtype: float64

5. Text cleaning¶

From text reviews to numerical vectors¶

Before training some models, the first step is to transform text into numbers :

from

f(raw_text) = rating_classe

to

f(numerical_vector_representing_text) = rating_classe

Indeed, we can't direclty feed an algortihm with text.

For example:

Wow, that beer is SOOOO good :O !!

must be transformed to something like:

[3, 4, 21, 0, 0, 8, 19]

where the values of the vector contain the meaning of the text. Knowing that the closer texts are in terms of meaning, the more closed their vector representation is too. Moreover, it is often more convenient to convert texts of different sizes into vectors of fixed size.

For example:

"Wow, that beer is SOOOO good :O !!"
-> characters : 34
-> vector (1x7) : [3, 4, 21, 0, 0, 8, 19]

"This beer is very tasty"
-> characters : 23
-> vector (1x7) : [3, 4, 20, 0, 0, 7, 19]

But:

"It's not a beer, just motor oil at best."
-> characters : 40
-> vector (1x7) : [0, 4, 1, 12, 14, 0, 0]

From raw text reviews to clean list of words¶

Before converting text to numerical vector, the first step is to clean the text to keep only the pertinent information.

Here are the following cleaning steps that we will apply on the reviews:

Convert letter to lowercase

"Wow, that beer is SOOOO good :O !!" -> "wow, that beer is soooo good :o !!"

Remove the punctuation letter to lowercase

"wow, that beer is soooo good :o !!" -> "wow that beer is soooo good"

Transform the text into tokens

"wow that beer is soooo good" -> ["wow", "that", "beer", "is", "soooo", "good"]

Remove the stopwords, the most common english words that often bring noise to the models.

["wow", "that", "beer", "is", "soooo", "good"] -> ["wow", "beer", "soooo", "good"]

To go further, some techniques can be used to reduce the forms of each word into a common base or root. This can be done with:

(1) Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

am, are, is $\Rightarrow$ be

(2) Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.

book, books, book's, book' $\Rightarrow$ book

The steps presented here are just the most basic, many different things can be applied to the cleaning part of the text.

In [10]:

Copied!





def convert_text_to_lowercase(df, colname):
    df[colname] = (
        df[colname]
        .str.lower() # LINE TO BE REMOVED FOR STUDENTS
    )
    return df

def not_regex(pattern):
        return r"((?!{}).)".format(pattern)

def remove_punctuation(df, colname):
    df[colname] = df[colname].str.replace("\n", " ")
    df[colname] = df[colname].str.replace("\r", " ")
    alphanumeric_characters_extended = "(\\b[-/]\\b|[a-zA-Z0-9])"
    df[colname] = df[colname].str.replace(not_regex(alphanumeric_characters_extended), " ")
    return df

def tokenize_sentence(df, colname):
    df[colname] = df[colname].str.split()
    return df

def remove_stop_words(df, colname):
    stop_words = stopwords.words("english")
    df[colname] = df[colname].apply(lambda x: [word for word in x if word not in stop_words])
    return df

def reverse_tokenize_sentence(df, colname):
    df[colname] = df[colname].map(lambda word: " ".join(word))
    return df


def text_cleaning(df, colname):
    """
    Takes in a string of text, then performs the following:
    1. convert text to lowercase
    2. remove punctuation and new line characters "\n"
    3. Tokenize sentences
    4. Remove all stopwords
    5. convert tokenized text to text
    """
    df = (
        df
        .pipe(convert_text_to_lowercase, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(remove_punctuation, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(tokenize_sentence, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(remove_stop_words, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(reverse_tokenize_sentence, colname) # LINE TO BE REMOVED FOR STUDENTS
    )
    return df
def convert_text_to_lowercase(df, colname):
    df[colname] = (
        df[colname]
        .str.lower() # LINE TO BE REMOVED FOR STUDENTS
    )
    return df

def not_regex(pattern):
        return r"((?!{}).)".format(pattern)

def remove_punctuation(df, colname):
    df[colname] = df[colname].str.replace("\n", " ")
    df[colname] = df[colname].str.replace("\r", " ")
    alphanumeric_characters_extended = "(\\b[-/]\\b|[a-zA-Z0-9])"
    df[colname] = df[colname].str.replace(not_regex(alphanumeric_characters_extended), " ")
    return df

def tokenize_sentence(df, colname):
    df[colname] = df[colname].str.split()
    return df

def remove_stop_words(df, colname):
    stop_words = stopwords.words("english")
    df[colname] = df[colname].apply(lambda x: [word for word in x if word not in stop_words])
    return df

def reverse_tokenize_sentence(df, colname):
    df[colname] = df[colname].map(lambda word: " ".join(word))
    return df


def text_cleaning(df, colname):
    """
    Takes in a string of text, then performs the following:
    1. convert text to lowercase
    2. remove punctuation and new line characters "\n"
    3. Tokenize sentences
    4. Remove all stopwords
    5. convert tokenized text to text
    """
    df = (
        df
        .pipe(convert_text_to_lowercase, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(remove_punctuation, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(tokenize_sentence, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(remove_stop_words, colname) # LINE TO BE REMOVED FOR STUDENTS
        .pipe(reverse_tokenize_sentence, colname) # LINE TO BE REMOVED FOR STUDENTS
    )
    return df

In [11]:

Copied!





# Apply data cleaning
df_cleaned = text_cleaning(
    df, # LINE TO BE REMOVED FOR STUDENTS
    "text"
)
# Apply data cleaning
df_cleaned = text_cleaning(
    df, # LINE TO BE REMOVED FOR STUDENTS
    "text"
)

In [12]:

Copied!

# Control the cleaning
df_cleaned.head()
# Control the cleaning
df_cleaned.head()

Out[12]:

	text	rating	is_good
119737	0.33l 2009-bottle. pours deep red body thin bubbly side-sticking almost yellow-beige head. aroma lovely caramel sweet, hops, smoke brown sugar. medium-bodied medium+ carbonated. quite boozy first, caramel- brown sugar-sweet slight floral note. something barlywines normally sweet me, sweet, sweet first would thought. also strong bitterness equalized sweetness. combined warming alcohol made pleasant experience.	15	0
72272	another nice seasonal muskoka. pours nice dark brown beige frothy head goes nice covering top. nose lotsa chocolate cocoa. mouth feel touch thin otherwise works well. flavor lotsa chocolate touch cranberries decent well hidden amount booze.	16	1
158154	pours hazy, white hued light golden, aroma sweet, estery fruity notes banana sweet melon prevalent. rich chewy mouth anise-like bitterness nice light burn finish, slightly warming.	14	0
65426	served draft beer run charlottesville, va 3/30/11. pours clear copper color medium sized creamy tan head. good head retention lacing. aroma pretty much straight bourbon toffee notes quite bit booze. taste bourbon, vanilla, toffee, caramel, booze lightly toasty finish. medium bodied.	15	0
30074	tap old hat. listed "amber rhy" scoreboard inclined think spelling error brewpub staff. cloudy orange-red body, thin light cream head. sweet grainy aroma notes caramel. flavor caramely, fruity sweetness stout doppelbock had, beer boring comparison. little rye character permeated unrefined sweetness. sharp carbonation, bland beer. something uninitiated, suppose.	12	0

We still have to transform the list of tokens to a fixed size numerical vector

For that, we will use 2 very common techniques : CountVectorizer and TF-IDF

1) CountVectorizer:

CountVectorizer is used to convert a collection of text documents to a vector of token counts.

Example:

["beer", "most", "tasty", "beer", "world"]

Will be transformed into ⬇

beer	most	tasty	world
2	1	1	1

In practice, you have to define a vocabulary size and each text will be transform into a vector of size [1 x vocabulary size]. Consequently, zeros will be added to the vector for each word present in the corpus vocabulary but missing in the specific review. The vocabulary space is defined using term frequency across the corpus: the most frequent words are kept.

2) TF-IDF (optional):

TF-IDF or term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to reflect how important a word is to a document in a collection or corpus.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Nbr of times term t appears in a document) / (Total nbr of terms in the
document)

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log(Total number of documents / Number of documents with term t in it)

Example:

Consider a document containing 100 words wherein the word book appears 3 times. The term frequency (i.e., tf) for cat is:

TF(t) = (Nbr of times term t appears in a document) / (Total nbr of terms in the
document)
      = 3 / 100
      = 0.03

Now, assume we have 10 million documents and the word book appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as

IDF(t) = log(Total number of documents / Number of documents with term t in it)
       = log(10,000,000 / 1,000)
       = 4

Thus, the Tf-idf weight is the product of these quantities:

Tf-IDF = TF(t) * IDF(t)
       = 0.03 * 4
       = 0.12

Split the data into train/test sets¶

Before applying these transormation on the text, we will just split the data for the modelling part.

Keep 20% of the data for the test dataset

In [13]:

Copied!





TARGET = "is_good" # LINE TO BE REMOVED FOR STUDENTS
FEATURE = "text" # LINE TO BE REMOVED FOR STUDENTS

x_train, x_test, y_train, y_test = train_test_split(
    df_cleaned[FEATURE], # LINE TO BE REMOVED FOR STUDENTS
    df_cleaned[TARGET], # LINE TO BE REMOVED FOR STUDENTS
    test_size=0.2,
    random_state=42)
TARGET = "is_good" # LINE TO BE REMOVED FOR STUDENTS
FEATURE = "text" # LINE TO BE REMOVED FOR STUDENTS

x_train, x_test, y_train, y_test = train_test_split(
    df_cleaned[FEATURE], # LINE TO BE REMOVED FOR STUDENTS
    df_cleaned[TARGET], # LINE TO BE REMOVED FOR STUDENTS
    test_size=0.2,
    random_state=42)

6. Modelling¶

6.1 First model using CountVectorizer¶

Transform the text reviews into numerical vectors by counting the number of words in each reviews. Use the scikit-learn library.

In order not to bring information from the train set into the test set, you must train the CountVectorizer on the train set and apply it to the test set.

Hint:

sklearn.feature_extraction.text.CountVectorizer

In [14]:

Copied!





# Define the vocabulary size to 100
count_vectorizer = CountVectorizer(
    analyzer="word",
    max_features=100 # LINE TO BE REMOVED FOR STUDENTS
    )
# Define the vocabulary size to 100
count_vectorizer = CountVectorizer(
    analyzer="word",
    max_features=100 # LINE TO BE REMOVED FOR STUDENTS
    )

In [15]:

Copied!

# Apply the CountVectorizer and check the results on some rows
count_vectorizer.fit(x_train)

x_train_features = count_vectorizer.transform(x_train).toarray()
x_test_features = count_vectorizer.transform(x_test).toarray()
# Apply the CountVectorizer and check the results on some rows
count_vectorizer.fit(x_train)

x_train_features = count_vectorizer.transform(x_train).toarray()
x_test_features = count_vectorizer.transform(x_test).toarray()

In [16]:

Copied!

x_train_features[0]
x_train_features[0]

Out[16]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [17]:

Copied!

x_train.iloc[0]
x_train.iloc[0]

Out[17]:

'bottle home ... colins tasting ... deep hazy brown ... thin lacing ... light sour cheery light strawb nose ... tart sourness ... cherry ... light sourness ... ok.'

The next step is to define the model that will use in input the vectors from the the CountVectorizer. We will use a logistic regression model, using again the scikit-learn library.

In order to produce cleaner code, we can combine these 2 steps (CountVectorizer and Logistic regression) into a single pipeline.

Initialize the CountVectorizer
Initialize the LogisticRegression
Define your Pipeline object with these 2 steps
Fit the Pipeline

Hint:

In [18]:

Copied!





# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(
    analyzer="word",
    max_features=100
    )

# Initialize the logistic regression
logit = LogisticRegression(solver="lbfgs", verbose=2, n_jobs=-1)

# Combine them into a Pipeline object
pipeline_cv = Pipeline([
    ("vectorizer", count_vectorizer), # LINE TO BE REMOVED FOR STUDENTS
    ("model", logit)]) # LINE TO BE REMOVED FOR STUDENTS

# Fit the Pipeline
pipeline_cv.fit(x_train, y_train)
# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(
    analyzer="word",
    max_features=100
    )

# Initialize the logistic regression
logit = LogisticRegression(solver="lbfgs", verbose=2, n_jobs=-1)

# Combine them into a Pipeline object
pipeline_cv = Pipeline([
    ("vectorizer", count_vectorizer), # LINE TO BE REMOVED FOR STUDENTS
    ("model", logit)]) # LINE TO BE REMOVED FOR STUDENTS

# Fit the Pipeline
pipeline_cv.fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          101     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  5.54518D+02    |proj g|=  2.04000D+02

At iterate    1    f=  5.31344D+02    |proj g|=  8.78928D+01

At iterate    2    f=  4.55160D+02    |proj g|=  4.01187D+01

At iterate    3    f=  4.32123D+02    |proj g|=  8.26324D+01

At iterate    4    f=  4.16715D+02    |proj g|=  3.98241D+01

At iterate    5    f=  3.97087D+02    |proj g|=  1.47372D+01

At iterate    6    f=  3.65120D+02    |proj g|=  1.53607D+01

At iterate    7    f=  3.40922D+02    |proj g|=  1.62430D+01

At iterate    8    f=  3.29177D+02    |proj g|=  3.47076D+00

At iterate    9    f=  3.27222D+02    |proj g|=  3.63787D+00

At iterate   10    f=  3.24542D+02    |proj g|=  4.96374D+00

At iterate   11    f=  3.22516D+02    |proj g|=  3.37517D+00

At iterate   12    f=  3.21303D+02    |proj g|=  1.66355D+00

At iterate   13    f=  3.21030D+02    |proj g|=  1.21264D+00

At iterate   14    f=  3.20910D+02    |proj g|=  4.79720D-01

At iterate   15    f=  3.20867D+02    |proj g|=  2.75051D+00

At iterate   16    f=  3.20817D+02    |proj g|=  7.42277D-01

At iterate   17    f=  3.20805D+02    |proj g|=  3.80123D-01

At iterate   18    f=  3.20798D+02    |proj g|=  2.03584D-01

At iterate   19    f=  3.20788D+02    |proj g|=  1.89959D-01

At iterate   20    f=  3.20780D+02    |proj g|=  1.59201D-01

At iterate   21    f=  3.20776D+02    |proj g|=  1.64005D-01

At iterate   22    f=  3.20775D+02    |proj g|=  6.25600D-02

At iterate   23    f=  3.20774D+02    |proj g|=  3.73012D-02

At iterate   24    f=  3.20774D+02    |proj g|=  2.13423D-02

At iterate   25    f=  3.20774D+02    |proj g|=  3.09411D-02

At iterate   26    f=  3.20774D+02    |proj g|=  1.86634D-02

At iterate   27    f=  3.20774D+02    |proj g|=  1.77506D-02

At iterate   28    f=  3.20774D+02    |proj g|=  1.19104D-02

At iterate   29    f=  3.20774D+02    |proj g|=  6.78925D-03

At iterate   30    f=  3.20774D+02    |proj g|=  4.54960D-03

At iterate   31    f=  3.20774D+02    |proj g|=  4.83343D-03

At iterate   32    f=  3.20774D+02    |proj g|=  8.18759D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  101     32     35      1     0     0   8.188D-03   3.208D+02
  F =   320.77383276003553     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH

 This problem is unconstrained.

Out[18]:

Pipeline(steps=[('vectorizer', CountVectorizer(max_features=100)),
                ('model', LogisticRegression(n_jobs=-1, verbose=2))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now you can make predictions on the test set

In [19]:

Copied!

# predictions
y_pred_cv = pipeline_cv.predict(x_test)
# predictions
y_pred_cv = pipeline_cv.predict(x_test)

How to evaluate our model ?

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations

In [20]:

Copied!

# Compute accuracy
print(f"model accuracy : {accuracy_score(y_pred_cv, y_test)} %")
# Compute accuracy
print(f"model accuracy : {accuracy_score(y_pred_cv, y_test)} %")

model accuracy : 0.72 %

What is you opinion about the accuracy score ?

One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

One might think that if we have high accuracy, our model is the best. Yes, accuracy is an excellent measure, but only when you have balanced data (i.e. an equivalent representation of each class in the data).

Let's do a test with a reference model to show how accuracy can be a source of error when evaluating a model: Create a model that predict everytime the most frequent class and compare the results.

In [21]:

Copied!

# Baseline model: predict the most-frequent class
y_pred_baseline = [y_test.mode()[0]] * len(y_test)

# Compute accuracy
print(f"model baseline accuracy : {accuracy_score(y_pred_baseline, y_test)} %")
# Baseline model: predict the most-frequent class
y_pred_baseline = [y_test.mode()[0]] * len(y_test)

# Compute accuracy
print(f"model baseline accuracy : {accuracy_score(y_pred_baseline, y_test)} %")

model baseline accuracy : 0.7 %

Accuracy are closed, but the last model is completely idiot !

In case of an imbalanced target (let's say 99% zeros), the accuracy of this dumb model will be 99% !

Therefore, you need to look at other metrics to evaluate the performance of your model.

6.2 Choose the right metrics¶

We need now define other metrics to evaluate our model.

True Positives (TP): these are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.
True Negatives (TN): these are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.
False Positives (FP): when actual class is no and predicted class is yes.
False Negatives (FN): when actual class is yes but predicted class in no.
Accuracy: the ratio of correctly predicted observation to the total observations.

Accuracy = $\frac{TP+TN}{TP+FP+FN+TN}$

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low false positive rate.

Precision = $\frac{TP}{TP+FP}$

Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes.

Recall = $\frac{TP}{TP+FN}$

F1 score: F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.

F1 Score = $2* \frac{Recall * Precision}{Recall + Precision}$

Let's now compare again our 2 models !

First, plot the confusion matrix, the precision, recall and f1 score for each model. You can also print the classification report of scikit learn that sums up the main classification metrics.

Hint:

In [22]:

Copied!

# Confusion matrices
print(f"Confusion matrix of the first model: \n {confusion_matrix(y_test, y_pred_cv)}")
print(f"Confusion matrix of the baseline model: \n {confusion_matrix(y_test, y_pred_baseline)}")
# Confusion matrices
print(f"Confusion matrix of the first model: \n {confusion_matrix(y_test, y_pred_cv)}")
print(f"Confusion matrix of the baseline model: \n {confusion_matrix(y_test, y_pred_baseline)}")

Confusion matrix of the first model: 
 [[122  18]
 [ 38  22]]
Confusion matrix of the baseline model: 
 [[140   0]
 [ 60   0]]

In [23]:

Copied!





# Evaluate the first model
print(f"first model precision : {precision_score(y_pred_cv, y_test):.{3}f}%")
print(f"first model recall    : {recall_score(y_pred_cv, y_test)}%")
print(f"first model f1 score  : {f1_score(y_pred_cv, y_test):.{3}f}%\n")

# Evaluate the baseline model
print(f"baseline model precision : {precision_score(y_pred_baseline, y_test)}%")
print(f"baseline model recall    : {recall_score(y_pred_baseline, y_test)}%")
print(f"baseline model f1 score  : {f1_score(y_pred_baseline, y_test)}%")
# Evaluate the first model
print(f"first model precision : {precision_score(y_pred_cv, y_test):.{3}f}%")
print(f"first model recall    : {recall_score(y_pred_cv, y_test)}%")
print(f"first model f1 score  : {f1_score(y_pred_cv, y_test):.{3}f}%\n")

# Evaluate the baseline model
print(f"baseline model precision : {precision_score(y_pred_baseline, y_test)}%")
print(f"baseline model recall    : {recall_score(y_pred_baseline, y_test)}%")
print(f"baseline model f1 score  : {f1_score(y_pred_baseline, y_test)}%")

first model precision : 0.367%
first model recall    : 0.55%
first model f1 score  : 0.440%

baseline model precision : 0.0%
baseline model recall    : 0.0%
baseline model f1 score  : 0.0%

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

In [24]:

Copied!





# Classification report
print(classification_report(
    y_test, y_pred_cv # LINE TO BE REMOVED FOR STUDENTS
))
# Classification report
print(classification_report(
    y_test, y_pred_cv # LINE TO BE REMOVED FOR STUDENTS
))

              precision    recall  f1-score   support

           0       0.76      0.87      0.81       140
           1       0.55      0.37      0.44        60

    accuracy                           0.72       200
   macro avg       0.66      0.62      0.63       200
weighted avg       0.70      0.72      0.70       200

In [25]:

Copied!

# Classification report
print(classification_report(y_test, y_pred_baseline))
# Classification report
print(classification_report(y_test, y_pred_baseline))

              precision    recall  f1-score   support

           0       0.70      1.00      0.82       140
           1       0.00      0.00      0.00        60

    accuracy                           0.70       200
   macro avg       0.35      0.50      0.41       200
weighted avg       0.49      0.70      0.58       200

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

6.3 Second model using TF-IDF Vectorizer (optional)¶

In this last section, you will use a better approach in term of vectorization: TF-IDF

Scikit-learn provide the TfidfVectorizer that can be used in the same way as CountVectorizer.

Hint:

sklearn.feature_extraction.text.TfidfVectorizer

In [26]:

Copied!





# Initialize the TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    analyzer='word',
    max_features=100 # LINE TO BE REMOVED FOR STUDENTS
    )

# Apply the TfidfVectorizer and check the results on some rows
tfidf_vectorizer.fit(x_train)

x_train_features = tfidf_vectorizer.transform(x_train).toarray()
x_test_features = tfidf_vectorizer.transform(x_test).toarray()
# Initialize the TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    analyzer='word',
    max_features=100 # LINE TO BE REMOVED FOR STUDENTS
    )

# Apply the TfidfVectorizer and check the results on some rows
tfidf_vectorizer.fit(x_train)

x_train_features = tfidf_vectorizer.transform(x_train).toarray()
x_test_features = tfidf_vectorizer.transform(x_test).toarray()

In [27]:

Copied!

x_train_features[0]
x_train_features[0]

Out[27]:

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.19400832, 0.        , 0.22794963, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.32713051, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.30997343,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.28915725, 0.59635253, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.26488472, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.34835863, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.27665128, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [28]:

Copied!

x_train.iloc[0]
x_train.iloc[0]

Out[28]:

'bottle home ... colins tasting ... deep hazy brown ... thin lacing ... light sour cheery light strawb nose ... tart sourness ... cherry ... light sourness ... ok.'

You can now combine the vectorizer to a logistic regression in a single pipeline

In [29]:

Copied!





# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    analyzer='word',
    max_features=100
    )

# Initialize the logistic regression
logit = LogisticRegression(solver='lbfgs', verbose=2, n_jobs=-1)

# Combine them into a Pipeline object
pipeline_tfidf = Pipeline([
    ('vectorizer', tfidf_vectorizer), # LINE TO BE REMOVED FOR STUDENTS
    ('model', logit)]) # LINE TO BE REMOVED FOR STUDENTS

# Fit the Pipeline
pipeline_tfidf.fit(x_train, y_train)
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    analyzer='word',
    max_features=100
    )

# Initialize the logistic regression
logit = LogisticRegression(solver='lbfgs', verbose=2, n_jobs=-1)

# Combine them into a Pipeline object
pipeline_tfidf = Pipeline([
    ('vectorizer', tfidf_vectorizer), # LINE TO BE REMOVED FOR STUDENTS
    ('model', logit)]) # LINE TO BE REMOVED FOR STUDENTS

# Fit the Pipeline
pipeline_tfidf.fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          101     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  5.54518D+02    |proj g|=  2.04000D+02

At iterate    1    f=  4.43986D+02    |proj g|=  9.94050D+00

At iterate    2    f=  4.41607D+02    |proj g|=  5.63999D+00

At iterate    3    f=  4.31240D+02    |proj g|=  2.62300D+01

At iterate    4    f=  4.16711D+02    |proj g|=  4.32906D+01

At iterate    5    f=  3.94456D+02    |proj g|=  4.20929D+01

At iterate    6    f=  3.76615D+02    |proj g|=  8.35456D+00

At iterate    7    f=  3.75845D+02    |proj g|=  1.10314D+00

At iterate    8    f=  3.75745D+02    |proj g|=  2.13981D+00

At iterate    9    f=  3.75463D+02    |proj g|=  2.69875D+00

At iterate   10    f=  3.74909D+02    |proj g|=  1.97574D+00

At iterate   11    f=  3.74747D+02    |proj g|=  2.81994D-01

At iterate   12    f=  3.74739D+02    |proj g|=  9.21477D-02

At iterate   13    f=  3.74736D+02    |proj g|=  1.40607D-01

At iterate   14    f=  3.74733D+02    |proj g|=  3.94355D-01

At iterate   15    f=  3.74732D+02    |proj g|=  9.60931D-02

At iterate   16    f=  3.74731D+02    |proj g|=  1.79723D-02

At iterate   17    f=  3.74731D+02    |proj g|=  2.14785D-02

At iterate   18    f=  3.74731D+02    |proj g|=  1.95197D-02

At iterate   19    f=  3.74731D+02    |proj g|=  6.76658D-02

At iterate   20    f=  3.74731D+02    |proj g|=  3.09067D-02

At iterate   21    f=  3.74731D+02    |proj g|=  7.79600D-03

At iterate   22    f=  3.74731D+02    |proj g|=  5.07095D-03

At iterate   23    f=  3.74731D+02    |proj g|=  6.33766D-03

At iterate   24    f=  3.74731D+02    |proj g|=  2.25929D-03

At iterate   25    f=  3.74731D+02    |proj g|=  1.51012D-03

At iterate   26    f=  3.74731D+02    |proj g|=  1.39964D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  101     26     30      1     0     0   1.400D-03   3.747D+02
  F =   374.73103485123914     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH

 This problem is unconstrained.

Out[29]:

Pipeline(steps=[('vectorizer', TfidfVectorizer(max_features=100)),
                ('model', LogisticRegression(n_jobs=-1, verbose=2))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [30]:

Copied!





# Predictions
y_pred_tfidf = pipeline_tfidf.predict(x_test)

# Evaluate the second model
print(f"Confusion matrix of the first model: \n {confusion_matrix(y_test, y_pred_tfidf)}")
print(f"second model precision : {precision_score(y_pred_tfidf, y_test):.{3}f}%")
print(f"second model recall    : {recall_score(y_pred_tfidf, y_test)}%")
print(f"second model f1 score  : {f1_score(y_pred_tfidf, y_test):.{3}f}%\n")
# Predictions
y_pred_tfidf = pipeline_tfidf.predict(x_test)

# Evaluate the second model
print(f"Confusion matrix of the first model: \n {confusion_matrix(y_test, y_pred_tfidf)}")
print(f"second model precision : {precision_score(y_pred_tfidf, y_test):.{3}f}%")
print(f"second model recall    : {recall_score(y_pred_tfidf, y_test)}%")
print(f"second model f1 score  : {f1_score(y_pred_tfidf, y_test):.{3}f}%\n")

Confusion matrix of the first model: 
 [[138   2]
 [ 49  11]]
second model precision : 0.183%
second model recall    : 0.8461538461538461%
second model f1 score  : 0.301%

In [31]:

Copied!

# Classification report
print(classification_report(y_test, y_pred_tfidf))
# Classification report
print(classification_report(y_test, y_pred_tfidf))

              precision    recall  f1-score   support

           0       0.74      0.99      0.84       140
           1       0.85      0.18      0.30        60

    accuracy                           0.74       200
   macro avg       0.79      0.58      0.57       200
weighted avg       0.77      0.74      0.68       200

6.4 Optimize the model¶

You can now try to optimize the model by changing a lot of parameters:

Take more reviews in input
Increase the max_features parameter
Remove the most frequent words
Try adding n-grams to the vectorizer
Improve the text cleaning
etc...

In [ ]: