! pip install seaborn
Requirement already satisfied: seaborn in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from seaborn) (1.26.0) Requirement already satisfied: pandas>=0.25 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from seaborn) (2.1.1) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from seaborn) (3.8.0) Requirement already satisfied: contourpy>=1.0.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.1.1) Requirement already satisfied: cycler>=0.10 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.42.1) Requirement already satisfied: kiwisolver>=1.0.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.5) Requirement already satisfied: packaging>=20.0 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1) Requirement already satisfied: pillow>=6.2.0 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (10.0.1) Requirement already satisfied: pyparsing>=2.3.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from pandas>=0.25->seaborn) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.precision", 2)
sns.set_style("whitegrid")
sns.set_context(rc={"patch.linewidth": 0.15})
Read remote dataset¶
The data is in this git repository: ML-boot-camp/ratebeer.git.
The data is located in the ratebeer/data/ folder.
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_enriched.parquet"
df = pd.read_parquet(file_url)
df.shape
(200000, 18)
Overview¶
Get a few information about the content of the dataframe:
- number of null values per column
- data type of each column
- memory usage
Methods you'll need:
df.info(memory_usage="deep") # LINE TO BE REMOVED FOR STUDENTS
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200000 entries, 0 to 199999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 beer 200000 non-null object 1 brewery 200000 non-null object 2 alcohol 200000 non-null float64 3 type 200000 non-null object 4 rating_appearance 200000 non-null int64 5 rating_aroma 200000 non-null int64 6 rating_palate 200000 non-null int64 7 rating_taste 200000 non-null int64 8 rating 200000 non-null int64 9 timestamp 200000 non-null object 10 user 200000 non-null object 11 text 200000 non-null object 12 beer_degree 200000 non-null int64 13 brewery_degree 200000 non-null int64 14 user_degree 200000 non-null int64 15 text_length 200000 non-null int64 16 date 200000 non-null datetime64[us] 17 is_good 200000 non-null int64 dtypes: datetime64[us](1), float64(1), int64(10), object(6) memory usage: 153.7 MB
Sample¶
Show a sample of the data
df
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... | 51 | 413 | 300 | 90 | 2008-07-31 02:00:00 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with gar... | 4 | 413 | 300 | 191 | 2011-08-06 02:00:00 | 0 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, sl... | 43 | 413 | 300 | 109 | 2008-03-09 01:00:00 | 0 |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | blutt59 | bottle, golden orange color with light tan foa... | 20 | 413 | 300 | 151 | 2009-10-17 02:00:00 | 0 |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | blutt59 | bottle, thanks to SS, almond amber colored pou... | 18 | 978 | 300 | 159 | 2010-08-17 02:00:00 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 199995 | Die M Dunkel | 6587 | -1.00 | Dunkel | 4 | 5 | 3 | 3 | 8 | 1204070400 | kuleko | Bottle, supermarket. Very dark colour. Sweet c... | 2 | 2 | 1 | 163 | 2008-02-27 01:00:00 | 0 |
| 199996 | Montt Hafen Porter | 8240 | 6.00 | Porter | 5 | 10 | 5 | 10 | 19 | 1175040000 | Andressantis | Sabrosa,con amargor justo y ligeramente dulce.... | 1 | 1 | 1 | 97 | 2007-03-28 02:00:00 | 1 |
| 199997 | Oxymore | 8022 | 5.00 | Saison | 4 | 8 | 3 | 6 | 12 | 1181952000 | DesmondLangham | Very earthy nose.\tClear and golden, head kept... | 1 | 1 | 1 | 192 | 2007-06-16 02:00:00 | 0 |
| 199998 | Stonehouse Station Bitter | 8561 | 3.90 | Bitter | 3 | 5 | 4 | 6 | 13 | 1238716800 | thedees | Cask at the Three Pigeons Nescliffe. Maly swee... | 1 | 2 | 1 | 88 | 2009-04-03 02:00:00 | 0 |
| 199999 | Pilker Negra | 6687 | 5.00 | Stout | 4 | 6 | 2 | 5 | 5 | 1142553600 | apeters | Roasty aroma.Black color.Dry and bitter flavor... | 1 | 1 | 1 | 144 | 2006-03-17 01:00:00 | 0 |
200000 rows × 18 columns
Describing statistics¶
Compute statistics to understand the content of each column.
Methods you'll need:
Bonus: fill NaN values with an empty string "" for a better readability using:
df.describe(include="all").fillna("").T
| count | unique | top | freq | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| beer | 200000 | 42125 | Guinness Draught | 252 | |||||||
| brewery | 200000 | 5413 | 32 | 3217 | |||||||
| alcohol | 200000.0 | 6.28 | -1.0 | 5.0 | 5.8 | 8.0 | 57.7 | 2.78 | |||
| type | 200000 | 89 | India Pale Ale (IPA) | 12698 | |||||||
| rating_appearance | 200000.0 | 3.43 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 | 0.81 | |||
| rating_aroma | 200000.0 | 6.35 | 1.0 | 6.0 | 7.0 | 7.0 | 10.0 | 1.64 | |||
| rating_palate | 200000.0 | 3.26 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 | 0.83 | |||
| rating_taste | 200000.0 | 6.45 | 1.0 | 6.0 | 7.0 | 8.0 | 10.0 | 1.62 | |||
| rating | 200000.0 | 13.19 | 1.0 | 12.0 | 14.0 | 15.0 | 20.0 | 3.35 | |||
| timestamp | 200000 | 4105 | 1188604800 | 392 | |||||||
| user | 200000 | 11251 | fonefan | 1146 | |||||||
| text | 200000 | 199261 | 334 | ||||||||
| beer_degree | 200000.0 | 34.88 | 1.0 | 4.0 | 16.0 | 50.0 | 252.0 | 44.31 | |||
| brewery_degree | 200000.0 | 605.39 | 1.0 | 68.0 | 304.0 | 828.0 | 3217.0 | 748.4 | |||
| user_degree | 200000.0 | 156.38 | 1.0 | 34.0 | 97.0 | 204.0 | 1146.0 | 185.3 | |||
| text_length | 200000.0 | 307.46 | 0.0 | 176.0 | 254.0 | 375.0 | 7989.0 | 212.23 | |||
| date | 200000 | 2008-03-08 05:48:03.672000 | 2000-04-23 02:00:00 | 2006-05-23 02:00:00 | 2008-07-08 02:00:00 | 2010-05-08 02:00:00 | 2012-01-13 01:00:00 | ||||
| is_good | 200000.0 | 0.24 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.43 |
Target¶
The dataset contains 2 possible targets:
rating: an ordinal variable, which can be used to define a regressionis_good: a binary variable, which can be used to define a classification
In the follow-up of the exploratory data analysis, for the sake of clariy, we'll consider only the binary target (classification). Some plots would be different for numeric target.
is_good¶
(
(df.is_good)
.value_counts()
.plot.bar() # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='is_good'>
sns.countplot(
df,
x="is_good", # LINE TO BE REMOVED FOR STUDENTS
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):
<Axes: xlabel='is_good', ylabel='count'>
Quantitative variables¶
alcoholdaterating_appearancerating_aromarating_palaterating_taste
((df.alcohol).plot.hist(bins=100)) # LINE TO BE REMOVED FOR STUDENTS
<Axes: ylabel='Frequency'>
sns.displot(
df,
x="alcohol",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4fb92790>
sns.displot(df, x="alcohol", hue="is_good")
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4f0af690>
sns.displot(
df.loc[lambda df: df.alcohol < 20], x="alcohol", hue="is_good"
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4e8c70d0>
sns.displot(
df.loc[lambda df: df.alcohol < 20],
x="alcohol",
hue="is_good",
kde=True,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac5493dc50>
sns.displot(
df.loc[lambda df: df.alcohol < 20],
x="alcohol",
hue="is_good",
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4e386290>
date¶
Distribution¶
sns.displot(
df,
x="date", # LINE TO BE REMOVED FOR STUDENTS
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4dc46c50>
Relationship with the target¶
sns.displot(
df,
x="date", # LINE TO BE REMOVED FOR STUDENTS
hue="is_good", # LINE TO BE REMOVED FOR STUDENTS
kde=True,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4db6cb90>
sns.displot(
df,
x="date", # LINE TO BE REMOVED FOR STUDENTS
hue="is_good", # LINE TO BE REMOVED FOR STUDENTS
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4e68fe10>
review_columns = [
"rating_appearance",
"rating_aroma",
"rating_palate", # LINE TO BE REMOVED FOR STUDENTS
"rating_taste", # LINE TO BE REMOVED FOR STUDENTS
]
df_rating_long = df.melt(id_vars="is_good", value_vars=review_columns)
df_rating_long
| is_good | variable | value | |
|---|---|---|---|
| 0 | 0 | rating_appearance | 4 |
| 1 | 0 | rating_appearance | 3 |
| 2 | 0 | rating_appearance | 3 |
| 3 | 0 | rating_appearance | 3 |
| 4 | 0 | rating_appearance | 3 |
| ... | ... | ... | ... |
| 799995 | 0 | rating_taste | 3 |
| 799996 | 1 | rating_taste | 10 |
| 799997 | 0 | rating_taste | 6 |
| 799998 | 0 | rating_taste | 6 |
| 799999 | 0 | rating_taste | 5 |
800000 rows × 3 columns
sns.displot(
df_rating_long,
x="value",
row="variable",
discrete=True,
height=3,
aspect=2,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4d83e810>
Relationship with the target¶
sns.displot(
df_rating_long,
x="value",
hue="is_good",
row="variable",
discrete=True,
height=3,
aspect=2,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac4d996b90>
sns.displot(
df_rating_long,
x="value",
hue="is_good",
row="variable",
discrete=True,
height=3,
aspect=2,
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac11005010>
Categorical variables¶
typebeer
(
(df.type)
.value_counts()
.plot.bar()
)
<Axes: xlabel='type'>
Relationship with the target¶
sns.displot(
df,
x="type",
discrete=True,
hue="is_good",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac110496d0>
df_styles = (
(df)
.groupby("type")
.is_good.agg(["count", "mean"])
.add_prefix("review_")
.reset_index()
.sort_values(by="review_mean", ascending=False)
.reset_index(drop=True)
.assign(
bar_left_position=lambda df: df.review_count.cumsum().shift(1, fill_value=0)
)
)
df_styles
| type | review_count | review_mean | bar_left_position | |
|---|---|---|---|---|
| 0 | Imperial Stout | 8460 | 0.63 | 0 |
| 1 | Ice Cider/Perry | 131 | 0.56 | 8460 |
| 2 | Abt/Quadrupel | 1327 | 0.55 | 8591 |
| 3 | Sour Ale/Wild Ale | 3264 | 0.55 | 9918 |
| 4 | Imperial/Double IPA | 7215 | 0.53 | 13182 |
| ... | ... | ... | ... | ... |
| 84 | Sak - Infused | 16 | 0.00 | 199959 |
| 85 | Sak - Honjozo | 5 | 0.00 | 199975 |
| 86 | Sak - Genshu | 6 | 0.00 | 199980 |
| 87 | Sak - Taru | 9 | 0.00 | 199986 |
| 88 | Sak - Tokubetsu | 5 | 0.00 | 199995 |
89 rows × 4 columns
plt.bar(
x=df_styles.bar_left_position,
height=df_styles.review_mean,
width=df_styles.review_count,
align="edge",
alpha=0.5,
edgecolor="k",
linewidth=0.5,
)
<BarContainer object of 89 artists>
High cardinality variables¶
beerbreweryuser
All those high cardinality variables can be thought as links of a network. Indeed, a review is an object comprising a beer, a brewery and a user and can be thought as a network link between them.
In other words, the review table is the a table describing the links in a network with 3 types of nodes: users, beers and breweries.
The first property to compute about each node is its "degree", which is its number of connections with other nodes. High degree means "highly connected".
Analyse the degree of the nodes is a way to answer the following questions:
- is an experienced user more severe ?
- is a new user more forgiving ?
- is a popular beer (or a big brewery) disadvantaged by a "boreness factor" ?
- is a new beer (a small brewery) benefitting from a "novelty factor" ?
To compute the degree you'll need:
df_beer_degree = df.loc[:, ["beer", "beer_degree"]].drop_duplicates()
df_beer_degree
| beer | beer_degree | |
|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 51 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 4 |
| 2 | Breckenridge Avalanche Amber | 43 |
| 3 | Breckenridge Lucky U IPA | 20 |
| 4 | Fullers Vintage Ale 2009 | 18 |
| ... | ... | ... |
| 199994 | Acadie-Broue La Patente | 1 |
| 199996 | Montt Hafen Porter | 1 |
| 199997 | Oxymore | 1 |
| 199998 | Stonehouse Station Bitter | 1 |
| 199999 | Pilker Negra | 1 |
42125 rows × 2 columns
(
(df_beer_degree.beer_degree)
.value_counts()
.reset_index()
.plot.scatter(
x="beer_degree", y="count", marker="." # LINE TO BE REMOVED FOR STUDENTS
)
)
<Axes: xlabel='beer_degree', ylabel='count'>
Many networks are scale free networks:

Meaning that the node's degree distribution follows a power law
distribution, which is better visualized
using a log-log scale:

(
(df_beer_degree.beer_degree)
.value_counts()
.reset_index()
.plot.scatter( # LINE TO BE REMOVED FOR STUDENTS
x="beer_degree", # LINE TO BE REMOVED FOR STUDENTS
y="count", # LINE TO BE REMOVED FOR STUDENTS
loglog=True, # LINE TO BE REMOVED FOR STUDENTS
marker="." # LINE TO BE REMOVED FOR STUDENTS
) # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='beer_degree', ylabel='count'>
Another way to visualize the power law distribution is to visualize the rank-frequency plot.
Instead of plotting the frequency as a function of the value (like in a normal distribution plot), we plot the frequency as a function of its rank. Note: It's always monotonically decreasing.
Power law distributions exhibit also a distinctive visual pattern in the
rank-frequency plot, known as the Zipf's law:
def plot_rank_size(series):
return (
(series)
.rename("")
.value_counts()
.reset_index()
.assign(rank=lambda df: range(1, 1 + df.shape[0]))
.plot(x="rank", y="count", loglog=True, marker=".")
)
plot_rank_size(df_beer_degree.beer_degree)
<Axes: xlabel='rank'>
Relationship with target¶
sns.displot(df, x="beer_degree", hue="is_good")
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac10416c50>
sns.displot(df, x="beer_degree", hue="is_good", log_scale=True)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0ff42610>
sns.displot(
df,
x="beer_degree",
hue="is_good",
log_scale=True,
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0fc496d0>
df_brewery_degree = df.loc[:, ["brewery", "brewery_degree"]].drop_duplicates()
df_brewery_degree
| brewery | brewery_degree | |
|---|---|---|
| 0 | 383 | 413 |
| 4 | 55 | 978 |
| 5 | 32 | 3217 |
| 9 | 284 | 610 |
| 10 | 149 | 400 |
| ... | ... | ... |
| 199992 | 13436 | 2 |
| 199994 | 12747 | 1 |
| 199996 | 8240 | 1 |
| 199997 | 8022 | 1 |
| 199999 | 6687 | 1 |
5413 rows × 2 columns
(
(df_brewery_degree.brewery_degree)
.value_counts()
.reset_index()
.plot.scatter( # LINE TO BE REMOVED FOR STUDENTS
x="brewery_degree", # LINE TO BE REMOVED FOR STUDENTS
y="count", # LINE TO BE REMOVED FOR STUDENTS
marker=".", # LINE TO BE REMOVED FOR STUDENTS
) # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='brewery_degree', ylabel='count'>
(
(df_brewery_degree.brewery_degree)
.value_counts()
.reset_index()
.plot.scatter( # LINE TO BE REMOVED FOR STUDENTS
x="brewery_degree", # LINE TO BE REMOVED FOR STUDENTS
y="count", # LINE TO BE REMOVED FOR STUDENTS
loglog=True, # LINE TO BE REMOVED FOR STUDENTS
marker=".", # LINE TO BE REMOVED FOR STUDENTS
) # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='brewery_degree', ylabel='count'>
plot_rank_size(df_brewery_degree.brewery_degree)
<Axes: xlabel='rank'>
Relationship with target¶
sns.displot(df, x="brewery_degree", hue="is_good")
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0f8b3c10>
sns.displot(
df, x="brewery_degree", hue="is_good", log_scale=True
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0f328f10>
sns.displot(
df,
x="brewery_degree",
hue="is_good",
log_scale=True,
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0f8d1490>
df_user_degree = df.loc[:, ["user", "user_degree"]].drop_duplicates()
df_user_degree
| user | user_degree | |
|---|---|---|
| 0 | blutt59 | 300 |
| 300 | VAYankee | 23 |
| 323 | garretjax | 7 |
| 330 | randolphjj | 10 |
| 340 | Leafs93 | 56 |
| ... | ... | ... |
| 199995 | kuleko | 1 |
| 199996 | Andressantis | 1 |
| 199997 | DesmondLangham | 1 |
| 199998 | thedees | 1 |
| 199999 | apeters | 1 |
11251 rows × 2 columns
(
(df_user_degree.user_degree)
.value_counts()
.reset_index()
.plot.scatter( # LINE TO BE REMOVED FOR STUDENTS
x="user_degree", # LINE TO BE REMOVED FOR STUDENTS
y="count", # LINE TO BE REMOVED FOR STUDENTS
marker=".", # LINE TO BE REMOVED FOR STUDENTS
) # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='user_degree', ylabel='count'>
(
(df_user_degree.user_degree)
.value_counts()
.reset_index()
.plot.scatter( # LINE TO BE REMOVED FOR STUDENTS
x="user_degree", # LINE TO BE REMOVED FOR STUDENTS
y="count", # LINE TO BE REMOVED FOR STUDENTS
loglog=True, # LINE TO BE REMOVED FOR STUDENTS
marker=".", # LINE TO BE REMOVED FOR STUDENTS
) # LINE TO BE REMOVED FOR STUDENTS
)
<Axes: xlabel='user_degree', ylabel='count'>
plot_rank_size(df_user_degree.user_degree)
<Axes: xlabel='rank'>
Relationship with target¶
sns.displot(
df, x="user_degree", hue="is_good"
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0efb1050>
sns.displot(
df,
x="user_degree",
hue="is_good",
log_scale=True,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0ea93750>
sns.displot(
df,
x="user_degree",
hue="is_good",
log_scale=True,
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0e51fbd0>
Text variable¶
Using the pd.Series.str API
(df.text_length).plot.hist(bins=200) # LINE TO BE REMOVED FOR STUDENTS
<Axes: ylabel='Frequency'>
((df.text_length).plot.hist(bins=200, logy=True))
<Axes: ylabel='Frequency'>
Relationship with the target¶
sns.displot(df, x="text_length")
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0e153b90>
sns.displot(
df,
x="text_length",
hue="is_good",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0d4efe10>
sns.displot(
df.loc[lambda df: df.text_length < 1500],
x="text_length",
hue="is_good",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0c3cbcd0>
sns.displot(
df.loc[lambda df: df.text_length < 1500],
x="text_length",
hue="is_good",
multiple="fill",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac0c311cd0>
Words associated to positive & negative reviews (optional)¶
(
(df)
.head(100000)
.assign(
tokenized_text=lambda df: (df.text)
.str.lower()
.str.replace(r"[^a-z]", " ")
.str.replace(r" +", " ")
.str.split(" ")
)
.loc[:, ["rating", "tokenized_text"]]
.explode("tokenized_text")
.loc[lambda df: df.tokenized_text != ""]
.groupby("tokenized_text", as_index=False)
.agg(["mean", "count"])
.reset_index()
.sort_values(by=("rating", "count"), ascending=False)
.head(200)
.style.background_gradient(cmap="RdYlGn")
)
| index | tokenized_text | rating | ||
|---|---|---|---|---|
| mean | count | |||
| 23275 | 23275 | a | 13.586334 | 241381 |
| 28030 | 28030 | and | 13.721123 | 212352 |
| 148797 | 148797 | the | 13.713381 | 169186 |
| 161860 | 161860 | with | 13.563324 | 156450 |
| 112885 | 112885 | of | 13.675712 | 133776 |
| 90940 | 90940 | is | 13.447157 | 109834 |
| 150780 | 150780 | to | 13.522130 | 58857 |
| 89101 | 89101 | in | 13.507647 | 49298 |
| 97612 | 97612 | light | 12.859438 | 47296 |
| 82869 | 82869 | head. | 13.214893 | 47270 |
| 29776 | 29776 | aroma | 13.304910 | 46335 |
| 138634 | 138634 | some | 13.531472 | 44992 |
| 44510 | 44510 | but | 13.407445 | 44109 |
| 149542 | 149542 | this | 13.633697 | 42211 |
| 157022 | 157022 | very | 13.926570 | 39943 |
| 88160 | 88160 | i | 13.446812 | 39135 |
| 145181 | 145181 | sweet | 13.353797 | 37920 |
| 72292 | 72292 | flavor | 13.438029 | 32943 |
| 91152 | 91152 | it | 13.442805 | 32057 |
| 111067 | 111067 | not | 12.693228 | 30922 |
| 160838 | 160838 | white | 12.765617 | 30655 |
| 148708 | 148708 | that | 13.628333 | 30191 |
| 113892 | 113892 | on | 13.704276 | 29913 |
| 31158 | 31158 | at | 13.454488 | 28498 |
| 58150 | 58150 | dark | 14.380695 | 27723 |
| 109609 | 109609 | nice | 14.524580 | 26078 |
| 122501 | 122501 | pours | 13.746651 | 25230 |
| 103581 | 103581 | medium | 13.422387 | 25086 |
| 30626 | 30626 | as | 13.803236 | 24471 |
| 100764 | 100764 | malt | 13.348653 | 24196 |
| 158640 | 158640 | was | 13.377105 | 22503 |
| 73925 | 73925 | for | 13.503043 | 21360 |
| 71204 | 71204 | finish. | 13.413420 | 21327 |
| 82746 | 82746 | head | 13.647756 | 20923 |
| 37198 | 37198 | bit | 13.368343 | 20899 |
| 43168 | 43168 | brown | 14.151154 | 20833 |
| 34593 | 34593 | beer | 13.235946 | 20564 |
| 78974 | 78974 | good | 13.946021 | 20156 |
| 147289 | 147289 | taste | 13.303670 | 18609 |
| 40087 | 40087 | bottle | 13.538968 | 18079 |
| 75063 | 75063 | from | 13.663889 | 17399 |
| 27959 | 27959 | an | 13.509422 | 17141 |
| 39125 | 39125 | body | 13.382865 | 16878 |
| 97915 | 97915 | like | 13.042391 | 16442 |
| 37357 | 37357 | bitter | 13.499561 | 15952 |
| 98706 | 98706 | little | 13.228940 | 15812 |
| 45933 | 45933 | caramel | 13.783775 | 15655 |
| 86751 | 86751 | hops | 13.759464 | 15137 |
| 111258 | 111258 | notes | 13.309095 | 14701 |
| 27424 | 27424 | amber | 13.368961 | 14511 |
| 86086 | 86086 | hop | 14.022538 | 14376 |
| 106960 | 106960 | more | 13.861212 | 14324 |
| 53151 | 53151 | color | 13.430517 | 14320 |
| 63943 | 63943 | dry | 13.494540 | 13827 |
| 82338 | 82338 | has | 13.569667 | 13299 |
| 136869 | 136869 | slightly | 13.206275 | 13196 |
| 51533 | 51533 | clear | 12.633244 | 12736 |
| 78761 | 78761 | golden | 12.470602 | 12688 |
| 101560 | 101560 | malty | 12.990689 | 11922 |
| 29931 | 29931 | aroma. | 12.967441 | 11794 |
| 149239 | 149239 | thin | 12.733708 | 11754 |
| 137156 | 137156 | small | 12.979860 | 11569 |
| 9029 | 9029 | - | 13.323211 | 11559 |
| 91489 | 91489 | its | 13.395435 | 11390 |
| 71131 | 71131 | finish | 13.701222 | 11373 |
| 114591 | 114591 | orange | 13.661272 | 11369 |
| 129833 | 129833 | roasted | 14.284308 | 11273 |
| 125314 | 125314 | quite | 13.656752 | 11196 |
| 110779 | 110779 | nose | 13.751688 | 10813 |
| 92905 | 92905 | just | 13.294800 | 10597 |
| 75764 | 75764 | fruity | 13.565176 | 10579 |
| 40285 | 40285 | bottle. | 13.097241 | 10510 |
| 49752 | 49752 | chocolate | 14.750944 | 10323 |
| 108504 | 108504 | my | 13.549403 | 9969 |
| 107899 | 107899 | much | 12.856110 | 9959 |
| 82665 | 82665 | hazy | 13.491174 | 9801 |
| 100788 | 100788 | malt, | 13.555246 | 9693 |
| 126698 | 126698 | really | 13.882165 | 9683 |
| 50818 | 50818 | citrus | 14.087832 | 9632 |
| 113998 | 113998 | one | 13.868399 | 9582 |
| 105037 | 105037 | mild | 13.179398 | 9465 |
| 146515 | 146515 | tan | 14.401638 | 9399 |
| 44828 | 44828 | by | 13.822816 | 9318 |
| 56539 | 56539 | creamy | 14.200086 | 9311 |
| 159634 | 159634 | well | 14.450652 | 9129 |
| 25862 | 25862 | alcohol | 14.389866 | 8980 |
| 29570 | 29570 | are | 13.712486 | 8970 |
| 34777 | 34777 | beer. | 13.502987 | 8871 |
| 38010 | 38010 | black | 14.801697 | 8840 |
| 45947 | 45947 | caramel, | 13.911252 | 8834 |
| 148625 | 148625 | than | 13.355771 | 8750 |
| 110115 | 110115 | no | 12.417195 | 8677 |
| 86895 | 86895 | hops. | 13.367686 | 8374 |
| 123193 | 123193 | pretty | 13.506425 | 8327 |
| 97827 | 97827 | lightly | 13.027896 | 8209 |
| 106284 | 106284 | moderate | 12.945842 | 8069 |
| 151530 | 151530 | too | 12.780524 | 8051 |
| 86611 | 86611 | hoppy | 13.600974 | 8012 |
| 151961 | 151961 | touch | 13.537463 | 8008 |
| 34195 | 34195 | be | 13.331522 | 7912 |
| 145204 | 145204 | sweet, | 13.192654 | 7895 |
| 113079 | 113079 | off-white | 13.344354 | 7864 |
| 116984 | 116984 | pale | 11.721063 | 7862 |
| 46556 | 46556 | carbonation. | 13.440301 | 7831 |
| 137670 | 137670 | smooth | 14.616456 | 7754 |
| 136840 | 136840 | slight | 13.355573 | 7734 |
| 36787 | 36787 | big | 14.338218 | 7643 |
| 163706 | 163706 | yellow | 11.484215 | 7507 |
| 148977 | 148977 | there | 13.509148 | 7379 |
| 75280 | 75280 | fruit | 13.969281 | 7357 |
| 155508 | 155508 | up | 13.993055 | 7343 |
| 85156 | 85156 | hint | 13.467099 | 7249 |
| 82445 | 82445 | have | 13.392852 | 7247 |
| 72755 | 72755 | flavour | 12.493092 | 7238 |
| 37654 | 37654 | bitterness | 13.744235 | 7112 |
| 101371 | 101371 | malts | 13.707405 | 6955 |
| 99421 | 99421 | lots | 14.327616 | 6938 |
| 81563 | 81563 | had | 13.487399 | 6904 |
| 59011 | 59011 | deep | 14.241690 | 6889 |
| 146796 | 146796 | tap | 13.618348 | 6878 |
| 122472 | 122472 | poured | 13.585611 | 6769 |
| 142930 | 142930 | strong | 14.024537 | 6643 |
| 26720 | 26720 | all | 13.653858 | 6636 |
| 86792 | 86792 | hops, | 13.856798 | 6627 |
| 52552 | 52552 | coffee | 14.676586 | 6620 |
| 76183 | 76183 | full | 14.875551 | 6573 |
| 148650 | 148650 | thanks | 14.364516 | 6510 |
| 112957 | 112957 | off | 13.485213 | 6492 |
| 72613 | 72613 | flavors | 14.252025 | 6420 |
| 73089 | 73089 | floral | 13.862421 | 6360 |
| 58841 | 58841 | decent | 13.161631 | 6255 |
| 80396 | 80396 | great | 15.095230 | 6206 |
| 49772 | 49772 | chocolate, | 15.006938 | 6198 |
| 46520 | 46520 | carbonation | 13.379681 | 6142 |
| 27023 | 27023 | almost | 13.432958 | 6123 |
| 140221 | 140221 | spicy | 13.992455 | 6097 |
| 39186 | 39186 | body. | 13.241162 | 6025 |
| 95061 | 95061 | lacing. | 13.933050 | 5885 |
| 115474 | 115474 | out | 13.864741 | 5885 |
| 111288 | 111288 | notes. | 13.252891 | 5880 |
| 139069 | 139069 | sour | 13.625856 | 5840 |
| 116823 | 116823 | palate | 13.731880 | 5781 |
| 149136 | 149136 | thick | 14.581053 | 5774 |
| 164131 | 164131 | you | 13.451472 | 5739 |
| 122396 | 122396 | pour | 13.908388 | 5687 |
| 114544 | 114544 | or | 13.060817 | 5607 |
| 155607 | 155607 | updated: | 14.085699 | 5566 |
| 138108 | 138108 | so | 13.434869 | 5512 |
| 29811 | 29811 | aroma, | 13.159846 | 5449 |
| 27175 | 27175 | also | 13.554172 | 5381 |
| 101387 | 101387 | malts, | 13.882528 | 5363 |
| 31741 | 31741 | average | 12.630856 | 5315 |
| 129208 | 129208 | rich | 15.086973 | 5197 |
| 145613 | 145613 | sweetness | 13.762656 | 5195 |
| 72390 | 72390 | flavor. | 13.148306 | 5165 |
| 162647 | 162647 | would | 13.489411 | 5147 |
| 35536 | 35536 | beige | 13.784184 | 5134 |
| 82798 | 82798 | head, | 13.087473 | 5133 |
| 85189 | 85189 | hints | 13.777712 | 5079 |
| 163330 | 163330 | yeast | 13.754000 | 5000 |
| 138370 | 138370 | soft | 13.829122 | 4986 |
| 103132 | 103132 | me | 13.453575 | 4965 |
| 127058 | 127058 | red | 13.999385 | 4878 |
| 101031 | 101031 | malt. | 13.419189 | 4857 |
| 107175 | 107175 | mostly | 12.826060 | 4789 |
| 141901 | 141901 | still | 13.993242 | 4735 |
| 51887 | 51887 | cloudy | 13.662056 | 4699 |
| 107460 | 107460 | mouthfeel | 14.115672 | 4677 |
| 55211 | 55211 | copper | 13.671597 | 4665 |
| 64989 | 64989 | earthy | 13.888191 | 4615 |
| 37699 | 37699 | bitterness. | 13.345585 | 4598 |
| 39142 | 39142 | body, | 13.120541 | 4513 |
| 69532 | 69532 | fairly | 13.391995 | 4472 |
| 160299 | 160299 | what | 13.128217 | 4313 |
| 78657 | 78657 | gold | 12.690288 | 4304 |
| 160627 | 160627 | which | 13.716495 | 4268 |
| 88438 | 88438 | if | 12.988051 | 4268 |
| 101571 | 101571 | malty, | 12.852782 | 4259 |
| 90414 | 90414 | into | 13.997877 | 4240 |
| 91235 | 91235 | it. | 13.330701 | 4179 |
| 159886 | 159886 | well. | 14.203650 | 4110 |
| 147569 | 147569 | tastes | 12.829526 | 4071 |
| 39064 | 39064 | bodied | 13.725253 | 4051 |
| 147086 | 147086 | tart | 14.334900 | 4043 |
| 30284 | 30284 | aromas | 13.767154 | 4037 |
| 99124 | 99124 | long | 14.158210 | 4001 |
| 150850 | 150850 | toasted | 13.797226 | 3965 |
| 95981 | 95981 | lasting | 13.619108 | 3946 |
| 53206 | 53206 | color. | 13.207562 | 3941 |
| 74725 | 74725 | fresh | 14.431298 | 3930 |
| 80176 | 80176 | grassy | 12.579619 | 3837 |
| 160367 | 160367 | wheat | 12.803390 | 3835 |
| 83719 | 83719 | heavy | 14.096791 | 3833 |
| 32776 | 32776 | balanced | 14.709516 | 3804 |
| 138768 | 138768 | somewhat | 13.084395 | 3768 |
| 21967 | 21967 | : | 12.685775 | 3768 |
| 72321 | 72321 | flavor, | 13.207923 | 3761 |
| 56872 | 56872 | crisp | 13.542400 | 3750 |
| 36423 | 36423 | better | 13.078177 | 3620 |
| 45340 | 45340 | can | 12.682158 | 3615 |
Word frequencies (optional)¶
Compute the frequency of the most used words in the texts
Methods you'll need:
Bonus: plot an histogram of the values, with log values, using:
Is it a Power law distribution ?
df_word_frequencies = (
(df.text)
.str.lower()
.str.replace(r"[^a-z\ ]", "")
.str.replace(r"\ +", " ")
.str.split(" ") # LINE TO BE REMOVED FOR STUDENTS
.explode()
.loc[lambda x: x != ""]
.value_counts(normalize=True)
.rename("word_frequency")
.rename_axis(index="word")
.reset_index()
.assign(rank=lambda df: range(1, 1 + df.shape[0]))
)
df_word_frequencies
| word | word_frequency | rank | |
|---|---|---|---|
| 0 | a | 4.38e-02 | 1 |
| 1 | and | 3.63e-02 | 2 |
| 2 | the | 3.13e-02 | 3 |
| 3 | with | 2.67e-02 | 4 |
| 4 | of | 2.35e-02 | 5 |
| ... | ... | ... | ... |
| 269396 | vapeur-like | 9.51e-08 | 269397 |
| 269397 | tongue-cleaning. | 9.51e-08 | 269398 |
| 269398 | sidra | 9.51e-08 | 269399 |
| 269399 | virtuous | 9.51e-08 | 269400 |
| 269400 | fruit-oriented | 9.51e-08 | 269401 |
269401 rows × 3 columns
df_word_frequencies.head(10000).plot(x="rank", y="word_frequency", marker=".")
<Axes: xlabel='rank'>
df_word_frequencies.head(10000).plot(
x="rank", y="word_frequency", loglog=True, marker="."
)
<Axes: xlabel='rank'>
sns.displot(
df.loc[lambda df: df.alcohol < 20],
x="alcohol",
hue="rating",
multiple="fill",
palette="RdYlGn",
bins=20,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fac079a4710>
sns.displot(
df.loc[lambda df: df.alcohol < 20],
x="alcohol",
y="rating",
bins=20,
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fabeba2c290>
Categorical variable¶
sns.displot(
df,
x="rating_appearance",
discrete=True,
hue="rating",
multiple="stack",
palette="RdYlGn",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fabddd06990>
sns.displot(
df,
x="rating_appearance",
discrete=True,
hue="rating",
multiple="fill",
palette="RdYlGn",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid at 0x7fabddd0bed0>
sns.violinplot(
df,
x="rating_appearance",
y="rating",
)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector):
<Axes: xlabel='rating_appearance', ylabel='rating'>