🐍 Practice n°1: data engineering¶
The objective of this session is to learn about the basics of data engineering. You will have to explore the Ratebeer dataset using sql and python.
This dataset consists of beer reviews from ratebeer. The data span a period of more than 10 years, including all ~3 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. We also have reviews from beeradvocate.
source ratebeer dataset description
To avoid high compute time, we are going to work with a sample during the session. Also, the data is already cleaned.
Here are the main steps of the notebook :
- Preparation
- Data engineering in sql with duckdb
- Data engineering in python with pandas
This is a data engineering tutorial in Python/pandas, it assumes you have already some some knowledge of data engineering in SQL.
Preparation¶
Install & import modules¶
!pip install pyarrow
Requirement already satisfied: pyarrow in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (13.0.0) Requirement already satisfied: numpy>=1.16.6 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from pyarrow) (1.26.0)
import pandas as pd
pd.set_option("display.max_columns", 100)
Preparation¶
Get some doc¶
Read data¶
The data is in this git repository: ML-boot-camp/ratebeer.git.
The data is located in the ratebeer/data/ folder.
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_clean.parquet"
Load the file ratebeer_sample_clean.parquet to extract a pandas DataFrame and
assign
it the variable df_raw.
Hint:
df_raw = pd.read_parquet(file_url)
df_raw.shape
(200000, 12)
Overview¶
Get a few information about the content of the dataframe:
- number of null values per column
- data type of each column
- memory usage
Methods you'll need:
df_raw.isnull().sum()
beer 0 brewery 0 alcohol 0 type 0 rating_appearance 0 rating_aroma 0 rating_palate 0 rating_taste 0 rating 0 timestamp 0 user 0 text 0 dtype: int64
df_raw.dtypes
beer object brewery object alcohol float64 type object rating_appearance int64 rating_aroma int64 rating_palate int64 rating_taste int64 rating int64 timestamp object user object text object dtype: object
df_raw.info(memory_usage="deep") # LINE TO BE REMOVED FOR STUDENTS
<class 'pandas.core.frame.DataFrame'> Index: 200000 entries, 2747119 to 2130618 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 beer 200000 non-null object 1 brewery 200000 non-null object 2 alcohol 200000 non-null float64 3 type 200000 non-null object 4 rating_appearance 200000 non-null int64 5 rating_aroma 200000 non-null int64 6 rating_palate 200000 non-null int64 7 rating_taste 200000 non-null int64 8 rating 200000 non-null int64 9 timestamp 200000 non-null object 10 user 200000 non-null object 11 text 200000 non-null object dtypes: float64(1), int64(5), object(6) memory usage: 146.1 MB
df_raw.head(5)
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2747119 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... |
| 680733 | De Dolle Arabier | 1163 | 8.00 | Belgian Strong Ale | 4 | 9 | 4 | 8 | 18 | 1291939200 | AndreaDel | Bottle. Orange color, big head. Aroma is hoppy... |
| 2797108 | Grafensteiner Pils | 2545 | 4.70 | Classic German Pilsener | 3 | 7 | 4 | 5 | 9 | 1306108800 | Koelschtrinker | Die Frage ist, ob ich das selbe Bier wie die a... |
| 824617 | New Glarus Raspberry Tart | 1248 | 4.00 | Fruit Beer | 4 | 8 | 5 | 8 | 16 | 1211760000 | polomagnifico | Bottle thanks to nflmvp, thanks Steve! Aroma ... |
| 2730732 | Nils Oscar Hop Yard IPA | 1086 | 7.30 | India Pale Ale (IPA) | 4 | 7 | 4 | 8 | 15 | 1275004800 | dEnk | Bottle, tfs Vaiz! Brownish amber, small head, ... |
Describing statistics¶
Compute statistics to understand the content of each column.
Methods you'll need:
Bonus: fill NaN values with an empty string "" for a better readability using:
df_raw.describe(include="all").fillna("")
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 200000 | 200000 | 200000.0 | 200000 | 200000.0 | 200000.0 | 200000.0 | 200000.0 | 200000.0 | 200000 | 200000 | 200000 |
| unique | 42125 | 5413 | 89 | 4105 | 11251 | 199261 | ||||||
| top | Guinness Draught | 32 | India Pale Ale (IPA) | 1188604800 | fonefan | |||||||
| freq | 252 | 3217 | 12698 | 392 | 1146 | 334 | ||||||
| mean | 6.276042 | 3.430205 | 6.3512 | 3.258225 | 6.44773 | 13.19231 | ||||||
| std | 2.781795 | 0.813961 | 1.643552 | 0.829915 | 1.623235 | 3.350988 | ||||||
| min | -1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ||||||
| 25% | 5.0 | 3.0 | 6.0 | 3.0 | 6.0 | 12.0 | ||||||
| 50% | 5.8 | 3.0 | 7.0 | 3.0 | 7.0 | 14.0 | ||||||
| 75% | 8.0 | 4.0 | 7.0 | 4.0 | 8.0 | 15.0 | ||||||
| max | 57.7 | 5.0 | 10.0 | 5.0 | 10.0 | 20.0 |
Sometimes you only need the describing statistics for a single column. Count and display the distinct beer names.
Hint:
(df_raw.beer).nunique() # LINE TO BE REMOVED FOR STUDENTS
42125
(df_raw.beer).unique() # LINE TO BE REMOVED FOR STUDENTS
array(['Breckenridge Oatmeal Stout', 'De Dolle Arabier',
'Grafensteiner Pils', ..., 'Rockbottom Wheat Beer',
'Dow Bridge Adventus', 'Beach Chalet Third Eye Rye'], dtype=object)
(df_raw.type).value_counts() # LINE TO BE REMOVED FOR STUDENTS
type
India Pale Ale (IPA) 12698
Pale Lager 10062
Belgian Strong Ale 8486
Imperial Stout 8460
Imperial/Double IPA 7215
...
Sak - Taru 9
Sak - Genshu 6
Sak - Honjozo 5
Sak - Tokubetsu 5
Sak - Koshu 3
Name: count, Length: 89, dtype: int64
Select data¶
Create the following dataframe :
- Keep only those columns:
beeralcoholtypeuserrating
- Keep only rows for which the
typecolumn contains the string"Stout"
Hint:
selected_columns = [
"beer",
"alcohol",
"type",
"user",
"rating",
]
df_stout = (
(df_raw)
.loc[:, selected_columns]
.loc[lambda df: df.type.str.contains("Stout")] # LINE TO BE REMOVED FOR STUDENTS
.reset_index(drop=True)
)
df_stout
| beer | alcohol | type | user | rating | |
|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 4.95 | Stout | blutt59 | 14 |
| 1 | Founders Breakfast Stout | 8.30 | Imperial Stout | beerandIT | 20 |
| 2 | Thirsty Dog Siberian Night Imperial Stout | 9.70 | Imperial Stout | Lehighbri | 14 |
| 3 | St. Ambroise Oatmeal Stout | 5.00 | Stout | SimonB | 18 |
| 4 | Ithaca Excelsior! Eleven | 8.50 | Stout | vtafro | 14 |
| ... | ... | ... | ... | ... | ... |
| 18114 | St Peters Cream Stout | 6.50 | Stout | Quietman | 14 |
| 18115 | Weyerbacher Old Heathen | 8.00 | Imperial Stout | cking | 16 |
| 18116 | Pisgah Valdez | 6.80 | Stout | CaptainCougar | 15 |
| 18117 | De Dolle Extra Export Stout | 9.00 | Foreign Stout | bierkoning | 16 |
| 18118 | Goose Island Honest Stout | 5.70 | Stout | Mangino | 14 |
18119 rows × 5 columns
Compute the number of occurences of each Stout beers.
df_stout.type.value_counts()
type Imperial Stout 8460 Stout 4539 Sweet Stout 2762 Dry Stout 1539 Foreign Stout 819 Name: count, dtype: int64
Feature engineering¶
High cardinality variables¶
beerbreweryuser
All those high cardinality variables can be thought as links of a network. Indeed, a review is an object comprising a beer, a brewery and a user and can be thought as a network link between them.
In other words, the review table is the a table describing the links in a network with 3 types of nodes: users, beers and breweries.
The first property to compute about each node is its "degree", which is its number of connections with other nodes. High degree means "highly connected".
To compute the degree you'll need:
df_beer_degree = (
(df_raw.beer)
.value_counts() # LINE TO BE REMOVED FOR STUDENTS
.rename("beer_degree")
.reset_index()
)
df_beer_degree
| beer | beer_degree | |
|---|---|---|
| 0 | Guinness Draught | 252 |
| 1 | Pabst Blue Ribbon | 245 |
| 2 | Dogfish Head 90 Minute Imperial IPA | 232 |
| 3 | Budweiser | 217 |
| 4 | Sierra Nevada Pale Ale (Bottle) | 207 |
| ... | ... | ... |
| 42120 | Meinel-Bru Zwickel | 1 |
| 42121 | 3 Rivers Murphys Law | 1 |
| 42122 | Skagit River Golden Lager | 1 |
| 42123 | Allsvensk | 1 |
| 42124 | Brew Wharf & The Kernel Collaboration | 1 |
42125 rows × 2 columns
Check that this table will merge properly.
df_tmp = df_raw.merge(
df_beer_degree,
on="beer",
how="outer",
validate="m:1",
indicator=True,
)
df_tmp._merge.value_counts()
_merge both 200000 left_only 0 right_only 0 Name: count, dtype: int64
df_brewery_degree = (
(df_raw.brewery)
.value_counts() # LINE TO BE REMOVED FOR STUDENTS
.rename("brewery_degree")
.reset_index()
)
df_brewery_degree
| brewery | brewery_degree | |
|---|---|---|
| 0 | 32 | 3217 |
| 1 | 76 | 2648 |
| 2 | 198 | 2576 |
| 3 | 96 | 2401 |
| 4 | 84 | 2266 |
| ... | ... | ... |
| 5408 | 5407 | 1 |
| 5409 | 8930 | 1 |
| 5410 | 12924 | 1 |
| 5411 | 7842 | 1 |
| 5412 | 8639 | 1 |
5413 rows × 2 columns
Check that this table will merge properly.
df_tmp = df_raw.merge(
df_brewery_degree,
on="brewery",
how="outer",
validate="m:1",
indicator=True,
)
df_tmp._merge.value_counts()
_merge both 200000 left_only 0 right_only 0 Name: count, dtype: int64
df_user_degree = (
(df_raw.user)
.value_counts() # LINE TO BE REMOVED FOR STUDENTS
.rename("user_degree")
.reset_index()
)
df_user_degree
| user | user_degree | |
|---|---|---|
| 0 | fonefan | 1146 |
| 1 | Ungstrup | 1063 |
| 2 | Papsoe | 927 |
| 3 | yespr | 889 |
| 4 | oh6gdx | 701 |
| ... | ... | ... |
| 11246 | BattleandBrew | 1 |
| 11247 | Borgen | 1 |
| 11248 | drainey99 | 1 |
| 11249 | biermeister99 | 1 |
| 11250 | rob51 | 1 |
11251 rows × 2 columns
Check that this table will merge properly.
df_tmp = df_raw.merge(
df_user_degree,
on="user",
how="outer",
validate="m:1",
indicator=True,
)
df_tmp._merge.value_counts()
_merge both 200000 left_only 0 right_only 0 Name: count, dtype: int64
We'll then merge the 3 dataframe at once.
text_length = df_raw.text.str.len()
text_length
2747119 90
680733 140
2797108 324
824617 371
2730732 166
...
1563335 222
2791415 157
449553 198
1666173 150
2130618 243
Name: text, Length: 200000, dtype: int64
Convert timestamp¶
date = (df_raw.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
date
2747119 2008-07-31
680733 2010-12-10
2797108 2011-05-23
824617 2008-05-26
2730732 2010-05-28
...
1563335 2009-09-05
2791415 2009-05-23
449553 2010-08-04
1666173 2009-05-23
2130618 2006-10-05
Name: timestamp, Length: 200000, dtype: datetime64[ns]
Binary target¶
The prediction problem is to predict rating based on other information. Since
rating is a numeric variable, it is a regression problem. We'd like also to do a
classification so we'll create a binary target based on the rating.
The ìs_good column is True if the rating is above or equal to 16 (expert judgement),
False otherwise.
Note: also convert the binary target to integer (O or 1) for better readability.
Methods you'll need:
is_good = (df_raw.rating >= 16).astype(int)
is_good
2747119 0
680733 1
2797108 0
824617 1
2730732 0
..
1563335 0
2791415 0
449553 0
1666173 0
2130618 0
Name: rating, Length: 200000, dtype: int64
What are the values of this binary target ?
is_good.value_counts()
rating 0 151269 1 48731 Name: count, dtype: int64
Combine dataframes¶
Create a dataframe combining information from:
df_raw: the original datasetdf_beer_degree: merged onbeercolumndf_brewery_degree: merged onbrewerycolumndf_user_degree: merged onusercolumntext_length: added as a new columndate: added as a new columnis_good: added as a new column
Note: merge is the equivalent of JOIN in SQL, and it changes the order of the rows
! So to add our data columns properly in the dataset, we have 2 options:
- add the new columns using precomputed arrays, but before merging (not recommended):
e.g:
df_raw.text.str.len() - add the new columns using a function (recommended):
e.g:
lambda df: df.text.str.len()
Hint:
Note: If some columns in both the left & right dataframes have the same name, you'll
obtain duplicated columns in the merge result. pandas adds the suffixes _x
and _y to avoid creating duplicate columns.
Use the suffixes argument to specify the suffixes to apply to duplicated
columns. In this example, there is no common column name in both dataframes.
We made lots of transformation to our datasets: we want to verify that all
values in the "primary keys" columns are indeed unique. Use the validate
argument to do so.
df_main = (
(df_raw)
.merge(
df_beer_degree,
on="beer", # LINE TO BE REMOVED FOR STUDENTS
how="inner",
validate="m:1",
)
.merge(
df_brewery_degree,
on="brewery", # LINE TO BE REMOVED FOR STUDENTS
how="inner",
validate="m:1",
)
.merge(
df_user_degree,
on="user", # LINE TO BE REMOVED FOR STUDENTS
how="inner",
validate="m:1",
)
.assign(text_length=lambda df: df.text.str.len())
.assign(
date=lambda df: (df.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
)
.assign(is_good=lambda df: (df.rating >= 16).astype(int))
)
df_main
| beer | brewery | alcohol | type | rating_appearance | rating_aroma | rating_palate | rating_taste | rating | timestamp | user | text | beer_degree | brewery_degree | user_degree | text_length | date | is_good | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breckenridge Oatmeal Stout | 383 | 4.95 | Stout | 4 | 7 | 4 | 7 | 14 | 1217462400 | blutt59 | bottle, oat nose with black color, bitter choc... | 51 | 413 | 300 | 90 | 2008-07-31 | 0 |
| 1 | Breckenridge 471 Small Batch Imperial Porter | 383 | 7.50 | Imperial/Strong Porter | 3 | 8 | 3 | 8 | 14 | 1312588800 | blutt59 | bottle, received in trade, dark brown with gar... | 4 | 413 | 300 | 191 | 2011-08-06 | 0 |
| 2 | Breckenridge Avalanche Amber | 383 | 5.41 | Amber Ale | 3 | 5 | 3 | 5 | 10 | 1205020800 | blutt59 | 12 oz. bottle, amber color with soapy head, sl... | 43 | 413 | 300 | 109 | 2008-03-09 | 0 |
| 3 | Breckenridge Lucky U IPA | 383 | 6.20 | India Pale Ale (IPA) | 3 | 6 | 3 | 7 | 12 | 1255737600 | blutt59 | bottle, golden orange color with light tan foa... | 20 | 413 | 300 | 151 | 2009-10-17 | 0 |
| 4 | Fullers Vintage Ale 2009 | 55 | 8.50 | English Strong Ale | 3 | 7 | 3 | 8 | 14 | 1282003200 | blutt59 | bottle, thanks to SS, almond amber colored pou... | 18 | 978 | 300 | 159 | 2010-08-17 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 199995 | Die M Dunkel | 6587 | -1.00 | Dunkel | 4 | 5 | 3 | 3 | 8 | 1204070400 | kuleko | Bottle, supermarket. Very dark colour. Sweet c... | 2 | 2 | 1 | 163 | 2008-02-27 | 0 |
| 199996 | Montt Hafen Porter | 8240 | 6.00 | Porter | 5 | 10 | 5 | 10 | 19 | 1175040000 | Andressantis | Sabrosa,con amargor justo y ligeramente dulce.... | 1 | 1 | 1 | 97 | 2007-03-28 | 1 |
| 199997 | Oxymore | 8022 | 5.00 | Saison | 4 | 8 | 3 | 6 | 12 | 1181952000 | DesmondLangham | Very earthy nose.\tClear and golden, head kept... | 1 | 1 | 1 | 192 | 2007-06-16 | 0 |
| 199998 | Stonehouse Station Bitter | 8561 | 3.90 | Bitter | 3 | 5 | 4 | 6 | 13 | 1238716800 | thedees | Cask at the Three Pigeons Nescliffe. Maly swee... | 1 | 2 | 1 | 88 | 2009-04-03 | 0 |
| 199999 | Pilker Negra | 6687 | 5.00 | Stout | 4 | 6 | 2 | 5 | 5 | 1142553600 | apeters | Roasty aroma.Black color.Dry and bitter flavor... | 1 | 1 | 1 | 144 | 2006-03-17 | 0 |
200000 rows × 18 columns
Save the final result to a parquet file named ratebeer_sample_enriched.parquet.
Hint:
# Uncomment the line below to save the dataset to disk
df_main.to_parquet("ratebeer_sample_enriched.parquet")
GOOD JOB 👍
