Download Open in colab

🐍 Practice n°1: data engineering¶

The objective of this session is to learn about the basics of data engineering. You will have to explore the Ratebeer dataset using sql and python.

This dataset consists of beer reviews from ratebeer. The data span a period of more than 10 years, including all ~3 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. We also have reviews from beeradvocate.

source ratebeer dataset description

To avoid high compute time, we are going to work with a sample during the session. Also, the data is already cleaned.

Here are the main steps of the notebook :

Preparation
Data engineering in sql with duckdb
Data engineering in python with pandas

This is a data engineering tutorial in Python/pandas, it assumes you have already some some knowledge of data engineering in SQL.

Preparation¶

Install & import modules¶

In [1]:

Copied!

!pip install pyarrow
!pip install pyarrow

Requirement already satisfied: pyarrow in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (13.0.0)
Requirement already satisfied: numpy>=1.16.6 in /home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages (from pyarrow) (1.26.0)

In [2]:

Copied!

import pandas as pd

pd.set_option("display.max_columns", 100)
import pandas as pd

pd.set_option("display.max_columns", 100)

Preparation¶

Get some doc¶

Read data¶

The data is in this git repository: ML-boot-camp/ratebeer.git.

The data is located in the ratebeer/data/ folder.

In [3]:

Copied!

file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_clean.parquet"
file_url = "https://github.com/ML-boot-camp/ratebeer/raw/master/data/ratebeer_sample_clean.parquet"

Load the file ratebeer_sample_clean.parquet to extract a pandas DataFrame and assign it the variable df_raw. Hint:

pd.read_parquet

In [4]:

Copied!

df_raw = pd.read_parquet(file_url)
df_raw = pd.read_parquet(file_url)

General information¶

Shape¶

Have a first overview of the dataframe size, i.e. number of rows & columns.

Methods you'll need:

pd.DataFrame.shape

In [5]:

Copied!

df_raw.shape
df_raw.shape

Out[5]:

(200000, 12)

Overview¶

Get a few information about the content of the dataframe:

number of null values per column
data type of each column
memory usage

Methods you'll need:

In [6]:

Copied!

df_raw.isnull().sum()
df_raw.isnull().sum()

Out[6]:

beer                 0
brewery              0
alcohol              0
type                 0
rating_appearance    0
rating_aroma         0
rating_palate        0
rating_taste         0
rating               0
timestamp            0
user                 0
text                 0
dtype: int64

In [7]:

Copied!

df_raw.dtypes
df_raw.dtypes

Out[7]:

beer                  object
brewery               object
alcohol              float64
type                  object
rating_appearance      int64
rating_aroma           int64
rating_palate          int64
rating_taste           int64
rating                 int64
timestamp             object
user                  object
text                  object
dtype: object

In [8]:

Copied!

df_raw.info(memory_usage="deep")  # LINE TO BE REMOVED FOR STUDENTS
df_raw.info(memory_usage="deep")  # LINE TO BE REMOVED FOR STUDENTS

<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, 2747119 to 2130618
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   beer               200000 non-null  object 
 1   brewery            200000 non-null  object 
 2   alcohol            200000 non-null  float64
 3   type               200000 non-null  object 
 4   rating_appearance  200000 non-null  int64  
 5   rating_aroma       200000 non-null  int64  
 6   rating_palate      200000 non-null  int64  
 7   rating_taste       200000 non-null  int64  
 8   rating             200000 non-null  int64  
 9   timestamp          200000 non-null  object 
 10  user               200000 non-null  object 
 11  text               200000 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 146.1 MB

Sample¶

Show a sample of the data

Methods you'll need:

In [9]:

Copied!

df_raw.head(5)
df_raw.head(5)

Out[9]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text
2747119	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter choc...
680733	De Dolle Arabier	1163	8.00	Belgian Strong Ale	4	9	4	8	18	1291939200	AndreaDel	Bottle. Orange color, big head. Aroma is hoppy...
2797108	Grafensteiner Pils	2545	4.70	Classic German Pilsener	3	7	4	5	9	1306108800	Koelschtrinker	Die Frage ist, ob ich das selbe Bier wie die a...
824617	New Glarus Raspberry Tart	1248	4.00	Fruit Beer	4	8	5	8	16	1211760000	polomagnifico	Bottle thanks to nflmvp, thanks Steve! Aroma ...
2730732	Nils Oscar Hop Yard IPA	1086	7.30	India Pale Ale (IPA)	4	7	4	8	15	1275004800	dEnk	Bottle, tfs Vaiz! Brownish amber, small head, ...

Describing statistics¶

Compute statistics to understand the content of each column.

Methods you'll need:

pd.DataFrame.describe

Bonus: fill NaN values with an empty string "" for a better readability using:

pd.DataFrame.fillna

In [10]:

Copied!

df_raw.describe(include="all").fillna("")
df_raw.describe(include="all").fillna("")

Out[10]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text
count	200000	200000	200000.0	200000	200000.0	200000.0	200000.0	200000.0	200000.0	200000	200000	200000
unique	42125	5413		89						4105	11251	199261
top	Guinness Draught	32		India Pale Ale (IPA)						1188604800	fonefan
freq	252	3217		12698						392	1146	334
mean			6.276042		3.430205	6.3512	3.258225	6.44773	13.19231
std			2.781795		0.813961	1.643552	0.829915	1.623235	3.350988
min			-1.0		1.0	1.0	1.0	1.0	1.0
25%			5.0		3.0	6.0	3.0	6.0	12.0
50%			5.8		3.0	7.0	3.0	7.0	14.0
75%			8.0		4.0	7.0	4.0	8.0	15.0
max			57.7		5.0	10.0	5.0	10.0	20.0

Sometimes you only need the describing statistics for a single column. Count and display the distinct beer names.

Hint:

In [11]:

Copied!

(df_raw.beer).nunique()  # LINE TO BE REMOVED FOR STUDENTS
(df_raw.beer).nunique()  # LINE TO BE REMOVED FOR STUDENTS

Out[11]:

In [12]:

Copied!

(df_raw.beer).unique()  # LINE TO BE REMOVED FOR STUDENTS
(df_raw.beer).unique()  # LINE TO BE REMOVED FOR STUDENTS

Out[12]:

array(['Breckenridge Oatmeal Stout', 'De Dolle Arabier',
       'Grafensteiner Pils', ..., 'Rockbottom Wheat Beer',
       'Dow Bridge Adventus', 'Beach Chalet Third Eye Rye'], dtype=object)

Display the number of reviews per beer type.

Hint:

pd.Series.value_counts

In [13]:

Copied!

(df_raw.type).value_counts()  # LINE TO BE REMOVED FOR STUDENTS
(df_raw.type).value_counts()  # LINE TO BE REMOVED FOR STUDENTS

Out[13]:

type
India Pale Ale (IPA)    12698
Pale Lager              10062
Belgian Strong Ale       8486
Imperial Stout           8460
Imperial/Double IPA      7215
                        ...  
Sak - Taru                  9
Sak - Genshu                6
Sak - Honjozo               5
Sak - Tokubetsu             5
Sak - Koshu                 3
Name: count, Length: 89, dtype: int64

Select data¶

Create the following dataframe :

Keep only those columns:
- beer
- alcohol
- type
- user
- rating
Keep only rows for which the type column contains the string "Stout"

Hint:

In [14]:

Copied!





selected_columns = [
    "beer",
    "alcohol",
    "type",
    "user",
    "rating",
]

df_stout = (
    (df_raw)
    .loc[:, selected_columns]
    .loc[lambda df: df.type.str.contains("Stout")]  # LINE TO BE REMOVED FOR STUDENTS
    .reset_index(drop=True)
)
df_stout
selected_columns = [
    "beer",
    "alcohol",
    "type",
    "user",
    "rating",
]

df_stout = (
    (df_raw)
    .loc[:, selected_columns]
    .loc[lambda df: df.type.str.contains("Stout")]  # LINE TO BE REMOVED FOR STUDENTS
    .reset_index(drop=True)
)
df_stout

Out[14]:

	beer	alcohol	type	user	rating
0	Breckenridge Oatmeal Stout	4.95	Stout	blutt59	14
1	Founders Breakfast Stout	8.30	Imperial Stout	beerandIT	20
2	Thirsty Dog Siberian Night Imperial Stout	9.70	Imperial Stout	Lehighbri	14
3	St. Ambroise Oatmeal Stout	5.00	Stout	SimonB	18
4	Ithaca Excelsior! Eleven	8.50	Stout	vtafro	14
...	...	...	...	...	...
18114	St Peters Cream Stout	6.50	Stout	Quietman	14
18115	Weyerbacher Old Heathen	8.00	Imperial Stout	cking	16
18116	Pisgah Valdez	6.80	Stout	CaptainCougar	15
18117	De Dolle Extra Export Stout	9.00	Foreign Stout	bierkoning	16
18118	Goose Island Honest Stout	5.70	Stout	Mangino	14

18119 rows × 5 columns

Compute the number of occurences of each Stout beers.

In [15]:

Copied!

df_stout.type.value_counts()
df_stout.type.value_counts()

Out[15]:

type
Imperial Stout    8460
Stout             4539
Sweet Stout       2762
Dry Stout         1539
Foreign Stout      819
Name: count, dtype: int64

Feature engineering¶

High cardinality variables¶

beer
brewery
user

All those high cardinality variables can be thought as links of a network. Indeed, a review is an object comprising a beer, a brewery and a user and can be thought as a network link between them.

In other words, the review table is the a table describing the links in a network with 3 types of nodes: users, beers and breweries.

The first property to compute about each node is its "degree", which is its number of connections with other nodes. High degree means "highly connected".

To compute the degree you'll need:

pd.Series.value_counts

In [16]:

Copied!





df_beer_degree = (
    (df_raw.beer)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("beer_degree")
    .reset_index()
)
df_beer_degree
df_beer_degree = (
    (df_raw.beer)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("beer_degree")
    .reset_index()
)
df_beer_degree

Out[16]:

	beer	beer_degree
0	Guinness Draught	252
1	Pabst Blue Ribbon	245
2	Dogfish Head 90 Minute Imperial IPA	232
3	Budweiser	217
4	Sierra Nevada Pale Ale (Bottle)	207
...	...	...
42120	Meinel-Bru Zwickel	1
42121	3 Rivers Murphys Law	1
42122	Skagit River Golden Lager	1
42123	Allsvensk	1
42124	Brew Wharf & The Kernel Collaboration	1

42125 rows × 2 columns

Check that this table will merge properly.

In [17]:

Copied!





df_tmp = df_raw.merge(
    df_beer_degree,
    on="beer",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()
df_tmp = df_raw.merge(
    df_beer_degree,
    on="beer",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()

Out[17]:

_merge
both          200000
left_only          0
right_only         0
Name: count, dtype: int64

In [18]:

Copied!





df_brewery_degree = (
    (df_raw.brewery)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("brewery_degree")
    .reset_index()
)
df_brewery_degree
df_brewery_degree = (
    (df_raw.brewery)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("brewery_degree")
    .reset_index()
)
df_brewery_degree

Out[18]:

	brewery	brewery_degree
0	32	3217
1	76	2648
2	198	2576
3	96	2401
4	84	2266
...	...	...
5408	5407	1
5409	8930	1
5410	12924	1
5411	7842	1
5412	8639	1

5413 rows × 2 columns

Check that this table will merge properly.

In [19]:

Copied!





df_tmp = df_raw.merge(
    df_brewery_degree,
    on="brewery",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()
df_tmp = df_raw.merge(
    df_brewery_degree,
    on="brewery",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()

Out[19]:

_merge
both          200000
left_only          0
right_only         0
Name: count, dtype: int64

In [20]:

Copied!





df_user_degree = (
    (df_raw.user)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("user_degree")
    .reset_index()
)
df_user_degree
df_user_degree = (
    (df_raw.user)
    .value_counts()  # LINE TO BE REMOVED FOR STUDENTS
    .rename("user_degree")
    .reset_index()
)
df_user_degree

Out[20]:

	user	user_degree
0	fonefan	1146
1	Ungstrup	1063
2	Papsoe	927
3	yespr	889
4	oh6gdx	701
...	...	...
11246	BattleandBrew	1
11247	Borgen	1
11248	drainey99	1
11249	biermeister99	1
11250	rob51	1

11251 rows × 2 columns

Check that this table will merge properly.

In [21]:

Copied!





df_tmp = df_raw.merge(
    df_user_degree,
    on="user",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()
df_tmp = df_raw.merge(
    df_user_degree,
    on="user",
    how="outer",
    validate="m:1",
    indicator=True,
)
df_tmp._merge.value_counts()

Out[21]:

_merge
both          200000
left_only          0
right_only         0
Name: count, dtype: int64

We'll then merge the 3 dataframe at once.

Text length¶

Compute the length of the texts in the dataset.

Methods you'll need:

pd.Series.str.len

In [22]:

Copied!

text_length = df_raw.text.str.len()
text_length
text_length = df_raw.text.str.len()
text_length

Out[22]:

2747119     90
680733     140
2797108    324
824617     371
2730732    166
          ... 
1563335    222
2791415    157
449553     198
1666173    150
2130618    243
Name: text, Length: 200000, dtype: int64

Convert timestamp¶

In [23]:

Copied!

date = (df_raw.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
date
date = (df_raw.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
date

Out[23]:

2747119   2008-07-31
680733    2010-12-10
2797108   2011-05-23
824617    2008-05-26
2730732   2010-05-28
             ...    
1563335   2009-09-05
2791415   2009-05-23
449553    2010-08-04
1666173   2009-05-23
2130618   2006-10-05
Name: timestamp, Length: 200000, dtype: datetime64[ns]

Binary target¶

The prediction problem is to predict rating based on other information. Since rating is a numeric variable, it is a regression problem. We'd like also to do a classification so we'll create a binary target based on the rating.

The ìs_good column is True if the rating is above or equal to 16 (expert judgement), False otherwise.

Note: also convert the binary target to integer (O or 1) for better readability.

Methods you'll need:

pd.Series.astype

In [24]:

Copied!

is_good = (df_raw.rating >= 16).astype(int)
is_good
is_good = (df_raw.rating >= 16).astype(int)
is_good

Out[24]:

2747119    0
680733     1
2797108    0
824617     1
2730732    0
          ..
1563335    0
2791415    0
449553     0
1666173    0
2130618    0
Name: rating, Length: 200000, dtype: int64

What are the values of this binary target ?

In [25]:

Copied!

is_good.value_counts()
is_good.value_counts()

Out[25]:

rating
0    151269
1     48731
Name: count, dtype: int64

Combine dataframes¶

Create a dataframe combining information from:

df_raw: the original dataset
df_beer_degree: merged on beer column
df_brewery_degree: merged on brewery column
df_user_degree: merged on user column
text_length: added as a new column
date: added as a new column
is_good: added as a new column

Note: merge is the equivalent of JOIN in SQL, and it changes the order of the rows ! So to add our data columns properly in the dataset, we have 2 options:

add the new columns using precomputed arrays, but before merging (not recommended): e.g: df_raw.text.str.len()
add the new columns using a function (recommended): e.g: lambda df: df.text.str.len()

Hint:

Note: If some columns in both the left & right dataframes have the same name, you'll obtain duplicated columns in the merge result. pandas adds the suffixes _x and _y to avoid creating duplicate columns. Use the suffixes argument to specify the suffixes to apply to duplicated columns. In this example, there is no common column name in both dataframes.

We made lots of transformation to our datasets: we want to verify that all values in the "primary keys" columns are indeed unique. Use the validate argument to do so.

In [26]:

Copied!





df_main = (
    (df_raw)
    .merge(
        df_beer_degree,
        on="beer",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .merge(
        df_brewery_degree,
        on="brewery",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .merge(
        df_user_degree,
        on="user",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .assign(text_length=lambda df: df.text.str.len())
    .assign(
        date=lambda df: (df.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
    )
    .assign(is_good=lambda df: (df.rating >= 16).astype(int))
)
df_main
df_main = (
    (df_raw)
    .merge(
        df_beer_degree,
        on="beer",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .merge(
        df_brewery_degree,
        on="brewery",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .merge(
        df_user_degree,
        on="user",  # LINE TO BE REMOVED FOR STUDENTS
        how="inner",
        validate="m:1",
    )
    .assign(text_length=lambda df: df.text.str.len())
    .assign(
        date=lambda df: (df.timestamp).astype(int).apply(pd.Timestamp.fromtimestamp)
    )
    .assign(is_good=lambda df: (df.rating >= 16).astype(int))
)
df_main

Out[26]:

	beer	brewery	alcohol	type	rating_appearance	rating_aroma	rating_palate	rating_taste	rating	timestamp	user	text	beer_degree	brewery_degree	user_degree	text_length	date	is_good
0	Breckenridge Oatmeal Stout	383	4.95	Stout	4	7	4	7	14	1217462400	blutt59	bottle, oat nose with black color, bitter choc...	51	413	300	90	2008-07-31	0
1	Breckenridge 471 Small Batch Imperial Porter	383	7.50	Imperial/Strong Porter	3	8	3	8	14	1312588800	blutt59	bottle, received in trade, dark brown with gar...	4	413	300	191	2011-08-06	0
2	Breckenridge Avalanche Amber	383	5.41	Amber Ale	3	5	3	5	10	1205020800	blutt59	12 oz. bottle, amber color with soapy head, sl...	43	413	300	109	2008-03-09	0
3	Breckenridge Lucky U IPA	383	6.20	India Pale Ale (IPA)	3	6	3	7	12	1255737600	blutt59	bottle, golden orange color with light tan foa...	20	413	300	151	2009-10-17	0
4	Fullers Vintage Ale 2009	55	8.50	English Strong Ale	3	7	3	8	14	1282003200	blutt59	bottle, thanks to SS, almond amber colored pou...	18	978	300	159	2010-08-17	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
199995	Die M Dunkel	6587	-1.00	Dunkel	4	5	3	3	8	1204070400	kuleko	Bottle, supermarket. Very dark colour. Sweet c...	2	2	1	163	2008-02-27	0
199996	Montt Hafen Porter	8240	6.00	Porter	5	10	5	10	19	1175040000	Andressantis	Sabrosa,con amargor justo y ligeramente dulce....	1	1	1	97	2007-03-28	1
199997	Oxymore	8022	5.00	Saison	4	8	3	6	12	1181952000	DesmondLangham	Very earthy nose.\tClear and golden, head kept...	1	1	1	192	2007-06-16	0
199998	Stonehouse Station Bitter	8561	3.90	Bitter	3	5	4	6	13	1238716800	thedees	Cask at the Three Pigeons Nescliffe. Maly swee...	1	2	1	88	2009-04-03	0
199999	Pilker Negra	6687	5.00	Stout	4	6	2	5	5	1142553600	apeters	Roasty aroma.Black color.Dry and bitter flavor...	1	1	1	144	2006-03-17	0

200000 rows × 18 columns

Save the final result to a parquet file named ratebeer_sample_enriched.parquet.

Hint:

pd.DataFrame.to_parquet

In [27]:

Copied!

# Uncomment the line below to save the dataset to disk
df_main.to_parquet("ratebeer_sample_enriched.parquet")
# Uncomment the line below to save the dataset to disk
df_main.to_parquet("ratebeer_sample_enriched.parquet")

GOOD JOB 👍