Skip to content

Basic data engineering

Join, group by & aggregate

Join

Join type

A join is a database operation that combines rows from two tables based on a related column between them, allowing you to query data from both tables as if they were a single entity. Different types of joins determine which rows are included in the result, providing flexibility in how the tables are merged.

Join types
Join type Description Includes all data from May loose data from
Inner Combines matching rows from both tables None Both
Left Includes all rows from the left table Left Right
Right Includes all rows from the right table Right Left
Outer Combines all rows from both tables, matching where possible Both None

Join types are related to the concepts of inclusion and completeness:

  • Use outer join, if you don't want to loose data, but you may have incomplete data.
  • Use inner join, if you don't want to have incomplete data, but you may loose data.

Note

When exploring data using pandas, always start your joins (using the merge method) by:

  • doing an outer join (how="outer")
  • checking the result of the join in the _merge column (e.g: using the value_counts method)

E.g:

Python
result = (
    df1
    .merge(df2, on="key_column", how="outer")
    ._merge
    .value_counts()
)

This shows how many rows are coming from each dataframe (both, left_only or right_only).

Matching type

A matching type refers to the relationship between key columns in the tables being joined, such as one-to-one, one-to-many, or many-to-many. It defines how many records in one table correspond to records in another, guiding the structure of the join and affecting the final result.

Matching types
Matching Type Description Example
One-to-One (1:1) Each row in the first table corresponds to one row in the second table. Employee to Social Security Number
One-to-Many (1:m) Each row in the first table corresponds to multiple rows in the second table. Customer to Orders (One customer, many orders)
Many-to-Many (m:m) Multiple rows in the first table correspond to multiple rows in the second table. Products to Categories (Many products, many categories)

Matching types are related to the concept of cardinality.

In a one-to-one matching:

  • The cardinality of the inner join is the cardinality of the intersection \( | L \cap R | \).
  • The cardinality of the outer join is the cardinality of the union \( | L \cup R | \).

Whereas one-to-many, many-to-one or many-to-many matchings do duplicate rows in the join result and don't preserve cardinality.

Note

When exploring data using pandas, always use the validate argument specify the matching type of your joins to avoid unpleasant surprises. Like:

  • validate="1:1"
  • validate="1:m"
  • validate="m:1"
  • validate="m:m" (No effect for this one. Can you guess why?)

E.g:

Python
df1.merge(df2, on="key_column", how="inner", validate="1:1")

Key type

Key columns have 2 types:

  • Key columns that have unique values are called primary keys. In other words, a primary key uniquely identifies each row within a table.
  • Key columns in a table that are referring to a primary key in another table are called foreign keys. In other words, a foreign key links the data from one table to another (but its values are not expected to be unique !).

A join between 2 tables using:

  • 2 primary keys is a one-to-one matching
  • 1 primary key and 1 foreign key is a one-to-many matching
  • 2 foreign keys is a many-to-many matching

Examples

Let's use 2 example tables:

Customers table

customer_id name
1 Alice
2 Bob
3 Carol

Orders table

order_id customer_id total_price
101 1 100
102 2 200
103 4 150

Inner Join on customer_id: combines rows that are matching customer IDs in both data frames.

customer_id name order_id total_price
1 Alice 101 100
2 Bob 102 200

Left Join on customer_id: same tables as Inner Join, but includes all customer rows.

customer_id name order_id total_price
1 Alice 101 100
2 Bob 102 200
3 Carol NaN NaN

Right Join on customer_id: includes all order rows and matched customer rows.

customer_id name order_id total_price
1 Alice 101 100
2 Bob 102 200
NaN NaN 103 150

Outer Join on customer_id: includes all rows from both data frames.

customer_id name order_id total_price
1 Alice 101 100
2 Bob 102 200
3 Carol NaN NaN
NaN NaN 103 150

Groupby and aggregation

Group by divides rows into groups based on column values and aggregation operations (like sum, average, or maximum) condense these groups into summary data. This allows for targeted analysis of specific segments of your data.

Aggregation types

Here are the most common SQL aggregation functions:

SQL Function Description
SUM Adds up the values in a numeric column
AVG Calculates the average of a numeric column
COUNT Counts the number of rows in a column or table
MAX Returns the maximum value in a column
MIN Returns the minimum value in a column
MEDIAN Finds the middle value of a numeric column
MODE Finds the most frequently occurring value in a column
STDDEV Calculates the standard deviation of a numeric column
VARIANCE Calculates the variance of a numeric column
FIRST Returns the first value in a column (order may vary by DBMS)
LAST Returns the last value in a column (order may vary by DBMS)

These functions provide various ways to summarize and analyze data. Different databases might offer additional aggregation functions tailored to specific needs or data types.

Note

The behavior of "FIRST" and "LAST" may depend on the database management system (DBMS) and how the data is ordered in the query. In some systems, you may need to use specific ordering clauses to get the desired behavior.

Example

Customer & orders table (join between customers and orders):

CustomerID Name OrderID Amount
1 Alice 1 100
1 Alice 3 200
2 Bob 2 50
3 Charlie 4 150

In SQL, if we want to find the total amount spent by each customer, we'll use a GROUP BY operation on the CustomerID and Name columns, and aggregate the Amount column using the SUM operation:

SQL
SELECT CustomerID, Name, SUM(Amount) as TotalAmount
FROM CustomerOrders
GROUP BY CustomerID, Name;

Which would give:

CustomerID Name TotalAmount
1 Alice 300
2 Bob 50
3 Charlie 150

The GROUP BY operation groups the rows by CustomerID and Name, and the SUM function calculates the total amount for each group, giving us a summarized view of the spending by each customer.

Note

The python/pandas equivalent of this SQL statement would be:

Python
result = (
    customer_orders
    .groupby(['CustomerID', 'Name'])
    .agg(TotalAmount=('Amount', 'sum'))
    .reset_index()
)