Plots¶

Point based plots¶

Scatter¶

Scatter plot is the most basic plot to encode any data using points and their properties. Location (X, Y) are 2 properties which are able to represent any measurement type.

Below, two ratio variables are represented (notice how the scale shows the level zero) thanks to position:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Miles per gallon (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[1]:

What we learn from this plot is that there might be a quadratic relationship between $ X $ and $ Y $ like: $ Y = a X^{2} + b X + c $.

Let's try to learn something by adding more information on this chart, mapping another data column on visual properties:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Miles per gallon (Ratio)
Color hue	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[2]:

We can improve this chart by using redundancy, i.e. several visual variables used to represent only a single column of data:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Miles per gallon (Ratio)
Color hue	Cylinders (Ordinal)
Shape	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[3]:

There is a slight issue of overplotting: the information is saturated due to elements being on top of each other. One solution is to use a bit of transparency.

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Miles per gallon (Ratio)
Color hue	Cylinders (Ordinal)
Shape	Cylinders (Ordinal)
Opacity	25%

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[4]:

We now see that most cars are in the long-distance driving region, with low consumption & low power.

Can we learn something by adding even more information ?

Below we try to map another data column on a supplementary visual variable:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Miles per gallon (Ratio)
Color hue	Cylinders (Ordinal)
Shape	Cylinders (Ordinal)
Opacity	25%
Size	Weights in lbs (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[5]:

The plot is barely readable. 😕

It is saturated due to overplotting, so we are not able to perceive the relationship between the new columns and the other. Also it's too complex: we are trying to show 4 data columns at once, it's often difficult to represent more than 3 information at once in an understandable way.

Strip¶

If you'd like to visualize the relationship between a discrete variable (nominal or ordinal) and a quantitative variable (interval or ratio), you can try to use a scatter plot:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Cylinders (Ordinal)
Opacity	25%

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[6]:

But the result is often barely readable due to overplotting, even if we use transparency, due to points being drawn on a few lines (1 for each discrete value of the data).

The strip plot is a simple chart composed of vertical lines instead of points to display information:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Cylinders (Ordinal)
Opacity	25%

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[7]:

It's a bit more readable than the scatter plot and very compact, but It is rarely used. For teaching purposes, it allows to understand what how you can encode a mix of discrete and quantitative data on position as a visual parameter: the trick is to draw lines instead of points.

Note that we can also use redundancy to improve the readability of the chart:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Cylinders (Ordinal)
Color	Cylinders (Ordinal)
Opacity	25%

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[8]:

Rectangle based plots¶

Bar & histogram¶

Bar plots use the bar size a visual variable to encode data. It is good to represent a quantity, so it is used by default in histograms:

Visual variable	Data
Position (x)	Horsepower (Ratio) binned
Size	Count (aggregate for each bin)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[9]:

You can try also to encode a discrete variable on color as a secondary visual variable, which is called a stacked bar chart:

Visual variable	Data
Position (x)	Horsepower binned (Ordinal)
Size	Count (for each bin)
Color	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[10]:

It gives an idea of the distribution of values in each class of the discrete data, but it's difficult to do a precise assessment: you can't compare the distributions in a stacked bar chart. The overall distribution is still understandable, though.

To get a precise view of each distribution and to compare them, it's better to separate them using the position as a visual variable:

Visual variable	Data
Position (x)	Horsepower binned (Ordinal)
Size	Count (for each bin)
Color	Cylinders (Ordinal)
Position (y)	-

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[11]:

But you lose the representation of the overall distribution.

2D-histogram¶

Finally, if have too much overplotting with scatter plots, it's possible to make 2D-histograms. The data is binned across 2 quantitative dimensions, which are transformed intro ordinal data. The count aggregate can then be encoded into visual variables, such as the size or the color.

When the count is encoded on the size of the points, it's a bubble plot:

Visual variable	Data
Position (x)	Horsepower binned (Ordinal)
Position (y)	Miles per gallon binned (Ordinal)
Size	Count (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[12]:

Which you can also improve using redundancy:

Visual variable	Data
Position (x)	Horsepower binned (Ordinal)
Position (y)	Miles per gallon binned (Ordinal)
Size	Count (Ratio)
Color	Count (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[13]:

When the count is encoded on the color of the points, it's a heatmap plot

Visual variable	Data
Position (x)	Horsepower binned (Ordinal)
Position (y)	Miles per gallon binned (Ordinal)
Color	Count (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[14]:

Line based plots¶

Line¶

In certain cases, the scatter plot is not really relevant, as for the plot below, which shows the mean consumption for all car released a given year time, and breakdown by number of cylinders. Can you guess what's wrong with this chart?

Visual variable	Data
Position (x)	Year (Ordinal)
Position (y)	Horsepower mean (Ratio)
Color hue	Cylinders (Ordinal)
Shape	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[15]:

This temporal relationship between data points can be represented using a line chart. A line chart uses small segments which have 2 additional visual variables - size & slope - which are powerful to perceive a variation rate.

Visual variable	Data
Position (x)	Year (Ordinal)
Position (y)	Horsepower mean (Ratio)
Size (of segments)	Horsepower mean variation (Interval)
Slope (of segments)
Color hue	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[16]:

We can refine this chart by encoding the count of cars for each year and cylinder class, as we made a mean aggregate. This is a ratio data which can be shown using the size:

Visual variable	Data
Position (x)	Year (Ordinal)
Position (y)	Horsepower mean (Ratio)
Size (of segments)	Horsepower mean variation (Interval)
Slope (of segments)
Color hue	Cylinders (Ordinal)
Size (of points)	Count

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[17]:

This chart allows to tell the whole story of the car industry in the 70s and the 80s: at first, muscle cars were very popular due to their huge power and their 8 cylinders, but the oil shocks in the 70s led people to prefer more fuel-savvy models with 4 cylinders.

No description has been provided for this image ➜

At the end, the industry tried to design less powerful cars but with still with 8 cylinders, in an attempt to lure the consumers into thinking that they were still buying muscle cars (but more fuel-savvy).

Area & distribution¶

Area charts are line charts, but filled with color. As bar charts, they are powerful to represent quantities.

It can also replace an histogram made on a binned quantitative variable, by using kernel smoothing (which is a type of weighted moving average).

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Density (Ratio)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[18]:

Like with histograms, we can also encode an ordinal data on the color as a visual variable:

Visual variable	Data
Position (x)	Horsepower (Ratio)
Position (y)	Density (Ratio)
Color	Cylinders (Ordinal)

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[19]:

Complex plots¶

Many complex plots exists out there and we can't show everything. For this beginner course, the only one worth showing is the parallel coordinate plot, which allows to show all the dimensions of the data at the same time:

/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/home/runner/micromamba/envs/ml-bootcamp/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Out[20]:

Exercises¶

Exercise 1: Pick one bad visualization from tumblr.com/badvisualisations and explain how you would do the visualization better.

Exercise 2: Explain why the data visualization below made by Minart in 1869 is known as "the best statistical graphic ever drawn": Minard's Map