Standard Imports¶

In [1]:

import pandas as pd
import altair as alt

A Data Source¶

vega_datasets, installed with pip install vega_datasets, has a data module. data is a collection of sixty-something different datasets, to be used in tutorials very like this one.

Nearly all the datasets are stored as pandas DataFrames that can be called by calling the dataset name as a data method - data.iris(), data.seattle_weather(). The ones that don't return pandas DataFrames return dictionaries, that are then easily turned into pandas DataFrames.

The Iris Dataset¶

Here we're going the use to famous Iris dataset of Sir Ronald Fisher first published in 1936. Fisher identified three species of Iris - setosa, virginica and versicolor - and took fifty samples of each. For each of the 150 data, Fisher recorded its species and its sepal and petal length and width.

In [2]:

from vega_datasets import data

In [3]:

df = data.iris()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sepalLength  150 non-null    float64
 1   sepalWidth   150 non-null    float64
 2   petalLength  150 non-null    float64
 3   petalWidth   150 non-null    float64
 4   species      150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

In [4]:

df.head()

Out[4]:

	sepalLength	sepalWidth	petalLength	petalWidth	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

And this is the iris data as we expected it - four numerical variables, one categorial.

Altair¶

Altair is not very pythonic. It's more like JavaScript, working by chaining methods each after the other. At first, it can be helpful to chant the methods as you're creating the charts, for fear or leaving one out. The first four, and most important, methods are:

alt.
Chart()
mark_[something]()
encode()

Chart takes the dataset as a parameter. mark_[something] determines the nature of the chart - mark_circle() for a scatter chart, mark_bar() for a bar chart, and so on. And finally encode() is where the chart is configured with x and y-axis and the rest.

A Scatter Chart¶

In [5]:

c = alt.Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth')
c

Out[5]:

And here it is, and it's - not that spectatular. A lot like it's R equivalent, if truth be told. But it can improved a little.

The first thing we'll do is change .mark_point() to .mark_circle(), and give the thing a little more body, a little more presence.

In [6]:

c = alt.Chart(df).mark_circle().encode(
    x='petalLength',
    y='petalWidth')
c

Out[6]:

Going Interactive¶

Calling interactive() on our chart does two things:

The view can move the chart around by clicking and dragging.
The viewer can zoom in and out of the chart by pinching a thumbpad or by scrolling a mouse wheel.

And suddenly the R chart is left behind.

In [7]:

c = alt.Chart(df).mark_circle().encode(
    x='petalLength',
    y='petalWidth').interactive()
c

Out[7]:

Adding Color¶

We can add a color parameter to .encode() in order to to identify the Iris species.

In [8]:

c = alt.Chart(df).mark_circle().encode(
    x='petalLength',
    y='petalWidth',
color='species').interactive()
c

Out[8]:

And a final coup de grâce¶

And now we can see why this dataset has been so useful down the years. The setosa irises are easily identifiable from the other two. The boundary between versicolor and virginica is much harder to identify. And now we can add one final parameter to encode, the parameter that really takes the interactive, online chart into its own and leaves "flat" charts far behind.

We can allow the user to hover her mouse over a point on the chart and see the species, petal width and petal height by using a tooltip.

In [9]:

c = alt.Chart(df).mark_circle().encode(
    x='petalLength',
    y='petalWidth',
color='species',
tooltip=['species', 'petalLength', 'petalWidth']).interactive()

c

Out[9]:

Intro to Altair - Scatter Plots