import pandas as pd
import altair as alt
vega_datasets
, installed with pip install vega_datasets
, has a data
module. data
is a collection of sixty-something different datasets, to be used in tutorials very like this one.
Nearly all the datasets are stored as pandas DataFrames that can be called by calling the dataset name as a data
method - data.iris()
, data.seattle_weather()
. The ones that don't return pandas DataFrames return dictionaries, that are then easily turned into pandas DataFrames.
Here we're going the use to famous Iris dataset of Sir Ronald Fisher first published in 1936. Fisher identified three species of Iris - setosa, virginica and versicolor - and took fifty samples of each. For each of the 150 data, Fisher recorded its species and its sepal and petal length and width.
from vega_datasets import data
df = data.iris()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepalLength 150 non-null float64 1 sepalWidth 150 non-null float64 2 petalLength 150 non-null float64 3 petalWidth 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
df.head()
sepalLength | sepalWidth | petalLength | petalWidth | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
And this is the iris data as we expected it - four numerical variables, one categorial.
Altair is not very pythonic. It's more like JavaScript, working by chaining methods each after the other. At first, it can be helpful to chant the methods as you're creating the charts, for fear or leaving one out. The first four, and most important, methods are:
alt.
Chart()
mark_[something]()
encode()
Chart
takes the dataset as a parameter. mark_[something]
determines the nature of the chart - mark_circle()
for a scatter chart, mark_bar()
for a bar chart, and so on. And finally encode()
is where the chart is configured with x and y-axis and the rest.
c = alt.Chart(df).mark_point().encode(
x='petalLength',
y='petalWidth')
c
And here it is, and it's - not that spectatular. A lot like it's R equivalent, if truth be told. But it can improved a little.
The first thing we'll do is change .mark_point()
to .mark_circle()
, and give the thing a little more body, a little more presence.
c = alt.Chart(df).mark_circle().encode(
x='petalLength',
y='petalWidth')
c
Calling interactive()
on our chart does two things:
And suddenly the R
chart is left behind.
c = alt.Chart(df).mark_circle().encode(
x='petalLength',
y='petalWidth').interactive()
c
We can add a color
parameter to .encode()
in order to to identify the Iris species.
c = alt.Chart(df).mark_circle().encode(
x='petalLength',
y='petalWidth',
color='species').interactive()
c
And now we can see why this dataset has been so useful down the years. The setosa irises are easily identifiable from the other two. The boundary between versicolor and virginica is much harder to identify. And now we can add one final parameter to encode, the parameter that really takes the interactive, online chart into its own and leaves "flat" charts far behind.
We can allow the user to hover her mouse over a point on the chart and see the species, petal width and petal height by using a tooltip
.
c = alt.Chart(df).mark_circle().encode(
x='petalLength',
y='petalWidth',
color='species',
tooltip=['species', 'petalLength', 'petalWidth']).interactive()
c