import altair as alt
import pandas as pd
The Seattle Weather dataset is a record of the daily weather taken daily over four years, 2012 to 2015 inclusive, in Seattle, WA, USA.
In this example I'm using a slightly-modified verison of the data with four extra columns to show the year
, month
, day
and dayOfYear
for each row of data, because that makes it easier to create useful and informative graphs, which is what we're all about.
These are the first five rows of data:
df = pd.read_csv('seattle_weather.csv')
df.head()
date | precipitation | temp_max | temp_min | wind | weather | year | month | day | dayOfYear | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | 2012 | 1 | Sun | 1 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | 2012 | 1 | Mon | 2 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | 2012 | 1 | Tue | 3 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | 2012 | 1 | Wed | 4 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | 2012 | 1 | Thu | 5 |
And this is the standard summary of the data:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1461 entries, 0 to 1460 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1461 non-null object 1 precipitation 1461 non-null float64 2 temp_max 1461 non-null float64 3 temp_min 1461 non-null float64 4 wind 1461 non-null float64 5 weather 1461 non-null object 6 year 1461 non-null int64 7 month 1461 non-null int64 8 day 1461 non-null object 9 dayOfYear 1461 non-null int64 dtypes: float64(4), int64(3), object(3) memory usage: 114.3+ KB
We can create a kind of a histogram here using temp_min
, because a, temp_min
has discrete data, making it suitable for this sort of thing, and b, it's a good way of understanding how altair
makes histograms.
c = alt.Chart(df).mark_bar().encode(
x='temp_min',
y='count()',
tooltip=['temp_min', 'count()'])
c
alt.X()
and Bins¶So what happens when the data isn't discrete, like temp_min
? Well. It turns out that the x=someColumn
construction is shorthand for the real works. The x
and y
parameters are actually alt.X()
and alt.Y()
methods, and we can use these to add additional parameters to our values.
To create a histogram, then, is just the same as making a bar chart in that we call .mark_bar()
as above. The difference is that we add the parameter bin=True
to the alt.X()
method.
c = alt.Chart(df).mark_bar().encode(
x=alt.X('temp_min', bin=True),
y='count()',
tooltip=['temp_min', 'count()'])
c
Those bins are a little on the portly side. We can tweak that by changing bin=True
to bin=alt.Bin(maxbins=50)
.
c = alt.Chart(df).mark_bar().encode(
x=alt.X('temp_min', bin=alt.Bin(maxbins=50)),
y='count()',
tooltip=['temp_min', 'count()'])
c
Histograms show the central tendency of the data. If we want to see the mirror function of that, and identify the outliers, we look to boxlplots.
Making a boxplot in altair
is a piece of cake. Call mark_boxplot()
on alt.Chart(df)
, where df
is the data frame with your data, and set either the x
or y
parameters to whichever numerical column you wish to explore. Setting the x
parameter returns a horizontal boxplot, setting a y
parameter returns a vertical boxplot. The sizing is done automatically.
alt.Chart(df).mark_boxplot().encode(
x='temp_min')
If we add a tooltip, we can easily identify the outliers in the data.
alt.Chart(df).mark_boxplot().encode(
x='precipitation',
tooltip=['date', 'precipitation'])
If you wish to break out the data by a categorical parameter, feel free. Here we break out precipitation
by year by setting y
to the categorical parameter (note the year:N
rather than year
, to enforce categorical recognition), and x
remains the numerical category, precipitation
. Again, if we wanted a horizontal chart we'd just swith those around.
alt.Chart(df).mark_boxplot().encode(
y='year:N',
x='precipitation',
tooltip=['date', 'precipitation'])
Altair does boxplots so well it makes you feel like cheering.