import pandas as pd
import altair as alt
from vega_datasets import data
The Seattle weather dataset is a record of the weather in Seattle, WA, USA, over four years.
df = data.seattle_weather()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1461 entries, 0 to 1460 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1461 non-null datetime64[ns] 1 precipitation 1461 non-null float64 2 temp_max 1461 non-null float64 3 temp_min 1461 non-null float64 4 wind 1461 non-null float64 5 weather 1461 non-null object dtypes: datetime64[ns](1), float64(4), object(1) memory usage: 68.6+ KB
Because date
is a datetime
type, we can use it to create subcategories. This is useful for the tutorial, but it's also common in business practice, where data is broken down by year, by month, by day, or whatever.
The additional categories are found by calling lambda
functions on the date
column. There's a full list of the possible date formats here: https://strftime.org/. These formats are common across a lot of languages.
df['year'] = df.date.apply(lambda x: x.strftime('%Y'))
df['month'] = df.date.apply(lambda x: x.strftime('%m'))
df['day'] = df.date.apply(lambda x: x.strftime('%a'))
df['dayOfYear'] = df.date.apply(lambda x: x.strftime('%j'))
df.head()
date | precipitation | temp_max | temp_min | wind | weather | year | month | day | dayOfYear | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | 2012 | 01 | Sun | 001 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | 2012 | 01 | Mon | 002 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | 2012 | 01 | Tue | 003 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | 2012 | 01 | Wed | 004 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | 2012 | 01 | Thu | 005 |
Bar charts, as you may remember from school, consist of categories along the horizontal axis and values for those categories along the vertical axis. But how do we group the data by category? If we want to chart precipitation
, say, and have year
on the x-axis, how do handle all the precipitation
data that exists for each year
?
The answer is that we have to aggregate the data. There are two ways to do that.
.groupby()
on the df
DataFame, grouping the data by year and then calling .sum()
, count()
or whatever on the data, orLet's look at the Altair
aggregation methods.
count()
shows the number of days for which we have records - 365 for the years 2013, '14, and '15, and 366 for 2012, a leap year.
c = alt.Chart(df).mark_bar().encode(
x = 'year',
y = 'count(precipitation)',
tooltip=['year', 'count(precipitation)'])
c
sum()
shows the total precipitation for each year.
c = alt.Chart(df).mark_bar().encode(
x = 'year',
y = 'sum(precipitation)',
tooltip=['year', 'sum(precipitation)'])
c
min()
shows least recorded amount of daily precipitation for each year in the dataset - 0, in each case.
c = alt.Chart(df).mark_bar().encode(
x = 'year',
y = 'min(precipitation)',
tooltip=['year', 'min(precipitation)'])
c
max()
shows the largest recorded amount of daily precipitation for each year in the dataset.
c = alt.Chart(df).mark_bar().encode(
x = 'year',
y = 'max(precipitation)',
tooltip=['year', 'max(precipitation)'])
c
mean()
shows the average amount of daily precipitation for each year in the dataset.
c = alt.Chart(df).mark_bar().encode(
x = 'year',
y = 'mean(precipitation)',
tooltip=['year', 'mean(precipitation)'])
c
These line charts can be improved in two ways:
.encode()
method, thus giving us a stacked chart, and c = alt.Chart(df).mark_bar().encode(
y = 'year',
x = 'sum(precipitation)',
color='month',
tooltip=['year', 'month', 'sum(precipitation)'])
c
Line charts are easy. All you need is numerical data for the y-axis and away you go.
c = alt.Chart(df).mark_line().encode(
x = 'date',
y = 'precipitation',
tooltip = ['date', 'precipitation']).interactive()
c
Well, maybe not quite. You do need suitable data. 1,461 values are too many values for an x-axis. However, if we reduce the timespan of the data, things get better. We can slice the data to show only the data for January, 2012:
c = alt.Chart(df[(df.date>'2011-12-31')&(df.date<'2012-02-01')]).mark_line().encode(
x = 'date',
y = 'precipitation',
tooltip = ['date', 'precipitation']).interactive()
c
Or, we can use the dayOfYear
category we cleverly added to our dataset at the start to allow us to compare precipitation in terms of day-on-day, and figure out the exact dates using the tooltip. This is online graph-making at its most useful:
c = alt.Chart(df).mark_line().encode(
x = 'dayOfYear',
y = 'precipitation',
color = 'year',
tooltip = ['date', 'precipitation']).interactive()
c
We can use aggregations in line charts, but they don't always come out as well as we might like.
c = alt.Chart(df).mark_line().encode(
x = 'month',
y = 'sum(precipitation)',
color = 'year',
tooltip = ['date', 'precipitation']).interactive()
c
There is, however, another type of chart in which aggregations work out beautifully, and that gives us the overall "feel" for the data of a bar chart combined with the notion of progress-in-time of a line chart. This is an
Just substitute .mark_area()
for .mark_line()
and leave everything else the same:
c = alt.Chart(df).mark_area().encode(
x = 'month',
y = 'sum(precipitation)',
color = 'year',
tooltip = ['month', 'year', 'sum(precipitation)']).interactive()
c
A very beautiful and meaningful result.