Pandas Basics

Python DevCon -- Principles of Data Wrangling

... aka why @stefsy <3s pandas

In [2]:
print "Hi! This is an ipython cell. Press the play button to run it and show the output below."
Hi! This is an ipython cell. Press the play button to run it and show the output below.

In [3]:
cars = 10
print cars + 100
110

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
In [5]:
#plot settings
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)

Before we go on. This workshop is a short version of @jvns' awesome pandas cookbook with some extra content. I highly recommend the cookbook and is available in full here! https://github.com/jvns/pandas-cookbook

1.0 Reading in Data

1.1 Reading CSV files

In [6]:
bikes = pd.read_csv('./data/bikes.csv')
In [7]:
bikes.head()
Out[7]:
Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0 01/01/2012;35;;0;38;51;26;10;16;
1 02/01/2012;83;;1;68;153;53;6;43;
2 03/01/2012;135;;2;104;248;89;3;58;
3 04/01/2012;144;;1;116;318;111;8;61;
4 05/01/2012;197;;2;124;330;97;13;95;

5 rows × 1 columns

In [8]:
bikes.tail()
Out[8]:
Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
305 01/11/2012;2405;;1208;1701;3082;2076;165;2461
306 02/11/2012;1582;;737;1109;2277;1392;97;1888
307 03/11/2012;844;;380;612;1137;713;105;1302
308 04/11/2012;966;;446;710;1277;692;197;1374
309 05/11/2012;2247;;1170;1705;3221;2143;179;2430

5 rows × 1 columns

YUCK!

In [9]:
fixed_bikes = pd.read_csv('./data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
In [10]:
fixed_bikes.head()
Out[10]:
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN

5 rows × 9 columns

1.2 Reading from dbs

In [11]:
import sqlite3
In [12]:
con = sqlite3.connect("./data/weather_2012.sqlite")
df = pd.read_sql("SELECT * from weather_2012 where temp >0 LIMIT 5", con)
In [13]:
df
Out[13]:
id date_time temp
0 14 2012-01-01 13:00:00 0.2
1 15 2012-01-01 14:00:00 0.8
2 16 2012-01-01 15:00:00 1.8
3 17 2012-01-01 16:00:00 2.6
4 18 2012-01-01 17:00:00 3.0

5 rows × 3 columns

In [14]:
#define index columns. We're going to come back to this in another section.
weather = pd.read_sql("SELECT * from weather_2012", con,
                 index_col=['id', 'date_time'])
In [15]:
weather.head()
Out[15]:
temp
id date_time
1 2012-01-01 00:00:00 -1.8
2 2012-01-01 01:00:00 -1.8
3 2012-01-01 02:00:00 -1.8
4 2012-01-01 03:00:00 -1.5
5 2012-01-01 04:00:00 -1.5

5 rows × 1 columns

Notes

pandas can connect with other types of dbs, like MySQL, PostgreSQL, etc. Reading from it goes exactly the same way as the example above, you create a connection with .connect() then use .read_sql()

pandas can also write data back into databases! handy for when you're working on datasets too big to comfortably fit in memory.

1.3 Reading from JSON

In [16]:
jsa = pd.read_json('./data/fernfamily.json')
jsa
Out[16]:
children name
0 {u'name': u'Lycopodiophytes'} Trachaeophyta
1 {u'name': u'Euphyllophytes', u'children': [{u'... Trachaeophyta

2 rows × 2 columns

You have to do a fair amount of processing to read nested JSON files. Loop through data, append values to a dict, then turn the dict into a DataFrame.

2.0 Using DataFrames

DataFrame -- The primary data structure in pandas. Two dimensional, tabular data structure with columns and rows. Pretty awesome. You can do matrix multiplication, comparisons, handle duplicates or NA values, create pivot tables and so much more!

In [17]:
fixed_bikes.head()
Out[17]:
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN

5 rows × 9 columns

In [18]:
fixed_bikes.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 00:00:00 to 2012-11-05 00:00:00
Data columns (total 9 columns):
Berri 1                                310 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  310 non-null int64
Maisonneuve 1                          310 non-null int64
Maisonneuve 2                          310 non-null int64
du Parc                                310 non-null int64
Pierre-Dupuy                           310 non-null int64
Rachel1                                310 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7)

2.1 Selecting Rows and Columns

In [19]:
#selecting a single column by name 
fixed_bikes['Berri 1']
Out[19]:
Date
2012-01-01     35
2012-01-02     83
2012-01-03    135
2012-01-04    144
2012-01-05    197
2012-01-06    146
2012-01-07     98
2012-01-08     95
2012-01-09    244
2012-01-10    397
2012-01-11    273
2012-01-12    157
2012-01-13     75
2012-01-14     32
2012-01-15     54
...
2012-10-22    3650
2012-10-23    4177
2012-10-24    3744
2012-10-25    3735
2012-10-26    4290
2012-10-27    1857
2012-10-28    1310
2012-10-29    2919
2012-10-30    2887
2012-10-31    2634
2012-11-01    2405
2012-11-02    1582
2012-11-03     844
2012-11-04     966
2012-11-05    2247
Name: Berri 1, Length: 310
In [20]:
#selecting the first 3 rows of data
fixed_bikes[:3]
Out[20]:
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN

3 rows × 9 columns

In [21]:
#selecting the first 3 rows of specific columns
fixed_bikes[:3][['Berri 1','Rachel1']]
Out[21]:
Berri 1 Rachel1
Date
2012-01-01 35 16
2012-01-02 83 43
2012-01-03 135 58

3 rows × 2 columns

3.0 Plotting

In [22]:
#easy plotting! less bikers in the beginning of the year
fixed_bikes['Berri 1'].plot()
Out[22]:
<matplotlib.axes.AxesSubplot at 0x1075d0c90>
In [23]:
#lets try plotting the numbers of bikers across all districts
fixed_bikes.plot(figsize=(15,10))
Out[23]:
<matplotlib.axes.AxesSubplot at 0x1075e4d90>
In [24]:
#that's cool but confusing. let's break it apart into subplots.
fixed_bikes.plot(subplots=True,figsize=(15,10))
Out[24]:
array([<matplotlib.axes.AxesSubplot object at 0x107acf6d0>,
       <matplotlib.axes.AxesSubplot object at 0x108594990>,
       <matplotlib.axes.AxesSubplot object at 0x1085b7710>,
       <matplotlib.axes.AxesSubplot object at 0x10882cb90>,
       <matplotlib.axes.AxesSubplot object at 0x108850810>,
       <matplotlib.axes.AxesSubplot object at 0x108553e90>,
       <matplotlib.axes.AxesSubplot object at 0x108576490>,
       <matplotlib.axes.AxesSubplot object at 0x108890b10>,
       <matplotlib.axes.AxesSubplot object at 0x108878c10>], dtype=object)
In [25]:
#Now visually obvious that two of our columns don't contain data, as we can verify with
fixed_bikes.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 00:00:00 to 2012-11-05 00:00:00
Data columns (total 9 columns):
Berri 1                                310 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  310 non-null int64
Maisonneuve 1                          310 non-null int64
Maisonneuve 2                          310 non-null int64
du Parc                                310 non-null int64
Pierre-Dupuy                           310 non-null int64
Rachel1                                310 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7)

4.0 Aggregation Methods

4.1 Counts and Combos

In [26]:
weather = pd.read_csv('./data/weather_2012.csv')
weather.head()
Out[26]:
Date/Time Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather
0 2012-01-01 00:00:00 -1.8 -3.9 86 4 8.0 101.24 Fog
1 2012-01-01 01:00:00 -1.8 -3.7 87 4 8.0 101.24 Fog
2 2012-01-01 02:00:00 -1.8 -3.4 89 7 4.0 101.26 Freezing Drizzle,Fog
3 2012-01-01 03:00:00 -1.5 -3.2 88 6 4.0 101.27 Freezing Drizzle,Fog
4 2012-01-01 04:00:00 -1.5 -3.3 88 7 4.8 101.23 Fog

5 rows × 8 columns

In [27]:
#show us the weather category only
weather['Weather']
Out[27]:
0                      Fog
1                      Fog
2     Freezing Drizzle,Fog
3     Freezing Drizzle,Fog
4                      Fog
5                      Fog
6                      Fog
7                      Fog
8                      Fog
9                      Fog
10                     Fog
11                     Fog
12                     Fog
13                     Fog
14                     Fog
...
8769        Snow
8770    Snow,Fog
8771        Snow
8772        Snow
8773        Snow
8774        Snow
8775        Snow
8776        Snow
8777        Snow
8778        Snow
8779        Snow
8780        Snow
8781        Snow
8782        Snow
8783        Snow
Name: Weather, Length: 8784, dtype: object
In [28]:
#show us the unique values for weather
weather['Weather'].unique()
Out[28]:
array(['Fog', 'Freezing Drizzle,Fog', 'Mostly Cloudy', 'Cloudy', 'Rain',
       'Rain Showers', 'Mainly Clear', 'Snow Showers', 'Snow', 'Clear',
       'Freezing Rain,Fog', 'Freezing Rain', 'Freezing Drizzle',
       'Rain,Snow', 'Moderate Snow', 'Freezing Drizzle,Snow',
       'Freezing Rain,Snow Grains', 'Snow,Blowing Snow', 'Freezing Fog',
       'Haze', 'Rain,Fog', 'Drizzle,Fog', 'Drizzle',
       'Freezing Drizzle,Haze', 'Freezing Rain,Haze', 'Snow,Haze',
       'Snow,Fog', 'Snow,Ice Pellets', 'Rain,Haze', 'Thunderstorms,Rain',
       'Thunderstorms,Rain Showers', 'Thunderstorms,Heavy Rain Showers',
       'Thunderstorms,Rain Showers,Fog', 'Thunderstorms',
       'Thunderstorms,Rain,Fog', 'Thunderstorms,Moderate Rain Showers,Fog',
       'Rain Showers,Fog', 'Rain Showers,Snow Showers', 'Snow Pellets',
       'Rain,Snow,Fog', 'Moderate Rain,Fog',
       'Freezing Rain,Ice Pellets,Fog', 'Drizzle,Ice Pellets,Fog',
       'Drizzle,Snow', 'Rain,Ice Pellets', 'Drizzle,Snow,Fog',
       'Rain,Snow Grains', 'Rain,Snow,Ice Pellets', 'Snow Showers,Fog',
       'Moderate Snow,Blowing Snow'], dtype=object)
In [29]:
#show us an aggregate count of the weather types
weather['Weather'].value_counts()
Out[29]:
Mainly Clear                               2106
Mostly Cloudy                              2069
Cloudy                                     1728
Clear                                      1326
Snow                                        390
Rain                                        306
Rain Showers                                188
Fog                                         150
Rain,Fog                                    116
Drizzle,Fog                                  80
Snow Showers                                 60
Drizzle                                      41
Snow,Fog                                     37
Snow,Blowing Snow                            19
Rain,Snow                                    18
Haze                                         16
Thunderstorms,Rain Showers                   16
Drizzle,Snow,Fog                             15
Freezing Rain                                14
Freezing Drizzle,Snow                        11
Freezing Drizzle                              7
Snow,Ice Pellets                              6
Freezing Drizzle,Fog                          6
Snow,Haze                                     5
Rain,Snow,Ice Pellets                         4
Freezing Rain,Fog                             4
Moderate Snow                                 4
Snow Showers,Fog                              4
Freezing Fog                                  4
Thunderstorms,Rain Showers,Fog                3
Thunderstorms,Rain                            3
Freezing Drizzle,Haze                         3
Rain,Haze                                     3
Moderate Snow,Blowing Snow                    2
Thunderstorms                                 2
Rain Showers,Snow Showers                     2
Drizzle,Snow                                  2
Freezing Rain,Haze                            2
Thunderstorms,Heavy Rain Showers              1
Thunderstorms,Moderate Rain Showers,Fog       1
Snow Pellets                                  1
Rain,Ice Pellets                              1
Rain,Snow,Fog                                 1
Drizzle,Ice Pellets,Fog                       1
Freezing Rain,Ice Pellets,Fog                 1
Freezing Rain,Snow Grains                     1
Thunderstorms,Rain,Fog                        1
Moderate Rain,Fog                             1
Rain Showers,Fog                              1
Rain,Snow Grains                              1
dtype: int64
In [30]:
#chain it! show us top 10 weather patterns
weather['Weather'].value_counts()[:10]
Out[30]:
Mainly Clear     2106
Mostly Cloudy    2069
Cloudy           1728
Clear            1326
Snow              390
Rain              306
Rain Showers      188
Fog               150
Rain,Fog          116
Drizzle,Fog        80
dtype: int64
In [31]:
#combo combo combo plot it!
weather['Weather'].value_counts()[:10].plot(kind='bar')
Out[31]:
<matplotlib.axes.AxesSubplot at 0x1075d0890>

4.2 groupby and aggregation

In [32]:
berri_bikes = fixed_bikes[['Berri 1']]
#instead of selecting just column, we're selecting the named column PLUS index columns
berri_bikes[:5]
Out[32]:
Berri 1
Date
2012-01-01 35
2012-01-02 83
2012-01-03 135
2012-01-04 144
2012-01-05 197

5 rows × 1 columns

In [33]:
#pandas can handle time series very well. can get the day of the month from the date index like so:
berri_bikes.index.day
Out[33]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,  1,  2,  3,
        4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
       21, 22, 23, 24, 25, 26, 27, 28, 29,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
       26, 27, 28, 29, 30, 31,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
       12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29, 30,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23, 24, 25, 26, 27, 28, 29, 30, 31,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
       26, 27, 28, 29, 30, 31,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
       12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29, 30,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,  1,
        2,  3,  4,  5], dtype=int32)
In [34]:
#get the weekday thus! 0 is a monday, 6 is sunday.
berri_bikes.index.weekday
Out[34]:
array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5,
       6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,
       5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
       4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0], dtype=int32)
In [35]:
#create a new column in our berri_bikes dataframe
berri_bikes['weekday'] = berri_bikes.index.weekday
berri_bikes.head()
Out[35]:
Berri 1 weekday
Date
2012-01-01 35 6
2012-01-02 83 0
2012-01-03 135 1
2012-01-04 144 2
2012-01-05 197 3

5 rows × 2 columns

In [36]:
#time to aggregate
weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
weekday_counts
Out[36]:
Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310

7 rows × 1 columns

In [37]:
#rename for clarity
weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts
Out[37]:
Berri 1
Monday 134298
Tuesday 135305
Wednesday 152972
Thursday 160131
Friday 141771
Saturday 101578
Sunday 99310

7 rows × 1 columns

In [38]:
#plot it!
weekday_counts.plot(kind='bar')
Out[38]:
<matplotlib.axes.AxesSubplot at 0x10a045b10>

The End! Questions? Comments? Try stuff now!

See my other talks and d3 viz examples here!