Getting to know your data with Pandas¶

Pandas¶

Pandas is the Python Data Analysis Library.

Pandas is an extremely versatile tool for manipulating datasets.

It also produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

The most important tool provided by Pandas is the data frame.

A data frame is a table in which each row and column is given a label.

Pandas DataFrames are documented at:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

Getting started¶

import pandas as pd
import pandas_datareader.data as web
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

from datetime import datetime

#pd.__version__

%matplotlib inline

Fetching, storing and retrieving your data¶

For demonstration purposes, we'll use a library built-in to Pandas that fetches data from standard online sources, such as Yahoo! Finance.

More information on what types of data you can fetch is at: http://pandas.pydata.org/pandas-docs/stable/remote_data.html

stocks = 'YELP'
data_source = 'google'
start = datetime(2015,1,1)
end = datetime(2015,12,31)

yahoo_stocks = web.DataReader(stocks, data_source, start, end)

yahoo_stocks.head()
yahoo_stocks.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2016-10-20 to 2017-10-18
Data columns (total 5 columns):
Open      250 non-null float64
High      250 non-null float64
Low       250 non-null float64
Close     251 non-null float64
Volume    251 non-null int64
dtypes: float64(4), int64(1)
memory usage: 11.8 KB

yahoo_stocks.head()

Reading data from a .csv file¶

yahoo_stocks.to_csv('yahoo_data.csv')
print(open('yahoo_data.csv').read())

Date,Open,High,Low,Close,Volume
2016-10-20,34.93,34.96,33.8,34.35,4385680
2016-10-21,34.07,34.3,33.24,33.4,3369939
2016-10-24,33.69,34.76,33.47,34.48,2855396
2016-10-25,34.29,34.96,33.72,33.95,3197182
2016-10-26,33.87,34.08,33.23,33.49,2125231
2016-10-27,33.6,33.66,32.65,32.74,2211486
2016-10-28,32.92,33.18,32.09,32.29,1481442
2016-10-31,32.44,32.83,32.24,32.66,1623863
2016-11-01,32.76,33.17,32.0,32.48,3693003
2016-11-02,36.11,38.24,35.01,35.71,9844026
2016-11-03,35.59,36.16,34.53,35.56,3432322
2016-11-04,35.94,36.82,35.53,36.14,2759431
2016-11-07,36.74,37.29,36.69,37.21,2135007
2016-11-08,37.06,38.5,36.86,37.99,2440585
2016-11-09,37.16,38.5,36.7,38.29,2195507
2016-11-10,38.44,38.82,35.76,36.07,3796396
2016-11-11,35.85,36.5,35.21,36.29,2742202
2016-11-14,36.54,37.13,35.25,35.72,3121938
2016-11-15,35.95,37.73,35.94,36.62,2187326
2016-11-16,36.38,36.82,35.98,36.51,1419619
2016-11-17,36.58,37.54,36.13,37.51,1218362
2016-11-18,37.47,38.71,37.05,38.29,1453593
2016-11-21,38.48,39.28,38.14,38.33,1856605
2016-11-22,37.71,38.64,37.13,38.34,2156349
2016-11-23,38.1,39.03,37.84,38.99,1013658
2016-11-25,39.01,39.33,38.21,38.51,538633
2016-11-28,38.28,38.65,38.02,38.06,629796
2016-11-29,38.1,38.63,37.77,38.31,1087411
2016-11-30,38.36,38.43,36.7,37.2,1425722
2016-12-01,37.1,37.39,35.5,36.26,1777523
2016-12-02,36.17,36.69,35.61,36.15,931999
2016-12-05,36.32,36.94,35.98,36.49,955458
2016-12-06,36.7,36.76,35.88,36.26,1111526
2016-12-07,36.4,36.69,36.04,36.11,1402460
2016-12-08,36.1,36.3,35.22,36.25,1545132
2016-12-09,36.34,36.56,36.0,36.09,1025670
2016-12-12,35.74,35.94,35.1,35.17,853380
2016-12-13,35.65,36.6,35.52,35.91,1479096
2016-12-14,35.93,36.11,35.08,35.31,1345893
2016-12-15,35.27,35.96,34.62,35.43,1728233
2016-12-16,35.29,35.87,35.03,35.83,1405248
2016-12-19,35.81,36.8,35.52,35.61,1273710
2016-12-20,35.63,37.57,35.63,37.14,2172144
2016-12-21,37.34,39.87,37.33,38.75,6191209
2016-12-22,39.18,39.95,38.09,38.23,3679699
2016-12-23,38.2,38.43,37.75,38.0,1125966
2016-12-27,38.4,40.67,38.08,39.95,4325032
2016-12-28,41.01,41.17,38.72,38.78,4200001
2016-12-29,38.82,39.03,38.24,38.68,1864357
2016-12-30,38.66,38.84,37.6,38.13,2478410
2017-01-03,38.5,39.14,37.56,38.22,2391914
2017-01-04,38.28,39.44,38.28,39.42,2238020
2017-01-05,39.2,40.18,38.81,39.62,2310862
2017-01-06,39.8,39.96,39.21,39.31,1230710
2017-01-09,39.49,40.41,39.31,40.14,1719388
2017-01-10,40.38,41.24,40.37,40.4,1977190
2017-01-11,40.02,40.76,39.96,40.7,1146698
2017-01-12,40.43,40.76,39.96,40.69,1132189
2017-01-13,40.9,41.44,40.5,40.94,1454800
2017-01-17,40.39,40.72,40.1,40.36,925261
2017-01-18,40.5,40.75,40.24,40.72,1012327
2017-01-19,40.75,41.02,40.57,40.92,1333381
2017-01-20,40.9,41.19,40.5,41.06,1402410
2017-01-23,40.99,41.78,40.83,41.53,1553151
2017-01-24,41.54,42.58,41.5,42.41,1630404
2017-01-25,42.57,42.69,42.0,42.53,843673
2017-01-26,42.5,43.03,42.4,42.75,2188142
2017-01-27,42.75,42.75,42.05,42.41,1176539
2017-01-30,42.36,42.42,41.03,41.69,1859813
2017-01-31,41.35,41.86,40.86,41.78,987674
2017-02-01,41.98,43.0,41.76,42.9,1787593
2017-02-02,42.68,43.41,42.25,42.62,1465407
2017-02-03,42.67,43.0,42.03,42.32,1342113
2017-02-06,42.36,42.49,41.63,41.85,1862726
2017-02-07,41.93,42.15,41.34,41.69,1794773
2017-02-08,41.69,41.75,40.87,41.41,3128095
2017-02-09,41.64,41.74,40.76,41.49,4727469
2017-02-10,37.53,38.42,35.7,35.83,16022030
2017-02-13,35.74,36.44,35.14,35.54,6013148
2017-02-14,35.53,36.6,35.39,35.85,4090700
2017-02-15,36.0,36.75,35.48,35.63,3086397
2017-02-16,35.69,35.88,34.61,34.77,4572237
2017-02-17,34.68,34.88,34.16,34.67,2319969
2017-02-21,34.68,35.05,34.27,34.43,2279964
2017-02-22,33.95,34.8,33.81,34.09,2760673
2017-02-23,34.14,34.15,32.99,33.47,4810710
2017-02-24,33.11,33.74,32.97,33.46,2620263
2017-02-27,33.48,34.54,33.46,34.07,2677252
2017-02-28,33.85,34.35,33.68,33.7,2534103
2017-03-01,33.91,34.88,33.91,34.66,2741761
2017-03-02,34.45,34.84,33.92,34.0,1858177
2017-03-03,33.88,34.08,33.34,33.37,2230211
2017-03-06,33.22,33.77,33.17,33.52,1932292
2017-03-07,33.31,34.16,33.27,33.72,1729149
2017-03-08,33.69,33.98,33.39,33.53,1208196
2017-03-09,33.53,34.67,33.26,34.38,4637769
2017-03-10,34.51,34.74,34.01,34.52,1897630
2017-03-13,34.54,34.72,34.29,34.58,1522562
2017-03-14,34.67,34.88,33.61,33.77,1515234
2017-03-15,33.77,34.34,33.35,34.16,2289372
2017-03-16,34.26,34.7,33.98,34.27,979102
2017-03-17,34.23,34.45,33.9,34.11,1236766
2017-03-20,34.17,34.72,34.1,34.65,1307483
2017-03-21,34.68,35.03,33.46,33.62,2569255
2017-03-22,33.59,33.98,33.09,33.5,2144322
2017-03-23,33.56,33.56,32.54,32.77,2563657
2017-03-24,32.85,33.25,32.33,32.48,2624177
2017-03-27,32.09,32.64,31.6,32.43,2434806
2017-03-28,32.34,32.34,31.68,31.81,2467938
2017-03-29,32.0,32.99,31.96,32.78,2794268
2017-03-30,32.84,33.02,32.31,32.52,1540336
2017-03-31,32.54,32.87,32.2,32.75,1644072
2017-04-03,32.9,32.93,32.22,32.63,2504149
2017-04-04,32.6,33.12,32.1,32.85,2031766
2017-04-05,33.12,34.14,32.61,33.22,4024165
2017-04-06,33.26,33.62,32.83,33.43,2116045
2017-04-07,33.38,33.49,32.81,33.23,1898392
2017-04-10,33.22,33.26,32.68,32.74,1471301
2017-04-11,33.85,34.0,33.17,33.37,3727669
2017-04-12,33.3,33.49,32.08,32.16,2942510
2017-04-13,32.23,32.5,31.92,32.03,1768903
2017-04-17,32.0,32.72,31.61,32.69,1567859
2017-04-18,32.5,32.95,32.21,32.86,998447
2017-04-19,32.95,32.96,32.59,32.83,1597174
2017-04-20,33.06,33.5,32.91,33.43,1743618
2017-04-21,33.48,33.86,33.34,33.49,1476369
2017-04-24,33.88,34.17,33.58,33.89,2482856
2017-04-25,33.97,34.55,33.82,34.21,1288881
2017-04-26,34.22,34.64,33.94,34.58,1205225
2017-04-27,34.84,36.16,34.79,36.01,3033650
2017-04-28,36.07,36.1,35.3,35.41,1997767
2017-05-01,35.52,35.91,35.22,35.72,1964028
2017-05-02,36.01,36.25,35.6,35.98,2512862
2017-05-03,35.7,35.98,35.18,35.2,1686292
2017-05-04,35.28,35.7,35.09,35.3,1375681
2017-05-05,35.36,35.7,35.3,35.68,1349525
2017-05-08,35.67,36.2,35.55,35.7,3262549
2017-05-09,35.66,36.0,34.15,34.7,11652195
2017-05-10,27.91,29.24,26.93,28.33,47155033
2017-05-11,29.11,29.69,28.0,29.3,13753319
2017-05-12,28.93,29.37,28.58,28.7,4744223
2017-05-15,28.76,29.15,28.43,29.01,3738178
2017-05-16,28.9,29.0,28.48,28.52,3211820
2017-05-17,28.24,28.35,27.33,27.38,3610740
2017-05-18,27.41,28.12,27.11,27.99,3265230
2017-05-19,28.25,28.26,27.83,27.84,2731369
2017-05-22,27.95,28.34,27.86,28.29,2006794
2017-05-23,28.14,28.29,27.59,27.78,2754566
2017-05-24,27.75,27.89,27.22,27.78,3388335
2017-05-25,27.85,28.38,27.83,28.07,2657716
2017-05-26,28.03,28.11,27.46,27.65,2666227
2017-05-30,27.65,27.84,27.46,27.73,1541750
2017-05-31,27.85,28.1,27.48,27.92,2010517
2017-06-01,27.84,29.0,27.8,28.83,3492429
2017-06-02,28.81,28.91,28.55,28.79,1977510
2017-06-05,29.63,30.21,29.16,30.09,4080323
2017-06-06,29.99,30.12,29.6,29.6,2954957
2017-06-07,29.75,30.39,29.72,30.21,2081792
2017-06-08,30.21,30.43,30.09,30.43,1887645
2017-06-09,30.43,30.86,29.38,29.8,3447295
2017-06-12,29.61,30.78,29.09,30.72,3262230
2017-06-13,30.75,31.03,30.52,30.72,3839886
2017-06-14,30.82,31.23,30.24,30.55,2068821
2017-06-15,30.05,30.7,29.73,30.63,2227221
2017-06-16,30.77,30.9,29.75,30.15,7210961
2017-06-19,30.28,30.72,30.08,30.28,1832217
2017-06-20,30.27,30.62,29.36,29.44,1923810
2017-06-21,29.57,29.59,28.98,29.36,2715508
2017-06-22,29.33,30.61,29.3,29.96,2935933
2017-06-23,29.9,31.05,29.52,30.88,9222368
2017-06-26,30.8,30.93,29.82,30.03,2933296
2017-06-27,30.0,30.2,29.38,29.4,2023858
2017-06-28,29.72,30.36,29.42,30.15,2521079
2017-06-29,30.0,30.58,29.55,29.89,2297852
2017-06-30,29.92,30.38,29.7,30.02,1311805
2017-07-03,30.01,30.35,29.78,30.07,925719
2017-07-05,30.03,30.52,29.92,30.29,1738814
2017-07-06,29.98,30.06,29.31,29.33,1504554
2017-07-07,29.41,29.74,29.3,29.55,1354946
2017-07-10,29.55,30.16,29.37,30.16,2311032
2017-07-11,30.11,31.48,30.03,31.11,5075411
2017-07-12,31.45,32.14,31.35,32.01,1908906
2017-07-13,32.07,32.07,31.46,31.49,1432113
2017-07-14,31.53,32.36,31.53,31.89,1096340
2017-07-17,31.97,32.22,31.57,31.79,1094522
2017-07-18,31.65,32.98,31.42,32.93,1926841
2017-07-19,32.93,33.14,32.46,32.88,1518548
2017-07-20,32.91,32.91,32.19,32.4,1966704
2017-07-21,32.43,32.43,31.74,32.11,1316883
2017-07-24,32.2,32.45,31.72,31.73,1398375
2017-07-25,31.7,32.32,31.5,32.0,1333593
2017-07-26,31.99,32.46,31.85,32.45,1208160
2017-07-27,32.48,32.89,31.94,32.76,2466868
2017-07-28,32.55,33.32,32.4,33.02,1698832
2017-07-31,,,,32.53,2057592
2017-08-01,32.7,33.46,32.61,33.22,1937096
2017-08-02,33.2,33.31,32.0,32.25,2791034
2017-08-03,32.27,32.75,31.15,31.37,6345470
2017-08-04,38.07,40.46,37.85,40.05,16906184
2017-08-07,40.15,42.16,39.57,41.73,7593183
2017-08-08,41.23,42.33,41.14,41.69,3758766
2017-08-09,41.33,42.19,40.82,41.55,2835055
2017-08-10,41.36,41.5,40.76,40.93,2623875
2017-08-11,40.9,42.18,40.7,42.03,3520108
2017-08-14,42.41,42.97,42.12,42.53,2422269
2017-08-15,42.44,42.44,41.31,41.56,2473592
2017-08-16,41.81,42.48,41.48,41.98,1660029
2017-08-17,41.67,42.26,41.39,41.43,1537439
2017-08-18,41.27,41.83,40.9,41.81,1489886
2017-08-21,41.85,41.93,41.03,41.46,1323965
2017-08-22,41.3,41.94,41.3,41.87,1473330
2017-08-23,41.69,42.34,41.53,42.12,1509718
2017-08-24,42.33,42.44,41.58,41.75,1494899
2017-08-25,42.14,42.32,41.79,42.14,1377551
2017-08-28,42.29,42.38,42.04,42.14,971348
2017-08-29,41.62,42.39,41.61,42.32,1235274
2017-08-30,42.27,42.97,42.18,42.9,1288815
2017-08-31,42.97,43.71,42.33,42.6,2126362
2017-09-01,42.68,43.43,42.31,42.98,1335111
2017-09-05,42.67,43.15,42.28,43.04,1258915
2017-09-06,43.22,43.42,42.5,43.04,1062768
2017-09-07,42.93,44.18,42.87,43.74,1622847
2017-09-08,43.51,44.25,43.41,43.75,1227068
2017-09-11,43.61,43.79,42.84,43.7,3123710
2017-09-12,43.74,44.03,43.51,43.85,1295504
2017-09-13,43.8,43.8,42.62,42.95,1908324
2017-09-14,42.75,43.03,42.52,42.91,1190931
2017-09-15,42.91,43.7,42.76,43.57,4243413
2017-09-18,43.8,44.23,43.35,43.5,1261139
2017-09-19,43.5,44.07,43.04,43.85,859142
2017-09-20,43.78,44.24,43.44,44.17,940095
2017-09-21,44.07,44.07,43.05,43.61,769355
2017-09-22,43.69,44.0,43.24,43.42,1056873
2017-09-25,43.03,43.13,41.32,41.9,2026892
2017-09-26,42.11,42.42,41.68,41.72,1087678
2017-09-27,42.09,42.79,41.8,42.7,1461182
2017-09-28,42.38,42.89,42.3,42.66,1622303
2017-09-29,42.62,43.42,42.59,43.3,898853
2017-10-02,43.8,44.81,43.58,44.09,1864486
2017-10-03,44.21,45.89,44.08,45.8,2153106
2017-10-04,45.63,46.16,45.05,46.07,1489915
2017-10-05,46.05,46.43,45.7,46.39,863247
2017-10-06,46.1,46.76,45.88,46.73,1169278
2017-10-09,45.28,46.15,44.7,45.63,1388717
2017-10-10,45.55,45.82,43.9,44.4,1898232
2017-10-11,44.34,44.48,44.18,44.22,49014
2017-10-12,43.92,43.97,43.53,43.75,1153125
2017-10-13,43.73,44.25,43.5,43.65,1129593
2017-10-16,43.91,43.98,43.32,43.52,687682
2017-10-17,43.36,43.68,42.92,43.2,648078
2017-10-18,43.86,44.9,43.76,44.58,1643612

df = pd.read_csv('yahoo_data.csv')
df

The number of rows in the DataFrame:

len(df)

251

Other ways to define data frames¶

d = {'A':[1., 2., 3., 4.],
     'B':[4., 3., 2., 1.]}  

pd.DataFrame(d)

d = [[1,2,3],[4,5,6]]
pd.DataFrame(d, index = ['one','two'], columns = ['A','B','C'])

Working with data columns¶

The columns or "features" in your data

df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

Selecting a single column from your data

df['Open']

0      34.93
1      34.07
2      33.69
3      34.29
4      33.87
5      33.60
6      32.92
7      32.44
8      32.76
9      36.11
10     35.59
11     35.94
12     36.74
13     37.06
14     37.16
15     38.44
16     35.85
17     36.54
18     35.95
19     36.38
20     36.58
21     37.47
22     38.48
23     37.71
24     38.10
25     39.01
26     38.28
27     38.10
28     38.36
29     37.10
       ...  
221    42.93
222    43.51
223    43.61
224    43.74
225    43.80
226    42.75
227    42.91
228    43.80
229    43.50
230    43.78
231    44.07
232    43.69
233    43.03
234    42.11
235    42.09
236    42.38
237    42.62
238    43.80
239    44.21
240    45.63
241    46.05
242    46.10
243    45.28
244    45.55
245    44.34
246    43.92
247    43.73
248    43.91
249    43.36
250    43.86
Name: Open, dtype: float64

Another way of selecting a single column from your data

df.Open

0      34.93
1      34.07
2      33.69
3      34.29
4      33.87
5      33.60
6      32.92
7      32.44
8      32.76
9      36.11
10     35.59
11     35.94
12     36.74
13     37.06
14     37.16
15     38.44
16     35.85
17     36.54
18     35.95
19     36.38
20     36.58
21     37.47
22     38.48
23     37.71
24     38.10
25     39.01
26     38.28
27     38.10
28     38.36
29     37.10
       ...  
221    42.93
222    43.51
223    43.61
224    43.74
225    43.80
226    42.75
227    42.91
228    43.80
229    43.50
230    43.78
231    44.07
232    43.69
233    43.03
234    42.11
235    42.09
236    42.38
237    42.62
238    43.80
239    44.21
240    45.63
241    46.05
242    46.10
243    45.28
244    45.55
245    44.34
246    43.92
247    43.73
248    43.91
249    43.36
250    43.86
Name: Open, dtype: float64

df[['Open','Close']].head()

df.Date.head(10)

0    2016-10-20
1    2016-10-21
2    2016-10-24
3    2016-10-25
4    2016-10-26
5    2016-10-27
6    2016-10-28
7    2016-10-31
8    2016-11-01
9    2016-11-02
Name: Date, dtype: object

df.Date.tail(10)

241    2017-10-05
242    2017-10-06
243    2017-10-09
244    2017-10-10
245    2017-10-11
246    2017-10-12
247    2017-10-13
248    2017-10-16
249    2017-10-17
250    2017-10-18
Name: Date, dtype: object

Changing the column names:

new_column_names = [x.lower().replace(' ','_') for x in df.columns]
df.columns = new_column_names
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 6 columns):
date      251 non-null object
open      250 non-null float64
high      250 non-null float64
low       250 non-null float64
close     251 non-null float64
volume    251 non-null int64
dtypes: float64(4), int64(1), object(1)
memory usage: 11.8+ KB

Now all columns can be accessed using the dot notation:

df.adj_close.head()

Data Frame methods¶

A DataFrame object has many useful methods.

df.mean()

open      3.622440e+01
high      3.675340e+01
low       3.573972e+01
close     3.624378e+01
volume    2.578220e+06
dtype: float64

df.std()

open      4.931111e+00
high      4.978669e+00
low       4.918562e+00
close     4.976986e+00
volume    3.489885e+06
dtype: float64

df.median()

open           35.655
high           36.105
low            35.120
close          35.560
volume    1898392.000
dtype: float64

df.open.mean()

36.224399999999996

df.high.mean()

36.75339999999998

Plotting methods¶

df.high.plot(label='high')
df.low.plot(label='low')
plt.legend(loc='best') #puts the ledgent in the best possible position

<matplotlib.legend.Legend at 0x6d3cd68>

df.close.hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0xd4fcc18>

Bulk Operations¶

Methods like sum( ) and std( ) work on entire columns.

We can run our own functions across all values in a column (or row) using apply( ).

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 6 columns):
date      251 non-null object
open      250 non-null float64
high      250 non-null float64
low       250 non-null float64
close     251 non-null float64
volume    251 non-null int64
dtypes: float64(4), int64(1), object(1)
memory usage: 11.8+ KB

df.date.head()

0    2016-10-20
1    2016-10-21
2    2016-10-24
3    2016-10-25
4    2016-10-26
Name: date, dtype: object

The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.

first_date = df.date.values[0]
first_date

'2016-10-20'

datetime.strptime(first_date, "%Y-%m-%d")

datetime.datetime(2016, 10, 20, 0, 0)

df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date.head()

0   2016-10-20
1   2016-10-21
2   2016-10-24
3   2016-10-25
4   2016-10-26
Name: date, dtype: datetime64[ns]

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

The row indices so far have been auto-generated by pandas, and are simply integers starting from 0.

From now on we will use dates instead of integers for indices -- the benefits of this will show later.

Overwriting the index is as easy as assigning to the index property of the DataFrame.

df.index = df.date
df.head()

Now that we have made an index based on date, we can drop the original date column.

df = df.drop(['date'],axis=1) #axis = 0 means rows, axis =1 means columns
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2016-10-20 to 2017-10-18
Data columns (total 5 columns):
open      250 non-null float64
high      250 non-null float64
low       250 non-null float64
close     251 non-null float64
volume    251 non-null int64
dtypes: float64(4), int64(1)
memory usage: 11.8 KB

Accessing rows of the DataFrame¶

So far we've seen how to access a column of the DataFrame. To access a row we use a different notation.

To access a row by its index value, use the .loc() method.

df.loc[datetime(2017,1,23,0,0)]

open           40.99
high           41.78
low            40.83
close          41.53
volume    1553151.00
Name: 2017-01-23 00:00:00, dtype: float64

To access a row by its sequence number (ie, like an array index), use .iloc() ('Integer Location')

df.iloc[0,:]
df.iloc[0:2,[1,3]]

To iterate over the rows, use .iterrows()

num_positive_days = 0
for idx, row in df.iterrows():
    if row.close > row.open:
        num_positive_days += 1
        
print("The total number of positive-gain days is {}.".format(num_positive_days))

The total number of positive-gain days is 132.

Filtering¶

It is very easy to select interesting rows from the data.

All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.

tmp_high = df.high > 32
tmp_high.head()

date
2016-10-20    True
2016-10-21    True
2016-10-24    True
2016-10-25    True
2016-10-26    True
Name: high, dtype: bool

Summing a Boolean array is the same as counting the number of True values.

sum(tmp_high)

207

Now, let's select only the rows of df1 that correspond to tmp_high

df[tmp_high]

Putting it all together, we have the following commonly-used patterns:

positive_days = df[df.close > df.open]
positive_days.head()

very_positive_days = df[df.close-df.open > 1]
very_positive_days.head()

Creating new columns¶

To create a new column, simply assign values to it. Think of the columns as a dictionary:

df['profit'] = (df.open < df.close)
df.head()

for idx, row in df.iterrows():
    if row.close > row.open:
        df.loc[idx,'gain']='negative'
    elif (row.open - row.close) < 1:
        df.loc[idx,'gain']='small_gain'
    elif (row.open - row.close) < 6:
        df.loc[idx,'gain']='medium_gain'
    else:
        df.loc[idx,'gain']='large_gain'
df.head()

Here is another, more "functional", way to accomplish the same thing.

Define a function that classifies rows, and apply it to each row.

def gainrow(row):
    if row.close > row.open:
        return 'negative'
    elif (row.open - row.close) < 1:
        return 'small_gain'
    elif (row.open - row.close) < 6:
        return 'medium_gain'
    else:
        return 'large_gain'

df['test_column'] = df.apply(gainrow, axis = 1)
#axis = 0 means rows, axis =1 means columns

df.head()

OK, point made, let's get rid of that extraneous test_column:

df = df.drop('test_column', axis = 1)
df.head()

Grouping¶

An extremely powerful DataFrame method is groupby().

This is entirely analagous to GROUP BY in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.

Here we will look at the average gain among the categories of gains (negative, small, medium and large) we defined above and stored in column gain.

gain_groups = df.groupby('gain')

Essentially, gain_groups behaves like a dictionary

whose keys are the unique values found in the gain column, and
whose values are DataFrames that contain only the rows having the corresponding unique values.

for gain, gain_data in gain_groups:
    print(gain)
    print(gain_data.head())
    print('=============================')

large_gain
            open  high  low  close   volume profit        gain
date                                                          
2017-07-31   NaN   NaN  NaN  32.53  2057592  False  large_gain
=============================
medium_gain
             open   high    low  close    volume profit         gain
date                                                                
2016-11-10  38.44  38.82  35.76  36.07   3796396  False  medium_gain
2016-11-30  38.36  38.43  36.70  37.20   1425722  False  medium_gain
2016-12-28  41.01  41.17  38.72  38.78   4200001  False  medium_gain
2017-02-10  37.53  38.42  35.70  35.83  16022030  False  medium_gain
2017-03-21  34.68  35.03  33.46  33.62   2569255  False  medium_gain
=============================
negative
             open   high    low  close   volume profit      gain
date                                                            
2016-10-24  33.69  34.76  33.47  34.48  2855396   True  negative
2016-10-31  32.44  32.83  32.24  32.66  1623863   True  negative
2016-11-04  35.94  36.82  35.53  36.14  2759431   True  negative
2016-11-07  36.74  37.29  36.69  37.21  2135007   True  negative
2016-11-08  37.06  38.50  36.86  37.99  2440585   True  negative
=============================
small_gain
             open   high    low  close   volume profit        gain
date                                                              
2016-10-20  34.93  34.96  33.80  34.35  4385680  False  small_gain
2016-10-21  34.07  34.30  33.24  33.40  3369939  False  small_gain
2016-10-25  34.29  34.96  33.72  33.95  3197182  False  small_gain
2016-10-26  33.87  34.08  33.23  33.49  2125231  False  small_gain
2016-10-27  33.60  33.66  32.65  32.74  2211486  False  small_gain
=============================

for gain, gain_data in df.groupby("gain"):
    print('The average closing value for the {} group is {}'.format(gain,
                                                           gain_data.close.mean()))

The average closing value for the large_gain group is 32.53
The average closing value for the medium_gain group is 37.495
The average closing value for the negative group is 36.66818181818182
The average closing value for the small_gain group is 35.67727272727272

We can obtain the dataframe that corresponds to a specific group by using the get_group method of the groupby object

sm = gain_groups.get_group('small_gain')
sm

Joins¶

We can join data frames in a similar way that we can do joins in SQL

data_source = 'google'
start = datetime(2015,1,1)
end = datetime(2015,12,31)

dy = web.DataReader('YELP', data_source, start, end)
di = web.DataReader('YELP', data_source, start, end)
dy.to_csv('yelp.csv')
di.to_csv('ibm.csv')
yelp = pd.read_csv('yelp.csv')
ibm = pd.read_csv('ibm.csv')

new_column_names = [x.lower().replace(' ','_') for x in yelp.columns]
yelp.columns = new_column_names
ibm.columns = new_column_names

yelp['gain'] = yelp.apply(gainrow, axis = 1)
ibm['gain'] = ibm.apply(gainrow, axis = 1)
print(yelp.head())
print(ibm.head())

         date   open   high    low  close   volume        gain
0  2016-10-21  34.07  34.30  33.24  33.40  3369939  small_gain
1  2016-10-24  33.69  34.76  33.47  34.48  2855396    negative
2  2016-10-25  34.29  34.96  33.72  33.95  3197182  small_gain
3  2016-10-26  33.87  34.08  33.23  33.49  2125231  small_gain
4  2016-10-27  33.60  33.66  32.65  32.74  2211486  small_gain
         date   open   high    low  close   volume        gain
0  2016-10-21  34.07  34.30  33.24  33.40  3369939  small_gain
1  2016-10-24  33.69  34.76  33.47  34.48  2855396    negative
2  2016-10-25  34.29  34.96  33.72  33.95  3197182  small_gain
3  2016-10-26  33.87  34.08  33.23  33.49  2125231  small_gain
4  2016-10-27  33.60  33.66  32.65  32.74  2211486  small_gain

common_gain_dates = pd.merge(yelp, ibm, on=['date','gain'])
common_gain_dates.head()
#common_gain_dates.date

left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
dfm = pd.merge(left, right, on='key') #keeps only the common key 'foo'
print(dfm)
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)

   key  lval
0  foo     1
1  foo     2
2  boo     3
   key  rval
0  foo     4
1  hoo     5
   key  lval  rval
0  foo     1     4
1  foo     2     4
   key  lval  rval
0  foo     1   4.0
1  foo     2   4.0
2  boo     3   NaN

Other Pandas Classes¶

A DataFrame is essentially an annotated 2-D array.

Pandas also has annotated versions of 1-D and 3-D arrays.

A 1-D array in Pandas is called a Series.

A 3-D array in Pandas is called a Panel.

To use these, read the documentation!

Comparing multiple stocks¶

As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.

stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
df = web.DataReader(stocks, 
                    data_source,                               
                    start=datetime(2014, 1, 1), 
                    end=datetime(2014, 12, 31))[attr]
df.head()

df.ORCL.plot(label = 'oracle')
df.TSLA.plot(label = 'tesla')
df.IBM.plot(label = 'ibm')
df.MSFT.plot(label = 'msft')
df.YELP.plot(label = 'yelp')
_ = plt.legend(loc='best')

Next, we will calculate returns over a period of length $T$, defined as:

$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$

The returns can be computed with a simple DataFrame method pct_change(). Note that for the first $T$ timesteps, this value is not defined (of course):

rets = df.pct_change(30)
rets.iloc[25:35]

Now we'll plot the timeseries of the returns of the different stocks.

Notice that the NaN values are gracefully dropped by the plotting function.

rets.ORCL.plot(label = 'oracle')
rets.TSLA.plot(label = 'tesla')
rets.IBM.plot(label = 'ibm')
rets.MSFT.plot(label = 'msft')
rets.YELP.plot(label = 'yelp')
_ = plt.legend(loc='best')

plt.scatter(rets.ORCL, rets.IBM)
plt.xlabel('TESLA 30-day returns')
_ = plt.ylabel('YELP 30-day returns')

There appears to be some (fairly strong) correlation between the movement of TSLA and YELP stocks. Let's measure this.

The correlation coefficient between variables $X$ and $Y$ is defined as follows:

$$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$

Pandas provides a DataFrame method to compute the correlation coefficient of all pairs of columns: corr().

rets.corr()

It takes a bit of time to examine that table and draw conclusions.

To speed that process up it helps to visualize the table.

We will learn more about visualization later, but for now this is a simple example.

_ = sns.heatmap(rets.corr(), annot=True)

Finally, it is important to know that the plotting performed by Pandas is just a layer on top of matplotlib (i.e., the plt package).

So Panda's plots can (and should) be replaced or improved by using additional functions from matplotlib.

For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).

Here is visualization of the result of such an analysis, and we construct the plot using only functions from matplotlib.

_ = plt.scatter(rets.mean(), rets.std())
# plt.xlabel('Expected returns')
# plt.ylabel('Standard Deviation (Risk)')
# for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
#    plt.annotate(
#        label, 
#        xy = (x, y), xytext = (20, -20),
#        textcoords = 'offset points', ha = 'right', va = 'bottom',
#        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
#        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

To understand what these functions are doing, (especially the annotate function), you will need to consult the online documentation for matplotlib. Just use Google to find it.

More plotting¶

df = pd.read_csv('distributions_short.csv',
                 names=list('ABCD'))
dfs = df.sort_values(by='A', ascending = True)
#plot column B against A 
plt.figure(); dfs.plot(x = 'A', y = 'B');
#plot column B against A in log-log scale
plt.figure(); dfs.plot(x = 'A', y = 'B', loglog=True);
#plot columns B and C against A in loglog scale
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], loglog=True);
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], logy=True);

<matplotlib.figure.Figure at 0xaa34b70>

<matplotlib.figure.Figure at 0xcade5c0>

<matplotlib.figure.Figure at 0xcb81470>

<matplotlib.figure.Figure at 0xcf9ecf8>

#Using matlab notation
plt.plot(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
plt.figure();plt.loglog(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')

[<matplotlib.lines.Line2D at 0xd785f28>,
 <matplotlib.lines.Line2D at 0xd7d6160>]

#scatter plot of columns A and B
plt.figure(); df.plot(kind ='scatter', x='A', y='B')
plt.figure(); plt.scatter(df.A, df.B)
#scatter plot of columns A and B in log-log scale
plt.figure(); df.plot(kind ='scatter', x='A', y='B', loglog = True)

<matplotlib.axes._subplots.AxesSubplot at 0xea26fd0>

<matplotlib.figure.Figure at 0xd1678d0>

<matplotlib.figure.Figure at 0xe9d0fd0>

#putting many scatter plots into the same plot
t = df.plot(kind='scatter', x='A', y='B', color='DarkBlue', label='B curve', loglog=True);
df.plot(kind='scatter', x='A', y='C',color='DarkGreen', label='C curve', ax=t, loglog = True);

	IBM	MSFT	ORCL	TSLA	YELP
IBM	1.000000	-0.195553	0.115325	0.082169	-0.059090
MSFT	-0.195553	1.000000	-0.233378	0.157012	-0.020928
ORCL	0.115325	-0.233378	1.000000	-0.248171	-0.353075
TSLA	0.082169	0.157012	-0.248171	1.000000	-0.047186
YELP	-0.059090	-0.020928	-0.353075	-0.047186	1.000000

	Open	High	Low	Close	Volume
Date
2016-10-20	34.93	34.96	33.80	34.35	4385680
2016-10-21	34.07	34.30	33.24	33.40	3369939
2016-10-24	33.69	34.76	33.47	34.48	2855396
2016-10-25	34.29	34.96	33.72	33.95	3197182
2016-10-26	33.87	34.08	33.23	33.49	2125231

	open	high	low	close	volume
date
2016-11-09	37.16	38.50	36.70	38.29	2195507
2016-12-20	35.63	37.57	35.63	37.14	2172144
2016-12-21	37.34	39.87	37.33	38.75	6191209
2016-12-27	38.40	40.67	38.08	39.95	4325032
2017-01-04	38.28	39.44	38.28	39.42	2238020

	IBM	MSFT	ORCL	TSLA	YELP
Date
2016-10-21	149.63	59.66	37.93	200.09	33.40
2016-10-24	150.57	61.00	38.27	202.76	34.48
2016-10-25	150.88	60.99	38.36	202.34	33.95
2016-10-26	151.81	60.63	38.31	202.24	33.49
2016-10-27	153.35	60.10	38.22	204.01	32.74

	IBM	MSFT	ORCL	TSLA	YELP
Date
2016-11-28	NaN	NaN	NaN	NaN	NaN
2016-11-29	NaN	NaN	NaN	NaN	NaN
2016-11-30	NaN	NaN	NaN	NaN	NaN
2016-12-01	NaN	NaN	NaN	NaN	NaN
2016-12-02	NaN	NaN	NaN	NaN	NaN
2016-12-05	0.068235	0.009387	0.027419	-0.066420	0.092515
2016-12-06	0.064953	-0.017213	0.031356	-0.083399	0.051624
2016-12-07	0.092192	0.006231	0.051095	-0.045419	0.063623
2016-12-08	0.089256	0.006268	0.053250	-0.049199	0.082413
2016-12-09	0.085882	0.031115	0.060178	-0.057987	0.102321

	Date	Open	High	Low	Close	Volume
0	2016-10-20	34.93	34.96	33.80	34.35	4385680
1	2016-10-21	34.07	34.30	33.24	33.40	3369939
2	2016-10-24	33.69	34.76	33.47	34.48	2855396
3	2016-10-25	34.29	34.96	33.72	33.95	3197182
4	2016-10-26	33.87	34.08	33.23	33.49	2125231
5	2016-10-27	33.60	33.66	32.65	32.74	2211486
6	2016-10-28	32.92	33.18	32.09	32.29	1481442
7	2016-10-31	32.44	32.83	32.24	32.66	1623863
8	2016-11-01	32.76	33.17	32.00	32.48	3693003
9	2016-11-02	36.11	38.24	35.01	35.71	9844026
10	2016-11-03	35.59	36.16	34.53	35.56	3432322
11	2016-11-04	35.94	36.82	35.53	36.14	2759431
12	2016-11-07	36.74	37.29	36.69	37.21	2135007
13	2016-11-08	37.06	38.50	36.86	37.99	2440585
14	2016-11-09	37.16	38.50	36.70	38.29	2195507
15	2016-11-10	38.44	38.82	35.76	36.07	3796396
16	2016-11-11	35.85	36.50	35.21	36.29	2742202
17	2016-11-14	36.54	37.13	35.25	35.72	3121938
18	2016-11-15	35.95	37.73	35.94	36.62	2187326
19	2016-11-16	36.38	36.82	35.98	36.51	1419619
20	2016-11-17	36.58	37.54	36.13	37.51	1218362
21	2016-11-18	37.47	38.71	37.05	38.29	1453593
22	2016-11-21	38.48	39.28	38.14	38.33	1856605
23	2016-11-22	37.71	38.64	37.13	38.34	2156349
24	2016-11-23	38.10	39.03	37.84	38.99	1013658
25	2016-11-25	39.01	39.33	38.21	38.51	538633
26	2016-11-28	38.28	38.65	38.02	38.06	629796
27	2016-11-29	38.10	38.63	37.77	38.31	1087411
28	2016-11-30	38.36	38.43	36.70	37.20	1425722
29	2016-12-01	37.10	37.39	35.50	36.26	1777523
...	...	...	...	...	...	...
221	2017-09-07	42.93	44.18	42.87	43.74	1622847
222	2017-09-08	43.51	44.25	43.41	43.75	1227068
223	2017-09-11	43.61	43.79	42.84	43.70	3123710
224	2017-09-12	43.74	44.03	43.51	43.85	1295504
225	2017-09-13	43.80	43.80	42.62	42.95	1908324
226	2017-09-14	42.75	43.03	42.52	42.91	1190931
227	2017-09-15	42.91	43.70	42.76	43.57	4243413
228	2017-09-18	43.80	44.23	43.35	43.50	1261139
229	2017-09-19	43.50	44.07	43.04	43.85	859142

	A	B
0	1.0	4.0
1	2.0	3.0
2	3.0	2.0
3	4.0	1.0

	A	B	C
one	1	2	3
two	4	5	6

	open	high	low	close	volume
date
2016-10-20	34.93	34.96	33.80	34.35	4385680
2016-10-21	34.07	34.30	33.24	33.40	3369939
2016-10-24	33.69	34.76	33.47	34.48	2855396
2016-10-25	34.29	34.96	33.72	33.95	3197182
2016-10-26	33.87	34.08	33.23	33.49	2125231
2016-10-27	33.60	33.66	32.65	32.74	2211486
2016-10-28	32.92	33.18	32.09	32.29	1481442
2016-10-31	32.44	32.83	32.24	32.66	1623863
2016-11-01	32.76	33.17	32.00	32.48	3693003
2016-11-02	36.11	38.24	35.01	35.71	9844026
2016-11-03	35.59	36.16	34.53	35.56	3432322
2016-11-04	35.94	36.82	35.53	36.14	2759431
2016-11-07	36.74	37.29	36.69	37.21	2135007
2016-11-08	37.06	38.50	36.86	37.99	2440585
2016-11-09	37.16	38.50	36.70	38.29	2195507
2016-11-10	38.44	38.82	35.76	36.07	3796396
2016-11-11	35.85	36.50	35.21	36.29	2742202
2016-11-14	36.54	37.13	35.25	35.72	3121938
2016-11-15	35.95	37.73	35.94	36.62	2187326
2016-11-16	36.38	36.82	35.98	36.51	1419619
2016-11-17	36.58	37.54	36.13	37.51	1218362
2016-11-18	37.47	38.71	37.05	38.29	1453593
2016-11-21	38.48	39.28	38.14	38.33	1856605
2016-11-22	37.71	38.64	37.13	38.34	2156349
2016-11-23	38.10	39.03	37.84	38.99	1013658
2016-11-25	39.01	39.33	38.21	38.51	538633
2016-11-28	38.28	38.65	38.02	38.06	629796
2016-11-29	38.10	38.63	37.77	38.31	1087411
2016-11-30	38.36	38.43	36.70	37.20	1425722
2016-12-01	37.10	37.39	35.50	36.26	1777523
...	...	...	...	...	...
2017-09-07	42.93	44.18	42.87	43.74	1622847
2017-09-08	43.51	44.25	43.41	43.75	1227068
2017-09-11	43.61	43.79	42.84	43.70	3123710
2017-09-12	43.74	44.03	43.51	43.85	1295504
2017-09-13	43.80	43.80	42.62	42.95	1908324
2017-09-14	42.75	43.03	42.52	42.91	1190931
2017-09-15	42.91	43.70	42.76	43.57	4243413
2017-09-18	43.80	44.23	43.35	43.50	1261139