Pandas is the Python Data Analysis Library.
Pandas is an extremely versatile tool for manipulating datasets.
It also produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.
The most important tool provided by Pandas is the data frame.
A data frame is a table in which each row and column is given a label.
Pandas DataFrames are documented at:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
import pandas as pd
import pandas_datareader.data as web
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from datetime import datetime
#pd.__version__
%matplotlib inline
For demonstration purposes, we'll use a library built-in to Pandas that fetches data from standard online sources, such as Yahoo! Finance.
More information on what types of data you can fetch is at: http://pandas.pydata.org/pandas-docs/stable/remote_data.html
stocks = 'YELP'
data_source = 'google'
start = datetime(2015,1,1)
end = datetime(2015,12,31)
yahoo_stocks = web.DataReader(stocks, data_source, start, end)
yahoo_stocks.head()
yahoo_stocks.info()
yahoo_stocks.head()
yahoo_stocks.to_csv('yahoo_data.csv')
print(open('yahoo_data.csv').read())
df = pd.read_csv('yahoo_data.csv')
df
The number of rows in the DataFrame:
len(df)
d = {'A':[1., 2., 3., 4.],
'B':[4., 3., 2., 1.]}
pd.DataFrame(d)
d = [[1,2,3],[4,5,6]]
pd.DataFrame(d, index = ['one','two'], columns = ['A','B','C'])
The columns or "features" in your data
df.columns
Selecting a single column from your data
df['Open']
Another way of selecting a single column from your data
df.Open
df[['Open','Close']].head()
df.Date.head(10)
df.Date.tail(10)
Changing the column names:
new_column_names = [x.lower().replace(' ','_') for x in df.columns]
df.columns = new_column_names
df.info()
Now all columns can be accessed using the dot notation:
df.adj_close.head()
A DataFrame object has many useful methods.
df.mean()
df.std()
df.median()
df.open.mean()
df.high.mean()
df.high.plot(label='high')
df.low.plot(label='low')
plt.legend(loc='best') #puts the ledgent in the best possible position
df.close.hist(bins=50)
Methods like sum( ) and std( ) work on entire columns.
We can run our own functions across all values in a column (or row) using apply( ).
df.info()
df.date.head()
The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.
first_date = df.date.values[0]
first_date
datetime.strptime(first_date, "%Y-%m-%d")
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date.head()
Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.
The row indices so far have been auto-generated by pandas, and are simply integers starting from 0.
From now on we will use dates instead of integers for indices -- the benefits of this will show later.
Overwriting the index is as easy as assigning to the index
property of the DataFrame.
df.index = df.date
df.head()
Now that we have made an index based on date, we can drop the original date
column.
df = df.drop(['date'],axis=1) #axis = 0 means rows, axis =1 means columns
df.info()
So far we've seen how to access a column of the DataFrame. To access a row we use a different notation.
To access a row by its index value, use the .loc()
method.
df.loc[datetime(2017,1,23,0,0)]
To access a row by its sequence number (ie, like an array index), use .iloc()
('Integer Location')
df.iloc[0,:]
df.iloc[0:2,[1,3]]
To iterate over the rows, use .iterrows()
num_positive_days = 0
for idx, row in df.iterrows():
if row.close > row.open:
num_positive_days += 1
print("The total number of positive-gain days is {}.".format(num_positive_days))
It is very easy to select interesting rows from the data.
All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.
tmp_high = df.high > 32
tmp_high.head()
Summing a Boolean array is the same as counting the number of True
values.
sum(tmp_high)
Now, let's select only the rows of df1
that correspond to tmp_high
df[tmp_high]
Putting it all together, we have the following commonly-used patterns:
positive_days = df[df.close > df.open]
positive_days.head()
very_positive_days = df[df.close-df.open > 1]
very_positive_days.head()
To create a new column, simply assign values to it. Think of the columns as a dictionary:
df['profit'] = (df.open < df.close)
df.head()
for idx, row in df.iterrows():
if row.close > row.open:
df.loc[idx,'gain']='negative'
elif (row.open - row.close) < 1:
df.loc[idx,'gain']='small_gain'
elif (row.open - row.close) < 6:
df.loc[idx,'gain']='medium_gain'
else:
df.loc[idx,'gain']='large_gain'
df.head()
Here is another, more "functional", way to accomplish the same thing.
Define a function that classifies rows, and apply
it to each row.
def gainrow(row):
if row.close > row.open:
return 'negative'
elif (row.open - row.close) < 1:
return 'small_gain'
elif (row.open - row.close) < 6:
return 'medium_gain'
else:
return 'large_gain'
df['test_column'] = df.apply(gainrow, axis = 1)
#axis = 0 means rows, axis =1 means columns
df.head()
OK, point made, let's get rid of that extraneous test_column
:
df = df.drop('test_column', axis = 1)
df.head()
An extremely powerful DataFrame method is groupby()
.
This is entirely analagous to GROUP BY
in SQL.
It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.
Here we will look at the average gain among the categories of gains (negative, small, medium and large) we defined above and stored in column gain
.
gain_groups = df.groupby('gain')
Essentially, gain_groups
behaves like a dictionary
gain
column, and for gain, gain_data in gain_groups:
print(gain)
print(gain_data.head())
print('=============================')
for gain, gain_data in df.groupby("gain"):
print('The average closing value for the {} group is {}'.format(gain,
gain_data.close.mean()))
We can obtain the dataframe that corresponds to a specific group by using the get_group method of the groupby object
sm = gain_groups.get_group('small_gain')
sm
We can join data frames in a similar way that we can do joins in SQL
data_source = 'google'
start = datetime(2015,1,1)
end = datetime(2015,12,31)
dy = web.DataReader('YELP', data_source, start, end)
di = web.DataReader('YELP', data_source, start, end)
dy.to_csv('yelp.csv')
di.to_csv('ibm.csv')
yelp = pd.read_csv('yelp.csv')
ibm = pd.read_csv('ibm.csv')
new_column_names = [x.lower().replace(' ','_') for x in yelp.columns]
yelp.columns = new_column_names
ibm.columns = new_column_names
yelp['gain'] = yelp.apply(gainrow, axis = 1)
ibm['gain'] = ibm.apply(gainrow, axis = 1)
print(yelp.head())
print(ibm.head())
common_gain_dates = pd.merge(yelp, ibm, on=['date','gain'])
common_gain_dates.head()
#common_gain_dates.date
left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
dfm = pd.merge(left, right, on='key') #keeps only the common key 'foo'
print(dfm)
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)
A DataFrame is essentially an annotated 2-D array.
Pandas also has annotated versions of 1-D and 3-D arrays.
A 1-D array in Pandas is called a Series
.
A 3-D array in Pandas is called a Panel
.
To use these, read the documentation!
As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
df = web.DataReader(stocks,
data_source,
start=datetime(2014, 1, 1),
end=datetime(2014, 12, 31))[attr]
df.head()
df.ORCL.plot(label = 'oracle')
df.TSLA.plot(label = 'tesla')
df.IBM.plot(label = 'ibm')
df.MSFT.plot(label = 'msft')
df.YELP.plot(label = 'yelp')
_ = plt.legend(loc='best')
Next, we will calculate returns over a period of length $T$, defined as:
$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$
The returns can be computed with a simple DataFrame method pct_change()
. Note that for the first $T$ timesteps, this value is not defined (of course):
rets = df.pct_change(30)
rets.iloc[25:35]
Now we'll plot the timeseries of the returns of the different stocks.
Notice that the NaN
values are gracefully dropped by the plotting function.
rets.ORCL.plot(label = 'oracle')
rets.TSLA.plot(label = 'tesla')
rets.IBM.plot(label = 'ibm')
rets.MSFT.plot(label = 'msft')
rets.YELP.plot(label = 'yelp')
_ = plt.legend(loc='best')
plt.scatter(rets.ORCL, rets.IBM)
plt.xlabel('TESLA 30-day returns')
_ = plt.ylabel('YELP 30-day returns')
There appears to be some (fairly strong) correlation between the movement of TSLA and YELP stocks. Let's measure this.
The correlation coefficient between variables $X$ and $Y$ is defined as follows:
$$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$
Pandas provides a DataFrame method to compute the correlation coefficient of all pairs of columns: corr()
.
rets.corr()
It takes a bit of time to examine that table and draw conclusions.
To speed that process up it helps to visualize the table.
We will learn more about visualization later, but for now this is a simple example.
_ = sns.heatmap(rets.corr(), annot=True)
Finally, it is important to know that the plotting performed by Pandas is just a layer on top of matplotlib
(i.e., the plt
package).
So Panda's plots can (and should) be replaced or improved by using additional functions from matplotlib
.
For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).
Here is visualization of the result of such an analysis, and we construct the plot using only functions from matplotlib
.
_ = plt.scatter(rets.mean(), rets.std())
# plt.xlabel('Expected returns')
# plt.ylabel('Standard Deviation (Risk)')
# for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
# plt.annotate(
# label,
# xy = (x, y), xytext = (20, -20),
# textcoords = 'offset points', ha = 'right', va = 'bottom',
# bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
# arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
To understand what these functions are doing, (especially the annotate
function), you will need to consult the online documentation for matplotlib. Just use Google to find it.
df = pd.read_csv('distributions_short.csv',
names=list('ABCD'))
dfs = df.sort_values(by='A', ascending = True)
#plot column B against A
plt.figure(); dfs.plot(x = 'A', y = 'B');
#plot column B against A in log-log scale
plt.figure(); dfs.plot(x = 'A', y = 'B', loglog=True);
#plot columns B and C against A in loglog scale
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], loglog=True);
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], logy=True);
#Using matlab notation
plt.plot(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
plt.figure();plt.loglog(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
#scatter plot of columns A and B
plt.figure(); df.plot(kind ='scatter', x='A', y='B')
plt.figure(); plt.scatter(df.A, df.B)
#scatter plot of columns A and B in log-log scale
plt.figure(); df.plot(kind ='scatter', x='A', y='B', loglog = True)
#putting many scatter plots into the same plot
t = df.plot(kind='scatter', x='A', y='B', color='DarkBlue', label='B curve', loglog=True);
df.plot(kind='scatter', x='A', y='C',color='DarkGreen', label='C curve', ax=t, loglog = True);