In this tutorial we will see some information about the Python Data Analysis library.
First, here are some libraries that we may want to use at various points in our examples.
Numpy is one of the most important libraries, as it enables us to manipulate data in the form of multidimensional arrays. It also gives us some operations such as random number generation
The matplotlib library enables plotting using an interface similar to MATLAB.
#this command is useful for making the plots appear inside the notebook
%matplotlib inline
import pandas as pd
import pandas_datareader.data as web
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from datetime import datetime
Pandas have different data structures that we can use for manipulating different types of data. Here we will see Series (1-dimensional data structures) and Data Frames (2-dimensional data structures).
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
s = pd.Series(data, index=index)
Here, data can be many different things:
The passed index is a list of axis labels.
#Series from list with index (row names)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
#s = pd.Series([1,2,3,4,5],index=['a', 'b', 'c', 'a', 'e'] )
s
#Series from list without index. Default index = [1,2,...]
pd.Series(np.random.randn(5))
#Series from dictionary: The keys are used as index, and the values as series values
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)
#Series with missing data
pd.Series(d, index=['b', 'c', 'd', 'a'])
#note that by setting the index we can re-arrange the series values
Note: NaN (not a number) is the standard missing data marker used in pandas
#Series with a scalar
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
We can select parts of the series, and perform operations on series
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
#Selection from series
print(s[0])
print (s['a'])
print (s[:2])
print(s[3:5])
print(s.median())
print(s>s.median())
print(s[s>s.median()])
#Operations on Series
#print (s+s)
#print (2*s)
#print (s**2)
#Using s[predicate] we can select different entries in the data series
print(s[s>0])
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df
print(df)
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
# 'two' : [4., 3., 2., 1.,7.]}
pd.DataFrame(d)
pd.DataFrame(d, index = ['a','b','c','d'])
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
orient='index', columns=['one', 'two', 'three']
)
#The orient = 'index' makes the keys to be rows, otherwise they would be columns
df2 = pd.DataFrame({ 'A' : 1.,
#'B' : pd.Timestamp('20130102'),
'B': pd.date_range('20151101', periods=4),
'C' : np.random.randn(4),
'D' : pd.Categorical(["test","train","test","train"]),
'E' : ["foo","bar","foo","bar"] }
,index = ['first','second','third','fourth']
)
df2
#df2.info()
tdf = pd.DataFrame([['a',1],['a',1],['b',2],['c',3],['c',2],['a',1]])
tdf
Loading data frame from CSV file
df = pd.read_csv('distributions_short.csv',
names=list('ABCD'))
df
Loading data from the web
More information on what types of data you can fetch http://pandas.pydata.org/pandas-docs/stable/remote_data.html
stocks = 'YELP'
ydf = web.get_data_yahoo(stocks)
# start=datetime.datetime(2011, 10, 1),
# end=datetime.datetime(2014, 12, 31))
ydf.info()
ydf.head()
Changing column names
new_column_names = [x.lower().replace(' ','_') for x in ydf.columns]
ydf.columns = new_column_names
ydf.info()
Saving data
Save a df into a csv:
df.to_csv('foo.csv')
Indexing
The basics of indexing are as follows:
Operation | Syntax | Result |
Select column | df[col] (df.col) | Series |
Select columns | df[[col list]] | Data Frame |
Select row by label | df.loc[label] | Series |
Select row by integer location | df.iloc[loc] | Series |
Slice rows | df[5:10] | Data Frame |
df2['B']
df2.A
df2.loc['first']
df2.ix('first')
df2.iloc[1:3]
df2.iloc[1]
df2[['A','B']]
df2[['A','B']][1:3]
df2[1:3][['A','B']]
Accessing the data
df.columns #the names of the columns
df.values # all values without index and column names in numpy array
df2.info() # info about the data per column
df.head() # top lines
df.tail() # bottom lines
Iterating over rows
num_positive_days = 0
for idx, row in ydf.iterrows():
if row.close > row.open:
num_positive_days += 1
print("The total number of positive-gain days is {}.".format(num_positive_days))
Filtering rows
We can select a subset of interesting rows using a boolean predicate. This will also return a data frame
df2.query('C>0')
df2[df2.C>0]
df[df>0.2]
df[df>0.2].A
df.A[df.D > .002]
tmp_high = ydf.high > 55
tmp_high.head()
#ydf[tmp_high]
positive_days = ydf[ydf.close > ydf.open]
positive_days.head()
very_positive_days = ydf[ydf.close-ydf.open > 4]
very_positive_days.head()
Transforming the data
#Transposing data
df.T
#Pivoting data:
df3 = pd.DataFrame({ 'A' : 1.,
#'B' : pd.Timestamp('20130102'),
'B': pd.date_range('20151101', periods=4),
'C' : np.random.randn(4),
'D' : pd.Categorical(["test","train","test","train"]),
'E' : ["foo","bar","foo","bar"] }
,index = ['first','second','third','fourth']
)
pd.pivot_table(df3, index=['B'], columns=['D'], values='C')
#the column values in B to become the rows,
#the values in D to become columns,
#the values in C become the cell values
#if there are more than one values, they are are aggregated by taking the mean.
#Apply a function
df.apply(lambda x: 2*x)
#df.a.apply(lambda x: x+1)
#Histograms
s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
s.value_counts()
np.histogram(df.A,10)
Creating new columns
The assign() method that allows you to easily create new columns that are potentially derived from existing columns. assign always returns a copy of the data, leaving the original DataFrame untouched
We can also simply create a new column like we would do with a dicitionary. In this case the column stays in the data. The column is deleted again like in a dictionary.
df = pd.DataFrame({'a':pd.Series(np.random.randn(5)),
'b':pd.Series(np.random.randn(5))})
#df['c'] = df['a']/df['b']
#df
#del df['c']
df3 = df.assign(c = df['a']/df['b'])
print(df)
print(df3)
df['c'] = df['a']/df['b']
print(df)
#remove the new column
df.drop('c', axis = 1) #axis = 0 means rows, axis =1 means columns
ydf['profit'] = ydf.close - ydf.open
ydf.head()
for idx, row in ydf.iterrows():
if row.close > row.open:
ydf.ix[idx,'gain']='negative'
elif (row.open - row.close) < 1:
ydf.ix[idx,'gain']='small_gain'
elif (row.open - row.close) < 6:
ydf.ix[idx,'gain']='medium_gain'
else:
ydf.ix[idx,'gain']='large_gain'
ydf.head()
Define a function that classifies rows, and apply it to each row to create a new row
def label_row(row):
if row.close > row.open:
return 'negative'
elif (row.open - row.close) < 1:
return 'small_gain'
elif (row.open - row.close) < 6:
return 'medium_gain'
else:
return 'large_gain'
ydf['test_column'] = ydf.apply(label_row, axis = 1)
print(ydf.head())
ydf = ydf.drop('test_column', axis = 1)
Computing Statitics and other operations
print(df)
# unless the axis is specified as 1, all operations are on columns
print(df.sum())
print(df.mean())
print(df.mean(1)) #df.mean(axis = 1)
print(df.std())
print(df.var())
print(df.describe())
Join
left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
dfm = pd.merge(left, right, on='key') #keeps only the common key 'foo'
print(dfm)
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)
Group by
df3 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
print(df3)
print(df3.groupby('A').sum())
# column B cannot be summed so it is ommitted.
print(df3.groupby('A').min())
print(df3.groupby(['A','B']).sum())
g = df3.groupby(['A','B']).sum()
print(type(g))
print(g)
print(g.loc[('bar','one')])
Group by produces a DataFrameGroupBy object that for each possible value of the group by attribute produces a dataframe
g = df3.groupby(['A','B'])
type(g)
for vals, data in g:
print(vals)
print(data)
print('--------------')
print(g.loc[('bar','one')])
g = df3.groupby(['A','B'])
bo = g.get_group(('foo','two')) #selects a specific group: returns a data frame
bo[['C','D']]
g = df3[['A','B','C']].groupby(['A','B'])
list(g.get_group(('foo','two')).C)
df4 = pd.DataFrame({'A':np.random.randint(1,3,10),
'B': np.random.randint(1,5,10),
'C': np.random.randn(10)})
df4.sort_values(by='A')
df4.sort_values(by = ['A','B'], ascending = [True,False])
df4[['A','B']].drop_duplicates().sort_values(by = 'A') #drop_duplicates: remove duplicate rows from a data frame
l = list(df4.B.unique()) #unique: keep unique rows from a series. Faster than drop duplicates
l.sort()
print(l)
Plotting uses the capabilities of matlab within the Iron Python framework. Using the matplotlib.pyplot library we can use all the plotting functions used in matlap. You can use this directly, or as a wrapper function that comes with data frames and series.
You can read more about matplotlib here. And for plotting with Pandas here.
Creating a Histogram
data = pd.Series(np.random.randn(1000))
plt.figure();plt.hist(data) #plt.figure() creates a new figure
plt.figure();plt.hist(data, bins=50)
# density plot
plt.figure();data.plot(kind = 'kde') #kde: kernel density estimate
plt.savefig('foo.jpg')
dfx= pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD'))
#print(dfx)
#Compute the cummulative sum for each column
dfx = dfx.cumsum()
#plot all columns in the same plot
plt.figure(); dfx.plot();
#plot column C against column B in df2
plt.figure(); df2.plot(x = 'B', y = 'C');
plt.figure(); df2.plot(kind = 'bar', x = 'B', y = 'C'); #bar plot
df = pd.read_csv('distributions_short.csv',
names=list('ABCD'))
dfs = df.sort_values(by='A', ascending = True)
#plot column B against A
plt.figure(); dfs.plot(x = 'A', y = 'B');
#plot column B against A in log-log scale
plt.figure(); dfs.plot(x = 'A', y = 'B', loglog=True);
#plot columns B and C against A in loglog scale
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], loglog=True);
#Using matlab notation
plt.plot(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
plt.loglog(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
#scatter plot of columns A and B
plt.figure(); df.plot(kind ='scatter', x='A', y='B')
plt.figure(); plt.scatter(df.A, df.B)
#scatter plot of columns A and B in log-log scale
plt.figure(); df.plot(kind ='scatter', x='A', y='B', loglog = True)
#putting many scatter plots into the same plot
t = df.plot(kind='scatter', x='A', y='B', color='DarkBlue', label='B curve', loglog=True);
df.plot(kind='scatter', x='A', y='C',color='DarkGreen', label='C curve', ax=t, loglog = True);
df6 = pd.DataFrame({'x':np.random.randn(100),
'y':np.random.randn(100),
'v': np.random.randint(1,500,100)})
plt.figure();df6.plot(kind = 'scatter', x='x',y='y')
plt.figure();df6.plot(kind='hexbin', x='x', y='y', gridsize=10)
plt.figure();df6.plot(kind = 'scatter', x='x',y='y', s=df6['v'],c=df6['v'])
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
attr = 'Close'
dfstocks = web.get_data_yahoo(stocks,
start=datetime(2014, 1, 1),
end=datetime(2014, 12, 31))[attr]
dfstocks.head()
dfstocks.ORCL.plot(label = 'oracle')
dfstocks.TSLA.plot(label = 'tesla')
dfstocks.IBM.plot(label = 'ibm')
dfstocks.MSFT.plot(label = 'msft')
dfstocks.YELP.plot(label = 'yelp')
plt.legend(loc='best')
plt.scatter(dfstocks.TSLA, dfstocks.ORCL)
plt.xlabel('TESLA')
plt.ylabel('ORCL')
Correlations of columns
The correlation coefficient between variables $X$ and $Y$ is defined as follows: $$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$ In python we can compute the correlation coefficient of all pairs of columns with corr( )
corr = dfstocks.corr()
print(corr)
sns.heatmap(corr, annot=True)
from pandas.tools.plotting import scatter_matrix
plt.figure(); scatter_matrix(dfstocks, alpha=0.2, figsize=(10, 10), diagonal='kde')