Introduction to Pandas and other libraries¶

(Many thanks to Evimaria Terzi and Mark Crovella for their code and examples)

Pandas¶

Pandas is the Python Data Analysis Library.

Pandas is an extremely versatile tool for manipulating datasets, mostly tabular data. You can think of Pandas as the evolution of excel spreadsheets, with more capabilities for coding, and SQL queries such as joins and group-by.

It also produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

You can find more details here

Storing data tables¶

Most data can be viewed as tables or matrices (in the case where all entries are numeric). The rows correspond to objects and the columns correspond to the attributes or features.

There are different ways we can store such data tables in Python

Two-dimensional lists

In [1]:
D = [[0.3, 10, 1000],[0.5,2,509],[0.4, 8, 789]]
print(D)
[[0.3, 10, 1000], [0.5, 2, 509], [0.4, 8, 789]]
In [2]:
D = [[30000, 'Married', 1],[20000,'Single', 0],[45000, 'Maried', 0]]
print(D)
[[30000, 'Married', 1], [20000, 'Single', 0], [45000, 'Maried', 0]]

Numpy Arrays

Numpy is a the library of Python for numerical computations and matrix manipulations. It has a lot of the functionality of Matlab but also allows for data analysis operations (similar to Pandas). Read more for Numpy here: http://www.numpy.org/

The Array is the main data structure for numpy. It stores multidimensional numeric tables.

We can create numpy arrays from lists

In [3]:
import numpy as np

#1-dimensional array
x = np.array([2,5,18,14,4])
print ("\n Deterministic 1-dimensional array \n")
print (x)

#2-dimensional array
x = np.array([[2,5,18,14,4], [12,15,1,2,8]])
print ("\n Deterministic 2-dimensional array \n")
print (x)
 Deterministic 1-dimensional array 

[ 2  5 18 14  4]

 Deterministic 2-dimensional array 

[[ 2  5 18 14  4]
 [12 15  1  2  8]]

There are also numpy operations that create arrays of different types

In [4]:
x = np.random.rand(5,5)
print ("\n Random 5x5 2-dimensional array \n")
print (x)

x = np.ones((4,4))
print ("\n 4x4 array with ones \n")
print (x)

x = np.diag([1,2,3])
print ("\n Diagonal matrix\n")
print(x)
 Random 5x5 2-dimensional array 

[[0.34871799 0.62071892 0.60618484 0.12913798 0.80754808]
 [0.68774236 0.12672021 0.3677389  0.95823523 0.19599754]
 [0.04738102 0.01502908 0.04402727 0.53070598 0.82008422]
 [0.65242762 0.81563413 0.55416071 0.86403159 0.77256431]
 [0.6733942  0.49771376 0.05805456 0.51350696 0.34186178]]

 4x4 array with ones 

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

 Diagonal matrix

[[1 0 0]
 [0 2 0]
 [0 0 3]]

Why do we need numpy arrays? Because we can do different linear algebra operations on the numeric arrays

For example:

In [5]:
x = np.random.randint(10,size=(2,3))
print("\n Random 2x3 array with integers")
print(x)

#Matrix transpose
print ("\n Transpose of the matrix \n")
print (x.T)

#multiplication and addition with scalar value
print("\n Matrix 2x+1 \n") 
print(2*x+1)
 Random 2x3 array with integers
[[6 2 3]
 [7 6 3]]

 Transpose of the matrix 

[[6 7]
 [2 6]
 [3 3]]

 Matrix 2x+1 

[[13  5  7]
 [15 13  7]]

Transform back to list of lists

In [6]:
lx = [list(y) for y in x]
lx
Out[6]:
[[6, 2, 3], [7, 6, 3]]

Pandas data frames

A data frame is a table in which each row and column is given a label. Very similar to a spreahsheet or a SQL table.

Pandas DataFrames are documented at: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

Pandas dataframes enable different data analysis operations

Creating Data Frames¶

A dataframe has names for the columns and the rows of the table. The column names are stored in the attribute columns, while the row names in the attribute index. When these are not speficied, they are just indexed by default with the numbers 0,1,...

There are multiple ways we can create a data frame. Here we list just a few.

In [1]:
import pandas as pd #The pandas library
from pandas import Series, DataFrame #Main pandas data structures
In [2]:
#Creating a data frame from a list of lists

df = pd.DataFrame([[1,2,3],[9,10,12]])
print(df)

# Each list becomes a row
# Names of columns are 0,1,3
# Rows are indexed by position numbers 0,1
   0   1   2
0  1   2   3
1  9  10  12
In [9]:
#Creating a data frame from a numpy array

df = pd.DataFrame(np.array([[1,2,3],[9,10,12]]))
print(df)
   0   1   2
0  1   2   3
1  9  10  12
In [10]:
# Specifying column names
df = pd.DataFrame(np.array([[1,2,3],[9,10,12]]), columns=['A','B','C'])
print(df)
   A   B   C
0  1   2   3
1  9  10  12
In [4]:
#Creating a data frame from a dictionary
# Keys are column names, values are lists with column values

dfe = pd.DataFrame({'A':[1,2,3], 'B':['a','b','c']})
print(dfe)
   A  B
0  1  a
1  2  b
2  3  c
In [12]:
# Reading from a csv file:
df = pd.read_csv('example.csv')
print(df)

# The first row of the file is used for the column names 
# The property columns gives us the column names
print(df.columns)
print(list(df.columns))
   NUMBER CHAR
0       1    a
1       2    b
2       3    c
Index(['NUMBER', 'CHAR'], dtype='object')
['NUMBER', 'CHAR']
In [13]:
# Reading from a csv file without header:
df = pd.read_csv('no-header.csv',header = None)
print(df)
   0  1
0  1  a
1  2  b
2  3  c
In [14]:
# Reading from am excel file:
df = pd.read_excel('example.xlsx')
print(df)
   NUMBER CHAR
0       1    a
1       2    b
2       3    c
In [15]:
#Writing to a csv file:
df.to_csv('example2.csv')
for x in open('example2.csv').readlines():
    print(x.strip())
,NUMBER,CHAR
0,1,a
1,2,b
2,3,c
In [16]:
# By default the row index is added as a column, we can remove it by seting index=False
df.to_csv('example2.csv',index = False)
for x in open('example2.csv').readlines():
    print(x.strip())
NUMBER,CHAR
1,a
2,b
3,c

Fetching data

For demonstration purposes, we will use data from Tiingo on stock quotes. We will see two ways of fetching data from Tiingo, one using the Tiingo client and one using the Data Reader library of Pandas.

More information on what types of data you can fetch is at: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

We will use stock quotes from IEX. To make use of these you need to first create an account and obtain an API key. Then you set the environment variable IEX_API_KEY to the value of the key as it is snown below

In [2]:
from datetime import datetime #For handling dates
In [3]:
from tiingo import TiingoClient
client = TiingoClient({'api_key':'614c1590a592cc6696f6082f83b2666cd83882ef'})
start = datetime(2018,1,1)
end = datetime(2018,12,31)
stocks_data = client.get_dataframe('META',frequency='daily',startDate=start,endDate=end)
stocks_data = stocks_data[['open','close','low','high','volume']]
In [ ]:
#Alternative without the client
import pandas_datareader.data as web # For accessing web data

stocks_data = web.get_data_tiingo('META', start, end, api_key=os.getenv('TIINGO_API_KEY'))

stocks_data = stocks_data.reset_index(level='symbol',drop=True)
stocks_data = stocks_data[['open','close','low','high','volume']]
In [4]:
# the method info() outputs basic information for our data frame
stocks_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2018-01-02 00:00:00+00:00 to 2018-12-31 00:00:00+00:00
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    251 non-null    float64
 1   close   251 non-null    float64
 2   low     251 non-null    float64
 3   high    251 non-null    float64
 4   volume  251 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 11.8 KB

The number of rows in the DataFrame:

In [5]:
len(stocks_data)
Out[5]:
251
In [6]:
#the medthod head() outputs the top rows of the data frame
stocks_data.head()
Out[6]:
open close low high volume
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216
In [7]:
#the medthod tail() outputs the last rows of the data frame
stocks_data.tail()
Out[7]:
open close low high volume
date
2018-12-24 00:00:00+00:00 123.10 124.06 123.02 129.74 22066002
2018-12-26 00:00:00+00:00 126.00 134.18 125.89 134.24 39723370
2018-12-27 00:00:00+00:00 132.44 134.52 129.67 134.99 31202509
2018-12-28 00:00:00+00:00 135.34 133.20 132.20 135.92 22627569
2018-12-31 00:00:00+00:00 134.45 131.09 129.95 134.64 24625308

Note that the date attribute is the index of the rows, not an attribute. The index gives a name to each row. The default index is the numbers 0...len(df). Here we index the rows by the date

In [8]:
#trying to access the date column will give an error

stocks_data.date
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [8], line 3
      1 #trying to access the date column will give an error
----> 3 stocks_data.date

File C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'date'
In [29]:
stocks_data.to_csv('stocks_data.csv')
for x in open('stocks_data.csv').readlines()[0:10]:
    print(x.strip()) 
df = pd.read_csv('stocks_data.csv')
df.head()
date,open,close,low,high,volume
2018-01-02 00:00:00+00:00,177.68,181.42,177.55,181.58,17694891
2018-01-03 00:00:00+00:00,181.88,184.67,181.33,184.78,16595495
2018-01-04 00:00:00+00:00,184.9,184.33,184.1,186.21,13554357
2018-01-05 00:00:00+00:00,185.59,186.85,184.93,186.9,13042388
2018-01-08 00:00:00+00:00,187.2,188.28,186.33,188.9,14719216
2018-01-09 00:00:00+00:00,188.7,187.87,187.1,188.8,12342722
2018-01-10 00:00:00+00:00,186.94,187.84,185.63,187.89,10464528
2018-01-11 00:00:00+00:00,188.4,187.77,187.38,188.4,8855144
2018-01-12 00:00:00+00:00,178.06,179.37,177.4,181.48,76645626
Out[29]:
date open close low high volume
0 2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
1 2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2 2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
3 2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
4 2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216
In [7]:
len(df)
Out[7]:
251
In [13]:
#the medthod head() outputs the top rows of the data frame
df.head()
Out[13]:
date open close low high volume
0 2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
1 2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2 2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
3 2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
4 2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216

Note that in the new dataframe, there is now a date column, while the index values are numbers 0,1,...

Working with data columns¶

The columns are the "features" in your data

In [29]:
#an object that refers to the names of the columns
df.columns
Out[29]:
Index(['date', 'open', 'high', 'low', 'close', 'volume'], dtype='object')

We can also assign a list to the columns property in order to change the attribute names.

Alternatively, you can change the name of an attribute using rename:

In [14]:
df = df.rename(columns = {'volume':'V'})
print(list(df.columns))
df.columns = ['date', 'open', 'high', 'low', 'close', 'vol']
df.head()
['date', 'open', 'close', 'low', 'high', 'V']
Out[14]:
date open high low close vol
0 2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
1 2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2 2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
3 2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
4 2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216

Selecting a single column from your data.

It is important to keep in mind that this selection process returns a new data frame.

In [31]:
df['open'].head()
Out[31]:
0    177.68
1    181.88
2    184.90
3    185.59
4    187.20
Name: open, dtype: float64

Another way of selecting a single column from your data

In [32]:
df.open.head()
Out[32]:
0    177.68
1    181.88
2    184.90
3    185.59
4    187.20
Name: open, dtype: float64

Selecting multiple columns

In [33]:
df[['open','close']].head()
Out[33]:
open close
0 177.68 181.42
1 181.88 184.67
2 184.90 184.33
3 185.59 186.85
4 187.20 188.28

We can use the values method to obtain the values of one or more attributes. It returns a numpy array. You can trasform it into a list, by applying the list() operator.

In [9]:
df.open.values[:10]
Out[9]:
array([177.68, 181.88, 184.9 , 185.59, 187.2 , 188.7 , 186.94, 188.4 ,
       178.06, 181.5 ])
In [12]:
type(df[['open','close']])
Out[12]:
pandas.core.frame.DataFrame
In [35]:
df[['open','close']].values[:10]
Out[35]:
array([[177.68, 181.42],
       [181.88, 184.67],
       [184.9 , 184.33],
       [185.59, 186.85],
       [187.2 , 188.28],
       [188.7 , 187.87],
       [186.94, 187.84],
       [188.4 , 187.77],
       [178.06, 179.37],
       [181.5 , 178.39]])

Data Frame methods¶

A DataFrame object has many useful methods.

In [15]:
df.mean() #produces the mean of the columns/features
C:\Users\tsap\AppData\Local\Temp\ipykernel_20824\3584231175.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df.mean() #produces the mean of the columns/features
Out[15]:
open     1.714729e+02
high     1.715110e+02
low      1.693031e+02
close    1.736133e+02
vol      2.743828e+07
dtype: float64

Note that date did not appear in the list. This is because it stores Strings.

The latest version of the pandas library raises a warning. We should preselect the attributes we want to get the mean for

In [14]:
df[['open','high','low','low','close']].mean()
Out[14]:
open     171.454424
high     173.615298
low      169.303110
low      169.303110
close    171.510936
dtype: float64
In [38]:
df[['open','high','low','low','close']].std() #produces the standard deviation of the columns/features
Out[38]:
open     19.683435
high     19.423837
low      20.074382
low      20.074382
close    19.977448
dtype: float64
In [39]:
df[['open','high','low','low','close']].sem() #produces the standard error of the mean of the columns/features
Out[39]:
open     1.242407
high     1.226022
low      1.267084
low      1.267084
close    1.260965
dtype: float64

Here is a manual way to compute the confidence interval. What you need is the size of the sample, and the confidence value.

In [40]:
#confidence interval
import scipy.stats as stats
conf = 0.95
t = stats.t.ppf((1+conf)/2.0, len(df)-1)
low = df[['open','high','low','low','close']].mean() - t*df[['open','high','low','low','close']].sem()
high = df[['open','high','low','low','close']].mean() + t*df[['open','high','low','low','close']].sem()
pd.DataFrame({'CI lower end':low, 'CI higher end':high})
Out[40]:
CI lower end CI higher end
open 169.007505 173.901344
high 171.200650 176.029945
low 166.807590 171.798629
low 166.807590 171.798629
close 169.027467 173.994406
In [41]:
df[['open','high','low','low','close']].median() #produces the median of the columns/features
Out[41]:
open     174.89
high     176.98
low      172.83
low      172.83
close    174.70
dtype: float64
In [42]:
df.open.mean()
Out[42]:
171.4544243027888
In [43]:
#95%-confidence interval
(df.open.mean()-t*df.open.sem(), df.open.mean()+t*df.open.sem())
Out[43]:
(169.00750496130627, 173.90134364427132)

Use describe to get all statistics for the data

In [44]:
stocks_data.describe()
Out[44]:
open high low close volume
count 251.000000 251.000000 251.000000 251.000000 2.510000e+02
mean 171.454424 173.615298 169.303110 171.510936 2.768798e+07
std 19.683435 19.423837 20.074382 19.977448 1.922117e+07
min 123.100000 129.740000 123.020000 124.060000 9.588587e+06
25% 157.815000 160.745000 155.525000 157.915000 1.782839e+07
50% 174.890000 176.980000 172.830000 174.700000 2.186093e+07
75% 184.890000 186.450000 183.420000 185.270000 3.031384e+07
max 215.715000 218.620000 214.270000 217.500000 1.698037e+08

We can obtain the sum of the column entries using the sum operation

In [45]:
stocks_data.sum()
Out[45]:
open      4.303506e+04
high      4.357744e+04
low       4.249508e+04
close     4.304924e+04
volume    6.949682e+09
dtype: float64

The functions we have seen work on columns. We can apply them to rows as well by specifying the axis of the data.

axis = 0 refers to the index, and it means rows, and it is the default behavior

axis = 1 means columns

It is confusing, but the axis refers to the axis along which we perform the operation. For example when we specify the axis for the sum method it means that we will do the summation over this axis. So if we sum over the 0-axis it means that we obtain the sum of the column values.

In [16]:
df.sum(axis=1)
C:\Users\tsap\AppData\Local\Temp\ipykernel_23208\1459321664.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df.sum(axis=1)
Out[16]:
0      1.815262e+07
1      1.688730e+07
2      1.388164e+07
3      1.357528e+07
4      1.799548e+07
           ...     
246    2.206650e+07
247    3.972389e+07
248    3.120304e+07
249    2.262811e+07
250    2.462584e+07
Length: 251, dtype: float64

Sorting: You can sort by a specific column, ascending (default) or descending. You can also sort inplace.

In [16]:
#when inplace is False (the default) it returns a new dataframe that is sorted.
#when it is True it does not return anything, just changes the dataframe.
stocks_data.sort_values(by = 'open', ascending =False, inplace=False).head()
Out[16]:
open close low high volume
date
2018-07-25 00:00:00+00:00 215.72 217.50 214.27 218.62 64592585
2018-07-24 00:00:00+00:00 215.11 214.67 212.60 216.20 28468681
2018-07-23 00:00:00+00:00 210.58 210.91 208.80 211.62 16731969
2018-07-18 00:00:00+00:00 209.82 209.36 208.44 210.99 15334907
2018-07-20 00:00:00+00:00 208.85 209.94 208.50 211.50 16241508

Bulk Operations¶

Methods like sum( ) and std( ) work on entire columns.

We can run our own functions across all values in a column (or row) using apply( ).

In [17]:
df.date.head()
Out[17]:
0    2018-01-02 00:00:00+00:00
1    2018-01-03 00:00:00+00:00
2    2018-01-04 00:00:00+00:00
3    2018-01-05 00:00:00+00:00
4    2018-01-08 00:00:00+00:00
Name: date, dtype: object

The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.

In [18]:
first_date = df.date.values[0]
first_date
#returns a string
Out[18]:
'2018-01-02 00:00:00+00:00'

The datetime library handles dates. The method strptime transforms a string into a date (according to a format given as parameter).

In [25]:
datetime.strptime(first_date, "%Y-%m-%d %H:%M:%S%z")
Out[25]:
datetime.datetime(2018, 1, 2, 0, 0, tzinfo=datetime.timezone.utc)

We will now make use of two operations:

The apply method takes a dataframe and applies a function that is given as input to apply to all the entries in the data frame. In the case below we apply it to just one column.

The lambda function allows to define an anonymus function that takes some parameters (d) and uses them to compute some expression.

Using the lambda function with apply, we can apply the function to all the entries of the data frame (in this case the column values)

In [30]:
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S%z").strftime("%Y-%m-%d"))
date_series = df.date # We want to keep the dates
df.date.head()

#Another way to do the same thing, by applying the function to every row (axis = 1)
#df.date = df.apply(lambda row: datetime.strptime(row.date, "%Y-%m-%d"), axis=1)
Out[30]:
0    2018-01-02
1    2018-01-03
2    2018-01-04
3    2018-01-05
4    2018-01-08
Name: date, dtype: object
In [31]:
df.date.head()
Out[31]:
0    2018-01-02
1    2018-01-03
2    2018-01-04
3    2018-01-05
4    2018-01-08
Name: date, dtype: object

For example, we can obtain the integer part of the open value

In [17]:
#dftest = df[['open','close']]
#dftest.apply(lambda x: int(x))
df.apply(lambda r: int(r.open), axis=1)
Out[17]:
0      177
1      181
2      184
3      185
4      187
      ... 
246    123
247    126
248    132
249    135
250    134
Length: 251, dtype: int64
In [54]:
dftest = df['open']
dftest.apply(lambda x: int(x))
Out[54]:
0      177
1      181
2      184
3      185
4      187
      ... 
246    123
247    126
248    132
249    135
250    134
Name: open, Length: 251, dtype: int64

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

The row indices so far have been auto-generated by pandas, and are simply integers starting from 0.

From now on we will use dates instead of integers for indices -- the benefits of this will show later.

Overwriting the index is as easy as assigning to the index property of the DataFrame.

In [21]:
list(df.index)[0:10]
Out[21]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [22]:
df.index = df.date
df.head()
Out[22]:
date open high low close volume
date
2018-01-02 2018-01-02 177.68 181.58 177.5500 181.42 18151903
2018-01-03 2018-01-03 181.88 184.78 181.3300 184.67 16886563
2018-01-04 2018-01-04 184.90 186.21 184.0996 184.33 13880896
2018-01-05 2018-01-05 185.59 186.90 184.9300 186.85 13574535
2018-01-08 2018-01-08 187.20 188.90 186.3300 188.28 17994726

Another example using the simple example.csv data we loaded

In [57]:
dfe
Out[57]:
A B
0 1 a
1 2 b
2 3 c
In [58]:
dfe.index = dfe.B
In [59]:
dfe
Out[59]:
A B
B
a 1 a
b 2 b
c 3 c

Now that we have made an index based on date, we can drop the original date column. We will not do it in this example to use it later on.

In [23]:
df = df.drop(columns = ['date']) #Equivalent to df = df.drop(columns = ['date']), axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2018-01-02 to 2018-12-31
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    251 non-null    float64
 1   high    251 non-null    float64
 2   low     251 non-null    float64
 3   close   251 non-null    float64
 4   volume  251 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 11.8 KB
In [61]:
#axis = 0 refers to dropping labels from rows (or you can use index = labels as a parameter). 
# Essentially we are droping a set of rows
#axis = 1 refers to dropping labels from columns.
dfe.drop(index='b')
Out[61]:
A B
B
a 1 a
c 3 c

Accessing rows of the DataFrame¶

So far we've seen how to access a column of the DataFrame. To access a row we use a different notation.

To access a row by its index value, use the .loc() method.

In [62]:
df.loc[datetime(2018,5,7)]
Out[62]:
open          177.35
high          179.50
low           177.17
close         177.97
vol      18697195.00
Name: 2018-05-07 00:00:00, dtype: float64

To access a row by its sequence number (ie, like an array index), use .iloc[] ('Integer Location')

In [63]:
df.iloc[10:20] #dataframe with rows from 10 to 20
Out[63]:
open high low close vol
date
2018-01-17 179.26 179.32 175.8000 177.60 27992376
2018-01-18 178.13 180.98 177.0800 179.80 23304901
2018-01-19 180.85 182.37 180.1702 181.29 26826540
2018-01-22 180.80 185.39 180.4100 185.37 21059464
2018-01-23 186.05 189.55 185.5500 189.35 25678781
2018-01-24 189.89 190.66 186.5200 186.55 24334548
2018-01-25 187.95 188.62 186.6000 187.48 17377740
2018-01-26 187.75 190.00 186.8100 190.00 17759212
2018-01-29 188.75 188.84 185.6301 185.98 20453172
2018-01-30 183.01 188.18 181.8400 187.12 20858556
In [64]:
df.iloc[0:2,[1,3]] #dataframe with rows 0:2, and the second and fourth columns
Out[64]:
high close
date
2018-01-02 181.58 181.42
2018-01-03 184.78 184.67
In [65]:
#select rows and columns
df[['high','close']].iloc[0:2]
Out[65]:
high close
date
2018-01-02 181.58 181.42
2018-01-03 184.78 184.67

To iterate over the rows, use .iterrows()¶

In [66]:
num_positive_days = 0
for idx, row in df.iterrows(): #returns the index name and the row
    if row.close > row.open:
        num_positive_days += 1
        
print("The total number of positive-gain days is {}.".format(num_positive_days))
The total number of positive-gain days is 130.

You can also do it this way:

In [67]:
num_positive_days = 0
for i in range(len(df)):
    row = df.iloc[i]
    if row.close > row.open:
        num_positive_days += 1
        
print("The total number of positive-gain days is {}.".format(num_positive_days))
The total number of positive-gain days is 130.

Or this way:

In [68]:
pos_days = [idx for (idx,row) in df.iterrows() if row.close > row.open]
print("The total number of positive-gain days is "+str(len(pos_days)))
The total number of positive-gain days is 130
In [69]:
pos_days[0:10]
Out[69]:
[Timestamp('2018-01-02 00:00:00'),
 Timestamp('2018-01-03 00:00:00'),
 Timestamp('2018-01-05 00:00:00'),
 Timestamp('2018-01-08 00:00:00'),
 Timestamp('2018-01-10 00:00:00'),
 Timestamp('2018-01-12 00:00:00'),
 Timestamp('2018-01-18 00:00:00'),
 Timestamp('2018-01-19 00:00:00'),
 Timestamp('2018-01-22 00:00:00'),
 Timestamp('2018-01-23 00:00:00')]
In [70]:
df.loc[pos_days]
Out[70]:
open high low close vol
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903
2018-01-03 181.88 184.78 181.3300 184.67 16886563
2018-01-05 185.59 186.90 184.9300 186.85 13574535
2018-01-08 187.20 188.90 186.3300 188.28 17994726
2018-01-10 186.94 187.89 185.6300 187.84 10529894
... ... ... ... ... ...
2018-12-18 141.08 145.93 139.8301 143.66 24709084
2018-12-20 130.70 135.57 130.0000 133.40 40297944
2018-12-24 123.10 129.74 123.0200 124.06 22066002
2018-12-26 126.00 134.24 125.8900 134.18 39723370
2018-12-27 132.44 134.99 129.6700 134.52 31202509

130 rows × 5 columns

In [71]:
#This will iteratate the column names:
for x in df:
    print(x)
open
high
low
close
vol

Filtering¶

It is very easy to select interesting rows from the data.

All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.

We can perform boolean operations on the columns. We will get a set of rows with boolean values

In [72]:
tmp_high = df.high > 170
tmp_high.head()
Out[72]:
date
2018-01-02    True
2018-01-03    True
2018-01-04    True
2018-01-05    True
2018-01-08    True
Name: high, dtype: bool

Summing a Boolean array is the same as counting the number of True values.

In [73]:
sum(tmp_high)
Out[73]:
149

We can use the boolean dataframe to select the rows that have true value. The operation below returns only the rows of df that correspond to tmp_high

In [74]:
df[tmp_high].head()
Out[74]:
open high low close vol
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903
2018-01-03 181.88 184.78 181.3300 184.67 16886563
2018-01-04 184.90 186.21 184.0996 184.33 13880896
2018-01-05 185.59 186.90 184.9300 186.85 13574535
2018-01-08 187.20 188.90 186.3300 188.28 17994726

Putting it all together, we have the following commonly-used patterns:

In [75]:
positive_days = df[df.close > df.open]
positive_days.head()
Out[75]:
open high low close vol
date
2018-01-02 177.68 181.58 177.55 181.42 18151903
2018-01-03 181.88 184.78 181.33 184.67 16886563
2018-01-05 185.59 186.90 184.93 186.85 13574535
2018-01-08 187.20 188.90 186.33 188.28 17994726
2018-01-10 186.94 187.89 185.63 187.84 10529894
In [76]:
very_positive_days = df[df.close-df.open > 5]
very_positive_days.head()
Out[76]:
open high low close vol
date
2018-02-06 178.57 185.77 177.7400 185.31 37758505
2018-02-14 173.45 179.81 173.2119 179.52 28929704
2018-04-10 157.93 165.98 157.0100 165.04 58947041
2018-07-17 204.90 210.46 204.8400 209.99 15349892
2018-08-02 170.68 176.79 170.2700 176.37 32399954

We can have more complex boolean expressions. The and is &, the or is |, the not is ~, and you also need the parentheses to make it work

In [24]:
df[(df.high<170)&(df.low>80)]
Out[24]:
open high low close volume
date
2018-03-23 165.44 167.10 159.02 159.39 53609706
2018-03-26 160.82 161.10 149.02 160.06 126116634
2018-03-27 156.31 162.85 150.75 152.22 79116995
2018-03-28 151.65 155.88 150.80 153.03 60029170
2018-03-29 155.15 161.42 154.14 159.79 59434293
... ... ... ... ... ...
2018-12-24 123.10 129.74 123.02 124.06 22066002
2018-12-26 126.00 134.24 125.89 134.18 39723370
2018-12-27 132.44 134.99 129.67 134.52 31202509
2018-12-28 135.34 135.92 132.20 133.20 22627569
2018-12-31 134.45 134.64 129.95 131.09 24625308

102 rows × 5 columns

In [78]:
df[(df.high<170)and(df.low>80)]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15712/807571659.py in <module>
----> 1 df[(df.high<170)and(df.low>80)]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1535     @final
   1536     def __nonzero__(self):
-> 1537         raise ValueError(
   1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In [27]:
temp_df = df[df.high<170]
temp_df = temp_df[temp_df > 80]
temp_df
Out[27]:
open high low close volume
date
2018-03-23 165.44 167.10 159.02 159.39 53609706
2018-03-26 160.82 161.10 149.02 160.06 126116634
2018-03-27 156.31 162.85 150.75 152.22 79116995
2018-03-28 151.65 155.88 150.80 153.03 60029170
2018-03-29 155.15 161.42 154.14 159.79 59434293
... ... ... ... ... ...
2018-12-24 123.10 129.74 123.02 124.06 22066002
2018-12-26 126.00 134.24 125.89 134.18 39723370
2018-12-27 132.44 134.99 129.67 134.52 31202509
2018-12-28 135.34 135.92 132.20 133.20 22627569
2018-12-31 134.45 134.64 129.95 131.09 24625308

102 rows × 5 columns

Creating new columns¶

To create a new column, simply assign values to it. Think of the columns as a dictionary:

In [79]:
df['profit'] = (df.close - df.open)
df.head()
Out[79]:
open high low close vol profit
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 3.74
2018-01-03 181.88 184.78 181.3300 184.67 16886563 2.79
2018-01-04 184.90 186.21 184.0996 184.33 13880896 -0.57
2018-01-05 185.59 186.90 184.9300 186.85 13574535 1.26
2018-01-08 187.20 188.90 186.3300 188.28 17994726 1.08
In [80]:
df.profit[df.profit>0].describe()
Out[80]:
count    130.000000
mean       2.193566
std        1.783093
min        0.020000
25%        0.720000
50%        1.630000
75%        3.280000
max        8.180000
Name: profit, dtype: float64
In [29]:
for idx, row in df.iterrows():
    if row.close < row.open:
        df.loc[idx,'gain']='negative'
    elif (row.close - row.open) < 1:
        df.loc[idx,'gain']='small_gain'
    elif (row.close - row.open) < 3:
        df.loc[idx,'gain']='medium_gain'
    else:
        df.loc[idx,'gain']='large_gain'
df.head()
Out[29]:
open high low close volume test_column gain
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 large_gain large_gain
2018-01-03 181.88 184.78 181.3300 184.67 16886563 medium_gain medium_gain
2018-01-04 184.90 186.21 184.0996 184.33 13880896 negative negative
2018-01-05 185.59 186.90 184.9300 186.85 13574535 medium_gain medium_gain
2018-01-08 187.20 188.90 186.3300 188.28 17994726 medium_gain medium_gain

Here is another, more "functional", way to accomplish the same thing.

Define a function that classifies rows, and apply it to each row.

In [28]:
def gainrow(row):
    if row.close < row.open:
        return 'negative'
    elif (row.close - row.open) < 1:
        return 'small_gain'
    elif (row.close - row.open) < 3:
        return 'medium_gain'
    else:
        return 'large_gain'

df['test_column'] = df.apply(gainrow, axis = 1)
#axis = 0 means rows, axis =1 means columns
In [83]:
df.head()
Out[83]:
open high low close vol profit gain test_column
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 3.74 large_gain large_gain
2018-01-03 181.88 184.78 181.3300 184.67 16886563 2.79 medium_gain medium_gain
2018-01-04 184.90 186.21 184.0996 184.33 13880896 -0.57 negative negative
2018-01-05 185.59 186.90 184.9300 186.85 13574535 1.26 medium_gain medium_gain
2018-01-08 187.20 188.90 186.3300 188.28 17994726 1.08 medium_gain medium_gain

OK, point made, let's get rid of that extraneous test_column:

In [84]:
df = df.drop('test_column', axis = 1)
df.head()
Out[84]:
open high low close vol profit gain
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 3.74 large_gain
2018-01-03 181.88 184.78 181.3300 184.67 16886563 2.79 medium_gain
2018-01-04 184.90 186.21 184.0996 184.33 13880896 -0.57 negative
2018-01-05 185.59 186.90 184.9300 186.85 13574535 1.26 medium_gain
2018-01-08 187.20 188.90 186.3300 188.28 17994726 1.08 medium_gain

Missing values¶

Data often has missing values. In Pandas these are denoted as NaN values. These may be part of our data (e.g. empty cells in an excel sheet), or they may appear as a result of a join. There are special methods for handling these values.

In [85]:
mdf = pd.read_csv('example-missing.csv')
mdf
Out[85]:
A B C
0 1.0 a x
1 5.0 b NaN
2 3.0 c y
3 9.0 NaN z
4 NaN a x
In [86]:
mdf.A.mean()
Out[86]:
4.5

We can fill the values using the fillna method

In [87]:
mdf.fillna(0)
Out[87]:
A B C
0 1.0 a x
1 5.0 b 0
2 3.0 c y
3 9.0 0 z
4 0.0 a x
In [88]:
mdf.A = mdf.A.fillna(mdf.A.mean())
mdf = mdf.fillna('')
mdf
Out[88]:
A B C
0 1.0 a x
1 5.0 b
2 3.0 c y
3 9.0 z
4 4.5 a x

We can drop the rows with missing values

In [89]:
mdf = pd.read_csv('example-missing.csv')
mdf.dropna()
Out[89]:
A B C
0 1.0 a x
2 3.0 c y

We can find those rows

In [90]:
mdf[mdf.B.isnull()]
Out[90]:
A B C
3 9.0 NaN z

Grouping¶

An extremely powerful DataFrame method is groupby().

This is entirely analagous to GROUP BY in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.

Here we will look at the average gain among the categories of gains (negative, small, medium and large) we defined above and stored in column gain.

In [91]:
gain_groups = df.groupby('gain')
In [92]:
type(gain_groups)
Out[92]:
pandas.core.groupby.generic.DataFrameGroupBy

Essentially, gain_groups behaves like a dictionary

  • The keys are the unique values found in the gain column, and
  • The values are DataFrames that contain only the rows having the corresponding unique values.
In [93]:
for gain, gain_data in gain_groups:
    print(gain)
    print(gain_data.head())
    print('=============================')
large_gain
              open    high     low   close       vol  profit        gain
date                                                                    
2018-01-02  177.68  181.58  177.55  181.42  18151903    3.74  large_gain
2018-01-22  180.80  185.39  180.41  185.37  21059464    4.57  large_gain
2018-01-23  186.05  189.55  185.55  189.35  25678781    3.30  large_gain
2018-01-30  183.01  188.18  181.84  187.12  20858556    4.11  large_gain
2018-02-01  188.22  195.32  187.89  193.09  54211293    4.87  large_gain
=============================
medium_gain
              open    high     low   close       vol  profit         gain
date                                                                     
2018-01-03  181.88  184.78  181.33  184.67  16886563    2.79  medium_gain
2018-01-05  185.59  186.90  184.93  186.85  13574535    1.26  medium_gain
2018-01-08  187.20  188.90  186.33  188.28  17994726    1.08  medium_gain
2018-01-12  178.06  181.48  177.40  179.37  77551299    1.31  medium_gain
2018-01-18  178.13  180.98  177.08  179.80  23304901    1.67  medium_gain
=============================
negative
              open    high       low   close       vol  profit      gain
date                                                                    
2018-01-04  184.90  186.21  184.0996  184.33  13880896   -0.57  negative
2018-01-09  188.70  188.80  187.1000  187.87  12393057   -0.83  negative
2018-01-11  188.40  188.40  187.3800  187.77   9588587   -0.63  negative
2018-01-16  181.50  181.75  178.0400  178.39  36183842   -3.11  negative
2018-01-17  179.26  179.32  175.8000  177.60  27992376   -1.66  negative
=============================
small_gain
              open    high       low   close       vol  profit        gain
date                                                                      
2018-01-10  186.94  187.89  185.6300  187.84  10529894    0.90  small_gain
2018-01-19  180.85  182.37  180.1702  181.29  26826540    0.44  small_gain
2018-02-20  175.77  177.95  175.1100  176.01  21204921    0.24  small_gain
2018-02-22  178.70  180.21  177.4100  178.99  18464192    0.29  small_gain
2018-02-26  184.58  185.66  183.2228  184.93  17599703    0.35  small_gain
=============================

We can obtain the dataframe that corresponds to a specific group by using the get_group method of the groupby object

In [94]:
sm = gain_groups.get_group('small_gain')
sm.head()
Out[94]:
open high low close vol profit gain
date
2018-01-10 186.94 187.89 185.6300 187.84 10529894 0.90 small_gain
2018-01-19 180.85 182.37 180.1702 181.29 26826540 0.44 small_gain
2018-02-20 175.77 177.95 175.1100 176.01 21204921 0.24 small_gain
2018-02-22 178.70 180.21 177.4100 178.99 18464192 0.29 small_gain
2018-02-26 184.58 185.66 183.2228 184.93 17599703 0.35 small_gain
In [95]:
for gain, gain_data in df.groupby("gain"):
    print('The average closing value for the {} group is {}'.format(gain,
                                                    gain_data.close.mean()))
The average closing value for the large_gain group is 174.99081081081084
The average closing value for the medium_gain group is 174.18557692307695
The average closing value for the negative group is 169.2336363636363
The average closing value for the small_gain group is 171.6991463414634

The operation above can be done with a typical SQL-like group by, where we group by one or more attributes, and aggreagate the values (of some) of the other attributes.

For example group by "gain" and take the average of the values for open, high, low, close, volume.

You can also use other aggregators such as count, sum, median, max, min.

Pandas is now returning a new dataframe indexed by the values of the group-by attribute(s), with columns the other attributes

In [96]:
gdf= df[['open','low','high','close','vol','gain']].groupby('gain').mean()
type(gdf)
Out[96]:
pandas.core.frame.DataFrame
In [97]:
gdf
Out[97]:
open low high close vol
gain
large_gain 170.459459 169.941454 175.660722 174.990811 3.034571e+07
medium_gain 172.305504 171.410923 175.321108 174.185577 2.795407e+07
negative 171.473133 168.024464 172.441342 169.233636 2.771124e+07
small_gain 171.217688 169.827283 173.070561 171.699146 2.488339e+07

We can also apply a customized aggregation function using the agg method, and the lambda functions. This is useful also for processing string attributes. For example, if we would like the aggregation function to be the max-min we can do the following

In [ ]:
gdf2= df[['open','low','high','close','vol','gain']].groupby('gain').apply(lambda x: max(x) -min(x), axis=1)
gdf2

If you want to remove the (hierarchical) index and have the group-by atrribute(s) to be part of the table, you can use the reset_index method. The result of this method is to make the index attribute one more column in the dataframe, and use the default index

In [98]:
#This can be used to remove the hiearchical index, if necessary
gdf = gdf.reset_index()
gdf
Out[98]:
gain open low high close vol
0 large_gain 170.459459 169.941454 175.660722 174.990811 3.034571e+07
1 medium_gain 172.305504 171.410923 175.321108 174.185577 2.795407e+07
2 negative 171.473133 168.024464 172.441342 169.233636 2.771124e+07
3 small_gain 171.217688 169.827283 173.070561 171.699146 2.488339e+07
In [99]:
gdf.set_index('gain')
Out[99]:
open low high close vol
gain
large_gain 170.459459 169.941454 175.660722 174.990811 3.034571e+07
medium_gain 172.305504 171.410923 175.321108 174.185577 2.795407e+07
negative 171.473133 168.024464 172.441342 169.233636 2.771124e+07
small_gain 171.217688 169.827283 173.070561 171.699146 2.488339e+07

Another example:

In [100]:
test = pd.DataFrame({'A':[1,2,3,4],'B':['a','b','b','a'],'C':['a','a','b','a']})
test
Out[100]:
A B C
0 1 a a
1 2 b a
2 3 b b
3 4 a a
In [101]:
gtest = test.groupby(['B','C']).mean()
gtest
#note that in this case we get a hierarchical index
Out[101]:
A
B C
a a 2.5
b a 2.0
b 3.0
In [102]:
gtest = gtest.reset_index()
gtest
#the hierarchical index is flattened out
Out[102]:
B C A
0 a a 2.5
1 b a 2.0
2 b b 3.0

Joins¶

We can join data frames in a similar way that we can do joins in SQL

In [103]:
data_source = 'iex'
start = datetime(2018,1,1)
end = datetime(2018,12,31)

dfb = web.DataReader('FB', data_source, start, end)
dgoog = web.DataReader('GOOG', data_source, start, end)

print(dfb.head())
print(dgoog.head())
              open    high       low   close    volume
date                                                  
2018-01-02  177.68  181.58  177.5500  181.42  18151903
2018-01-03  181.88  184.78  181.3300  184.67  16886563
2018-01-04  184.90  186.21  184.0996  184.33  13880896
2018-01-05  185.59  186.90  184.9300  186.85  13574535
2018-01-08  187.20  188.90  186.3300  188.28  17994726
               open       high        low    close      volume
date                                                          
2018-01-02  52.4170  53.347000  52.261500  53.2500  24751280.0
2018-01-03  53.2155  54.314500  53.160500  54.1240  28603400.0
2018-01-04  54.4000  54.678495  54.200085  54.3200  20092100.0
2018-01-05  54.7000  55.212500  54.600000  55.1115  25582460.0
2018-01-08  55.1115  55.563500  55.081000  55.3470  20952060.0

Perform join on the date (the index value). Note the _x and _y in the column names to differentiate columns with the same name coming from the left (x) and the right (y) dataframes

In [104]:
common_dates = pd.merge(dfb,dgoog,on='date')
common_dates.head()
Out[104]:
open_x high_x low_x close_x volume_x open_y high_y low_y close_y volume_y
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 52.4170 53.347000 52.261500 53.2500 24751280.0
2018-01-03 181.88 184.78 181.3300 184.67 16886563 53.2155 54.314500 53.160500 54.1240 28603400.0
2018-01-04 184.90 186.21 184.0996 184.33 13880896 54.4000 54.678495 54.200085 54.3200 20092100.0
2018-01-05 185.59 186.90 184.9300 186.85 13574535 54.7000 55.212500 54.600000 55.1115 25582460.0
2018-01-08 187.20 188.90 186.3300 188.28 17994726 55.1115 55.563500 55.081000 55.3470 20952060.0

We can determine the suffix for the left and right tables

In [105]:
common_dates = pd.merge(dfb,dgoog,on='date',suffixes=('_fb', '_goog'))
common_dates.head()
Out[105]:
open_fb high_fb low_fb close_fb volume_fb open_goog high_goog low_goog close_goog volume_goog
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 52.4170 53.347000 52.261500 53.2500 24751280.0
2018-01-03 181.88 184.78 181.3300 184.67 16886563 53.2155 54.314500 53.160500 54.1240 28603400.0
2018-01-04 184.90 186.21 184.0996 184.33 13880896 54.4000 54.678495 54.200085 54.3200 20092100.0
2018-01-05 185.59 186.90 184.9300 186.85 13574535 54.7000 55.212500 54.600000 55.1115 25582460.0
2018-01-08 187.20 188.90 186.3300 188.28 17994726 55.1115 55.563500 55.081000 55.3470 20952060.0

Compute gain and perform join on the date AND gain.

In [106]:
dfb['gain'] = dfb.apply(gainrow, axis = 1)
dgoog['gain'] = dgoog.apply(gainrow, axis = 1)
dfb['profit'] = dfb.close-dfb.open
dgoog['profit'] = dgoog.close-dgoog.open
In [107]:
common_gain_dates = pd.merge(dfb, dgoog, on=['date','gain'],suffixes=('_fb', '_goog'))
common_gain_dates.head()
Out[107]:
open_fb high_fb low_fb close_fb volume_fb gain profit_fb open_goog high_goog low_goog close_goog volume_goog profit_goog
date
2018-01-04 184.90 186.21 184.0996 184.33 13880896 negative -0.57 54.4000 54.678495 54.200085 54.3200 20092100.0 -0.0800
2018-01-09 188.70 188.80 187.1000 187.87 12393057 negative -0.83 55.4700 55.528500 55.061535 55.3130 18050820.0 -0.1570
2018-01-10 186.94 187.89 185.6300 187.84 10529894 small_gain 0.90 54.8550 55.230000 54.805500 55.1305 20855860.0 0.2755
2018-01-11 188.40 188.40 187.3800 187.77 9588587 negative -0.63 55.3150 55.326250 54.979500 55.2760 19565840.0 -0.0390
2018-01-16 181.50 181.75 178.0400 178.39 36183842 negative -3.11 56.6255 56.995500 55.891580 56.0880 31505220.0 -0.5375

More join examples, including left outer join

In [108]:
left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
print('\n')
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
print('\n')
dfm = pd.merge(left, right, on='key') #keeps only the common key 'foo'
print(dfm)
   key  lval
0  foo     1
1  foo     2
2  boo     3


   key  rval
0  foo     4
1  hoo     5


   key  lval  rval
0  foo     1     4
1  foo     2     4

Left outer join

In [109]:
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)
print('\n')
dfm = dfm.fillna(0) #fills the NaN values with specified value
dfm
   key  lval  rval
0  foo     1   4.0
1  foo     2   4.0
2  boo     3   NaN


Out[109]:
key lval rval
0 foo 1 4.0
1 foo 2 4.0
2 boo 3 0.0

You can also use the join command to perform joins between dataframes

It works very similarly to the merge command, but it only allows join on the index, and performs left outer join by default

In [110]:
dfb.join(dgoog,lsuffix='_FB',rsuffix='_GOOG')
Out[110]:
open_FB high_FB low_FB close_FB volume_FB gain_FB profit_FB open_GOOG high_GOOG low_GOOG close_GOOG volume_GOOG gain_GOOG profit_GOOG
date
2018-01-02 177.68 181.58 177.5500 181.42 18151903 large_gain 3.74 52.4170 53.347000 52.261500 53.2500 24751280.0 small_gain 0.8330
2018-01-03 181.88 184.78 181.3300 184.67 16886563 medium_gain 2.79 53.2155 54.314500 53.160500 54.1240 28603400.0 small_gain 0.9085
2018-01-04 184.90 186.21 184.0996 184.33 13880896 negative -0.57 54.4000 54.678495 54.200085 54.3200 20092100.0 negative -0.0800
2018-01-05 185.59 186.90 184.9300 186.85 13574535 medium_gain 1.26 54.7000 55.212500 54.600000 55.1115 25582460.0 small_gain 0.4115
2018-01-08 187.20 188.90 186.3300 188.28 17994726 medium_gain 1.08 55.1115 55.563500 55.081000 55.3470 20952060.0 small_gain 0.2355
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-12-24 123.10 129.74 123.0200 124.06 22066002 small_gain 0.96 48.6950 50.177000 48.505500 48.8110 31806560.0 small_gain 0.1160
2018-12-26 126.00 134.24 125.8900 134.18 39723370 large_gain 8.18 49.4505 52.000000 49.150000 51.9730 47465400.0 medium_gain 2.5225
2018-12-27 132.44 134.99 129.6700 134.52 31202509 medium_gain 2.08 50.8575 52.194500 49.850000 52.1940 42195540.0 medium_gain 1.3365
2018-12-28 135.34 135.92 132.2000 133.20 22627569 negative -2.14 52.4810 52.778000 51.655000 51.8540 28275440.0 negative -0.6270
2018-12-31 134.45 134.64 129.9500 131.09 24625308 negative -3.36 52.5480 52.635000 51.179500 51.7805 29874440.0 negative -0.7675

251 rows × 14 columns

In [111]:
left.index=left.key
left = left.drop('key',axis=1)
right.index =right.key
right = right.drop('key',axis=1)
left.join(right)
Out[111]:
lval rval
key
boo 3 NaN
foo 1 4.0
foo 2 4.0