Introduction to Pandas and other libraries¶

(Many thanks to Evimaria Terzi and Mark Crovella for their code and examples)

Pandas¶

Pandas is the Python Data Analysis Library.

Pandas is an extremely versatile tool for manipulating datasets, mostly tabular data. You can think of Pandas as the evolution of excel spreadsheets, with more capabilities for coding, and SQL queries such as joins and group-by.

It also produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

You can find more details here

Storing data tables¶

Most data can be viewed as tables or matrices (in the case where all entries are numeric). The rows correspond to objects and the columns correspond to the attributes or features.

There are different ways we can store such data tables in Python

Two-dimensional lists

In [ ]:
D = [[0.3, 10, 1000],[0.5,2,509],[0.4, 8, 789]]
print(D)
[[0.3, 10, 1000], [0.5, 2, 509], [0.4, 8, 789]]
In [ ]:
D = [[30000, 'Married', 1],[20000,'Single', 0],[45000, 'Maried', 0]]
print(D)
[[30000, 'Married', 1], [20000, 'Single', 0], [45000, 'Maried', 0]]

Numpy Arrays

Numpy is a the library of Python for numerical computations and matrix manipulations. It has a lot of the functionality of Matlab but also allows for data analysis operations (similar to Pandas). Read more for Numpy here: http://www.numpy.org/

The Array is the main data structure for numpy. It stores multidimensional numeric tables.

We can create numpy arrays from lists

In [2]:
import numpy as np

#1-dimensional array
x = np.array([2,5,18,14,4])
print ("\n Deterministic 1-dimensional array \n")
print (x)

#2-dimensional array
x = np.array([[2,5,18,14,4], [12,15,1,2,8]])
print ("\n Deterministic 2-dimensional array \n")
print (x)
 Deterministic 1-dimensional array 

[ 2  5 18 14  4]

 Deterministic 2-dimensional array 

[[ 2  5 18 14  4]
 [12 15  1  2  8]]

There are also numpy operations that create arrays of different types

In [ ]:
x = np.random.rand(5,5)
print ("\n Random 5x5 2-dimensional array \n")
print (x)

x = np.ones((4,4))
print ("\n 4x4 array with ones \n")
print (x)

x = np.diag([1,2,3])
print ("\n Diagonal matrix\n")
print(x)
 Random 5x5 2-dimensional array 

[[0.03573091 0.48721998 0.27345982 0.74677877 0.73333542]
 [0.35490938 0.67864398 0.16007005 0.50396204 0.70817655]
 [0.47903301 0.99878053 0.77743757 0.84333336 0.89785166]
 [0.09531723 0.51871558 0.91669484 0.74306461 0.75330863]
 [0.24522124 0.49625669 0.37934991 0.5472943  0.76281352]]

 4x4 array with ones 

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

 Diagonal matrix

[[1 0 0]
 [0 2 0]
 [0 0 3]]

Why do we need numpy arrays? Because we can do different linear algebra operations on the numeric arrays

For example:

In [ ]:
x = np.random.randint(10,size=(2,3))
print("\n Random 2x3 array with integers")
print(x)

#Matrix transpose
print ("\n Transpose of the matrix \n")
print (x.T)

#multiplication and addition with scalar value
print("\n Matrix 2x+1 \n")
print(2*x+1)
 Random 2x3 array with integers
[[6 4 4]
 [9 0 7]]

 Transpose of the matrix 

[[6 9]
 [4 0]
 [4 7]]

 Matrix 2x+1 

[[13  9  9]
 [19  1 15]]

Transform back to list of lists

In [ ]:
lx = [list(y) for y in x]
lx
Out[ ]:
[[6, 4, 4], [9, 0, 7]]

Pandas data frames

A data frame is a table in which each row and column is given a label. Very similar to a spreahsheet or a SQL table.

Pandas DataFrames are documented at: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

Pandas dataframes enable different data analysis operations

Creating Data Frames¶

A dataframe has names for the columns and the rows of the table. The column names are stored in the attribute columns, while the row names in the attribute index. When these are not speficied, they are just indexed by default with the numbers 0,1,...

There are multiple ways we can create a data frame. Here we list just a few.

In [1]:
import pandas as pd #The pandas library
from pandas import Series, DataFrame #Main pandas data structures
In [3]:
#Creating a data frame from a list of lists

df = pd.DataFrame([[1,2,3],[9,10,12]])
print(df)

# Each list becomes a row
# Names of columns are 0,1,3
# Rows are indexed by position numbers 0,1
   0   1   2
0  1   2   3
1  9  10  12
In [ ]:
#Creating a data frame from a numpy array

df = pd.DataFrame(np.array([[1,2,3],[9,10,12]]))
print(df)
In [4]:
# Specifying column names
df = pd.DataFrame(np.array([[1,2,3],[9,10,12]]), columns=['A','B','C'])
print(df)
   A   B   C
0  1   2   3
1  9  10  12
In [43]:
#Creating a data frame from a dictionary
# Keys are column names, values are lists with column values

dfe = pd.DataFrame({'A':[1,2,3], 'B':['a','b','c']})
print(dfe)
   A  B
0  1  a
1  2  b
2  3  c
In [5]:
# Reading from a csv file:
df = pd.read_csv('example.csv')
print(df)

# The first row of the file is used for the column names
# The property columns gives us the column names
print(df.columns)
print(list(df.columns))
   NUMBER CHAR
0       1    a
1       2    b
2       3    c
Index(['NUMBER', 'CHAR'], dtype='object')
['NUMBER', 'CHAR']
In [7]:
# Reading from a csv file without header:
df = pd.read_csv('no-header.csv',header = None)
print(df)
   0  1
0  1  a
1  2  b
2  3  c
In [8]:
# Reading from am excel file:
df = pd.read_excel('example.xlsx')
print(df)
   NUMBER CHAR
0       1    a
1       2    b
2       3    c
In [9]:
#Writing to a csv file:
df.to_csv('example2.csv')
for x in open('example2.csv').readlines():
    print(x.strip())
,NUMBER,CHAR
0,1,a
1,2,b
2,3,c
In [10]:
# By default the row index is added as a column, we can remove it by seting index=False
df.to_csv('example2.csv',index = False)
for x in open('example2.csv').readlines():
    print(x.strip())
NUMBER,CHAR
1,a
2,b
3,c

Fetching data

For demonstration purposes, we will use data from Tiingo on stock quotes. We will see two ways of fetching data from Tiingo, one using the Tiingo client and one using the Data Reader library of Pandas.

More information on what types of data you can fetch is at: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

We will use stock quotes from IEX. To make use of these you need to first create an account and obtain an API key. Then you set the environment variable IEX_API_KEY to the value of the key as it is snown below

In [11]:
from datetime import datetime #For handling dates
In [12]:
!pip install tiingo
import os
Collecting tiingo
  Downloading tiingo-0.15.6-py2.py3-none-any.whl.metadata (15 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from tiingo) (2.32.3)
Requirement already satisfied: websocket-client in /usr/local/lib/python3.10/dist-packages (from tiingo) (1.8.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->tiingo) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->tiingo) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->tiingo) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->tiingo) (2024.8.30)
Downloading tiingo-0.15.6-py2.py3-none-any.whl (15 kB)
Installing collected packages: tiingo
Successfully installed tiingo-0.15.6
In [13]:
from tiingo import TiingoClient

import pandas_datareader.data as web # For accessing web data


client = TiingoClient({'api_key':'614c1590a592cc6696f6082f83b2666cd83882ef'})
start = datetime(2018,1,1)
end = datetime(2018,12,31)
stocks_data = client.get_dataframe('META',frequency='daily',startDate=start,endDate=end)
stocks_data = stocks_data[['open','close','low','high','volume']]
In [14]:
# the method info() outputs basic information for our data frame
stocks_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2018-01-02 00:00:00+00:00 to 2018-12-31 00:00:00+00:00
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    251 non-null    float64
 1   close   251 non-null    float64
 2   low     251 non-null    float64
 3   high    251 non-null    float64
 4   volume  251 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 11.8 KB

The number of rows in the DataFrame:

In [15]:
len(stocks_data)
Out[15]:
251
In [16]:
#the medthod head() outputs the top rows of the data frame
stocks_data.head()
Out[16]:
open close low high volume
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216
In [ ]:
#the medthod tail() outputs the last rows of the data frame
stocks_data.tail()
Out[ ]:
open close low high volume
date
2018-12-24 00:00:00+00:00 123.10 124.06 123.02 129.74 22066002
2018-12-26 00:00:00+00:00 126.00 134.18 125.89 134.24 39723370
2018-12-27 00:00:00+00:00 132.44 134.52 129.67 134.99 31202509
2018-12-28 00:00:00+00:00 135.34 133.20 132.20 135.92 22627569
2018-12-31 00:00:00+00:00 134.45 131.09 129.95 134.64 24625308

Note that the date attribute is the index of the rows, not an attribute. The index gives a name to each row. The default index is the numbers 0...len(df). Here we index the rows by the date

In [ ]:
stocks_data.close
Out[ ]:
close
date
2018-01-02 00:00:00+00:00 181.42
2018-01-03 00:00:00+00:00 184.67
2018-01-04 00:00:00+00:00 184.33
2018-01-05 00:00:00+00:00 186.85
2018-01-08 00:00:00+00:00 188.28
... ...
2018-12-24 00:00:00+00:00 124.06
2018-12-26 00:00:00+00:00 134.18
2018-12-27 00:00:00+00:00 134.52
2018-12-28 00:00:00+00:00 133.20
2018-12-31 00:00:00+00:00 131.09

251 rows × 1 columns


In [17]:
#trying to access the date column will give an error

stocks_data.date
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-d893f040ef09> in <cell line: 3>()
      1 #trying to access the date column will give an error
      2 
----> 3 stocks_data.date

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   6297         ):
   6298             return self[name]
-> 6299         return object.__getattribute__(self, name)
   6300 
   6301     @final

AttributeError: 'DataFrame' object has no attribute 'date'
In [18]:
stocks_data.to_csv('stocks_data.csv')
for x in open('stocks_data.csv').readlines()[0:10]: # Read and print First 10 lines
    print(x.strip())
df = pd.read_csv('stocks_data.csv')
date,open,close,low,high,volume
2018-01-02 00:00:00+00:00,177.68,181.42,177.55,181.58,17694891
2018-01-03 00:00:00+00:00,181.88,184.67,181.33,184.78,16595495
2018-01-04 00:00:00+00:00,184.9,184.33,184.1,186.21,13554357
2018-01-05 00:00:00+00:00,185.59,186.85,184.93,186.9,13042388
2018-01-08 00:00:00+00:00,187.2,188.28,186.33,188.9,14719216
2018-01-09 00:00:00+00:00,188.7,187.87,187.1,188.8,12342722
2018-01-10 00:00:00+00:00,186.94,187.84,185.63,187.89,10464528
2018-01-11 00:00:00+00:00,188.4,187.77,187.38,188.4,8855144
2018-01-12 00:00:00+00:00,178.06,179.37,177.4,181.48,76645626
In [ ]:
len(df)
Out[ ]:
251
In [19]:
#the medthod head() outputs the top rows of the data frame
df.head()
Out[19]:
date open close low high volume
0 2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
1 2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2 2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
3 2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
4 2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216
In [20]:
df.date
Out[20]:
date
0 2018-01-02 00:00:00+00:00
1 2018-01-03 00:00:00+00:00
2 2018-01-04 00:00:00+00:00
3 2018-01-05 00:00:00+00:00
4 2018-01-08 00:00:00+00:00
... ...
246 2018-12-24 00:00:00+00:00
247 2018-12-26 00:00:00+00:00
248 2018-12-27 00:00:00+00:00
249 2018-12-28 00:00:00+00:00
250 2018-12-31 00:00:00+00:00

251 rows × 1 columns


Note that in the new dataframe, there is now a date column, while the index values are numbers 0,1,...

Working with data columns¶

The columns are the "features" in your data

In [21]:
#an object that refers to the names of the columns
df.columns
Out[21]:
Index(['date', 'open', 'close', 'low', 'high', 'volume'], dtype='object')

We can also assign a list to the columns property in order to change the attribute names.

Alternatively, you can change the name of an attribute using rename:

In [22]:
df = df.rename(columns = {'volume':'V'})
print(list(df.columns))
df.columns = ['date', 'open', 'high', 'low', 'close', 'vol']
df.head()
['date', 'open', 'close', 'low', 'high', 'V']
Out[22]:
date open high low close vol
0 2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891
1 2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495
2 2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357
3 2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388
4 2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216

Selecting a single column from your data.

It is important to keep in mind that this selection process returns a new data frame.

In [23]:
df['open'].head()
Out[23]:
open
0 177.68
1 181.88
2 184.90
3 185.59
4 187.20

Another way of selecting a single column from your data

In [ ]:
df.open.head()
Out[ ]:
open
0 177.68
1 181.88
2 184.90
3 185.59
4 187.20

Selecting multiple columns

In [ ]:
df[['open','close']].head()
Out[ ]:
open close
0 177.68 181.58
1 181.88 184.78
2 184.90 186.21
3 185.59 186.90
4 187.20 188.90

We can use the values method to obtain the values of one or more attributes. It returns a numpy array. You can trasform it into a list, by applying the list() operator.

In [ ]:
df.open.values[:10]
Out[ ]:
array([177.68, 181.88, 184.9 , 185.59, 187.2 , 188.7 , 186.94, 188.4 ,
       178.06, 181.5 ])
In [24]:
type(df.open.values[:10])
Out[24]:
numpy.ndarray
In [ ]:
type(df[['open','close']])
Out[ ]:
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
In [ ]:
df[['open','close']].values[:10]
Out[ ]:
array([[177.68, 181.58],
       [181.88, 184.78],
       [184.9 , 186.21],
       [185.59, 186.9 ],
       [187.2 , 188.9 ],
       [188.7 , 188.8 ],
       [186.94, 187.89],
       [188.4 , 188.4 ],
       [178.06, 181.48],
       [181.5 , 181.75]])

Data Frame methods¶

A DataFrame object has many useful methods.

In [25]:
df.mean() #produces the mean of the columns/features
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-bfb9cc9bfb14> in <cell line: 1>()
----> 1 df.mean() #produces the mean of the columns/features

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in mean(self, axis, skipna, numeric_only, **kwargs)
  11691         **kwargs,
  11692     ):
> 11693         result = super().mean(axis, skipna, numeric_only, **kwargs)
  11694         if isinstance(result, Series):
  11695             result = result.__finalize__(self, method="mean")

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in mean(self, axis, skipna, numeric_only, **kwargs)
  12418         **kwargs,
  12419     ) -> Series | float:
> 12420         return self._stat_function(
  12421             "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12422         )

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12375         validate_bool_kwarg(skipna, "skipna", none_allowed=False)
  12376 
> 12377         return self._reduce(
  12378             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12379         )

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11560         # After possibly _get_data and transposing, we are now in the
  11561         #  simple case where we can use BlockManager.reduce
> 11562         res = df._mgr.reduce(blk_func)
  11563         out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11564         if out_dtype is not None and out.dtype != "boolean":

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/managers.py in reduce(self, func)
   1498         res_blocks: list[Block] = []
   1499         for blk in self.blocks:
-> 1500             nbs = blk.reduce(func)
   1501             res_blocks.extend(nbs)
   1502 

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/blocks.py in reduce(self, func)
    402         assert self.ndim == 2
    403 
--> 404         result = func(self.values)
    405 
    406         if self.values.ndim == 1:

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in blk_func(values, axis)
  11479                     return np.array([result])
  11480             else:
> 11481                 return op(values, axis=axis, skipna=skipna, **kwds)
  11482 
  11483         def _get_data() -> DataFrame:

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    145                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    146             else:
--> 147                 result = alt(values, axis=axis, skipna=skipna, **kwds)
    148 
    149             return result

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in new_func(values, axis, skipna, mask, **kwargs)
    402             mask = isna(values)
    403 
--> 404         result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    405 
    406         if datetimelike:

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in nanmean(values, axis, skipna, mask)
    718     count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719     the_sum = values.sum(axis, dtype=dtype_sum)
--> 720     the_sum = _ensure_numeric(the_sum)
    721 
    722     if axis is not None and getattr(the_sum, "ndim", False):

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in _ensure_numeric(x)
   1684             if inferred in ["string", "mixed"]:
   1685                 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1686                 raise TypeError(f"Could not convert {x} to numeric")
   1687             try:
   1688                 x = x.astype(np.complex128)

TypeError: Could not convert ['2018-01-02 00:00:00+00:002018-01-03 00:00:00+00:002018-01-04 00:00:00+00:002018-01-05 00:00:00+00:002018-01-08 00:00:00+00:002018-01-09 00:00:00+00:002018-01-10 00:00:00+00:002018-01-11 00:00:00+00:002018-01-12 00:00:00+00:002018-01-16 00:00:00+00:002018-01-17 00:00:00+00:002018-01-18 00:00:00+00:002018-01-19 00:00:00+00:002018-01-22 00:00:00+00:002018-01-23 00:00:00+00:002018-01-24 00:00:00+00:002018-01-25 00:00:00+00:002018-01-26 00:00:00+00:002018-01-29 00:00:00+00:002018-01-30 00:00:00+00:002018-01-31 00:00:00+00:002018-02-01 00:00:00+00:002018-02-02 00:00:00+00:002018-02-05 00:00:00+00:002018-02-06 00:00:00+00:002018-02-07 00:00:00+00:002018-02-08 00:00:00+00:002018-02-09 00:00:00+00:002018-02-12 00:00:00+00:002018-02-13 00:00:00+00:002018-02-14 00:00:00+00:002018-02-15 00:00:00+00:002018-02-16 00:00:00+00:002018-02-20 00:00:00+00:002018-02-21 00:00:00+00:002018-02-22 00:00:00+00:002018-02-23 00:00:00+00:002018-02-26 00:00:00+00:002018-02-27 00:00:00+00:002018-02-28 00:00:00+00:002018-03-01 00:00:00+00:002018-03-02 00:00:00+00:002018-03-05 00:00:00+00:002018-03-06 00:00:00+00:002018-03-07 00:00:00+00:002018-03-08 00:00:00+00:002018-03-09 00:00:00+00:002018-03-12 00:00:00+00:002018-03-13 00:00:00+00:002018-03-14 00:00:00+00:002018-03-15 00:00:00+00:002018-03-16 00:00:00+00:002018-03-19 00:00:00+00:002018-03-20 00:00:00+00:002018-03-21 00:00:00+00:002018-03-22 00:00:00+00:002018-03-23 00:00:00+00:002018-03-26 00:00:00+00:002018-...

For the date column we cannot calculate the mean. This is because it stores Strings.

The latest version of the pandas library raises a warning. We should preselect the attributes we want to get the mean for

In [26]:
df[['open','high','low','close']].mean()
Out[26]:
0
open 171.472948
high 171.510956
low 169.303128
close 173.613347

In [ ]:
df[['open','high','low','close']].std() #produces the standard deviation of the columns/features
Out[ ]:
0
open 19.696493
high 19.977452
low 20.074408
close 19.424564

In [ ]:
df[['open','high','low','close']].sem() #produces the standard error of the mean of the columns/features
Out[ ]:
0
open 1.243232
high 1.260966
low 1.267085
close 1.226068

Here is a manual way to compute the confidence interval. What you need is the size of the sample, and the confidence value.

In [ ]:
#confidence interval
import scipy.stats as stats
conf = 0.95
t = stats.t.ppf((1+conf)/2.0, len(df)-1)
low = df[['open','high','low','low','close']].mean() - t*df[['open','high','low','low','close']].sem()
high = df[['open','high','low','low','close']].mean() + t*df[['open','high','low','low','close']].sem()
pd.DataFrame({'CI lower end':low, 'CI higher end':high})
Out[ ]:
CI lower end CI higher end
open 169.024405 173.921491
high 169.027487 173.994426
low 166.807605 171.798651
low 166.807605 171.798651
close 171.198608 176.028085
In [ ]:
df[['open','high','low','low','close']].median() #produces the median of the columns/features
Out[ ]:
0
open 174.89
high 174.70
low 172.83
low 172.83
close 176.98

In [ ]:
df.open.mean()
Out[ ]:
171.47294820717133

Use describe to get all statistics for the data

In [27]:
stocks_data.describe()
Out[27]:
open close low high volume
count 251.000000 251.000000 251.000000 251.000000 2.510000e+02
mean 171.472948 171.510956 169.303128 173.613347 2.743828e+07
std 19.696493 19.977452 20.074408 19.424564 1.910143e+07
min 123.100000 124.060000 123.020000 129.740000 8.855144e+06
25% 157.815000 157.915000 155.525000 160.745000 1.750763e+07
50% 174.890000 174.700000 172.830000 176.980000 2.186093e+07
75% 184.915000 185.270000 183.420000 186.450000 2.984990e+07
max 215.720000 217.500000 214.270000 218.620000 1.698037e+08

We can obtain the sum of the column entries using the sum operation

In [28]:
stocks_data.sum()
Out[28]:
0
open 4.303971e+04
close 4.304925e+04
low 4.249509e+04
high 4.357695e+04
volume 6.887007e+09

The functions we have seen work on columns. We can apply them to rows as well by specifying the axis of the data.

axis = 0 refers to the index, and it means rows, and it is the default behavior

axis = 1 means columns

It is confusing, but the axis refers to the axis along which we perform the operation. For example when we specify the axis for the sum method it means that we will do the summation over this axis. So if we sum over the 0-axis it means that we obtain the sum of the column values.

In [29]:
df.sum(axis=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-f6cc363c055e> in <cell line: 1>()
----> 1 df.sum(axis=1)

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in sum(self, axis, skipna, numeric_only, min_count, **kwargs)
  11668         **kwargs,
  11669     ):
> 11670         result = super().sum(axis, skipna, numeric_only, min_count, **kwargs)
  11671         return result.__finalize__(self, method="sum")
  11672 

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in sum(self, axis, skipna, numeric_only, min_count, **kwargs)
  12504         **kwargs,
  12505     ):
> 12506         return self._min_count_stat_function(
  12507             "sum", nanops.nansum, axis, skipna, numeric_only, min_count, **kwargs
  12508         )

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _min_count_stat_function(self, name, func, axis, skipna, numeric_only, min_count, **kwargs)
  12487             axis = 0
  12488 
> 12489         return self._reduce(
  12490             func,
  12491             name=name,

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11560         # After possibly _get_data and transposing, we are now in the
  11561         #  simple case where we can use BlockManager.reduce
> 11562         res = df._mgr.reduce(blk_func)
  11563         out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11564         if out_dtype is not None and out.dtype != "boolean":

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/managers.py in reduce(self, func)
   1498         res_blocks: list[Block] = []
   1499         for blk in self.blocks:
-> 1500             nbs = blk.reduce(func)
   1501             res_blocks.extend(nbs)
   1502 

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/blocks.py in reduce(self, func)
    402         assert self.ndim == 2
    403 
--> 404         result = func(self.values)
    405 
    406         if self.values.ndim == 1:

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in blk_func(values, axis)
  11479                     return np.array([result])
  11480             else:
> 11481                 return op(values, axis=axis, skipna=skipna, **kwds)
  11482 
  11483         def _get_data() -> DataFrame:

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in _f(*args, **kwargs)
     83                 )
     84             try:
---> 85                 return f(*args, **kwargs)
     86             except ValueError as e:
     87                 # we want to transform an object array

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in new_func(values, axis, skipna, mask, **kwargs)
    402             mask = isna(values)
    403 
--> 404         result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    405 
    406         if datetimelike:

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in newfunc(values, axis, **kwargs)
    475             return np.array(results)
    476 
--> 477         return func(values, axis=axis, **kwargs)
    478 
    479     return cast(F, newfunc)

/usr/local/lib/python3.10/dist-packages/pandas/core/nanops.py in nansum(values, axis, skipna, min_count, mask)
    644         dtype_sum = np.dtype(np.float64)
    645 
--> 646     the_sum = values.sum(axis, dtype=dtype_sum)
    647     the_sum = _maybe_null_out(the_sum, axis, mask, values.shape, min_count=min_count)
    648 

/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     48          initial=_NoValue, where=True):
---> 49     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     50 
     51 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: can only concatenate str (not "float") to str
In [73]:
df[['open', 'close', 'low', 'high', 'vol']].sum(axis=1)
Out[73]:
0
date
2018-01-02 17695609.23
2018-01-03 16596227.66
2018-01-04 13555096.54
2018-01-05 13043132.27
2018-01-08 14719966.71
... ...
2018-12-24 22066501.92
2018-12-26 39723890.31
2018-12-27 31203040.62
2018-12-28 22628105.66
2018-12-31 24625838.13

251 rows × 1 columns


Sorting: You can sort by a specific column, ascending (default) or descending. You can also sort inplace.

In [31]:
#when inplace is False (the default) it returns a new dataframe that is sorted.
#when it is True it does not return anything, just changes the dataframe.
stocks_data.sort_values(by = 'open', ascending =False, inplace=False).head()
Out[31]:
open close low high volume
date
2018-07-25 00:00:00+00:00 215.72 217.50 214.27 218.62 64592585
2018-07-24 00:00:00+00:00 215.11 214.67 212.60 216.20 28468681
2018-07-23 00:00:00+00:00 210.58 210.91 208.80 211.62 16731969
2018-07-18 00:00:00+00:00 209.82 209.36 208.44 210.99 15334907
2018-07-20 00:00:00+00:00 208.85 209.94 208.50 211.50 16241508

Bulk Operations¶

Methods like sum( ) and std( ) work on entire columns.

We can run our own functions across all values in a column (or row) using apply( ).

In [32]:
df.date.head()
Out[32]:
date
0 2018-01-02 00:00:00+00:00
1 2018-01-03 00:00:00+00:00
2 2018-01-04 00:00:00+00:00
3 2018-01-05 00:00:00+00:00
4 2018-01-08 00:00:00+00:00

The values property of the column returns a numpy array of values for the column. Inspecting the first value reveals that these are strings with a particular format.

In [33]:
first_date = df.date.values[0]
first_date
#returns a string
Out[33]:
'2018-01-02 00:00:00+00:00'

The datetime library handles dates. The method strptime transforms a string into a date (according to a format given as parameter).

In [34]:
datetime.strptime(first_date, "%Y-%m-%d %H:%M:%S%z")
Out[34]:
datetime.datetime(2018, 1, 2, 0, 0, tzinfo=datetime.timezone.utc)

We will now make use of two operations:

The apply method takes a dataframe and applies a function that is given as input to apply to all the entries in the data frame. In the case below we apply it to just one column.

The lambda function allows to define an anonymus function that takes some parameters (d) and uses them to compute some expression.

Using the lambda function with apply, we can apply the function to all the entries of the data frame (in this case the column values)

In [35]:
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S%z").strftime("%Y-%m-%d"))
date_series = df.date # We want to keep the dates
df.date.head()

#Another way to do the same thing, by applying the function to every row (axis = 1)
#df.date = df.apply(lambda row: datetime.strptime(row.date, "%Y-%m-%d"), axis=1)
Out[35]:
date
0 2018-01-02
1 2018-01-03
2 2018-01-04
3 2018-01-05
4 2018-01-08

In [36]:
date_series.head()
Out[36]:
date
0 2018-01-02
1 2018-01-03
2 2018-01-04
3 2018-01-05
4 2018-01-08

For example, we can obtain the integer part of the open value

In [37]:
#dftest = df[['open','close']]
#dftest.apply(lambda x: int(x))
df.apply(lambda r: int(r.open), axis=1)
Out[37]:
0
0 177
1 181
2 184
3 185
4 187
... ...
246 123
247 126
248 132
249 135
250 134

251 rows × 1 columns


In [38]:
dftest = df['open']
dftest.apply(lambda x: int(x))
Out[38]:
open
0 177
1 181
2 184
3 185
4 187
... ...
246 123
247 126
248 132
249 135
250 134

251 rows × 1 columns


Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

The row indices so far have been auto-generated by pandas, and are simply integers starting from 0.

From now on we will use dates instead of integers for indices -- the benefits of this will show later.

Overwriting the index is as easy as assigning to the index property of the DataFrame.

In [40]:
list(df.index)[0:10]
Out[40]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [41]:
df.index = df.date
df.head()
Out[41]:
date open high low close vol
date
2018-01-02 2018-01-02 177.68 181.42 177.55 181.58 17694891
2018-01-03 2018-01-03 181.88 184.67 181.33 184.78 16595495
2018-01-04 2018-01-04 184.90 184.33 184.10 186.21 13554357
2018-01-05 2018-01-05 185.59 186.85 184.93 186.90 13042388
2018-01-08 2018-01-08 187.20 188.28 186.33 188.90 14719216

Another example using the simple example.csv data we loaded

In [44]:
dfe
Out[44]:
A B
0 1 a
1 2 b
2 3 c
In [45]:
dfe.index = dfe.B
In [46]:
dfe
Out[46]:
A B
B
a 1 a
b 2 b
c 3 c

Now that we have made an index based on date, we can drop the original date column. We will not do it in this example to use it later on.

In [47]:
df = df.drop(columns = ['date']) #Equivalent to df = df.drop(columns = ['date']), axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 251 entries, 2018-01-02 to 2018-12-31
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    251 non-null    float64
 1   high    251 non-null    float64
 2   low     251 non-null    float64
 3   close   251 non-null    float64
 4   vol     251 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 11.8+ KB
In [48]:
#axis = 0 refers to dropping labels from rows (or you can use index = labels as a parameter).
# Essentially we are droping a set of rows
#axis = 1 refers to dropping labels from columns.
dfe.drop(index='b')
Out[48]:
A B
B
a 1 a
c 3 c

Accessing rows of the DataFrame¶

So far we've seen how to access a column of the DataFrame. To access a row we use a different notation.

To access a row by its index value, use the .loc() method.

In [49]:
df.index = pd.to_datetime(df.index)
print(df.index)
DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05',
               '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11',
               '2018-01-12', '2018-01-16',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='date', length=251, freq=None)
In [50]:
df.loc[datetime(2018,5,7)]
Out[50]:
2018-05-07
open 177.35
high 177.97
low 177.17
close 179.50
vol 18697195.00

To access a row by its sequence number (ie, like an array index), use .iloc[] ('Integer Location')

In [51]:
df.iloc[10:20] #dataframe with rows from 10 to 20
Out[51]:
open high low close vol
date
2018-01-17 179.26 177.60 175.80 179.32 27356988
2018-01-18 178.13 179.80 177.08 180.98 22783759
2018-01-19 180.85 181.29 180.17 182.37 26266081
2018-01-22 180.80 185.37 180.41 185.39 20567285
2018-01-23 186.05 189.35 185.55 189.55 24956444
2018-01-24 189.89 186.55 186.52 190.66 22992031
2018-01-25 187.95 187.48 186.60 188.62 16698537
2018-01-26 187.75 190.00 186.81 190.00 16652197
2018-01-29 188.75 185.98 185.63 188.84 20022931
2018-01-30 187.62 187.12 181.84 188.18 20382841
In [52]:
df.iloc[0:2,[1,3]] #dataframe with rows 0:2, and the second and fourth columns
Out[52]:
high close
date
2018-01-02 181.42 181.58
2018-01-03 184.67 184.78
In [53]:
#select rows and columns
df[['high','close']].iloc[0:2]
Out[53]:
high close
date
2018-01-02 181.42 181.58
2018-01-03 184.67 184.78

To iterate over the rows, use .iterrows()¶

In [54]:
print(df.dtypes)
open     float64
high     float64
low      float64
close    float64
vol        int64
dtype: object
In [55]:
num_positive_days = 0
for idx, row in df.iterrows(): #returns the index name and the row
    if row.close > row.open:
        num_positive_days += 1

print("The total number of positive-gain days is {}.".format(num_positive_days))
The total number of positive-gain days is 244.

You can also do it this way:

In [56]:
num_positive_days = 0
for i in range(len(df)):
    row = df.iloc[i]
    if row.close > row.open:
        num_positive_days += 1

print("The total number of positive-gain days is {}.".format(num_positive_days))
The total number of positive-gain days is 244.

Or this way:

In [57]:
pos_days = [idx for (idx,row) in df.iterrows() if row.close > row.open]
print("The total number of positive-gain days is "+str(len(pos_days)))
The total number of positive-gain days is 244
In [58]:
pos_days[0:10]
Out[58]:
[Timestamp('2018-01-02 00:00:00'),
 Timestamp('2018-01-03 00:00:00'),
 Timestamp('2018-01-04 00:00:00'),
 Timestamp('2018-01-05 00:00:00'),
 Timestamp('2018-01-08 00:00:00'),
 Timestamp('2018-01-09 00:00:00'),
 Timestamp('2018-01-10 00:00:00'),
 Timestamp('2018-01-12 00:00:00'),
 Timestamp('2018-01-16 00:00:00'),
 Timestamp('2018-01-17 00:00:00')]
In [59]:
df.loc[pos_days]
Out[59]:
open high low close vol
date
2018-01-02 177.68 181.42 177.55 181.58 17694891
2018-01-03 181.88 184.67 181.33 184.78 16595495
2018-01-04 184.90 184.33 184.10 186.21 13554357
2018-01-05 185.59 186.85 184.93 186.90 13042388
2018-01-08 187.20 188.28 186.33 188.90 14719216
... ... ... ... ... ...
2018-12-24 123.10 124.06 123.02 129.74 22066002
2018-12-26 126.00 134.18 125.89 134.24 39723370
2018-12-27 132.44 134.52 129.67 134.99 31202509
2018-12-28 135.34 133.20 132.20 135.92 22627569
2018-12-31 134.45 131.09 129.95 134.64 24625308

244 rows × 5 columns

In [60]:
#This will iteratate the column names:
for x in df:
    print(x)
open
high
low
close
vol

Filtering¶

It is very easy to select interesting rows from the data.

All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.

We can perform boolean operations on the columns. We will get a set of rows with boolean values

In [66]:
tmp_high = df.high > 170
tmp_high.tail()
Out[66]:
high
date
2018-12-24 False
2018-12-26 False
2018-12-27 False
2018-12-28 False
2018-12-31 False

Summing a Boolean array is the same as counting the number of True values.

In [64]:
sum(tmp_high)
Out[64]:
144

We can use the boolean dataframe to select the rows that have true value. The operation below returns only the rows of df that correspond to tmp_high

In [67]:
df[tmp_high].tail()
Out[67]:
open high low close vol
date
2018-08-28 178.10 176.26 175.83 178.24 15910675
2018-08-29 176.30 175.90 174.75 176.79 18678301
2018-08-30 175.90 177.64 175.70 179.79 24216532
2018-08-31 177.15 175.73 174.98 177.62 18065159
2018-09-04 173.50 171.16 168.80 173.89 29808971

Putting it all together, we have the following commonly-used patterns:

In [68]:
positive_days = df[df.close > df.open]
positive_days.head()
Out[68]:
open high low close vol
date
2018-01-02 177.68 181.42 177.55 181.58 17694891
2018-01-03 181.88 184.67 181.33 184.78 16595495
2018-01-04 184.90 184.33 184.10 186.21 13554357
2018-01-05 185.59 186.85 184.93 186.90 13042388
2018-01-08 187.20 188.28 186.33 188.90 14719216
In [74]:
very_positive_days = df[df.close-df.open > 5]
very_positive_days.head()
Out[74]:
open high low close vol
date
2018-02-01 188.22 193.09 187.89 195.32 53608910
2018-02-06 178.57 185.31 177.74 185.77 36829710
2018-02-14 173.45 179.52 173.21 179.81 27963758
2018-03-07 178.74 183.71 178.07 183.82 19097293
2018-03-21 164.80 169.39 163.30 173.40 105350867

We can have more complex boolean expressions. The and is &, the or is |, the not is ~, and you also need the parentheses to make it work

In [75]:
df[(df.high<170)&(df.low>80)]
Out[75]:
open high low close vol
date
2018-03-20 167.47 168.15 161.95 170.20 128925534
2018-03-21 164.80 169.39 163.30 173.40 105350867
2018-03-22 166.13 164.89 163.72 170.27 73389988
2018-03-23 165.44 159.39 159.02 166.60 52306891
2018-03-26 160.82 160.06 149.02 161.10 125438294
... ... ... ... ... ...
2018-12-24 123.10 124.06 123.02 129.74 22066002
2018-12-26 126.00 134.18 125.89 134.24 39723370
2018-12-27 132.44 134.52 129.67 134.99 31202509
2018-12-28 135.34 133.20 132.20 135.92 22627569
2018-12-31 134.45 131.09 129.95 134.64 24625308

107 rows × 5 columns

In [76]:
temp_df = df[df.high<170]
temp_df = temp_df[temp_df > 80]
temp_df
Out[76]:
open high low close vol
date
2018-03-20 167.47 168.15 161.95 170.20 128925534
2018-03-21 164.80 169.39 163.30 173.40 105350867
2018-03-22 166.13 164.89 163.72 170.27 73389988
2018-03-23 165.44 159.39 159.02 166.60 52306891
2018-03-26 160.82 160.06 149.02 161.10 125438294
... ... ... ... ... ...
2018-12-24 123.10 124.06 123.02 129.74 22066002
2018-12-26 126.00 134.18 125.89 134.24 39723370
2018-12-27 132.44 134.52 129.67 134.99 31202509
2018-12-28 135.34 133.20 132.20 135.92 22627569
2018-12-31 134.45 131.09 129.95 134.64 24625308

107 rows × 5 columns

Creating new columns¶

To create a new column, simply assign values to it. Think of the columns as a dictionary:

In [77]:
df['profit'] = (df.close - df.open)
df.head()
Out[77]:
open high low close vol profit
date
2018-01-02 177.68 181.42 177.55 181.58 17694891 3.90
2018-01-03 181.88 184.67 181.33 184.78 16595495 2.90
2018-01-04 184.90 184.33 184.10 186.21 13554357 1.31
2018-01-05 185.59 186.85 184.93 186.90 13042388 1.31
2018-01-08 187.20 188.28 186.33 188.90 14719216 1.70
In [78]:
df.profit[df.profit>0].describe()
Out[78]:
profit
count 244.000000
mean 2.201803
std 1.834447
min 0.010000
25% 0.785000
50% 1.585000
75% 3.427500
max 8.600000

In [79]:
for idx, row in df.iterrows():
    if row.close < row.open:
        df.loc[idx,'gain']='negative'
    elif (row.close - row.open) < 1:
        df.loc[idx,'gain']='small_gain'
    elif (row.close - row.open) < 3:
        df.loc[idx,'gain']='medium_gain'
    else:
        df.loc[idx,'gain']='large_gain'
df.head()
Out[79]:
open high low close vol profit gain
date
2018-01-02 177.68 181.42 177.55 181.58 17694891 3.90 large_gain
2018-01-03 181.88 184.67 181.33 184.78 16595495 2.90 medium_gain
2018-01-04 184.90 184.33 184.10 186.21 13554357 1.31 medium_gain
2018-01-05 185.59 186.85 184.93 186.90 13042388 1.31 medium_gain
2018-01-08 187.20 188.28 186.33 188.90 14719216 1.70 medium_gain

Here is another, more "functional", way to accomplish the same thing.

Define a function that classifies rows, and apply it to each row.

In [80]:
def gainrow(row):
    if row.close < row.open:
        return 'negative'
    elif (row.close - row.open) < 1:
        return 'small_gain'
    elif (row.close - row.open) < 3:
        return 'medium_gain'
    else:
        return 'large_gain'

df['test_column'] = df.apply(gainrow, axis = 1)
#axis = 0 means rows, axis =1 means columns
In [81]:
df.head()
Out[81]:
open high low close vol profit gain test_column
date
2018-01-02 177.68 181.42 177.55 181.58 17694891 3.90 large_gain large_gain
2018-01-03 181.88 184.67 181.33 184.78 16595495 2.90 medium_gain medium_gain
2018-01-04 184.90 184.33 184.10 186.21 13554357 1.31 medium_gain medium_gain
2018-01-05 185.59 186.85 184.93 186.90 13042388 1.31 medium_gain medium_gain
2018-01-08 187.20 188.28 186.33 188.90 14719216 1.70 medium_gain medium_gain

OK, point made, let's get rid of that extraneous test_column:

In [82]:
df = df.drop('test_column', axis = 1)
df.head()
Out[82]:
open high low close vol profit gain
date
2018-01-02 177.68 181.42 177.55 181.58 17694891 3.90 large_gain
2018-01-03 181.88 184.67 181.33 184.78 16595495 2.90 medium_gain
2018-01-04 184.90 184.33 184.10 186.21 13554357 1.31 medium_gain
2018-01-05 185.59 186.85 184.93 186.90 13042388 1.31 medium_gain
2018-01-08 187.20 188.28 186.33 188.90 14719216 1.70 medium_gain

Missing values¶

Data often has missing values. In Pandas these are denoted as NaN values. These may be part of our data (e.g. empty cells in an excel sheet), or they may appear as a result of a join. There are special methods for handling these values.

In [83]:
mdf = pd.read_csv('example-missing.csv')
mdf
Out[83]:
A B C
0 1.0 a x
1 5.0 b NaN
2 3.0 c y
3 9.0 NaN z
4 NaN a x
In [84]:
mdf.A.mean()
Out[84]:
4.5

We can fill the values using the fillna method

In [85]:
mdf.fillna(0)
Out[85]:
A B C
0 1.0 a x
1 5.0 b 0
2 3.0 c y
3 9.0 0 z
4 0.0 a x
In [86]:
mdf.A = mdf.A.fillna(mdf.A.mean())
mdf = mdf.fillna('')
mdf
Out[86]:
A B C
0 1.0 a x
1 5.0 b
2 3.0 c y
3 9.0 z
4 4.5 a x

We can drop the rows with missing values

In [87]:
mdf = pd.read_csv('example-missing.csv')
mdf.dropna()
Out[87]:
A B C
0 1.0 a x
2 3.0 c y

We can find those rows

In [88]:
mdf[mdf.B.isnull()]
Out[88]:
A B C
3 9.0 NaN z

Grouping¶

An extremely powerful DataFrame method is groupby().

This is entirely analagous to GROUP BY in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.

Here we will look at the average gain among the categories of gains (negative, small, medium and large) we defined above and stored in column gain.

In [89]:
gain_groups = df.groupby('gain')
In [90]:
type(gain_groups) #DataFrameGroupBy Object
Out[90]:
pandas.core.groupby.generic.DataFrameGroupBy
def __init__(obj: NDFrameT, keys: _KeysArgType | None=None, axis: Axis=0, level: IndexLabel | None=None, grouper: ops.BaseGrouper | None=None, exclusions: frozenset[Hashable] | None=None, selection: IndexLabel | None=None, as_index: bool=True, sort: bool=True, group_keys: bool=True, observed: bool | lib.NoDefault=lib.no_default, dropna: bool=True) -> None
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/generic.pyClass for grouping and aggregating relational data.

See aggregate, transform, and apply functions on this object.

It's easiest to use obj.groupby(...) to use GroupBy, but you can also do:

::

    grouped = groupby(obj, ...)

Parameters
----------
obj : pandas object
axis : int, default 0
level : int, default None
    Level of MultiIndex
groupings : list of Grouping objects
    Most users should ignore this
exclusions : array-like, optional
    List of columns to exclude
name : str
    Most users should ignore this

Returns
-------
**Attributes**
groups : dict
    {group name -> group labels}
len(grouped) : int
    Number of groups

Notes
-----
After grouping, see aggregate, apply, and transform functions. Here are
some other brief notes about usage. When grouping by multiple groups, the
result index will be a MultiIndex (hierarchical) by default.

Iteration produces (key, group) tuples, i.e. chunking the data by group. So
you can write code like:

::

    grouped = obj.groupby(keys, axis=axis)
    for key, group in grouped:
        # do something with the data

Function calls on GroupBy, if not specially implemented, "dispatch" to the
grouped data. So if you group a DataFrame and wish to invoke the std()
method on each group, you can simply do:

::

    df.groupby(mapper).std()

rather than

::

    df.groupby(mapper).aggregate(np.std)

You can pass arguments to these "wrapped" functions, too.

See the online documentation for full exposition on these topics and much
more

Essentially, gain_groups behaves like a dictionary

  • The keys are the unique values found in the gain column, and
  • The values are DataFrames that contain only the rows having the corresponding unique values.
In [91]:
for gain, gain_data in gain_groups:
    print(gain)
    print(gain_data.head())
    print('=============================')
large_gain
              open    high     low   close       vol  profit        gain
date                                                                    
2018-01-02  177.68  181.42  177.55  181.58  17694891    3.90  large_gain
2018-01-12  178.06  179.37  177.40  181.48  76645626    3.42  large_gain
2018-01-22  180.80  185.37  180.41  185.39  20567285    4.59  large_gain
2018-01-23  186.05  189.35  185.55  189.55  24956444    3.50  large_gain
2018-02-01  188.22  193.09  187.89  195.32  53608910    7.10  large_gain
=============================
medium_gain
              open    high     low   close       vol  profit         gain
date                                                                     
2018-01-03  181.88  184.67  181.33  184.78  16595495    2.90  medium_gain
2018-01-04  184.90  184.33  184.10  186.21  13554357    1.31  medium_gain
2018-01-05  185.59  186.85  184.93  186.90  13042388    1.31  medium_gain
2018-01-08  187.20  188.28  186.33  188.90  14719216    1.70  medium_gain
2018-01-18  178.13  179.80  177.08  180.98  22783759    2.85  medium_gain
=============================
small_gain
              open    high     low   close       vol  profit        gain
date                                                                    
2018-01-09  188.70  187.87  187.10  188.80  12342722    0.10  small_gain
2018-01-10  186.94  187.84  185.63  187.89  10464528    0.95  small_gain
2018-01-11  188.40  187.77  187.38  188.40   8855144    0.00  small_gain
2018-01-16  181.50  178.39  178.04  181.75  35027166    0.25  small_gain
2018-01-17  179.26  177.60  175.80  179.32  27356988    0.06  small_gain
=============================

We can obtain the dataframe that corresponds to a specific group by using the get_group method of the groupby object

In [92]:
sm = gain_groups.get_group('small_gain')
sm.head()
Out[92]:
open high low close vol profit gain
date
2018-01-09 188.70 187.87 187.10 188.80 12342722 0.10 small_gain
2018-01-10 186.94 187.84 185.63 187.89 10464528 0.95 small_gain
2018-01-11 188.40 187.77 187.38 188.40 8855144 0.00 small_gain
2018-01-16 181.50 178.39 178.04 181.75 35027166 0.25 small_gain
2018-01-17 179.26 177.60 175.80 179.32 27356988 0.06 small_gain
In [93]:
for gain, gain_data in df.groupby("gain"):
    print('The average closing value for the {} group is {}'.format(gain,
                                                    gain_data.close.mean()))
The average closing value for the large_gain group is 171.15434782608693
The average closing value for the medium_gain group is 175.6158333333333
The average closing value for the small_gain group is 173.3509302325581

The operation above can be done with a typical SQL-like group by, where we group by one or more attributes, and aggreagate the values (of some) of the other attributes.

For example group by "gain" and take the average of the values for open, high, low, close, volume.

Example SQL code: SELECT gain, AVG(close) AS avg_close FROM stock_data GROUP BY gain;

You can also use other aggregators such as count, sum, median, max, min.

Pandas is now returning a new dataframe indexed by the values of the group-by attribute(s), with columns the other attributes

In [94]:
gdf= df[['open','low','high','close','vol','gain']].groupby('gain').mean()
type(gdf)
Out[94]:
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
In [96]:
gdf
Out[96]:
open low high close vol
gain
large_gain 166.503623 165.626087 169.394638 171.154348 3.481156e+07
medium_gain 173.778021 171.991458 173.742292 175.615833 2.371978e+07
small_gain 172.886860 169.252385 170.718140 173.350930 2.567338e+07

We can also apply a customized aggregation function using the agg method, and the lambda functions. This is useful also for processing string attributes. For example, if we would like the aggregation function to be the max-min we can do the following

If you want to remove the (hierarchical) index and have the group-by atrribute(s) to be part of the table, you can use the reset_index method. The result of this method is to make the index attribute one more column in the dataframe, and use the default index

In [97]:
#This can be used to remove the hiearchical index, if necessary
gdf = gdf.reset_index()
gdf
Out[97]:
gain open low high close vol
0 large_gain 166.503623 165.626087 169.394638 171.154348 3.481156e+07
1 medium_gain 173.778021 171.991458 173.742292 175.615833 2.371978e+07
2 small_gain 172.886860 169.252385 170.718140 173.350930 2.567338e+07
In [98]:
gdf.set_index('gain')
Out[98]:
open low high close vol
gain
large_gain 166.503623 165.626087 169.394638 171.154348 3.481156e+07
medium_gain 173.778021 171.991458 173.742292 175.615833 2.371978e+07
small_gain 172.886860 169.252385 170.718140 173.350930 2.567338e+07

Another example:

In [99]:
test = pd.DataFrame({'A':[1,2,3,4],'B':['a','b','b','a'],'C':['a','a','b','a']})
test
Out[99]:
A B C
0 1 a a
1 2 b a
2 3 b b
3 4 a a
In [100]:
gtest = test.groupby(['B','C']).mean()
gtest
#note that in this case we get a hierarchical index
Out[100]:
A
B C
a a 2.5
b a 2.0
b 3.0
In [101]:
gtest = gtest.reset_index()
gtest
#the hierarchical index is flattened out
Out[101]:
B C A
0 a a 2.5
1 b a 2.0
2 b b 3.0

Joins¶

We can join data frames in a similar way that we can do joins in SQL

In [102]:
client = TiingoClient({'api_key':'614c1590a592cc6696f6082f83b2666cd83882ef'})
start = datetime(2018,1,1)
end = datetime(2018,12,31)

dfb = client.get_dataframe('META',frequency='daily',startDate=start,endDate=end)
dfb = dfb[['open','close','low','high','volume']]

dgoog = client.get_dataframe('GOOGL',frequency='daily',startDate=start,endDate=end)
dgoog = dgoog[['open','close','low','high','volume']]

print(dfb.head())
print(dgoog.head())
                             open   close     low    high    volume
date                                                               
2018-01-02 00:00:00+00:00  177.68  181.42  177.55  181.58  17694891
2018-01-03 00:00:00+00:00  181.88  184.67  181.33  184.78  16595495
2018-01-04 00:00:00+00:00  184.90  184.33  184.10  186.21  13554357
2018-01-05 00:00:00+00:00  185.59  186.85  184.93  186.90  13042388
2018-01-08 00:00:00+00:00  187.20  188.28  186.33  188.90  14719216
                              open    close      low     high   volume
date                                                                  
2018-01-02 00:00:00+00:00  1053.02  1073.21  1053.02  1075.98  1555809
2018-01-03 00:00:00+00:00  1073.93  1091.52  1073.43  1096.10  1550593
2018-01-04 00:00:00+00:00  1097.09  1095.76  1094.26  1104.08  1289293
2018-01-05 00:00:00+00:00  1103.45  1110.29  1101.80  1113.58  1493389
2018-01-08 00:00:00+00:00  1111.00  1114.21  1110.00  1119.16  1148958

Perform join on the date (the index value). Note the _x and _y in the column names to differentiate columns with the same name coming from the left (x) and the right (y) dataframes

In [106]:
common_dates = pd.merge(dfb,dgoog,on='date')
common_dates.head()
Out[106]:
open_x close_x low_x high_x volume_x open_y close_y low_y high_y volume_y
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891 1053.02 1073.21 1053.02 1075.98 1555809
2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495 1073.93 1091.52 1073.43 1096.10 1550593
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357 1097.09 1095.76 1094.26 1104.08 1289293
2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388 1103.45 1110.29 1101.80 1113.58 1493389
2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216 1111.00 1114.21 1110.00 1119.16 1148958
In [107]:
len(common_dates)
Out[107]:
251

We can determine the suffix for the left and right tables

In [108]:
common_dates = pd.merge(dfb,dgoog,on='date',suffixes=('_fb', '_goog'))
common_dates.head()
Out[108]:
open_fb close_fb low_fb high_fb volume_fb open_goog close_goog low_goog high_goog volume_goog
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891 1053.02 1073.21 1053.02 1075.98 1555809
2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495 1073.93 1091.52 1073.43 1096.10 1550593
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357 1097.09 1095.76 1094.26 1104.08 1289293
2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388 1103.45 1110.29 1101.80 1113.58 1493389
2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216 1111.00 1114.21 1110.00 1119.16 1148958

Compute gain and perform join on the date AND gain.

In [109]:
dfb['gain'] = dfb.apply(gainrow, axis = 1)
dgoog['gain'] = dgoog.apply(gainrow, axis = 1)
dfb['profit'] = dfb.close-dfb.open
dgoog['profit'] = dgoog.close-dgoog.open
In [110]:
common_gain_dates = pd.merge(dfb, dgoog, on=['date','gain'],suffixes=('_fb', '_goog'))
common_gain_dates.head()
Out[110]:
open_fb close_fb low_fb high_fb volume_fb gain profit_fb open_goog close_goog low_goog high_goog volume_goog profit_goog
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891 large_gain 3.74 1053.02 1073.21 1053.02 1075.98 1555809 20.19
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357 negative -0.57 1097.09 1095.76 1094.26 1104.08 1289293 -1.33
2018-01-09 00:00:00+00:00 188.70 187.87 187.10 188.80 12342722 negative -0.83 1118.44 1112.79 1108.20 1118.44 1335995 -5.65
2018-01-11 00:00:00+00:00 188.40 187.77 187.38 188.40 8855144 negative -0.63 1112.31 1112.05 1106.48 1114.85 1102461 -0.26
2018-01-16 00:00:00+00:00 181.50 178.39 178.04 181.75 35027166 negative -3.11 1140.31 1130.70 1126.66 1148.88 1783881 -9.61
In [111]:
len(common_gain_dates)
Out[111]:
128

More join examples, including left outer join

In [116]:
left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
print('\n')
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
print('\n')
   key  lval
0  foo     1
1  foo     2
2  boo     3


   key  rval
0  foo     4
1  hoo     5


In [117]:
dfm = pd.merge(left, right, on='key')
print(dfm)
   key  lval  rval
0  foo     1     4
1  foo     2     4

Left outer join

In [118]:
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)
print('\n')
dfm = dfm.fillna(0) #fills the NaN values with specified value
dfm
   key  lval  rval
0  foo     1   4.0
1  foo     2   4.0
2  boo     3   NaN


Out[118]:
key lval rval
0 foo 1 4.0
1 foo 2 4.0
2 boo 3 0.0

You can also use the join command to perform joins between dataframes

It works very similarly to the merge command, but it only allows join on the index, and performs left outer join by default

In [119]:
dfb.join(dgoog,lsuffix='_FB',rsuffix='_GOOG')
Out[119]:
open_FB close_FB low_FB high_FB volume_FB gain_FB profit_FB open_GOOG close_GOOG low_GOOG high_GOOG volume_GOOG gain_GOOG profit_GOOG
date
2018-01-02 00:00:00+00:00 177.68 181.42 177.55 181.58 17694891 large_gain 3.74 1053.02 1073.21 1053.02 1075.98 1555809 large_gain 20.19
2018-01-03 00:00:00+00:00 181.88 184.67 181.33 184.78 16595495 medium_gain 2.79 1073.93 1091.52 1073.43 1096.10 1550593 large_gain 17.59
2018-01-04 00:00:00+00:00 184.90 184.33 184.10 186.21 13554357 negative -0.57 1097.09 1095.76 1094.26 1104.08 1289293 negative -1.33
2018-01-05 00:00:00+00:00 185.59 186.85 184.93 186.90 13042388 medium_gain 1.26 1103.45 1110.29 1101.80 1113.58 1493389 large_gain 6.84
2018-01-08 00:00:00+00:00 187.20 188.28 186.33 188.90 14719216 medium_gain 1.08 1111.00 1114.21 1110.00 1119.16 1148958 large_gain 3.21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-12-24 00:00:00+00:00 123.10 124.06 123.02 129.74 22066002 small_gain 0.96 984.32 984.67 977.66 1012.12 1817955 small_gain 0.35
2018-12-26 00:00:00+00:00 126.00 134.18 125.89 134.24 39723370 large_gain 8.18 997.99 1047.85 992.65 1048.45 2315862 large_gain 49.86
2018-12-27 00:00:00+00:00 132.44 134.52 129.67 134.99 31202509 medium_gain 2.08 1026.20 1052.90 1007.00 1053.34 2299806 large_gain 26.70
2018-12-28 00:00:00+00:00 135.34 133.20 132.20 135.92 22627569 negative -2.14 1059.50 1046.68 1042.00 1064.23 1718352 negative -12.82
2018-12-31 00:00:00+00:00 134.45 131.09 129.95 134.64 24625308 negative -3.36 1057.83 1044.96 1033.04 1062.99 1655504 negative -12.87

251 rows × 14 columns

In [120]:
left.index=left.key
left = left.drop('key',axis=1)
right.index =right.key
right = right.drop('key',axis=1)
left.join(right)
Out[120]:
lval rval
key
foo 1 4.0
foo 2 4.0
boo 3 NaN