Introduction to Numpy, Scipy, SciKit-Learn¶

In this tutorial we will look into the Numpy library: http://www.numpy.org/

Numpy is a very important library for numerical computations and matrix manipulation. It has a lot of the functionality of Matlab, and some of the functionality of Pandas

We will also use the Scipy library for scientific computation: http://docs.scipy.org/doc/numpy/reference/index.html

In [98]:
import numpy as np
import scipy as sp
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
In [100]:
import scipy.sparse as sp_sparse
import scipy.spatial.distance as sp_dist
import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
from sklearn import preprocessing
import scipy.sparse.linalg as linalg

Why Numpy?¶

To demonstrate the benefits of using numpy we will perform the addition of two vectors using two different ways, one using lists, one using numpy and for loops, and one using the built-in addition operator of numpy

In [101]:
import time

def trad_version():
    t1 = time.time()
    X = range(10000000)
    Y = range(10000000)
    Z = [x+y for x,y in zip(X,Y)]
    return time.time() - t1

def naive_numpy_version():
    t1 = time.time()
    X = np.arange(10000000)  # Numpy array
    Y = np.arange(10000000)  # Numpy array
    Z = np.zeros(10000000)   # Numpy array
    for i in range(10000000):
        Z[i] = X[i]+Y[i]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(10000000)
    Y = np.arange(10000000)
    Z = X + Y
    return time.time() - t1


traditional_time = trad_version()
naive_numpy_time = naive_numpy_version()
numpy_time = numpy_version()
print ("Traditional time = "+ str(traditional_time))
print ("Naive numpy time = "+ str(naive_numpy_time))
print ("Numpy time       = "+ str(numpy_time))
Traditional time = 2.067781686782837
Naive numpy time = 6.293838024139404
Numpy time       = 0.14075922966003418

Arrays¶

Creating Arrays¶

In Numpy data is organized into arrays. There are many different ways to create a numpy array.

For the following we will use the random library of Numpy: http://docs.scipy.org/doc/numpy-1.10.0/reference/routines.random.html

Creating arrays from lists

In [ ]:
#1-dimensional arrays
x = np.array([2,5,18,14,4])
print ("\n Deterministic 1-dimensional array \n")
print (x)

#2-dimensional arrays
x = np.array([[2,5,18,14,4], [12,15,1,2,8]])
print ("\n Deterministic 2-dimensional array \n")
print (x)
 Deterministic 1-dimensional array 

[ 2  5 18 14  4]

 Deterministic 2-dimensional array 

[[ 2  5 18 14  4]
 [12 15  1  2  8]]

We can also create Numpy arrays from Pandas DataFrames

In [ ]:
d = {'A':[1., 2., 3., 4.],
     'B':[4., 3., 2., 1.]}
df = pd.DataFrame(d)
x = np.array(df)
print(x)
[[1. 4.]
 [2. 3.]
 [3. 2.]
 [4. 1.]]

Creating random arrays

In [ ]:
#1-dimensional arrays
x = np.random.rand(5)
print ("\n Random 1-dimensional array \n")
print (x)

#2-dimensional arrays

x = np.random.rand(5,5)
print ("\n Random 5x5 2-dimensional array \n")
print (x)

x = np.random.randint(10,size=(2,3))
print("\n Random 2x3 array with integers")
print(x)
 Random 1-dimensional array 

[0.50457438 0.33339231 0.75741759 0.86867317 0.96011403]

 Random 5x5 2-dimensional array 

[[0.18233561 0.08269357 0.27816999 0.96417408 0.69902401]
 [0.14659692 0.30611214 0.66732051 0.61254577 0.72106603]
 [0.41956495 0.8328845  0.94123913 0.69286056 0.74035365]
 [0.04374242 0.37434048 0.79260902 0.81932212 0.36689422]
 [0.13302682 0.78787713 0.84957869 0.02067925 0.33595578]]

 Random 2x3 array with integers
[[9 3 8]
 [9 5 1]]

Transpose and get array dimensions

In [ ]:
print("\n Matrix Dimensions \n")
print(x.shape)
print ("\n Transpose of the matrix \n")
print (x.T)
print ("\n Transpose matrix Dimensions \n")
print (x.T.shape)
 Matrix Dimensions 

(2, 3)

 Transpose of the matrix 

[[9 9]
 [3 5]
 [8 1]]

 Transpose matrix Dimensions 

(3, 2)

Special Arrays

In [ ]:
x = np.zeros((4,4))
print ("\n 4x4 array with zeros \n")
print(x)

x = np.ones((4,4))
print ("\n 4x4 array with ones \n")
print (x)

x = np.eye(4)
print ("\n Identity matrix of size 4\n")
print(x)

x = np.diag([1,2,3])
print ("\n Diagonal matrix\n")
print(x)
 4x4 array with zeros 

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

 4x4 array with ones 

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

 Identity matrix of size 4

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

 Diagonal matrix

[[1 0 0]
 [0 2 0]
 [0 0 3]]

Diagonal matrices are useful because we can use them to mutliply the rows or columns of a matrix by a value. For example below we will multiply the first row of matrix A with 2 and the second row with 3.

In [ ]:
A = np.random.randint(10,size=(2,3))
A
Out[ ]:
array([[3, 0, 7],
       [5, 6, 9]])
In [ ]:
v = np.array([4,5])
D = np.diag(v)
print(D@A)
[[12  0 28]
 [25 30 45]]

Operations on arrays.¶

These are very similar to what we did with Pandas

In [ ]:
x = np.random.randint(10, size = (2,4))
print (x)
print('\n mean value of all elements')
print (np.mean(x))
print('\n vector of mean values for columns')
print (np.mean(x,0)) #0 signifies the dimension meaning columns, similar to axis=0 in pandas
print('\n vector of mean values for rows')
print (np.mean(x,1)) #1 signifies the dimension meaning rows, similar to axis=1 in pandas
[[0 4 9 0]
 [0 3 9 2]]

 mean value of all elements
3.375

 vector of mean values for columns
[0.  3.5 9.  1. ]

 vector of mean values for rows
[3.25 3.5 ]
In [ ]:
print (x)
print('\n standard deviation of all elements')
print (np.std(x))
print('\n vector of std values for rows')
print (np.std(x,1)) #1 signifies the dimension meaning rows
print('\n median value of all elements')
print (np.median(x))
print('\n vector of median values for rows')
print (np.median(x,1))
print('\n sum of all elements')
print (np.sum(x))
print('\n vector of column sums')
print (np.sum(x,0))
print('\n product of all elements')
print (np.prod(x))
print('\n vector of row products')
print (np.prod(x,1))
[[0 4 9 0]
 [0 3 9 2]]

 standard deviation of all elements
3.5333235062756425

 vector of std values for rows
[3.69966215 3.35410197]

 median value of all elements
2.5

 vector of median values for rows
[2.  2.5]

 sum of all elements
27

 vector of column sums
[ 0  7 18  2]

 product of all elements
0

 vector of row products
[0 0]
In [ ]:
x
Out[ ]:
array([[0, 4, 9, 0],
       [0, 3, 9, 2]])

What is going on in this example?

In [ ]:
s = np.sum(x,1)
s = np.diag(1/s)
s@x
Out[ ]:
array([[0.        , 0.30769231, 0.69230769, 0.        ],
       [0.        , 0.21428571, 0.64285714, 0.14285714]])

Manipulating arrays¶

Accessing and Slicing¶

Here the operations are similar to those used in lists, or in Matlab

Again dont forget in Python the 0-indexing !!

python 0 indexing joke.png

In [ ]:
x = np.random.rand(4,3)
print(x)
print("\n element\n")
print(x[1,2])
print("\n row zero \n")
print(x[0,:])
print(x[0])
print('\nfirst two rows\n')
print(x[0:2])
print("\n column 2 \n")
print(x[:,2])
print("\n submatrix \n")
print(x[1:3,0:2])
print("\n entries > 0.5 \n")
print(x[x>0.5])
[[0.2842668  0.84076974 0.11116952]
 [0.32006523 0.57831104 0.43788748]
 [0.40621726 0.3021298  0.00387381]
 [0.92741045 0.62079339 0.98846728]]

 element

0.43788747706943076

 row zero 

[0.2842668  0.84076974 0.11116952]
[0.2842668  0.84076974 0.11116952]

first two rows

[[0.2842668  0.84076974 0.11116952]
 [0.32006523 0.57831104 0.43788748]]

 column 2 

[0.11116952 0.43788748 0.00387381 0.98846728]

 submatrix 

[[0.32006523 0.57831104]
 [0.40621726 0.3021298 ]]

 entries > 0.5 

[0.84076974 0.57831104 0.92741045 0.62079339 0.98846728]

Changing entries¶

In [ ]:
x = np.random.rand(4,3)
print(x)

x[2,0] = -5      #change an entry
x[:2,:] += 1     #change a set of rows: add 1 to all the elements of the first two rows
x[2:4,1:3] = 0.5 #change a block
print('\n')
print(x)

print('\n Set entries > 0.5 to zero')
x[x>0.5] = 0
print(x)
[[0.92221645 0.94827402 0.7846012 ]
 [0.47433553 0.81953789 0.69990238]
 [0.59931135 0.71572562 0.02280937]
 [0.68927939 0.23903059 0.88867906]]


[[ 1.92221645  1.94827402  1.7846012 ]
 [ 1.47433553  1.81953789  1.69990238]
 [-5.          0.5         0.5       ]
 [ 0.68927939  0.5         0.5       ]]

 Set entries > 0.5 to zero
[[ 0.   0.   0. ]
 [ 0.   0.   0. ]
 [-5.   0.5  0.5]
 [ 0.   0.5  0.5]]

Manipulating the diagonal¶

In [ ]:
x = np.random.rand(4,4)
print(x)
print('\n Read Diagonal \n')
print(x.diagonal())
print('\n Fill Diagonal with 1s \n')
np.fill_diagonal(x,1)
print(x)
print('\n Fill Diagonal with vector \n')
x[np.diag_indices_from(x)] = [1,2,3,4]
print(x)
[[0.53465514 0.73452408 0.6076236  0.04444903]
 [0.45891009 0.151404   0.72442041 0.70714302]
 [0.54433935 0.03699876 0.5773934  0.01864703]
 [0.66722031 0.39552228 0.81874451 0.1935456 ]]

 Read Diagonal 

[0.53465514 0.151404   0.5773934  0.1935456 ]

 Fill Diagonal with 1s 

[[1.         0.73452408 0.6076236  0.04444903]
 [0.45891009 1.         0.72442041 0.70714302]
 [0.54433935 0.03699876 1.         0.01864703]
 [0.66722031 0.39552228 0.81874451 1.        ]]

 Fill Diagonal with vector 

[[1.         0.73452408 0.6076236  0.04444903]
 [0.45891009 2.         0.72442041 0.70714302]
 [0.54433935 0.03699876 3.         0.01864703]
 [0.66722031 0.39552228 0.81874451 4.        ]]

Quiz¶

We want to create a dataset of 10 users and 5 items, where each user i selects an item j with probability 0.3.

How can we do this with matrix operations?

In [ ]:
x = np.random.rand(10,5)
x[x<0.3]=1
x[x!=1]=0
x
Out[ ]:
array([[0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1.],
       [0., 1., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.]])

Or another similar way:

In [ ]:
D = np.random.rand(10,5)
print(D)
D[D>=0.7] = 1     # 30% chance
D[D< 0.7] = 0     # 70% chance
#D[D <= 0.3] = 1
#D[D != 1] = 0
D
[[0.65425279 0.66935535 0.66740194 0.60666757 0.97978199]
 [0.34101405 0.83097327 0.44760625 0.67784447 0.67050312]
 [0.14823977 0.28683124 0.4375794  0.89159433 0.31809913]
 [0.69930581 0.46244382 0.99249947 0.59998347 0.53639178]
 [0.88846221 0.06707406 0.71958525 0.66636098 0.82200912]
 [0.45051753 0.15884326 0.16406689 0.19519915 0.34009175]
 [0.46275625 0.8511663  0.51038054 0.2665805  0.18880077]
 [0.00213541 0.96584797 0.62825686 0.13478724 0.4421784 ]
 [0.03507128 0.04660249 0.65641959 0.83667493 0.19624016]
 [0.5139164  0.26268219 0.38399554 0.87902329 0.12176352]]
Out[ ]:
array([[0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 1.],
       [0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]])

Operations with Arrays¶

Multiplication and addition with scalar¶

In [ ]:
x = np.random.rand(4,3)
print(x)

#multiplication and addition with scalar value
print("\n Matrix 2x+1 \n")
print(2*x+1)
[[0.52530579 0.41955143 0.63449375]
 [0.10039893 0.35166512 0.64652786]
 [0.03893962 0.721026   0.69740297]
 [0.75940929 0.08626766 0.67322999]]

 Matrix 2x+1 

[[2.05061159 1.83910286 2.26898751]
 [1.20079785 1.70333024 2.29305572]
 [1.07787923 2.442052   2.39480594]
 [2.51881858 1.17253531 2.34645998]]

Vector-vector dot product

There are three ways to get the dot product of two vectors:

  • Using the method .dot of an array
  • Using the method dot of the numpy library
  • Using the '@' operator
  • In [ ]:
    y = np.array([2,-1,3])
    z = np.array([-1,2,2])
    print('\n y:',y)
    print(' z:',z)
    print('\n vector-vector dot product')
    print(y.dot(z))
    print(np.dot(y,z))
    print(z@y)
    
     y: [ 2 -1  3]
     z: [-1  2  2]
    
     vector-vector dot product
    2
    2
    2
    

    External product

    The external product between two vectors x,y of size (n,1) and (m,1) results in a matrix M of size (n,m) with entries M(i,j) = x(i)*y(j)

    In [ ]:
    print('\n y:',y)
    print(' z:',z)
    print('\n vector-vector external product')
    print(np.outer(y,z))
    
     y: [ 2 -1  3]
     z: [-1  2  2]
    
     vector-vector external product
    [[-2  4  4]
     [ 1 -2 -2]
     [-3  6  6]]
    

    Element-wise operations

    In [ ]:
    print('\n y:',y)
    print(' z:',z)
    print('\n element-wise addition')
    print(y+z)
    print('\n element-wise product')
    print(y*z)
    print('\n element-wise division')
    print(y/z)
    print('\n element-wise inversion')
    print(1/y)
    
     y: [ 2 -1  3]
     z: [-1  2  2]
    
     element-wise addition
    [1 1 5]
    
     element-wise product
    [-2 -2  6]
    
     element-wise division
    [-2.  -0.5  1.5]
    
     element-wise inversion
    [ 0.5        -1.          0.33333333]
    

    Matrix-Vector multiplication

    Again we can do the multiplication either using the dot method or the '@' operator

    In [ ]:
    X = np.random.randint(10, size = (4,3))
    print('Matrix X:\n',X)
    y = np.array([1,0,0])
    print("\n Matrix-vector right multiplication with",y,"\n")
    print(X.dot(y))
    print(np.dot(X,y))
    print(X@y)
    print(y.shape)
    y = np.array([1,0,1,0])
    print("\n Matrix-vector left multiplication with",y,"\n")
    print(y.dot(X),'\n')
    print(np.dot(y,X),'\n')
    print(y@X,'\n')
    print(y.shape)
    
    Matrix X:
     [[6 6 3]
     [5 6 1]
     [9 9 5]
     [7 8 8]]
    
     Matrix-vector right multiplication with [1 0 0] 
    
    [6 5 9 7]
    [6 5 9 7]
    [6 5 9 7]
    (3,)
    
     Matrix-vector left multiplication with [1 0 1 0] 
    
    [15 15  8] 
    
    [15 15  8] 
    
    [15 15  8] 
    
    (4,)
    

    Matrix-Matrix multiplication

    Same for the matrix-matrix operation

    In [ ]:
    Y = np.random.randint(10, size=(3,2))
    print("\n Matrix-matrix multiplication\n")
    print('Matrix X:\n',X)
    print('Matrix Y:\n',Y)
    print('Product:\n',X.dot(Y))
    print('Product:\n',np.dot(X,Y))
    print('Product:\n',X@Y)
    
     Matrix-matrix multiplication
    
    Matrix X:
     [[6 6 3]
     [5 6 1]
     [9 9 5]
     [7 8 8]]
    Matrix Y:
     [[3 6]
     [1 7]
     [4 1]]
    Product:
     [[ 36  81]
     [ 25  73]
     [ 56 122]
     [ 61 106]]
    Product:
     [[ 36  81]
     [ 25  73]
     [ 56 122]
     [ 61 106]]
    Product:
     [[ 36  81]
     [ 25  73]
     [ 56 122]
     [ 61 106]]
    

    Matrix-Matrix element-wise operations

    In [ ]:
    Z = np.random.randint(10, size=(3,2))+1
    print('Matrix Y:\n',Y)
    print('Matrix Z:\n',Z)
    print("\n Matrix-matrix element-wise addition\n")
    print(Y+Z)
    print("\n Matrix-matrix element-wise multiplication\n")
    print(Y*Z)
    print("\n Matrix-matrix element-wise division\n")
    print(Y/Z)
    
    Matrix Y:
     [[3 6]
     [1 7]
     [4 1]]
    Matrix Z:
     [[4 6]
     [7 5]
     [8 9]]
    
     Matrix-matrix element-wise addition
    
    [[ 7 12]
     [ 8 12]
     [12 10]]
    
     Matrix-matrix element-wise multiplication
    
    [[12 36]
     [ 7 35]
     [32  9]]
    
     Matrix-matrix element-wise division
    
    [[0.75       1.        ]
     [0.14285714 1.4       ]
     [0.5        0.11111111]]
    

    Inverting a matrix¶

    In [ ]:
    M = np.random.randint(10,size=(3,3))
    print(M)
    print('\nInversion\n')
    print(np.invert(M))
    
    [[1 5 4]
     [5 6 1]
     [5 6 3]]
    
    Inversion
    
    [[-2 -6 -5]
     [-6 -7 -2]
     [-6 -7 -4]]
    

    Creating Sparse Arrays¶

    For sparse arrays we need to use the sp_sparse library from SciPy: http://docs.scipy.org/doc/scipy/reference/sparse.html

    There are three types of sparse matrices:

    • csr: compressed row format
    • csc: compressed column format
    • lil: list of lists format
    • coo: coordinates format
    These different types have to do with the matrix implementation and the data structures used.

    The csr and csc formats are fast for arithmetic operations, but slow for slicing and incremental changes.

    The lil format is fast for slicing and incremental construction, but slow for arithmetic operations.

    The coo format does not support arithmetic operations and slicing, but it is very fast for constructing a matrix incrementally. You should then transform it to some other format for operations.

    Creation of matrix from triplets

    Triplets are of the form (row, column, value)

    In [ ]:
    import scipy.sparse as sp_sparse
    
    # d = matrix of elements and there possitions ( graph edges (0,1) and weights (2) )
    d = np.array([[0, 0, 12],
                  [0, 1, 1],
                  [0, 5, 34],
                  [1, 3, 12],
                  [1, 2, 6],
                  [2, 0, 23],
                  [3, 4, 14],
                  ])
    row = d[:,0]
    col = d[:,1]
    data = d[:,2]
    
    # a matrix M with M[row[i],col[i]] = data[i] will be created
    M = sp_sparse.csr_matrix((data,(row,col)), shape=(4,6))      # Create a Compressed Sparse Row (CSR) matrix M
    print(M)
    print('\n')
    print(M.toarray()) #transforms back to full matrix
    
    <Compressed Sparse Row sparse matrix of dtype 'int64'
    	with 7 stored elements and shape (4, 6)>
      Coords	Values
      (0, 0)	12
      (0, 1)	1
      (0, 5)	34
      (1, 2)	6
      (1, 3)	12
      (2, 0)	23
      (3, 4)	14
    
    
    [[12  1  0  0  0 34]
     [ 0  0  6 12  0  0]
     [23  0  0  0  0  0]
     [ 0  0  0  0 14  0]]
    

    The method toarray() creates a dense matrix. If you need a sparse matrix you should not use it.

    Making a full matrix sparse

    In [ ]:
    x = np.random.randint(2,size = (3,4))
    print(x)
    print('\n make x sparce')
    A = sp_sparse.csr_matrix(x)
    print(A)
    
    [[1 0 0 1]
     [0 1 0 0]
     [0 0 0 0]]
    
     make x sparce
    <Compressed Sparse Row sparse matrix of dtype 'int64'
    	with 3 stored elements and shape (3, 4)>
      Coords	Values
      (0, 0)	1
      (0, 3)	1
      (1, 1)	1
    

    Creating a sparse matrix incrementally

    In [ ]:
    # Use lil (list of lists) representation, you can also use coo (coordinates)
    A = sp_sparse.lil_matrix((10, 10))
    A[0, :5] = np.random.randint(10,size = 5)
    A[1, 5:10] = A[0, :5]
    A.setdiag(np.random.randint(10,size = 10))
    A[9,9] = 99
    A[9,0]=1
    print(A)
    print('\n')
    print(A.toarray())
    print('\n')
    print(A.diagonal())
    A = A.tocsr() # makes it to csr format (Now that we have finished building it with lil)
    B = A.dot(np.ones(10)) # Csr better for dot product.
    print('\n')
    print(B)
    
    <List of Lists sparse matrix of dtype 'float64'
    	with 17 stored elements and shape (10, 10)>
      Coords	Values
      (0, 0)	7.0
      (0, 1)	6.0
      (0, 3)	7.0
      (0, 4)	2.0
      (1, 1)	5.0
      (1, 5)	3.0
      (1, 6)	6.0
      (1, 8)	7.0
      (1, 9)	2.0
      (2, 2)	4.0
      (3, 3)	2.0
      (4, 4)	2.0
      (5, 5)	1.0
      (6, 6)	7.0
      (8, 8)	5.0
      (9, 0)	1.0
      (9, 9)	99.0
    
    
    [[ 7.  6.  0.  7.  2.  0.  0.  0.  0.  0.]
     [ 0.  5.  0.  0.  0.  3.  6.  0.  7.  2.]
     [ 0.  0.  4.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  2.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  2.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  7.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  5.  0.]
     [ 1.  0.  0.  0.  0.  0.  0.  0.  0. 99.]]
    
    
    [ 7.  5.  4.  2.  2.  1.  7.  0.  5. 99.]
    
    
    [ 22.  23.   4.   2.   2.   1.   7.   0.   5. 100.]
    

    All operations work like before

    In [ ]:
    print(A.dot(A.T))
    
    <Compressed Sparse Row sparse matrix of dtype 'float64'
    	with 25 stored elements and shape (10, 10)>
      Coords	Values
      (0, 4)	4.0
      (0, 3)	14.0
      (0, 1)	30.0
      (0, 9)	7.0
      (0, 0)	138.0
      (1, 9)	198.0
      (1, 8)	35.0
      (1, 6)	42.0
      (1, 5)	3.0
      (1, 1)	123.0
      (1, 0)	30.0
      (2, 2)	16.0
      (3, 3)	4.0
      (3, 0)	14.0
      (4, 4)	4.0
      (4, 0)	4.0
      (5, 5)	1.0
      (5, 1)	3.0
      (6, 6)	49.0
      (6, 1)	42.0
      (8, 8)	25.0
      (8, 1)	35.0
      (9, 1)	198.0
      (9, 9)	9802.0
      (9, 0)	7.0
    
    In [ ]:
    A[0].mean()
    
    Out[ ]:
    np.float64(2.2000000000000006)

    Singluar Value Decomposition¶

    For the singular value decomposition we will use the libraries from Numpy and SciPy and SciKit Learn

    image.png

    We use sklearn to create a low-rank matrix (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html). We will create a matrix with effective rank 2.

    In [62]:
    import sklearn.datasets as sk_data
    
    data = sk_data.make_low_rank_matrix(n_samples=100, n_features=50, effective_rank=2, tail_strength=0.0, random_state=None)
    sns.heatmap(data, xticklabels=False, yticklabels=False, linewidths=0)
    
    Out[62]:
    <Axes: >
    No description has been provided for this image

    We will use the numpy.linalg.svd function to compute the Singular Value Decomposition of the matrix we created (http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html).

    In [63]:
    U, s, V = np.linalg.svd(data,full_matrices = False)
    print (U.shape, s.shape, V.shape)
    print('\n')
    print(s)
    print('\n')
    print(s[0:10])
    plt.plot(s[0:10])
    plt.ylabel('singular value')
    plt.xlabel('number of singular values')
    
    (100, 50) (50,) (50, 50)
    
    
    [1.00000000e+00 7.78800783e-01 3.67879441e-01 1.05399225e-01
     1.83156389e-02 1.93045414e-03 1.23409804e-04 4.78511739e-06
     1.12535175e-07 1.60522805e-09 1.38879427e-11 7.28724346e-14
     2.34399394e-16 9.54441483e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 6.81365481e-17 6.81365481e-17 6.81365481e-17
     6.81365481e-17 3.39251633e-17]
    
    
    [1.00000000e+00 7.78800783e-01 3.67879441e-01 1.05399225e-01
     1.83156389e-02 1.93045414e-03 1.23409804e-04 4.78511739e-06
     1.12535175e-07 1.60522805e-09]
    
    Out[63]:
    Text(0.5, 0, 'number of singular values')
    No description has been provided for this image

    We can also use the scipy.sparse.linalg libary to compute the SVD for sparse matrices (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.linalg.svds.html)

    We need to specify the number of components, otherwise it is by default k = 6. The singular values are in increasing order.

    In [64]:
    import scipy.sparse.linalg as sp_linalg
    
    data2 = sp_sparse.csc_matrix(data)    # Convers dense matrix into sparce
    print(data2.shape)
    U,s,V = sp_linalg.svds(data2, k = 10) # By default returns k=6 singular values (truncated)
    print (U.shape, s.shape, V.shape)     # Shapes now with 10
    print(s)
    plt.plot(s[::-1]) #invert the order of the singular values
    plt.ylabel('eigenvalue value')
    plt.xlabel('number of eigenvalues')
    
    (100, 50)
    (100, 10) (10,) (10, 50)
    [1.60522805e-09 1.12535175e-07 4.78511739e-06 1.23409804e-04
     1.93045414e-03 1.83156389e-02 1.05399225e-01 3.67879441e-01
     7.78800783e-01 1.00000000e+00]
    
    Out[64]:
    Text(0.5, 0, 'number of eigenvalues')
    No description has been provided for this image

    We can also compute SVD using the library of SciKit Learn (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)

    In [66]:
    from sklearn.decomposition import TruncatedSVD
    
    K = 10
    svd = TruncatedSVD(n_components=K)
    svd.fit(data2)
    print(svd.components_.shape) # the V vectors
    print(svd.transform(data2).shape) # the U vectors
    print(svd.singular_values_)
    
    (10, 50)
    (100, 10)
    [1.00000000e+00 7.78800783e-01 3.67879441e-01 1.05399225e-01
     1.83156389e-02 1.93045414e-03 1.23409804e-04 4.78511739e-06
     1.12535175e-07 1.60522805e-09]
    

    Obtaining a low rank approximation of the data¶

    To obtain a rank-k approximation of the matrix we multiplty the k first columns of U, with the diagonal matrix with the k first (largest) singular values, with the matrix with the first k rows of V transpose

    In [69]:
    K = 6
    U_k,s_k,V_k = sp_linalg.svds(data2, K, which = 'LM')  # LM = Largest Magnitude
    print (U_k.shape, s_k.shape, V_k.shape)
    print(s_k)
    plt.plot(s_k[::-1])
    plt.ylabel('eigenvalue value')
    plt.xlabel('number of eigenvalues')
    S_k = np.diag(s_k)
    k=K
    S_k[k:,k:]
    
    (100, 6) (6,) (6, 50)
    [0.00193045 0.01831564 0.10539922 0.36787944 0.77880078 1.        ]
    
    Out[69]:
    array([], shape=(0, 0), dtype=float64)
    No description has been provided for this image
    In [70]:
    reconstruction_error = []
    for k in range(K-1,-1,-1): #iterate from end to start
        r = K-k
        #print(S_k[k:,k:])
        data_k = U_k[:,k:].dot(S_k[k:,k:]).dot(V_k[k:,:]) #here we obtain the rank-r matrix
        error = np.linalg.norm(data_k-data2,ord='fro')
        reconstruction_error.append(error)
        print(r,error)
    data_k = U_k.dot(S_k).dot(V_k)
    print(np.linalg.norm(data_k-data2,ord='fro'))
    
    plt.plot(1+np.array(range(6)),reconstruction_error)
    plt.ylabel('rank-k reconstruction error')
    plt.xlabel('rank')
    
    1 0.8679367165994608
    2 0.38312332780557673
    3 0.10699626662742326
    4 0.01841750618200929
    5 0.0019344006983659247
    6 0.00012350259009393956
    0.00012350259009393956
    
    Out[70]:
    Text(0.5, 0, 'rank')
    No description has been provided for this image

    An example¶

    We will create a block diagonal matrix, with blocks of different "intensity" of values

    In [71]:
    import numpy as np
    
    M1 = np.random.randint(1,50,(50,20))
    M2 = np.random.randint(1,10,(50,20))
    M3 = np.random.randint(1,10,(50,20))
    M4 = np.random.randint(1,50,(50,20))
    
    T = np.concatenate((M1,M2),axis=1)
    B = np.concatenate((M3,M4),axis=1)
    M = np.concatenate((T,B),axis = 0)
    
    In [72]:
    # Matrix Visualized as Heatmap
    plt.figure(figsize=(8, 6))
    im = plt.imshow(M, cmap='hot')
    plt.colorbar(im, label='Matrix Value') # Pass the image object 'im' to colorbar
    plt.show()
    
    No description has been provided for this image

    We observe that there is a correlation between the column and row sums and the left and right singular vectors

    How is this clear from the Pearson Correlation results?

    Note: The values of the vectors are negative. We would get the same result if we make them positive.

    In [73]:
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    
    (U,S,V) = np.linalg.svd(M,full_matrices = False)
    #print(S)
    c = M.sum(0)
    r = M.sum(1)
    print(stats.pearsonr(r,U[:,0]))
    print(stats.pearsonr(c,V[0]))
    plt.scatter(r,U[:,0])
    plt.figure()
    plt.scatter(c,V[0])
    
    PearsonRResult(statistic=np.float64(-0.9721346141745567), pvalue=np.float64(1.504249108163752e-63))
    PearsonRResult(statistic=np.float64(-0.9364675493665189), pvalue=np.float64(7.01937908198349e-19))
    
    Out[73]:
    <matplotlib.collections.PathCollection at 0x795ca54a4320>
    No description has been provided for this image
    No description has been provided for this image

    We observe a famous property of SVD on non-negative matrices:

    The first singular vector (the largest component) acts as an average row/column sum estimator.

    Using the first two signular vectors we can clearly differentiate the two blocks of rows

    In [74]:
    plt.scatter(U[:,0],U[:,1])
    
    Out[74]:
    <matplotlib.collections.PathCollection at 0x795c96107f80>
    No description has been provided for this image
    In [75]:
    plt.scatter(x = U[:50,0],y = U[:50,1], color = 'r')
    plt.scatter(x = U[50:,0],y = U[50:,1], color = 'b')
    
    Out[75]:
    <matplotlib.collections.PathCollection at 0x795c93f07650>
    No description has been provided for this image

    As is expected this clear clusterring can also be confirmed from the singular values

    In [76]:
    plt.plot(S)
    
    Out[76]:
    [<matplotlib.lines.Line2D at 0x795c9179afc0>]
    No description has been provided for this image

    PCA using SciKit Learn¶

    We will now use the PCA package from the SciKit Learn (sklearn) library. PCA is the same as SVD but now the matrix is centered: the mean is removed from the columns of the matrix.

    You can read more here: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

    In [78]:
    from sklearn.decomposition import PCA
    pca = PCA(n_components=2)
    pca.fit(M)
    
    Out[78]:
    PCA(n_components=2)
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
    PCA(n_components=2)
    In [79]:
    pca.components_
    
    Out[79]:
    array([[ 0.10515321,  0.19404979,  0.14055598,  0.1469998 ,  0.16922203,
             0.13837153,  0.16201805,  0.14300106,  0.15004895,  0.16414542,
             0.15661362,  0.16215106,  0.17030279,  0.16546738,  0.1241947 ,
             0.14789572,  0.15698525,  0.1430435 ,  0.15675912,  0.16973438,
            -0.15590506, -0.17195176, -0.14194407, -0.1626008 , -0.15910389,
            -0.15049444, -0.16594936, -0.17213403, -0.18228625, -0.14769501,
            -0.18126023, -0.16654362, -0.18139975, -0.15408992, -0.18369372,
            -0.16568987, -0.15259719, -0.14687577, -0.15546674, -0.12202896],
           [-0.31651872,  0.14371345, -0.14229102,  0.30086307,  0.0695832 ,
             0.14437878,  0.27759904,  0.13356004,  0.23907233,  0.07863036,
            -0.2278267 , -0.20982028,  0.28510476, -0.21277656, -0.17279117,
             0.12673185,  0.09861236, -0.18693302,  0.4152043 , -0.18194852,
            -0.00978419,  0.04326397,  0.01999254,  0.00495667, -0.0335748 ,
            -0.01494252,  0.06097415,  0.06449193,  0.07299188,  0.00307976,
             0.09940295,  0.07348219,  0.10467432, -0.02862334,  0.08325732,
            -0.01290569,  0.03593026,  0.03630658,  0.09609304,  0.05425316]])
    In [80]:
    plt.scatter(pca.components_[0],pca.components_[1])
    
    Out[80]:
    <matplotlib.collections.PathCollection at 0x795c95d3b0e0>
    No description has been provided for this image

    Using the operation transform we can transform the data directly to the lower-dimensional space

    In [81]:
    MPCA = pca.transform(M)
    print(MPCA.shape)
    
    (100, 2)
    
    In [82]:
    plt.scatter(MPCA[:,0],MPCA[:,1])
    
    Out[82]:
    <matplotlib.collections.PathCollection at 0x795c95d38920>
    No description has been provided for this image

    We will now experiment with a well-known dataset of data analysis, the iris dataset:

    https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

    In [84]:
    from sklearn import datasets
    iris = datasets.load_iris()  # 4 Features: Sepal length/width , Petal length/width
    X = iris.data
    y = iris.target              # contains the labels of the data (3 flower targets)
    
    pca = PCA(n_components=3)
    pca.fit(X)
    X = pca.transform(X)
    pca.components_
    
    Out[84]:
    array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
           [ 0.65658877,  0.73016143, -0.17337266, -0.07548102],
           [-0.58202985,  0.59791083,  0.07623608,  0.54583143]])
    In [85]:
    pca.explained_variance_
    
    Out[85]:
    array([4.22824171, 0.24267075, 0.0782095 ])
    In [86]:
    plt.scatter(X[:,0],X[:,1])
    
    Out[86]:
    <matplotlib.collections.PathCollection at 0x795c91795d60>
    No description has been provided for this image
    In [87]:
    plt.scatter(X[y==0,0],X[y==0,1], color='b')
    plt.scatter(X[y==1,0],X[y==1,1], color='r')
    plt.scatter(X[y==2,0],X[y==2,1], color='g')
    
    Out[87]:
    <matplotlib.collections.PathCollection at 0x795c917d0b60>
    No description has been provided for this image
    In [88]:
    from mpl_toolkits.mplot3d import Axes3D
    import plotly.express as px
    import pandas as pd
    
    # Convert data to a DataFrame for Plotly
    df = pd.DataFrame(X, columns=['PC1', 'PC2', 'PC3'])
    df['label'] = y
    
    # Create an interactive 3D scatter plot
    fig = px.scatter_3d(
        df,
        x='PC1',
        y='PC2',
        z='PC3',
        color='label',
        title='3D PCA Plot',
        labels={'label': 'Class'}
    )
    
    fig.update_traces(marker=dict(size=5))
    fig.show()
    
    In [53]:
    from sklearn.datasets import fetch_20newsgroups
    
    categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
    news_data = fetch_20newsgroups(subset='train', categories=categories)
    
    In [54]:
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    vectorizer = TfidfVectorizer(stop_words='english', min_df=4,max_df=0.8)
    dtm = vectorizer.fit_transform(news_data.data)
    
    In [89]:
    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
    from nltk.stem.snowball import SnowballStemmer
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    import nltk
    nltk.download('punkt_tab')
    
    
    stemmed_data = [" ".join(SnowballStemmer("english", ignore_stopwords=True).stem(word)
           for sent in sent_tokenize(message)
            for word in word_tokenize(sent))
            for message in news_data.data]
    
    # stemmed_data = news_data.data
    
    [nltk_data] Downloading package punkt to /root/nltk_data...
    [nltk_data]   Package punkt is already up-to-date!
    [nltk_data] Downloading package stopwords to /root/nltk_data...
    [nltk_data]   Package stopwords is already up-to-date!
    [nltk_data] Downloading package punkt_tab to /root/nltk_data...
    [nltk_data]   Package punkt_tab is already up-to-date!
    
    In [90]:
    dtm = vectorizer.fit_transform(stemmed_data)
    terms = vectorizer.get_feature_names_out()
    print(terms)
    
    ['00' '000' '0005' ... 'zx' 'zy' 'zz']
    
    In [92]:
    dtm_dense = dtm.todense() # TfIdfVectorizer by default creates a sparce matrix: We convert it back to dense
    centered_dtm = dtm_dense - np.mean(dtm_dense, axis=0) # Centering data before doing PCA
    np.sum(centered_dtm,axis=0)[:,:10]
    
    Out[92]:
    matrix([[-6.86083135e-16, -2.56565602e-15, -2.49019555e-15,
             -2.45983789e-15,  5.33384101e-15, -1.18178037e-17,
              3.10255294e-15, -3.24393290e-15, -1.11624035e-14,
              3.10553449e-15]])
    In [93]:
    u, s, vt = np.linalg.svd(centered_dtm)
    
    In [94]:
    plt.xlim([0,50])
    plt.plot(range(1,len(s)+1),s)
    
    Out[94]:
    [<matplotlib.lines.Line2D at 0x795c94004410>]
    No description has been provided for this image
    In [95]:
    k=2
    vectorsk = np.array(u[:,:k] @ np.diag(s[:k])) # (Document Vectors) * (Singular Values)
    labels = [news_data.target_names[i] for i in news_data.target]
    sns.scatterplot(x=vectorsk[:,0], y=vectorsk[:, 1], hue=labels)
    
    Out[95]:
    <Axes: >
    No description has been provided for this image

    Pairwise Relationship of Principal Components

    In [96]:
    import seaborn as sns
    k = 5
    Xk = u[:,:k] @ np.diag(s[:k])
    X_df = pd.DataFrame(Xk)
    g = sns.PairGrid(X_df)
    g.map(plt.scatter)
    
    Out[96]:
    <seaborn.axisgrid.PairGrid at 0x795c93ff8740>
    No description has been provided for this image
    In [97]:
    terms = vectorizer.get_feature_names_out()
    for i in range(6):
        top = np.argsort(np.abs(vt[i]))
        topterms = [terms[top[0,f]] for f in range(12)]
        print (i, topterms)
    
    0 ['ate', 'mwwhj', 'mwwiz', 'pmf9f9fq', 'wrj', 'wwhjnuy', 'f3w2tg', '7kmzwt', '9f9f9f9f', 'm9l0qax', 'f9f9f9f9f0', 'wwizw']
    1 ['m2j', 'dle', 'ahl', 'u45', '0va', 'mwwhj', 'mwwiz', 'pmf9f9fq', 'wrj', 'f3w2tg', 'wwhjnuy', '7kmzwt']
    2 ['sake', 'bobbi', 'mound', 'macdonald', 'dia', 'clarifi', 'hansen', 'glavin', 'welcom', 'polytechn', '9g', 'sleep']
    3 ['wrist', 'prowess', '60', 'division', 'desktop', 'swartzendrub', 'dswartz', 'slam', '636', 'pretend', 'eye', 'wish']
    4 ['consider', '185', 'stretch', 'dud', 'hom', '649', '320', 't1', '270', 'luddington', 'cs902043', 'giant']
    5 ['missouri', 'av', '373', 'rumor', 'hlvs', 'magic', 'vr', 'mead', 'leak', 'enviro', 'lp', 'cp']