Introduction to Numpy, Scipy, SciKit-Learn

In this tutorial we will look into the Numpy library: http://www.numpy.org/

Numpy is a very important library for numerical computations and matrix manipulation. It has a lot of the functionality of Matlab, and some of the functionality of Pandas

We will also use the Scipy library for scientific computation: http://docs.scipy.org/doc/numpy/reference/index.html

Why Numpy?

Arrays

Creating Arrays

In Numpy data is organized into arrays. There are many different ways to create a numpy array.

For the following we will use the random library of Numpy: http://docs.scipy.org/doc/numpy-1.10.0/reference/routines.random.html

Creating arrays from lists

We can also create Numpy arrays from Pandas DataFrames

Creating random arrays

Transpose and get array dimensions

Special Arrays

Operations on arrays.

These are very similar to what we did with Pandas

Manipulating arrays

Accessing and Slicing

Changing entries

Quiz

We want to create a dataset of 10 users and 5 items, where each user i has selects an item j with probability 0.3.

How can we do this with matrix operations?

Operations with Arrays

Multiplication and addition with scalar

Vector-vector dot product

There are three ways to get the dot product of two vectors:

  • Using the method .dot of an array

  • Using the method dot of the numpy library

  • Using the '@' operator

  • External product

    The external product between two vectors x,y of size (n,1) and (m,1) results in a matrix M of size (n,m) with entries M(i,j) = x(i)*y(j)

    Element-wise operations

    Matrix-Vector multiplication

    Again we can do the multiplication either using the dot method or the '@' operator

    Matrix-Matrix multiplication

    Same for the matrix-matrix operation

    Matrix-Matrix element-wise operations

    Creating Sparse Arrays

    For sparse arrays we need to use the sp_sparse library from SciPy: http://docs.scipy.org/doc/scipy/reference/sparse.html

    There are three types of sparse matrices:

    Creation of matrix from triplets

    Triplets are of the form (row, column, value)

    Making a full matrix sparse

    Creating a sparse matrix incrementally

    All operations work like before

    Singluar Value Decomposition

    For the singular value decomposition we will use the libraries from Numpy and SciPy and SciKit Learn

    We use sklearn to create a low-rank matrix (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_low_rank_matrix.html). We will create a matrix with effective rank 2.

    We will use the numpy.linalg.svd function to compute the Singular Value Decomposition of the matrix we created (http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html).

    We can also use the scipy.sparse.linalg libary to compute the SVD for sparse matrices (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.linalg.svds.html)

    We need to specify the number of components, otherwise it is by default k = 6. The singular values are in increasing order.

    We can also compute SVD using the library of SciKit Learn (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)

    Onbtaining a low rank approximation of the data

    To obtain a rank-k approximation of the matrix we multiplty the k first columns of U, with the diagonal matrix with the k first (largest) singular values, with the matrix with the first k rows of V transpose

    An example

    We will create a block diagonal matrix, with blocks of different "intensity" of values

    We observe that there is a correlation between the column and row sums and the left and right singular vectors

    Note: The values of the vectors are negative. We would get the same result if we make them positive.

    Using the first two signular vectors we can clearly differentiate the two blocks of rows

    PCA using SciKit Learn

    We will now use the PCA package from the SciKit Learn (sklearn) library. PCA is the same as SVD but now the matrix is centered: the mean is removed from the columns of the matrix.

    You can read more here: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

    Using the operation transform we can transform the data directly to the lower-dimensional space

    We will now experiment with a well-known dataset of data analysis, the iris dataset:

    https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html