Supervised learning using scikit-learn

The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover feature selection, and evaluation.

In [1]:
import numpy as np
import scipy.sparse as sp_sparse

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics

import seaborn as sns

%matplotlib inline

Feature Selection

Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.

http://scikit-learn.org/stable/modules/feature_selection.html

Variance Threshold

The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's

In [14]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(np.array(X))
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]
Out[14]:
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])
In [15]:
import sklearn.datasets as sk_data
iris = sk_data.load_iris()
X = iris.data
print(X[1:10,:])
print(X.var(axis = 0))
sel = VarianceThreshold(threshold=0.2)
sel.fit_transform(X)[1:10]
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0.68112222 0.18871289 3.09550267 0.57713289]
Out[15]:
array([[4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2],
       [5.4, 1.7, 0.4],
       [4.6, 1.4, 0.3],
       [5. , 1.5, 0.2],
       [4.4, 1.4, 0.2],
       [4.9, 1.5, 0.1]])

Univariate Feature Selection

A more sophisticated feature selection technique uses test to determine if a feature and the class label are independent. An example of such a test is the chi-square test (there are more)

In this case we keep the features with high chi-square score and low p-value.

The features with the lowest scores and highest values are rejected.

The chi-square test is usually applied on categorical data.

In [16]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = sk_data.load_iris()
X, y = iris.data, iris.target
print(X.shape)
print('Features:')
print(X[1:10,:])
print('Labels:')
print(y[1:10])
sel = SelectKBest(chi2, k=2)
X_new = sel.fit_transform(X, y)
print('Selected Features:')
print(X_new[1:10])
(150, 4)
Features:
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
Labels:
[0 0 0 0 0 0 0 0 0]
Selected Features:
[[1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]
 [1.7 0.4]
 [1.4 0.3]
 [1.5 0.2]
 [1.4 0.2]
 [1.5 0.1]]

The chi-square values and the p-values between features and target variable (X columns and y)

In [17]:
print('Chi2 values')
print(sel.scores_)
c,p = sk.feature_selection.chi2(X, y)
print('Chi2 values')
print(c) #The chi-square value between X columns and y
print('p-values')
print(p) #The p-value for the test
Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]
Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]
p-values
[4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]

Supervised Learning

Python has several classes and objects for implementing different supervised learning techniques such as Regression and Classification.

Regardless of the model being implemented, the following methods are implemented:

The method fit() takes the training data and labels/values, and trains the model

The method predict() takes as input the test data and applies the model.

Preparing the data

To perform classification we first need to prepare the data into train and test datasets.

In [2]:
from sklearn.datasets import load_iris
import sklearn.utils as utils

iris = load_iris()
print("sample of data")
print(iris.data[:5,:])
print("the class labels vector")
print(iris.target)
print("the names of the classes:",iris.target_names)
print(iris.feature_names)
sample of data
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
the class labels vector
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
the names of the classes: ['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Randomly shuffle the data. This is useful to know that the data is in random order

In [3]:
X, y = utils.shuffle(iris.data, iris.target, random_state=1) #shuffle the data
print(X.shape)
print(y.shape)
print(y)
(150, 4)
(150,)
[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0 1 2 2 0 2 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0
 1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 1 0 1 0 0 2 2
 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1
 2 0]

Select a subset for training and a subset for testing

In [4]:
train_set_size = 100
X_train = X[:train_set_size]  # selects first 100 rows (examples) for train set
y_train = y[:train_set_size]
X_test = X[train_set_size:]   # selects from row 100 until the last one for test set
y_test = y[train_set_size:]
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(100, 4) (100,)
(50, 4) (50,)

We can also use the train_test_split function of python for splitting the data into train and test sets. In this case you do not need the random shuffling (but it does not hurt).

In [43]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Classification models

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Python has classes and objects that implement the different classification techniques that we described in class.

Decision Trees

http://scikit-learn.org/stable/modules/tree.html

Train and apply a decision tree classifier. The default score computed in the classifier object is the accuracy. Decision trees can also give us probabilities

In [5]:
from sklearn import tree

dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)

print("classifier accuracy:",dtree.score(X_test,y_test))

y_pred = dtree.predict(X_test)
y_prob = dtree.predict_proba(X_test)
print("classifier predictions:",y_pred[:10])
print("ground truth labels   :",y_test[:10])
print(y_prob[:10])
classifier accuracy: 0.9
classifier predictions: [0 1 0 1 1 0 1 0 0 2]
ground truth labels   : [0 1 0 1 1 0 1 0 0 2]
[[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

Compute some more metrics

In [22]:
print("accuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
accuracy: 0.88

Confusion matrix
[[19  0  0]
 [ 0 16  2]
 [ 0  4  9]]

Precision Score per class
[1.         0.8        0.81818182]

Average Precision Score
0.8807272727272726

Recall Score per class
[1.         0.88888889 0.69230769]

Average Recall Score
0.88

F1-score Score per class
[1.         0.84210526 0.75      ]

Average F1 Score
0.8781578947368422

Visualize the decision tree.

For this you will need to install the package python-graphviz

In [6]:
#conda install python-graphviz
import graphviz 
print(iris.feature_names)
dot_data = tree.export_graphviz(dtree,out_file=None)
graph = graphviz.Source(dot_data)
graph
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Out[6]:
Tree 0 X[3] <= 1.75 gini = 0.665 samples = 100 value = [31, 32, 37] 1 X[2] <= 2.45 gini = 0.515 samples = 64 value = [31, 32, 1] 0->1 True 8 gini = 0.0 samples = 36 value = [0, 0, 36] 0->8 False 2 gini = 0.0 samples = 31 value = [31, 0, 0] 1->2 3 X[1] <= 2.25 gini = 0.059 samples = 33 value = [0, 32, 1] 1->3 4 X[2] <= 4.5 gini = 0.5 samples = 2 value = [0, 1, 1] 3->4 7 gini = 0.0 samples = 31 value = [0, 31, 0] 3->7 5 gini = 0.0 samples = 1 value = [0, 1, 0] 4->5 6 gini = 0.0 samples = 1 value = [0, 0, 1] 4->6
In [7]:
dtree2 = tree.DecisionTreeClassifier(max_depth=2)
dtree2 = dtree2.fit(X_train, y_train)
print(dtree2.score(X_test,y_test))
dot_data2 = tree.export_graphviz(dtree2,out_file=None)
graph2 = graphviz.Source(dot_data2)
graph2
0.9
Out[7]:
Tree 0 X[3] <= 1.75 gini = 0.665 samples = 100 value = [31, 32, 37] 1 X[3] <= 0.75 gini = 0.515 samples = 64 value = [31, 32, 1] 0->1 True 4 gini = 0.0 samples = 36 value = [0, 0, 36] 0->4 False 2 gini = 0.0 samples = 31 value = [31, 0, 0] 1->2 3 gini = 0.059 samples = 33 value = [0, 32, 1] 1->3
In [25]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print("classifier score:", knn.score(X_test,y_test))

y_pred = knn.predict(X_test)

print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.94

accuracy: 0.94

Confusion matrix
[[19  0  0]
 [ 0 16  2]
 [ 0  1 12]]

Precision Score per class
[1.         0.94117647 0.85714286]

Average Precision Score
0.9416806722689074

Recall Score per class
[1.         0.88888889 0.92307692]

Average Recall Score
0.94

F1-score Score per class
[1.         0.91428571 0.88888889]

Average F1 Score
0.9402539682539682
In [27]:
from sklearn import svm

#svm_clf = svm.LinearSVC()
#svm_clf = svm.SVC(kernel = 'poly')
svm_clf = svm.SVC()
svm_clf.fit(X_train,y_train)
print("classifier score:",svm_clf.score(X_test,y_test))
y_pred = svm_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.98

accuracy: 0.98

Confusion matrix
[[19  0  0]
 [ 0 18  0]
 [ 0  1 12]]

Precision Score per class
[1.         0.94736842 1.        ]

Average Precision Score
0.9810526315789474

Recall Score per class
[1.         1.         0.92307692]

Average Recall Score
0.98

F1-score Score per class
[1.         0.97297297 0.96      ]

Average F1 Score
0.9798702702702704
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [29]:
import sklearn.linear_model as linear_model

lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_train, y_train)
print("classifier score:",lr_clf.score(X_test,y_test))
y_pred = lr_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.88

accuracy: 0.88

Confusion matrix
[[19  0  0]
 [ 0 13  5]
 [ 0  1 12]]

Precision Score per class
[1.         0.92857143 0.70588235]

Average Precision Score
0.8978151260504202

Recall Score per class
[1.         0.72222222 0.92307692]

Average Recall Score
0.88

F1-score Score per class
[1.     0.8125 0.8   ]

Average F1 Score
0.8805000000000001
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

For Logistic Regression we can also obtain the probabilities for the different classes

In [27]:
probs = lr_clf.predict_proba(X_test)
print("Class Probabilities (first 10):")
print (probs[:10])
print(probs.argmax(axis = 1)[:10])
print(probs.max(axis = 1)[:10])
Class Probabilities (first 10):
[[8.97030460e-03 3.32653685e-01 6.58376010e-01]
 [3.65818540e-03 4.12481405e-01 5.83860409e-01]
 [6.12425133e-04 3.31379557e-01 6.68008018e-01]
 [9.06929006e-01 9.26073940e-02 4.63599597e-04]
 [8.98809388e-01 1.00868455e-01 3.22156817e-04]
 [9.57598497e-01 4.23682210e-02 3.32819743e-05]
 [1.32310636e-03 3.27816831e-01 6.70860062e-01]
 [1.27558143e-03 3.77948164e-01 6.20776255e-01]
 [1.50692477e-03 3.85667745e-01 6.12825330e-01]
 [8.56351814e-04 2.05563299e-01 7.93580350e-01]]
[2 2 2 0 0 0 2 2 2 2]
[0.65837601 0.58386041 0.66800802 0.90692901 0.89880939 0.9575985
 0.67086006 0.62077626 0.61282533 0.79358035]

And the coeffients of the logistic regression model

In [30]:
print(lr_clf.coef_)
[[-0.38922579  0.7901789  -2.07561283 -0.88405677]
 [-0.8655343  -1.82238198  0.83829068 -1.14920552]
 [ 0.3621683  -0.39018483  2.41717507  2.06070229]]

Linear Regression

Linear Regression is implemented in the library sklearn.linear_model.LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [31]:
from sklearn.linear_model import LinearRegression
X_reg = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])

# y = 1 * x_0 + 2 * x_1 + 3
y_reg = np.dot(X_reg, np.array([1, 2])) + 3

reg = LinearRegression().fit(X_reg, y_reg)
In [32]:
#Obtain the function coefficients
print(reg.coef_)
#and the intercept
print(reg.intercept_)
[1. 2.]
3.0000000000000018

The $R^2$ score computes the "explained variance"

$R^2 = 1-\frac{\sum_i (y_i -\hat y_i)^2}{\sum_i (y_i -\bar y)^2}$

where $\hat y_i$ is the prediction for point $x_i$ and $\bar y$ is the mean value of the target variable

In [33]:
print(reg.score(X_reg, y_reg))
1.0
In [35]:
#Predict for a new point
reg.predict(np.array([[3, 5]]))
Out[35]:
array([16.])

A more complex example with the diabetes dataset

In [37]:
diabetes_X, diabetes_y = sk_data.load_diabetes(return_X_y=True)

# Shuffle the data
diabetes_X, diabetes_y = utils.shuffle(diabetes_X, diabetes_y, random_state=1)

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % metrics.mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
# Computed over the *test* data
print('Coefficient of determination: %.2f'
      % metrics.r2_score(diabetes_y_test, diabetes_y_pred))
print('Predictions:')
print(diabetes_y_pred)
print('True values:')
print(diabetes_y_test)
Coefficients: 
 [  -8.80343059 -238.68845774  515.45209151  329.26528155 -878.18276219
  530.03363161  126.03912568  213.28475276  734.46021416   67.32526032]
Mean squared error: 2304.97
Coefficient of determination: 0.68
Predictions:
[149.75303117 199.7656287  248.11294766 182.95040528  98.34540804
  96.66271486 248.59757565  64.84343648 234.52373522 209.30957394
 179.26665684  85.95716856  70.54292903 197.93453267 100.34630781
 116.81521079 134.97372936  64.08572743 178.32873088 155.32247369]
True values:
[168. 221. 310. 283.  81.  94. 277.  72. 270. 268. 174.  96.  83. 222.
  69. 153. 202.  43. 124. 276.]

Computing Scores

In [38]:
p,r,f,s = metrics.precision_recall_fscore_support(y_test,y_pred)
print(p)
print(r)
print(f)
print(s)
[1.         0.92857143 0.70588235]
[1.         0.72222222 0.92307692]
[1.     0.8125 0.8   ]
[19 18 13]
In [39]:
report = metrics.classification_report(y_test,y_pred)
print(report)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.93      0.72      0.81        18
           2       0.71      0.92      0.80        13

    accuracy                           0.88        50
   macro avg       0.88      0.88      0.87        50
weighted avg       0.90      0.88      0.88        50

In [41]:
cancer_data = sk_data.load_breast_cancer()
X_cancer,y_cancer  = utils.shuffle(cancer_data.data, cancer_data.target, random_state=1)
X_cancer_train = X_cancer[:500]
y_cancer_train = y_cancer[:500]
X_cancer_test = X_cancer[500:]
y_cancer_test = y_cancer[500:]
lr_clf.fit(X_cancer_train, y_cancer_train)
print("classifier score:",lr_clf.score(X_cancer_test,y_cancer_test))
classifier score: 0.9565217391304348
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
In [42]:
y_cancer_pred = lr_clf.predict(X_cancer_test)
cancer_probs = lr_clf.predict_proba(X_cancer_test)
print("Class Probabilities (first 10):")
print (cancer_probs[:10])
y_cancer_scores = cancer_probs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(y_cancer_test,y_cancer_scores)
#plt.scatter(recall,precision)
plt.plot(recall,precision, color='darkorange',lw=2)
print(recall)
print(precision)
print(thresholds)
Class Probabilities (first 10):
[[5.89753170e-01 4.10246830e-01]
 [7.92422013e-03 9.92075780e-01]
 [9.99984050e-01 1.59499811e-05]
 [9.99883765e-01 1.16235236e-04]
 [2.32299932e-02 9.76770007e-01]
 [9.99979786e-01 2.02137763e-05]
 [4.58383527e-03 9.95416165e-01]
 [1.51230669e-03 9.98487693e-01]
 [1.83894055e-03 9.98161059e-01]
 [6.20787344e-02 9.37921266e-01]]
[1.         0.97619048 0.97619048 0.97619048 0.95238095 0.92857143
 0.9047619  0.88095238 0.85714286 0.83333333 0.80952381 0.78571429
 0.76190476 0.76190476 0.73809524 0.71428571 0.69047619 0.66666667
 0.64285714 0.61904762 0.5952381  0.57142857 0.54761905 0.52380952
 0.5        0.47619048 0.45238095 0.42857143 0.4047619  0.38095238
 0.35714286 0.33333333 0.30952381 0.28571429 0.26190476 0.23809524
 0.21428571 0.19047619 0.16666667 0.14285714 0.11904762 0.0952381
 0.07142857 0.04761905 0.02380952 0.        ]
[0.93333333 0.93181818 0.95348837 0.97619048 0.97560976 0.975
 0.97435897 0.97368421 0.97297297 0.97222222 0.97142857 0.97058824
 0.96969697 1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.        ]
[0.41024683 0.46462295 0.6469258  0.76726865 0.8438038  0.84791902
 0.85405505 0.9298477  0.9328004  0.93452853 0.93792127 0.93995404
 0.95035813 0.95459886 0.96173054 0.97677001 0.97884705 0.98496682
 0.98894314 0.9892078  0.99158777 0.99207578 0.99261661 0.99284162
 0.99285247 0.99335536 0.9951331  0.99541616 0.99545223 0.99548238
 0.99630561 0.9972414  0.99747063 0.99805455 0.99816106 0.99848769
 0.99926891 0.99939715 0.99949829 0.99960473 0.99971177 0.99974118
 0.9997877  0.99980147 0.99987041]
In [43]:
fpr, tpr, ths = metrics.roc_curve(y_cancer_test,y_cancer_scores)
plt.plot(fpr,tpr,color='darkorange',lw=2)
print(metrics.roc_auc_score(y_cancer_test,y_cancer_scores))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
No handles with labels found to put in legend.
0.9894179894179893
In [44]:
(Xtoy,y_toy)=sk_data.make_classification(n_samples=1000)
Xttrain = Xtoy[:800,:]
Xttest = Xtoy[800:,:]
yttrain = y_toy[:800]
yttest = y_toy[800:]

lr_clf.fit(Xttrain, yttrain)
#print(lr_clf.score(Xttest,yttest))
#y_tpred = lr_clf.predict(X_test)
tprobs = lr_clf.predict_proba(Xttest)
print (tprobs[:10])

y_tscores = tprobs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(yttest,y_tscores)
plt.plot(recall,precision)
[[3.01429343e-02 9.69857066e-01]
 [5.17109254e-04 9.99482891e-01]
 [9.86817813e-01 1.31821866e-02]
 [4.58170927e-01 5.41829073e-01]
 [6.03672784e-01 3.96327216e-01]
 [8.89021469e-02 9.11097853e-01]
 [3.62442439e-03 9.96375576e-01]
 [1.77673884e-02 9.82232612e-01]
 [9.25776066e-01 7.42239344e-02]
 [7.29918870e-01 2.70081130e-01]]
Out[44]:
[<matplotlib.lines.Line2D at 0x20e2996afd0>]

k-fold cross validation

In k-fold cross validation the data is split into k equal parts, the k-1 are used for training and the last one for testing. k models are trained, each time leaving a different part for testing

https://scikit-learn.org/stable/modules/cross_validation.html

There are two methods for implementing k-fold cross-validation, under the library model selection: cross_val_score, and cross validate. The latter allows multiple metrics to be considered together.

In [45]:
import sklearn.model_selection as model_selection

scores = model_selection.cross_val_score(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring='f1_weighted',
                                          cv=5)
print (scores)
print (scores.mean())
[1.         0.93333333 0.96658312 0.96658312 0.86111111]
0.9455221386800334
In [46]:
scores = model_selection.cross_validate(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring=['precision_weighted','recall_weighted'],
                                          cv=3)
print (scores)
print (scores['test_precision_weighted'].mean(),scores['test_recall_weighted'].mean())
{'fit_time': array([0.00112987, 0.        , 0.00102115]), 'score_time': array([0.00246477, 0.00201488, 0.00296402]), 'test_precision_weighted': array([0.96078431, 0.90952381, 0.87830688]), 'test_recall_weighted': array([0.96078431, 0.90196078, 0.875     ])}
0.916205000518726 0.9125816993464052

Creating a pipeline

If the same steps are often repeated, you can create a pipeline to perform them all at once:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

Text classification Example

We will use the 20 newsgroups to do a text classification example

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.space','rec.sport.baseball']
#categories = ['alt.atheism', 'rec.sport.baseball']
news_train = sk_data.fetch_20newsgroups(subset='train', 
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
print (len(news_train.target))
X_news_train_data = news_train.data
y_news_train = news_train.target
news_test = sk_data.fetch_20newsgroups(subset='test', 
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
print (len(news_test.target))
X_news_test_data = news_test.data
y_news_test = news_test.target
1190
791
In [5]:
from sklearn.linear_model import LinearRegression
import sklearn.linear_model as linear_model

import sklearn.feature_extraction.text as sk_text

vectorizer = sk_text.TfidfVectorizer(stop_words='english',
                             max_features = 1000,
                             min_df=4, max_df=0.8)
X_news_train = vectorizer.fit_transform(X_news_train_data)

lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_news_train, y_news_train)
Out[5]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [6]:
X_news_test = vectorizer.transform(X_news_test_data)
print("classifier score:",lr_clf.score(X_news_test,y_news_test))
classifier score: 0.9077117572692794

Word embeddings and Text classification

We will now see how we can train and use word embeddings. We will also see the NLTK library.

The NLTK libary allows for sohpisticated text processing. It can do stemming, create a parse tree, do PoS (Part-of-Speech) tagging, find Noun Phrases, entities,

In [18]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
In [16]:
nltk.download('punkt')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tsapa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[16]:
True
In [55]:
import string as string

X_news_train_nltk = []
y_news_train_nltk = []
for x,y in zip(X_news_train_data,y_news_train):
    wt = word_tokenize(x.lower())
    doc = [w for w in wt if w not in string.punctuation]
    if len(doc) == 0: continue
    X_news_train_nltk.append(doc)
    y_news_train_nltk.append(y)
In [9]:
from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tsapa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
Out[9]:
True
In [30]:
english_stop_words = set(stopwords.words('english'))
In [57]:
X_news_test_nltk = []
y_news_test_nltk = []

for x,y in zip(X_news_test_data,y_news_test):
    wt = word_tokenize(x.lower())
    doc = [w for w in wt if (w not in english_stop_words) and (w not in string.punctuation)]
    if len(doc) == 0: continue    
    X_news_test_nltk.append(doc)
    y_news_test_nltk.append(y)

The Gensim library

The Gensim library has several NLP models.

You can use existing modules to train a word2vec model: https://radimrehurek.com/gensim/models/word2vec.html

In [32]:
import gensim 
from gensim.models import Word2Vec

Train a CBOW embedding on the training data corpus

In [79]:
embedding_size = 100
cbow_model = gensim.models.Word2Vec(X_news_train_nltk, min_count = 1,size = embedding_size, window = 50) 
In [59]:
cbow_model.wv['space']
Out[59]:
array([ 0.4922015 , -0.9359473 ,  0.34343162, -1.3081925 , -0.1901973 ,
        0.21532376,  1.2156173 ,  0.4430774 ,  0.25710803, -0.50712353,
       -4.272391  ,  2.6873357 , -2.493701  , -2.589504  ,  3.1585612 ,
       -0.36507684,  2.8055093 , -0.57850176,  2.6417184 , -3.5526915 ,
       -2.2415674 ,  3.0676148 ,  1.9202963 ,  4.1860013 ,  2.3986182 ,
        0.3706956 , -1.5442146 , -0.50923645,  2.043965  , -1.8197259 ,
        0.14750983,  0.23102613, -0.3497653 , -1.5297139 , -1.9090512 ,
       -1.4567456 , -3.1899529 ,  1.9151456 ,  0.46872106, -1.0813582 ,
        2.6567688 , -0.6640329 ,  2.2825925 , -1.6533066 ,  1.4027395 ,
        0.18779953,  0.2779936 ,  0.8941543 ,  0.66988635,  2.8031368 ,
       -0.57614326,  3.6284869 ,  1.7607224 ,  0.63334346, -3.1453493 ,
       -0.3909697 ,  2.4043996 ,  0.9988226 ,  1.2679374 ,  1.4401687 ,
       -0.42137426, -0.17361656,  0.75003636, -2.102788  ,  1.1037221 ,
       -0.05709558,  5.175388  ,  1.8924491 ,  1.6269192 , -0.27977344,
        2.0932372 , -0.8879969 , -0.38593277, -1.2551974 ,  0.9152297 ,
       -2.774163  , -1.2016242 ,  2.342689  ,  2.1376672 , -0.32378462,
       -1.8339034 ,  1.3824389 , -2.2503915 , -3.9719074 ,  1.3599632 ,
       -0.30928013,  1.3319025 ,  2.289628  ,  3.4073575 , -1.0615308 ,
       -0.32421514, -4.495946  ,  1.0212944 , -0.61979413, -1.0651779 ,
       -4.571965  ,  0.6625646 , -4.2656155 ,  1.8576877 ,  0.5310398 ],
      dtype=float32)

Transform the train and test data

In [80]:
X_news_train_cbow = []
for x in X_news_train_nltk:
    vx = np.zeros(embedding_size)
    for w in x: 
        vx += cbow_model.wv[w]
    vx /= len(x)
    X_news_train_cbow.append(vx)
In [81]:
X_news_test_cbow = []
for x in X_news_test_nltk:
    vx = np.zeros(embedding_size)
    length = 0;
    for w in x: 
        if (w not in cbow_model.wv): continue
        length += 1
        vx += cbow_model.wv[w]
    if length != 0: vx /= length
    X_news_test_cbow.append(vx)

Train a classifier on the emebddings

In [82]:
lr_clf.fit(np.array(X_news_train_cbow), np.array(y_news_train_nltk))
Out[82]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [83]:
lr_clf.score(np.array(X_news_test_cbow),y_news_test_nltk)
Out[83]:
0.6973684210526315

Train a SkipGram embedding on the training data corpus

In [84]:
embedding_size = 100
skipgram_model = gensim.models.Word2Vec(X_news_train_nltk, min_count = 1,size = embedding_size, window = 50, sg = 1) 

Transform the train and test data

In [85]:
X_news_train_skipgram = []
for x in X_news_train_nltk:
    vx = np.zeros(embedding_size)
    for w in x: 
        vx += skipgram_model.wv[w]
    vx /= len(x)
    X_news_train_skipgram.append(vx)
In [86]:
X_news_test_skipgram = []
for x in X_news_test_nltk:
    vx = np.zeros(embedding_size)
    length = 0
    for w in x: 
        if (w not in skipgram_model.wv): continue
        length += 1
        vx += skipgram_model.wv[w]
    if length!= 0: vx /= length
    X_news_test_skipgram.append(vx)

Train a classifier on the emebddings

In [87]:
lr_clf.fit(np.array(X_news_train_skipgram), np.array(y_news_train_nltk))

lr_clf.score(np.array(X_news_test_skipgram),y_news_test_nltk)
Out[87]:
0.9368421052631579

You can also download the Google word2vec model trained over millions of documents

In [120]:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
[==================================================] 100.0% 1662.8/1662.8MB downloaded
C:\Users\tsapa/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
In [88]:
path = 'C:\\Users\\tsapa/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz'
g_model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)  
In [89]:
print(len(g_model['hello']))
print(g_model.wv['hello'])
300
[-0.05419922  0.01708984 -0.00527954  0.33203125 -0.25       -0.01397705
 -0.15039062 -0.265625    0.01647949  0.3828125  -0.03295898 -0.09716797
 -0.16308594 -0.04443359  0.00946045  0.18457031  0.03637695  0.16601562
  0.36328125 -0.25585938  0.375       0.171875    0.21386719 -0.19921875
  0.13085938 -0.07275391 -0.02819824  0.11621094  0.15332031  0.09082031
  0.06787109 -0.0300293  -0.16894531 -0.20800781 -0.03710938 -0.22753906
  0.26367188  0.012146    0.18359375  0.31054688 -0.10791016 -0.19140625
  0.21582031  0.13183594 -0.03515625  0.18554688 -0.30859375  0.04785156
 -0.10986328  0.14355469 -0.43554688 -0.0378418   0.10839844  0.140625
 -0.10595703  0.26171875 -0.17089844  0.39453125  0.12597656 -0.27734375
 -0.28125     0.14746094 -0.20996094  0.02355957  0.18457031  0.00445557
 -0.27929688 -0.03637695 -0.29296875  0.19628906  0.20703125  0.2890625
 -0.20507812  0.06787109 -0.43164062 -0.10986328 -0.2578125  -0.02331543
  0.11328125  0.23144531 -0.04418945  0.10839844 -0.2890625  -0.09521484
 -0.10351562 -0.0324707   0.07763672 -0.13378906  0.22949219  0.06298828
  0.08349609  0.02929688 -0.11474609  0.00534058 -0.12988281  0.02514648
  0.08789062  0.24511719 -0.11474609 -0.296875   -0.59375    -0.29492188
 -0.13378906  0.27734375 -0.04174805  0.11621094  0.28320312  0.00241089
  0.13867188 -0.00683594 -0.30078125  0.16210938  0.01171875 -0.13867188
  0.48828125  0.02880859  0.02416992  0.04736328  0.05859375 -0.23828125
  0.02758789  0.05981445 -0.03857422  0.06933594  0.14941406 -0.10888672
 -0.07324219  0.08789062  0.27148438  0.06591797 -0.37890625 -0.26171875
 -0.13183594  0.09570312 -0.3125      0.10205078  0.03063965  0.23632812
  0.00582886  0.27734375  0.20507812 -0.17871094 -0.31445312 -0.01586914
  0.13964844  0.13574219  0.0390625  -0.29296875  0.234375   -0.33984375
 -0.11816406  0.10644531 -0.18457031 -0.02099609  0.02563477  0.25390625
  0.07275391  0.13574219 -0.00138092 -0.2578125  -0.2890625   0.10107422
  0.19238281 -0.04882812  0.27929688 -0.3359375  -0.07373047  0.01879883
 -0.10986328 -0.04614258  0.15722656  0.06689453 -0.03417969  0.16308594
  0.08642578  0.44726562  0.02026367 -0.01977539  0.07958984  0.17773438
 -0.04370117 -0.00952148  0.16503906  0.17285156  0.23144531 -0.04272461
  0.02355957  0.18359375 -0.41601562 -0.01745605  0.16796875  0.04736328
  0.14257812  0.08496094  0.33984375  0.1484375  -0.34375    -0.14160156
 -0.06835938 -0.14648438 -0.02844238  0.07421875 -0.07666016  0.12695312
  0.05859375 -0.07568359 -0.03344727  0.23632812 -0.16308594  0.16503906
  0.1484375  -0.2421875  -0.3515625  -0.30664062  0.00491333  0.17675781
  0.46289062  0.14257812 -0.25       -0.25976562  0.04370117  0.34960938
  0.05957031  0.07617188 -0.02868652 -0.09667969 -0.01281738  0.05859375
 -0.22949219 -0.1953125  -0.12207031  0.20117188 -0.42382812  0.06005859
  0.50390625  0.20898438  0.11230469 -0.06054688  0.33203125  0.07421875
 -0.05786133  0.11083984 -0.06494141  0.05639648  0.01757812  0.08398438
  0.13769531  0.2578125   0.16796875 -0.16894531  0.01794434  0.16015625
  0.26171875  0.31640625 -0.24804688  0.05371094 -0.0859375   0.17089844
 -0.39453125 -0.00156403 -0.07324219 -0.04614258 -0.16210938 -0.15722656
  0.21289062 -0.15820312  0.04394531  0.28515625  0.01196289 -0.26953125
 -0.04370117  0.37109375  0.04663086 -0.19726562  0.3046875  -0.36523438
 -0.23632812  0.08056641 -0.04248047 -0.14648438 -0.06225586 -0.0534668
 -0.05664062  0.18945312  0.37109375 -0.22070312  0.04638672  0.02612305
 -0.11474609  0.265625   -0.02453613  0.11083984 -0.02514648 -0.12060547
  0.05297852  0.07128906  0.00063705 -0.36523438 -0.13769531 -0.12890625]
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
  

Transform the train and test data

In [90]:
X_news_train_gmodel = []
for x in X_news_train_nltk:
    vx = np.zeros(300)
    length = 0
    for w in x: 
        if w in g_model.wv:
            length += 1
            vx += g_model[w]
    if length != 0: vx /= length
    X_news_train_gmodel.append(vx)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
  
In [92]:
X_news_test_gmodel = []
for x in X_news_test_nltk:
    vx = np.zeros(300)
    length = 0
    for w in x: 
        if (w not in g_model.wv): continue
        length += 1
        vx += g_model.wv[w]
    if length != 0: vx /= length
    X_news_test_gmodel.append(vx)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
  
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
  

Train a classifier on the emebddings

In [91]:
lr_clf.fit(X_news_train_gmodel, np.array(y_news_train_nltk))
Out[91]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [93]:
lr_clf.score(np.array(X_news_test_gmodel),y_news_test_nltk)
Out[93]:
0.9447368421052632