Introduction to Classification¶

The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover the topic of feature normalization, and evaluation.

import numpy as np
import scipy.sparse as sp_sparse

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
from sklearn import preprocessing

import seaborn as sns

%matplotlib inline

Feature normalization¶

Python provides some functionality for normalizing and standardizing the data. Be careful though, some operations work only with dense data.

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Normalize by removing the mean and dividing by the standard deviation. This is done per feature, that is, per column of the dataset.

X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  1.],
              [ 0.,  1., -1.]])
print(X.mean(axis = 0))
print(X.std(axis = 0))
X_scaled = preprocessing.scale(X)
print(X_scaled)
print(X_scaled.mean(axis=0))
print(X_scaled.var(axis = 0))

[ 1.          0.          0.66666667]
[ 0.81649658  0.81649658  1.24721913]
[[ 0.         -1.22474487  1.06904497]
 [ 1.22474487  0.          0.26726124]
 [-1.22474487  1.22474487 -1.33630621]]
[  0.00000000e+00   0.00000000e+00   1.48029737e-16]
[ 1.  1.  1.]

import scipy.sparse
cX = scipy.sparse.csc_matrix(X)
cX_scaled = preprocessing.scale(cX)
print(cX_scaled)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-600a78fca8bb> in <module>()
      1 import scipy.sparse
      2 cX = scipy.sparse.csc_matrix(X)
----> 3 cX_scaled = preprocessing.scale(cX)
      4 print(cX_scaled)

~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in scale(X, axis, with_mean, with_std, copy)
    135         if with_mean:
    136             raise ValueError(
--> 137                 "Cannot center sparse matrices: pass `with_mean=False` instead"
    138                 " See docstring for motivation and alternatives.")
    139         if axis != 0:

ValueError: Cannot center sparse matrices: pass `with_mean=False` instead See docstring for motivation and alternatives.

The same can be done with the StandardScaler from the preprocessing library of sklearn.

The function fit() computes the parameters for scaling, and transform() applies the scaling

from sklearn import preprocessing
std_scaler = preprocessing.StandardScaler().fit(X)
print(std_scaler.mean_)
print(std_scaler.scale_)
X_std = std_scaler.transform(X)
print(X_std)

[ 1.          0.          0.66666667]
[ 0.81649658  0.81649658  1.24721913]
[[ 0.         -1.22474487  1.06904497]
 [ 1.22474487  0.          0.26726124]
 [-1.22474487  1.22474487 -1.33630621]]

The advantage is the we can now apply the transform to new data.

For example, we compute the parameters for the training data and we apply the scaling to the test data.

y = np.array([[2.,3.,1.],
              [1.,2.,1.]])
print(std_scaler.transform(y))

[[ 1.22474487  3.67423461  0.26726124]
 [ 0.          2.44948974  0.26726124]]

The MinMaxScaler subbtracts from each column the minimum and then divides by the max-min.

min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X)
print(X_minmax)
print(min_max_scaler.transform(y))

[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.66666667]
 [ 0.          1.          0.        ]]
[[ 1.          2.          0.66666667]
 [ 0.5         1.5         0.66666667]]

The MaxAbsScaler divides with the maximum absolute value.

The MaxAbsScaler can work with sparse data, since it does not destroy the data sparseness. For the other datasets, removing the mean (or min) can destroy the sparseness of the data.

Sometimes we may choose to normalize only the non-zero values. This should be done manually.

# works with sparse data
max_abs_scaler = preprocessing.MaxAbsScaler()
X_maxabs = max_abs_scaler.fit_transform(X)
X_maxabs

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0.5],
       [ 0. ,  1. , -0.5]])

cX_scaled = max_abs_scaler.transform(cX)
print(cX_scaled)

  (0, 0)	0.5
  (1, 0)	1.0
  (0, 1)	-1.0
  (2, 1)	1.0
  (0, 2)	1.0
  (1, 2)	0.5
  (2, 2)	-0.5

The normalize function normalizes the rows so that they become unit vectors in some norm that we specify. It can be applied to sparse matrices without destroying the sparsity.

#works with sparse data

X_normalized = preprocessing.normalize(X, norm='l2')

X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 0.89442719,  0.        ,  0.4472136 ],
       [ 0.        ,  0.70710678, -0.70710678]])

crX = scipy.sparse.csr_matrix(X)
crX_scaled = preprocessing.normalize(crX,norm='l1')
print(crX_scaled)

  (0, 0)	0.25
  (0, 1)	-0.25
  (0, 2)	0.5
  (1, 0)	0.666666666667
  (1, 2)	0.333333333333
  (2, 1)	0.5
  (2, 2)	-0.5

OneHotEncoder¶

The OneHotEncoder can be used for categorical data to transform them into binary, where for each attribute value we have 0 or 1 depending on whether this value appears in the feature vector. It works with numerical categorical values.

X = [[0,1,2],
     [1,2,3],
     [0,1,4]]
enc = preprocessing.OneHotEncoder()
enc.fit(X)
enc.transform([[0,2,4],[1,1,2]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.],
       [ 0.,  1.,  1.,  0.,  1.,  0.,  0.]])

We can also apply it selectively to some columns of the data

#works with sparse data

X = [[0, 10, 45100],
     [1, 20, 45221],
     [0, 20, 45212]]
enc = preprocessing.OneHotEncoder(categorical_features=[2]) #only the third column is categorical
enc.fit(X)
enc.transform([[5,13,45212],[4,12,45221]]).toarray()

array([[  0.,   1.,   0.,   5.,  13.],
       [  0.,   0.,   1.,   4.,  12.]])

Feature Selection¶

Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.

http://scikit-learn.org/stable/modules/feature_selection.html

The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(np.array(X))
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

A more sophisticated feature selection technique uses the chi-square test to determine if a feature and the class label are independent.

https://en.wikipedia.org/wiki/Chi-squared_test

In this case the feature will have high score and low p-value. The features with the lowest values are rejected.

The chi-square test is usually applied on categorical data.

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
print(X[1:10,:],y[1:10])
sel = SelectKBest(chi2, k=2)
X_new = sel.fit_transform(X, y)
print(X_new[1:10])
print(sel.scores_)
c,p = sk.feature_selection.chi2(X, y)
print(c) #The chi-square value between X columns and y
print(p) #The p-value for the test

(150, 4)
[[ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]] [0 0 0 0 0 0 0 0 0]
[[ 1.4  0.2]
 [ 1.3  0.2]
 [ 1.5  0.2]
 [ 1.4  0.2]
 [ 1.7  0.4]
 [ 1.4  0.3]
 [ 1.5  0.2]
 [ 1.4  0.2]
 [ 1.5  0.1]]
[  10.81782088    3.59449902  116.16984746   67.24482759]
[  10.81782088    3.59449902  116.16984746   67.24482759]
[  4.47651499e-03   1.65754167e-01   5.94344354e-26   2.50017968e-15]

Classification models¶

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Python has classes and objects that implement the different classification techniques that we described in class.

Regardless of the choice of classifier, each classifier has two methods:

The method fit() takes the training data and labels, and trains the model

The method predict() takes as input the test data and applies the model.

Preparing the data¶

Here we load the data, randomly shuffle it, and select a part of the data for training and a part for testing

from sklearn.datasets import load_iris
import sklearn.utils as utils

iris = load_iris()
print(iris.data[:5,:])
print(iris.target)
print(iris.target_names)
X, y = utils.shuffle(iris.data, iris.target, random_state=1) #shuffle the data
print(X.shape)
print(y.shape)
print(y)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']
(150, 4)
(150,)
[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0 1 2 2 0 2 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0
 1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 1 0 1 0 0 2 2
 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1
 2 0]

train_set_size = 100
X_train = X[:train_set_size]  # selects first 100 rows (examples) for train set
y_train = y[:train_set_size]
X_test = X[train_set_size:]   # selects from row 100 until the last one for test set
y_test = y[train_set_size:]
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(100, 4) (100,)
(50, 4) (50,)

Decision Trees¶

http://scikit-learn.org/stable/modules/tree.html

from sklearn import tree

dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)
print(dtree.score(X_test,y_test))

y_pred = dtree.predict(X_test)
y_prob = dtree.predict_proba(X_test)
print(y_pred[:10])
print(y_prob[:10])
print(metrics.accuracy_score(y_test,y_pred))
print(metrics.confusion_matrix(y_test,y_pred))
print(metrics.precision_score(y_test,y_pred,average=None))
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print(metrics.recall_score(y_test,y_pred,average=None))
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print(metrics.f1_score(y_test,y_pred,average=None))
print(metrics.f1_score(y_test,y_pred,average='weighted'))

0.88
[0 1 0 1 1 0 1 0 0 2]
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]
0.88
[[19  0  0]
 [ 0 16  2]
 [ 0  4  9]]
[ 1.          0.8         0.81818182]
0.880727272727
[ 1.          0.88888889  0.69230769]
0.88
[ 1.          0.84210526  0.75      ]
0.878157894737

print(dtree.tree_)
from inspect import getmembers
print( getmembers( dtree.tree_ ) )

<sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>
[('__class__', <class 'sklearn.tree._tree.Tree'>), ('__delattr__', <method-wrapper '__delattr__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__dir__', <built-in method __dir__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__doc__', "Array-based representation of a binary decision tree.\n\n    The binary tree is represented as a number of parallel arrays. The i-th\n    element of each array holds information about the node `i`. Node 0 is the\n    tree's root. You can find a detailed description of all arrays in\n    `_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split\n    nodes, resp. In this case the values of nodes of the other type are\n    arbitrary!\n\n    Attributes\n    ----------\n    node_count : int\n        The number of nodes (internal nodes + leaves) in the tree.\n\n    capacity : int\n        The current capacity (i.e., size) of the arrays, which is at least as\n        great as `node_count`.\n\n    max_depth : int\n        The maximal depth of the tree.\n\n    children_left : array of int, shape [node_count]\n        children_left[i] holds the node id of the left child of node i.\n        For leaves, children_left[i] == TREE_LEAF. Otherwise,\n        children_left[i] > i. This child handles the case where\n        X[:, feature[i]] <= threshold[i].\n\n    children_right : array of int, shape [node_count]\n        children_right[i] holds the node id of the right child of node i.\n        For leaves, children_right[i] == TREE_LEAF. Otherwise,\n        children_right[i] > i. This child handles the case where\n        X[:, feature[i]] > threshold[i].\n\n    feature : array of int, shape [node_count]\n        feature[i] holds the feature to split on, for the internal node i.\n\n    threshold : array of double, shape [node_count]\n        threshold[i] holds the threshold for the internal node i.\n\n    value : array of double, shape [node_count, n_outputs, max_n_classes]\n        Contains the constant prediction value of each node.\n\n    impurity : array of double, shape [node_count]\n        impurity[i] holds the impurity (i.e., the value of the splitting\n        criterion) at node i.\n\n    n_node_samples : array of int, shape [node_count]\n        n_node_samples[i] holds the number of training samples reaching node i.\n\n    weighted_n_node_samples : array of int, shape [node_count]\n        weighted_n_node_samples[i] holds the weighted number of training samples\n        reaching node i.\n    "), ('__eq__', <method-wrapper '__eq__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__format__', <built-in method __format__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__ge__', <method-wrapper '__ge__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__getattribute__', <method-wrapper '__getattribute__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__getstate__', <built-in method __getstate__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__gt__', <method-wrapper '__gt__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__hash__', <method-wrapper '__hash__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__init__', <method-wrapper '__init__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__init_subclass__', <built-in method __init_subclass__ of type object at 0x00007FFE52DE77E0>), ('__le__', <method-wrapper '__le__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__lt__', <method-wrapper '__lt__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__ne__', <method-wrapper '__ne__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__new__', <built-in method __new__ of type object at 0x00007FFE52DE77E0>), ('__pyx_vtable__', <capsule object NULL at 0x00000295ACDCA7B0>), ('__reduce__', <built-in method __reduce__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__reduce_ex__', <built-in method __reduce_ex__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__repr__', <method-wrapper '__repr__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__setattr__', <method-wrapper '__setattr__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__setstate__', <built-in method __setstate__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__sizeof__', <built-in method __sizeof__ of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__str__', <method-wrapper '__str__' of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('__subclasshook__', <built-in method __subclasshook__ of type object at 0x00007FFE52DE77E0>), ('apply', <built-in method apply of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('capacity', 9), ('children_left', array([ 1,  2, -1,  4,  5, -1, -1, -1, -1], dtype=int64)), ('children_right', array([ 8,  3, -1,  7,  6, -1, -1, -1, -1], dtype=int64)), ('compute_feature_importances', <built-in method compute_feature_importances of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('decision_path', <built-in method decision_path of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('feature', array([ 3,  3, -2,  1,  3, -2, -2, -2, -2], dtype=int64)), ('impurity', array([ 0.6646    ,  0.51513672,  0.        ,  0.05876951,  0.5       ,
        0.        ,  0.        ,  0.        ,  0.        ])), ('max_depth', 4), ('max_n_classes', 3), ('n_classes', array([3], dtype=int64)), ('n_features', 4), ('n_node_samples', array([100,  64,  31,  33,   2,   1,   1,  31,  36], dtype=int64)), ('n_outputs', 1), ('node_count', 9), ('predict', <built-in method predict of sklearn.tree._tree.Tree object at 0x00000295ACDEC6B0>), ('threshold', array([ 1.75,  0.75, -2.  ,  2.25,  1.25, -2.  , -2.  , -2.  , -2.  ])), ('value', array([[[ 31.,  32.,  37.]],

       [[ 31.,  32.,   1.]],

       [[ 31.,   0.,   0.]],

       [[  0.,  32.,   1.]],

       [[  0.,   1.,   1.]],

       [[  0.,   1.,   0.]],

       [[  0.,   0.,   1.]],

       [[  0.,  31.,   0.]],

       [[  0.,   0.,  36.]]])), ('weighted_n_node_samples', array([ 100.,   64.,   31.,   33.,    2.,    1.,    1.,   31.,   36.]))]

import graphviz 
print(iris.feature_names)
dot_data = tree.export_graphviz(dtree,out_file=None)
graph = graphviz.Source(dot_data)
graph

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

from sklearn import tree
import graphviz 

dtree2 = tree.DecisionTreeClassifier(max_depth=2)
dtree2 = dtree2.fit(X_train, y_train)
dot_data2 = tree.export_graphviz(dtree2,out_file=None)
graph2 = graphviz.Source(dot_data2)
graph2

k-NN Classification¶

http://scikit-learn.org/stable/modules/neighbors.html

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print(knn.score(X_test,y_test))

y_pred = knn.predict(X_test)
print(metrics.confusion_matrix(y_test,y_pred))

print(metrics.precision_score(y_test,y_pred,average=None))
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print(metrics.recall_score(y_test,y_pred,average=None))
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print(metrics.f1_score(y_test,y_pred,average=None))
print(metrics.f1_score(y_test,y_pred,average='weighted'))

0.94
[[19  0  0]
 [ 0 16  2]
 [ 0  1 12]]
[ 1.          0.94117647  0.85714286]
0.941680672269
[ 1.          0.88888889  0.92307692]
0.94
[ 1.          0.91428571  0.88888889]
0.940253968254

SVM Classification¶

http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

from sklearn import svm

#svm_clf = svm.LinearSVC()
#svm_clf = svm.SVC(kernel = 'poly')
svm_clf = svm.SVC()
svm_clf.fit(X_train,y_train)
print(svm_clf.score(X_test,y_test))
y_pred = svm_clf.predict(X_test)
print(metrics.confusion_matrix(y_test,y_pred))

print(metrics.precision_score(y_test,y_pred,average=None))
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print(metrics.recall_score(y_test,y_pred,average=None))
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print(metrics.f1_score(y_test,y_pred,average=None))
print(metrics.f1_score(y_test,y_pred,average='weighted'))

0.98
[[19  0  0]
 [ 0 18  0]
 [ 0  1 12]]
[ 1.          0.94736842  1.        ]
0.981052631579
[ 1.          1.          0.92307692]
0.98
[ 1.          0.97297297  0.96      ]
0.97987027027

Logistic Regression¶

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

import sklearn.linear_model as linear_model

lr_clf = linear_model.LogisticRegression()
lr_clf.fit(X_train, y_train)
print(lr_clf.score(X_test,y_test))
y_pred = lr_clf.predict(X_test)
print(metrics.confusion_matrix(y_test,y_pred))
print(metrics.precision_score(y_test,y_pred,average=None))
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print(metrics.recall_score(y_test,y_pred,average=None))
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print(metrics.f1_score(y_test,y_pred,average=None))
print(metrics.f1_score(y_test,y_pred,average='weighted'))

0.96
[[19  0  0]
 [ 0 16  2]
 [ 0  0 13]]
[ 1.          1.          0.86666667]
0.965333333333
[ 1.          0.88888889  1.        ]
0.96
[ 1.          0.94117647  0.92857143]
0.96025210084

probs = lr_clf.predict_proba(X_test)
print (probs)
print (probs.argmax(axis = 1))
print (probs.max(axis = 1))

[[  9.42848537e-01   5.71375773e-02   1.38860506e-05]
 [  7.36901098e-02   7.22694123e-01   2.03615768e-01]
 [  7.11546818e-01   2.87755584e-01   6.97598209e-04]
 [  4.89825361e-02   7.17197031e-01   2.33820433e-01]
 [  1.81245540e-02   6.43677875e-01   3.38197571e-01]
 [  8.56222465e-01   1.43598825e-01   1.78709592e-04]
 [  1.32661219e-02   6.69340523e-01   3.17393355e-01]
 [  8.90737869e-01   1.09004746e-01   2.57384935e-04]
 [  7.99367919e-01   2.00379203e-01   2.52878017e-04]
 [  1.26409293e-03   1.57157352e-01   8.41578555e-01]
 [  2.49910290e-03   2.46064318e-01   7.51436579e-01]
 [  6.39100774e-04   2.24939995e-01   7.74420904e-01]
 [  8.33706116e-01   1.66210803e-01   8.30810191e-05]
 [  8.94412871e-01   1.05503577e-01   8.35520171e-05]
 [  9.72020748e-03   3.00444130e-01   6.89835663e-01]
 [  7.98905301e-01   2.00868235e-01   2.26463968e-04]
 [  1.60403736e-03   2.95399378e-01   7.02996584e-01]
 [  7.87969440e-01   2.11818459e-01   2.12101017e-04]
 [  2.72568465e-03   2.82704683e-01   7.14569632e-01]
 [  3.64574157e-03   3.04211789e-01   6.92142469e-01]
 [  8.22276156e-01   1.77562266e-01   1.61577503e-04]
 [  2.09344504e-03   1.84118529e-01   8.13788026e-01]
 [  8.09401345e-01   1.90462524e-01   1.36131352e-04]
 [  6.59873831e-02   6.19755173e-01   3.14257444e-01]
 [  7.80727909e-01   2.18951841e-01   3.20249733e-04]
 [  6.71158547e-02   7.66891891e-01   1.65992254e-01]
 [  4.02285825e-02   6.89671837e-01   2.70099580e-01]
 [  8.56946753e-01   1.42998178e-01   5.50690948e-05]
 [  7.94214334e-01   2.05641671e-01   1.43995870e-04]
 [  1.25434739e-01   6.99690556e-01   1.74874705e-01]
 [  8.95288270e-01   1.04669525e-01   4.22044393e-05]
 [  5.07419345e-02   5.50399848e-01   3.98858218e-01]
 [  1.33216135e-02   5.48832988e-01   4.37845398e-01]
 [  8.37568960e-01   1.62343221e-01   8.78194536e-05]
 [  2.74945520e-02   6.41899518e-01   3.30605930e-01]
 [  4.41253274e-02   6.20573300e-01   3.35301372e-01]
 [  5.42982750e-03   5.05914148e-01   4.88656025e-01]
 [  4.37634497e-02   7.61758107e-01   1.94478443e-01]
 [  1.60403736e-03   2.95399378e-01   7.02996584e-01]
 [  8.18520368e-01   1.81402743e-01   7.68890305e-05]
 [  7.72715352e-01   2.27097263e-01   1.87384606e-04]
 [  7.72125588e-04   4.37082934e-01   5.62144941e-01]
 [  7.65259021e-02   7.19907084e-01   2.03567014e-01]
 [  1.39728181e-03   4.66169653e-01   5.32433065e-01]
 [  1.58140880e-01   7.60607424e-01   8.12516952e-02]
 [  3.77017915e-03   4.60739591e-01   5.35490230e-01]
 [  1.80675792e-03   3.27728808e-01   6.70464434e-01]
 [  3.59953765e-03   4.77448945e-01   5.18951517e-01]
 [  9.40355077e-04   1.96873447e-01   8.02186198e-01]
 [  7.82774102e-01   2.17108428e-01   1.17470682e-04]]
[0 1 0 1 1 0 1 0 0 2 2 2 0 0 2 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1
 1 2 0 0 2 1 2 1 2 2 2 2 0]
[ 0.94284854  0.72269412  0.71154682  0.71719703  0.64367787  0.85622247
  0.66934052  0.89073787  0.79936792  0.84157856  0.75143658  0.7744209
  0.83370612  0.89441287  0.68983566  0.7989053   0.70299658  0.78796944
  0.71456963  0.69214247  0.82227616  0.81378803  0.80940134  0.61975517
  0.78072791  0.76689189  0.68967184  0.85694675  0.79421433  0.69969056
  0.89528827  0.55039985  0.54883299  0.83756896  0.64189952  0.6205733
  0.50591415  0.76175811  0.70299658  0.81852037  0.77271535  0.56214494
  0.71990708  0.53243307  0.76060742  0.53549023  0.67046443  0.51895152
  0.8021862   0.7827741 ]

Evaluation¶

http://scikit-learn.org/stable/model_selection.html#model-selection

Computing Scores¶

p,r,f,s = metrics.precision_recall_fscore_support(y_test,y_pred)
print(p)
print(r)
print(f)

[ 1.          1.          0.86666667]
[ 1.          0.88888889  1.        ]
[ 1.          0.94117647  0.92857143]

report = metrics.classification_report(y_test,y_pred)
print(report)

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        19
          1       1.00      0.89      0.94        18
          2       0.87      1.00      0.93        13

avg / total       0.97      0.96      0.96        50

#y_true = np.array([0, 0, 1, 1])
y_true = np.array(y_test)
print(y_true)
print(y_test)
y_true[y_true != 2] = 0
y_true[y_true==2] = 1
#y_scores = np.array([0.1, 0.4, 0.35, 0.8])
y_scores = probs[:,2]
precision, recall, thresholds = metrics.precision_recall_curve(y_true,y_scores)
plt.scatter(recall,precision)
print(recall)
print(precision)
print(thresholds)
fpr, tpr, ths = metrics.roc_curve(y_true,y_scores)
print(metrics.roc_auc_score(y_true,y_scores))

[0 1 0 1 1 0 1 0 0 2 2 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1
 1 2 0 0 2 1 2 1 2 2 1 2 0]
[0 1 0 1 1 0 1 0 0 2 2 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1
 1 2 0 0 2 1 2 1 2 2 1 2 0]
[ 1.          0.92307692  0.84615385  0.76923077  0.69230769  0.69230769
  0.61538462  0.46153846  0.38461538  0.30769231  0.23076923  0.15384615
  0.07692308  0.        ]
[ 0.92857143  0.92307692  0.91666667  0.90909091  0.9         1.          1.
  1.          1.          1.          1.          1.          1.          1.        ]
[ 0.53243307  0.53549023  0.56214494  0.67046443  0.68983566  0.69214247
  0.70299658  0.71456963  0.75143658  0.7744209   0.8021862   0.81378803
  0.84157856]
0.991683991684

(Xtoy,y_toy)=sk_data.make_classification(n_samples=1000)
Xttrain = Xtoy[:800,:]
Xttest = Xtoy[800:,:]
yttrain = y_toy[:800]
yttest = y_toy[800:]

lr_clf.fit(Xttrain, yttrain)
#print(lr_clf.score(Xttest,yttest))
#y_tpred = lr_clf.predict(X_test)
tprobs = lr_clf.predict_proba(Xttest)
print (tprobs)

y_tscores = tprobs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(yttest,y_tscores)
plt.scatter(recall,precision)

[[  8.45092914e-01   1.54907086e-01]
 [  9.99986461e-01   1.35387408e-05]
 [  1.86410469e-03   9.98135895e-01]
 [  7.19296692e-06   9.99992807e-01]
 [  7.69377078e-04   9.99230623e-01]
 [  2.46954912e-01   7.53045088e-01]
 [  3.83185594e-03   9.96168144e-01]
 [  9.97436461e-01   2.56353912e-03]
 [  9.87310275e-01   1.26897245e-02]
 [  6.68878808e-03   9.93311212e-01]
 [  1.20824092e-04   9.99879176e-01]
 [  9.48775721e-01   5.12242791e-02]
 [  7.07352926e-01   2.92647074e-01]
 [  9.98429120e-01   1.57087955e-03]
 [  9.79761333e-01   2.02386674e-02]
 [  2.06536170e-03   9.97934638e-01]
 [  2.97926888e-01   7.02073112e-01]
 [  4.94170630e-01   5.05829370e-01]
 [  1.21516137e-03   9.98784839e-01]
 [  9.98261314e-01   1.73868570e-03]
 [  9.92486607e-01   7.51339344e-03]
 [  9.92981490e-01   7.01850976e-03]
 [  1.76058441e-01   8.23941559e-01]
 [  2.14806280e-03   9.97851937e-01]
 [  9.77064059e-01   2.29359410e-02]
 [  3.39512180e-03   9.96604878e-01]
 [  4.09388105e-03   9.95906119e-01]
 [  9.99178554e-01   8.21446116e-04]
 [  1.93094040e-02   9.80690596e-01]
 [  5.90347668e-01   4.09652332e-01]
 [  9.99938151e-01   6.18493543e-05]
 [  4.76430473e-05   9.99952357e-01]
 [  1.03610170e-02   9.89638983e-01]
 [  9.51130283e-01   4.88697174e-02]
 [  9.73766974e-01   2.62330256e-02]
 [  1.69188523e-03   9.98308115e-01]
 [  9.93411050e-01   6.58895042e-03]
 [  9.94786439e-01   5.21356135e-03]
 [  8.65064846e-01   1.34935154e-01]
 [  4.12347132e-04   9.99587653e-01]
 [  9.34688513e-01   6.53114869e-02]
 [  2.13593120e-04   9.99786407e-01]
 [  1.47307725e-01   8.52692275e-01]
 [  7.95356140e-01   2.04643860e-01]
 [  1.00242132e-02   9.89975787e-01]
 [  8.89218851e-01   1.10781149e-01]
 [  2.97048000e-01   7.02952000e-01]
 [  9.97622832e-01   2.37716829e-03]
 [  3.61147543e-03   9.96388525e-01]
 [  3.54042722e-04   9.99645957e-01]
 [  9.36876572e-04   9.99063123e-01]
 [  5.47673571e-01   4.52326429e-01]
 [  1.74196178e-02   9.82580382e-01]
 [  3.68668356e-03   9.96313316e-01]
 [  9.17141910e-01   8.28580896e-02]
 [  6.22999787e-01   3.77000213e-01]
 [  1.43549497e-02   9.85645050e-01]
 [  9.07031163e-01   9.29688371e-02]
 [  9.75844489e-02   9.02415551e-01]
 [  9.99473321e-01   5.26678911e-04]
 [  9.91979468e-01   8.02053161e-03]
 [  9.77758638e-01   2.22413621e-02]
 [  9.99274749e-01   7.25251458e-04]
 [  9.95593793e-01   4.40620693e-03]
 [  1.27896010e-03   9.98721040e-01]
 [  5.06231711e-01   4.93768289e-01]
 [  9.64050032e-01   3.59499678e-02]
 [  8.98497038e-01   1.01502962e-01]
 [  1.13892561e-02   9.88610744e-01]
 [  4.56244958e-03   9.95437550e-01]
 [  8.28328038e-01   1.71671962e-01]
 [  1.37138999e-04   9.99862861e-01]
 [  9.97209673e-01   2.79032728e-03]
 [  3.09539478e-02   9.69046052e-01]
 [  4.86753522e-03   9.95132465e-01]
 [  9.93109958e-01   6.89004153e-03]
 [  1.96627045e-03   9.98033730e-01]
 [  1.20152862e-02   9.87984714e-01]
 [  9.95386008e-01   4.61399247e-03]
 [  9.99181598e-01   8.18401568e-04]
 [  9.96594041e-01   3.40595894e-03]
 [  5.79396213e-01   4.20603787e-01]
 [  5.52724453e-03   9.94472755e-01]
 [  9.97732610e-01   2.26739004e-03]
 [  9.96888152e-01   3.11184826e-03]
 [  8.35164591e-01   1.64835409e-01]
 [  9.74790112e-01   2.52098876e-02]
 [  8.81903132e-01   1.18096868e-01]
 [  9.99520061e-01   4.79939164e-04]
 [  9.45127885e-01   5.48721152e-02]
 [  3.51597328e-02   9.64840267e-01]
 [  7.37738215e-01   2.62261785e-01]
 [  9.99670040e-01   3.29959671e-04]
 [  9.98584942e-01   1.41505776e-03]
 [  1.87747545e-03   9.98122525e-01]
 [  9.73040493e-01   2.69595073e-02]
 [  4.83069474e-04   9.99516931e-01]
 [  6.07419687e-01   3.92580313e-01]
 [  2.24816279e-04   9.99775184e-01]
 [  4.06548701e-01   5.93451299e-01]
 [  9.76712621e-01   2.32873794e-02]
 [  3.44961753e-01   6.55038247e-01]
 [  1.38719474e-04   9.99861281e-01]
 [  9.71216404e-01   2.87835960e-02]
 [  9.96556395e-01   3.44360457e-03]
 [  2.16004102e-02   9.78399590e-01]
 [  9.46124939e-01   5.38750614e-02]
 [  3.07089089e-04   9.99692911e-01]
 [  9.96549170e-01   3.45082963e-03]
 [  6.70631709e-03   9.93293683e-01]
 [  3.67014889e-03   9.96329851e-01]
 [  9.96842593e-01   3.15740684e-03]
 [  2.80133313e-01   7.19866687e-01]
 [  2.02312182e-02   9.79768782e-01]
 [  2.71087185e-03   9.97289128e-01]
 [  6.45733522e-04   9.99354266e-01]
 [  2.36791119e-03   9.97632089e-01]
 [  2.59330141e-02   9.74066986e-01]
 [  1.64545752e-05   9.99983545e-01]
 [  4.13528690e-02   9.58647131e-01]
 [  8.95796256e-01   1.04203744e-01]
 [  2.68293408e-04   9.99731707e-01]
 [  1.22466603e-02   9.87753340e-01]
 [  5.41780086e-02   9.45821991e-01]
 [  1.45401866e-02   9.85459813e-01]
 [  7.90342485e-03   9.92096575e-01]
 [  9.93949744e-01   6.05025605e-03]
 [  1.31722968e-02   9.86827703e-01]
 [  9.94522254e-01   5.47774590e-03]
 [  2.95682895e-01   7.04317105e-01]
 [  1.20990530e-02   9.87900947e-01]
 [  9.26557808e-01   7.34421918e-02]
 [  9.46920275e-01   5.30797248e-02]
 [  9.77216220e-01   2.27837797e-02]
 [  9.99219999e-01   7.80001340e-04]
 [  2.21185338e-01   7.78814662e-01]
 [  5.38986858e-06   9.99994610e-01]
 [  9.99851312e-01   1.48687548e-04]
 [  1.30886344e-02   9.86911366e-01]
 [  5.18032268e-04   9.99481968e-01]
 [  9.95831351e-01   4.16864866e-03]
 [  7.70361772e-04   9.99229638e-01]
 [  1.50845046e-06   9.99998492e-01]
 [  3.75878277e-01   6.24121723e-01]
 [  9.59756216e-03   9.90402438e-01]
 [  1.07405329e-02   9.89259467e-01]
 [  1.50815462e-03   9.98491845e-01]
 [  5.45156579e-02   9.45484342e-01]
 [  9.99887245e-01   1.12754546e-04]
 [  9.96316632e-01   3.68336822e-03]
 [  9.77745534e-01   2.22544658e-02]
 [  9.61563394e-01   3.84366060e-02]
 [  9.99867310e-01   1.32689709e-04]
 [  3.41830452e-03   9.96581695e-01]
 [  5.51009055e-04   9.99448991e-01]
 [  1.86847455e-03   9.98131525e-01]
 [  2.63710096e-02   9.73628990e-01]
 [  7.78496579e-01   2.21503421e-01]
 [  2.63253377e-03   9.97367466e-01]
 [  9.60113489e-01   3.98865112e-02]
 [  2.22688977e-03   9.97773110e-01]
 [  9.88631477e-02   9.01136852e-01]
 [  8.65131286e-04   9.99134869e-01]
 [  5.87174678e-03   9.94128253e-01]
 [  8.92437332e-01   1.07562668e-01]
 [  4.26814662e-03   9.95731853e-01]
 [  8.62315820e-01   1.37684180e-01]
 [  9.99081211e-01   9.18788953e-04]
 [  3.16875148e-04   9.99683125e-01]
 [  9.84831840e-01   1.51681604e-02]
 [  1.17748130e-02   9.88225187e-01]
 [  1.71670730e-03   9.98283293e-01]
 [  9.33041784e-01   6.69582157e-02]
 [  9.73245741e-01   2.67542588e-02]
 [  8.32469895e-03   9.91675301e-01]
 [  9.24258301e-01   7.57416993e-02]
 [  1.82314999e-02   9.81768500e-01]
 [  1.33682284e-02   9.86631772e-01]
 [  2.35799182e-03   9.97642008e-01]
 [  5.81617775e-01   4.18382225e-01]
 [  1.08992170e-03   9.98910078e-01]
 [  9.68238242e-01   3.17617580e-02]
 [  9.73717518e-01   2.62824816e-02]
 [  9.87619280e-01   1.23807203e-02]
 [  9.89257070e-01   1.07429300e-02]
 [  9.81882875e-01   1.81171249e-02]
 [  1.31338374e-04   9.99868662e-01]
 [  9.75389902e-01   2.46100979e-02]
 [  9.98085253e-01   1.91474655e-03]
 [  3.86817277e-02   9.61318272e-01]
 [  9.67688695e-01   3.23113053e-02]
 [  8.12478722e-01   1.87521278e-01]
 [  9.22846918e-01   7.71530819e-02]
 [  5.96546420e-03   9.94034536e-01]
 [  9.99964336e-01   3.56635976e-05]
 [  6.64349270e-03   9.93356507e-01]
 [  9.69345346e-01   3.06546541e-02]
 [  8.27770532e-01   1.72229468e-01]
 [  9.92443123e-01   7.55687652e-03]
 [  9.99219871e-01   7.80128649e-04]]

<matplotlib.collections.PathCollection at 0x295ad62a240>

k-fold cross validation¶

http://scikit-learn.org/stable/modules/cross_validation.html

http://scikit-learn.org/stable/modules/cross_validation.html#k-fold

import sklearn.cross_validation as cross_validation

scores = cross_validation.cross_val_score(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring='accuracy',
                                          cv=3)
print (scores)
print (scores.mean())

[ 0.96078431  0.98039216  0.875     ]
0.938725490196