Supervised learning using scikit-learn¶

The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover feature selection, and evaluation.

In [41]:
import numpy as np
import scipy.sparse as sp_sparse

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics

import seaborn as sns

%matplotlib inline

Feature Selection¶

Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.

http://scikit-learn.org/stable/modules/feature_selection.html

Variance Threshold¶

The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's

In [42]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(np.array(X))
print('\n')
sel = VarianceThreshold(threshold=(0.8*(1 - .8)))  # p*(1-p) = 0.16 , p = 0.5 = MaxVariance (for binary)
sel.fit_transform(X)                               # Here we set Threshold at p = 0.8
[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]


Out[42]:
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

What happended in this example? Which feature was removed and why?

In [ ]:
import sklearn.datasets as sk_data

iris = sk_data.load_iris()
X = iris.data
print(X[1:10,:])
print('\nVariance of the Features:')
print(X.var(axis = 0))
sel = VarianceThreshold(threshold=0.2)
print('\nData after applying Variance Threshold:')
sel.fit_transform(X)[1:10]
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Variance of the Features:
[0.68112222 0.18871289 3.09550267 0.57713289]

Data after applying Variance Threshold:
Out[ ]:
array([[4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2],
       [5.4, 1.7, 0.4],
       [4.6, 1.4, 0.3],
       [5. , 1.5, 0.2],
       [4.4, 1.4, 0.2],
       [4.9, 1.5, 0.1]])

Is it always a good idea to remove low variance features? Can we think of a counterexample?

Univariate Feature Selection¶

A more sophisticated feature selection technique uses a test to determine if a feature and the class label are independent. An example of such a test is the chi-square test (there are more as we have seen when studying statistics)

In this case we keep the features with high chi-square score and low p-value.

The features with the lowest scores and highest p-values are rejected.

The chi-square test is usually applied on categorical data.

In [ ]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

iris = sk_data.load_iris()
X, y = iris.data, iris.target
print(X.shape)
print('\nFeatures:')
print(X[1:10,:])
print('\nLabels:')
print(y)
sel = SelectKBest(chi2, k=2)     # Select the top k=2 features with the highest chi-square scores
X_new = sel.fit_transform(X, y)  # Now with chi2 we NEED the targets y to apply it
print('\nSelected Features:')
print(X_new[1:10])
(150, 4)

Features:
[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Selected Features:
[[1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]
 [1.7 0.4]
 [1.4 0.3]
 [1.5 0.2]
 [1.4 0.2]
 [1.5 0.1]]

The chi-square values and the p-values between features and target variable (X columns and y)

In [ ]:
print('Chi2 values')
print(sel.scores_)
c,p = sk.feature_selection.chi2(X, y)
print('\nChi2 values')
print(c) #The chi-square value between X columns and y
print('\np-values')
print(p) #The p-value for the test
Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]

Chi2 values
[ 10.81782088   3.7107283  116.31261309  67.0483602 ]

p-values
[4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]

Supervised Learning¶

Python has several classes and objects for implementing different supervised learning techniques such as Regression and Classification.

Regardless of the model being implemented, the following methods are implemented:

The method fit() takes the training data and labels/values, and trains the model

The method predict() takes as input the test data and applies the model.

Linear Regression¶

Linear Regression is implemented in the library sklearn.linear_model.

LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Before we dive into Classification (by classifying flowers), let's start with the simplest form of Supervised Learning: Linear Regression.

What is Supervised Learning?

Imagine teaching a child. You show them flashcards:

  • Input (X): A picture of a cat.
  • Answer (y): "Cat".
  • Input (X): A picture of a dog.
  • Answer (y): "Dog".

Over time, the child learns to associate the Input with the Answer. In Machine Learning:

  • Regression: The answer is a Number (e.g., Price, Temperature, Salary).
  • Classification: The answer is a Label (e.g., Cat, Dog, Spam, Not Spam).

The Task: "Study Time vs. Grades"

We will create a tiny dataset to predict a student's Test Score (y) based on how many Hours they Studied (X).

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

# 1. Prepare the Data
# X = Input Feature (Hours Studied)
X_hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)

# y = Target Label (Test Score)
# We add some "noise" so they don't fall on a perfect line.
# Perfect world: [60, 70, 80, 90, 100]
# Real world:
y_scores = np.array([58, 74, 78, 92, 96])

print("Inputs (X):\n", X_hours)
print("Targets (y):\n", y_scores)

# 2. Visualize the Data
plt.figure(figsize=(6,4))
plt.scatter(X_hours, y_scores, color='blue', s=100, label='Actual Students')
plt.title("Study Hours vs. Test Scores (Real Data)")
plt.xlabel("Hours Studied")
plt.ylabel("Test Score")
plt.grid(True)
plt.legend()
plt.show()
Inputs (X):
 [[1]
 [2]
 [3]
 [4]
 [5]]
Targets (y):
 [58 74 78 92 96]
No description has been provided for this image

Training the Model:

The "Line of Best Fit" Notice that the points do not form a perfect straight line.

Linear Regression cannot touch every point. Instead, it tries to find the "Best Fit" line—the line that is close to everyone on average, minimizing the total error (distance) between the points and the line.

In [ ]:
from sklearn.linear_model import LinearRegression

# 1. Instantiate
reg = LinearRegression()

# 2. Fit (Find the best line through the mess)
reg.fit(X_hours, y_scores)

# 3. Inspect the model
print(f"Coefficient (Slope): {reg.coef_[0]:.2f}")
print(f"Intercept (Bias):   {reg.intercept_:.2f}")

# The Equation
print(f"\nModel Equation: Score = {reg.coef_[0]:.2f} * Hours + {reg.intercept_:.2f}")
Coefficient (Slope): 9.40
Intercept (Bias):   51.40

Model Equation: Score = 9.40 * Hours + 51.40

Question: What grade would we expect for a student that studied 3.5 hours??

In [ ]:
# 1. Predict for a new value (3.5 hours)
new_X = np.array([[3.5]])
prediction = reg.predict(new_X)

print(f"Prediction: If you study for 3.5 hours, you will get a score of: {prediction[0]:.2f}")

# 2. Visualize the Result
plt.figure(figsize=(8,5))

# Plot actual data points
plt.scatter(X_hours, y_scores, color='blue', s=100, label='Actual Data')

# Plot the model's line
# We predict on the original X to draw the red line
plt.plot(X_hours, reg.predict(X_hours), color='red', linewidth=2, label='Line of Best Fit')

# Plot the new prediction
plt.scatter(new_X, prediction, color='green', marker='*', s=300, zorder=5, label='Prediction (3.5 hrs)')

# Visual Trick: Draw vertical lines to show the "Error" (Residuals)
for i in range(len(X_hours)):
    plt.plot([X_hours[i], X_hours[i]], [y_scores[i], reg.predict(X_hours)[i]], 'k--', alpha=0.3)

plt.title("Linear Regression: Minimizing the Error")
plt.xlabel("Hours Studied")
plt.ylabel("Test Score")
plt.legend()
plt.grid(True)
plt.show()
Prediction: If you study for 3.5 hours, you will get a score of: 84.30
No description has been provided for this image

The $R^2$ score computes the "explained variance"

$R^2 = 1-\frac{\sum_i (y_i -\hat y_i)^2}{\sum_i (y_i -\bar y)^2}$

where $\hat y_i$ is the prediction for point $x_i$ and $\bar y$ is the mean value of the target variable

In [ ]:
r2_score = reg.score(X_hours, y_scores)

print(f"R² Score: {r2_score:.4f}")
R² Score: 0.9571

Preparing the data¶

To perform classification we first need to prepare the data into train and test datasets.

In [ ]:
from sklearn.datasets import load_iris
import sklearn.utils as utils

iris = load_iris()
print("sample of data")
print(iris.data[:5,:])
print("\nthe class labels vector")
print(iris.target)
print("\nthe names of the classes:",iris.target_names)
print(iris.feature_names)
sample of data
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

the class labels vector
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

the names of the classes: ['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Randomly shuffle the data. This is useful to know that the data is in random order

In [ ]:
X, y = utils.shuffle(iris.data, iris.target, random_state=1) #shuffle the data
print(X.shape)
print(y.shape)
print(y)
(150, 4)
(150,)
[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0 1 2 2 0 2 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0
 1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 1 0 1 0 0 2 2
 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1
 2 0]

Select a subset for training and a subset for testing

In [ ]:
train_set_size = 100
X_train = X[:train_set_size]  # selects first 100 rows (samples) for train set
y_train = y[:train_set_size]
X_test = X[train_set_size:]   # selects from row 100 until the last one for test set
y_test = y[train_set_size:]
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(100, 4) (100,)
(50, 4) (50,)

We can also use the train_test_split function of python for splitting the data into train and test sets. In this case you do not need the random shuffling (but it does not hurt).

In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Classification models¶

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Python has classes and objects that implement the different classification techniques that we described in class.

Decision Trees¶

http://scikit-learn.org/stable/modules/tree.html

Train and apply a decision tree classifier. The default score computed in the classifier object is the accuracy. Decision trees can also give us probabilities

Discrete Mathematics: Elementary and Beyond (2003) by László Lovász, József Pelikán, and Katalin Vesztergombi. Chapter 8:

image.png

In [ ]:
from sklearn import tree

dtree = tree.DecisionTreeClassifier() # max_depth=None, No fixed randomness for handling Tied Splits (random_state = 0)
dtree = dtree.fit(X_train, y_train)

print("classifier accuracy:",dtree.score(X_test,y_test))

y_pred = dtree.predict(X_test)
y_prob = dtree.predict_proba(X_test)
print("\nclassifier predictions:",y_pred[:10])
print("ground truth labels   :",y_test[:10])
print('\n')
print(y_prob[:10])
classifier accuracy: 0.9166666666666666

classifier predictions: [2 2 2 0 0 0 2 2 2 2]
ground truth labels   : [1 2 2 0 0 0 2 2 2 2]


[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]

Why does y_prob have only 0 and 1 entries?

Compute some more metrics

In [ ]:
print("accuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
accuracy: 0.9166666666666666

Confusion matrix
[[20  0  0]
 [ 0 15  4]
 [ 0  1 20]]

Precision Score per class
[1.         0.9375     0.83333333]

Average Precision Score
0.921875

Recall Score per class
[1.         0.78947368 0.95238095]

Average Recall Score
0.9166666666666666

F1-score Score per class
[1.         0.85714286 0.88888889]

Average F1 Score
0.9158730158730158

Visualize the decision tree.

For this you will need to install the package python-graphviz

In [ ]:
#conda install python-graphviz
import graphviz
print(iris.feature_names)
dot_data = tree.export_graphviz(dtree,out_file=None)
graph = graphviz.Source(dot_data)
graph
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Out[ ]:
No description has been provided for this image
In [ ]:
dtree2 = tree.DecisionTreeClassifier(max_depth=3)  # Based on the above diagrame most of the classification is done on Depth = 3
dtree2 = dtree2.fit(X_train, y_train)
print(dtree2.score(X_test,y_test))
dot_data2 = tree.export_graphviz(dtree2,out_file=None)
graph2 = graphviz.Source(dot_data2)
graph2
0.9166666666666666
Out[ ]:
No description has been provided for this image

k-NN Classification¶

https://scikit-learn.org/stable/modules/neighbors.html#classification

While the Decision Tree tried to learn explicit rules (like "if petal length < 2.45..."), the k-Nearest Neighbors (k-NN) algorithm doesn't "learn" rules at all. Instead, it relies on similarity.

It is often called a "Lazy Learner" because it doesn't build a model during the training phase. It simply memorizes the training data.

The Intuition: The logic follows the old proverb:

"Tell me who your friends are, and I will tell you who you are."

How it works: When we ask the model to classify a new, unseen flower, it follows these steps:

  1. Measure Distance: It calculates the distance (usually Euclidean) between the new flower and every flower in the training set.
  2. Find Neighbors: It picks the $k$ closest flowers (where $k$ is a number we choose, e.g., 3).
  3. Majority Vote: It looks at the classes of those $k$ neighbors. If 2 are 'Versicolor' and 1 is 'Virginica', the model predicts 'Versicolor'.

Key Parameter: n_neighbors. The most important setting here is $k$ (in code: n_neighbors).

  • Small $k$ (e.g., 1): The model is very sensitive to noise (a single weird flower can flip the result).
  • Large $k$ (e.g., 50): The decision becomes very smooth, but we might lose local details.
In [ ]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print("classifier score:", knn.score(X_test,y_test))

y_pred = knn.predict(X_test)

print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.9333333333333333

accuracy: 0.9333333333333333

Confusion matrix
[[20  0  0]
 [ 0 16  3]
 [ 0  1 20]]

Precision Score per class
[1.         0.94117647 0.86956522]

Average Precision Score
0.9357203751065644

Recall Score per class
[1.         0.84210526 0.95238095]

Average Recall Score
0.9333333333333333

F1-score Score per class
[1.         0.88888889 0.90909091]

Average F1 Score
0.9329966329966329

Example image of how kNN draws Decision Boundaries. In our case that there are 4 features such a plot connot be made since it will have to be in 4D.

image.png

SVM Classification¶

http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [ ]:
from sklearn import svm

#svm_clf = svm.LinearSVC()
#svm_clf = svm.SVC(kernel = 'poly')   # kernel determines how the data is transformed to make it linearly separable
svm_clf = svm.SVC()                   # default kernel: Radial basis function (rbf)
svm_clf.fit(X_train,y_train)
print("classifier score:",svm_clf.score(X_test,y_test))
y_pred = svm_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.95

accuracy: 0.95

Confusion matrix
[[20  0  0]
 [ 0 18  1]
 [ 0  2 19]]

Precision Score per class
[1.   0.9  0.95]

Average Precision Score
0.9508333333333333

Recall Score per class
[1.         0.94736842 0.9047619 ]

Average Recall Score
0.95

F1-score Score per class
[1.         0.92307692 0.92682927]

Average F1 Score
0.9500312695434646

kNN vs SVM Descision Boundries

kNN: Decision boundaries are determined by local data density based on majority voting among the k nearest neighbors

SVM: Decision boundaries are determined by maximizing the margin using support vectors. Particularly effective in high-dimensional or sparse datasets.

Logistic Regression¶

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

Despite its name, Logistic Regression is used for Classification, not Regression.

Why is it different?

  1. The Shape: Unlike Decision Trees (which cut the data into rectangles) or k-NN (which creates complex, wiggly shapes), Logistic Regression is a Linear Classifier. It tries to draw a single straight line (or a flat plane) to separate the classes.
  2. The Output: It doesn't just give us a "Yes/No" answer. It gives us a Probability (a score between 0 and 1).

The Intuition¶

Imagine you are trying to separate red and blue marbles on a table:

  • Decision Tree: You build a wall of LEGO bricks around the red marbles.
  • k-NN: You look at clusters of marbles.
  • Logistic Regression: You place a single straight stick on the table to separate them as best as you can.

Because it is "linear," it might struggle if the data is very complex (e.g., a circle inside a circle), but it is very fast and easy to interpret.

In [ ]:
import sklearn.linear_model as linear_model

lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_train, y_train)
print("classifier score:",lr_clf.score(X_test,y_test))
y_pred = lr_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.95

accuracy: 0.95

Confusion matrix
[[20  0  0]
 [ 0 17  2]
 [ 0  1 20]]

Precision Score per class
[1.         0.94444444 0.90909091]

Average Precision Score
0.9505892255892257

Recall Score per class
[1.         0.89473684 0.95238095]

Average Recall Score
0.95

F1-score Score per class
[1.         0.91891892 0.93023256]

Average F1 Score
0.9499057196731615

For Logistic Regression we can also obtain the probabilities for the different classes

In [ ]:
probs = lr_clf.predict_proba(X_test)
print("Class Probabilities (first 10):")
print (probs[:10])
print(y_test[:10])
print(probs.argmax(axis = 1)[:10])
print(probs.max(axis = 1)[:10])
Class Probabilities (first 10):
[[3.58062452e-03 4.78837536e-01 5.17581840e-01]
 [1.14629520e-03 4.05835759e-01 5.93017945e-01]
 [3.92636719e-05 6.43273069e-02 9.35633429e-01]
 [9.58095468e-01 4.19028153e-02 1.71652335e-06]
 [9.43362912e-01 5.66341310e-02 2.95722406e-06]
 [9.82882494e-01 1.71170688e-02 4.36780706e-07]
 [5.55192517e-05 7.15153840e-02 9.28429097e-01]
 [1.95254291e-04 1.77447939e-01 8.22356806e-01]
 [1.39404689e-04 1.43508126e-01 8.56352469e-01]
 [1.28136030e-05 1.91639422e-02 9.80823244e-01]]
[1 2 2 0 0 0 2 2 2 2]
[2 2 2 0 0 0 2 2 2 2]
[0.51758184 0.59301795 0.93563343 0.95809547 0.94336291 0.98288249
 0.9284291  0.82235681 0.85635247 0.98082324]

And the coeffients of the logistic regression model

In [ ]:
print(lr_clf.coef_)
[[-0.41634761  0.72750266 -2.19259055 -0.94546959]
 [ 0.0821038  -0.38691851 -0.0226097  -0.69668026]
 [ 0.33424381 -0.34058416  2.21520024  1.64214985]]
In [ ]:
from matplotlib.colors import ListedColormap

# Select two features for visualization
feature_1_idx = 0  # e.g., sepal length
feature_2_idx = 2  # e.g., petal length

X_train_2d = X_train[:, [feature_1_idx, feature_2_idx]]
X_test_2d = X_test[:, [feature_1_idx, feature_2_idx]]

# Train Logistic Regression on the two selected features
lr_clf_2d = linear_model.LogisticRegression(solver='lbfgs', multi_class='ovr')
lr_clf_2d.fit(X_train_2d, y_train)

# Generate a grid for plotting decision boundaries
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Predict for each grid point
Z = lr_clf_2d.predict(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundaries
plt.figure(figsize=(10, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

plt.contourf(xx, yy, Z, alpha=0.8, cmap=cmap_light)
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
plt.title("Logistic Regression Decision Boundaries - Feature 1 vs Feature 2")
plt.xlabel(f"Feature {feature_1_idx + 1} (e.g., {iris.feature_names[feature_1_idx]})")
plt.ylabel(f"Feature {feature_2_idx + 1} (e.g., {iris.feature_names[feature_2_idx]})")
plt.show()
/usr/local/lib/python3.12/dist-packages/sklearn/linear_model/_logistic.py:1256: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.
  warnings.warn(
No description has been provided for this image

Neural Networks: Mutli-Layer Perceptron¶

https://scikit-learn.org/stable/modules/neural_networks_supervised.html https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier

Multi-Layer Perceptron (MLP) is the simplest form of a Neural Network.

Logistic Regression used a single equation (a single "neuron") to separate the data with a straight line.

  • MLP works by combining many of these "neurons" together in layers.
  • Hidden Layers: The layers in the middle allow the model to learn complex, non-linear patterns.

image.png

Linear vs. Non-Linear:

  • Logistic Regression: Can only draw straight lines.
  • MLP: Can draw curves, circles, and complex squiggles. It bends the decision boundary to fit the data.

The Trade-off: While powerful, MLPs are "Black Boxes." Unlike the Decision Tree (where we could see the rules) or Logistic Regression (where we saw the coefficients), it is very hard to look inside a Neural Network and understand exactly why it made a specific decision.

In [ ]:
from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier(solver='lbfgs')
print("MLP Default Architecture")
print(mlp_clf.hidden_layer_sizes)
mlp_clf.fit(X_train, y_train)
print("\nclassifier score:",mlp_clf.score(X_test,y_test))
y_pred = mlp_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
MLP Default Architecture
(100,)

classifier score: 0.9333333333333333

accuracy: 0.9333333333333333

Confusion matrix
[[20  0  0]
 [ 0 17  2]
 [ 0  2 19]]

Precision Score per class
[1.         0.89473684 0.9047619 ]

Average Precision Score
0.9333333333333333

Recall Score per class
[1.         0.89473684 0.9047619 ]

Average Recall Score
0.9333333333333333

F1-score Score per class
[1.         0.89473684 0.9047619 ]

Average F1 Score
0.9333333333333333

Lets see the Descision Boundaries that the MLP makes

In [ ]:
# Select two features for visualization
feature_1_idx = 0  # First feature
feature_2_idx = 2  # Third feature

X_train_2d = X_train[:, [feature_1_idx, feature_2_idx]]
X_test_2d = X_test[:, [feature_1_idx, feature_2_idx]]

# Train MLP on selected features
mlp_clf_2d = MLPClassifier(solver='lbfgs')
mlp_clf_2d.fit(X_train_2d, y_train)

# Generate grid for decision boundaries
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = mlp_clf_2d.predict(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundaries
plt.figure(figsize=(10, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

plt.contourf(xx, yy, Z, alpha=0.8, cmap=cmap_light)
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
plt.title("MLP Decision Boundaries - Feature 1 vs Feature 3")
plt.xlabel(f"Feature {feature_1_idx + 1}")
plt.ylabel(f"Feature {feature_2_idx + 1}")
plt.show()
No description has been provided for this image

Computing Scores

In [ ]:
p,r,f,s = metrics.precision_recall_fscore_support(y_test,y_pred)  # y_pred is from the MLP
print(p)
print(r)
print(f)
print(s)
[1.         0.89473684 0.9047619 ]
[1.         0.89473684 0.9047619 ]
[1.         0.89473684 0.9047619 ]
[20 19 21]
In [ ]:
report = metrics.classification_report(y_test,y_pred)  # Support tell us how many samples of each class are in the y_test
print(report)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       0.89      0.89      0.89        19
           2       0.90      0.90      0.90        21

    accuracy                           0.93        60
   macro avg       0.93      0.93      0.93        60
weighted avg       0.93      0.93      0.93        60

More on Evaluation: http://scikit-learn.org/stable/model_selection.html#model-selection

A more Complex example with the diabetes dataset

In [ ]:
diabetes_X, diabetes_y = sk_data.load_diabetes(return_X_y=True)   # X = 10 features (Age, BMI, Blood pressure, etc...)
                                                                  # y = Quantitative measure for disease progression (Continuous Value)
# Shuffle the data
diabetes_X, diabetes_y = utils.shuffle(diabetes_X, diabetes_y, random_state=1)

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('\nMean squared error: %.2f'
      % metrics.mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction           # Obviously here we dont have a perfect fit for a linear model like before
# Computed over the *test* data
print('\nCoefficient of determination: %.2f'
      % metrics.r2_score(diabetes_y_test, diabetes_y_pred))
print('\nPredictions:')
print(diabetes_y_pred)
print('\nTrue values:')
print(diabetes_y_test)
Coefficients: 
 [  -8.80064784 -238.68584171  515.45675075  329.26068533 -878.18560171
  530.03616927  126.04120869  213.28386451  734.45899793   67.32731226]

Mean squared error: 2305.01

Coefficient of determination: 0.68

Predictions:
[149.7546802  199.76667761 248.11135815 182.95023775  98.34758327
  96.66442169 248.60103198  64.8494556  234.5250113  209.30960598
 179.26527876  85.95464444  70.53999409 197.9358827  100.34679414
 116.8171366  134.97124147  64.08460686 178.33480132 155.32208789]

True values:
[168. 221. 310. 283.  81.  94. 277.  72. 270. 268. 174.  96.  83. 222.
  69. 153. 202.  43. 124. 276.]

Precision vs Recall

High Precision = Fewer false positives

High Recall = Fewer false negatives

In this binary example we get different Precision and Recall scores for different threashold values.

For a given threshold: $t$, the model predicts the positive class $(\hat{y} = 1)$ for all samples where:

$$ P(\text{class}=1) \geq t $$

and predicts the negative class $( \hat{y} = 0)$ otherwise. The threshold is varied over the range of predicted probabilities, and for each threshold, the Precision and Recall are computed as:

$$ \text{Precision}(t) = \frac{\text{True Positives}(t)}{\text{True Positives}(t) + \text{False Positives}(t)} $$

$$ \text{Recall}(t) = \frac{\text{True Positives}(t)}{\text{True Positives}(t) + \text{False Negatives}(t)} $$

The resulting Precision-Recall Curve is plotted by connecting the Precision and Recall values computed for all thresholds.

In [ ]:
y_cancer_pred = lr_clf.predict(X_cancer_test)
cancer_probs = lr_clf.predict_proba(X_cancer_test)
print("Class Probabilities (first 10):")
print (cancer_probs[:10])
y_cancer_scores = cancer_probs[:,1]

precision, recall, thresholds = metrics.precision_recall_curve(y_cancer_test, y_cancer_scores)

# Plot the Precision-Recall Curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='darkorange', lw=2)
plt.title("Precision-Recall Curve")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.grid(True)
plt.show()
In [ ]:
fpr, tpr, ths = metrics.roc_curve(y_cancer_test,y_cancer_scores)
plt.plot(fpr,tpr,color='darkorange',lw=2)
print(metrics.roc_auc_score(y_cancer_test,y_cancer_scores))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
#plt.legend(loc="lower right")
plt.show()
In [ ]:
(Xtoy,y_toy)=sk_data.make_classification(n_samples=10000)  # Synthetic dataset for binary classification
Xttrain = Xtoy[:8000,:]                                    # X = 20 features
Xttest = Xtoy[8000:,:]
yttrain = y_toy[:8000]
yttest = y_toy[8000:]

lr_clf.fit(Xttrain, yttrain)
#print(lr_clf.score(Xttest,yttest))
#y_tpred = lr_clf.predict(X_test)
tprobs = lr_clf.predict_proba(Xttest)
print (tprobs[:10])

y_tscores = tprobs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(yttest,y_tscores)
plt.plot(recall,precision)

k-fold cross validation¶

In k-fold cross validation the data is split into k equal parts, the k-1 are used for training and the last one for testing. k models are trained, each time leaving a different part for testing

https://scikit-learn.org/stable/modules/cross_validation.html

There are two methods for implementing k-fold cross-validation, under the library model selection: cross_val_score, and cross validate. The latter allows multiple metrics to be considered together.

In [ ]:
import sklearn.model_selection as model_selection

scores = model_selection.cross_val_score(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring='f1_weighted',
                                          cv=5)
print (scores)
print (scores.mean())
[1.         0.93333333 0.96658312 0.96658312 0.93265993]
0.9598319029897977
In [ ]:
scores = model_selection.cross_validate(#lr_clf,
                                          #svm_clf,
                                          #knn,
                                          dtree,
                                          X,
                                          y,
                                          scoring=['precision_weighted','recall_weighted'],
                                          cv=3)
print (scores)
print (scores['test_precision_weighted'].mean(),scores['test_recall_weighted'].mean())
{'fit_time': array([0.00221992, 0.00099707, 0.00091362]), 'score_time': array([0.00517941, 0.003649  , 0.0036037 ]), 'test_precision_weighted': array([0.96      , 0.90834586, 0.88308772]), 'test_recall_weighted': array([0.96, 0.9 , 0.88])}
0.9171445279866332 0.9133333333333332

Creating a pipeline¶

If the same steps are often repeated, you can create a pipeline to perform them all at once:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

Text classification Example¶

We will use the 20 newsgroups to do a text classification example

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.space','rec.sport.baseball']
#categories = ['alt.atheism', 'rec.sport.baseball']
news_train = sk_data.fetch_20newsgroups(subset='train',
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
print (len(news_train.target))
X_news_train_data = news_train.data
y_news_train = news_train.target
news_test = sk_data.fetch_20newsgroups(subset='test',
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
print (len(news_test.target))
X_news_test_data = news_test.data
y_news_test = news_test.target
1190
791
In [3]:
X_news_train_data[0]
Out[3]:
"I've been saying this for quite some time, but being absent from the\nnet for a while I figured I'd stick my neck out a bit...\n\nThe Royals will set the record for fewest runs scored by an AL\nteam since the inception of the DH rule.  (p.s. any ideas what this is?)\n\nThey will fall easily short of 600 runs, that's for damn sure.  I can't\nbelieve these media fools picking them to win the division (like our\nTom Gage of the Detroit News claiming Herk Robinson is some kind of\ngenius for the trades/aquisitions he's made)\n\nc-ya\n\nSean\n\n"
In [4]:
y_news_train[0]
Out[4]:
np.int64(0)
In [5]:
from sklearn.linear_model import LinearRegression
import sklearn.linear_model as linear_model

import sklearn.feature_extraction.text as sk_text

vectorizer = sk_text.TfidfVectorizer(stop_words='english',
                             #max_features = 1000,
                             min_df=4, max_df=0.8)
X_news_train = vectorizer.fit_transform(X_news_train_data)

lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_news_train, y_news_train)
Out[5]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [6]:
X_news_test = vectorizer.transform(X_news_test_data)
print("classifier score:",lr_clf.score(X_news_test,y_news_test))
classifier score: 0.9216182048040455

Word embeddings and Text classification¶

We will now see how we can train and use word embeddings.

The Gensim library¶

The Gensim library has several NLP models.

You can use existing modules to train a word2vec model: https://radimrehurek.com/gensim/models/word2vec.html

In [7]:
!pip install gensim

import gensim
import gensim.models
from gensim.models import Word2Vec
from gensim import utils
Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.12/dist-packages (from gensim) (2.0.2)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.16.3)
Requirement already satisfied: smart_open>=1.8.1 in /usr/local/lib/python3.12/dist-packages (from gensim) (7.5.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart_open>=1.8.1->gensim) (2.0.1)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 64.4 MB/s eta 0:00:00
Installing collected packages: gensim
Successfully installed gensim-4.4.0

The utils library will do preprocessing of the text. It will lower-case and tokenize the text and remove punctuation. The end result is a list with tokens, which is what we need as inpute for training or using the different models.

In [8]:
train_gsim = [utils.simple_preprocess(x) for x in X_news_train_data]
train_data_labels = [(x,y) for (x,y) in zip(train_gsim, y_news_train) if len(x) > 0]  # Removes documents that become empty after preprocessing (e.g., if all words in a document are stopwords or invalid).
X_news_train_gsim = [x for (x,y) in train_data_labels]
y_news_train_gsim = [y for (x,y) in train_data_labels]
In [9]:
test_gsim = [utils.simple_preprocess(x) for x in X_news_test_data]
test_data_labels = [(x,y) for (x,y) in zip(test_gsim, y_news_test) if len(x) > 0]
X_news_test_gsim = [x for (x,y) in test_data_labels]
y_news_test_gsim = [y for (x,y) in test_data_labels]
In [10]:
X_news_train_gsim[0]
Out[10]:
['ve',
 'been',
 'saying',
 'this',
 'for',
 'quite',
 'some',
 'time',
 'but',
 'being',
 'absent',
 'from',
 'the',
 'net',
 'for',
 'while',
 'figured',
 'stick',
 'my',
 'neck',
 'out',
 'bit',
 'the',
 'royals',
 'will',
 'set',
 'the',
 'record',
 'for',
 'fewest',
 'runs',
 'scored',
 'by',
 'an',
 'al',
 'team',
 'since',
 'the',
 'inception',
 'of',
 'the',
 'dh',
 'rule',
 'any',
 'ideas',
 'what',
 'this',
 'is',
 'they',
 'will',
 'fall',
 'easily',
 'short',
 'of',
 'runs',
 'that',
 'for',
 'damn',
 'sure',
 'can',
 'believe',
 'these',
 'media',
 'fools',
 'picking',
 'them',
 'to',
 'win',
 'the',
 'division',
 'like',
 'our',
 'tom',
 'gage',
 'of',
 'the',
 'detroit',
 'news',
 'claiming',
 'herk',
 'robinson',
 'is',
 'some',
 'kind',
 'of',
 'genius',
 'for',
 'the',
 'trades',
 'aquisitions',
 'he',
 'made',
 'ya',
 'sean']

Train a CBOW (Continuous Bag Of Words) embedding on the training data corpus

image.png

Machine learning models cannot understand text; they only understand numbers. Word2Vec is a technique to turn words into lists of numbers (vectors) so that words with similar meanings have similar numbers.

In [11]:
embedding_size = 50 # Resolution of the word understanding: Each word ("apple", "king", etc...) is represented as a list of 50 numbers.
cbow_model = gensim.models.Word2Vec(X_news_train_gsim, min_count = 1,vector_size = embedding_size, window = 10) # attention span": looks at 10 words to the left and 10 words to the right.

We now have a representation of the words as 50-dimensional real vectors

In [12]:
cbow_model.wv['pitch']
Out[12]:
array([-0.15768205, -0.15172504, -0.12478593,  0.12387501, -0.27375695,
       -0.32890198,  0.29313874,  0.8522625 , -0.48969692, -0.08721875,
        0.1272012 , -0.6764343 ,  0.14053679,  0.5181793 , -0.2863375 ,
        0.18444148,  0.47363278,  0.05032672, -0.6023291 , -0.57959276,
        0.29312116,  0.30031574,  0.70870405, -0.30768996,  0.6378389 ,
        0.13288465, -0.27912104,  0.05217481, -0.64838284,  0.2454371 ,
        0.04064707,  0.06659085,  0.04146409, -0.06713676, -0.33795828,
        0.19997104,  0.35694647, -0.21540166,  0.3260302 , -0.06656712,
        0.33125332,  0.02459017,  0.08362663, -0.02666873,  0.672934  ,
       -0.19353233, -0.10657518, -0.18800081,  0.5240707 ,  0.2803613 ],
      dtype=float32)

image.png

We can use this to find similar words

In [13]:
cbow_model.wv.most_similar('pitch')
Out[13]:
[('great', 0.9990735650062561),
 ('against', 0.9990703463554382),
 ('wasn', 0.998989462852478),
 ('down', 0.9989439845085144),
 ('every', 0.9989256262779236),
 ('average', 0.9988952875137329),
 ('very', 0.9988757967948914),
 ('am', 0.9988664984703064),
 ('again', 0.9988647699356079),
 ('run', 0.9988623857498169)]

Use the additivity property: Chicago + Cubs - Boston = Sox

Teams: Chicago Cubs and Boston Red Sox

In [14]:
cbow_model.wv.most_similar(positive=['chicago','cubs'],negative=['boston'])
Out[14]:
[('suck', 0.9981583952903748),
 ('detroit', 0.9950345158576965),
 ('road', 0.9944502711296082),
 ('milwaukee', 0.9943985342979431),
 ('astros', 0.9943371415138245),
 ('st', 0.9942942261695862),
 ('hernandez', 0.994268536567688),
 ('west', 0.9942171573638916),
 ('won', 0.9941810965538025),
 ('slg', 0.9940813779830933)]

Use the word embeddings to obtain a vector representation of the document by taking the average of the embeddings of the words. Transform the train and test data

Average Embending Vector = "Summary" Vector for the Document

In [16]:
np.array([cbow_model.wv[x] for x in X_news_train_gsim[0]]).mean(axis = 0)
Out[16]:
array([-0.4029563 , -0.33122525, -0.30095032,  0.21110743, -0.5092193 ,
       -0.7030677 ,  0.5381219 ,  1.8370438 , -1.0126023 , -0.14332198,
        0.25449035, -1.3880475 ,  0.29733127,  1.1351492 , -0.5628445 ,
        0.41576612,  1.0588037 ,  0.28731224, -1.2441657 , -1.1561385 ,
        0.6357985 ,  0.651013  ,  1.4021233 , -0.6070939 ,  1.3530061 ,
        0.268321  , -0.6106677 ,  0.17058086, -1.3819162 ,  0.4244899 ,
        0.07523321,  0.05736452,  0.13832642, -0.12398918, -0.7068729 ,
        0.43235666,  0.7909126 , -0.4662388 ,  0.73772854, -0.12105183,
        0.6979285 ,  0.06914084,  0.19897568, -0.16035889,  1.4181232 ,
       -0.32762772, -0.21987288, -0.3918733 ,  1.0586956 ,  0.5894503 ],
      dtype=float32)
In [20]:
X_news_train_cbow = [np.array([cbow_model.wv[x] for x in y]).mean(axis = 0) for y in X_news_train_gsim]
In [21]:
X_news_test_cbow = [np.array([cbow_model.wv[x] for x in y if x in cbow_model.wv]).mean(axis = 0) for y in X_news_test_gsim]

Train a classifier on the embeddings

In [22]:
lr_clf.fit(np.array(X_news_train_cbow), np.array(y_news_train_gsim))
Out[22]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [23]:
lr_clf.score(np.array(X_news_test_cbow),y_news_test_gsim)
Out[23]:
0.7889182058047494

Train a SkipGram embedding on the training data corpus

image.png

CBOW vs. Skip-gram

In the code above, we trained a Word2Vec model. But how does it actually learn?

Word2Vec is not just one algorithm; it is a family of two distinct architectures that learn word meanings in opposite ways.

  1. Continuous Bag of Words (CBOW)
  • The Goal: Predict the Target Word based on the Context (surrounding words).
  • The Analogy: A "Fill in the blank" game.
  • Example:
    • Context: ["The", "cat", "sits", "on"]
    • Target: ?
    • Prediction: "mat"
  • Why use it? It is generally faster to train and produces slightly better accuracy for frequent words.
  • Note: This is the default method in gensim (which we used above).

  1. Skip-gram
  • The Goal: Predict the Context (surrounding words) based on the Target Word.
  • The Analogy: The reverse of CBOW. You are given one word and have to guess the "story" around it.
  • Example:
    • Target: "mat"
    • Prediction: ["The", "cat", "sits", "on"]
  • Why use it? It is slower to train but works much better for small datasets and rare words.
In [24]:
embedding_size = 50
skipgram_model = gensim.models.Word2Vec(X_news_train_gsim, min_count = 1,vector_size = embedding_size, window = 10, sg = 1)

Transform the train and test data

In [25]:
X_news_train_skipgram = [np.array([skipgram_model.wv[x] for x in y]).mean(axis = 0) for y in X_news_train_gsim]

X_news_test_skipgram = [np.array([skipgram_model.wv[x] for x in y if x in skipgram_model.wv]).mean(axis = 0) for y in X_news_test_gsim]

Train a classifier on the emebddings

In [26]:
lr_clf.fit(np.array(X_news_train_skipgram), np.array(y_news_train_gsim))

lr_clf.score(np.array(X_news_test_skipgram),y_news_test_gsim)
Out[26]:
0.9261213720316622

You can also download the Google word2vec model trained over millions of documents (use pre-trained models)

In [43]:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
[==================================================] 100.0% 1662.8/1662.8MB downloaded
/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
In [44]:
#path = 'C:\\Users\\tsapa/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz'
g_model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)
In [45]:
print(len(g_model['hello']))
print(g_model['hello'])
300
[-0.05419922  0.01708984 -0.00527954  0.33203125 -0.25       -0.01397705
 -0.15039062 -0.265625    0.01647949  0.3828125  -0.03295898 -0.09716797
 -0.16308594 -0.04443359  0.00946045  0.18457031  0.03637695  0.16601562
  0.36328125 -0.25585938  0.375       0.171875    0.21386719 -0.19921875
  0.13085938 -0.07275391 -0.02819824  0.11621094  0.15332031  0.09082031
  0.06787109 -0.0300293  -0.16894531 -0.20800781 -0.03710938 -0.22753906
  0.26367188  0.012146    0.18359375  0.31054688 -0.10791016 -0.19140625
  0.21582031  0.13183594 -0.03515625  0.18554688 -0.30859375  0.04785156
 -0.10986328  0.14355469 -0.43554688 -0.0378418   0.10839844  0.140625
 -0.10595703  0.26171875 -0.17089844  0.39453125  0.12597656 -0.27734375
 -0.28125     0.14746094 -0.20996094  0.02355957  0.18457031  0.00445557
 -0.27929688 -0.03637695 -0.29296875  0.19628906  0.20703125  0.2890625
 -0.20507812  0.06787109 -0.43164062 -0.10986328 -0.2578125  -0.02331543
  0.11328125  0.23144531 -0.04418945  0.10839844 -0.2890625  -0.09521484
 -0.10351562 -0.0324707   0.07763672 -0.13378906  0.22949219  0.06298828
  0.08349609  0.02929688 -0.11474609  0.00534058 -0.12988281  0.02514648
  0.08789062  0.24511719 -0.11474609 -0.296875   -0.59375    -0.29492188
 -0.13378906  0.27734375 -0.04174805  0.11621094  0.28320312  0.00241089
  0.13867188 -0.00683594 -0.30078125  0.16210938  0.01171875 -0.13867188
  0.48828125  0.02880859  0.02416992  0.04736328  0.05859375 -0.23828125
  0.02758789  0.05981445 -0.03857422  0.06933594  0.14941406 -0.10888672
 -0.07324219  0.08789062  0.27148438  0.06591797 -0.37890625 -0.26171875
 -0.13183594  0.09570312 -0.3125      0.10205078  0.03063965  0.23632812
  0.00582886  0.27734375  0.20507812 -0.17871094 -0.31445312 -0.01586914
  0.13964844  0.13574219  0.0390625  -0.29296875  0.234375   -0.33984375
 -0.11816406  0.10644531 -0.18457031 -0.02099609  0.02563477  0.25390625
  0.07275391  0.13574219 -0.00138092 -0.2578125  -0.2890625   0.10107422
  0.19238281 -0.04882812  0.27929688 -0.3359375  -0.07373047  0.01879883
 -0.10986328 -0.04614258  0.15722656  0.06689453 -0.03417969  0.16308594
  0.08642578  0.44726562  0.02026367 -0.01977539  0.07958984  0.17773438
 -0.04370117 -0.00952148  0.16503906  0.17285156  0.23144531 -0.04272461
  0.02355957  0.18359375 -0.41601562 -0.01745605  0.16796875  0.04736328
  0.14257812  0.08496094  0.33984375  0.1484375  -0.34375    -0.14160156
 -0.06835938 -0.14648438 -0.02844238  0.07421875 -0.07666016  0.12695312
  0.05859375 -0.07568359 -0.03344727  0.23632812 -0.16308594  0.16503906
  0.1484375  -0.2421875  -0.3515625  -0.30664062  0.00491333  0.17675781
  0.46289062  0.14257812 -0.25       -0.25976562  0.04370117  0.34960938
  0.05957031  0.07617188 -0.02868652 -0.09667969 -0.01281738  0.05859375
 -0.22949219 -0.1953125  -0.12207031  0.20117188 -0.42382812  0.06005859
  0.50390625  0.20898438  0.11230469 -0.06054688  0.33203125  0.07421875
 -0.05786133  0.11083984 -0.06494141  0.05639648  0.01757812  0.08398438
  0.13769531  0.2578125   0.16796875 -0.16894531  0.01794434  0.16015625
  0.26171875  0.31640625 -0.24804688  0.05371094 -0.0859375   0.17089844
 -0.39453125 -0.00156403 -0.07324219 -0.04614258 -0.16210938 -0.15722656
  0.21289062 -0.15820312  0.04394531  0.28515625  0.01196289 -0.26953125
 -0.04370117  0.37109375  0.04663086 -0.19726562  0.3046875  -0.36523438
 -0.23632812  0.08056641 -0.04248047 -0.14648438 -0.06225586 -0.0534668
 -0.05664062  0.18945312  0.37109375 -0.22070312  0.04638672  0.02612305
 -0.11474609  0.265625   -0.02453613  0.11083984 -0.02514648 -0.12060547
  0.05297852  0.07128906  0.00063705 -0.36523438 -0.13769531 -0.12890625]

The commands below are too slow, and run out of memory

In [46]:
g_model.most_similar('pitch')
Out[46]:
[('pitches', 0.7401652932167053),
 ('backdoor_slider', 0.5972762107849121),
 ('fastball', 0.5737808346748352),
 ('curveball', 0.5543882846832275),
 ('hanging_slider', 0.5523896217346191),
 ('hittable_pitch', 0.5503243207931519),
 ('leadoff_batter_Cliff_Floyd', 0.5496907830238342),
 ('offspeed_pitch', 0.547758936882019),
 ('atbat', 0.5441111326217651),
 ('yaw_converters_SCADA', 0.5410848259925842)]
In [47]:
g_model.most_similar(positive=['woman','king'],negative=['man'])
Out[47]:
[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]

Transform the train and test data

In [48]:
train_gmodel = [[g_model[x] for x in y if x in g_model] for y in X_news_train_gsim]
train_data_labels = [(x,y) for (x,y) in zip(train_gmodel, y_news_train) if len(x) > 0]
X_news_train_gm = [np.array(x) for (x,y) in train_data_labels]
y_news_train_gm = [y for (x,y) in train_data_labels]
In [49]:
X_news_train_gmodel = [x.mean(axis = 0) for x in X_news_train_gm]
In [50]:
test_gmodel = [[g_model[x] for x in y if x in g_model] for y in X_news_test_gsim]
test_data_labels = [(x,y) for (x,y) in zip(test_gmodel, y_news_test) if len(x) > 0]
X_news_test_gm = [np.array(x) for (x,y) in test_data_labels]
y_news_test_gm = [y for (x,y) in test_data_labels]
In [51]:
X_news_test_gmodel = [x.mean(axis = 0) for x in X_news_test_gm]

Train a classifier on the emebddings

In [52]:
lr_clf.fit(X_news_train_gmodel, np.array(y_news_train_gm))
Out[52]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [53]:
lr_clf.score(np.array(X_news_test_gmodel),y_news_test_gm)
Out[53]:
0.4947229551451187

Why did the much more generaly word2vec model from google perform much worse than our simple CBOW and Skip-Gram models?

The Glove Embedding¶

This is another embedding produced by Google, available online. It is faster to use and performs better in terms of similarity and analogies, but not for classification.

In [28]:
import gensim.downloader as api
print(api.load('glove-wiki-gigaword-50', return_path=True))
[==================================================] 100.0% 66.0/66.0MB downloaded
/root/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
In [29]:
glove_model = api.load("glove-wiki-gigaword-50")
In [30]:
glove_model.most_similar('pitch')
Out[30]:
[('pitches', 0.8380102515220642),
 ('pitching', 0.775322675704956),
 ('ball', 0.7705615162849426),
 ('infield', 0.769540548324585),
 ('inning', 0.7672455906867981),
 ('game', 0.751035213470459),
 ('hitters', 0.7493574619293213),
 ('outfield', 0.7477315068244934),
 ('hitter', 0.7467021346092224),
 ('pitched', 0.7417561411857605)]
In [31]:
glove_model.most_similar(positive=['chicago','rangers'],negative=['texas'])
Out[31]:
[('blackhawks', 0.798629641532898),
 ('sabres', 0.7900724411010742),
 ('canucks', 0.7876150608062744),
 ('canadiens', 0.7621992826461792),
 ('leafs', 0.7570874094963074),
 ('bruins', 0.7503584027290344),
 ('oilers', 0.7478305101394653),
 ('dodgers', 0.7437342405319214),
 ('phillies', 0.7399099469184875),
 ('mets', 0.7378402352333069)]
In [32]:
glove_model.most_similar(positive=['woman','king'],negative=['man'])
Out[32]:
[('queen', 0.8523604273796082),
 ('throne', 0.7664334177970886),
 ('prince', 0.7592144012451172),
 ('daughter', 0.7473883628845215),
 ('elizabeth', 0.7460219860076904),
 ('princess', 0.7424570322036743),
 ('kingdom', 0.7337412238121033),
 ('monarch', 0.721449077129364),
 ('eldest', 0.7184861898422241),
 ('widow', 0.7099431157112122)]
In [33]:
train_glove = [[glove_model[x] for x in y if x in glove_model] for y in X_news_train_gsim]
train_data_labels = [(x,y) for (x,y) in zip(train_glove, y_news_train) if len(x) > 0]
X_news_train_glove = [np.array(x).mean(axis=0) for (x,y) in train_data_labels]
y_news_train_glove = [y for (x,y) in train_data_labels]
In [34]:
test_glove = [[glove_model[x] for x in y if x in glove_model] for y in X_news_test_gsim]
test_data_labels = [(x,y) for (x,y) in zip(test_glove, y_news_test) if len(x) > 0]
X_news_test_glove = [np.array(x).mean(axis=0) for (x,y) in test_data_labels]
y_news_test_glove = [y for (x,y) in test_data_labels]
In [35]:
lr_clf.fit(X_news_train_glove, np.array(y_news_train_glove))
Out[35]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [36]:
lr_clf.score(np.array(X_news_test_glove),y_news_test_glove)
Out[36]:
0.5118733509234829

The Doc2Vec model¶

Similar to Word2Vec the Doc2Vec produces embeddings for full documents. Now we will turn into a vector the whole block of text. It performs much better for classification, indicating that simple averaging of the word embeddings is not a good option.

In [37]:
train_corpus = [gensim.models.doc2vec.TaggedDocument(X_news_train_gsim[i], [i]) for i in range(len(X_news_train_gsim))]
In [38]:
d2v_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
d2v_model.build_vocab(train_corpus)
d2v_model.train(train_corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
In [39]:
X_news_train_d2v = [d2v_model.infer_vector(x) for x in X_news_train_gsim]
X_news_test_d2v = [d2v_model.infer_vector(x) for x in X_news_test_gsim]
In [40]:
lr_clf.fit(X_news_train_d2v, np.array(y_news_train_gsim))
lr_clf.score(X_news_test_d2v,y_news_test_gsim)
Out[40]:
0.945910290237467