Supervised learning using scikit-learn¶
The goal of this tutorial is to introduce you to the scikit libraries for classification. We will also cover feature selection, and evaluation.
import numpy as np
import scipy.sparse as sp_sparse
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
import seaborn as sns
%matplotlib inline
Feature Selection¶
Feature selection is about finding the best features for your classifier. This may be important if you do not have enough training data. The idea is to find metrics that either characterize the features by themselves, or with respect to the class we want to predict, or with respect to other features.
http://scikit-learn.org/stable/modules/feature_selection.html
Variance Threshold¶
The VarianceThreshold selection drops features whose variance is below some threshold. If we have binary features we can estimate the treshold exactly so as to guarantee a specific ratio of 0's and 1's
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(np.array(X))
print('\n')
sel = VarianceThreshold(threshold=(0.8*(1 - .8))) # p*(1-p) = 0.16 , p = 0.5 = MaxVariance (for binary)
sel.fit_transform(X) # Here we set Threshold at p = 0.8
[[0 0 1] [0 1 0] [1 0 0] [0 1 1] [0 1 0] [0 1 1]]
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
What happended in this example? Which feature was removed and why?
import sklearn.datasets as sk_data
iris = sk_data.load_iris()
X = iris.data
print(X[1:10,:])
print('\nVariance of the Features:')
print(X.var(axis = 0))
sel = VarianceThreshold(threshold=0.2)
print('\nData after applying Variance Threshold:')
sel.fit_transform(X)[1:10]
[[4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]] Variance of the Features: [0.68112222 0.18871289 3.09550267 0.57713289] Data after applying Variance Threshold:
array([[4.9, 1.4, 0.2],
[4.7, 1.3, 0.2],
[4.6, 1.5, 0.2],
[5. , 1.4, 0.2],
[5.4, 1.7, 0.4],
[4.6, 1.4, 0.3],
[5. , 1.5, 0.2],
[4.4, 1.4, 0.2],
[4.9, 1.5, 0.1]])
Is it always a good idea to remove low variance features? Can we think of a counterexample?
Univariate Feature Selection¶
A more sophisticated feature selection technique uses a test to determine if a feature and the class label are independent. An example of such a test is the chi-square test (there are more as we have seen when studying statistics)
In this case we keep the features with high chi-square score and low p-value.
The features with the lowest scores and highest p-values are rejected.
The chi-square test is usually applied on categorical data.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = sk_data.load_iris()
X, y = iris.data, iris.target
print(X.shape)
print('\nFeatures:')
print(X[1:10,:])
print('\nLabels:')
print(y)
sel = SelectKBest(chi2, k=2) # Select the top k=2 features with the highest chi-square scores
X_new = sel.fit_transform(X, y) # Now with chi2 we NEED the targets y to apply it
print('\nSelected Features:')
print(X_new[1:10])
(150, 4) Features: [[4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]] Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] Selected Features: [[1.4 0.2] [1.3 0.2] [1.5 0.2] [1.4 0.2] [1.7 0.4] [1.4 0.3] [1.5 0.2] [1.4 0.2] [1.5 0.1]]
The chi-square values and the p-values between features and target variable (X columns and y)
print('Chi2 values')
print(sel.scores_)
c,p = sk.feature_selection.chi2(X, y)
print('\nChi2 values')
print(c) #The chi-square value between X columns and y
print('\np-values')
print(p) #The p-value for the test
Chi2 values [ 10.81782088 3.7107283 116.31261309 67.0483602 ] Chi2 values [ 10.81782088 3.7107283 116.31261309 67.0483602 ] p-values [4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]
Supervised Learning¶
Python has several classes and objects for implementing different supervised learning techniques such as Regression and Classification.
Regardless of the model being implemented, the following methods are implemented:
The method fit() takes the training data and labels/values, and trains the model
The method predict() takes as input the test data and applies the model.
Linear Regression¶
Linear Regression is implemented in the library sklearn.linear_model.
LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Before we dive into Classification (by classifying flowers), let's start with the simplest form of Supervised Learning: Linear Regression.
What is Supervised Learning?
Imagine teaching a child. You show them flashcards:
- Input (X): A picture of a cat.
- Answer (y): "Cat".
- Input (X): A picture of a dog.
- Answer (y): "Dog".
Over time, the child learns to associate the Input with the Answer. In Machine Learning:
- Regression: The answer is a Number (e.g., Price, Temperature, Salary).
- Classification: The answer is a Label (e.g., Cat, Dog, Spam, Not Spam).
The Task: "Study Time vs. Grades"
We will create a tiny dataset to predict a student's Test Score (y) based on how many Hours they Studied (X).
import numpy as np
import matplotlib.pyplot as plt
# 1. Prepare the Data
# X = Input Feature (Hours Studied)
X_hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
# y = Target Label (Test Score)
# We add some "noise" so they don't fall on a perfect line.
# Perfect world: [60, 70, 80, 90, 100]
# Real world:
y_scores = np.array([58, 74, 78, 92, 96])
print("Inputs (X):\n", X_hours)
print("Targets (y):\n", y_scores)
# 2. Visualize the Data
plt.figure(figsize=(6,4))
plt.scatter(X_hours, y_scores, color='blue', s=100, label='Actual Students')
plt.title("Study Hours vs. Test Scores (Real Data)")
plt.xlabel("Hours Studied")
plt.ylabel("Test Score")
plt.grid(True)
plt.legend()
plt.show()
Inputs (X): [[1] [2] [3] [4] [5]] Targets (y): [58 74 78 92 96]
Training the Model:
The "Line of Best Fit" Notice that the points do not form a perfect straight line.
Linear Regression cannot touch every point. Instead, it tries to find the "Best Fit" line—the line that is close to everyone on average, minimizing the total error (distance) between the points and the line.
from sklearn.linear_model import LinearRegression
# 1. Instantiate
reg = LinearRegression()
# 2. Fit (Find the best line through the mess)
reg.fit(X_hours, y_scores)
# 3. Inspect the model
print(f"Coefficient (Slope): {reg.coef_[0]:.2f}")
print(f"Intercept (Bias): {reg.intercept_:.2f}")
# The Equation
print(f"\nModel Equation: Score = {reg.coef_[0]:.2f} * Hours + {reg.intercept_:.2f}")
Coefficient (Slope): 9.40 Intercept (Bias): 51.40 Model Equation: Score = 9.40 * Hours + 51.40
Question: What grade would we expect for a student that studied 3.5 hours??
# 1. Predict for a new value (3.5 hours)
new_X = np.array([[3.5]])
prediction = reg.predict(new_X)
print(f"Prediction: If you study for 3.5 hours, you will get a score of: {prediction[0]:.2f}")
# 2. Visualize the Result
plt.figure(figsize=(8,5))
# Plot actual data points
plt.scatter(X_hours, y_scores, color='blue', s=100, label='Actual Data')
# Plot the model's line
# We predict on the original X to draw the red line
plt.plot(X_hours, reg.predict(X_hours), color='red', linewidth=2, label='Line of Best Fit')
# Plot the new prediction
plt.scatter(new_X, prediction, color='green', marker='*', s=300, zorder=5, label='Prediction (3.5 hrs)')
# Visual Trick: Draw vertical lines to show the "Error" (Residuals)
for i in range(len(X_hours)):
plt.plot([X_hours[i], X_hours[i]], [y_scores[i], reg.predict(X_hours)[i]], 'k--', alpha=0.3)
plt.title("Linear Regression: Minimizing the Error")
plt.xlabel("Hours Studied")
plt.ylabel("Test Score")
plt.legend()
plt.grid(True)
plt.show()
Prediction: If you study for 3.5 hours, you will get a score of: 84.30
The $R^2$ score computes the "explained variance"
$R^2 = 1-\frac{\sum_i (y_i -\hat y_i)^2}{\sum_i (y_i -\bar y)^2}$
where $\hat y_i$ is the prediction for point $x_i$ and $\bar y$ is the mean value of the target variable
r2_score = reg.score(X_hours, y_scores)
print(f"R² Score: {r2_score:.4f}")
R² Score: 0.9571
Preparing the data¶
To perform classification we first need to prepare the data into train and test datasets.
from sklearn.datasets import load_iris
import sklearn.utils as utils
iris = load_iris()
print("sample of data")
print(iris.data[:5,:])
print("\nthe class labels vector")
print(iris.target)
print("\nthe names of the classes:",iris.target_names)
print(iris.feature_names)
sample of data [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]] the class labels vector [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] the names of the classes: ['setosa' 'versicolor' 'virginica'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Randomly shuffle the data. This is useful to know that the data is in random order
X, y = utils.shuffle(iris.data, iris.target, random_state=1) #shuffle the data
print(X.shape)
print(y.shape)
print(y)
(150, 4) (150,) [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1 0 1 2 2 0 2 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0 1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 1 0 1 0 0 2 2 2 0 0 1 0 2 0 2 2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1 2 0]
Select a subset for training and a subset for testing
train_set_size = 100
X_train = X[:train_set_size] # selects first 100 rows (samples) for train set
y_train = y[:train_set_size]
X_test = X[train_set_size:] # selects from row 100 until the last one for test set
y_test = y[train_set_size:]
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(100, 4) (100,) (50, 4) (50,)
We can also use the train_test_split function of python for splitting the data into train and test sets. In this case you do not need the random shuffling (but it does not hurt).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
Classification models¶
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
Python has classes and objects that implement the different classification techniques that we described in class.
Decision Trees¶
http://scikit-learn.org/stable/modules/tree.html
Train and apply a decision tree classifier. The default score computed in the classifier object is the accuracy. Decision trees can also give us probabilities
Discrete Mathematics: Elementary and Beyond (2003) by László Lovász, József Pelikán, and Katalin Vesztergombi. Chapter 8:
from sklearn import tree
dtree = tree.DecisionTreeClassifier() # max_depth=None, No fixed randomness for handling Tied Splits (random_state = 0)
dtree = dtree.fit(X_train, y_train)
print("classifier accuracy:",dtree.score(X_test,y_test))
y_pred = dtree.predict(X_test)
y_prob = dtree.predict_proba(X_test)
print("\nclassifier predictions:",y_pred[:10])
print("ground truth labels :",y_test[:10])
print('\n')
print(y_prob[:10])
classifier accuracy: 0.9166666666666666 classifier predictions: [2 2 2 0 0 0 2 2 2 2] ground truth labels : [1 2 2 0 0 0 2 2 2 2] [[0. 0. 1.] [0. 0. 1.] [0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [1. 0. 0.] [0. 0. 1.] [0. 0. 1.] [0. 0. 1.] [0. 0. 1.]]
Why does y_prob have only 0 and 1 entries?
Compute some more metrics
print("accuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
accuracy: 0.9166666666666666 Confusion matrix [[20 0 0] [ 0 15 4] [ 0 1 20]] Precision Score per class [1. 0.9375 0.83333333] Average Precision Score 0.921875 Recall Score per class [1. 0.78947368 0.95238095] Average Recall Score 0.9166666666666666 F1-score Score per class [1. 0.85714286 0.88888889] Average F1 Score 0.9158730158730158
Visualize the decision tree.
For this you will need to install the package python-graphviz
#conda install python-graphviz
import graphviz
print(iris.feature_names)
dot_data = tree.export_graphviz(dtree,out_file=None)
graph = graphviz.Source(dot_data)
graph
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
dtree2 = tree.DecisionTreeClassifier(max_depth=3) # Based on the above diagrame most of the classification is done on Depth = 3
dtree2 = dtree2.fit(X_train, y_train)
print(dtree2.score(X_test,y_test))
dot_data2 = tree.export_graphviz(dtree2,out_file=None)
graph2 = graphviz.Source(dot_data2)
graph2
0.9166666666666666
k-NN Classification¶
https://scikit-learn.org/stable/modules/neighbors.html#classification
While the Decision Tree tried to learn explicit rules (like "if petal length < 2.45..."), the k-Nearest Neighbors (k-NN) algorithm doesn't "learn" rules at all. Instead, it relies on similarity.
It is often called a "Lazy Learner" because it doesn't build a model during the training phase. It simply memorizes the training data.
The Intuition: The logic follows the old proverb:
"Tell me who your friends are, and I will tell you who you are."
How it works: When we ask the model to classify a new, unseen flower, it follows these steps:
- Measure Distance: It calculates the distance (usually Euclidean) between the new flower and every flower in the training set.
- Find Neighbors: It picks the $k$ closest flowers (where $k$ is a number we choose, e.g., 3).
- Majority Vote: It looks at the classes of those $k$ neighbors. If 2 are 'Versicolor' and 1 is 'Virginica', the model predicts 'Versicolor'.
Key Parameter: n_neighbors.
The most important setting here is $k$ (in code: n_neighbors).
- Small $k$ (e.g., 1): The model is very sensitive to noise (a single weird flower can flip the result).
- Large $k$ (e.g., 50): The decision becomes very smooth, but we might lose local details.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print("classifier score:", knn.score(X_test,y_test))
y_pred = knn.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.9333333333333333 accuracy: 0.9333333333333333 Confusion matrix [[20 0 0] [ 0 16 3] [ 0 1 20]] Precision Score per class [1. 0.94117647 0.86956522] Average Precision Score 0.9357203751065644 Recall Score per class [1. 0.84210526 0.95238095] Average Recall Score 0.9333333333333333 F1-score Score per class [1. 0.88888889 0.90909091] Average F1 Score 0.9329966329966329
Example image of how kNN draws Decision Boundaries. In our case that there are 4 features such a plot connot be made since it will have to be in 4D.
from sklearn import svm
#svm_clf = svm.LinearSVC()
#svm_clf = svm.SVC(kernel = 'poly') # kernel determines how the data is transformed to make it linearly separable
svm_clf = svm.SVC() # default kernel: Radial basis function (rbf)
svm_clf.fit(X_train,y_train)
print("classifier score:",svm_clf.score(X_test,y_test))
y_pred = svm_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.95 accuracy: 0.95 Confusion matrix [[20 0 0] [ 0 18 1] [ 0 2 19]] Precision Score per class [1. 0.9 0.95] Average Precision Score 0.9508333333333333 Recall Score per class [1. 0.94736842 0.9047619 ] Average Recall Score 0.95 F1-score Score per class [1. 0.92307692 0.92682927] Average F1 Score 0.9500312695434646
kNN vs SVM Descision Boundries
kNN: Decision boundaries are determined by local data density based on majority voting among the k nearest neighbors
SVM: Decision boundaries are determined by maximizing the margin using support vectors. Particularly effective in high-dimensional or sparse datasets.
Despite its name, Logistic Regression is used for Classification, not Regression.
Why is it different?
- The Shape: Unlike Decision Trees (which cut the data into rectangles) or k-NN (which creates complex, wiggly shapes), Logistic Regression is a Linear Classifier. It tries to draw a single straight line (or a flat plane) to separate the classes.
- The Output: It doesn't just give us a "Yes/No" answer. It gives us a Probability (a score between 0 and 1).
The Intuition¶
Imagine you are trying to separate red and blue marbles on a table:
- Decision Tree: You build a wall of LEGO bricks around the red marbles.
- k-NN: You look at clusters of marbles.
- Logistic Regression: You place a single straight stick on the table to separate them as best as you can.
Because it is "linear," it might struggle if the data is very complex (e.g., a circle inside a circle), but it is very fast and easy to interpret.
import sklearn.linear_model as linear_model
lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_train, y_train)
print("classifier score:",lr_clf.score(X_test,y_test))
y_pred = lr_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
classifier score: 0.95 accuracy: 0.95 Confusion matrix [[20 0 0] [ 0 17 2] [ 0 1 20]] Precision Score per class [1. 0.94444444 0.90909091] Average Precision Score 0.9505892255892257 Recall Score per class [1. 0.89473684 0.95238095] Average Recall Score 0.95 F1-score Score per class [1. 0.91891892 0.93023256] Average F1 Score 0.9499057196731615
For Logistic Regression we can also obtain the probabilities for the different classes
probs = lr_clf.predict_proba(X_test)
print("Class Probabilities (first 10):")
print (probs[:10])
print(y_test[:10])
print(probs.argmax(axis = 1)[:10])
print(probs.max(axis = 1)[:10])
Class Probabilities (first 10): [[3.58062452e-03 4.78837536e-01 5.17581840e-01] [1.14629520e-03 4.05835759e-01 5.93017945e-01] [3.92636719e-05 6.43273069e-02 9.35633429e-01] [9.58095468e-01 4.19028153e-02 1.71652335e-06] [9.43362912e-01 5.66341310e-02 2.95722406e-06] [9.82882494e-01 1.71170688e-02 4.36780706e-07] [5.55192517e-05 7.15153840e-02 9.28429097e-01] [1.95254291e-04 1.77447939e-01 8.22356806e-01] [1.39404689e-04 1.43508126e-01 8.56352469e-01] [1.28136030e-05 1.91639422e-02 9.80823244e-01]] [1 2 2 0 0 0 2 2 2 2] [2 2 2 0 0 0 2 2 2 2] [0.51758184 0.59301795 0.93563343 0.95809547 0.94336291 0.98288249 0.9284291 0.82235681 0.85635247 0.98082324]
And the coeffients of the logistic regression model
print(lr_clf.coef_)
[[-0.41634761 0.72750266 -2.19259055 -0.94546959] [ 0.0821038 -0.38691851 -0.0226097 -0.69668026] [ 0.33424381 -0.34058416 2.21520024 1.64214985]]
from matplotlib.colors import ListedColormap
# Select two features for visualization
feature_1_idx = 0 # e.g., sepal length
feature_2_idx = 2 # e.g., petal length
X_train_2d = X_train[:, [feature_1_idx, feature_2_idx]]
X_test_2d = X_test[:, [feature_1_idx, feature_2_idx]]
# Train Logistic Regression on the two selected features
lr_clf_2d = linear_model.LogisticRegression(solver='lbfgs', multi_class='ovr')
lr_clf_2d.fit(X_train_2d, y_train)
# Generate a grid for plotting decision boundaries
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Predict for each grid point
Z = lr_clf_2d.predict(grid_points)
Z = Z.reshape(xx.shape)
# Plot decision boundaries
plt.figure(figsize=(10, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.contourf(xx, yy, Z, alpha=0.8, cmap=cmap_light)
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
plt.title("Logistic Regression Decision Boundaries - Feature 1 vs Feature 2")
plt.xlabel(f"Feature {feature_1_idx + 1} (e.g., {iris.feature_names[feature_1_idx]})")
plt.ylabel(f"Feature {feature_2_idx + 1} (e.g., {iris.feature_names[feature_2_idx]})")
plt.show()
/usr/local/lib/python3.12/dist-packages/sklearn/linear_model/_logistic.py:1256: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning. warnings.warn(
Multi-Layer Perceptron (MLP) is the simplest form of a Neural Network.
Logistic Regression used a single equation (a single "neuron") to separate the data with a straight line.
- MLP works by combining many of these "neurons" together in layers.
- Hidden Layers: The layers in the middle allow the model to learn complex, non-linear patterns.
Linear vs. Non-Linear:
- Logistic Regression: Can only draw straight lines.
- MLP: Can draw curves, circles, and complex squiggles. It bends the decision boundary to fit the data.
The Trade-off: While powerful, MLPs are "Black Boxes." Unlike the Decision Tree (where we could see the rules) or Logistic Regression (where we saw the coefficients), it is very hard to look inside a Neural Network and understand exactly why it made a specific decision.
from sklearn.neural_network import MLPClassifier
mlp_clf = MLPClassifier(solver='lbfgs')
print("MLP Default Architecture")
print(mlp_clf.hidden_layer_sizes)
mlp_clf.fit(X_train, y_train)
print("\nclassifier score:",mlp_clf.score(X_test,y_test))
y_pred = mlp_clf.predict(X_test)
print("\naccuracy:",metrics.accuracy_score(y_test,y_pred))
print("\nConfusion matrix")
print(metrics.confusion_matrix(y_test,y_pred))
print("\nPrecision Score per class")
print(metrics.precision_score(y_test,y_pred,average=None))
print("\nAverage Precision Score")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRecall Score per class")
print(metrics.recall_score(y_test,y_pred,average=None))
print("\nAverage Recall Score")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nF1-score Score per class")
print(metrics.f1_score(y_test,y_pred,average=None))
print("\nAverage F1 Score")
print(metrics.f1_score(y_test,y_pred,average='weighted'))
MLP Default Architecture (100,) classifier score: 0.9333333333333333 accuracy: 0.9333333333333333 Confusion matrix [[20 0 0] [ 0 17 2] [ 0 2 19]] Precision Score per class [1. 0.89473684 0.9047619 ] Average Precision Score 0.9333333333333333 Recall Score per class [1. 0.89473684 0.9047619 ] Average Recall Score 0.9333333333333333 F1-score Score per class [1. 0.89473684 0.9047619 ] Average F1 Score 0.9333333333333333
Lets see the Descision Boundaries that the MLP makes
# Select two features for visualization
feature_1_idx = 0 # First feature
feature_2_idx = 2 # Third feature
X_train_2d = X_train[:, [feature_1_idx, feature_2_idx]]
X_test_2d = X_test[:, [feature_1_idx, feature_2_idx]]
# Train MLP on selected features
mlp_clf_2d = MLPClassifier(solver='lbfgs')
mlp_clf_2d.fit(X_train_2d, y_train)
# Generate grid for decision boundaries
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = mlp_clf_2d.predict(grid_points)
Z = Z.reshape(xx.shape)
# Plot decision boundaries
plt.figure(figsize=(10, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.contourf(xx, yy, Z, alpha=0.8, cmap=cmap_light)
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
plt.title("MLP Decision Boundaries - Feature 1 vs Feature 3")
plt.xlabel(f"Feature {feature_1_idx + 1}")
plt.ylabel(f"Feature {feature_2_idx + 1}")
plt.show()
Computing Scores
p,r,f,s = metrics.precision_recall_fscore_support(y_test,y_pred) # y_pred is from the MLP
print(p)
print(r)
print(f)
print(s)
[1. 0.89473684 0.9047619 ] [1. 0.89473684 0.9047619 ] [1. 0.89473684 0.9047619 ] [20 19 21]
report = metrics.classification_report(y_test,y_pred) # Support tell us how many samples of each class are in the y_test
print(report)
precision recall f1-score support
0 1.00 1.00 1.00 20
1 0.89 0.89 0.89 19
2 0.90 0.90 0.90 21
accuracy 0.93 60
macro avg 0.93 0.93 0.93 60
weighted avg 0.93 0.93 0.93 60
More on Evaluation: http://scikit-learn.org/stable/model_selection.html#model-selection
A more Complex example with the diabetes dataset
diabetes_X, diabetes_y = sk_data.load_diabetes(return_X_y=True) # X = 10 features (Age, BMI, Blood pressure, etc...)
# y = Quantitative measure for disease progression (Continuous Value)
# Shuffle the data
diabetes_X, diabetes_y = utils.shuffle(diabetes_X, diabetes_y, random_state=1)
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('\nMean squared error: %.2f'
% metrics.mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction # Obviously here we dont have a perfect fit for a linear model like before
# Computed over the *test* data
print('\nCoefficient of determination: %.2f'
% metrics.r2_score(diabetes_y_test, diabetes_y_pred))
print('\nPredictions:')
print(diabetes_y_pred)
print('\nTrue values:')
print(diabetes_y_test)
Coefficients: [ -8.80064784 -238.68584171 515.45675075 329.26068533 -878.18560171 530.03616927 126.04120869 213.28386451 734.45899793 67.32731226] Mean squared error: 2305.01 Coefficient of determination: 0.68 Predictions: [149.7546802 199.76667761 248.11135815 182.95023775 98.34758327 96.66442169 248.60103198 64.8494556 234.5250113 209.30960598 179.26527876 85.95464444 70.53999409 197.9358827 100.34679414 116.8171366 134.97124147 64.08460686 178.33480132 155.32208789] True values: [168. 221. 310. 283. 81. 94. 277. 72. 270. 268. 174. 96. 83. 222. 69. 153. 202. 43. 124. 276.]
Precision vs Recall
High Precision = Fewer false positives
High Recall = Fewer false negatives
In this binary example we get different Precision and Recall scores for different threashold values.
For a given threshold: $t$, the model predicts the positive class $(\hat{y} = 1)$ for all samples where:
$$ P(\text{class}=1) \geq t $$
and predicts the negative class $( \hat{y} = 0)$ otherwise. The threshold is varied over the range of predicted probabilities, and for each threshold, the Precision and Recall are computed as:
$$ \text{Precision}(t) = \frac{\text{True Positives}(t)}{\text{True Positives}(t) + \text{False Positives}(t)} $$
$$ \text{Recall}(t) = \frac{\text{True Positives}(t)}{\text{True Positives}(t) + \text{False Negatives}(t)} $$
The resulting Precision-Recall Curve is plotted by connecting the Precision and Recall values computed for all thresholds.
y_cancer_pred = lr_clf.predict(X_cancer_test)
cancer_probs = lr_clf.predict_proba(X_cancer_test)
print("Class Probabilities (first 10):")
print (cancer_probs[:10])
y_cancer_scores = cancer_probs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(y_cancer_test, y_cancer_scores)
# Plot the Precision-Recall Curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='darkorange', lw=2)
plt.title("Precision-Recall Curve")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.grid(True)
plt.show()
fpr, tpr, ths = metrics.roc_curve(y_cancer_test,y_cancer_scores)
plt.plot(fpr,tpr,color='darkorange',lw=2)
print(metrics.roc_auc_score(y_cancer_test,y_cancer_scores))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
#plt.legend(loc="lower right")
plt.show()
(Xtoy,y_toy)=sk_data.make_classification(n_samples=10000) # Synthetic dataset for binary classification
Xttrain = Xtoy[:8000,:] # X = 20 features
Xttest = Xtoy[8000:,:]
yttrain = y_toy[:8000]
yttest = y_toy[8000:]
lr_clf.fit(Xttrain, yttrain)
#print(lr_clf.score(Xttest,yttest))
#y_tpred = lr_clf.predict(X_test)
tprobs = lr_clf.predict_proba(Xttest)
print (tprobs[:10])
y_tscores = tprobs[:,1]
precision, recall, thresholds = metrics.precision_recall_curve(yttest,y_tscores)
plt.plot(recall,precision)
k-fold cross validation¶
In k-fold cross validation the data is split into k equal parts, the k-1 are used for training and the last one for testing. k models are trained, each time leaving a different part for testing
https://scikit-learn.org/stable/modules/cross_validation.html
There are two methods for implementing k-fold cross-validation, under the library model selection: cross_val_score, and cross validate. The latter allows multiple metrics to be considered together.
import sklearn.model_selection as model_selection
scores = model_selection.cross_val_score(#lr_clf,
#svm_clf,
#knn,
dtree,
X,
y,
scoring='f1_weighted',
cv=5)
print (scores)
print (scores.mean())
[1. 0.93333333 0.96658312 0.96658312 0.93265993] 0.9598319029897977
scores = model_selection.cross_validate(#lr_clf,
#svm_clf,
#knn,
dtree,
X,
y,
scoring=['precision_weighted','recall_weighted'],
cv=3)
print (scores)
print (scores['test_precision_weighted'].mean(),scores['test_recall_weighted'].mean())
{'fit_time': array([0.00221992, 0.00099707, 0.00091362]), 'score_time': array([0.00517941, 0.003649 , 0.0036037 ]), 'test_precision_weighted': array([0.96 , 0.90834586, 0.88308772]), 'test_recall_weighted': array([0.96, 0.9 , 0.88])}
0.9171445279866332 0.9133333333333332
Creating a pipeline¶
If the same steps are often repeated, you can create a pipeline to perform them all at once:
Text classification Example¶
We will use the 20 newsgroups to do a text classification example
from sklearn.datasets import fetch_20newsgroups
categories = ['sci.space','rec.sport.baseball']
#categories = ['alt.atheism', 'rec.sport.baseball']
news_train = sk_data.fetch_20newsgroups(subset='train',
remove=('headers', 'footers', 'quotes'),
categories=categories)
print (len(news_train.target))
X_news_train_data = news_train.data
y_news_train = news_train.target
news_test = sk_data.fetch_20newsgroups(subset='test',
remove=('headers', 'footers', 'quotes'),
categories=categories)
print (len(news_test.target))
X_news_test_data = news_test.data
y_news_test = news_test.target
1190 791
X_news_train_data[0]
"I've been saying this for quite some time, but being absent from the\nnet for a while I figured I'd stick my neck out a bit...\n\nThe Royals will set the record for fewest runs scored by an AL\nteam since the inception of the DH rule. (p.s. any ideas what this is?)\n\nThey will fall easily short of 600 runs, that's for damn sure. I can't\nbelieve these media fools picking them to win the division (like our\nTom Gage of the Detroit News claiming Herk Robinson is some kind of\ngenius for the trades/aquisitions he's made)\n\nc-ya\n\nSean\n\n"
y_news_train[0]
np.int64(0)
from sklearn.linear_model import LinearRegression
import sklearn.linear_model as linear_model
import sklearn.feature_extraction.text as sk_text
vectorizer = sk_text.TfidfVectorizer(stop_words='english',
#max_features = 1000,
min_df=4, max_df=0.8)
X_news_train = vectorizer.fit_transform(X_news_train_data)
lr_clf = linear_model.LogisticRegression(solver='lbfgs')
lr_clf.fit(X_news_train, y_news_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
X_news_test = vectorizer.transform(X_news_test_data)
print("classifier score:",lr_clf.score(X_news_test,y_news_test))
classifier score: 0.9216182048040455
Word embeddings and Text classification¶
We will now see how we can train and use word embeddings.
The Gensim library¶
The Gensim library has several NLP models.
You can use existing modules to train a word2vec model: https://radimrehurek.com/gensim/models/word2vec.html
!pip install gensim
import gensim
import gensim.models
from gensim.models import Word2Vec
from gensim import utils
Collecting gensim Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB) Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.12/dist-packages (from gensim) (2.0.2) Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from gensim) (1.16.3) Requirement already satisfied: smart_open>=1.8.1 in /usr/local/lib/python3.12/dist-packages (from gensim) (7.5.0) Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart_open>=1.8.1->gensim) (2.0.1) Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 64.4 MB/s eta 0:00:00 Installing collected packages: gensim Successfully installed gensim-4.4.0
The utils library will do preprocessing of the text. It will lower-case and tokenize the text and remove punctuation. The end result is a list with tokens, which is what we need as inpute for training or using the different models.
train_gsim = [utils.simple_preprocess(x) for x in X_news_train_data]
train_data_labels = [(x,y) for (x,y) in zip(train_gsim, y_news_train) if len(x) > 0] # Removes documents that become empty after preprocessing (e.g., if all words in a document are stopwords or invalid).
X_news_train_gsim = [x for (x,y) in train_data_labels]
y_news_train_gsim = [y for (x,y) in train_data_labels]
test_gsim = [utils.simple_preprocess(x) for x in X_news_test_data]
test_data_labels = [(x,y) for (x,y) in zip(test_gsim, y_news_test) if len(x) > 0]
X_news_test_gsim = [x for (x,y) in test_data_labels]
y_news_test_gsim = [y for (x,y) in test_data_labels]
X_news_train_gsim[0]
['ve', 'been', 'saying', 'this', 'for', 'quite', 'some', 'time', 'but', 'being', 'absent', 'from', 'the', 'net', 'for', 'while', 'figured', 'stick', 'my', 'neck', 'out', 'bit', 'the', 'royals', 'will', 'set', 'the', 'record', 'for', 'fewest', 'runs', 'scored', 'by', 'an', 'al', 'team', 'since', 'the', 'inception', 'of', 'the', 'dh', 'rule', 'any', 'ideas', 'what', 'this', 'is', 'they', 'will', 'fall', 'easily', 'short', 'of', 'runs', 'that', 'for', 'damn', 'sure', 'can', 'believe', 'these', 'media', 'fools', 'picking', 'them', 'to', 'win', 'the', 'division', 'like', 'our', 'tom', 'gage', 'of', 'the', 'detroit', 'news', 'claiming', 'herk', 'robinson', 'is', 'some', 'kind', 'of', 'genius', 'for', 'the', 'trades', 'aquisitions', 'he', 'made', 'ya', 'sean']
Train a CBOW (Continuous Bag Of Words) embedding on the training data corpus
Machine learning models cannot understand text; they only understand numbers. Word2Vec is a technique to turn words into lists of numbers (vectors) so that words with similar meanings have similar numbers.
embedding_size = 50 # Resolution of the word understanding: Each word ("apple", "king", etc...) is represented as a list of 50 numbers.
cbow_model = gensim.models.Word2Vec(X_news_train_gsim, min_count = 1,vector_size = embedding_size, window = 10) # attention span": looks at 10 words to the left and 10 words to the right.
We now have a representation of the words as 50-dimensional real vectors
cbow_model.wv['pitch']
array([-0.15768205, -0.15172504, -0.12478593, 0.12387501, -0.27375695,
-0.32890198, 0.29313874, 0.8522625 , -0.48969692, -0.08721875,
0.1272012 , -0.6764343 , 0.14053679, 0.5181793 , -0.2863375 ,
0.18444148, 0.47363278, 0.05032672, -0.6023291 , -0.57959276,
0.29312116, 0.30031574, 0.70870405, -0.30768996, 0.6378389 ,
0.13288465, -0.27912104, 0.05217481, -0.64838284, 0.2454371 ,
0.04064707, 0.06659085, 0.04146409, -0.06713676, -0.33795828,
0.19997104, 0.35694647, -0.21540166, 0.3260302 , -0.06656712,
0.33125332, 0.02459017, 0.08362663, -0.02666873, 0.672934 ,
-0.19353233, -0.10657518, -0.18800081, 0.5240707 , 0.2803613 ],
dtype=float32)
We can use this to find similar words
cbow_model.wv.most_similar('pitch')
[('great', 0.9990735650062561),
('against', 0.9990703463554382),
('wasn', 0.998989462852478),
('down', 0.9989439845085144),
('every', 0.9989256262779236),
('average', 0.9988952875137329),
('very', 0.9988757967948914),
('am', 0.9988664984703064),
('again', 0.9988647699356079),
('run', 0.9988623857498169)]
Use the additivity property: Chicago + Cubs - Boston = Sox
Teams: Chicago Cubs and Boston Red Sox
cbow_model.wv.most_similar(positive=['chicago','cubs'],negative=['boston'])
[('suck', 0.9981583952903748),
('detroit', 0.9950345158576965),
('road', 0.9944502711296082),
('milwaukee', 0.9943985342979431),
('astros', 0.9943371415138245),
('st', 0.9942942261695862),
('hernandez', 0.994268536567688),
('west', 0.9942171573638916),
('won', 0.9941810965538025),
('slg', 0.9940813779830933)]
Use the word embeddings to obtain a vector representation of the document by taking the average of the embeddings of the words. Transform the train and test data
Average Embending Vector = "Summary" Vector for the Document
np.array([cbow_model.wv[x] for x in X_news_train_gsim[0]]).mean(axis = 0)
array([-0.4029563 , -0.33122525, -0.30095032, 0.21110743, -0.5092193 ,
-0.7030677 , 0.5381219 , 1.8370438 , -1.0126023 , -0.14332198,
0.25449035, -1.3880475 , 0.29733127, 1.1351492 , -0.5628445 ,
0.41576612, 1.0588037 , 0.28731224, -1.2441657 , -1.1561385 ,
0.6357985 , 0.651013 , 1.4021233 , -0.6070939 , 1.3530061 ,
0.268321 , -0.6106677 , 0.17058086, -1.3819162 , 0.4244899 ,
0.07523321, 0.05736452, 0.13832642, -0.12398918, -0.7068729 ,
0.43235666, 0.7909126 , -0.4662388 , 0.73772854, -0.12105183,
0.6979285 , 0.06914084, 0.19897568, -0.16035889, 1.4181232 ,
-0.32762772, -0.21987288, -0.3918733 , 1.0586956 , 0.5894503 ],
dtype=float32)
X_news_train_cbow = [np.array([cbow_model.wv[x] for x in y]).mean(axis = 0) for y in X_news_train_gsim]
X_news_test_cbow = [np.array([cbow_model.wv[x] for x in y if x in cbow_model.wv]).mean(axis = 0) for y in X_news_test_gsim]
Train a classifier on the embeddings
lr_clf.fit(np.array(X_news_train_cbow), np.array(y_news_train_gsim))
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
lr_clf.score(np.array(X_news_test_cbow),y_news_test_gsim)
0.7889182058047494
Train a SkipGram embedding on the training data corpus
CBOW vs. Skip-gram
In the code above, we trained a Word2Vec model. But how does it actually learn?
Word2Vec is not just one algorithm; it is a family of two distinct architectures that learn word meanings in opposite ways.
- Continuous Bag of Words (CBOW)
- The Goal: Predict the Target Word based on the Context (surrounding words).
- The Analogy: A "Fill in the blank" game.
- Example:
- Context:
["The", "cat", "sits", "on"] - Target:
? - Prediction:
"mat"
- Context:
- Why use it? It is generally faster to train and produces slightly better accuracy for frequent words.
- Note: This is the default method in
gensim(which we used above).
- Skip-gram
- The Goal: Predict the Context (surrounding words) based on the Target Word.
- The Analogy: The reverse of CBOW. You are given one word and have to guess the "story" around it.
- Example:
- Target:
"mat" - Prediction:
["The", "cat", "sits", "on"]
- Target:
- Why use it? It is slower to train but works much better for small datasets and rare words.
embedding_size = 50
skipgram_model = gensim.models.Word2Vec(X_news_train_gsim, min_count = 1,vector_size = embedding_size, window = 10, sg = 1)
Transform the train and test data
X_news_train_skipgram = [np.array([skipgram_model.wv[x] for x in y]).mean(axis = 0) for y in X_news_train_gsim]
X_news_test_skipgram = [np.array([skipgram_model.wv[x] for x in y if x in skipgram_model.wv]).mean(axis = 0) for y in X_news_test_gsim]
Train a classifier on the emebddings
lr_clf.fit(np.array(X_news_train_skipgram), np.array(y_news_train_gsim))
lr_clf.score(np.array(X_news_test_skipgram),y_news_test_gsim)
0.9261213720316622
You can also download the Google word2vec model trained over millions of documents (use pre-trained models)
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
[==================================================] 100.0% 1662.8/1662.8MB downloaded /root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
#path = 'C:\\Users\\tsapa/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz'
g_model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)
print(len(g_model['hello']))
print(g_model['hello'])
300 [-0.05419922 0.01708984 -0.00527954 0.33203125 -0.25 -0.01397705 -0.15039062 -0.265625 0.01647949 0.3828125 -0.03295898 -0.09716797 -0.16308594 -0.04443359 0.00946045 0.18457031 0.03637695 0.16601562 0.36328125 -0.25585938 0.375 0.171875 0.21386719 -0.19921875 0.13085938 -0.07275391 -0.02819824 0.11621094 0.15332031 0.09082031 0.06787109 -0.0300293 -0.16894531 -0.20800781 -0.03710938 -0.22753906 0.26367188 0.012146 0.18359375 0.31054688 -0.10791016 -0.19140625 0.21582031 0.13183594 -0.03515625 0.18554688 -0.30859375 0.04785156 -0.10986328 0.14355469 -0.43554688 -0.0378418 0.10839844 0.140625 -0.10595703 0.26171875 -0.17089844 0.39453125 0.12597656 -0.27734375 -0.28125 0.14746094 -0.20996094 0.02355957 0.18457031 0.00445557 -0.27929688 -0.03637695 -0.29296875 0.19628906 0.20703125 0.2890625 -0.20507812 0.06787109 -0.43164062 -0.10986328 -0.2578125 -0.02331543 0.11328125 0.23144531 -0.04418945 0.10839844 -0.2890625 -0.09521484 -0.10351562 -0.0324707 0.07763672 -0.13378906 0.22949219 0.06298828 0.08349609 0.02929688 -0.11474609 0.00534058 -0.12988281 0.02514648 0.08789062 0.24511719 -0.11474609 -0.296875 -0.59375 -0.29492188 -0.13378906 0.27734375 -0.04174805 0.11621094 0.28320312 0.00241089 0.13867188 -0.00683594 -0.30078125 0.16210938 0.01171875 -0.13867188 0.48828125 0.02880859 0.02416992 0.04736328 0.05859375 -0.23828125 0.02758789 0.05981445 -0.03857422 0.06933594 0.14941406 -0.10888672 -0.07324219 0.08789062 0.27148438 0.06591797 -0.37890625 -0.26171875 -0.13183594 0.09570312 -0.3125 0.10205078 0.03063965 0.23632812 0.00582886 0.27734375 0.20507812 -0.17871094 -0.31445312 -0.01586914 0.13964844 0.13574219 0.0390625 -0.29296875 0.234375 -0.33984375 -0.11816406 0.10644531 -0.18457031 -0.02099609 0.02563477 0.25390625 0.07275391 0.13574219 -0.00138092 -0.2578125 -0.2890625 0.10107422 0.19238281 -0.04882812 0.27929688 -0.3359375 -0.07373047 0.01879883 -0.10986328 -0.04614258 0.15722656 0.06689453 -0.03417969 0.16308594 0.08642578 0.44726562 0.02026367 -0.01977539 0.07958984 0.17773438 -0.04370117 -0.00952148 0.16503906 0.17285156 0.23144531 -0.04272461 0.02355957 0.18359375 -0.41601562 -0.01745605 0.16796875 0.04736328 0.14257812 0.08496094 0.33984375 0.1484375 -0.34375 -0.14160156 -0.06835938 -0.14648438 -0.02844238 0.07421875 -0.07666016 0.12695312 0.05859375 -0.07568359 -0.03344727 0.23632812 -0.16308594 0.16503906 0.1484375 -0.2421875 -0.3515625 -0.30664062 0.00491333 0.17675781 0.46289062 0.14257812 -0.25 -0.25976562 0.04370117 0.34960938 0.05957031 0.07617188 -0.02868652 -0.09667969 -0.01281738 0.05859375 -0.22949219 -0.1953125 -0.12207031 0.20117188 -0.42382812 0.06005859 0.50390625 0.20898438 0.11230469 -0.06054688 0.33203125 0.07421875 -0.05786133 0.11083984 -0.06494141 0.05639648 0.01757812 0.08398438 0.13769531 0.2578125 0.16796875 -0.16894531 0.01794434 0.16015625 0.26171875 0.31640625 -0.24804688 0.05371094 -0.0859375 0.17089844 -0.39453125 -0.00156403 -0.07324219 -0.04614258 -0.16210938 -0.15722656 0.21289062 -0.15820312 0.04394531 0.28515625 0.01196289 -0.26953125 -0.04370117 0.37109375 0.04663086 -0.19726562 0.3046875 -0.36523438 -0.23632812 0.08056641 -0.04248047 -0.14648438 -0.06225586 -0.0534668 -0.05664062 0.18945312 0.37109375 -0.22070312 0.04638672 0.02612305 -0.11474609 0.265625 -0.02453613 0.11083984 -0.02514648 -0.12060547 0.05297852 0.07128906 0.00063705 -0.36523438 -0.13769531 -0.12890625]
The commands below are too slow, and run out of memory
g_model.most_similar('pitch')
[('pitches', 0.7401652932167053),
('backdoor_slider', 0.5972762107849121),
('fastball', 0.5737808346748352),
('curveball', 0.5543882846832275),
('hanging_slider', 0.5523896217346191),
('hittable_pitch', 0.5503243207931519),
('leadoff_batter_Cliff_Floyd', 0.5496907830238342),
('offspeed_pitch', 0.547758936882019),
('atbat', 0.5441111326217651),
('yaw_converters_SCADA', 0.5410848259925842)]
g_model.most_similar(positive=['woman','king'],negative=['man'])
[('queen', 0.7118193507194519),
('monarch', 0.6189674139022827),
('princess', 0.5902431011199951),
('crown_prince', 0.5499460697174072),
('prince', 0.5377321839332581),
('kings', 0.5236844420433044),
('Queen_Consort', 0.5235945582389832),
('queens', 0.518113374710083),
('sultan', 0.5098593235015869),
('monarchy', 0.5087411403656006)]
Transform the train and test data
train_gmodel = [[g_model[x] for x in y if x in g_model] for y in X_news_train_gsim]
train_data_labels = [(x,y) for (x,y) in zip(train_gmodel, y_news_train) if len(x) > 0]
X_news_train_gm = [np.array(x) for (x,y) in train_data_labels]
y_news_train_gm = [y for (x,y) in train_data_labels]
X_news_train_gmodel = [x.mean(axis = 0) for x in X_news_train_gm]
test_gmodel = [[g_model[x] for x in y if x in g_model] for y in X_news_test_gsim]
test_data_labels = [(x,y) for (x,y) in zip(test_gmodel, y_news_test) if len(x) > 0]
X_news_test_gm = [np.array(x) for (x,y) in test_data_labels]
y_news_test_gm = [y for (x,y) in test_data_labels]
X_news_test_gmodel = [x.mean(axis = 0) for x in X_news_test_gm]
Train a classifier on the emebddings
lr_clf.fit(X_news_train_gmodel, np.array(y_news_train_gm))
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
lr_clf.score(np.array(X_news_test_gmodel),y_news_test_gm)
0.4947229551451187
Why did the much more generaly word2vec model from google perform much worse than our simple CBOW and Skip-Gram models?
The Glove Embedding¶
This is another embedding produced by Google, available online. It is faster to use and performs better in terms of similarity and analogies, but not for classification.
import gensim.downloader as api
print(api.load('glove-wiki-gigaword-50', return_path=True))
[==================================================] 100.0% 66.0/66.0MB downloaded /root/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
glove_model = api.load("glove-wiki-gigaword-50")
glove_model.most_similar('pitch')
[('pitches', 0.8380102515220642),
('pitching', 0.775322675704956),
('ball', 0.7705615162849426),
('infield', 0.769540548324585),
('inning', 0.7672455906867981),
('game', 0.751035213470459),
('hitters', 0.7493574619293213),
('outfield', 0.7477315068244934),
('hitter', 0.7467021346092224),
('pitched', 0.7417561411857605)]
glove_model.most_similar(positive=['chicago','rangers'],negative=['texas'])
[('blackhawks', 0.798629641532898),
('sabres', 0.7900724411010742),
('canucks', 0.7876150608062744),
('canadiens', 0.7621992826461792),
('leafs', 0.7570874094963074),
('bruins', 0.7503584027290344),
('oilers', 0.7478305101394653),
('dodgers', 0.7437342405319214),
('phillies', 0.7399099469184875),
('mets', 0.7378402352333069)]
glove_model.most_similar(positive=['woman','king'],negative=['man'])
[('queen', 0.8523604273796082),
('throne', 0.7664334177970886),
('prince', 0.7592144012451172),
('daughter', 0.7473883628845215),
('elizabeth', 0.7460219860076904),
('princess', 0.7424570322036743),
('kingdom', 0.7337412238121033),
('monarch', 0.721449077129364),
('eldest', 0.7184861898422241),
('widow', 0.7099431157112122)]
train_glove = [[glove_model[x] for x in y if x in glove_model] for y in X_news_train_gsim]
train_data_labels = [(x,y) for (x,y) in zip(train_glove, y_news_train) if len(x) > 0]
X_news_train_glove = [np.array(x).mean(axis=0) for (x,y) in train_data_labels]
y_news_train_glove = [y for (x,y) in train_data_labels]
test_glove = [[glove_model[x] for x in y if x in glove_model] for y in X_news_test_gsim]
test_data_labels = [(x,y) for (x,y) in zip(test_glove, y_news_test) if len(x) > 0]
X_news_test_glove = [np.array(x).mean(axis=0) for (x,y) in test_data_labels]
y_news_test_glove = [y for (x,y) in test_data_labels]
lr_clf.fit(X_news_train_glove, np.array(y_news_train_glove))
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
lr_clf.score(np.array(X_news_test_glove),y_news_test_glove)
0.5118733509234829
The Doc2Vec model¶
Similar to Word2Vec the Doc2Vec produces embeddings for full documents. Now we will turn into a vector the whole block of text. It performs much better for classification, indicating that simple averaging of the word embeddings is not a good option.
train_corpus = [gensim.models.doc2vec.TaggedDocument(X_news_train_gsim[i], [i]) for i in range(len(X_news_train_gsim))]
d2v_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
d2v_model.build_vocab(train_corpus)
d2v_model.train(train_corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
X_news_train_d2v = [d2v_model.infer_vector(x) for x in X_news_train_gsim]
X_news_test_d2v = [d2v_model.infer_vector(x) for x in X_news_test_gsim]
lr_clf.fit(X_news_train_d2v, np.array(y_news_train_gsim))
lr_clf.score(X_news_test_d2v,y_news_test_gsim)
0.945910290237467