Introduction to Sci-Kit Learn and Clustering¶

In this tutorial we will introduce the Sci-Kit Learn library:https://scikit-learn.org/stable/

This is a very important library with a huge toolkit for data processing, unsupervised and supervised learning. It is one of the core tools for data science.

We will see some of the capabilities of this toolkit and focus on clustering.

In [ ]:
import numpy as np
import scipy as sp
import scipy.sparse as sp_sparse
import scipy.spatial.distance as sp_dist

import matplotlib.pyplot as plt

import sklearn as sk
import sklearn.datasets as sk_data
import sklearn.metrics as metrics
from sklearn import preprocessing
import sklearn.cluster as sk_cluster
import sklearn.feature_extraction.text as sk_text
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import scipy.cluster.hierarchy as hr

import time
import seaborn as sns

%matplotlib inline

Computing distances¶

For the computation of distances there are libraries in Scipy

http://docs.scipy.org/doc/scipy-0.15.1/reference/spatial.distance.html#module-scipy.spatial.distance

but also in SciKit metrics library:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

Most of these work with sparse data as well.

Compute distances using scipy¶

Computing distances between vectors

image.png

image.png

image.png

image.png

image.png

image.png

In [ ]:
import scipy.spatial.distance as sp_dist

x = np.random.randint(2, size = 5)
y = np.random.randint(2, size = 5)
print (x)
print (y)
print (sp_dist.cosine(x,y))
print (sp_dist.euclidean(x,y))
print (sp_dist.jaccard(x,y))
print (sp_dist.hamming(x,y))
# When computing jaccard similarity of 0/1 matrices,
# 1 means that the element corresponding to the column is in the set,
# 0 that the element is not in the set
[0 0 0 0 1]
[0 0 1 0 0]
1.0
1.4142135623730951
1.0
0.4

Compute pairwise distances in a table using pdist of scipy.

When given a matrix, it computes all pairwise distances between its rows. The output is a vector with N(N-1)/2 entries (N number of rows). We can transform it into an NxN distance matrix using squareform.

In [ ]:
A = np.random.randint(2, size = (5,3))

# computes the matrix of all pairwise distances of rows
# returns a vector with N(N-1)/2 entries (N number of rows)
D = sp_dist.pdist(A, 'jaccard')
print (A)
print('\n All row distances')
print (D)
print('\n All row distances in Symmetric Matrix form')
print(sp_dist.squareform(D))
[[0 1 0]
 [1 0 1]
 [0 1 0]
 [1 0 1]
 [1 1 1]]

 All row distances
[1.         0.         1.         0.66666667 1.         0.
 0.33333333 1.         0.66666667 0.33333333]

 All row distances in Symmetric Matrix form
[[0.         1.         0.         1.         0.66666667]
 [1.         0.         1.         0.         0.33333333]
 [0.         1.         0.         1.         0.66666667]
 [1.         0.         1.         0.         0.33333333]
 [0.66666667 0.33333333 0.66666667 0.33333333 0.        ]]

Similarly with the single vectors x,y that we had before

In [ ]:
x = x.reshape(1,5)
y = y.reshape(1,5)
sp_dist.cdist(x,y,'cosine')
Out[ ]:
array([[1.]])

We can compute all pairwise distances between the rows of two tables A and B, using the cdist function of scipy. If A has N rows and B has M rows the result is an NxM matrix with all the distances

In [ ]:
B = np.random.randint(2, size = (3,3))
print(A)
print('\n')
print(B)
D = sp_dist.cdist(A,B,'jaccard')
print('\n')
print(D)
[[0 1 0]
 [1 0 1]
 [0 1 0]
 [1 0 1]
 [1 1 1]]


[[1 0 0]
 [1 1 0]
 [0 1 0]]


[[1.         0.5        0.        ]
 [0.5        0.66666667 1.        ]
 [1.         0.5        0.        ]
 [0.5        0.66666667 1.        ]
 [0.66666667 0.33333333 0.66666667]]

Compute distances using sklearn¶

In [ ]:
import sklearn.metrics as metrics

#computes the matrix of all pairwise distances of rows
# returns a NxN matrix (N number of rows)
print(A)
D2 = metrics.pairwise_distances(A,metric = 'jaccard')
print('\n The matrix of row distances')
print(D2)
[[0 1 0]
 [1 0 1]
 [0 1 0]
 [1 0 1]
 [1 1 1]]

 The matrix of row distances
[[0.         1.         0.         1.         0.66666667]
 [1.         0.         1.         0.         0.33333333]
 [0.         1.         0.         1.         0.66666667]
 [1.         0.         1.         0.         0.33333333]
 [0.66666667 0.33333333 0.66666667 0.33333333 0.        ]]
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/pairwise.py:2466: DataConversionWarning: Data was converted to boolean for metric jaccard
  warnings.warn(msg, DataConversionWarning)

Some similarity and distance metrics are directly computed in the pairwise library:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise

In [ ]:
C = metrics.pairwise.cosine_similarity(A)
print(A)
print('\n Cosine Similarity')
print(C)
[[0 1 0]
 [1 0 1]
 [0 1 0]
 [1 0 1]
 [1 1 1]]

 Cosine Similarity
[[1.         0.         1.         0.         0.57735027]
 [0.         1.         0.         1.         0.81649658]
 [1.         0.         1.         0.         0.57735027]
 [0.         1.         0.         1.         0.81649658]
 [0.57735027 0.81649658 0.57735027 0.81649658 1.        ]]

Compute distances between the rows of two tables

In [ ]:
print(A)
print('\n')
print(B)

#computes the matrix of all pairwise distances of rows of A with rows of B
# returns an NxM matrix (N rows of A, M rows of B)
D3 = metrics.pairwise_distances(A,B,metric = 'jaccard')
print('\n The matrix of distances between the rows of A and B')
print(D3)
[[0 1 0]
 [1 0 1]
 [0 1 0]
 [1 0 1]
 [1 1 1]]


[[1 0 0]
 [1 1 0]
 [0 1 0]]

 The matrix of distances between the rows of A and B
[[1.         0.5        0.        ]
 [0.5        0.66666667 1.        ]
 [1.         0.5        0.        ]
 [0.5        0.66666667 1.        ]
 [0.66666667 0.33333333 0.66666667]]
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/pairwise.py:2466: DataConversionWarning: Data was converted to boolean for metric jaccard
  warnings.warn(msg, DataConversionWarning)

We can apply everything to sparce matrices

In [ ]:
d = np.array([[0, 0, 12],    # columns 0,1 => (i,j) possitions
              [0, 1, 1],     # column 2    => Value
              [0, 5, 34],
              [1, 3, 12],
              [1, 2, 6],
              [2, 0, 23],
              [3, 4, 14],
              ])
s = sp_sparse.csr_matrix((d[:,2],(d[:,0],d[:,1])), shape=(4,6))
D4 = metrics.pairwise.pairwise_distances(s,metric = 'euclidean')
print('Sparce Matrix')
print(s.toarray())
print('\n Matrix of row distances')
print(D4)
Sparce Matrix
[[12  1  0  0  0 34]
 [ 0  0  6 12  0  0]
 [23  0  0  0  0  0]
 [ 0  0  0  0 14  0]]

 Matrix of row distances
[[ 0.         38.48376281 35.74912586 38.69108424]
 [38.48376281  0.         26.62705391 19.39071943]
 [35.74912586 26.62705391  0.         26.92582404]
 [38.69108424 19.39071943 26.92582404  0.        ]]
In [ ]:
v = np.random.randint(2, size = 6)
v = v.reshape(1,6)
print(v)
print('\n')
print(s.toarray())
print('\n Matrix of row distances')
metrics.pairwise.pairwise_distances(v,s,metric = 'euclidean')
[[1 1 1 1 1 0]]


[[12  1  0  0  0 34]
 [ 0  0  6 12  0  0]
 [23  0  0  0  0  0]
 [ 0  0  0  0 14  0]]

 Matrix of row distances
Out[ ]:
array([[35.77708764, 12.20655562, 22.09072203, 13.15294644]])

Clustering¶

You can read more about clustering in SciKit here:

http://scikit-learn.org/stable/modules/clustering.html

Generate data from Gaussian distributions.

More on data generation here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

In [ ]:
centers = [[1,1], [-1, -1], [1, -1]]
X, true_labels = sk_data.make_blobs(n_samples=500, centers=centers, n_features=2,
                                    center_box=(-10.0, 10.0),random_state=0)
plt.scatter(X[:,0], X[:,1])
Out[ ]:
<matplotlib.collections.PathCollection at 0x79663808c650>
No description has been provided for this image
In [ ]:
print(type(X))
print('\n')
print(true_labels)
print('\n')
print(len(true_labels[true_labels==0]),len(true_labels[true_labels==1]),len(true_labels[true_labels==2]))
<class 'numpy.ndarray'>


[2 0 1 1 0 1 1 2 2 2 1 0 0 1 1 0 0 2 2 0 0 0 2 2 0 1 0 2 0 2 0 0 0 2 1 1 0
 0 2 2 0 2 1 0 2 2 0 0 1 2 2 0 0 1 0 2 1 1 1 2 2 1 0 0 2 1 1 2 2 2 2 1 2 0
 0 0 2 2 0 0 0 0 0 2 1 2 2 0 0 2 2 1 0 2 1 0 1 2 1 1 2 2 1 2 1 0 1 1 0 2 2
 2 0 2 0 2 2 0 1 1 0 1 2 1 1 2 2 1 2 0 0 0 1 2 2 0 2 0 2 1 2 1 0 0 1 0 2 1
 0 1 2 2 2 0 1 0 1 0 2 2 0 1 0 0 1 2 1 1 1 2 1 2 1 0 1 0 2 2 0 2 1 0 2 2 0
 1 2 0 0 2 1 2 2 2 0 2 2 1 2 1 0 2 1 2 1 2 0 0 0 1 2 0 0 2 1 1 2 2 0 1 2 0
 0 1 1 1 0 2 2 2 1 2 1 1 1 0 1 2 0 2 1 2 0 2 1 1 2 2 1 2 0 0 1 0 1 0 2 1 2
 1 1 1 1 0 0 1 0 1 1 1 1 2 1 0 0 0 0 2 1 2 2 0 0 1 0 2 1 0 2 2 1 0 0 1 0 2
 1 2 1 0 0 0 1 2 0 0 2 2 1 0 0 1 1 0 2 1 0 1 2 1 1 0 2 0 2 1 2 1 0 0 0 1 0
 2 2 1 0 2 2 2 0 1 1 1 0 1 0 0 0 2 0 2 0 2 2 0 2 2 2 2 1 1 1 2 2 2 2 0 0 0
 1 2 0 1 0 1 0 1 2 2 0 2 1 0 1 2 2 0 1 2 1 2 0 0 0 1 2 0 0 1 2 2 0 2 1 0 2
 0 1 0 2 0 0 1 0 0 0 0 1 0 2 2 2 1 0 1 2 1 2 1 0 1 1 1 1 1 0 2 1 2 0 0 1 2
 0 2 1 0 0 1 1 2 1 2 1 1 1 1 2 0 1 1 0 1 2 0 1 1 0 2 0 0 1 1 1 0 1 0 1 1 1
 1 0 1 1 2 2 2 1 1 1 0 1 2 2 2 1 0 0 2]


167 167 166
In [ ]:
plt.scatter(X[true_labels==1,0], X[true_labels==1,1],c = 'r')
plt.scatter(X[true_labels==2,0], X[true_labels==2,1],c = 'b')
plt.scatter(X[true_labels==0,0], X[true_labels==0,1],c = 'g')
Out[ ]:
<matplotlib.collections.PathCollection at 0x796635c50a10>
No description has been provided for this image

Useful command: We will create a colormap of the distance matrix using the pcolormesh method of matplotlib.pyplot

In [ ]:
plt.figure(figsize=(8, 6))

# Calculate the distances
euclidean_dists = metrics.euclidean_distances(X)

# Plot the heatmap
mesh = plt.pcolormesh(euclidean_dists, cmap=plt.cm.coolwarm)

# Add Colorbar
cbar = plt.colorbar(mesh)
cbar.set_label('Euclidean Distance', rotation=270, labelpad=15)

# Add Labels
plt.title('Pairwise Distance Matrix')
plt.xlabel('Data Point i')
plt.ylabel('Data Point j')
plt.show()
No description has been provided for this image

Clustering Algorithms¶

scikit-learn has a huge set of tools for unsupervised learning generally, and clustering specifically. These are in sklearn.cluster. http://scikit-learn.org/stable/modules/clustering.html

There are 3 functions in all the clustering classes,

  • fit(): builds the model from the training data (e.g. for kmeans, it finds the centroids)
  • predict(): assigns labels to the data after building the model
  • fit_predict(): does both at the same data (e.g in kmeans, it finds the centroids and assigns the labels to the dataset)

K-means clustering¶

More on the k-means clustering here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

Important parameters

init: determines the way the initialization is done. kmeans++ is the default.

n_init: number of random initializations

max_iter: maximum number of iterations,

Important attributes:

labels_ the labels for each point

cluster_centers_: the cluster centroids

inertia_: the SSE value

How K-means works¶

K-Means is an iterative algorithm. It tries to partition the dataset into $K$ distinct, non-overlapping subgroups (clusters) where each data point belongs to the cluster with the nearest mean (centroid).

The algorithm follows these 4 steps:

  1. Initialization: We pick $K$ random points as the initial "centroids" (centers of the clusters).
  2. Assignment (The "E" Step): Every data point calculates its distance to each centroid and assigns itself to the closest one.
  3. Update (The "M" Step): The centroids move to the new center (the average mean) of the points assigned to them.
  4. Repeat: Steps 2 and 3 are repeated until the centroids stop moving (convergence) or we reach a maximum number of iterations.

Understanding the Code Parameters¶

In the code below, we configure the algorithm with these settings:

  • n_clusters=3: This is the $K$. We are telling the algorithm to look for exactly 3 groups.
  • init='k-means++': Instead of placing the initial centroids purely randomly (which can lead to bad results), this method places them in a smart way (far apart from each other) to speed up convergence.
  • max_iter=100: If the algorithm doesn't settle down (converge), we force it to stop after 100 loops to prevent it from running forever.

The "Error" (Inertia)¶

The code will print the Inertia. This is the sum of squared distances of samples to their closest cluster center.

  • Lower Inertia = Better: It means the clusters are tight and points are close to their centroids.
  • Higher Inertia = Worse: It means the points are spread far out from their centroids.
In [ ]:
import sklearn.cluster as sk_cluster

kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=3, max_iter=100)
kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
kmeans_labels = kmeans.labels_
error = kmeans.inertia_

print ("The total error of the clustering is: ", error)
print ('\nCluster labels')
print(kmeans_labels)
print ('\n Cluster Centroids')
print (centroids)
The total error of the clustering is:  729.4109184603107

Cluster labels
[0 0 0 2 1 2 2 0 0 0 2 1 1 2 2 1 1 1 2 0 1 1 0 1 1 2 1 2 1 0 0 1 1 0 2 2 1
 1 0 0 1 0 2 1 0 0 1 1 0 0 2 1 1 2 1 1 2 2 1 1 0 2 1 1 1 2 2 2 0 0 1 2 0 1
 0 1 2 0 1 1 1 1 1 0 2 0 1 1 1 0 2 2 1 0 0 0 2 0 1 2 1 0 0 0 2 1 2 2 1 0 0
 0 1 0 0 0 0 1 2 2 1 2 0 2 2 0 0 2 0 1 1 1 2 0 0 1 0 1 0 0 0 2 0 0 2 1 0 0
 1 2 1 0 0 1 2 1 2 0 1 0 1 0 0 1 2 0 2 2 2 0 0 0 2 1 2 1 2 0 1 0 2 1 1 0 1
 2 0 1 1 1 2 0 0 0 1 0 0 2 0 2 1 0 2 0 2 0 1 1 1 2 0 1 1 2 1 2 0 0 1 2 0 1
 1 2 2 1 1 0 1 0 2 0 2 1 2 1 2 1 1 1 2 0 1 0 2 0 0 1 2 0 1 0 2 1 2 1 0 2 0
 2 2 2 0 1 1 1 1 2 2 1 2 0 2 1 1 1 1 0 2 0 0 1 1 2 1 0 2 1 2 0 2 0 1 2 1 1
 2 0 2 1 1 1 2 0 1 1 1 0 2 0 1 2 2 1 0 2 1 2 1 0 2 1 0 1 0 2 1 0 1 1 1 2 1
 1 0 2 0 0 0 0 1 2 2 2 0 1 1 1 1 0 1 2 0 0 2 1 1 0 2 2 2 0 2 1 0 0 0 1 1 1
 0 1 1 2 1 2 0 1 2 0 1 0 2 1 2 0 0 1 2 0 2 0 1 0 1 2 0 1 1 2 0 0 1 0 2 1 0
 0 2 2 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 0 0 2 1 2 2 0 2 2 1 2 2 0 1 1 2 0
 1 0 2 1 0 0 2 0 2 0 2 2 2 2 0 0 2 2 1 2 0 1 2 2 1 0 0 1 2 2 1 0 2 0 2 2 1
 2 1 2 2 0 1 0 2 2 2 1 1 1 0 0 0 0 0 1]

 Cluster Centroids
[[ 1.14033049 -1.174838  ]
 [ 0.78589165  1.17781335]
 [-1.34430057 -1.28806973]]
In [ ]:
actual_centroids = np.array([[1, 1], [-1, -1], [1, -1]])
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', s=50, edgecolor='k')
# Plot centroids as red stars
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='*', s=200, label='K-Means Centroids')
# Plot also the actual centroids as orange stars
plt.scatter(actual_centroids[:, 0], actual_centroids[:, 1], c='orange', marker='*', s=200, label='Actual Centroids')
plt.title("K-Means Clustering Results")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
No description has been provided for this image

Useful command: numpy.argsort sorts a set of values and returns the sorted indices

In [ ]:
plt.figure(figsize=(8, 6))

# Sort the data based on the cluster labels
idx = np.argsort(kmeans_labels) # returns the indices in sorted order
rX = X[idx,:]

# Recalculate distances on the sorted data
r_euclid = metrics.euclidean_distances(rX)
# Alternative efficiency tip: re-order the existing matrix instead of re-calculating
# r_euclid = euclidean_dists[idx,:][:,idx]

# Plot the heatmap
mesh = plt.pcolormesh(r_euclid, cmap=plt.cm.coolwarm)

# Add Colorbar and Labels
cbar = plt.colorbar(mesh)
cbar.set_label('Euclidean Distance', rotation=270, labelpad=15)
plt.title('Pairwise Distance Matrix (Sorted by K-Means Clusters)')
plt.xlabel('Data Point i (Sorted)')
plt.ylabel('Data Point j (Sorted)')
plt.show()
No description has been provided for this image

What can we notice in the above Pairwise Distance matrix compared to before?

Evaluation¶

Confusion matrix¶

Confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [ ]:
C = confusion_matrix(true_labels, kmeans_labels)

print (C)
print('\n')

plt.figure(figsize=(6, 5))
disp = ConfusionMatrixDisplay(confusion_matrix=C)
disp.plot(cmap=plt.cm.Reds)

plt.title('Confusion Matrix (Cluster vs True)')
plt.show()
[[ 26 140   1]
 [ 20  12 135]
 [122  29  15]]


<Figure size 600x500 with 0 Axes>
No description has been provided for this image

Label Permutation Problem: In Supervised Learning (classification), if a picture is a "Cat" (Label 0), the model must predict "0". In Clustering (Unsupervised), the algorithm has no idea what the labels are. The Result: A perfect clustering model might produce a Confusion Matrix where the "correct" matches are not on the diagonal, but scattered in other cells, i.e. dont expect a diagonal line (like in classification), we will address this later.

Important: In the produced confusion matrix, the first list defines the rows and the second the columns. The matrix is always square, regarless if the number of classes and clusters are not the same. The extra rows or columns are filled with zeros.

Create a function that maps each cluster to the class that has the most points.

You need to be careful if many clusters map to the same class. It will not work in this case

Useful command: numpy.argmax returns the index of the max element

In [ ]:
def cluster_class_mapping(kmeans_labels,true_labels):
    C= metrics.confusion_matrix(kmeans_labels,true_labels)
    mapping = list(np.argmax(C,axis=1)) #for each row (cluster) find the best class in the confusion matrix
    mapped_kmeans_labels = [mapping[l] for l in kmeans_labels]
    C2= metrics.confusion_matrix(mapped_kmeans_labels,true_labels)
    return mapped_kmeans_labels,C2

mapped_kmeans_labels,C = cluster_class_mapping(kmeans_labels,true_labels)
print(C)
plt.pcolormesh(C, cmap=plt.cm.Reds)

# Annotate the heatmap with cell values
for i in range(C.shape[0]):
    for j in range(C.shape[1]):
        plt.text(j + 0.5, i + 0.5, str(C[i, j]),
                 ha='center', va='center', color='black')

# Set axis labels
plt.xlabel("True Labels", fontsize=12)
plt.ylabel("Mapped Cluster Labels", fontsize=12)

# Add tick marks with labels
plt.xticks(np.arange(C.shape[1]) + 0.5, range(C.shape[1]), rotation=45)
plt.yticks(np.arange(C.shape[0]) + 0.5, range(C.shape[0]))

plt.title("Confusion Matrix Heatmap", fontsize=14)
plt.colorbar()
plt.show()
[[140  12  29]
 [  1 135  15]
 [ 26  20 122]]
No description has been provided for this image

Predicted VS Actual Values¶

image.png

Precision and recall¶

Precision: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

Recall: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score

These metrics are for classification, so they assume that row i is mapped to column i

image.png

image.png

image.png

image.png

image.png

In [ ]:
p = metrics.precision_score(true_labels,kmeans_labels, average=None)
print(p)
r = metrics.recall_score(true_labels,kmeans_labels, average = None)
print(r)
[0.1547619  0.06629834 0.09933775]
[0.15568862 0.07185629 0.09036145]

Why are some scores so low?¶

You likely see Precision and Recall scores near 0.0 (or very low), despite the Confusion Matrix showing good separation.

The Problem: Label Mismatch Standard metrics assume that Label 0 in the data matches Cluster 0 in the prediction. However, K-Means assigns label IDs randomly!

  • True Data: Group 0 might be the "Blue" group.
  • K-Means: Might find the "Blue" group perfectly but name it "Cluster 1".

Because 0 != 1, the standard precision score counts this as a total failure.

The Solution: We need metrics that are invariant to label swapping. We don't care what the cluster is named, only that it contains the right points.

Evaluating with Mapped Labels¶

Now we calculate the scores again, but this time we use the mapped_kmeans_labels that we created using our helper function.

Why are the scores better? Because we fixed the naming problem! By translating the random K-Means cluster names to match the most likely True Label, we can now strictly compare them.

Understanding the Output:

  • average=None: Returns 3 scores (one for each cluster). You can see if one specific cluster is harder to separate than the others.
  • average='weighted': Returns a single number representing the overall performance, weighted by the size of each cluster.

Note: In a real-world scenario where we don't have True Labels to create a mapping, we can't use this method. This is why we need specific clustering metrics (like Homogeneity) which we will look at next.

In [ ]:
p = metrics.precision_score(true_labels,mapped_kmeans_labels, average=None)
print(p)
r = metrics.recall_score(true_labels,mapped_kmeans_labels, average = None)
print(r)
f = metrics.f1_score(true_labels,mapped_kmeans_labels, average = None)
print(f)
p = metrics.precision_score(true_labels,mapped_kmeans_labels, average='weighted')
print(p)
r = metrics.recall_score(true_labels,mapped_kmeans_labels, average = 'weighted')
print(r)
f = metrics.f1_score(true_labels,mapped_kmeans_labels, average = 'weighted')
print(f)
[0.77348066 0.89403974 0.72619048]
[0.83832335 0.80838323 0.73493976]
[0.8045977  0.8490566  0.73053892]
0.7980470510548809
0.794
0.794859459999974

Homogeneity and completeness¶

http://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness

Homogeneity and completeness are computed using the conditional entropy of the labels given the cluster, and the conditional entropy of the cluster labels given the class label. The V-measure combines these in a similar way like F-measure

Standard metrics like Precision and Recall depend on the labels matching perfectly. Since clustering algorithms shuffle labels, we use Homogeneity and Completeness instead. These metrics use Entropy (Information Theory), which measures "uncertainty" or "messiness."

1. Homogeneity ("Purity")¶

  • The Question: "Does each cluster contain only members of a single class?"
  • The Math: $Homogeneity = 1 - \frac{H(C|K)}{H(C)}$
  • Intuition: If we pick a cluster, do we know exactly what class belongs there?
    • If a cluster has 99 Red dots and 1 Blue dot, the uncertainty (Entropy) is low. Homogeneity is high.
    • If a cluster is 50% Red and 50% Blue, uncertainty is maximum. Homogeneity is low.

2. Completeness ("Togetherness")¶

  • The Question: "Are all members of a given class assigned to the same cluster?"
  • The Math: $Completeness = 1 - \frac{H(K|C)}{H(K)}$
  • Intuition: If we pick a class (e.g., "The Blue Team"), are they all in one place?
    • If the Blue Team is split across 3 different clusters, the uncertainty of "Where is the Blue Team?" is high. Completeness is low.

3. V-Measure¶

  • This is the Harmonic Mean of Homogeneity and Completeness.
  • The Math: $V-Measure = 2 ⋅ \frac{Homogeneity⋅Completeness}{Homogeneity+Completeness}$
  • It is the clustering equivalent of the F1-Score.

image.png

Math Reminder: Entropy Definitions¶

For those interested we remind the Shannon Entropy ($H$), from Information Theory that measures "uncertainty" or "messiness".

1. Entropy of the Classes ($H(C)$) First, we measure the uncertainty of the dataset based on the true labels alone. $$H(C) = -\sum_{c=1}^{|C|} \frac{n_c}{N} \cdot \log\left(\frac{n_c}{N}\right)$$

  • $N$: Total number of data points.
  • $n_c$: Number of points belonging to class $c$.
  • Interpretation: If the dataset has an equal mix of classes, Entropy is high (1.0). If it has only one class, Entropy is 0.

2. Conditional Entropy ($H(C|K)$) Next, we calculate the conditional entropy of the classes given the cluster assignments. This asks: "If I know which cluster a point is in, how uncertain am I about its true class?" $$H(C|K) = - \sum_{k=1}^{|K|} \frac{n_k}{N} \cdot \underbrace{\left( \sum_{c=1}^{|C|} \frac{n_{c,k}}{n_k} \cdot \log\left(\frac{n_{c,k}}{n_k}\right) \right)}_{\text{Entropy inside Cluster } k}$$

  • $n_k$: Number of points in cluster $k$.
  • $n_{c,k}$: Number of points of class $c$ inside cluster $k$.
  • Interpretation: We calculate the entropy for each cluster individually, then take the weighted average.
In [ ]:
h = metrics.homogeneity_score(true_labels,mapped_kmeans_labels)
print(h)
c = metrics.completeness_score(true_labels,mapped_kmeans_labels)
print(c)
v = metrics.v_measure_score(true_labels,mapped_kmeans_labels)
print(v)
0.44199547480098583
0.4430951461741084
0.44254462735008065

Interpreting the Results¶

We got a V-Measure of roughly 0.44. What does this mean?

  1. Moderate Separation: The score is not close to 1.0, which means the clusters are not perfectly separated. Looking back at our data generation, our centers were close together ([1,1] vs [1,-1]), causing the "blobs" to overlap significantly. K-Means struggles to draw perfect lines when data points mix.

  2. Balanced Error: Homogeneity and Completeness are nearly identical. This means the algorithm isn't biased; it's confusing the classes equally.

  3. A Key Lesson on Metrics:

    • For Precision/Recall, we had to map the labels first (0 → 2, etc.) to get a good score.
    • For Homogeneity/Completeness, we did not need to map them. These formulas look at the structure of the groups, not the specific names (0, 1, 2).
    • Try it yourself: Replace mapped_kmeans_labels with kmeans_labels in the code above. The score will remain exactly 0.44!

Finding the Optimal K: The Elbow Method¶

In the real world, we rarely know the true number of clusters ($K$) beforehand. To find it, we often use the Elbow Method.

How it works:

  1. We run K-Means multiple times with different values of $K$ (e.g., 1 to 10).
  2. We calculate the Inertia (Error) for each $K$.
  3. We plot the Error vs. $K$.

What to look for: As we add more clusters, the error naturally goes down. However, we are looking for the "Elbow"—the point where the error stops dropping drastically and starts to flatten out. This represents the point of diminishing returns.

The SSE plot:

In [ ]:
error = np.zeros(11)
sh_score = np.zeros(11)
for k in range(1,11):
    kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=k)
    kmeans.fit_predict(X)
    error[k] = kmeans.inertia_
    if k>1: sh_score[k]= metrics.silhouette_score(X, kmeans.labels_)

plt.plot(range(1,len(error)),error[1:])
plt.xlabel('Number of clusters')
plt.ylabel('Error')
Out[ ]:
Text(0, 0.5, 'Error')
No description has been provided for this image

Silhouette score¶

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

The Elbow method can sometimes be ambiguous (the "bend" isn't always sharp). A clearer method is to use the Silhouette Score.

This metric measures how "distinct" the clusters are. It answers two questions for every data point:

  1. How close is it to points in its own cluster? (Cohesion)
  2. How far is it from points in the nearest neighbor cluster? (Separation)

How to read the plot:

  • Higher is Better: We are looking for the global maximum.
  • The Range:
    • +1: Perfect separation.
    • 0: Overlapping clusters.
    • -1: Wrong clustering.
In [ ]:
plt.plot(range(2,len(sh_score)),sh_score[2:])
plt.xlabel('Number of clusters')
plt.ylabel('silhouette score')
Out[ ]:
Text(0, 0.5, 'silhouette score')
No description has been provided for this image

Plot of silhouette and SSE together in a plot with two different y axes. The red (left) is the silhouette score and the blue (right) is the SSE score. We can now study the two lines together.

In [ ]:
fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('number of clusters')
ax1.set_ylabel('silhoutte score', color=color)
ax1.plot(range(2,len(sh_score)),sh_score[2:], color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('SSE', color=color)  # we already handled the x-label with ax1
ax2.plot(range(2,len(error)),error[2:], color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
No description has been provided for this image
In [ ]:
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
plt.scatter(X[:, 0], X[:, 1], color=colors[kmeans_labels].tolist(), s=10, alpha=0.8)
Out[ ]:
<matplotlib.collections.PathCollection at 0x7966355da570>
No description has been provided for this image

Agglomerative Clustering¶

So far, we have used K-Means, which is a "centroid-based" algorithm (it finds centers and assigns points to them).

Now, we will explore Hierarchical Clustering. Specifically, we will use Agglomerative Clustering, which is a "bottom-up" approach.

How it works¶

  1. Start: Treat every single data point as its own cluster. (If we have 500 points, we start with 500 clusters).
  2. Merge: Find the two "closest" clusters and merge them into one.
  3. Repeat: Keep merging the closest pairs until we are left with the desired number of clusters (in our case, n_clusters=3).

The "Linkage" Parameter¶

In the code below, you will see linkage='complete'. This defines how we measure the distance between two clusters (groups of points).

  • Complete Linkage: The distance between Cluster A and Cluster B is the distance between the farthest points in those two clusters. This approach tends to find compact clusters of equal diameters.
  • (Other types include 'single' linkage (nearest points) and 'ward' linkage (minimizing variance).

Evaluation¶

Just like with K-Means, Agglomerative Clustering assigns arbitrary label IDs (0, 1, 2). To evaluate the precision and recall, we will use our cluster_class_mapping function again to align the predicted labels with the true labels.

More on Agglomerative Clustering here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

image.png

image.png

In [ ]:
agglo = sk_cluster.AgglomerativeClustering(linkage = 'complete', n_clusters = 3)
agglo_labels = agglo.fit_predict(X)

print("Confusion Matrix before label mapping:")
C_agglo = metrics.confusion_matrix(agglo_labels, true_labels)
print(C_agglo)
#plt.pcolor(C_agglo, cmap=plt.cm.coolwarm)
#plt.pcolormesh(C_agglo, cmap=plt.cm.Reds)

print("\nConfusion Matrix after label mapping:")
mapped_agglo_labels, C_agglo = cluster_class_mapping(agglo_labels, true_labels)
print(C_agglo)

print("\nPrecision score:")
p = metrics.precision_score(true_labels, mapped_agglo_labels, average='weighted')
print(p)

print("\nRecall score:")
r = metrics.recall_score(true_labels, mapped_agglo_labels, average='weighted')
print(r)
Confusion Matrix before label mapping:
[[ 33 156 108]
 [126  10  16]
 [  8   1  42]]

Confusion Matrix after label mapping:
[[126  10  16]
 [ 33 156 108]
 [  8   1  42]]

Precision score:
0.7257145291928573

Recall score:
0.648

Another way to do agglomerative clustering using SciPy:

https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

In [ ]:
import scipy.spatial.distance as sp_dist
D = sp_dist.pdist(X, 'euclidean')
Z = hr.linkage(D, method='complete')
print (Z.shape, X.shape)
print('\n')
print("Linkage Matrix (Z):")
print("\n Index of Cluster 1 | Index of Cluster 2 | Distance Between Clusters | Number of Points in New Cluster")
print(Z)
(499, 4) (500, 2)


Linkage Matrix (Z):

 Index of Cluster 1 | Index of Cluster 2 | Distance Between Clusters | Number of Points in New Cluster
[[3.01000000e+02 4.12000000e+02 4.62482589e-03 2.00000000e+00]
 [1.41000000e+02 2.59000000e+02 9.35716392e-03 2.00000000e+00]
 [2.16000000e+02 4.22000000e+02 1.05873218e-02 2.00000000e+00]
 ...
 [9.91000000e+02 9.94000000e+02 5.60370001e+00 2.97000000e+02]
 [9.92000000e+02 9.96000000e+02 7.70360478e+00 3.48000000e+02]
 [9.95000000e+02 9.97000000e+02 8.02903447e+00 5.00000000e+02]]

Understanding "Z": The Linkage Matrix¶

The variable Z is not just a random matrix. It is the History Log of every single step the algorithm took.

Remember, Agglomerative clustering starts with $N$ points (clusters) and merges them one by one until only 1 giant cluster remains. Since we have 500 data points, there will be exactly 499 merge steps (rows).

How to read a row: Each row represents one merge. The columns tell you exactly what happened:

  1. Column 0 & 1 (Cluster Indices): "I took Cluster A and Cluster B..."
    • Note: Indices $0$ to $499$ are your original data points. Indices $500+$ are the newly formed clusters created in previous steps.
  2. Column 2 (Distance): "...because they were this far apart..."
    • This is the distance between them at the moment they were merged.
  3. Column 3 (Cluster Size): "...and combined them into a new cluster with this many items."

Example: If the first row reads: [12, 45, 0.1, 2]

  • It means: "Data point 12 and Data point 45 were merged. They were 0.1 units apart. The new cluster has 2 items."

A Visual way to find the optimal number of clusters when doing agglomerative clustering is from the Dendrogram:

Look for the largest vertical distance that can be drawn without crossing a horizontal line. This represents the point at which merging should stop to achieve an optimal number of clusters.

In [ ]:
fig = plt.figure(figsize=(10,10))
T = hr.dendrogram(Z,color_threshold=0.4, leaf_font_size=2)
# Add a horizontal line to indicate the threshold for 3 clusters
plt.axhline(y=7.5, color='r', linestyle='--', label='Threshold for 3 Clusters')
fig.show()

# y-axis hight in dendrogram = Distance between clusters
No description has been provided for this image

Another way to do agglomerative clustering (and visualizing it): http://seaborn.pydata.org/generated/seaborn.clustermap.html

In [ ]:
distances = metrics.euclidean_distances(X)
cg = sns.clustermap(distances, method="complete", figsize=(13,13), xticklabels=False)
print (cg.dendrogram_col.reordered_ind)
/usr/local/lib/python3.12/dist-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/usr/local/lib/python3.12/dist-packages/seaborn/matrix.py:530: ClusterWarning: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  linkage = hierarchy.linkage(self.array, method=self.method,
/usr/local/lib/python3.12/dist-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/usr/local/lib/python3.12/dist-packages/seaborn/matrix.py:530: ClusterWarning: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  linkage = hierarchy.linkage(self.array, method=self.method,
[177, 469, 83, 179, 343, 61, 34, 124, 3, 466, 252, 490, 442, 312, 354, 230, 240, 476, 302, 57, 317, 154, 438, 167, 71, 472, 232, 399, 450, 35, 236, 454, 172, 478, 219, 107, 320, 455, 283, 434, 244, 426, 425, 121, 123, 25, 335, 432, 254, 6, 174, 127, 388, 423, 267, 53, 435, 257, 197, 209, 481, 415, 491, 264, 206, 294, 181, 411, 275, 12, 117, 208, 226, 187, 332, 444, 238, 274, 263, 310, 75, 355, 374, 4, 424, 465, 95, 114, 142, 309, 281, 129, 468, 471, 80, 250, 200, 266, 419, 235, 436, 383, 194, 462, 28, 160, 441, 301, 412, 40, 11, 173, 242, 380, 100, 350, 221, 330, 392, 410, 36, 499, 287, 394, 88, 31, 54, 43, 155, 182, 347, 20, 222, 295, 369, 32, 15, 188, 440, 326, 316, 447, 24, 82, 137, 92, 480, 168, 223, 431, 382, 484, 492, 52, 345, 363, 58, 300, 47, 78, 51, 387, 356, 145, 207, 346, 416, 62, 329, 37, 305, 98, 321, 153, 178, 247, 348, 306, 417, 112, 148, 163, 405, 367, 81, 255, 323, 304, 131, 313, 135, 218, 402, 120, 16, 482, 63, 205, 333, 474, 443, 231, 1, 403, 150, 108, 397, 45, 19, 376, 140, 357, 23, 282, 189, 398, 237, 239, 175, 212, 276, 284, 420, 372, 414, 26, 368, 318, 46, 299, 87, 211, 273, 430, 253, 17, 55, 477, 196, 319, 33, 279, 385, 277, 427, 364, 192, 195, 307, 38, 41, 29, 258, 136, 453, 115, 220, 458, 365, 400, 143, 157, 111, 475, 289, 216, 422, 97, 324, 159, 8, 353, 86, 265, 158, 225, 409, 204, 149, 213, 130, 340, 371, 79, 214, 233, 377, 73, 292, 486, 64, 269, 21, 328, 70, 105, 184, 228, 59, 493, 245, 251, 327, 7, 291, 344, 485, 103, 496, 370, 74, 401, 165, 249, 384, 94, 375, 147, 448, 30, 49, 198, 69, 201, 68, 459, 186, 134, 418, 243, 497, 72, 379, 185, 215, 13, 99, 351, 268, 373, 413, 90, 479, 106, 360, 463, 144, 166, 10, 56, 298, 362, 224, 234, 65, 488, 42, 202, 156, 50, 483, 311, 437, 278, 404, 118, 489, 325, 66, 457, 395, 270, 429, 91, 260, 67, 248, 14, 456, 358, 446, 296, 164, 96, 452, 199, 272, 315, 104, 308, 341, 261, 290, 141, 259, 342, 467, 180, 390, 190, 27, 460, 349, 210, 361, 76, 378, 109, 60, 116, 421, 193, 227, 138, 133, 101, 126, 44, 122, 151, 433, 2, 48, 113, 84, 408, 461, 288, 18, 473, 5, 119, 293, 132, 359, 176, 139, 286, 331, 449, 183, 336, 170, 191, 161, 381, 77, 262, 246, 439, 39, 171, 256, 280, 89, 366, 352, 470, 93, 322, 393, 498, 297, 396, 285, 203, 391, 314, 406, 464, 0, 407, 338, 152, 22, 271, 303, 128, 241, 487, 110, 162, 217, 229, 389, 494, 339, 146, 337, 102, 451, 386, 445, 9, 428, 169, 125, 495, 85, 334]
No description has been provided for this image

DBSCAN Algorithm¶

More on DBSCAN here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed, while marking points that lie alone as Noise (labeled as -1).

The 2 Key Parameters¶

Unlike K-Means, we don't pick $K$ (number of clusters). Instead, we define "density" using two parameters:

  1. eps ($\epsilon$): The maximum distance (radius) between two samples for one to be considered as in the neighborhood of the other.
  2. min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a "Core Point".

How it works (The 3 types of points)¶

The algorithm classifies every data point into one of three categories:

  1. Core Point: A point that has at least min_samples within its eps radius. (These are the "heart" of a cluster).
  2. Border Point: A point that is within the eps radius of a Core Point but has fewer than min_samples neighbors of its own. (These form the "edge" of the cluster).
  3. Noise Point: A point that is neither a Core nor a Border point. (These are outliers).

The Algorithm Steps¶

  1. Pick a random point that hasn't been visited.
  2. Check its neighborhood (using eps).
    • If it has enough neighbors ($\ge$ min_samples), it becomes a Core Point and starts a new cluster.
    • If not, label it as Noise (for now).
  3. Expand the Cluster: If it's a Core Point, check its neighbors. If those neighbors are also Core Points, they merge into the same cluster. This "chain reaction" continues until the density drops.
  4. Repeat: Pick the next unvisited point and repeat until all points are processed.

Why use DBSCAN?

  • It does not require us to know the number of clusters ($K$) beforehand.
  • It can find arbitrarily shaped clusters, whereas K-Means assumes spherical clusters.
  • It is robust to outliers (Noise).
In [ ]:
dbscan = sk_cluster.DBSCAN(eps=0.3)
dbscan_labels = dbscan.fit_predict(X)

print("DBSCAN Labels (Note: -1 corresponds to noise):")
print(dbscan_labels)

# Rename labels to avoid negative values for metrics
renamed_dbscan_labels = [x + 1 for x in dbscan_labels]

print("\nConfusion Matrix comparing DBSCAN labels with true labels:")
C = metrics.confusion_matrix(renamed_dbscan_labels, true_labels)
print(C[:, :max(true_labels) + 1])

# Calculating scores
print("\nPrecision Score (weighted):")
precision = metrics.precision_score(true_labels, renamed_dbscan_labels, average='weighted', zero_division=1)
print(precision)
print("\nRecall Score (weighted):")
recall = metrics.recall_score(true_labels, renamed_dbscan_labels, average='weighted', zero_division=1)
print(recall)
print("\nF1 Score (weighted):")
f1_score = metrics.f1_score(true_labels, renamed_dbscan_labels, average='weighted', zero_division=1)
print(f1_score)
DBSCAN Labels (Note: -1 corresponds to noise):
[ 5  0  0 -1 -1  0 -1  0  0  3  0  0 -1 -1  0 -1  0  0  0  0  0  0  5  0
  0 -1  0 -1 -1  0  0  0 -1  0 -1  0  0  0  0 -1  0  0  0  0  0  0  0 -1
  0  0  0 -1  1 -1  0  0  0  0  1  0  0 -1  0  0  0  0  0  0  0  0  0 -1
  0  0  0 -1  0  0 -1  0  2  0  0 -1  0  3  0  0  0 -1  0  0 -1  0  0 -1
  0  0  0 -1  0  0  0  0  0  0  0  0  0  0 -1  0  0  0 -1  0  0 -1  0  0
  0 -1  0 -1 -1  3  0 -1 -1 -1  0  0  0  0  0  0  0  0  0  0  0  0 -1  0
  0  0  4  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0 -1  1  0  0  0 -1
 -1  3  0 -1 -1  0 -1  0  0 -1  0 -1  0 -1  0  0  0  0  0 -1 -1  0 -1  0
  0  0  0  0  0 -1  0  0  2  0  0 -1  0  0 -1  0 -1 -1 -1  0  0  0  0 -1
  0 -1  0  0  0 -1  0 -1  0  0 -1  0  0  0  0  0  0  0  0  2  0  0 -1  0
  0  5  0  0 -1  0  0  0  0  0  2  0 -1  0 -1  0 -1 -1  0  0  0  0  0 -1
 -1  0  2 -1  0  0  0  5  0  0 -1 -1  0  0  0  0 -1 -1  0  0  0 -1  0  0
  0  0  0  0  0  0 -1  0  0  0  0  0  1  0  0 -1  0  0  0  0  0 -1 -1  0
  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0 -1  0  3 -1
  0  4  5  4  0  0  0 -1  0  1  0  0  0 -1  0  0 -1  0  0 -1  0  0  0  0
  0 -1  0  1  0  0 -1  0  0  0  0  0  0  0 -1  0  0  0 -1  0  0  0 -1  2
  0  0  0 -1 -1  4  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0 -1 -1  5
  0  0  0 -1  0  0  0 -1  0  0  0  2  0 -1  0 -1 -1 -1 -1  0  3  0  0 -1
 -1  0  0 -1  2  0  0  0 -1  0  0  0 -1  0  0  0  0  0  0  0  0  0 -1 -1
  0  0  0  0 -1  0  0  0 -1 -1 -1  0 -1 -1 -1  2 -1  0  0  0  0  0 -1  0
 -1 -1  0  0 -1  0  0  5  0  0 -1 -1  1  0  4  3  0  0  0  0]

Confusion Matrix comparing DBSCAN labels with true labels:
[[ 48  47  26]
 [106 117 120]
 [  3   3   1]
 [  9   0   0]
 [  0   0   7]
 [  0   0   5]
 [  1   0   7]]

Precision Score (weighted):
0.29385446835168544

Recall Score (weighted):
0.332

F1 Score (weighted):
0.26841854244588004
In [ ]:
#colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
#colors = np.hstack([colors] * 20)
colors = np.array([x for x in 'bgrcmywk'*10])
plt.scatter(X[:, 0], X[:, 1], color=colors[dbscan_labels].tolist(), s=10, alpha=0.8)
Out[ ]:
<matplotlib.collections.PathCollection at 0x7966351c3980>
No description has been provided for this image

A case where DBSCAN workds better

In [ ]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN, KMeans
from sklearn.metrics import adjusted_rand_score

# Create synthetic data with a non-convex shape
X, _ = make_moons(n_samples=500, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Apply KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# Plotting the results for DBSCAN
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis', marker='o', edgecolor='k')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plotting the results for KMeans
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', marker='o', edgecolor='k')
plt.title('KMeans Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Evaluating the clustering performance using Adjusted Rand Index (ARI)
dbscan_ari = adjusted_rand_score(_, dbscan_labels)
kmeans_ari = adjusted_rand_score(_, kmeans_labels)

print("Adjusted Rand Index for DBSCAN:", dbscan_ari)
print("Adjusted Rand Index for KMeans:", kmeans_ari)
No description has been provided for this image
Adjusted Rand Index for DBSCAN: 1.0
Adjusted Rand Index for KMeans: 0.2525184244081995

Clustering text data¶

An example of what we want to do: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

SciKit datasets: http://scikit-learn.org/stable/datasets/

We will use the 20-newsgroups datasets which consists of postings on 20 different newsgroups.

More information here: http://scikit-learn.org/stable/datasets/#the-20-newsgroups-text-dataset

In [ ]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
#categories = ['alt.atheism', 'sci.space','rec.sport.baseball']
news_data = sk_data.fetch_20newsgroups(subset='train',
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
print (news_data.target)
print (len(news_data.target))
[2 0 0 ... 2 1 2]
1781
In [ ]:
print (type(news_data))
print (news_data.filenames)
print (news_data.target[:10])
print (news_data.data[1])
print (len(news_data.data))
<class 'sklearn.utils._bunch.Bunch'>
['/root/scikit_learn_data/20news_home/20news-bydate-train/sci.space/60940'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.os.ms-windows.misc/9955'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.os.ms-windows.misc/9846'
 ...
 '/root/scikit_learn_data/20news_home/20news-bydate-train/sci.space/60891'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.baseball/104484'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/sci.space/61110']
[2 0 0 2 0 0 1 2 2 1]
Recently the following problem has arrisen.  The first time I turn on my  
computer when windows starts (from my autoexec) after the win31 title screen  
the computer reboots on its own.  Usually the second time (after reboot) or  
from the DOS prompt everything works fine.

 s far as I remember I have not changed my config.sys or autoxec.bat or  
win.ini.  I can't remember whether this problem occured before I  
optimized/defragmented my disk and created a larger swap file (Thank you  
MathCAD 4 :(  )

System 386sx, 4MB, stacker 2.0, win31, DOS 5

---
---------------------------------------------------------------------
1781

Make text data into vector form using TfidfVectorizer

In [ ]:
vectorizer = sk_text.TfidfVectorizer(stop_words='english',  # removes common words ("the", "is", "and", etc.)
                             #max_features = 1000,
                             min_df=4, max_df=0.8)
data = vectorizer.fit_transform(news_data.data)
print(type(data))  # Sparce matrix of: 1781 rows by Vocabulary #words columns.
<class 'scipy.sparse._csr.csr_matrix'>

Apply our 3 clustering algorithms

In [ ]:
# K-Means
kmeans = sk_cluster.KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=10, random_state=42)
km_labels = kmeans.fit_predict(data)

# Agglomerative Clustering
# Used 'ward' linkage to minimize variance
# Note: Ward linkage requires a dense matrix, so we convert the sparse data.
agglo = sk_cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
agg_labels = agglo.fit_predict(data.toarray())

# DBSCAN
# Note: We MUST use 'cosine' metric for text. Think why...
# eps=0.5 is a standard starting threshold for cosine distance (range 0 to 1).
# min_samples=5 is the default density threshold.
dbscan = sk_cluster.DBSCAN(eps=0.5, min_samples=5, metric='cosine')
db_labels = dbscan.fit_predict(data)


# Helper Function to Calculate Scores
def get_scores(true_labels, pred_labels):
    if len(set(pred_labels)) < 2: # Check if algorithm failed to find clusters
        return 0.0, 0.0, 0.0
    h = metrics.homogeneity_score(true_labels, pred_labels)
    c = metrics.completeness_score(true_labels, pred_labels)
    v = metrics.v_measure_score(true_labels, pred_labels)
    return h, c, v

# Calculate scores for all three
km_scores = get_scores(news_data.target, km_labels)
agg_scores = get_scores(news_data.target, agg_labels)
db_scores = get_scores(news_data.target, db_labels)

# Print Comparison Table
print(f"{'Algorithm':<20} | {'Homogeneity':<12} | {'Completeness':<12} | {'V-Measure':<12}")
print("-" * 65)
print(f"{'K-Means':<20} | {km_scores[0]:.3f}        | {km_scores[1]:.3f}        | {km_scores[2]:.3f}")
print(f"{'Agglomerative':<20} | {agg_scores[0]:.3f}        | {agg_scores[1]:.3f}        | {agg_scores[2]:.3f}")
print(f"{'DBSCAN':<20} | {db_scores[0]:.3f}        | {db_scores[1]:.3f}        | {db_scores[2]:.3f}")
Algorithm            | Homogeneity  | Completeness | V-Measure   
-----------------------------------------------------------------
K-Means              | 0.447        | 0.514        | 0.478
Agglomerative        | 0.281        | 0.488        | 0.357
DBSCAN               | 0.009        | 0.173        | 0.017

Print DBSCAN number of clusters and noise info to explain why score is low

In [ ]:
n_db_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = list(db_labels).count(-1)
print(f"\nDBSCAN found {n_db_clusters} clusters and marked {n_noise} points as noise")
DBSCAN found 2 clusters and marked 1765 points as noise

To understand the clusters we can print the words that have the highest values in the centroid

In [ ]:
# Function to calculate centroids manually (needed for Agglomerative & DBSCAN which dont have default centroids)
def print_top_terms(labels, data, vectorizer, algorithm_name, n_terms=10):
    print(f"\n--- Top Terms for {algorithm_name} ---")

    # Get all unique labels found by the algorithm
    unique_labels = sorted(set(labels))
    terms = vectorizer.get_feature_names_out()

    for label in unique_labels:
        # Handle Noise points (-1) typically found in DBSCAN
        label_name = f"Cluster {label}" if label != -1 else "Noise (-1)"
        print(f"{label_name}:", end=" ")

        # Identify which rows belong to this cluster
        indices = np.where(labels == label)[0]

        if len(indices) == 0:
            continue

        # Calculate the Mean (Centroid) of these rows
        # data is sparse, so we use .mean(axis=0) which returns a matrix
        centroid = data[indices].mean(axis=0)
        # Flatten to 1D array
        centroid = np.array(centroid).flatten()

        # Sort indices to find highest TF-IDF values
        top_indices = centroid.argsort()[::-1][:n_terms]

        # Print words
        top_words = [terms[i] for i in top_indices]
        print(", ".join(top_words))

# Run the function for all three algorithms to compare
print_top_terms(km_labels, data, vectorizer, "K-Means")
print_top_terms(agg_labels, data, vectorizer, "Agglomerative")
print_top_terms(db_labels, data, vectorizer, "DBSCAN")
--- Top Terms for K-Means ---
Cluster 0: windows, file, dos, files, thanks, use, driver, drivers, card, problem
Cluster 1: year, team, game, games, runs, good, think, hit, baseball, pitching
Cluster 2: space, just, like, nasa, think, know, don, time, people, thanks

--- Top Terms for Agglomerative ---
Cluster 0: space, year, just, like, think, don, good, know, team, time
Cluster 1: windows, file, thanks, dos, files, use, problem, card, drivers, driver
Cluster 2: ax, max, b8f, g9v, a86, 145, 1d9, pl, 2di, 0t

--- Top Terms for DBSCAN ---
Noise (-1): windows, space, like, just, know, think, don, thanks, year, does
Cluster 0: ax, max, b8f, g9v, a86, 145, 1d9, pl, 2di, 0t
Cluster 1: mystery, 9591, circumference, 2178, 18084tm, mcwilliams, grows, 336, 355, 517