= 1000000
N = 768
D = F.normalize(torch.randn(N, D), dim=-1) x
KMeans in PyTorch with Cosine Distance🥧🔦
Introduction
Kmeans is one of the easiest and fastest clustering algorithms. Here we tweak the algorithm to cluster vectors with unit length.
Data
We randomly generate a million data points with 768 dimensions (usual size in transformer embeddings). And then we normalize all those data points to unit length.
Algorithm
The following shows our kmeans implementation. The steps are as follows: 1. Choose n_clusters
points from our dataset randomly and set them as our initial centroids. 2. Iterate through all datapoints and assign each point to one of the centroids. 3. Recalculate centroids based by averaging datapoints assigned to each cluster. As an additional step to usual kmeans, normalize to unit length. 4. Repeat from step 2, for epochs
iterations.
In order to preserve some memory usage we calculate similarity batch-wise and store simply the largest index (argmax
).
class KMeans:
def __init__(self, n_clusters, batch_size, epochs):
self.n_clusters = n_clusters
self.batch_size = batch_size
self.epochs = epochs
self.centers = None
self.similarities = []
def fit(self, x):
if self.centers is None:
= np.random.choice(len(x), self.n_clusters, replace=False)
idx = x[idx]
centers else:
= self.centers
centers
for _ in range(self.epochs):
= []
batch_cluster = 0
similarity_mean for i in tqdm(range(0, len(x), self.batch_size)):
= (x[i:i+self.batch_size] @ centers.T).max(dim=-1)
similarity, idx += similarity.sum()
similarity_mean
batch_cluster.append(idx)= torch.cat(batch_cluster)
x_assigned_cluster = torch.stack(
centers
[==i].mean(dim=0), dim=-1)
F.normalize(x[x_assigned_clusterfor i in range(self.n_clusters)
]
)self.similarities.append(similarity_mean / len(x))
self.centers = centers
= KMeans(1000, 64, 10)
kmeans kmeans.fit(x)
You can see the similarities converging below:
plt.plot(kmeans.similarities)
Shameless Self Promotion
If you enjoyed the tutorial buy my course (usually 90% off).