april 2026
7 min read
k-means at scale: what i learned building a clustering pipeline
there's a class of company emerging that i find genuinely fascinating. not companies that use AI to augment human work. companies that use AI to run entire business processes autonomously — discovering what to build, building it, shipping it, and growing it with minimal human intervention.
the first hard technical problem in that pipeline is one you might not expect: community clustering. if you're trying to identify which of millions of niche online communities represent viable software markets, you need to cluster them at scale, automatically, with no human in the loop. that's where k-means comes in.
i spent a week going deep on this while preparing for a technical interview. what i found is that the algorithm itself is almost trivially simple. the engineering around making it production-ready — especially for automated, autonomous pipelines — is where it gets interesting.
before writing any code, it's worth being precise about the problem. you have a dataset of communities — reddit subreddits, discord servers, niche forums. each one is represented as a feature vector capturing signals like posting frequency, member retention, topic vocabulary density, engagement rate, and cross-promotion fit. the goal is to find natural groupings automatically, without knowing how many groups exist, and use those groupings to score opportunities.
the output you want isn't just clusters. it's archetypes. 'cluster 4 represents highly engaged, niche communities with low churn and strong willingness to pay signals.' that archetype, ranked by opportunity score, becomes the next product to build.
k-means alternates between two steps until convergence: assign every point to its nearest centroid, then move each centroid to the mean of its assigned points. the mathematical guarantee is clean — inertia (total within-cluster sum of squared distances) decreases or stays the same at every step. the assignment step picks the nearest centroid by definition. the update step places the centroid at the mean, which is the point that minimizes the sum of squared distances to a set of points — that's literally the definition of the mean. both steps are monotone. inertia is bounded below by zero. therefore it terminates.
what it doesn't guarantee is optimality. k-means converges to a local minimum, not the global minimum. the solution depends entirely on initialization.
i structured this as a class rather than standalone functions for one reason: a trained model is a stateful object. it has learned centroids that need to persist for prediction on new data. a function runs and disappears. a class persists. this mirrors sklearn's api intentionally — trailing underscores for attributes learned during training, clean separation between configuration, learning, and inference.
the class structure — configuration vs learned state
class KMeans:
def __init__(self, k, max_iters=300, tol=1e-4, init="kmeans++"):
self.k = k
self.max_iters = max_iters
self.tol = tol
self.init = init
# learned during fit() — trailing underscore = sklearn convention
# signals these were learned from data, not set by the user
self.centroids_ = None
self.labels_ = None
self.inertia_ = None
self.n_iters_ = Nonerandom initialization is the most common failure mode. if two initial centroids happen to land in the same natural cluster, that cluster gets split and another gets ignored. the algorithm converges, but to a locally optimal solution that's structurally wrong.
k-means++ fixes this with a probabilistic initialization: after picking the first centroid randomly, each subsequent centroid is chosen with probability proportional to its squared distance from the nearest existing centroid. points far from all current centroids are much more likely to be picked next. the key insight is squaring the distance rather than using it directly — a point twice as far gets four times the probability. this makes the initialization more aggressive about spreading centroids across the space.
k-means++ initialization
def _kmeans_plus_plus_init(self, X):
# first centroid is purely random - no information yet
centroids = [X[np.random.randint(len(X))]]
for _ in range(self.k - 1):
# squared distance from each point to its nearest existing centroid
# squaring amplifies differences so far points get
# disproportionately higher probability
distances = np.min(
[np.linalg.norm(X - c, axis=1) ** 2 for c in centroids],
axis=0
)
# divide by sum → valid probability distribution summing to 1
probs = distances / distances.sum()
# sample next centroid — far points more likely, but still random
centroids.append(X[np.random.choice(len(X), p=probs)])
return np.array(centroids)the normalization step — dividing by the sum — converts raw squared distances into a valid probability distribution that sums to exactly one. without this, numpy's random.choice can't use them as probabilities. this is the detail most explanations gloss over.
the assignment step is where performance lives. the naive implementation uses a double nested loop — for each point, compute distance to each centroid. for 50,000 communities and 20 centroids, that's one million iterations in python. the vectorized version does the same computation in a single numpy operation using broadcasting.
assignment step — vectorized with broadcasting
def _assign_labels(self, X, centroids):
# X[:, np.newaxis] reshapes (n,d) → (n,1,d)
# broadcasting against centroids (k,d) gives differences (n,k,d)
# this computes every point's diff to every centroid simultaneously
differences = X[:, np.newaxis] - centroids
# norm on axis=2 collapses d → gives distance matrix (n,k)
distances = np.linalg.norm(differences, axis=2)
# argmin on axis=1 picks nearest centroid per point → gives (n,) labels
return np.argmin(distances, axis=1)the np.newaxis insertion is the key. it reshapes X from (n,d) to (n,1,d), allowing numpy to broadcast against centroids (k,d) to produce a (n,k,d) result — every point's difference vector to every centroid, computed simultaneously in c-level operations.
the empty cluster problem is the most important edge case to handle. a centroid can end up in a region of space with no nearby points — every point is closer to some other centroid, so this one gets zero assignments. taking the mean of an empty array gives nan, which silently propagates through every subsequent computation and corrupts your results without raising an error.
update step — with empty cluster guard
def _update_centroids(self, X, labels):
new_centroids = np.zeros((self.k, X.shape[1]))
for i in range(self.k):
cluster_points = X[labels == i]
if len(cluster_points) == 0:
# empty cluster guard — without this we get NaN from empty mean
# reinitialize randomly rather than crashing silently
new_centroids[i] = X[np.random.randint(len(X))]
else:
# mean(axis=0) = column-wise mean = geometric center of cluster
new_centroids[i] = cluster_points.mean(axis=0)
return new_centroidstolerance-based convergence is the other one. checking for exact equality between old and new centroids fails because floating point arithmetic means they almost never reach exactly zero movement. using a tolerance threshold handles this correctly.
fit — the main training loop
def fit(self, X):
X = np.array(X, dtype=np.float64)
centroids = self._initialize_centroids(X)
for iteration in range(self.max_iters):
labels = self._assign_labels(X, centroids)
new_centroids = self._update_centroids(X, labels)
# measure how much centroids moved — our convergence signal
shift = np.linalg.norm(new_centroids - centroids)
centroids = new_centroids
self.n_iters_ = iteration + 1
if shift < self.tol:
break
self.centroids_ = centroids
self.labels_ = labels
self.inertia_ = self._compute_inertia(X, labels, centroids)
# return self enables chaining: model.fit(X).predict(X_new)
return selffor an autonomous pipeline, you can't have a human staring at an elbow plot. you need automatic k selection. inertia can't do this — it always decreases as k increases. if you told a system to minimize inertia, it would always pick the highest k you allow. setting k equal to n gives inertia of zero, which is perfectly tight and completely meaningless.
silhouette score is the right tool. for every point p, you compute two values: a(p) — the mean distance to every other point in its own cluster, measuring cohesion; and b(p) — the mean distance to every point in the nearest neighboring cluster, measuring separation. the score is (b - a) / max(a, b), ranging from -1 to +1. unlike inertia, silhouette score peaks at the k that best matches the natural cluster structure of the data, then drops off on either side.
silhouette score — implemented from scratch
def compute_a(X, labels, i):
own_cluster = labels[i]
cluster_points = X[labels == own_cluster]
if len(cluster_points) <= 1:
return 0.0
distances = np.linalg.norm(cluster_points - X[i], axis=1)
return distances.sum() / (len(cluster_points) - 1)
def compute_b(X, labels, i):
own_cluster = labels[i]
other_clusters = [c for c in np.unique(labels) if c != own_cluster]
mean_distances = []
for cluster in other_clusters:
neighbor_points = X[labels == cluster]
mean_dist = np.linalg.norm(neighbor_points - X[i], axis=1).mean()
mean_distances.append(mean_dist)
return min(mean_distances)
def silhouette_score_custom(X, labels):
X = np.array(X, dtype=np.float64)
n = len(X)
scores = np.zeros(n)
for i in range(n):
a = compute_a(X, labels, i)
b = compute_b(X, labels, i)
scores[i] = (b - a) / max(a, b)
return scores.mean(), scoreswith silhouette score, automatic k selection becomes a clean loop — run k-means for each candidate k, evaluate the result, keep the best.
automatic k selection — no human needed
def auto_select_k(X, k_min=2, k_max=15):
best_k = k_min
best_score = -1
best_model = None
all_scores = {}
for k in range(k_min, k_max + 1):
model = KMeans(k=k, init="kmeans++")
model.fit(X)
score, _ = silhouette_score_custom(X, model.labels_)
all_scores[k] = score
if score > best_score:
best_score = score
best_k = k
# save the whole fitted model — no need to refit later
best_model = model
return best_k, best_model, all_scoresat the scale of a real community discovery pipeline — potentially millions of data points across hundreds of features — two things break. the distance matrix computation becomes O(n × k × d × i). at n=1M, k=20, d=100, that's prohibitively expensive. mini-batch k-means addresses this by sampling a random subset each iteration. FAISS addresses it at the distance computation level, using approximate nearest neighbor indexing.
the curse of dimensionality is the second problem. in high-dimensional spaces, euclidean distance loses its geometric meaning. the ratio between the maximum and minimum pairwise distance approaches one, meaning all points appear roughly equidistant. k-means assignments become unreliable. the fix is dimensionality reduction before clustering — PCA projects the data onto the directions of maximum variance, typically reducing 100+ dimensions to 20-30 while preserving 90-95% of the variance. you're not throwing away information. you're throwing away noise, which often improves clustering quality rather than hurting it.
the most important step in the pipeline is interpretation. after fitting, each centroid represents the average feature profile of its cluster. looking at which features are highest in each centroid tells you what that cluster represents. in a community discovery context, this is where the algorithm connects to the product decision.
interpreting clusters — the step most people skip
# after fitting, understand what each cluster actually means
df['cluster'] = model.labels_
print(df.groupby('cluster').mean().round(3))
print(df['cluster'].value_counts().sort_index())
# a cluster with high posting_frequency, low churn, niche topic_specificity
# signals an engaged community with clear shared needs
# that's a viable product target
# a cluster with high member_count but high churn and broad vocabulary
# is large but shallow — probably not worth building forthe algorithm is 60 years old and fits in 50 lines of python. the engineering around making it production-ready — handling edge cases, vectorizing correctly, scaling to real data, selecting k automatically — is where the interesting work lives.
the other thing i took away is about the application itself. clustering has been used in data science for decades. the interesting question is what becomes possible when you apply it at scale, automatically, as the input to a decision-making pipeline rather than as an end in itself. when discovery is automated and the unit economics of serving niche markets change, a lot of software that wasn't worth building before suddenly becomes worth building.
that's the engineering problem i want to keep working on.