K-Mode Clustering in Python
K-mode clustering is an unsupervised machine-learning used to group categorical data into k clusters (groups). The K-Modes clustering partitions the data into two mutually exclusive groups. Unlike K-Means which uses distances between numbers K-Modes uses the number of mismatches between categorical values to decide how similar two data points are. For example:
- Data point 1: ["red", "small", "round"]
- Data point 2: ["blue", "small", "square"]
Here there are 2 mismatches (color and shape) so these two are not very similar.
When Should You Use K-Modes?
Use K-Modes when:
- Your dataset contains categorical variables like gender, color, brand etc.
- You want to group customers by product preferences
- You're analyzing survey responses Yes/No, Male/Female etc.
How K-Modes clustering works?
Unlike hierarchical clustering KModes requires us to decide the number of clusters (K) in advance. Here's how it works step by step:
- Start by picking clusters: Randomly select K data points from the dataset to act as the starting clusters these are called "modes".
- Assign data to clusters: Check how similar each data point is to these clusters using the total number of mismatches and assign each data point to the cluster it matches the most.
- Update the clusters: Find the most common value for each cluster and update the cluster centers based on this.
- Repeat the process: Keep repeating steps 2 and 3 until no data points are reassigned to different clusters.
Let X be a set of categorical data objects of
Apply dissimilarity metric equation for data objects
Suppose we want to K cluster Then we have Q =
Overall the goal of K-modes clustering is to minimize the dissimilarities between the data objects and the centroids (modes) of the clusters using a measure of categorical similarity such as the Hamming distance.
Implementation of the k-mode clustering algorithm
K-Modes is a way to group categorical data into clusters. Here's how you can do it step-by-step in Python using just NumPy and Pandas.
Step 1: Prepare Your Data
Start by defining your dataset. Each row is a data point and each column contains categorical values like letters or labels.
import numpy as np
import pandas as pd
data = np.array([
['A', 'B', 'C'],
['B', 'C', 'A'],
['C', 'A', 'B'],
['A', 'C', 'B'],
['A', 'A', 'B']
])
Step 2: Set Number of Clusters
Decide how many groups you want to divide your data into.
k = 2
Step 3: Pick Starting Points (Modes)
Randomly choose k
rows from the data to be the starting cluster centers.
np.random.seed(0)
modes = data[np.random.choice(data.shape[0], k, replace=False)]
Step 4: Assign Data to Clusters
For each data point, count how many features are different from each mode. Assign the point to the most similar cluster.
clusters = np.zeros(data.shape[0], dtype=int)
for _ in range(10):
for i, point in enumerate(data):
distances = [np.sum(point != mode) for mode in modes]
clusters[i] = np.argmin(distances)
Step 5: Update Cluster Modes
After assigning all points update each clusterâs mode to the most common values in that cluster.
for j in range(k):
if np.any(clusters == j):
modes[j] = pd.DataFrame(data[clusters == j]).mode().iloc[0].values
Step 6: View Final Results
Print out which cluster each data point belongs to and what the final cluster centers (modes) are.
print("Cluster assignments:", clusters)
print("Cluster modes:", modes)
Output:

The output shows that the first data point belongs to cluster 1 and the rest belong to cluster 0. Each cluster has a common pattern: cluster 0 has mode values ['A', 'A', 'B'] and cluster 1 has ['A', 'B', 'C']. These modes represent the most frequent values in each cluster and are used to group similar rows together.
Cluster with kmodes Library
pip install kmodes
Optimal number of clusters in the K-Mode algorithm
Elbow method is used to find the optimal number of clusters
import pandas as pd
import numpy as np
# !pip install kmodes
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt
%matplotlib inline
cost = []
K = range(1,5)
for k in list(K):
kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1)
kmode.fit_predict(data)
cost.append(kmode.cost_)
plt.plot(K, cost, 'x-')
plt.xlabel('No. of clusters')
plt.ylabel('Cost')
plt.title('Elbow Curve')
plt.show()
Outputs:

As we can see from the graph there is an elbow-like shape at 2.0 and 3.0 Now it we can consider either 2.0 or 3.0 cluster. Let's consider Number of cluster =2.0
kmode = KModes(n_clusters=2, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(data)
clusters
Outputs :
array([1, 0, 1, 1, 1], dtype=uint16)
This also shows that the first, third, fourth and fifth data points have been assigned to the first cluster and the second data points have been assigned to the second cluster. So our previous answer was 100 % correct. To find the best number of groups we use the Elbow Method which helps us see when adding more groups doesn't make a big difference. K-Modes is an easy and effective way to group similar data when working with categories