Complement Naive Bayes (CNB) Algorithm
Complement Naive Bayes (CNB) is a variant of the Naive Bayes algorithm that is specifically designed to improve classification performance on imbalanced datasets and text classification tasks. It modifies the way probabilities are estimated to reduce bias towards majority classes, making it more suitable than the standard Multinomial Naive Bayes in many cases.
Challenge of Unbalanced Datasets
An unbalanced dataset means one type of data appears much more often than the other. This often happens in spam filtering (more normal emails than spam) or medical diagnosis (more healthy cases than disease cases).
Example:
If 95% of cases are "not fraud" and only 5% are "fraud," a model that always predicts "not fraud" will be 95% accurate but will miss all fraud cases. This shows why special methods are needed to deal with such uneven data.
How Complement Naive Bayes Works
- For each class, compute the complement frequency: the frequency of features in all other classes combined.
- Estimate the conditional probabilities using these complement frequencies.
- Normalize the values to ensure they form valid probability distributions.
- Classify a sample by choosing the class with the maximum posterior probability.
Formula
For a class c and feature f:
P(f|c) = \frac{count(f, \bar{c}) + \alpha}{\sum_{f'} count(f', \bar{c}) + \alpha \cdot |V|}
count(f, \bar{c}) = count of feature f in the complement of class c\alpha = smoothing parameter (Laplace smoothing)|V| = vocabulary size
Example
Suppose classifying sentences as Apples or Bananas using word frequencies, To classify a new sentence (Round=1, Red=1, Soft=1):
- MNB would estimate probabilities for Apples using only Apples data
- CNB estimates probabilities for Apples using Bananas' data (complement) and vice versa
Solving by CNB: We classify a new sentence with features {Round =1, Red =1, Soft =1} and vocabulary {Round, Red, Soft}.
Step 1: Complement counts
- For Apples, use Bananasâ counts -> {Round:5, Red:1, Soft:3}
- For Bananas, use Applesâ counts -> {Round:3, Red:4, Soft:1}
Step 2: Probabilities (using Laplace smoothing, Îą =1)
For Apples:
- Round = (5+1)/(5+1+3+3) = 6/12 = 0.5
- Red = (1+1)/12 = 0.167
- Soft = (3+1)/12 = 0.333
For Bananas:
- Round = (3+1)/(3+1+4+1) = 4/11 â 0.364
- Red = (4+1)/11 = 0.455
- Soft = (1+1)/11 = 0.182
Step 3: Scores, Multiply feature probabilities:
- Apples = 0.5 Ã 0.167 Ã 0.333 â 0.0278
- Bananas = 0.364 Ã 0.455 Ã 0.182 â 0.0301
Final Result -> Bananas
Implementing CNB
We can implement CNB using scikit-learn on the wine dataset (for demonstration purposes).
1. Import libraries and load data
We will import and load the required libraries
- Import load_wine for dataset loading from sklearn.
- Use train_test_split to divide data into training and test sets.
- Import ComplementNB as the classifier.
- Import evaluation metrics: classification_report and accuracy_score.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, accuracy_score
# Load the wine dataset
data = load_wine()
X, y = data.data, data.target
2. Split into training and test sets
We will split the dataset into training and test sets:
- Split the dataset into 70% training and 30% testing data.
- Set random_state=42 for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
3. Train the CNB classifier
We will train the Complement Naive Bayes classifier
- Create a ComplementNB instance.
- Fit the classifier on the training data.
cnb = ComplementNB()
cnb.fit(X_train, y_train)
4. Evaluate the model
We will now evaluate the trained model:
- Predict class labels for the test set using predict().
- Print the accuracy score and the classification report for detailed metrics.
y_pred = cnb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Note: CNB is better suited for discrete data like text. For continuous features (as in this dataset), Gaussian Naive Bayes might perform better.
When to Use CNB
Scenario | Why CNB is Suitable |
---|---|
Imbalanced class distributions | The complement approach ensures minority classes receive fairer parameter estimates. |
Text classification | CNB handles discrete feature counts (e.g., word frequencies) very effectively. |
Large feature spaces | CNB is computationally efficient and easy to interpret, even with many features. |
Limitations of CNB
- Feature independence assumption: Like all Naive Bayes variants, CNB assumes that features are conditionally independent given the class. This assumption is rarely true in real-world datasets and can reduce accuracy when violated.
- Best suited for discrete features: CNB is primarily designed for tasks with discrete data, such as word counts in text classification. Continuous data typically requires preprocessing for optimal results.
- Bias in balanced datasets: The complement-based parameter estimation can introduce unnecessary bias when classes are already balanced. This may reduce its advantage compared to standard Naive Bayes models.