Open In App

Complement Naive Bayes (CNB) Algorithm

Last Updated : 04 Sep, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Complement Naive Bayes (CNB) is a variant of the Naive Bayes algorithm that is specifically designed to improve classification performance on imbalanced datasets and text classification tasks. It modifies the way probabilities are estimated to reduce bias towards majority classes, making it more suitable than the standard Multinomial Naive Bayes in many cases.

Challenge of Unbalanced Datasets

An unbalanced dataset means one type of data appears much more often than the other. This often happens in spam filtering (more normal emails than spam) or medical diagnosis (more healthy cases than disease cases).

Example:

If 95% of cases are "not fraud" and only 5% are "fraud," a model that always predicts "not fraud" will be 95% accurate but will miss all fraud cases. This shows why special methods are needed to deal with such uneven data.

How Complement Naive Bayes Works

  1. For each class, compute the complement frequency: the frequency of features in all other classes combined.
  2. Estimate the conditional probabilities using these complement frequencies.
  3. Normalize the values to ensure they form valid probability distributions.
  4. Classify a sample by choosing the class with the maximum posterior probability.

Formula

For a class c and feature f:

P(f|c) = \frac{count(f, \bar{c}) + \alpha}{\sum_{f'} count(f', \bar{c}) + \alpha \cdot |V|}

  • count(f, \bar{c}) = count of feature f in the complement of class c
  • \alpha = smoothing parameter (Laplace smoothing)
  • |V| = vocabulary size

Example

Suppose classifying sentences as Apples or Bananas using word frequencies, To classify a new sentence (Round=1, Red=1, Soft=1):

  • MNB would estimate probabilities for Apples using only Apples data
  • CNB estimates probabilities for Apples using Bananas' data (complement) and vice versa

Solving by CNB: We classify a new sentence with features {Round =1, Red =1, Soft =1} and vocabulary {Round, Red, Soft}.

Step 1: Complement counts

  • For Apples, use Bananas’ counts -> {Round:5, Red:1, Soft:3}
  • For Bananas, use Apples’ counts -> {Round:3, Red:4, Soft:1}

Step 2: Probabilities (using Laplace smoothing, Îą =1)

For Apples:

  • Round = (5+1)/(5+1+3+3) = 6/12 = 0.5
  • Red = (1+1)/12 = 0.167
  • Soft = (3+1)/12 = 0.333

For Bananas:

  • Round = (3+1)/(3+1+4+1) = 4/11 ≈ 0.364
  • Red = (4+1)/11 = 0.455
  • Soft = (1+1)/11 = 0.182

Step 3: Scores, Multiply feature probabilities:

  • Apples = 0.5 × 0.167 × 0.333 ≈ 0.0278
  • Bananas = 0.364 × 0.455 × 0.182 ≈ 0.0301

Final Result -> Bananas

Implementing CNB

We can implement CNB using scikit-learn on the wine dataset (for demonstration purposes).

1. Import libraries and load data

We will import and load the required libraries

  • Import load_wine for dataset loading from sklearn.
  • Use train_test_split to divide data into training and test sets.
  • Import ComplementNB as the classifier.
  • Import evaluation metrics: classification_report and accuracy_score.
Python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, accuracy_score

# Load the wine dataset
data = load_wine()
X, y = data.data, data.target

2. Split into training and test sets

We will split the dataset into training and test sets:

  • Split the dataset into 70% training and 30% testing data.
  • Set random_state=42 for reproducibility.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

3. Train the CNB classifier

We will train the Complement Naive Bayes classifier

  • Create a ComplementNB instance.
  • Fit the classifier on the training data.
Python
cnb = ComplementNB()
cnb.fit(X_train, y_train)

4. Evaluate the model

We will now evaluate the trained model:

  • Predict class labels for the test set using predict().
  • Print the accuracy score and the classification report for detailed metrics.
Python
y_pred = cnb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
CNB-output
CNB output

Note: CNB is better suited for discrete data like text. For continuous features (as in this dataset), Gaussian Naive Bayes might perform better.

When to Use CNB

ScenarioWhy CNB is Suitable
Imbalanced class distributionsThe complement approach ensures minority classes receive fairer parameter estimates.
Text classificationCNB handles discrete feature counts (e.g., word frequencies) very effectively.
Large feature spacesCNB is computationally efficient and easy to interpret, even with many features.

Limitations of CNB

  • Feature independence assumption: Like all Naive Bayes variants, CNB assumes that features are conditionally independent given the class. This assumption is rarely true in real-world datasets and can reduce accuracy when violated.
  • Best suited for discrete features: CNB is primarily designed for tasks with discrete data, such as word counts in text classification. Continuous data typically requires preprocessing for optimal results.
  • Bias in balanced datasets: The complement-based parameter estimation can introduce unnecessary bias when classes are already balanced. This may reduce its advantage compared to standard Naive Bayes models.