Label Encoding in Python
Label encoding is a fundamental data preprocessing technique used to convert categorical data into a numerical format suitable for machine learning models. Many algorithms cannot process non-numeric values, making encoding a necessary step when working with features such as colors, cities or product types.
For example
In a dataset with a Fruit column containing "Apple," "Banana," and "Orange," label encoding can map :
Apple -> 0
Banana -> 1
Orange -> 2
Understanding Label Encoding
Categorical data is broadly divided into two types:
- Nominal Data: Categories without inherent order (e.g., colors: red, blue, green).
- Ordinal Data: Categories with a natural order (e.g., satisfaction levels: low, medium, high).
Label encoding works best for ordinal data, where the assigned numbers reflect the order. However, applying it to nominal data can unwantedly suggest an order (e.g., Red = 0, Blue = 1, Green = 2), which may mislead algorithms like linear regression. Thus, the choice of encoding must align with the data type and the algorithm used.
When to Use Label Encoding
Label encoding is particularly valuable when:
- Encoding ordinal features: Numbers can capture the inherent order of categories.
- Using tree-based algorithms: Models like decision trees or random forests are insensitive to numerical order assumptions.
- Memory efficiency is critical: Each category is stored as a single integer, unlike one-hot encoding which expands data into multiple columns.
For nominal data and algorithms sensitive to numerical values, one-hot encoding is often a better alternative.
Implementing Label Encoding in Python
Python provides two primary ways to perform label encoding: scikit-learn's LabelEncoder and pandas’ Categorical type.
1. Using scikit-learn’s LabelEncoder
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Sample dataset
data = pd.DataFrame({
'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Orange', 'Banana'],
'Price': [1.2, 0.5, 0.8, 1.3, 0.9, 0.6]
})
# Initialize and fit LabelEncoder
le = LabelEncoder()
data['Fruit_Encoded'] = le.fit_transform(data['Fruit'])
print(data)
print("Category Mapping:", le.classes_)
Output:

The fit_transform method both learns the unique categories and applies the encoding, while the classes_ attribute stores the mapping for future reference.
2. Using pandas’ Categorical
Type
# Encode using pandas categorical type
data['Fruit_Encoded_Pandas'] = data['Fruit'].astype('category').cat.codes
print(data)
print("Category Mapping:", dict(enumerate(data['Fruit'].astype('category').cat.categories)))
Output:

This approach is simpler for pandas-based workflows and does not require an external library.
Encoding Ordinal Data
When dealing with ordinal data, a custom mapping ensures the numeric values preserve order:
data = pd.DataFrame({
'Satisfaction': ['Low', 'High', 'Medium', 'Low', 'High'],
'Score': [3, 8, 5, 2, 9]
})
satisfaction_order = {'Low': 0, 'Medium': 1, 'High': 2}
data['Satisfaction_Encoded'] = data['Satisfaction'].map(satisfaction_order)
print(data)
Output:

This approach is ideal for features where the order carries semantic meaning.
Performance and Limitations
Label encoding is computationally efficient. Both LabelEncoder and pandas' Categorical require a single scan of the data (O(n)) to map categories. Memory usage is minimal as only integer codes and the category map are stored.
Limitations
- Nominal data misinterpretation: Encoded integers can imply false order; one-hot encoding is safer for nominal features.
- Missing values: These must be handled prior to encoding.
- Unseen categories in test data: Encoders will fail if new categories appear; handle this with a default value or ensure training includes all possible categories.
- High cardinality: Features with many unique categories may still require additional feature engineering.
Best Practices
- Apply label encoding primarily to ordinal features or tree-based models.
- Handle missing values before encoding.
- Save the encoder or category mapping to enable inverse transformation during evaluation or deployment.
- For nominal features in algorithms sensitive to numerical relationships, use one-hot encoding instead.