Feature Encoding in Machine Learning

Chanchala Gorale
3 min readJun 12, 2024

Introduction to Feature Encoding

In machine learning, feature encoding is the process of converting categorical data into a format that can be provided to machine learning algorithms to improve the predictive performance of the model. Algorithms typically require numerical input data, but many datasets contain categorical features. Proper encoding of these categorical features is crucial to leverage the full potential of machine learning models.

Why Feature Encoding is Important

Categorical data can be challenging for machine learning algorithms to process directly, as they do not have a meaningful numerical relationship. Feature encoding transforms these categorical variables into numerical values, preserving the information and relationships within the data. This process helps models understand and learn from categorical data effectively.

Types of Feature Encoding

  1. Label Encoding
  2. One-Hot Encoding
  3. Binary Encoding
  4. Target Encoding
  5. Frequency Encoding
  6. Hash Encoding

Label Encoding

Label encoding converts categorical values into numerical values by assigning a unique integer to each category. This method is simple and efficient but can introduce ordinal relationships where there may be none.

Example:

from sklearn.preprocessing import LabelEncoder

data = {'color': ['red', 'blue', 'green', 'blue', 'green', 'red']}
df = pd.DataFrame(data)

label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])

print(df)

Output:

   color  color_encoded
0 red 2
1 blue 0
2 green 1
3 blue 0
4 green 1
5 red 2

One-Hot Encoding

One-hot encoding transforms categorical variables into a binary matrix. Each category is represented by a binary column, and only the column corresponding to the specific category is set to 1, while others are set to 0.

Example:

from sklearn.preprocessing import OneHotEncoder

data = {'color': ['red', 'blue', 'green']}
df = pd.DataFrame(data)

one_hot_encoder = OneHotEncoder(sparse=False)
encoded_data = one_hot_encoder.fit_transform(df[['color']])
encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out(['color']))

print(encoded_df)

Output:

   color_blue  color_green  color_red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0

Ordinal Encoding

Ordinal encoding assigns integer values to categorical variables, maintaining the order of the categories. This method is suitable for ordinal data where the order of categories is important.

Example:

from sklearn.preprocessing import OrdinalEncoder

data = {'size': ['small', 'medium', 'large']}
df = pd.DataFrame(data)

ordinal_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])

print(df)

Output:

     size  size_encoded
0 small 0.0
1 medium 1.0
2 large 2.0

Binary Encoding

Binary encoding combines label encoding and one-hot encoding. Each category is first converted to an integer label, then these integers are converted to binary code, and finally, the binary digits are split into separate columns.

Example:

from category_encoders import BinaryEncoder

data = {'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
binary_encoder = BinaryEncoder()
df_encoded = binary_encoder.fit_transform(df['city'])
print(df_encoded)

Output:

city_0  city_1  city_2
0 0 0 1
1 0 1 0
2 1 0 0
3 1 0 1

Target Encoding

Target encoding replaces a categorical variable with the mean of the target variable. This method can be beneficial for high cardinality categorical variables but can lead to overfitting if not used carefully.

Example:

from category_encoders import TargetEncoder

data = {'category': ['A', 'B', 'A', 'B', 'C'], 'target': [1, 2, 1, 2, 3]}
df = pd.DataFrame(data)
target_encoder = TargetEncoder()
df['category_encoded'] = target_encoder.fit_transform(df['category'], df['target'])
print(df)

Output:

category  target  category_encoded
0 A 1 1.0
1 B 2 2.0
2 A 1 1.0
3 B 2 2.0
4 C 3 3.0

Frequency Encoding

Frequency encoding replaces each category with the frequency of its occurrence. This method is simple and useful for high cardinality features.

Example:

data = {'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']}
df = pd.DataFrame(data)

df['city_encoded'] = df['city'].map(df['city'].value_counts(normalize=True))
print(df)

Output:

city  city_encoded
0 New York 0.4
1 Los Angeles 0.2
2 Chicago 0.2
3 Houston 0.2
4 New York 0.4

Hashing Encoding

Hashing encoding uses a hash function to convert categories into numerical values. This method is efficient for high cardinality features but introduces some risk of collisions (two categories hashing to the same value).

Example:

from sklearn.feature_extraction.text import HashingVectorizer

data = {'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']}
df = pd.DataFrame(data)
hashing_encoder = HashingVectorizer(n_features=8, alternate_sign=False)
encoded_data = hashing_encoder.fit_transform(df['city']).toarray()
encoded_df = pd.DataFrame(encoded_data, columns=[f'feature_{i}' for i in range(encoded_data.shape[1])])
print(encoded_df)

Output:

feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7
0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

Feature encoding is a crucial step in the machine learning pipeline, transforming categorical data into numerical formats suitable for model training. The choice of encoding technique depends on the nature of the data and the specific requirements of the model. Understanding and correctly applying these encoding methods can significantly improve the performance and interpretability of your machine learning models.

--

--