Support Vector Machines in Machine Learning
Introduction to Support Vector Machines (SVM)
Support Vector Machines (SVM) are a powerful supervised machine learning algorithm used primarily for classification tasks but also applicable to regression. SVMs are effective in high-dimensional spaces and are versatile due to their ability to use various kernel functions for transforming data. The fundamental goal of SVM is to find a hyperplane that best separates the data into different classes.
Key Concepts and Terminology
1. Hyperplane
A hyperplane in an n-dimensional space is a flat affine subspace of dimension n-1. In the context of SVM, a hyperplane is used to separate data points of different classes. For a 2D space, it is a line, for a 3D space, it is a plane, and for higher dimensions, it is referred to as a hyperplane.
2. Support Vectors
Support vectors are the data points closest to the hyperplane and are critical in defining the position and orientation of the hyperplane. They are the points that lie on the boundary of the margin and can influence the hyperplane’s position.
3. Margin
The margin is the distance between the hyperplane and the nearest data points from each class. SVM aims to maximize this margin to ensure that the hyperplane not only separates the classes but also does so with the greatest possible distance from any data point.
Mathematical Formulation of SVM
1. Hard Margin SVM
2. Soft Margin SVM
3. Cost Function and Hinge Loss
Kernel Trick
Use Case Example: Email Spam Detection
One common application of SVM is in email spam detection. Here, the task is to classify emails as either spam or not spam based on their content and other features.
- Data Collection: Gather a labeled dataset of emails with features extracted from the email text, metadata, etc.
- Feature Extraction: Transform the raw email data into numerical features using techniques like TF-IDF for text data.
- Training the SVM Model: Use the labeled dataset to train an SVM model, selecting an appropriate kernel function (e.g., linear or RBF).
- Model Evaluation: Evaluate the model using metrics such as accuracy, precision, recall, and F1-score on a validation dataset.
- Deployment: Deploy the trained model to classify incoming emails as spam or not spam in real-time.
Github Repo: https://github.com/hypothesistribetechnology/spam-email-detection/blob/main/model-on-kaggle-dataset.ipynb
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
# Sample dataset (Replace this with your actual dataset)
# You can also use kaggel dataset for this project
# https://www.kaggle.com/datasets/venky73/spam-mails-dataset
data = {
'text': [
'Congratulations, you have won a lottery! Claim your prize now.',
'Hi John, can we reschedule our meeting?',
'Get cheap meds online without prescription!',
'Reminder: Your appointment is tomorrow at 10 AM.',
'Limited time offer! Buy one get one free.'
],
'spam': [1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)
# Text Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['spam']
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the SVM model
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))