Statistics in Machine Learning

4 min readJun 8, 2024

Statistics play a crucial role in machine learning (ML), providing the foundation for understanding data, making inferences, and validating models. Here’s an in-depth look at how statistics intertwine with various aspects of ML:

1. Descriptive Statistics

Descriptive statistics summarize and describe the features of a dataset. This is often the first step in data analysis and includes:

Measures of Central Tendency: Mean, median, mode.
Measures of Variability: Range, variance, standard deviation.
Distribution Shape: Skewness and kurtosis.
Visualization: Histograms, box plots, scatter plots

Example:

import numpy as np

data = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
mean = np.mean(data)
median = np.median(data)
variance = np.var(data)
standard_deviation = np.std(data)

print(f"Mean: {mean}, Median: {median}, Variance: {variance}, Standard Deviation: {standard_deviation}")

2. Probability Theory

Probability forms the basis of making predictions in ML. Understanding the likelihood of events helps in creating models that can generalize well.

Probability Distributions: Normal distribution, binomial distribution, Poisson distribution.
Bayes’ Theorem: Foundation for many ML algorithms, particularly in classification problems (e.g., Naive Bayes classifier).
Random Variables: Discrete and continuous variables, probability density functions (PDFs), cumulative distribution functions (CDFs).

Example:

from scipy.stats import norm

# Probability of a data point within one standard deviation of the mean in a normal distribution
probability_within_1_std = norm.cdf(1) - norm.cdf(-1)
print(f"Probability within one standard deviation: {probability_within_1_std}")

3. Inferential Statistics

Inferential statistics allow us to make predictions or inferences about a population based on a sample.

Hypothesis Testing: Null hypothesis, alternative hypothesis, p-values, significance levels.
Confidence Intervals: Estimating population parameters.
t-tests and ANOVA: Comparing means between groups.

Example:

from scipy.stats import ttest_ind

group1 = [2.1, 2.5, 2.8, 3.0, 3.2]
group2 = [3.1, 3.4, 3.6, 3.8, 4.0]
t_stat, p_value = ttest_ind(group1, group2)
print(f"t-statistic: {t_stat}, p-value: {p_value}")

4. Regression Analysis

Regression analysis involves modeling the relationship between dependent and independent variables. It is a key technique in both descriptive and predictive analytics.

Linear Regression: Simple linear regression, multiple linear regression.
Logistic Regression: Used for binary classification.
Assumptions: Linearity, independence, homoscedasticity, normality of errors.

Example:

from sklearn.linear_model import LinearRegression

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1, 3, 2, 5, 7])
model = LinearRegression().fit(X, y)
print(f"Intercept: {model.intercept_}, Coefficients: {model.coef_}")

5. Evaluation Metrics

Evaluation metrics are crucial for assessing the performance of ML models. Different types of metrics are used depending on the problem (e.g., regression, classification).

Regression Metrics: Mean squared error (MSE), mean absolute error (MAE), R-squared.
Classification Metrics: Accuracy, precision, recall, F1 score, ROC curve, AUC.

Example:

from sklearn.metrics import mean_squared_error, r2_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"MSE: {mse}, R-squared: {r2}")

6. Sampling Techniques

Proper sampling techniques ensure that the data used for training and testing are representative of the population.

Random Sampling: Each sample has an equal chance of being selected.
Stratified Sampling: Ensures that subgroups are proportionally represented.
Cross-Validation: Techniques like k-fold cross-validation to assess model performance.

Example:

from sklearn.model_selection import train_test_split

X = np.arange(10).reshape((5, 2))
y = range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(f"X_train: {X_train}, X_test: {X_test}")

7. Bayesian Statistics

Bayesian statistics provide a probabilistic approach to inference, incorporating prior knowledge with evidence from data.

Bayesian Inference: Updating the probability of a hypothesis as more evidence is available.
Markov Chain Monte Carlo (MCMC): A method for sampling from probability distributions.

Example:

import pymc3 as pm

with pm.Model() as model:
    # Define priors
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    # Define likelihood
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
    y = np.array([2, 3, 4, 5, 6])
    mu = alpha + pm.math.dot(X, beta)
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y)
    
    # Inference
    trace = pm.sample(1000, cores=2)
pm.summary(trace)

8. Dimensionality Reduction

Dimensionality reduction techniques are used to reduce the number of features in a dataset, making models simpler and faster.

Principal Component Analysis (PCA): Transforming data to a lower-dimensional space.
t-SNE: Non-linear dimensionality reduction for visualization.

Example:

from sklearn.decomposition import PCA

X = np.array([[2, 3, 4], [1, 5, 7], [2, 8, 6], [3, 7, 8], [4, 6, 9]])
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Reduced data: {X_reduced}")

9. Resampling Techniques

Resampling methods help to assess model performance and ensure robustness.

Bootstrap: Random sampling with replacement to estimate the sampling distribution.
Jackknife: Systematic leave-one-out method to estimate the bias and variance.

Example:

from sklearn.utils import resample

data = np.array([1, 2, 3, 4, 5])
bootstrapped_samples = resample(data, n_samples=5, replace=True)
print(f"Bootstrapped samples: {bootstrapped_samples}")

Understanding statistics is fundamental for working effectively with ML. It provides the tools to describe and infer properties of data, evaluate models, and make sound decisions based on data analysis. Mastery of both basic and advanced statistical methods allows data scientists and ML practitioners to build more robust, accurate, and interpretable models.

Statistics in Machine Learning

1. Descriptive Statistics

2. Probability Theory

3. Inferential Statistics

4. Regression Analysis

5. Evaluation Metrics

6. Sampling Techniques

7. Bayesian Statistics

8. Dimensionality Reduction

9. Resampling Techniques

Written by Chanchala Gorale

No responses yet