Linear regression, convergence, cost function, and gradient descent in ML

3 min readJun 9, 2024

Linear regression, convergence, cost function, and gradient descent are foundational concepts in machine learning, particularly in supervised learning for regression tasks. Here’s an in-depth explanation of each term and how they interrelate:

1. Linear Regression

Definition: Linear regression is a statistical method used to model the relationship between a dependent variable (often called the target or outcome) and one or more independent variables (often called features or predictors). The goal is to find a linear equation that best predicts the dependent variable based on the independent variables.

Mathematical Form: In simple linear regression with one independent variable xxx, the model can be expressed as: y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilony=β0+β1x+ϵ where:

yyy is the dependent variable.
β0\beta_0β0 is the y-intercept.
β1\beta_1β1 is the slope of the line.
xxx is the independent variable.
ϵ\epsilonϵ is the error term (the difference between the predicted and actual values).

In multiple linear regression with multiple independent variables x1,x2,…,xpx_1, x_2, \ldots, x_px1,x2,…,xp, the model is: y=β0+β1x1+β2x2+…+βpxp+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilony=β0+β1x1+β2x2+…+βpxp+ϵ

Use Case: Linear regression is used in various domains where the relationship between variables needs to be quantified. For example:

Predicting house prices based on features like size, location, and number of rooms.
Estimating the effect of advertising expenditure on sales.
Modeling the relationship between body weight and height.

2. Convergence

Definition: Convergence in machine learning refers to the process of iteratively adjusting model parameters until the model reaches a state where changes in parameters result in minimal or no reduction in the cost function. Essentially, it means the algorithm has found the optimal parameters that minimize the error.

Use Case: Convergence is crucial for training models using iterative algorithms like gradient descent. Ensuring convergence means that the model parameters have reached values that provide the best possible predictions according to the chosen cost function.

3. Cost Function

Definition: A cost function, also known as a loss function or objective function, measures the error between the predicted values and the actual values. In linear regression, the most commonly used cost function is the Mean Squared Error (MSE), which is defined as: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i — \hat{y}_i)²MSE=n1∑i=1n(yi−y^i)2 where:

nnn is the number of observations.
yiy_iyi is the actual value.
y^i\hat{y}_iy^i is the predicted value.

Use Case: The cost function is used to evaluate how well the model is performing. During training, the goal is to minimize the cost function, which indicates better predictions.

4. Gradient Descent

Definition: Gradient descent is an optimization algorithm used to minimize the cost function by iteratively adjusting the model parameters. The basic idea is to take steps proportional to the negative of the gradient (or approximate gradient) of the cost function with respect to the parameters.

Mathematical Form: For linear regression, the parameter updates for gradient descent are: βj=βj−α∂J∂βj\beta_j = \beta_j — \alpha \frac{\partial J}{\partial \beta_j}βj=βj−α∂βj∂J where:

βj\beta_jβj is the parameter being updated.
α\alphaα is the learning rate (a hyperparameter that determines the step size).
JJJ is the cost function (e.g., MSE).

The partial derivative ∂J∂βj\frac{\partial J}{\partial \beta_j}∂βj∂J represents the gradient of the cost function with respect to the parameter βj\beta_jβj.

Use Case: Gradient descent is widely used for training machine learning models, especially in large-scale problems where analytical solutions are infeasible. Its variants, such as stochastic gradient descent (SGD) and mini-batch gradient descent, are used to handle large datasets more efficiently.

Interrelation and Workflow in Machine Learning

Model Definition (Linear Regression): Define the linear regression model you want to train, specifying the relationship between the dependent and independent variables.
Initialization: Start with initial guesses for the model parameters (e.g., β0\beta_0β0 and β1\beta_1β1).
Cost Function Calculation: Calculate the initial cost using the cost function (e.g., MSE) to measure the error of the initial model.
Gradient Descent: Apply the gradient descent algorithm to iteratively update the model parameters. This involves:

Computing the gradients (partial derivatives of the cost function with respect to each parameter).
Updating the parameters by moving in the direction opposite to the gradient.
Recalculating the cost function after each update to monitor progress.

Convergence Check: Continue iterating until the change in the cost function is below a predefined threshold, indicating that convergence has been achieved.
Model Evaluation: After convergence, evaluate the final model on a separate validation dataset to assess its performance.

These concepts form the backbone of many machine learning algorithms and understanding them is crucial for developing and optimizing predictive models.