EVERYTHING YOU NEED TO KNOW ABOUT LINEAR REGRESSION – PART 1

Akash Chandel
4 min readApr 7, 2024

--

Linear Regression basically defines a causal relationship between a Dependent Variable and an Independent Variable.

It can of 2 types:

Simple Linear Regression

In simple linear regression we have only one independent variable with the dependent variable.

Y = mx + c

Multiple Linear Regression

In multiple linear regression we have multiple independent variables with the dependent variable.

Y = m1x1 + m2x2 +…+mnxn + c1 +c2+ cnxn

APPLICATIONS OF LINEAR REGRESSION

It is used to identify the “strength of the effect of that independent variable has on dependent variable”.

It is used to forecast the “impact of changes in dependent variable with respect to independent variable”.

It is used to predict trends, future values and to get point estimates.

ASUMPTIONS OF MULTIPLE LINEAR REGRESSION

  1. Linearity : Data should follow a linear pattern i.e. if x increases y should also increase or if x decreases value of y should decrease.
  2. No Endogeneity : There should be no correlation of error with Independent or Dependent variable. Error should remain constant with any changes in X(Independent) and Y(dependent) variable.
  3. Normality of Residuals : The residuals of the values i.e. difference between ypred and ytest if plotted on a graph should be normally distrusted having mean close to zero.
  4. Homoscedasticity : Homoscedasticity meaning “having the same spread” meaning that the spread of the residuals should be equal which can be validated by plotting the graph between ypred on x axis and residuals (ypred – ytest) on the y axis should give a uniform scatter.
  5. No Autocorrelation : Autocorrelation occurs when regression model residuals violate the independence assumption because they are highly dependent across time, Autocorrelation measures the relationship between a variable’s current values and its past values.
  6. No Multicollinearity : No two or more that two independent variables should be correlated to each other. High collinearity can cause bias in the model.

CHECKS AND FIXES FOR ASSUMPTIONS OF MULTIPLE LINEAR REGRESSION

Linearity : Non linear Regression, Transform relationship ( Exponential, Logarithmic).

Endogeneity : check via Graphical representation.

Heteroscedasticity : Check for omitted variable bias, Remove outliers, Logarithmic transformation.

Multicollinearity : Remove highly correlated variables, Check Tolerance and Variance Inflation Factor.

VIF = 1/(1-r^2)

Tolerance = 1-r^2

Autocorrelation : Dublin watson test.

STATISTICAL MEASURES FOR REGRESSION

1. R – squared

It tells us how good our model is than a simple horizontal line through the mean.

R^2 = SSR/SST

SST = SSR + SSE

SSR : sum of squares residual

SST : total sum of squares

SSE : sum of squared error

Sum of squared residuals SSR : It is the difference between predicted value and the mean of the dependent variable, measure that describes how well the line fits the data.

Aka Explained Variation

Sum of squared total SST : It is the difference between the observed variable around mean. It is the measure of total variability of dataset.

Aka Total variation

Sum of squared error SSE : It is the difference between the observed/ actual value and the actual value of the dependent variable.

Aka Unexplained variation

2. Adjusted R – squared

Adjusted R squared penalises the R – squared value for excessive use of variables, hence it is always less than R – squared.

Note : If R2 adj increases after increasing independent variable that means the newly added independent variable is providing some new information to the model whereas if the adj R2 value decreases on increasing independent variable then the newly added independent variable provide little to no substantial information to the model.

3. Mean Squared Error MSE :

It gives the squared sum of the difference between the predicted value and the actual value divided by the number of observations.

4. Root Mean Squared Error RMSE :

It gives the square root of the squared sum of the difference between the predicted value and the actual value divided by the number of observations.

5. Mean Absolute Error MAE :

Mean absolute error is a measure of difference between two continuous variable, It is more intuitive and less sensitive to outliers.

--

--

Akash Chandel
Akash Chandel

Written by Akash Chandel

0 Followers

Data Analyst | Sports Enthusiast

No responses yet