Intuition of Univariate Linear Regression

Emir Hurturk
7 min readFeb 3, 2021
Photo by Roman Mager on Unsplash

When I first learned linear regression and machine learning, I was amazed. Now, in this article, I will share my knowledge with you, and hopefully, bring you a better intuition on linear regression. First of all, I am going to explain the basics of univariate linear regression, then we’ll dive into multivariate linear regression in the second part, and finally we will going to dive right into the code. Let’s get started.

Univariate Linear Regression

In machine learning, our ultimate goal is to make the most accurate predictions based on previous data. Therefore, we want to minimize the errors between our model’s predictions, and the real data. Let’s go back to basics.

The basic slope-intercept form in which m represent slope, b represents y-intercept
The basic slope-intercept form in which m represent slope, b represents the y-intercept

You all know this equation right? This is the form of an equation in a 2D plane, in which the coefficient, or weight, m represents the slope of the line and the constant term, or bias, b represents the y-intercept of the line. If we plot this equation, our graph would be like:

Graph of the function y=2x+2

And in this case, m is 2, and bias b is also 2.

Now, let’s work with some data. Let’s have a random list of points of the amount 7. If we plot them,

Our plane would look something like this. Now, since we want to find the most accurate line, in other words, the trendline, we want a line that tries to pass through nearly all points. I said nearly because it is almost impossible to find a line that goes through all points (Actually, it is possible in polynomial functions, but we don’t want that kind of graph, because we may lose our accuracy in future data) Back in linear regression, we first initialize a random weight and a bias as our initial function. Let’s say, we initialized our parameters to 1. Now, our 2D plane looks like

y=x+1

And our function is y=1x+1. Actually, as a tradition, we can name this function as the hypothesis function. Our hypothesis function looks pretty accurate now. And to calculate a function’s accuracy, typically in machine learning we use a function called the cost function, or mean squared error (MSE). With this function, we can calculate the accuracy of an arbitrary hypothesis function, and the formula for the function looks something like:

Cost Function

Theta represents our coefficient, or weight, vector (In case you don’t know what a vector or matrix is, I’ll explain them in chapter 2 of this series, Multivariate Linear Regression. For now, suppose that a vector is an array that contains some ordered numbers). h(x) is our hypothesis function, that returns our prediction for a particular input xi. And for univariate linear regression, xi is the x-coordinate of a particular data point, which we use the term feature. Anyways, Yi is the real value. We are squaring the difference between the prediction and real value because there might be a difference of -5, and a difference of 5. We don’t want the sum of these values to equal 0, therefore we square the difference to make nonpositive values positive. A visual intuition of cost function would look like:

The lines between our hypothesis function and the data points are our differences. We take the sum of the square of these errors, and we divide it with the number of data points to get the average of the sum of errors. A basic interpretation of the cost function is:

For univariate linear regression, the graph of the cost function vs. a parameter is:

Since our aim is to get the parameters that minimize our average error and make the most accurate predictions, we will declare the perfect point for a particular parameter as the vertex, or the local minimum of the graph of the cost function vs. that particular parameter.

To get a parameter to its minimum point, we use an algorithm called Gradient Descent. This algorithm simply updates the parameters to move the parameters to the x-coordinate of the minimum value of their cost function. The steps for this algorithm is:

step 1
step 2
step 3

I find it more intuitive to first give the equation, then the graph. So, here is the equation of gradient descent:

General Gradient Descent Equation

There are a lot of symbols in this equation! But, it is relatively meaningful, after you get the intuition. First of all, let’s start with the := operator. This operator is an ‘assignment’ operator, to avoid confusion with the mathematics. Theta J is the current parameter, and for univariate linear regression (ULR), theta J is simply the bias b and the slope coefficient m.

Now comes in α. This is the learning rate. Learning rate controls how the parameter-moving part in the costs function graph behaves, and it is a common tradition that learning rate is between 0.001–0.9 (Of course, your learning rate may be outside this bound, but don’t choose it too large, because your parameters may not converge! And don’t choose it too small because the learning process can be too long) You can find the perfect learning rate α by experimenting with your code.

Now comes the partial derivative part!! The partial derivative is simply the 3D version of normal derivatives in calculus, and it is also used for taking the slope of a function, with respect to a parameter. Since we are dealing with only 1 parameter in ULR, we can treat this partial derivative as a normal derivative (I’ll explain partial derivatives more deeply in part 2, Multivariate Linear Regression). For any person who does not know calculus, derivatives simply return the slope of the tangent line at a given point, and we can take derivatives of functions to measure the rate of the change of ‘y’ with respect to the changes of ‘x’. And, the derivative of the vertex of any function is 0, because the tangent line is a horizontal one.

Since we want the parameter to converge to the local minimum at which the slope of the tangent line equals zero, we subtract the perimeter by α * the current slope of the tangent line of the parameter-point.

A visual approach would be:

As you can see from the image above, learning rate α controls the learning steps, and towards to minimum, the steps become much smaller because of the decrease in the slope of the tangent line at the given parameter-point.

Thus, we can declare our convergence for the parameter when the change in ‘step’ is lower than a threshold value, e.g. 0.5. After running the gradient descent for all parameters and for the declaration of convergence, now we have the ‘perfect’ parameters that fit our data best. For ULR, we only have 2 parameters: m and b. Therefore, we simply need to run the algorithm for only 2 parameters. And that’s it! After running the algorithm until convergence, we will have our perfect parameters for our data!

Conclusion

In this article, we learned what is a hypothesis function is, and how we can calculate the errors between our predictions with our hypothesis function, and the real data. Then, we learned how we can find the perfect parameters for our hypothesis function with the help of an algorithm called ‘Gradient Descent’. In conclusion, I like to, again, put all the pictures of equations that you’d use in Univariate Linear Regression.

Here are all the equations we learned in this article:

Hypothesis Function
Cost Function
A simplified version of Cost Function
Graph of Cost Function according to a Parameter
Gradient Descent Equation
A visual way of Gradient Descent

Hope you found this article helpful

Image Credits (in MLA format):

I used Desmos to make all other graphs, and LaTex to write those beautiful equations.

--

--

Emir Hurturk

I'm a student who is interested in coding, Machine Learning, Mathematics, and Physics