At this point in the course, you understand how to build, interpret, and select regression models. You can also verify that linear regression is adapted to the dataset by verifying the five assumptions from the last chapter.

However, you still don't know how the coefficients of the linear regression method are calculated and why you need to satisfy these five assumptions. To understand the mechanisms behind regression modeling, you need to have a good grasp of the mathematical aspects of the method.

At the end of this chapter, you should have a good understanding of the following:

How the linear regression coefficients are calculated with both OLS and MLE.

The fundamental differences between the OLS and the MLE method.

Where the log-likelihood statistic come from.

The concept of loss function.

This chapter is more formal and mathematical than the previous ones. To make it more palatable, we will sacrifice some mathematical rigor. We will also mostly restrict ourselves to the univariate case.

### Overview

There are several ways to calculate the optimal coefficients for a linear regression model. Here we focus on the ordinary least square (OLS) method and the maximum likelihood estimation (MLE) method.

The **ordinary least square (OLS)** method is tailored to the linear regression model. If the data is not too weird, it should always give a decent result. The OLS method does not make any assumption on the *probabilistic* nature of the variables and is considered to be *deterministic*.

The **maximum likelihood estimation (MLE)** method is a more general approach, probabilistic by nature, that is not limited to linear regression models.

The cool thing is that under certain conditions, the MLE and OLS methods lead to the same solutions.

### The Context

Let's briefly reiterate the context of univariate linear regression. We have an outcome variable $\$\backslash (y\backslash )\$$ , a predictor $\$\backslash (x\backslash )\$$ , and $\$\backslash (N\backslash )\$$ samples. $\$\backslash (x\backslash )\$$ and $\$\backslash (y\backslash )\$$ are $\$\backslash (N\backslash )\$$-sized vectors. We assume that there's a linear relation between $\$\backslash (y\backslash )\$$ and $\$\backslash (x\backslash )\$$. We can write:

$\$\backslash [y\_i\; =\; ax\_i\; +\; b\; +\; \backslash varepsilon\_i\; \backslash text\{\; for\; all\; samples\; \}\; i\; \backslash in\; [1,n]\backslash ]\$$

Where $\$\backslash (\backslash varepsilon\backslash )\$$ is a random variable that introduces some noise in the dataset, $\$\backslash (\backslash varepsilon\backslash )\$$ is assumed to follow a normal distribution with mean = 0.

The goal is to find the best coefficients $\$\backslash (\backslash hat\{a\}\backslash )\$$ and $\$\backslash (\backslash hat\{b\}\backslash )\$$ such that the estimation $\$\backslash (\backslash hat\{y\_i\}\; =\; \backslash hat\{a\}\; x\_i\; +\; \backslash hat\{b\}\backslash )\$$ is as close as possible to the real values $\$\backslash (y\_i\backslash )\$$ . In other words, to** minimize the distance between $\$\backslash (y\_i\backslash )\$$ and $\$\backslash (\backslash hat\{y\_i\}\backslash )\$$**for all samples i**.**

We can rewrite that last equation as a product of a vector of coefficients $\$\backslash (\backslash omega\backslash )\$$ and a design matrix of predictors $\$\backslash (X\backslash )\$$

$\$\backslash [\backslash hat\{y\}\; =\; X\; \backslash omega\backslash ]\$$

Where $\$\backslash (\backslash omega\; =\; [a,b]^T\backslash )\$$ and $\$\backslash (X\backslash )\$$ is the 2 by N design matrix defined by:

$\$\backslash (\{\backslash displaystyle\; \{\backslash begin\{aligned\}\; X\; =\&\; \backslash left[\; [1,x\_1]\; \backslash \backslash \; [1,x\_2]\; \backslash \backslash \; \backslash cdots\; \backslash \backslash \; [1\; ,x\_n]\; \backslash right\; ]\; \backslash end\{aligned\}\}\}\backslash )\$$

What's important here is that:

We want $\$\backslash (y\backslash )\$$ and $\$\backslash (\backslash hat\{y\}\backslash )\$$ to be as close as possible.

$\$\backslash (\backslash hat\{y\}\backslash )\$$ can be written as $\$\backslash (\backslash hat\{y\}\; =\; X\; \backslash omega\backslash )\$$ , a product of a vector of coefficient $\$\backslash (\backslash omega\backslash )\$$ and a matrix $\$\backslash (X\backslash )\$$ of samples.

### The OLS Method

When we say we want the estimated values $\$\backslash (\backslash hat\{y\}\backslash )\$$ to be a*s close as possible* to the real value $\$\backslash (y\backslash )\$$ , this implies a notion of **distance** between the samples. In math, a good reliable distance is a quadratic distance.

In two-dimensions, if the points have the coordinates $\$\backslash (p\; =\; (p1,\; p2)\backslash )\$$ and $\$\backslash (q\; =\; (q1,\; q2)\backslash )\$$ , then the distance is given by:

$\$\backslash [d(\backslash mathbf\; \{p\}\; ,\backslash mathbf\; \{q\}\; )=\{\backslash sqrt\; \{(q\_\{1\}-p\_\{1\})^\{2\}+(q\_\{2\}-p\_\{2\})^\{2\}\}\}.\backslash ]\$$

Our goal is to **minimize the quadratic distance** between all the real points $\$\backslash (y\_i\backslash )\$$ and the inferred points $\$\backslash (\backslash hat\{y\_i\}\backslash )\$$ .

And the distance between $\$\backslash (y\backslash )\$$ and $\$\backslash (\backslash hat\{y\}\backslash )\$$ is:

$\$\backslash [d(y,\; \backslash hat\{y\})\; =\; \backslash sum\_\{i\; =\; 1\}^\{n\}\; (\; x\_i\; w\; -y\_i\; )^2\backslash ]\$$

To find the value of $\$\backslash (\backslash omega\backslash )\$$ that minimizes that distance, take the derivative of $\$\backslash (\{\backslash displaystyle\; \backslash sum\_\{i\; =\; 1\}^\{n\}\; (\; x\_i\; w\; -y\_i\; )^2\; \}\backslash )\$$ with respect to $\$\backslash (\backslash omega\backslash )\$$ , and solve the equation:

$\$\backslash [\{\backslash displaystyle\; \backslash frac\{\backslash partial\; \{\backslash displaystyle\; \backslash sum\_\{i\; =\; 1\}^\{n\}\; (\; x\_i\; w\; -y\_i\; )^2\; \}\; \}\{\backslash partial\; \backslash omega\}=\; 0\}\backslash ]\$$

Easy Peasy? :euh:

We'll skip the corny details of that derivation (it's online somewhere), and instead fast-forward to the solution:

$\$\backslash [\{\backslash hat\; w\; \}=(X^T\; X\; )^\{-1\}\; X^T\; y\backslash ]\$$

And that's how you calculate the coefficients of the linear regression with OLS. :magicien:

#### Univariate Case

In the univariate case with N samples, the problem comes down to finding $\$\backslash (\backslash hat\{a\}\backslash )\$$ and $\$\backslash (\backslash hat\{b\}\backslash )\$$ that best solve a set of N equations with two unknowns:

$\$\backslash [y\_1\; =\; \backslash hat\{a\}\; x\_1\; +\; \backslash hat\{b\}\; \backslash \backslash \; y\_2\; =\; \backslash hat\{a\}\; x\_2\; +\; \backslash hat\{b\}\; \backslash \backslash \; \backslash cdots\; \backslash \backslash \; y\_n\; =\; \backslash hat\{a\}\; x\_n\; +\; \backslash hat\{b\}\; \backslash \backslash \backslash ]\$$

The solution to this set of N equations is given by:

Where $\$\backslash (\backslash overline\; \{x\}\backslash )\$$ and $\$\backslash (\backslash overline\; \{y\}\backslash )\$$ are respectively the means of x and y.

#### Too Good to Be True?

Although it may seem that the OLS method will always output meaningful results, this is unfortunately not the case when dealing with real-world data. In the multivariate case, the operations involved in calculating this exact OLS solution can sometimes be problematic.

Calculating the matrix $\$\backslash ((X^T\; X\; )^\{-1\}\; X^T\backslash )\$$ involves many operations including inverting the $\$\backslash (N\backslash )\$$ by $\$\backslash (N\backslash )\$$ matrix $\$\backslash (X^T\; X\backslash )\$$ . For large number of samples $\$\backslash (N\backslash )\$$, big datasets, this matrix can become huge, and inverting huge matrices takes time and large amounts of memory. Even though this large matrix inversion problem is greatly optimized today, using the closed-form solution given by the OLS method is not always the most efficient way to obtain the coefficients.

**The feasibility of calculating the closed-form solution $\$\backslash (\{\backslash hat\; w\; \}=(X^T\; X\; )^\{-1\}\; X^T\; y\backslash )\$$ is what drives the five assumptions of linear regression.**

For example, it can be shown that if there is *perfect multicollinearity* between the predictors (one predictor is a linear combination of the others), then the normal matrix has no inverse, and the coefficients cannot be calculated by the OLS method.

#### OLS Recap

What's important to understand and remember about the OLS method:

The OLS method comes down to

**minimizing the squared residuals.**There is a

**closed-form solution.**Being able to calculate this exact solution is what

**drives the five assumptions**of linear regression.The exact solution is difficult to calculate for a

**large number of samples.**

Let's turn our attention now to the second method for calculating the regression coefficients, the maximum likelihood estimation, or MLE.

### Maximum Likelihood Estimation (MLE)

Let's look at the problem of calculating the best coefficients of linear regression differently with this or any kind of model. The model has a potential set of parameters that best approaches a given dataset. You want to find this set of optimum parameters.

You want to maximize the **probability of observing your dataset, given the parameters of that model.**

In other words, find the parameters of the model that makes the current observation most **likely.** This is called **maximizing the likelihood** of the dataset.

How do we define the **likelihood** mathematically?

Consider $\$\backslash (p(x\_i\; /\; \backslash omega)\backslash )\$$ the probability of the existence (observation) of a sample $\$\backslash (x\_i\backslash )\$$ given the model and parameters $\$\backslash (\backslash omega\backslash )\$$ . If the model was the true model that generated the dataset, what is the chance of seeing that particular sample?

Now consider the probability for each sample in the dataset and multiply them all.

$\$\backslash [L(\backslash omega)\; =\; \backslash displaystyle\{\; \backslash prod\_\{i\; =\; 1\}^\{n\}\; p(x\_i\; /\; \backslash omega)\}\backslash ]\$$

$\$\backslash (L(\backslash omega)\backslash )\$$ is called the **likelihood,** and this is what you want to maximize.

Suppose here that the samples are independent of each other. $\$\backslash (p(x\_i\; ,\; x\_k)\; =\; 0\; \backslash text\{\; for\; all\; \}\; (i,\; k)\; \backslash text\{\; with\; \}\; i\backslash neq\; k\backslash )\$$.

Since the log function is an always increasing function, maximizing $\$\backslash (L(\backslash omega)\backslash )\$$ or $\$\backslash (log(L(\backslash omega))\backslash )\$$ is the same. And if you take the log of $\$\backslash (L(\backslash omega)\backslash )\$$ , the product transforms into a sum and you have the mathematical definition of the **log-likelihood!**

$\$\backslash [l(\backslash omega\; /\; x)\; =\; \backslash sum\_\{i\; =\; 1\}^n\; p(x\_i\; /\; \backslash omega)\backslash ]\$$

#### MLE for Linear Regression

In the case of linear regression, and when the residuals are normally distributed, the two methods OLS and MLE lead to the same optimal coefficients.

#### Math Demo (From Afar)

The demonstration goes like this:

If you assume that the residuals $\$\backslash (\backslash hat\{y\}\; -y\backslash )\$$ are normally distributed centered with standard deviation $\$\backslash (\backslash sigma^2:\; N(0,\; \backslash sigma^2)\backslash )\$$ , then the likelihood function can be expressed as:

$\$\backslash [\{\backslash displaystyle\; L(\backslash sigma^2,\; \backslash mu\; /\; y,X\; )\; =\; \backslash frac\{1\}\{\; \backslash sqrt\{\; (2\backslash pi\; )^n\}\}\; e^\{\; -\backslash frac\{1\}\{2\; \backslash sigma^2\}\; \backslash displaystyle\{\backslash sum\_\{i=1\}^n\; (\backslash hat\{y\_i\}\; -\; y\_i)^2\; \}\}\; \}\backslash ]\$$

The log-likelihood function is then:

$\$\backslash [l(\backslash sigma^2,\; \backslash mu\; /\; y,X\; )\; =\; -\backslash frac\{n\}\{2\}\; \backslash log\{\; (2\backslash pi\; )\}\; -\; \backslash frac\{n\}\{2\}\; \backslash log\{\; (\backslash sigma\; )^2\}\; -\; -\backslash frac\{1\}\{2\; \backslash sigma^2\}\; \backslash displaystyle\{\; \backslash sum\_\{i=1\}^n\; (\backslash hat\{y\_i\}\; -\; y\_i)^2\; \}\backslash ]\$$

In other words:

$\$\backslash [l(\backslash sigma^2,\; \backslash mu\; /\; y,X\; )\; =\; \backslash text\{cte\}\; -\backslash frac\{1\}\{2\; \backslash sigma^2\}\; d(\backslash hat\{y\},y)\; \backslash ]\$$

And you find the same distance $\$\backslash (d(y,\; \backslash hat\{y\})\backslash )\$$ between the real values $\$\backslash (y\backslash )\$$ and the inferred values $\$\backslash (\backslash hat\{y\}\backslash )\$$ in the OLS method!

Notice the negative sign in front of $\$\backslash (d(y,\; \backslash hat\{y\})\backslash )\$$ . This is why **minimizing** the quadratic distance in OLS is the same as **maximizing** the log-likelihood.

If the assumptions of the linear regression are met, notably the normally distributed property of the residuals, the OLS and the MLE methods both lead to the same optimal coefficients.

### Loss Functions

To conclude this chapter, I'd like to talk about loss functions. In both OLS and MLE methods, we choose a different function of the model parameters and the model estimation error, and then we minimize or maximize that function for the parameters of the model. We can generalize that idea by considering other functions that we could potentially minimize to find optimal model parameters.

For instance:

We could consider the absolute value of the residuals $\$\backslash (|\backslash hat\{y\}\; -y|\backslash )\$$ instead of the quadratic distance.

We could also add a term $\$\backslash (\backslash alpha\; \backslash left\backslash |\; \backslash omega\; \backslash right\backslash |^2\backslash )\$$ that takes into account the influence of the coefficients to the term to be minimized $\$\backslash (\backslash left\backslash |\; \backslash hat\{y\}\; -\; y\; \backslash right\backslash |^2\; +\; \backslash alpha\; \backslash left\backslash |\; \backslash omega\; \backslash right\backslash |^2\backslash )\$$. This loss function is used in a method called ridge regression.

By changing the loss function, we can build different models that lead to different optimal solutions.

In fact, as you will see in the chapter on logistic regression, the loss function in binary classification (when $\$\backslash (y\backslash )\$$ can only take 2 values 0 or 1):

$\$\backslash [J(\backslash omega)\; =\; \backslash displaystyle\{\; \backslash sum\_\{i=1\}^n\; y\_i\; P(y=1\; /\backslash omega)\; +\; (1\; -\; y\_i)\; P(y=0\; /\backslash omega)\; \}\backslash ]\$$

Elegant!

### Summary

In this chapter, we focused on the mathematical theory that drives the linear regression method.

The key takeaways are:

The math is what dictates the five assumptions of. linear regression.

The ordinary least square

**minimizes the square of the residuals.**The OLS method is computationally costly in the presence of large datasets.

The maximum likelihood estimation method

**maximizes the probability of observing the dataset**given a model and its parameters.In linear regression, OLS and MLE lead to the same optimal set of coefficients.

Changing the loss functions leads to other optimal solutions.

This concludes Part 2 of the course! Amazing work! You should now have a good grasp on how to apply linear regression to data, how to evaluate the quality of a linear regression model, the conditions that need to be met, and what calculation drives the linear regression method.

In Part 3, we're going to extend linear regression, first to handle categorical variables, both as an outcome and as predictors, and then to effectively deal with nonlinear datasets.

### Go One Step Further: Math Stuff That Did Not Make the Cut

I won't lie, math has a bad rap for being obtuse. I like the elegance and feeling of the magic behind equations and logical reasoning. That said, here is some math stuff that I find very cool, but not necessary to understand OLS or MLE.

#### Distances

A distance is simply the length between two points, cities, places, etc. But in math, we can play with a whole slew of distances.

For instance:

The absolute value distance ( $\$\backslash (L\_1\backslash )\$$ norm) between 2 points $\$\backslash (x\backslash )\$$ and $\$\backslash (y\backslash )\$$ is defined as

$\$\backslash (L\_1(x,y)\; =\; |\; x\; -\; y\; |\backslash )\$$While the quadratic norm is defined by

$\$\backslash (L\_2(x,y)\; =\; \backslash left\backslash |\; x\; -\; y\backslash right\backslash |\_2\; =\; \backslash sqrt\{\; (x\_\{1\}-y\_\{1\})^\{2\}+(x\_\{2\}-y\_\{2\})^\{2\}+\backslash cdots\; +(x\_\{n\}-y\_\{n\})^\{2\}\; \}\backslash )\$$The Manhattan distance, aka taxi distance.

You can even define an $\$\backslash (\backslash infty\backslash )\$$ distance with

$\$\backslash (\backslash left\backslash |\; x-y\; \backslash right\backslash |\_\{\backslash infty\}\; =\; max\_i\; \backslash left\backslash \{\; |x\_i\; -\; y\_i|\; \backslash right\backslash \}\backslash )\$$

So much fun!

#### Inverting Matrices

There are several equivalent conditions that are necessary and sufficient for a matrix $\$\backslash (A\backslash )\$$ to be invertible. For instance:

Its determinant $\$\backslash (det(A)\; \backslash neq\; 0\backslash )\$$

Its columns are not linearly dependent.

The only solution to $\$\backslash (Ax\; =\; 0\backslash )\$$ is $\$\backslash (x\; =\; 0\backslash )\$$

In a way, this is similar to the scalar case where the inverse of $\$\backslash (0:\; \backslash frac\{1\}\{0\}\backslash )\$$ can not be calculated.