• 12 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 6/23/22

Appreciate Ordinary Least Square and Maximum Likelihood Estimation

At this point in the course, you understand how to build, interpret, and select regression models. You can also verify that linear regression is adapted to the dataset by verifying the five assumptions from the last chapter.

However, you still don't know how the coefficients of the linear regression method are calculated and why you need to satisfy these five assumptions. To understand the mechanisms behind regression modeling, you need to have a good grasp of the mathematical aspects of the method.

At the end of this chapter, you should have a good understanding of the following:

  • How the linear regression coefficients are calculated with both OLS and MLE.

  • The fundamental differences between the OLS and the MLE method.

  • Where the log-likelihood statistic come from.

  • The concept of loss function.

This chapter is more formal and mathematical than the previous ones. To make it more palatable, we will sacrifice some mathematical rigor. We will also mostly restrict ourselves to the univariate case.

Overview

There are several ways to calculate the optimal coefficients for a linear regression model. Here we focus on the ordinary least square (OLS) method and the maximum likelihood estimation (MLE) method.

The ordinary least square (OLS) method is tailored to the linear regression model. If the data is not too weird, it should always give a decent result. The OLS method does not make any assumption on the probabilistic nature of the variables and is considered to be deterministic

The maximum likelihood estimation (MLE) method is a more general approach, probabilistic by nature, that is not limited to linear regression models.

The cool thing is that under certain conditions, the MLE and OLS methods lead to the same solutions.

The Context

Let's briefly reiterate the context of univariate linear regression. We have an outcome variable $\(y\)$ , a predictor $\(x\)$ , and $\(N\)$  samples. $\(x\)$ and $\(y\)$ are $\(N\)$-sized vectors. We assume that there's a linear relation between $\(y\)$ and $\(x\)$. We can write:

$\[y_i = ax_i + b + \varepsilon_i \text{ for all samples } i \in [1,n]\]$

 Where $\(\varepsilon\)$ is a random variable that introduces some noise in the dataset, $\(\varepsilon\)$ is assumed to follow a normal distribution with mean = 0.

The goal is to find the best coefficients $\(\hat{a}\)$ and $\(\hat{b}\)$ such that the estimation $\(\hat{y_i} = \hat{a} x_i + \hat{b}\)$ is as close as possible to the real values $\(y_i\)$ . In other words, to minimize the distance between $\(y_i\)$ and $\(\hat{y_i}\)$for all samples i.

We can rewrite that last equation as a product of a vector of coefficients $\(\omega\)$ and a design matrix of predictors $\(X\)$

$\[\hat{y} = X \omega\]$

Where $\(\omega = [a,b]^T\)$ and $\(X\)$ is the 2 by N design matrix defined by:

  $\({\displaystyle {\begin{aligned} X =& \left[ [1,x_1] \\ [1,x_2] \\ \cdots \\ [1 ,x_n] \right ] \end{aligned}}}\)$

What's important here is that:

  1. We want $\(y\)$ and $\(\hat{y}\)$ to be as close as possible.

  2.  $\(\hat{y}\)$ can be written as $\(\hat{y} = X \omega\)$ , a product of a vector of coefficient $\(\omega\)$ and a matrix $\(X\)$ of samples.

The OLS Method

When we say we want the estimated values $\(\hat{y}\)$ to be as close as possible to the real value $\(y\)$ , this implies a notion of distance between the samples. In math, a good reliable distance is a quadratic distance.

 In two-dimensions, if the points have the coordinates  $\(p = (p1, p2)\)$ and $\(q = (q1, q2)\)$ , then the distance is given by:

$\[d(\mathbf {p} ,\mathbf {q} )={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}}}.\]$

Our goal is to minimize the quadratic distance between all the real points $\(y_i\)$ and the inferred points $\(\hat{y_i}\)$ .

And the distance between $\(y\)$ and $\(\hat{y}\)$ is:

$\[d(y, \hat{y}) = \sum_{i = 1}^{n} ( x_i w -y_i )^2\]$

To find the value of $\(\omega\)$ that minimizes that distance, take the derivative of $\({\displaystyle \sum_{i = 1}^{n} ( x_i w -y_i )^2 }\)$   with respect to $\(\omega\)$ , and solve the equation: 

$\[{\displaystyle \frac{\partial {\displaystyle \sum_{i = 1}^{n} ( x_i w -y_i )^2 } }{\partial \omega}= 0}\]$

Easy Peasy? :euh:

We'll skip the corny details of that derivation (it's online somewhere), and instead fast-forward to the solution:

$\[{\hat w }=(X^T X )^{-1} X^T y\]$

And that's how you calculate the coefficients of the linear regression with OLS. :magicien:

Univariate Case

In the univariate case with N samples, the problem comes down to finding $\(\hat{a}\)$ and $\(\hat{b}\)$ that best solve a set of N equations with two unknowns:

$\[y_1 = \hat{a} x_1 + \hat{b} \\ y_2 = \hat{a} x_2 + \hat{b} \\ \cdots \\ y_n = \hat{a} x_n + \hat{b} \\\]$

The solution to this set of N equations is given by:

Where $\(\overline {x}\)$ and $\(\overline {y}\)$ are respectively the means of x and y.

Too Good to Be True?

Although it may seem that the OLS method will always output meaningful results, this is unfortunately not the case when dealing with real-world data. In the multivariate case, the operations involved in calculating this exact OLS solution can sometimes be problematic.

Calculating the matrix $\((X^T X )^{-1} X^T\)$ involves many operations including inverting the $\(N\)$  by $\(N\)$  matrix $\(X^T X\)$ . For large number of samples $\(N\)$, big datasets, this matrix can become huge, and inverting huge matrices takes time and large amounts of memory. Even though this large matrix inversion problem is greatly optimized today, using the closed-form solution given by the OLS method is not always the most efficient way to obtain the coefficients.

The feasibility of calculating the closed-form solution  $\({\hat w }=(X^T X )^{-1} X^T y\)$  is what drives the five assumptions of linear regression.

For example, it can be shown that if there is perfect multicollinearity between the predictors (one predictor is a linear combination of the others), then the normal matrix has no inverse, and the coefficients cannot be calculated by the OLS method.

OLS Recap

What's important to understand and remember about the OLS method:

  • The OLS method comes down to minimizing the squared residuals.

  • There is a closed-form solution.

  • Being able to calculate this exact solution is what drives the five assumptions of linear regression.

  • The exact solution is difficult to calculate for a large number of samples.

Let's turn our attention now to the second method for calculating the regression coefficients, the maximum likelihood estimation, or MLE.

Maximum Likelihood Estimation (MLE)

Let's look at the problem of calculating the best coefficients of linear regression differently with this or any kind of model. The model has a potential set of parameters that best approaches a given dataset. You want to find this set of optimum parameters.

You want to maximize the probability of observing your dataset, given the parameters of that model.

In other words, find the parameters of the model that makes the current observation most likely. This is called maximizing the likelihood of the dataset.

How do we define the likelihood mathematically?

Consider $\(p(x_i / \omega)\)$  the probability of the existence (observation) of a sample $\(x_i\)$ given the model and parameters $\(\omega\)$ . If the model was the true model that generated the dataset, what is the chance of seeing that particular sample?

Now consider the probability for each sample in the dataset and multiply them all.

$\[L(\omega) = \displaystyle{ \prod_{i = 1}^{n} p(x_i / \omega)}\]$

 $\(L(\omega)\)$ is called the likelihood, and this is what you want to maximize.

Suppose here that the samples are independent of each other. $\(p(x_i , x_k) = 0 \text{ for all } (i, k) \text{ with } i\neq k\)$.

Since the log function is an always increasing function, maximizing $\(L(\omega)\)$ or $\(log(L(\omega))\)$ is the same. And if you take the log of $\(L(\omega)\)$ , the product transforms into a sum and you have the mathematical definition of the  log-likelihood!

$\[l(\omega / x) = \sum_{i = 1}^n p(x_i / \omega)\]$

MLE for Linear Regression

In the case of linear regression, and when the residuals are normally distributed, the two methods OLS and MLE lead to the same optimal coefficients.

Math Demo (From Afar)

The demonstration goes like this:

If you assume that the residuals $\(\hat{y} -y\)$ are normally distributed centered with standard deviation $\(\sigma^2: N(0, \sigma^2)\)$ , then the likelihood function can be expressed as:

$\[{\displaystyle L(\sigma^2, \mu / y,X ) = \frac{1}{ \sqrt{ (2\pi )^n}} e^{ -\frac{1}{2 \sigma^2} \displaystyle{\sum_{i=1}^n (\hat{y_i} - y_i)^2 }} }\]$

The log-likelihood function is then:

$\[l(\sigma^2, \mu / y,X ) = -\frac{n}{2} \log{ (2\pi )} - \frac{n}{2} \log{ (\sigma )^2} - -\frac{1}{2 \sigma^2} \displaystyle{ \sum_{i=1}^n (\hat{y_i} - y_i)^2 }\]$

In other words:

$\[l(\sigma^2, \mu / y,X ) = \text{cte} -\frac{1}{2 \sigma^2} d(\hat{y},y) \]$

And you find the same distance $\(d(y, \hat{y})\)$ between the real values $\(y\)$ and the inferred values $\(\hat{y}\)$ in the OLS method!

Notice the negative sign in front of $\(d(y, \hat{y})\)$ . This is why minimizing the quadratic distance in OLS is the same as maximizing the log-likelihood.

If the assumptions of the linear regression are met, notably the normally distributed property of the residuals, the OLS and the MLE methods both lead to the same optimal coefficients.

Loss Functions

To conclude this chapter, I'd like to talk about loss functions. In both OLS and MLE methods, we choose a different function of the model parameters and the model estimation error, and then we minimize or maximize that function for the parameters of the model. We can generalize that idea by considering other functions that we could potentially minimize to find optimal model parameters.

For instance:

  • We could consider the absolute value of the residuals $\(|\hat{y} -y|\)$ instead of the quadratic distance.

  • We could also add a term  $\(\alpha \left\| \omega \right\|^2\)$ that takes into account the influence of the coefficients to the term to be minimized $\(\left\| \hat{y} - y \right\|^2 + \alpha \left\| \omega \right\|^2\)$. This loss function is used in a method called ridge regression.

By changing the loss function, we can build different models that lead to different optimal solutions.

In fact, as you will see in the chapter on logistic regression, the loss function in binary classification (when  $\(y\)$   can only take 2 values 0 or 1):

$\[J(\omega) = \displaystyle{ \sum_{i=1}^n y_i P(y=1 /\omega) + (1 - y_i) P(y=0 /\omega) }\]$

Elegant!

Summary

In this chapter, we focused on the mathematical theory that drives the linear regression method.

The key takeaways are:

  • The math is what dictates the five assumptions of. linear regression.

  • The ordinary least square  minimizes the square of the residuals.

  • The OLS method is computationally costly in the presence of large datasets.

  • The maximum likelihood estimation method maximizes the probability of observing the dataset given a model and its parameters.

  • In linear regression, OLS and MLE lead to the same optimal set of coefficients.

  • Changing the loss functions leads to other optimal solutions.

This concludes Part 2 of the course! Amazing work! You should now have a good grasp on how to apply linear regression to data, how to evaluate the quality of a linear regression model, the conditions that need to be met, and what calculation drives the linear regression method.

In Part 3, we're going to extend linear regression, first to handle categorical variables, both as an outcome and as predictors, and then to effectively deal with nonlinear datasets.

Go One Step Further: Math Stuff That Did Not Make the Cut

I won't lie, math has a bad rap for being obtuse. I like the elegance and feeling of the magic behind equations and logical reasoning. That said, here is some math stuff that I find very cool, but not necessary to understand OLS or MLE.

Distances

A distance is simply the length between two points, cities, places, etc. But in math, we can play with a whole slew of distances.

For instance:

  • The absolute value distance ( $\(L_1\)$ norm) between 2 points $\(x\)$ and $\(y\)$ is defined as
    $\(L_1(x,y) = | x - y |\)$

  • While the quadratic norm is defined by
    $\(L_2(x,y) = \left\| x - y\right\|_2 = \sqrt{ (x_{1}-y_{1})^{2}+(x_{2}-y_{2})^{2}+\cdots +(x_{n}-y_{n})^{2} }\)$

  • The Manhattan distance, aka taxi distance.

  • You can even define an $\(\infty\)$  distance with
    $\(\left\| x-y \right\|_{\infty} = max_i \left\{ |x_i - y_i| \right\}\)$

So much fun!

Inverting Matrices

There are several equivalent conditions that are necessary and sufficient for a matrix $\(A\)$ to be invertible. For instance:

  • Its determinant $\(det(A) \neq 0\)$ 

  • Its columns are not linearly dependent.

  • The only solution to $\(Ax = 0\)$ is $\(x = 0\)$

In a way, this is similar to the scalar case where the inverse of $\(0: \frac{1}{0}\)$ can not be calculated.

Example of certificate of achievement
Example of certificate of achievement