Gradient Descent: Part II¶

Chapter 3 showed how to leverage the gradient descent approach to find the minimal point for a given function. This chapter shows how the gradient descent is used in linear regression. In the case of linear regression, we have a bunch of observed data points, and the task is to identify the underlying function that generated it. For instance, let’s assume the underlying hidden function is \(y=4x_1^2 + 2x_2^2\). Below code, snippet uses this function to generate 500 data points.

from numpy.random import normal
import pandas as pd
import random

# set seed so that generated data is repetable
random.seed(10)

# randomly generate 500 values for X1 and x2
dataDF = pd.DataFrame({
  'x1':  [random.randint(-100, 100) for x in range(1000, )],
  'x2':  [random.randint(-100, 100) for x in range(1000)]
})

# compute y for given set of X1 and X2
dataDF['y'] = 4 * dataDF['x1']**2 + 2 * dataDF['x2']**2 

display(dataDF.sample(10))

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go
# init_notebook_mode(connected=False)
iplot([go.Scatter3d(x=dataDF['x1'], y=dataDF['x2'], z=dataDF['y'], opacity=0.5, mode='markers')], show_link=False)

	x1	x2	y
588	34	67	13602
174	24	46	6536
117	-43	18	8044
865	-90	-11	32642
326	-39	46	10316
986	46	-88	23952
351	97	-86	52428
259	4	-62	7752
870	-64	71	26466
834	28	51	8338

Now the challenge is can we discover the underlying function (\(y=4x_1^2 + 2x_2^2\)) only by using the above-generated data points . On the surface, this doesn’t look like an optimization function for which we need gradient descent approach. But let’s rethink the problem again in reference to Chapter 1.

Chapter 1 discussed various metrics (such as MAE, R Squared, etc.) to evaluate the quality of a regression model. For the given dataset, we can come up with many different possible models such as:

\(y = 3x_1^2 + 2x_2^2\)
\(y = 4x_1^2 + 2x_2^2\)
\(y = 4x_1^2 + 3x_2^2\)

Without knowing the underlying function, how do we figure which of these models best fits our observed data points? Well, as discussed in chapter 1, we can compare predicted values to the actual values and compute various statistics such as mean absolute error, R squared, etc. The best model will be the one that minimizes our selected metrics, say MAE.

For a moment, lets assume the metric we are interested is MAE (refer Chapter 1 for details). Mathematically, we can compute MAE for each of our model as follows:

\(MAE_1 = \frac{1}{n} \sum_{i=1}^{n}|y - (3x_1^2 + 2x_2^2)|\)
\(MAE_1 = \frac{1}{n} \sum_{i=1}^{n}|y - (4x_1^2 + 2x_2^2)|\)
\(MAE_1 = \frac{1}{n} \sum_{i=1}^{n}|y - (4x_1^2 + 3x_2^2)|\)

Now there is an infinite number of different models that we can come up. So how do we find the one with minimum MAE? An alternative approach is to parameterize our model. All our models have the same form and can be abstractly represented as:

\[y=\theta_1x_1^2 + \theta_2x_2^2\]

For model 1, \(\theta_1 = 3\) and \(\theta_2 = 2\). For model 2, \(\theta_1 = 4\) and \(\theta_2 = 2\) and so on. Further, we can abstractly represent MAE computation as:

\[MAE = \frac{1}{m} \sum_{i=1}^{m}|y - (\theta_1x_1^2 + \theta_2x_2^2)|\]

Thus, our problem of finding the underlying function that fits our model can now be expressed as minimizing the above equation. Great, now this starts looking like chapter 3. We have a function that we want to minimize. We want to identify optimal \(\theta_1\) and \(\theta_2\) for which MAE is minimum.

But there is a problem. ** To apply gradient descent, the function should be differentiable**. However, the “absolute” function in MAE is not differentiable. So we need to look for another metric. The reason we used the “absolute” function in MAE is to treat under-estimation and over-estimation equally. Another way to treat them equally is to take the square of the difference between the actual and the predicted value. “Square” as a function is differentiable and hence enables us to gradient descent approach for finding optimal parameters (i.e. \(\theta_1\) and \(\theta_2\)). This new metric is refereed to as “Mean Squared Error” (or MSE) as we are taking the square of the difference between actual and predicted value.

\[MSE = \frac{1}{m} \sum_{i=1}^{m}\left[y - (\theta_1x_1^2 + \theta_2x_2^2)\right]^2\]

First derivative of the above function with respect to parameters (i.e. \(\theta_1\) and \(\theta_2\)) is given as:

\[\begin{split}MSE' = \begin{bmatrix} -\frac{2}{m}\sum_{i=1}^m{[y_i - (\theta_1x_1^2 + \theta_2x_2^2)]x_1^2}\\ -\frac{2}{m}\sum_{i=1}^m{[y_i - (\theta_1x_1^2 + \theta_2x_2^2)]x_2^2}\\ \end{bmatrix}\end{split}\]

Great! Now we have a function, its first derivative, and we want to minimize it. These are all the components that we used in the previous chapter. But before we jump onto implementing this, there is one more detail we need to consider.

In chapter 3, we were evaluating gradient at a single point i.e., there was no summation operator across various data points. In this case, we are evaluating gradient across many different points of our function and then taking average of them. This is hard to imagine, and you might want to spend time trying to understand what’s happening. But the below code demonstrates that taking MSE as the function to minimize we can easily identify the optimal parameters using the gradient descent approach.

There is also one more improvement we can do. \(\theta_1x_1^2 + \theta_2X_2^2\) can be abstractly represented as \(\sum_{i=1}^{n}\theta_iX_1\) with the assumption that \(X_1\) and \(X_2\) are already squared for us. The advantage of this approach it makes our implementation of the algorithm independent of the form of equation as long as the form can be represented as linear combination of parameters. Thus, our new MSE and MSE’ form can be expressed as:

\[MSE = \frac{1}{m} \sum_{i=1}^{m}\left[y - \sum_{i=1}^{n}\theta_iX_i\right]^2\]

and

\[\begin{split}MSE' = \begin{bmatrix} -\frac{2}{m}\sum_{i=1}^m{[y_i - \sum_{i=1}^{n}\theta_iX_i]x_1}\\ -\frac{2}{m}\sum_{i=1}^m{[y_i - \sum_{i=1}^{n}\theta_iX_i]x_2}\\ \cdots \\ \cdots \\ -\frac{2}{m}\sum_{i=1}^m{[y_i - \sum_{i=1}^{n}\theta_iX_i]x_n}\\ \end{bmatrix}\end{split}\]

Below code snippet uses these generalized form to compute gradient.

from copy import copy
from scipy.spatial import distance
from random import random
from IPython.display import HTML
import numpy as np


# this function returns gradient of MSE 
def gradient(X, y, theta):
    num_samples = X.shape[0]
    
    # assuming function to be linear combination of parameeters 
    # hence np.transpose(theta) * X
    y_prime = np.sum(np.transpose(theta)*X, axis=1)
    diff = y - y_prime
    gradient = -1. * np.sum(diff.reshape((num_samples, 1)) * X, axis=0) / float(num_samples)
    return gradient


def minimize(X, y, eta = 0.03,  max_iterations = 1000, traceback = None, stopping_threshold = 1.0e-6):
    """Minimizes 
    X -- data frame containing X1 and X2 
    y -- actual output values 
    eta -- learning rate 
    max_iterations
    traceback -- keep track of parameters
    stopping_threshold -- max distance between previous and new parameters at which it the algorithm should stop. 
    """
    
    # number of samples
    m = X.shape[0]
    
    # number of parameters
    c = X.shape[1]
    
    # starting point -- randomly select
    theta1 = [random() for i in range(c)] 
    
    for iter in range(max_iterations):
        
        if traceback is not None:
            traceback.append(copy(theta1))

        # compute average gradient for all data points
        # and move in the opposite direction
        theta2  = theta1 - eta * gradient(X, y, theta1)
        
        # check if we reached stopping criteria threshold
        if distance.euclidean(theta2, theta1) < stopping_threshold:
            return theta1
        else:
            theta1 = theta2
        
    # if we reached max iterations then return current point
    return theta1


# Note we are already taking square of X1 and X2 
X = (dataDF[['x1', 'x2']] ** 2).values

y = dataDF['y'].values

traceback = []
theta = minimize(X, y, eta=1e-8, max_iterations=10000, traceback=traceback)

display(HTML("<strong>Optimized Parameters: {}</strong>".format(theta)))

Optimized Parameters: [3.99999265 2.00000706]

Isn’t this magic. Our algorithm was able to correctly identify the parameters of our parabola function just by looking at the generated data points. However, this example is far from reality. In particular, I made a lot of assumptions and ignored some of the details. Below are a few things that we should think about:

I assumed the functional form of parabola (i.e. \(\theta_1X_1^2 + \theta_2X_2^2\)). But, in reality, we don’t know how features (i.e. \(x_1\) and \(x_2\)) relate to target variable. For instance, what if I started with a functional form of \(\theta_1X_1 + \theta_2X_2\) or \(\theta_1X_1 + \theta_2X_2 + \theta_3X_1X_2\). Thus, there could be an infinite functional form with a different number of parameters. So, one thing to note is that the gradient descent based approach is giving optimal parameters (\(\theta\)) for a single functional form. If we change our functional form, we will have to re-run the gradient descent algorithm to identify optimal parameters for the new functional form.
When generating the datasets (X, y), I used pure parabola function. However, in reality, the collected data has some level of noise, and there can be outliers. These outliers can significantly influence optimal parameters. As discussed in the next chapter, to deal with outliers requires some form of regularization.

Unboxing Machine Learning

Gradient Descent: Part II¶