Machine Learning Series Day 1 (Linear Regression)

I promise it’s not just another “ML Article.”

Alex Guanga

Published in

Becoming Human: Artificial Intelligence Magazine

8 min readFeb 7, 2019

Terminology:

Supervised Learning: Analyzed the dataset to produce a predicted function to forecast new examples.
Overfitting: A model that has learned ‘too much’ of the dataset. Hence, the model will not be as useful on new examples.
Ground Truth: The actual result.
MSE: Mean Squared Error. It’s a formula that measures how well the model is performing. For each observation, it calculates the difference between the predicted and the ground truth. It then calculates the summation of the squares the difference. Lastly, it calculates the mean (divide by all the observations).
Gradient Descent: A first-order iterative optimization algorithm for finding the minimum of a function.
Error term: The inability to be 100% of our data.
Shrinking: Reducing the effect of the coefficients.

Concept:

For supervised learning, the dataset will have independents variables and a dependent variable. As a data scientist, your responsibility will be to discover the underlying pattern that maps the independent variables to the dependent variable.

Think about it like this. You go to a park. You sit down and observe that there are many mothers with their children. They decide to do a fun contest: predict the age of the next mother who comes to the park (this will never happen) by the age of their kid. You’ll probably take a look at the mothers and their children who are already in the park. You realize that when a child is 3, most mothers appear 30 years old. When a child is 10, the most mother looks 40. You do some mental math and calculate if a child is a newborn, you’ll predict 28. Every additional year their child has, a mother age is 1.5 years older. So if a kid is 5, you’ll predict the mother is 35.5 years old. Notice that you are adding weight to the independent variable (kid’s age) to predict the dependent variable (mother’s age).

Linear Regression is a standard algorithm that is not as complicated/accurate as other models. Yet, it is the building blocks for a more complicated model.

Details:

Supervised/Unsupervised: Supervised
Regression/Classification: Regression

Visual:

The goal is to best fit the line on the data values. Meaning, we calculate the “line” that produces the least error (sum of total).

Mathematics/Statistics:

Formula — (Linear Regression):

X’s variables (Inputs): These are the inputs values that are in the dataset. If you’re predicting employee salaries, some inputs could be age, level of education, or their location.
β0 (Bias Term): A Bias term is needed unless we believe that the model starts at the origin. This bias term is where the line intercepts the y-axis. The y-intercepts matters. For example, if we were to predict a babies weight — a baby cannot weigh 0lbs (assuming they were born).
Β1… Bp-1 variables (coefficients): We multiply the X variables with the weights/betas (represented by β). The betas are what the model is calculating. As you’ll later see, it’s how we manipulate the line below to fit the dependent variable.
The epsilon (ϵ): Represents the error term or the inability to be 100% of our data. We can never expect that our data is an accurate representation of the population.

The objective is to predict the betas (weights) that “fits” the data values the best.

They are the coefficients to be multiplied with the inputs. If there were only one predictor, the goal would be to find a linear line that best “fits” the data points.

“What next?”

The evaluation method must be established: cost function. The cost function is typically the Mean Squared Error (MSE).

Formula — MSE (Mean Squared Error):

The MSE formula measures the average squared difference of the summation of the observed and actual.

y1: The ground truth
W*X: Multiplying our weights by the X variables to get a prediction.
W0: Not shown in the function, but we will use a bias term.

So the process is relatively straight-forward. Let’s use an example:

You’re tasked with predicting people’s salary. Age is your independent variable. You will probably have more variables than age, but let’s stick with one variable. You will use the training set to measure accuracy.
If you were only to find the pattern for one person whose salary is $100,000 and is 50-years old, the coefficient (beta) would be 2000. Meaning, for every additional age, a person’s expected to earn $2000 more.
There are now are two people. The second person earns $60,000 but is 20-years-old. Hence, our weight of 2000 does not work anymore.

Our job is to find a weight that minimizes the error in average for all the observations.

The task becomes very complicated when there are a lot of variables and observations.

Additional Notes (MSE):

We square the difference to calculate their positive value. The capital sigma, Σ, means we sum the variations of the predictions and ground truths. Lastly, we divide by the total number of observations (N) to get the average.
The reason we use “2N” instead of N to facilitate when calculating the derivatives.

Our goal is to minimize the Cost Function (MSE).

“So how do we minimize the MSE?”

With the gradient descent.

Formula — Gradient Descent (Simple Regression):

The diagram illustrates (look below) the goal of the gradient descent. We need to “reach” the bottom of the parabola. The slopes are drawn by the straight colored lines (green, yellow, and red). The steeper the slope, the quicker we reach to the bottom. Hence, the objective is to find the slope that can help us “reach” the bottom the quickest. The goal is to reach the base as efficiently as possible from the current position.

For a single linear regression problem, we calculate the derivatives of our independent variables (including the bias). The equations below are the derivatives for m and b. Realize that m is the same as betas/weights in single linear regression (terminology is often inconsistent).

We calculate the derivatives of the bias and weights for each observation. We then sum the derivatives to derive the average.
The last step is updating the betas. These derivatives calculate how we should tweak the betas.
The formula is original_beta minus (learning_rate * derivative).
The derivative is the direction we want to move towards, and the learning rate is how fast we would like to proceed. The learning rate is typically 0.05 but could be adjusted.

Formula — Gradient Descent (Multi-Linear Regression):

More realistically, you’ll be dealing with a Multi-Linear Regression problem. Hence, the equations below described the partial derivatives of each of the possible weights, which are associated with each independent variable. The “h0(x)” term is our prediction.

If you’re rock climbing, you can either walk forward, backward or up and down. The partial derivatives described how much you should in a 3-Dimensional plane.

Is there anything else?

Yes. One of the difficulties in machine learning is a concept called overfitting. Overfitting is when our model performs well on the training set, the dataset the models learn, but not as well in the validation or testing set, the dataset that the models are validated with or tested with.

One technique to reduce overfitting is by shrinking your weights/betas.

It’s a fascinated concept. The strongest beta/weigh will still have the most robust beta/weight among its the other independent variables.

However, the magnitude will be reduced for all independent variables.

We have the ridge/lasso regression.

Lasso and Ridge Regression

The formula is the same as it was for Multi-Independent Variables with the addition of lambda (λ).
The lambda term is referred to as the penalty term because we will penalize the model the larger it is.
The formula above only describes the derivative of the coefficients (betas).
Once the derivative is calculated, as noted above, the coefficients (betas) are multiplied with the learning rate which is then subtracted from the current coefficients (betas).
The lambda (λ) computes a larger formula.
The larger the formula above, the more significant the reduction of the coefficients (betas).

Ridge versus Lasso

The difference between the ridge and lasso regression is how the regularization term is computed.

Ridge regression squares the regularization term.
Lasso regression uses the absolute value of the regularization term.
Also, the Lasso shrinks the less significant coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a considerable number of features.
As we increase the complexity of our model, the larger the regularization term (balances it out).

There’s one additional difference between the L1 Regularization and the L2 Regularization.

The image above showcases the constraint on the model. The restriction is the regularization term we impose on our data. Without it, the derivatives will at the center of the eclipse. The eclipse is the derivatives of our coefficients (betas). The closer to the middle, the better it performs.

Since the L1 Regularization is absolute values, we obtain the left image. Geometrically, it’s a diamond (2-Dimensional). If the vector was [x, y], its L1 norm is |x| + |y|.
Since the L2 Regularization is square values, we obtain the right image. Geometrically, it’s a diamond (2-Dimensional). If the vector was [x, y], its L1 norm is x2 + y2 (a circle).

Moreover, we want to figure where the constraint and the model “touch” one another. There might be examples where the model “touches” the restriction in the axis. Hence, this pushes some independent variables to be pruned (removed since they are close to or 0).

Sources: