Wednesday, June 28, 2017

Linear Regression for Least Squares Fit of 2D Data

This post describes a script which finds the best fit line for a data set.  My work was motivated by a desire to learn a bit more about artificial intelligence algorithms, and confusion about why the average of the slopes (y/x) of a dataset (when the x and y values are proportional) doesn't agree with the slope of a best fit line (with the y-intercept set to zero).

Artificial intelligence is a vast subject.  A subset of this subject relates to tasks that computers can be taught to complete.  Since computers operate much faster than humans, it is easy to give them simple tasks and have them iterate repeatedly.  This article from an online Stanford Computer Science course describes "supervised learning", where computers use a training dataset to "learn".  A very simple example is a best fit line; the computer determines a rule (y=mx + b) which predicts the output value (y) for a given input value (x) based on a training dataset of (x,y) data points.  This rule has two unknowns (m = slope and b = y-intercept) which can be independently varied to give the best fit of the training dataset.  Spreadsheet programs like Excel can fit data based on a wide variety of rule categories (linear, polynomial, exponential, etc).

A related and interesting observation relates to a spreadsheet analysis of a dataset wherein x and y should vary linearly.  For example, the heat released by salts when they dissolve in water follows a linear relationship with the amount of salt used.  A material property, the enthalpy of solution, ΔHsoln, is equal to the ratio of heat to mass (ΔHsoln=q/M) and is constant for a given temperature and pressure.  This means that a plot of heat (q) versus mass (M) is linear with a slope equal to the enthalpy of solution (q = ΔHsoln  M is in the form y = m x).  You can also determine the enthalpy by calculating q/M for each trial of a dataset and then averaging the results.  The picture below shows the results of a spreadsheet analysis for ammonium chloride dissolved in water.

What is interesting to me, is that the enthalpy as determined by the slope of the best fit line with zero intercept (283 J/g) is not the same as the enthalpy as determined by averaging the values of q/m for each trial (277 J/g).  To understand why they are not the same, I decided to learn a bit more about how Excel fits linear data.

Fitting of linear data can be done with a linear regression.  This basically means that we guess a linear function (y = mx + b) which could fit the dataset.  Then we change either b or m and see if the new function is a better or a worse fit.  If the fit is better, we "keep going" in the direction we picked.  For example, if we were making the slope bigger, then we keep increasing the slope.  If we made the fit worse, we want to change the parameter in the other direction.

As described, this is a "poor man's" version of a gradient descent protocol.  The difference is that in a gradient descent, we are not so much randomly changing and checking our parameters, but are looking at a mathematical function cost = f (m, b) and determining the direction (specified by a vector in the m, b plane) which leads to the largest decrease in cost.  We then change m and b concurrently so as to move in the direction of maximum decrease.  In multivariable calculus, the gradient is analogous to the derivative in 2D calculus.  Hence gradient descent is analogous to minimization problems done in introductory calculus classes.

From my point of view, doing a real gradient descent is not necessary as the computing power available from a standard laptop allows me to fit the data within seconds using a more rudimentary algorithm.

The basic algorithm is shown in the flowchart at right.  Code can be found online on github.




At the heart of the algorithm is a cost function which is the sum of the squares of the differences between the predicted y value (mxi + b) and the actual y value (yi) for each x value.  Suppose that your training dataset is depicted by a series of i data points (xi, yi).  Then













In python, this calculation is easily done by reading in the dataset







and then calculating the cost

assuming that your data is saved in a file called data.csv located in the same place as the python program, and that the data file contains x and y values separated by a comma with a different point on each line.

The rest of the "meat" of the program is in incrementing/decrementing m or b which is done by the following:

and in toggling whether we are incrementing/decrementing m and b, which is done by the following:
There are several factors in the program which determine how quickly the best fit line converges.  The first is how close the initial guess is.  The second is how much the slope or y-intercept changes each time through the iteration.  These latter "step sizes" are stored in the variables called "deltaSlope" and "deltaIntercept".

As a test, I fed in the enthalpy data discussed above into both Excel and this python program.  I chose an initial guess of m = 250, b = 0 and a step size of 1 for both slope and intercept.  The python program "converged" to m ~ 285 while b ranged between 0 and -20.

The Excel fit of the data gives m = 285 and b = -12, so I'm definitely in the right ballpark.

Obviously this a pretty rudimentary program.  It doesn't assess how close it's getting and it doesn't adjust the step sizes at all, so we don't see convergence.  Rather we see fluctuations around the actual values which would minimize the cost function.  More robust algorithms could adjust the values of "deltaSlope" and "deltaIntercept" perhaps in proportion to the relative value of the cost function.  Alternately, the program could just stop changing slope or intercept once changes lead to minimal cost change.  These are all tweaks that I'm sure have been thoroughly exploited in calculators and spreadsheet program. 

So back to the original motivating question... why is the slope of a best fit line not the same as the average of the y/x ratios for all the data points?  The best fit line uses a least squares minimization, so it's a nonlinear approach as opposed to a simple average.  So the results won't be the same.