Machine Learning allows computers to carry out complex tasks without being specifically programmed to do so. One way that this is possible is through the process known as Supervised Learning, in which the computer is given a set of training data – samples of input and output of the desired function – and from that data is able to generate a function which, given the input needed to perform the desired task, is able to return the correct output. There are two main kinds of tasks that are done with supervised learning: classification and regression. With classification, the computer takes some input data, usually as a vector, and returns one of a finite set of possible outcomes. For example, tell the computer that something is a reptile, that it has a dome-shaped shell, and that its average lifespan is over 100 years, and a properly “learned” machine could tell you that what you are describing is a tortoise. With regression, the computer takes a vector of input data and maps it to the predicted result within a continuous solution space. For example, tell the computer how large your house is in square feet, how many bedrooms/bathrooms it has, how old it is, etc. and it could return the predicted value of your house in dollars. But how do these processes work? Regression and classification both commonly use the same methods at their core: using gradient descent to achieve linear or logistic regression. For the sake of simplicity, let’s stick to linear regression for now. Linear regression is the process of taking a set of data-points and finding the line that best fits the data points. How do we determine how well a line “fits” the data? Consider a line in 2-dimensional Euclidean Space, whose function is given by:

*y* = m*x* + b

Where *x* is an independent variable, *y* is a dependent variable, and m and b are constant coefficients. Now consider a sample data point, *p _{0}* , which is located at the point (

*x*) in 2-dimensional space. We can calculate the Cost Function at this point to be |

_{0}, y_{0}*y*–

_{0}*y |*, where

*y*= m

*x*+ b. So, given all of our data points, we can calculate sum of the cost for all of these points. Actually, it is more useful to calculate something called the

_{0}*mean squared error*, which is 1/2 times the

*average*

*squared cost*for all of the points, that is, the average of the squares of the cost divided by 2. We can actually represent this as a function of our coefficients m and b. So, we want to find the m and b which gives us the smallest average square cost. How we find the optimal m and b brings us to the crux of the problem: gradient descent. This average squared cost function forms a paraboloid in 3-dimensional space. The minimum of the paraboloid represents our m and b values which minimize the average cost. So how do we find this minimum? Pick any two values to start with for m and b. Now think back to calculus. If we can perpetually find the direction of greatest descent at every point in time, ie, the direction with the lowest directional derivative, and continuously move in that direction, we will eventually reach the bottom of the paraboloid. And if you think back to calculus again, it is easy to find this direction, to find the gradient of a function at a point, just calculate the directional derivatives for each vector in a basis for your vector space and multiply those directional derivatives by their respective element in the basis. This gives us the steepest

*positive*gradient. To get the steepest

*negative*gradient, just go in the exact opposite direction. So that’s the direction we move in. And if we do this and move at small enough decreasing increments, we can approximate doing it continuously.