Neural Networks - backpropagation

XXX: fix the example so its the same as in the C file
XXX: link to the c file

The basic idea behind backpropagation is that if you know how the network works and what is the correct output, you can fix the weigths so that the network's output matches the desired output.

Consider a single neuron network that has a function:

f(x, a, b) = ax+b

Where x is the input and a, b, are the weights. This neuron computes a linear function. If you know what the output y should be for a given x, then you can adjust weights a, b for the best match. This process is called backpropagation.

Backpropagation requires a two things:

a data set {x_i, y_i} (the training data)
an error (or loss) function L (this is often a sum of squares function)

So our loss function, L( f(x_i), y_i ) is going to look like that:

f(x_i) is the neural network's result
y_i    is the correct result

L = ( f(x_i) - y_i) ^ 2  = ( ax+b - y_i ) ^2

The main backpropagation loop works like this:

Run data through the network (in this case, calculate the results of f for given x's)
Run the result through L ("How bad was my guess?") - This results in a number (scalar)
Update the weigths (parameters) according to the Gradient Descent Algorhitm.

Step 1. Run data through network

This is relatively simple. For each of the input point x_i you find f(x_i, a, b) where a, b are the neuron's weights (parameters). You want to save these into some array.

Pseudocode:

float *x = ...; // This is the input
float *y = ...; // This is the output
float *result = ...; // This is the array for the temporary result of this iteration.

float a, b; // These are the nework parameters - we adjust them.

for(int i = 0; i < N_POINTS; i++){ 
   result[i] = f(x_i, a, b);
}

Step 2. Run the result through L (Find how bad the guess was)

This is also simple and mostly used for control (if the loss function result increases aftter each generation, something is bad, because you are making the network guess worse)

Pseudocode:

float L(result, target){
    return pow(result - target, 2)
}


//The loss function is often done as a mean errors from each datapoint.

float loss = 0;

for(int i = 0; i < N_POINTS; i++){
    loss += L(result[i], y[i] )

}
loss /= N_POINTS;

Step 3. Adjust the weights

This is the tricky part, so I will try to explain the grading descent algorhitm. I will do it for one parameter.

Consider we have a neuron with a single parameter: f(x, a) = a*x.

We do steps 1 and 2 using the error function: L( f(x, a), y). This means that for a given a, we run all data points and get some error value. The smaller the error value, the closer we are. It never reaches zero, because you cannot fit with a zero error to all points in most cases.

The parameter we adjust is a. The error function depends on it, the closer we get a to the correct value, the lesser the error is.

Now, how do we find which way we adjust a? It is surprisingly easy if you use a derivative like this:

a = a - k * dL/da           where k is some learning coefficient, usualy 0.1 - 0.001

Why is that? Consider this graph:

  /'\
   |                                 the function slope is dL/da
   |                               /
 L |----------........____       '/_         ___...-----
   |                      ``--X_         __--` 
   |                            `-.   _-`
   |                               `.`
   |                                  minimum error
   |                                :
   +--------------------------------:------------------->
                                 optimal a               a

   X marks the value of L for the current A

On this graph, X is on a slope. The slope value is dL/da. The sign tells us which way to go to reach the optimum point. By repeating the equation above, we adjust a with small steps towards the optimum point. The k coefficient (learning coefficient) is there so that we do not overshoot and jump over the optimum point. The k coefficient cannot be too small, or we might fall into some local minimum, it cant be too big or we overshoot. The amount we change a by is proportional to the derivative (the slope), so on a steeper slope we move faster (this is counterintuitive, and this is why k cannot be too big)

To wrap this up, for our two parameter case, where f(x, a, b) = ax+b we get two parameters. This puts the error function graph in 3D: The first two axes are a and b, the third axis is L (we plot L(a, b)). On that chart we have a surface, and the two derivatives tell us which way to go along which axis.

The two derivatives in our case are:

dL/da = df/da * dL/df
dL/db = df/db * dL/df

L = L(y, f(x)) = (y - f)^2    // we put f there to use chain rule

df/da = x           df/db = 1

dL/df = -2(y-f)

So each iteration we adjust it like that

// This is done for each training iteration (epoch)
float k = 0.01;

float dL_da = 0;
float dL_db = 0;

for(int i = 0; i < N_POINTS; i++){
    float dL_df = -2 * (result[i] - y[i]);
    dL_da += dL_df * x;
    dL_db += dL_df * 1;
}


a -= k * dL_da;
b -= k * dL_db;

If we do this enough times, a and b will eventually become really close to the original solution.