- How do these two work together in neural networks?
- What is the basic math involved?
Forward Propagation
As the name suggests, the input data is fed in the forward direction through the network. Each hidden layer accepts the input data, processes it as per the activation function and passes to the successive layer.
In order to generate some output, the input data should be fed in the forward direction only. The data should not flow in reverse direction during output generation otherwise it would form a cycle and the output could never be generated. Such network configurations are known as feed-forward network. The feed-forward network helps in forward propagation.
At each neuron in a hidden or output layer, the processing happens in two steps:
- Pre-activation: it is a weighted sum of inputs i.e., the linear transformation of weights w.r.t to inputs available. Based on this aggregated sum and activation function the neuron makes a decision whether to pass this information further or not.
- Activation: the calculated weighted sum of inputs is passed to the activation function. An activation function is a mathematical function which adds non-linearity to the network. There are four commonly used and popular activation functions — sigmoid, hyperbolic tangent(tanh), ReLU and Softmax.
Backward Propagation
Back-propagation is the essence of neural net training. It is the practice of fine-tuning the weights of a neural net based on the error rate (i.e., loss) obtained in the previous epoch (i.e., iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.
We have a model that does not give accurate predictions and that is attributed to the fact that its weights have not been tuned yet. We also have the loss. Back-propagation is all about feeding this loss backwards in such a way that we can fine-tune the weights based on which. The optimization function such as gradient descent will help us find the weights that will — hopefully — yield a smaller loss in the next iteration.
The overall steps are:
- In the forward propagate stage, the data flows through the network to get the outputs
- The loss function is used to calculate the total error
- Then, we use backward propagation algorithm to calculate the gradient of the loss function with respect to each weight and bias
- Finally, we use gradient descent to update the weights and biases at each layer
- We repeat above steps to minimize the total error of the neural network.
Using the input variables x and y, the forward pass or propagation calculates output z as a function of x and y i.e. f(x,y).
During backward pass or propagation, on receiving dL/dz (the derivative of the total loss, L with respect to the output, z), we can calculate the individual gradients of x and y on the loss function by applying the chain rule, as shown in the figure.
Thanks Tanya, this is very crisp and clear!