Forward pass¶
As stated earlier, the network can be seen as a function \(h: \mathbb{R}^{m} \rightarrow \mathbb{R}^{k}\). In the case of a multi-class classification problem, given a data-point \(\boldsymbol{x} \in \mathbb{R}^m\), the network outputs a vector of probabilities. $$ h(\boldsymbol{x}) = \boldsymbol{\hat{y}} $$ If the feature matrix \(\boldsymbol{X}\) of size \(n \times m\) is passed as input, we get a matrix of probabilities, \(\boldsymbol{\hat{Y}}\) of size \(n \times k\). For a regression problem, \(k = 1\) and the output is interpreted as some real number. This input-output relationship can be computed in an iterative manner. This is termed as a forward pass.
Activations¶
First we look at what happens at one layer of the network. At any layer, there are three steps that have to be performed in this sequence:
- Accept input
- Linearly combine the inputs
- Apply the activation function
What comes out of a layer are called the activations at that layer. The linear combination of the inputs to a layer are called the pre-activations.
\(\boldsymbol{Z_l}\) and \(\boldsymbol{A_l}\) represent the matrices of pre-activations and activations respectively at layer \(l\) for \(0 \leq l \leq L\). \(\boldsymbol{A_0} = \boldsymbol{X}\), the feature matrix. The activation matrix \(\boldsymbol{A_l}\) is of size \(n \times S_l\). Each row of the activation matrix corresponds to the activation vector for one of the \(n\) data-points.
Algorithm¶
If \(\boldsymbol{A_{l - 1}}\) represents the activation matrix at layer \(l - 1\), then the activations at layer \(l\) can be computed iteratively using the following pair of equations:
Here, \(\boldsymbol{A_0} = \boldsymbol{X}\). The shapes of these matrices/vectors are as follows:
- \(\boldsymbol{A_{l - 1}}\): \(n \times S_{l - 1}\)
- \(\boldsymbol{W_l}\): \(S_{l - 1} \times S_{l}\)
- \(\boldsymbol{b_l}\): \(S_l\)
- \(\boldsymbol{Z_l}\): \(n \times S_l\)
- \(\boldsymbol{A_{l}}\): \(n \times S_{l}\)
Note that \(\boldsymbol{b_{l}}\) gets added to each row of the product \(\boldsymbol{A_{l - 1} W_{l}}\) according to NumPy
broadcasting rules. \(g\) is the hidden-layer activation function for \(1 \leq l \leq L - 1\) and the output-layer activation function for \(l = L\). The final shape of the output activations at layer \(L\) is:
- \(n\) for regression and binary classification problems
- \(n \times k\) for a multi-class classification problem
According to our notation, \(\boldsymbol{A_L} = \boldsymbol{\hat{Y}}\). The algorithm can now be specfieid as given below: