Introduction
Artificial Neural Networks are a type of machine learning algorithm inspired by the structure and function of the human brain. They are composed of a large number of interconnected processing nodes, called artificial neurons, which work together to solve complex problems.
The math behind ANNs is based on linear algebra and calculus, specifically matrix operations and optimization techniques. At its core, an ANN is a series of matrix multiplications and nonlinear activation functions.
The input data is passed through the layers of the network, where it is multiplied by a set of weights, and then passed through an activation function. The output of each layer is then passed as input to the next layer, until the final output is produced.
The goal of training an ANN is to find the best set of weights for the network, so that it can accurately predict the output for a given input. This is done by minimizing the error between the predicted output and the true output, using optimization techniques such as gradient descent.
The backpropagation algorithm is used to calculate the gradient of the error with respect to the weights, which is then used to update the weights in the opposite direction of the gradient, so as to minimize the error.
The math behind ANNs involves matrix operations, nonlinear activation functions, and optimization techniques to find the best set of weights for the network, so that it can accurately predict the output for a given input.
Application Example
Letβs take an example to see how an ANN^{1} works.
Obesity  Exercise  Smoking  Diabetic 

1  0  0  1 
0  1  0  0 
0  0  1  0 
1  1  0  1 
In the previous table, the value 1 represents true and the value 0 represents false. It appears that, in this example, a person who has diabetes is also inevitably obese. How can we create a program that utilizes these four examples and can predict whether a person has diabetes or not using other examples? This can be achieved by designing an Artificial Neural Network (ANN) with three input neurons that correspond to the Obesity, Exercise, and Smoking columns, and a single output neuron that corresponds to the Diabetic column.
input layer output layer
βββββ
β βββββββββββββββββ
βββββ β
β
βββββ βββΌββ
β ββββββββββββββΊβ β
βββββ βββ²ββ
β
βββββ β
β βββββββββββββββββ
βββββ
Neural Network Operation
The input values will be passed to the input neurons and the network will process them to determine if the person is diabetic or not. Each connection between neurons is controlled by a unique value called weight. Each neuron holds a specific value, calculated based on the previous neuron values and the weights. The ultimate goal of the neural network is to find the correct weights that will produce accurate predictions. To achieve this, the neural network will undergo several steps. Firstly, it will compute the value of the output layerβs neurons, known as the prediction. Then, it will calculate the difference between the prediction and the actual output, and adjust the weights accordingly to increase the accuracy of the prediction. This process will be repeated until the network reaches an acceptable level of accuracy. Once the neural network is constructed, it can be used to predict whether a person has diabetes or not by inputting values such as 1, 1 and 1 (indicating that the person is obese, exercises and smokes), and the network will provide an answer indicating whether the person is diabetic or not.
A. Forward propagation
Forward Propagation is the process of passing input data through the network to produce the output. The input data is passed through the layers of the network, where it is multiplied by a set of weights and then passed through an activation function. The output of each layer is then passed as input to the next layer, until the final output is produced.
As previously mentioned, each neuron in an Artificial Neural Network holds a value that ranges between 0 and 1, which is calculated using the values of the previous neurons, the weights, and a bias term. This section will delve into the details of how these values are calculated.
Notation:
 Weights $w_k$
 Input layers: $a^{(l1)}_1,a^{(l1)}_2,a^{(l1)}_3$
 Output Layer: $a^{(l)}_2$
 Activation function $a$
 Layer number: $l$
Letβs create a notation, $$z^{(l)}_1$$ which is the bias added to the dot product of weights $w_k$ and values of the previous neurons $$a^{(l1)}_k$$.
$$ {z^{(l)}1=b^{(l)}1+\sum{k=1}^{n{(l1)}} a^{(l1)}_k w_k} $$
In order to keep the values in the neurons between 0 and 1, we are going to use the sigmoid function which is defined on [0,1], noted :
$$ \sigma(x)=\frac{1}{1+e^{x}} $$
Thus,
$$ a^{(l)}_1=\sigma\left(z^{(l)}_1\right) \newline a^{(l)}_1 \in \lbrack {0,1} \rbrack $$
This calculus can be visualised with matrix
$$ a^{(l)}_1=\sigma\left(\begin{bmatrix}w_1 & w_2 & w_3\end{bmatrix}\begin{bmatrix}a^{(l1)}_1 \ a^{(l1)}_2 \ a^{(l1)}_3\end{bmatrix}+\begin{bmatrix}b^{(l)}_1\end{bmatrix}\right) $$
B. Back propagation
Backpropagation is the process of adjusting the weights of the network in order to minimize the error between the predicted output and the true output. The backpropagation algorithm is used to calculate the gradient of the error with respect to the weights. This gradient is then used to update the weights in the opposite direction of the gradient, so as to minimize the error.
 Cost Function Letβs take a simple example of a oneinputlayeroneoutputlayerneuralnetwork. $$ a^{(l1)} \to a^{(l)} $$
Once again,
$$ z^{(l)}=b+a^{(l1)}w \newline a^{(l)}=\sigma\left(z^{(l)}\right) $$
The goal is now to adjust the weights to make the prediction more accurate. Letβs introduce the cost function which calculates the square difference between the prediction and the actual output $y$.
$$ C_1\left(a^{(l)},y\right)=\left(a^{(l)}y\right)^2 $$
The accuracy of predictions made by an Artificial Neural Network (ANN) can be measured by the cost function. Mathematically, the goal is to minimize this function. The smaller the value of the cost function, the more accurate the predictions of the ANN are considered to be.
 Gradient descent
Now we will need to understand how sensitive the cost function is to small changes to $w_k$ because remember from Neural Network Operation above, the goal is to adjust weights. Thus, we will determine the partial derivative of $C$ with respect to $w$ using the chain rule.
$$ {\frac{\partial C_1}{\partial w}=\frac{\partial C_1}{\partial a^{(l)}} \frac{\partial a^{(l)}}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial w}} $$
Indeed,
$$ \frac{\partial C_1}{\partial a^{(l)}}=2\left(a^{(l)}y\right) $$
$$ \frac{\partial a^{(l)}}{\partial z^{(l)}}=\frac{\partial \sigma\left(z^{(l)}\right)}{\partial z^{(l)}}=\sigma\prime\left(z^{(l)}\right) $$
$$ \frac{\partial z^{(l)}}{\partial w}=\frac{\partial \left(b+a^{(l1)}w\right)}{\partial w}=a^{(l1)} $$
All together, it gives us
$$ \frac{\partial C_1}{\partial w}=2\left(a^{(l)}y\right) \sigma\prime\left(z^{(l)}\right) a^{(l1)} $$
We will use this formula to calculate the adjustments to make to the weights multiple times until the predictions are accurate.
$$ w=w+\alpha \frac{\partial C_1}{\partial w} $$
In summary, forward propagation is the process of passing input data through the network to produce the output, while backpropagation is the process of adjusting the weights of the network to minimize the error between the predicted output and the true output.
Footnotes

Artificial Neural Network β©