Introduction to Deep Learning

AI, But Simple Issue #27

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Introduction to Deep Learning

AI, But Simple Issue #27

Deep learning is a subset of machine learning using neural networks, allowing a machine to learn complicated patterns and relationships within data.

It is a pivotal technology behind innovations like Large Language Models (LLMs), self-driving cars, Natural Language Processing (NLP), and object detection.

Deep learning focuses on Artificial Neural Networks (ANNs) with many layers, often referred to as deep neural networks.

These networks are capable of learning complex patterns from data by stacking multiple layers (hence the term “deep”) with many neurons.

Neural networks do not require explicit programming to handle specific conditions but instead rely on training to learn difficult patterns.

This makes them ideal for solving a wide range of problems in the real world.

Neural networks are function approximators. They approximate a function based on the input data and true outputs (labels) you give it, using an optimization algorithm to learn based on how close the neural network’s output is compared to the true output.

During training, the neural network learns the function that transforms the input data into the true outputs.

To put this into perspective, we’ll use an example. Let’s say you have a function. If you are given this function explicitly, then it is 100% guaranteed to replicate the correct output, given any input.

But what if you weren’t given the function at all and instead, some data points and wanted to guess what the function was shaped like?

For instance, there is no function that can tell you whether a tumor is malignant or benign. But using neural networks, we can give it some data and tell it to make some sense of it.

If we train it well, then it can predict whether a given tumor is malignant or benign with decent precision, good enough for real-world applications (while not being 100% accurate).

Neural Network Structure

Neural networks are made up of an input layer, multiple hidden layers, and one output layer.

  • Input Layer: Processed input data (text, audio, images) is represented by neurons in a numerical representation.

  • Hidden Layers: Contain neurons that mathematically transform the data to learn the underlying patterns. The word "deep" in deep networks and deep learning comes from stacking many of these layers together.

  • Output Layer: Produces the final output (some prediction or classification).

The specific name for the network that we’ll be studying is the fully-connected feed-forward network; fully-connected as each neuron from the previous layer has a connection with every other neuron in the next layer; feed-forward as data passes forward through the network.

Forward Propagation

During forward propagation (or the forward pass), the neural network mathematically transforms the input data as it passes through each layer, finally ending up as a relevant output at the output layer.

  • As data passes from layer to layer, patterns are learned, and the neurons transform the data into abstract representations.

To better understand how the forward propagation works, let’s do an example with some math.

To make this example simple, we’ll use 5 input neurons and 1 neuron in the current layer. It will teach us how data is passed through neurons and transformed.

Given the input data from the input neurons, we want to mathematically transform it and output it to the next neurons.

To do this, the first thing to calculate is the “pre-activation”, denoted as z:

Above, z is computed by taking the sum of all of the weight values (weight vector w) multiplied by the input values (input vector x), added to a bias b.

In simple terms, we would have to perform a matrix multiplication, then a sum:

The size of a weight matrix is only computed in this way if the network is fully connected.

After calculating the pre-activation (z), we pass it through an activation function to further transform the data and introduce some nonlinearity.

Although the ReLU activation function is used here, it can be replaced with any other activation function. It can be substituted for activation functions such as sigmoid, tanh, GELU, eLU, softmax, etc.

  • Other activation functions transform the pre-activation differently; activation functions are used on a case-by-case basis.

Here’s a diagram of a neuron (in a simple neural network) to better visualize what happens:

Here, the x vector is given as the input connections and is then multiplied with the w vector, finally summed and added to a bias term.

This sum is passed through a ReLU activation function that takes the maximum of either the sum or 0 (eliminating the negative value).

After the data is passed through the activation function, the node then sends the output along its outgoing connections (to the neurons in the next layer).

We repeat this for all neurons in each layer, finally obtaining a relevant output at the output layer.

  • This output could be a probability distribution of the class an image belongs to, a numerical prediction, or a sequence of values representing a sentence.

For a more in-depth math explanation of forward propagation, consider upgrading for more features!

Loss Function

The network learns by evaluating its output by comparing it to the actual target value.

This measure of error (the difference between the output and the actual target value) can be represented through loss functions.

Some examples include Mean Squared Error (MSE) for regression (numerical outputs) and Cross-Entropy for classification (categorical outputs).

Backpropagation

Through the loss function, we obtain the error, which we use as a metric to train the network.

The “training” occurs when the parameter values of the neural network are changed, with some common parameters being weight matrices and bias vectors (the “weights” and “biases”).

These values are tweaked using an optimization algorithm (known as an optimizer) during the backwards pass, using backpropagation to compute relevant gradients used in these algorithms.

Backpropagation has its name since errors from the loss function are propagated backward, computing gradients from the back to the front using the chain rule in calculus.

Some common optimization methods include gradient descent, mini-batch gradient descent, RMSprop, and Adam.

After the parameters are adjusted, we repeat the cycle with more forward passes and backward passes for multiple iterations (what we call epochs), looking to improve the network’s accuracy and to lower the loss after each iteration.

This way, the model learns to recognize patterns and generalize from the data effectively.

Here’s a special thanks to our biggest supporters:

Sushant Waidande

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team