AI, But Simple
Posts
All About Activation Functions

All About Activation Functions

AI, But Simple Issue #28

Edwin Dong
December 02, 2024

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Activation Functions, Simply Explained

AI, But Simple Issue #28

Activation functions are mathematical functions in deep learning models placed between layers, playing a critical role in the performance and capabilities of neural networks.

They introduce non-linearity into the model, enabling it to learn and represent complex patterns in data.
It allows models to perform popular deep learning tasks like image recognition, natural language processing (NLP), and more recently, text generation with LLMs.

Without activation functions, neural networks would be limited to linear mappings, simplifying their boundaries to lines regardless of how many layers are used. Activation functions enable networks to model complicated non-linear relationships.

Mathematically speaking, an activation function is an equation that determines the output of a neuron given an input or set of inputs.

The activation function “decides” whether a neuron should be activated or not by calculating the weighted sum of inputs and biases.

Where are activation functions used in neural networks?

Activation functions are used in a network’s hidden layers to add non-linearity and to mathematically transform the data.

Each and every neuron in the hidden layers of a neural network applies an activation function.

To further understand the activation function’s importance in the hidden layer, here’s how a hidden layer is composed:

Here, the linear transformation portion will differ for what type of model is being used.

In Multi-Layer Perceptrons (MLPs), the linear transformation will be the perceptron’s pre-activation (Wx+b).
In CNNs, it will take the form of the convolutional layer (convolution of Wx +b).
In RNNs, it will be the hidden state pre-activation (a form of Wx + b including timesteps).

Above, the math representation of this hidden layer is for a MLP with the pre-activation (z) represented using vectors. W is the weight matrix, b is the bias vector, and x is the input vector.

Activation functions are also used at the output layer. The activation function used in the output layer depends on the type of problem since it transforms the network's output into a desired format.

If the problem is a regression problem, we would typically use a linear activation function and simply take the output from the last hidden layer as the numerical prediction.

If the problem is a classification problem:

For multiple classes (multi-class), we will often use the Softmax activation function, generating a probability distribution over all possible classes.
For binary classification, we can use the Sigmoid activation function.

In most cases for classification problems, RNNs, Transformers, and CNNs, the output will be passed through Softmax.

The mathematical process behind the output layer runs the same as the hidden layer. The previous layer input is mathematically transformed by the pre-activation, then passed through an activation function—only this time, the output is final.

In deep learning, activation functions also enable deeper networks, allowing multiple layers to be stacked, effectively learning patterns in the data.

They can also control the output range, normalizing outputs to a specific range, which is useful for tasks like classification where outputs represent probabilities.

Since deep learning models heavily rely on activation functions, they are extremely popular and have been continually researched in the past 20 years.

In this issue, we’ll look at the foundational activation functions of deep learning along with some more recent activation functions that have brought a boost in performance over the years.

But before we get into that, we’re proud to announce that we’ve partnered with 1440 Media for this sponsored segment:

Fact-based news without bias awaits. Make 1440 your choice today.

Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.

Linear Activation Functions

Linear Activation Function

The linear activation function is an activation function that outputs the same value as its input. In other words, the output is directly proportional to the input.

Due to this feature, the linear activation function does not introduce any non-linearity.

This means that when using multiple layers with linear activation functions, they collapse into an equivalent single-layer model.

One of the uses of the linear activation function or the identity function is for skip (residual) connections in CNNs, allowing data to flow directly from one point to another, bypassing layers to eliminate the data from vanishing.

It is also used when the network needs to predict continuous values without bounding them; they are useful for regression tasks.

Non-Linear Activation Functions

Non-linear activation functions inserted between layers are crucial for deep learning models to capture complex patterns in data and build non-linear decision boundaries.

Sigmoid Function

The sigmoid function is an s-shaped curve that ranges from 0 to 1, exclusive. It has been a foundational activation function for neural networks, being used in early neural networks.

It is commonly used when the output represents a probability of belonging to a particular class, for binary classification.

It is only sometimes used between layers, although it bottlenecks performance when compared to the ReLU or other activation functions due to the vanishing gradient problem.

The vanishing gradient problem occurs since gradients can become very small for large positive or negative inputs, slowing down the learning process.

Hyperbolic Tangent (Tanh) Function

Tanh is similar to the sigmoid function in the shape—it is an s-shaped curve—however, it is centered around 0 with a range from -1 to 1, exclusive.

The tanh function helps center the data, potentially leading to faster convergence, although still suffering from the vanishing gradient problem.

It is used specifically in the hidden layers (or hidden states) of RNNs for handling sequential data, not seeing much use outside of RNNs.

Rectified Linear Unit (ReLU)

The ReLU activation function is the most widely used activation function in all of deep learning, especially in hidden layers of deep neural networks.

The use of ReLU has been documented especially in convolutional neural networks (CNNs), increasing performance and creating accurate image models for tasks like object detection and image segmentation.

For instance, ResNet, VGGNet, and AlexNet all use ReLU in their architecture.

It outputs zero for negative inputs and outputs the input for positive inputs (linear relationship), with a range between 0 inclusive and positive infinity exclusive.

It is extremely simple and fast to compute, on top of alleviating the vanishing gradient problem.

However, even as this activation function seemingly solves the vanishing gradient problem, it creates another problem of its own: the dying ReLU problem.

When using the ReLU activation functions, neurons can become inactive and only output zero if the inputs to those neurons are always negative.

The first 3 activation functions (Sigmoid, Tanh, and ReLU) have been noted for their computational efficiency and hold a portion of the neural network’s history.

One may opt to use these activation functions instead of more complicated ones to speed up training.

Softplus Function

Softplus is ReLU’s smooth approximation; it was released in 2001. Softplus is an always positive, always increasing activation function with a range of 0 to infinity exclusive.

Softplus is not really used in practice since it increases the computing power needed without much of an advantage over even the standard ReLU.

It is sometimes used in models where a smooth activation function is beneficial for estimating probabilities, but not much elsewhere.

Leaky ReLU

Leaky ReLU was developed as a way to keep negative data alive, improving the problem of inactive neurons in the ReLU activation function.

Above, α is a small constant, typically 0.01. On the graph, you can see how slightly it changes the left half of the ReLU function; it’s basically horizontal.

The Leaky ReLU function ranges from negative infinity to positive infinity exclusive, and is used in deep neural networks with sparse data.

It is only really used in models where negative inputs are everywhere and the dying ReLU problem needs to be solved.

Parametric ReLU (PReLU)

PReLU is extremely similar to the Leaky ReLU function, so much so that the functions appear to be the same.

However, here, α is a learnable parameter instead of a fixed constant.

The authors of the paper behind PReLU figured: “why not let the α in αx get learned instead of constantly tweaking and tuning the constant?”

PReLU allows the network to learn the optimal value of the negative slope, making it easier to use than the Leaky ReLU, however also introducing additional parameters, which increases the risk of overfitting.

PReLU is used in advanced computer vision models for image recognition tasks.

Exponential Linear Unit (ELU)

The ELU is an activation function that is identical to the ReLU on the right side but takes a curved approach to the left.

For negative inputs, it smoothly decreases until it approaches -α; the ELU has a range of -α to infinity, exclusive.

The ELU is accurate and produces better results along with reducing the vanishing gradient problem yet preserving an element of the ReLU (while simultaneously improving the dying ReLU problem).

It is used in deep networks that need decently fast convergence and good performance.

Gaussian Error Linear Unit (GELU)

Above, P(X<=x) is the cumulative distribution function (CDF) of the standard normal distribution, hence the term “Gaussian”.

The GeLU is smooth and differentiable, combining the properties of dropout and ReLU to improve performance, although it is computationally expensive.

It is used in certain transformer models in natural language processing (NLP) such as BERT, GPT-2, and GPT-3 for natural language understanding and generation tasks.

The GeLU function can also be approximated to be:

This approximation is accurate enough to be used in practice to speed up computational speed.

Swish Function

Here, σ is the sigmoid function, and β is a parameter that is usually set to 1.

The Swish resembles the GeLU but combines the properties of the sigmoid function and the ReLU function to achieve this same shape.

It can have a slight edge over ReLU in certain deep networks for image recognition tasks, while being more computationally complex and expensive.

It is implemented in Google’s deep learning models like MobileNetV2 and EfficientNet for image classification tasks.

It’s interesting to see how activation functions have evolved in their use in neural networks, and specifically how researchers have leaned towards ReLU-based activation functions for optimal performance.

The ReLU activation function and its variants are commonly used in hidden layers across many domains, from image processing to language processing to speech recognition.

It is also effective and efficient in deep networks, where modern networks have become increasingly deep for performance.

Here’s a special thanks to our biggest supporters:

Sushant Waidande

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team