The Ultimate Guide to Activation Functions

Introduction

In a neural network, the activation function acts as the mathematical "gatekeeper" for every neuron. Conceptually, it decides whether a neuron should "fire" (activate) or remain dormant based on the weighted sum of inputs it receives. Without these functions, a neural network—no matter how many layers deep—would behave mathematically like a single linear regression model.

Figure 1: Visual flow of a single neuron.

The power of these functions lies in non-linearity. Real-world data is messy and rarely fits into straight lines. Non-linear functions bend and twist the decision boundary, allowing the network to learn intricate patterns like the curve of a cat's ear or the sentiment of a sentence. Furthermore, the derivative (or gradient) is critical for training. It tells the network "which direction to move" to minimize error. If the derivative vanishes (becomes zero), the network stops learning.

Also, try the Dropout Visualizer to see how dropout helps reduce overfitting.

The Evolution of Activations

We can trace the history of Deep Learning through these functions.
Generation 1 (Biological): Functions like Sigmoid and Tanh mimicked the firing rate of biological neurons. While intuitive, they failed in deep networks due to vanishing gradients.
Generation 2 (Computational): ReLU changed everything in 2012 (AlexNet). By being computationally "free" and providing a constant gradient, it enabled the Deep Learning boom.
Generation 3 (Probabilistic): Modern functions like GELU and Swish use smooth, probabilistic curves to optimize massive models like GPT-4 and EfficientNet.

Blue Line = Activation (f) | Red Dashed = Gradient (f')

Test Input (x):

Output f(x): -

Gradient f'(x): -

Param: 1.0

1️⃣ Step / Linear

2️⃣ Sigmoid Family

3️⃣ ReLU Family

4️⃣ Advanced

Binary Step Function

f(x)	f'(x)
1 if x >= 0 else 0	0

✅ Pros / ❌ Cons

✅ Conceptually simple.
❌ Derivative is 0 (Cannot learn).

💻 Code

y = 1 if x >= 0 else 0

The Step Function acts like a basic light switch. If the input is positive, the neuron fires (1); if negative, it stays off (0). While it laid the foundation for the "Perceptron" in the 1950s, it is essentially useless for modern deep learning. The fatal flaw lies in its derivative: the gradient is zero everywhere (except at the jump, where it is undefined). In the Backpropagation algorithm, the gradient tells the network how much to update its weights to reduce error. If the gradient is zero, the network assumes no changes are needed, and learning halts immediately.

Linear Activation

f(x)	f'(x)
x	1

✅ Pros / ❌ Cons

✅ Essential for regression outputs.
❌ Collapses deep networks into a single layer.

💻 Code

nn.Identity()

Linear Activation (Identity) passes the input through unchanged ($f(x) = x$). It is typically used only in the output layer of regression models, such as predicting house prices, where the output is a continuous value. Using Linear activation in hidden layers is a critical error. Mathematical theory proves that a neural network composed of any number of linear hidden layers is functionally equivalent to a single linear layer. The network loses its ability to model complex, non-linear relationships (like curves) and becomes just a standard Linear Regression model.

Sigmoid / Logistic

f(x)	f'(x)
1 / (1 + e^-x)	f(x) * (1 - f(x))

Key Properties

Range: (0, 1)
Centered: No
Vanishing Gradient: Yes (Severe)

💻 Code

torch.sigmoid(x)

The Sigmoid function takes any real value and squashes it into a range between 0 and 1. This probability-like output made it the default choice in early neural networks. However, for hidden layers, it has largely been abandoned due to the severe Vanishing Gradient Problem. If you observe the red derivative line, you will see it peaks at only 0.25 (at the center) and quickly drops to zero at the tails. In deep networks, these small gradients multiply together during backpropagation ($0.25 \times 0.25 \dots$), causing the error signal to become infinitesimally small. This prevents early layers from learning. Today, it is primarily used only in the final output layer for binary classification.

Tanh (Hyperbolic Tangent)

f(x)	f'(x)
(e^x - e^-x) / (e^x + e^-x)	1 - f(x)^2

Key Properties

Range: (-1, 1)
Centered: Yes
Vanishing Gradient: Yes (Moderate)

💻 Code

torch.tanh(x)

Tanh is effectively a stretched Sigmoid that maps inputs to the range $[-1, 1]$. Because it is zero-centered (the average output is closer to 0), it generally allows for faster and easier optimization than Sigmoid. Furthermore, its derivative peaks at 1.0 (at $x=0$) rather than 0.25, providing a stronger gradient signal. Despite these advantages, Tanh still suffers from the same saturation issues: for very large or small inputs, the curve flattens out, and the gradient vanishes. It remains a standard choice in Recurrent Neural Networks (RNNs, LSTMs, GRUs) but is rarely used in deep convolutional networks.

Hard-Sigmoid

✅ Pros / ❌ Cons

✅ Fast (no exponents).
✅ Great for Mobile/Edge.
❌ Lower precision.

💻 Code

nn.Hardsigmoid()

Hard-Sigmoid is a piecewise linear approximation of the standard Sigmoid function. In environments with limited computational resources—such as mobile phones or IoT devices—calculating exponentials ($e^x$) can be computationally expensive. Hard-Sigmoid approximates the smooth curve using simple straight lines (clipping). While it is mathematically less precise and its derivative is a simple "box" function, the speed gains often outweigh the slight loss in accuracy. It is famously used in efficient mobile architectures like MobileNetV3.

ReLU (Rectified Linear Unit)

f(x)	f'(x)
max(0, x)	1 if x>0 else 0

Key Properties

Range: [0, Infinity)
Issue: Dead Neurons
Architecture: ResNet, VGG, AlexNet

💻 Code

F.relu(x)

ReLU is the undisputed "king" of modern deep learning. Its beauty lies in its simplicity: if the input is positive, output it unchanged; if negative, output zero. This provides two massive benefits. First, it is computationally free (just a threshold check). Second, for positive inputs, the derivative is exactly 1. This constant gradient allows errors to flow back through extremely deep networks without vanishing. The main downside is the "Dying ReLU" problem: if a neuron's weights shift such that it always receives negative inputs, it outputs 0, its gradient becomes 0, and it effectively "dies," never learning again.

Leaky ReLU

✅ Pros / ❌ Cons

✅ Fixes "Dead ReLU".
✅ Gradients flow for negatives.
❌ Adds hyperparameter (slope).

💻 Code

nn.LeakyReLU(0.01)

Leaky ReLU was designed specifically to fix the "Dying ReLU" problem. Instead of forcing the output to be exactly zero for negative inputs, it allows a small, non-zero gradient (typically 0.01) to leak through. This means that even if a neuron is inactive (receiving negative input), it still generates a tiny gradient signal during backpropagation. This keeps the weights adaptable and allows "dead" neurons to eventually recover during training. It is a robust alternative if you find your standard ReLU network is underfitting.

ELU (Exponential Linear Unit)

Key Properties

Range: (-1, Infinity)
Robustness: High noise tolerance.
Smoothness: C1 Continuous.

💻 Code

nn.ELU(alpha=1.0)

ELU attempts to combine the best features of ReLU and Tanh. Like ReLU, it is linear for positive values (providing good gradient flow). Unlike ReLU, it uses a smooth exponential curve for negative values that saturates at -1. This negative saturation helps push the mean activation of the layer closer to zero, which speeds up convergence during training. Furthermore, because it handles negative noise smoothly rather than cutting it off abruptly (as ReLU does), ELU models are often more robust to noise in the input data. The trade-off is slightly higher computational cost.

SELU

✅ Pros / ❌ Cons

✅ Self-Normalizing.
✅ Prevents exploding gradients.
❌ Strict initialization required.

💻 Code

nn.SELU()

SELU is a specialized variant of ELU designed for "Self-Normalizing Neural Networks". When used in a dense network with a specific weight initialization method (LeCun Normal), SELU mathematically forces the output of each layer to automatically maintain a mean of 0 and a variance of 1 during training. This magical property prevents gradients from exploding or vanishing, even in networks with hundreds of layers, often eliminating the need for external normalization layers like Batch Normalization. It is highly effective for deep Feed-Forward Networks (FNNs) on tabular data.

GELU (Gaussian Error Linear Unit)

Key Properties

Use Case: Transformers (BERT/GPT).
Type: Probabilistic gating.
Formula: $x \Phi(x)$

💻 Code

nn.GELU()

GELU is the activation function of choice for the modern NLP revolution, powering state-of-the-art models like BERT, GPT-3, and GPT-4. Rather than deterministically gating inputs by their sign (like ReLU: "if positive, pass; if negative, cut"), GELU weights inputs by their percentile in a Gaussian distribution. Visually, it looks like a smoother ReLU that dips slightly below zero before curving up. This smoothness and probabilistic nature help complex Transformer models optimize more easily than the rigid corners of ReLU, serving as a bridge between deterministic and stochastic regularization.

Swish / SiLU

Key Properties

Non-Monotonic: Yes (dips below 0).
Use Case: EfficientNet, CNNs.

💻 Code

nn.SiLU()

Swish (also known as SiLU) was discovered by Google Brain researchers using automated search algorithms (AutoML). Defined as $f(x) = x \cdot \text{sigmoid}(x)$, it is a smooth, non-monotonic function. The critical feature is the "dip" where the curve goes slightly negative for small negative inputs before returning to 0. This unique shape allows a small amount of "negative" information to propagate through the network, which has been empirically shown to improve performance in very deep computer vision architectures like EfficientNet, often outperforming ReLU in classification accuracy.

Mish

✅ Pros / ❌ Cons

✅ Ultra-smooth gradient.
✅ SOTA for Computer Vision.
❌ Expensive to compute.

💻 Code

nn.Mish()

Mish is a self-regularized, non-monotonic function defined as $x \cdot \tanh(\ln(1 + e^x))$. While the formula looks complex, the resulting curve is extremely smooth and conceptually similar to Swish. This smoothness ensures that the gradient changes gradually rather than abruptly (as in ReLU). This results in a "smoother loss landscape," making it easier for the optimizer to find the global minimum. Mish gained immense popularity after being used in the YOLOv4 object detection model, where it significantly boosted accuracy compared to Leaky ReLU.

Softplus

Key Properties

Smoothness: Differentiable everywhere.
Range: (0, Infinity).
Derivative: Sigmoid function.

💻 Code

nn.Softplus()

Softplus is a smooth approximation of ReLU, defined as $\ln(1 + e^x)$. While ReLU has a sharp "corner" at zero (which makes it technically non-differentiable at that specific point), Softplus is a continuous, smooth curve. Interestingly, the derivative of Softplus is exactly the Sigmoid function! It is typically used in scientific computing or probabilistic models (like Variational Autoencoders) where you need to constrain outputs to be always positive (like ReLU) but also require the function to be mathematically smooth and differentiable at every single point.

Architecture Guide: How to Choose?

A common mistake beginners make is using the same activation function everywhere. You must distinguish between Hidden Layers (internal processing) and the Output Layer (final prediction).

Layer Type	Goal	Recommended Function
Hidden Layers	Extract features / patterns	ReLU (Default), GELU (Transformers), Swish (Deep CNNs)
Output (Regression)	Predict any continuous number	Linear (Identity)
Output (Binary)	Predict Yes/No (0 to 1)	Sigmoid
Output (Multi-Class)	Predict classes (Cat, Dog, Car)	Softmax

💡 Pro Tip: Weight Initialization

Activation functions require specific methods to initialize the random weights of the network. If you mismatch them, your network might fail to train.
• If using ReLU / Leaky ReLU / GELU → Use He (Kaiming) Initialization.
• If using Sigmoid / Tanh → Use Xavier (Glorot) Initialization.

❓ Frequently Asked Questions

Q: Which activation function should I start with?

A: Start with ReLU. It is standard, fast, and works for 90% of hidden layers. If you are doing NLP (Text), start with GELU. If you are doing Computer Vision and ReLU isn't working, try Swish.

Q: Why do we need non-linearity?

A: Without non-linearity, a neural network is just a giant linear regression model. Activation functions bend the space, allowing the network to learn complex shapes (like a spiral or a face).

Q: Can I mix activation functions?

A: You generally use one type (e.g., ReLU) for all hidden layers and a different type (e.g., Softmax) for the output layer. Mixing different types within hidden layers is rare and usually unnecessary.

📖 Glossary

Backpropagation

The algorithm used to train neural networks. It calculates errors at the output and spreads them backwards to update weights.

Gradient

The slope of the curve at a specific point. It tells the network "which direction to move" to reduce error.

Vanishing Gradient

When the gradient becomes so small (near zero) that the network stops learning. Common with Sigmoid in deep nets.

Saturation

When an activation function flattens out (becomes horizontal). In these areas, learning stalls because gradients are zero.

Sparsity

When a function outputs exactly 0 for some inputs (like ReLU). This makes the model computationally efficient.

Non-Monotonic

A function that changes direction (e.g., goes down then up), like Swish. This helps capture complex data patterns.

Differentiable

A curve that is smooth everywhere (no sharp corners). Essential for calculus-based optimization (Gradient Descent).

Also, try the Dropout Visualizer to see how dropout helps reduce overfitting.

📚 Recommended Reading

Deep Learning Book (Chapter 6)
The definitive mathematical guide by Goodfellow, Bengio, and Courville.
Stanford CS231n: Neural Networks
Practical "rule-of-thumb" advice on architecture choices.
Searching for Activation Functions (Swish Paper)
Google Brain's paper on discovering Swish via AutoML.
GELU Paper (Gaussian Error Linear Units)
The foundation of activation functions in Transformers.