Loss Functions Visualizer

Interactive exploration of Neural Network Optimization

In the realm of deep learning, a **Loss Function** (or Cost Function) serves as the mathematical compass for a neural network. It quantifies the "error" by calculating the precise distance between the model's current prediction ($\hat{y}$) and the actual ground truth ($y$). During the training process, the network aims to minimize this value iteratively through an optimization algorithm like Gradient Descent.

The choice of function fundamentally dictates how the model learns. **Regression tasks** (predicting continuous quantities like house prices) typically employ MSE or MAE to measure numerical discrepancies. **Classification tasks** (predicting categories like "Spam" or "Not Spam") rely on Cross-Entropy to penalize probability divergences.

How to read this graph (The Analogy)

1. The Red Diamond (The Bullseye): Represents the True Value ($y$). This is the correct answer.

2. The Blue Line (The Penalty Map): Shows the penalty for every possible prediction. Low = Good, High = Bad.

3. The Valley: The goal of the neural network is to roll down the hill into the valley where the diamond sits.

Why Overlay? (Educational Insight)

Use the "Overlay Comparison" dropdown below to plot two functions at once. This reveals hidden behaviors:

MSE vs MAE: Notice that MSE is lower than MAE when the error is small (fractional), but shoots up higher than MAE when the error is large. This visualizes why MSE is stable near the target but sensitive to outliers.
Huber vs Others: If you overlay Huber, you will see it perfectly tracks the "bottom" of the MSE curve near the center and the "slope" of the MAE curve at the edges.

Configuration

1. Primary Loss Function

2. Overlay Comparison (Optional)

True Value ($y$) 0

Live Computation (Primary Function)

Hover over the blue line...

Comparison of Loss Functions

Function	Formula Concept	Pros	Cons	Best Use Case
MSE	$(y-\hat{y})^2$	Penalizes large errors heavily; Gradient approaches 0 smoothly (stable convergence).	Very sensitive to outliers; squaring a huge error creates a massive gradient that can destabilize training.	General Regression (House Prices, Temperature).
MAE	$\|y-\hat{y}\|$	Robust to outliers (linear penalty).	Gradient is constant (doesn't shrink near 0), causing the model to "bounce" around the target without decaying learning rate.	Regression with noisy data/outliers (Finance).
Huber	Piecewise	Best of both worlds: Quadratic near 0 (smooth), Linear far away (robust).	More complex to compute; introduces a hyperparameter ($\delta$) that must be tuned.	Robust Regression (when data is messy but you need precision).
Cross-Entropy	$-y \log(\hat{y})$	Theoretical foundation of information theory; Heavily penalizes confident wrong answers.	Can be unstable if inputs are exactly 0 or 1 (Log(0) = $-\infty$).	Classification (Cats vs Dogs, Spam detection).

? Frequently Asked Questions

Why does the Cross-Entropy curve shoot up to infinity?

Cross-Entropy uses a logarithm: $-\log(p)$. As the probability $p$ approaches 0 (meaning the model says "0% chance" for something that is actually true), the logarithm approaches infinity. This ensures the model is severely punished for being "confidently wrong," forcing it to learn quickly away from that error.

When should I use Huber Loss over MSE?

If your dataset contains "outliers" (noisy data points that are very far from the average), MSE will try too hard to fit them because it squares the error (e.g., error of 10 becomes 100). Huber loss ignores these extreme outliers (treating them linearly) while still fitting the normal data points accurately using the quadratic method.

What does the "True Value" slider represent?

In a real training scenario, the "True Value" ($y$) is your training data label (e.g., the actual price of the house). It is fixed. However, in this visualizer, we let you move it so you can see how the loss function shifts dynamically. Notice that the lowest point of the curve (Loss=0) always aligns with the True Value.

Why isn't MAE used as often as MSE?

While MAE is robust, its gradient is constant (the slope is always -1 or +1). This means as the model gets very close to the answer, the "push" to change the weights doesn't get smaller. This can cause the model to "overshoot" the target back and forth. MSE's gradient gets smaller as you approach 0, acting like a natural brake.

What is the "Delta" in Huber Loss?

Delta ($\delta$) is the threshold. If the error is smaller than Delta, the function acts like MSE (curved). If the error is larger than Delta, it acts like MAE (straight lines). Adjusting Delta allows you to decide exactly how "far" an error needs to be before you consider it an outlier.

What does the Overlay show me?

By overlaying MSE and MAE, you can see the "Crossing Point" (at error=1). Below an error of 1, MSE is actually smaller than MAE (squaring a fraction makes it smaller). Above 1, MSE grows much faster. This visual helps explain why MSE focuses on fixing large errors, while sometimes ignoring small ones compared to MAE.