Interactive tool to visualize classification results and compute essential machine learning evaluation metrics

📊 Understanding the Confusion Matrix

In machine learning and statistical classification, a confusion matrix (also called an error matrix) is a simple but powerful way to summarize how well a classification model is performing. It is a table that compares the actual class labels in your data with the predicted labels from your model. This lets you see not only the overall accuracy, but also exactly what kinds of mistakes the model is making.

The name “confusion” comes from the fact that the matrix shows where the model gets confused between classes. For a binary classification problem, the confusion matrix is a 2×2 table with four key quantities:

Starting from these four numbers, you can compute many important performance metrics—such as accuracy, precision, recall, and F1-score —which give a much deeper understanding of how and where your model is performing well or failing.

⚙️ Enter Your Classification Results

📋 Load Example Scenario

Select a Scenario

Click on a scenario above to load realistic example values and learn about the context.

Correctly predicted positive cases

Correctly predicted negative cases

Type I Error — predicted positive, actually negative

Type II Error — predicted negative, actually positive

PREDICTED CLASS

Positive
Negative
ACTUAL CLASS Positive
Negative
TP 85
FN 15
FP 10
TN 90

⚠️ Imbalanced Dataset Detected

Your data shows significant class imbalance. Accuracy may be misleading—consider focusing on Precision, Recall, F1 Score, or Balanced Accuracy instead.

📈 Computed Metrics

Accuracy
0.875
87.50%
Precision
0.895
89.47%
Recall (TPR)
0.850
85.00%
F1 Score
0.872
87.18%
Specificity (TNR)
0.900
90.00%
FPR
0.100
10.00%
FNR
0.150
15.00%
Balanced Acc.
0.875
87.50%

⚖️ Confusion Matrix for Imbalanced Datasets

In real-world machine learning, class imbalance is the norm rather than the exception. Fraud detection deals with 0.1% fraudulent transactions, disease screening with 1-5% positive cases, and manufacturing defect detection with even rarer anomalies. When one class vastly outnumbers the other, traditional accuracy becomes a dangerously misleading metric.

The Accuracy Paradox

Consider a dataset with 1,000 samples where only 10 are positive (disease cases) and 990 are negative (healthy). A naive model that always predicts negative achieves 99% accuracy—yet it catches zero disease cases! This is the accuracy paradox: high accuracy can mask complete failure on the minority class.

Naive Model (Always Predicts Negative)

TP: 0
FN: 10
FP: 0
TN: 990
Accuracy: 99% — Looks great!
Recall: 0% — Catches no positive cases
F1 Score: 0% — Reveals the failure

Useful Model (Balanced Performance)

TP: 8
FN: 2
FP: 50
TN: 940
Accuracy: 94.8% — Lower than naive!
Recall: 80% — Catches most positives
F1 Score: 24.2% — Shows real tradeoff

Recommended Metrics for Imbalanced Data

When working with imbalanced datasets, rely on metrics that account for class distribution:

Balanced Accuracy

Balanced Accuracy is the arithmetic mean of sensitivity (Recall) and specificity. It gives equal weight to both classes regardless of their size, making it robust to imbalance:

\[ \text{Balanced Accuracy} = \frac{\text{TPR} + \text{TNR}}{2} = \frac{\text{Recall} + \text{Specificity}}{2} \]

Practical Strategies

1. Resampling: Use SMOTE (oversampling minority) or random undersampling to balance training data.

2. Class Weights: Assign higher weights to minority class errors in your loss function.

3. Threshold Tuning: Adjust the decision threshold based on Precision-Recall curves rather than using default 0.5.

4. Ensemble Methods: Techniques like BalancedRandomForest or EasyEnsemble are designed for imbalanced scenarios.

📐 Understanding Each Metric

Each metric is derived from specific cells in the confusion matrix. Below, we visualize which cells form the numerator and which cells are included in the denominator for each formula.

Accuracy

Accuracy measures the overall correctness of the model by calculating the proportion of all correct predictions (both true positives and true negatives) out of the total number of cases. While intuitive, accuracy can be misleading with imbalanced datasets.

\[ \text{Accuracy} = \frac{\colorbox{#bbf7d0}{TP} + \colorbox{#bbf7d0}{TN}}{\colorbox{#e0e7ff}{TP + TN + FP + FN}} \]

Interpretation: "What percentage of all predictions were correct?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator Denominator (all)

Precision (Positive Predictive Value)

Precision answers the question: "Of all the cases predicted as positive, how many were actually positive?" High precision indicates low false positive rate, which is crucial when the cost of false alarms is high (e.g., spam detection).

\[ \text{Precision} = \frac{\colorbox{#bbf7d0}{TP}}{\colorbox{#e0e7ff}{TP + FP}} \]

Interpretation: "When the model predicts positive, how often is it right?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator Denominator

Recall / Sensitivity / True Positive Rate (TPR)

Recall measures the model's ability to find all positive cases. It answers: "Of all actual positive cases, how many did we correctly identify?" High recall is essential when missing positive cases is costly (e.g., disease screening, fraud detection).

\[ \text{Recall} = \text{TPR} = \frac{\colorbox{#bbf7d0}{TP}}{\colorbox{#e0e7ff}{TP + FN}} \]

Interpretation: "Of all actual positive cases, how many did we catch?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator Denominator (actual +)

F1 Score

The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is particularly useful when you need to compare models and want to find a balance between precision and recall.

\[ F_1 = \frac{2 \times \colorbox{#bbf7d0}{TP}}{2 \times \colorbox{#e0e7ff}{TP} + \colorbox{#e0e7ff}{FP} + \colorbox{#e0e7ff}{FN}} \]

Interpretation: "Harmonic balance between precision and recall — penalizes extreme values."

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator (2×TP) Denominator

Specificity / True Negative Rate (TNR)

Specificity measures how well the model identifies negative cases. It answers: "Of all actual negative cases, how many were correctly classified as negative?" This is the complement perspective to recall.

\[ \text{Specificity} = \text{TNR} = \frac{\colorbox{#bbf7d0}{TN}}{\colorbox{#e0e7ff}{TN + FP}} \]

Interpretation: "Of all actual negative cases, how many did we correctly identify as negative?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator Denominator (actual −)

False Positive Rate (FPR)

FPR indicates the proportion of actual negatives that were incorrectly classified as positive. It equals (1 - Specificity) and is plotted on the x-axis of ROC curves. Lower FPR means fewer false alarms.

\[ \text{FPR} = \frac{\colorbox{#fecaca}{FP}}{\colorbox{#e0e7ff}{FP + TN}} = 1 - \text{Specificity} \]

Interpretation: "Of all actual negative cases, how many did we incorrectly flag as positive?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator (errors) Denominator (actual −)

False Negative Rate (FNR) / Miss Rate

FNR indicates the proportion of actual positives that were incorrectly classified as negative. It equals (1 - Recall) and represents missed detections. Lower FNR is critical when missing positive cases is dangerous.

\[ \text{FNR} = \frac{\colorbox{#fecaca}{FN}}{\colorbox{#e0e7ff}{FN + TP}} = 1 - \text{Recall} \]

Interpretation: "Of all actual positive cases, how many did we miss?"

Pred +
Pred −
Actual +
TP
FN
Actual −
FP
TN
Numerator (errors) Denominator (actual +)

📖 Glossary of Terms

True Positive (TP)
The model correctly predicts the positive class. The actual value was positive, and the model said positive. Example: A medical test correctly identifies a patient who has the disease.
True Negative (TN)
The model correctly predicts the negative class. The actual value was negative, and the model said negative. Example: A spam filter correctly identifies a legitimate email as "not spam."
False Positive (FP) — Type I Error
The model incorrectly predicts positive when the actual value is negative. Also called a "false alarm." Example: A medical test says a healthy patient has a disease when they don't.
False Negative (FN) — Type II Error
The model incorrectly predicts negative when the actual value is positive. Also called a "miss." Example: A security system fails to detect an actual intrusion.

❓ Frequently Asked Questions

When should I prioritize Precision over Recall?

Prioritize Precision when false positives are costly. For example, in email spam filtering, marking a legitimate email as spam (FP) could cause users to miss important messages. In such cases, you want to be very confident before predicting "positive."

When should I prioritize Recall over Precision?

Prioritize Recall when missing positive cases (false negatives) is dangerous or expensive. In medical diagnosis for serious diseases, failing to detect a condition (FN) could be life-threatening. It's better to have some false alarms than to miss actual cases.

Why is Accuracy sometimes a poor metric?

Accuracy can be misleading with imbalanced datasets. If 95% of cases are negative, a model that always predicts "negative" achieves 95% accuracy while being completely useless at finding positive cases. In such scenarios, F1 Score, Precision-Recall AUC, or ROC-AUC provide better insights.

What is a good F1 Score?

There's no universal threshold—it depends on your domain and problem difficulty. Generally, F1 > 0.9 is excellent, 0.8-0.9 is good, and below 0.5 suggests the model struggles significantly. Always compare against baseline models and domain-specific benchmarks.

How do TPR and FPR relate to ROC curves?

ROC (Receiver Operating Characteristic) curves plot TPR (y-axis) against FPR (x-axis) at various classification thresholds. A perfect classifier reaches the top-left corner (TPR=1, FPR=0). The Area Under the ROC Curve (AUC-ROC) summarizes overall discriminative ability.

What metrics should I use for imbalanced datasets?

For imbalanced datasets, avoid relying solely on Accuracy. Instead, use Precision, Recall, F1 Score, Matthews Correlation Coefficient (MCC), Balanced Accuracy, or Precision-Recall AUC. These metrics provide a more realistic picture of model performance on minority classes.

What is the difference between Type I and Type II errors?

Type I Error (False Positive) occurs when the model incorrectly predicts positive when the actual value is negative—a false alarm. Type II Error (False Negative) occurs when the model incorrectly predicts negative when the actual value is positive—a miss. The relative cost of each error type depends on your specific application.

📚 Recommended Reading