Interactive tool to visualize classification results and compute essential machine learning evaluation metrics
In machine learning and statistical classification, a confusion matrix (also called an error matrix) is a simple but powerful way to summarize how well a classification model is performing. It is a table that compares the actual class labels in your data with the predicted labels from your model. This lets you see not only the overall accuracy, but also exactly what kinds of mistakes the model is making.
The name “confusion” comes from the fact that the matrix shows where the model gets confused between classes. For a binary classification problem, the confusion matrix is a 2×2 table with four key quantities:
Starting from these four numbers, you can compute many important performance metrics—such as accuracy, precision, recall, and F1-score —which give a much deeper understanding of how and where your model is performing well or failing.
Click on a scenario above to load realistic example values and learn about the context.
Correctly predicted positive cases
Correctly predicted negative cases
Type I Error — predicted positive, actually negative
Type II Error — predicted negative, actually positive
PREDICTED CLASS
Your data shows significant class imbalance. Accuracy may be misleading—consider focusing on Precision, Recall, F1 Score, or Balanced Accuracy instead.
In real-world machine learning, class imbalance is the norm rather than the exception. Fraud detection deals with 0.1% fraudulent transactions, disease screening with 1-5% positive cases, and manufacturing defect detection with even rarer anomalies. When one class vastly outnumbers the other, traditional accuracy becomes a dangerously misleading metric.
Consider a dataset with 1,000 samples where only 10 are positive (disease cases) and 990 are negative (healthy). A naive model that always predicts negative achieves 99% accuracy—yet it catches zero disease cases! This is the accuracy paradox: high accuracy can mask complete failure on the minority class.
When working with imbalanced datasets, rely on metrics that account for class distribution:
Balanced Accuracy is the arithmetic mean of sensitivity (Recall) and specificity. It gives equal weight to both classes regardless of their size, making it robust to imbalance:
1. Resampling: Use SMOTE (oversampling minority) or random undersampling to balance training data.
2. Class Weights: Assign higher weights to minority class errors in your loss function.
3. Threshold Tuning: Adjust the decision threshold based on Precision-Recall curves rather than using default 0.5.
4. Ensemble Methods: Techniques like BalancedRandomForest or EasyEnsemble are designed for imbalanced scenarios.
Each metric is derived from specific cells in the confusion matrix. Below, we visualize which cells form the numerator and which cells are included in the denominator for each formula.
Accuracy measures the overall correctness of the model by calculating the proportion of all correct predictions (both true positives and true negatives) out of the total number of cases. While intuitive, accuracy can be misleading with imbalanced datasets.
Interpretation: "What percentage of all predictions were correct?"
Precision answers the question: "Of all the cases predicted as positive, how many were actually positive?" High precision indicates low false positive rate, which is crucial when the cost of false alarms is high (e.g., spam detection).
Interpretation: "When the model predicts positive, how often is it right?"
Recall measures the model's ability to find all positive cases. It answers: "Of all actual positive cases, how many did we correctly identify?" High recall is essential when missing positive cases is costly (e.g., disease screening, fraud detection).
Interpretation: "Of all actual positive cases, how many did we catch?"
The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is particularly useful when you need to compare models and want to find a balance between precision and recall.
Interpretation: "Harmonic balance between precision and recall — penalizes extreme values."
Specificity measures how well the model identifies negative cases. It answers: "Of all actual negative cases, how many were correctly classified as negative?" This is the complement perspective to recall.
Interpretation: "Of all actual negative cases, how many did we correctly identify as negative?"
FPR indicates the proportion of actual negatives that were incorrectly classified as positive. It equals (1 - Specificity) and is plotted on the x-axis of ROC curves. Lower FPR means fewer false alarms.
Interpretation: "Of all actual negative cases, how many did we incorrectly flag as positive?"
FNR indicates the proportion of actual positives that were incorrectly classified as negative. It equals (1 - Recall) and represents missed detections. Lower FNR is critical when missing positive cases is dangerous.
Interpretation: "Of all actual positive cases, how many did we miss?"
When should I prioritize Precision over Recall?
Prioritize Precision when false positives are costly. For example, in email spam filtering, marking a legitimate email as spam (FP) could cause users to miss important messages. In such cases, you want to be very confident before predicting "positive."
When should I prioritize Recall over Precision?
Prioritize Recall when missing positive cases (false negatives) is dangerous or expensive. In medical diagnosis for serious diseases, failing to detect a condition (FN) could be life-threatening. It's better to have some false alarms than to miss actual cases.
Why is Accuracy sometimes a poor metric?
Accuracy can be misleading with imbalanced datasets. If 95% of cases are negative, a model that always predicts "negative" achieves 95% accuracy while being completely useless at finding positive cases. In such scenarios, F1 Score, Precision-Recall AUC, or ROC-AUC provide better insights.
What is a good F1 Score?
There's no universal threshold—it depends on your domain and problem difficulty. Generally, F1 > 0.9 is excellent, 0.8-0.9 is good, and below 0.5 suggests the model struggles significantly. Always compare against baseline models and domain-specific benchmarks.
How do TPR and FPR relate to ROC curves?
ROC (Receiver Operating Characteristic) curves plot TPR (y-axis) against FPR (x-axis) at various classification thresholds. A perfect classifier reaches the top-left corner (TPR=1, FPR=0). The Area Under the ROC Curve (AUC-ROC) summarizes overall discriminative ability.
What metrics should I use for imbalanced datasets?
For imbalanced datasets, avoid relying solely on Accuracy. Instead, use Precision, Recall, F1 Score, Matthews Correlation Coefficient (MCC), Balanced Accuracy, or Precision-Recall AUC. These metrics provide a more realistic picture of model performance on minority classes.
What is the difference between Type I and Type II errors?
Type I Error (False Positive) occurs when the model incorrectly predicts positive when the actual value is negative—a false alarm. Type II Error (False Negative) occurs when the model incorrectly predicts negative when the actual value is positive—a miss. The relative cost of each error type depends on your specific application.
Comprehensive guide to classification metrics in Python's scikit-learn library.
Learn how ROC curves and AUC extend confusion matrix concepts.
Goodfellow et al.'s comprehensive textbook covering evaluation methodology.
Interactive guide to neural network activation functions.
Explore how loss functions guide model training and optimization.
Python library for handling imbalanced datasets with SMOTE and more.