just a tourist

False Positives vs False Negatives: The Errors That Shape Science and Safety

When testing anything — a medical diagnosis, an AI classifier, a scientific hypothesis — two kinds of mistakes can happen. Understanding these errors isn't just academic: they determine whether innocent people go to prison, whether diseases get caught early, and whether scientific discoveries are real or illusions.

The Two Types of Errors

False Positive (Type I Error): You detect something that isn't there.

False Negative (Type II Error): You miss something that is there.

Why Different Fields Fear Different Errors

The choice of which error to minimize depends entirely on the consequences.

Fields That Fear False Positives

Drug Approval (FDA, EMA): A false positive means approving a drug that doesn't work — or worse, one that harms patients. The thalidomide disaster of the 1960s, which caused severe birth defects, made regulators extremely cautious about Type I errors.

Criminal Justice: "Better that ten guilty persons escape than that one innocent suffer" — this legal principle explicitly prioritizes avoiding false positives (wrongful convictions) over false negatives (letting criminals go free).

Scientific Publishing: A false positive means publishing a claim that isn't true. This erodes trust in science and wastes resources as others try to build on false foundations.

Fields That Fear False Negatives

Cancer Screening: Missing an early-stage tumor is often a death sentence. Mammograms and other screening tests are calibrated to catch as many cancers as possible, accepting some false alarms.

Security & Threat Detection: Missing a terrorist or a cyberattack has catastrophic consequences. Airport security accepts many false alarms (bags flagged for inspection that turn out to be harmless) to avoid missing real threats.

Safety Engineering: Not detecting a fault in an aircraft or nuclear plant can be fatal. These systems are designed with redundancy specifically to minimize false negatives.

Real-World Disasters: When Errors Have Consequences

The Theranos Scandal: False Results at Scale

Theranos promised revolutionary blood testing from a single drop of blood. The reality was far darker. Between 2013 and 2016, the company sold over 1.5 million blood tests — and more than 10% of results were later voided or corrected.

The human cost was severe:

Elizabeth Holmes was ultimately sentenced to over 11 years in prison. The case illustrates how unreliable testing doesn't just produce "errors" — it derails lives.

The Replication Crisis: Science's False Positive Problem

In 2011, psychologist Daryl Bem published a paper appearing to show evidence for precognition — that people could predict the future. The study was methodologically standard for its field. This triggered a crisis: if standard methods could produce such obviously wrong results, what else had they gotten wrong?

The Open Science Collaboration attempted to replicate 100 psychology studies. The results were sobering: while 97% of original studies reported statistically significant results, only 36% replicated successfully.

Famous findings that failed to replicate include:

These weren't obscure findings — they were featured in TED talks, bestselling books, and corporate training programs. The replication crisis revealed that a significant portion of published research may represent false positives: "discoveries" of effects that don't actually exist.

AI in Medicine: New Technology, Old Problems

ECRI's 2026 health technology hazards report lists AI chatbot misuse as the top concern. The problem isn't just that AI makes mistakes — it's that people trust it too much.

Research from Cambridge and Simon Fraser universities found that machine learning in medical imaging is "highly unstable" and may produce both false positives (flagging healthy tissue as diseased) and false negatives (missing actual pathology). Unlike a human radiologist who might say "I'm not sure," AI systems typically produce confident-looking outputs regardless of their actual reliability.

The FDA recently relaxed requirements for clinical decision support tools, meaning many AI systems providing diagnostic suggestions may reach clinics without rigorous validation. We may be setting up conditions for a new wave of both error types.

The Statistics Behind the Trade-Off

In formal hypothesis testing:

Here's the fundamental constraint: you cannot minimize both errors simultaneously without more data or better measurements.

This is why the 5% significance threshold is contentious. It's a somewhat arbitrary trade-off point. Some researchers now advocate for stricter thresholds (0.5% or 0.1%) to reduce false positives, while others argue this will cause too many real effects to be missed.

Visualizing the Trade-Off: ROC Curves and AUC

The tension between false positives and false negatives is beautifully captured in the ROC curve (Receiver Operating Characteristic). This visualization, originally developed for radar operators in World War II to distinguish enemy aircraft from noise, has become the standard tool for evaluating classifiers.

An ROC curve plots the True Positive Rate (sensitivity — how many actual positives you catch) against the False Positive Rate (how many negatives you incorrectly flag) as you vary the decision threshold.

                    ROC Curve (AUC ≈ 0.87)
┌────────────────────────────────────────────────────────────┐
│                                ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▞▀▀▀▀▀▀▀▀▀▀▀▀│ 1.0
│                    ▄▄▄▄▄▀▀▀▀▀▀▀                           │
│             ▗▄▄▀▀▀▀                                       │
│         ▗▄▀▀▘                                             │ TPR
│      ▗▄▀▘                                                 │(Sensitivity)
│    ▗▀▘                                                    │
│  ▗▞▘                                                      │
│ ▗▘                                                        │
│ ▞                                                         │
│▐                                                          │
▗▘                                                          │
▞▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│ 0
└────────────────────────────────────────────────────────────┘
0.0                           0.5                          1.0
                    FPR (1 - Specificity)

How to read it:

The AUC (Area Under the Curve) gives a single number summarizing performance:

AUC Interpretation
0.5 No better than random
0.7–0.8 Acceptable
0.8–0.9 Good
0.9+ Excellent

The key insight: Every point on the ROC curve represents a different trade-off. Moving up and right catches more true positives but also more false positives. Moving down and left reduces false alarms but misses more real cases. There's no free lunch — you're sliding along the curve, not jumping above it.

The only way to get a better ROC curve (one that bows more toward the upper-left) is to improve the underlying classifier: better features, more training data, or a more sophisticated model.

Calibration: When Confidence Meets Reality

There's another dimension to prediction quality that ROC curves don't capture: calibration. A model can have excellent discrimination (high AUC) but terrible calibration — and this matters enormously for decision-making.

Calibration asks: When you say you're 70% confident, are you actually right 70% of the time?

A well-calibrated forecaster, as statistician Philip Dawid put it, is one where "of those events to which he assigns a probability of 30%, the long-run proportion that actually occurs turns out to be 30%."

The Gold Standard: Weather Forecasters

Weather forecasters are famously well-calibrated. When they say there's a 30% chance of rain, it rains about 30% of those times. When they say 80%, it rains about 80%. This isn't an accident — they've spent decades refining their probability estimates against actual outcomes.

This is what a reliability diagram (or calibration curve) shows:

         Calibration: Confidence vs Reality

Actual   1.0 │                          ╱
Accuracy     │                       ╱⬤  ← Perfect calibration
(fraction    │                    ╱     (diagonal line)
correct) 0.7 │                 ╱  ●
             │              ╱ ●
         0.5 │           ╱●      ← Overconfident model
             │        ╱●          (says 80%, right only 60%)
         0.3 │     ╱ ●
             │  ╱ ●
         0.0 └────────────────────────────
             0.0   0.3   0.5   0.7   1.0
                  Stated Confidence

The Problem with Modern AI

Modern neural networks are often poorly calibrated — and they tend toward overconfidence. A model might output "92% confident this is benign" while actually being correct only 75% of the time. The prediction looks precise; it isn't.

This matters because decisions depend on these probabilities:

If those confidence scores are systematically wrong, the decisions will be systematically wrong too.

Calibration vs. Discrimination

Here's the subtle point: AUC and calibration measure different things.

A model can perfectly separate cats from dogs (AUC = 1.0) while being horribly miscalibrated — claiming 99% confidence when it should say 70%. For many applications, especially in medicine and safety, you need both.

Measuring Calibration

Two common metrics:

The good news: calibration can often be fixed after training through techniques like temperature scaling or Platt scaling, which adjust the probability outputs without changing the underlying predictions.

Reducing Errors in Practice

To reduce false positives:

To reduce false negatives:

To reduce both simultaneously:

The Decision Framework

When designing any detection system — medical, security, scientific, or AI — ask:

  1. What's the cost of a false positive? (Unnecessary treatment, wrongful conviction, wasted research effort)
  2. What's the cost of a false negative? (Missed cancer, security breach, undiscovered scientific truth)
  3. Which error is more reversible? (Can you catch missed cases later? Can you undo false alarms?)
  4. What's the base rate? (If the condition is rare, even a good test will produce many false positives)

There's no universal right answer. A screening test for a deadly but treatable cancer should accept many false positives. A test to convict someone of a crime should demand near-certainty. A scientific publication should require strong evidence — but perhaps not so strong that real effects go undiscovered for decades.

The key is to make the trade-off consciously, understanding what you're optimizing for and what you're sacrificing.


Links: Theranos Fraud Case (DOJ) | Replication Crisis (Wikipedia) | Type I and II Errors (Wikipedia) | ROC Curve (Wikipedia) | Calibration (statistics) (Wikipedia) | Forecast Calibration (World Climate Service) | Ten Psychology Findings That Failed to Replicate (BPS) | ECRI 2026 Health Tech Hazards (Healthcare Dive) | AI in Medical Imaging Instability (Cambridge)

#calibration #decision-making #methodology #replication-crisis #science #statistics