The Cost of Being Wrong: False Positives, False Negatives, and the Berlin Snow Storm That Wasn't

11 Jan, 2026

On January 10, 2026, Berlin braced for impact. Storm "Elli" was coming. Schools suspended mandatory attendance, the Zoo and Tierpark closed their gates, hospitals prepared for an influx of ice-related injuries, and the Deutsche Bahn preemptively cancelled trains. Meteorologists warned of 15-20 cm of snow, dangerous ice rain, and wind gusts up to 60 km/h.

Then... almost nothing happened. The snow band passed north of the city. Berlin got a dusting while Hamburg drowned in white. Cue the inevitable backlash: Was this all overblown? Did we shut down a city for nothing?

This question sits at the heart of one of the most misunderstood concepts in prediction and decision-making: the asymmetric costs of false positives versus false negatives.

The Two Ways to Be Wrong

Every binary prediction can fail in exactly two ways:

False Positive (Type I Error): We predict an event that doesn't occur. The alarm sounds, but there's no fire.
False Negative (Type II Error): We miss an event that does occur. The fire burns, but the alarm stays silent.

In the Berlin case:

False Positive: Warning issued, storm doesn't hit Berlin → Schools closed unnecessarily, economic disruption, public trust eroded
False Negative: No warning issued, storm does hit → Traffic chaos, accidents, injuries, potentially deaths

The critical insight is that these errors are not symmetric. The cost of a false positive (a lost school day, some inconvenience) is fundamentally different from the cost of a false negative (a 86-year-old killed by a snow plow in Villingendorf, as actually happened elsewhere during Elli).

Expected Loss: Probability Meets Consequence

Risk management gives us a useful framework here. In credit risk, we decompose expected loss as:

$Expected Loss = PD \times LGD \times EAD$

Where:

PD (Probability of Default): The likelihood the event occurs
LGD (Loss Given Default): The severity if it does occur
EAD (Exposure at Default): The amount at stake

We can adapt this for prediction decisions:

$Expected Cost of Action a = \sum_{s \in States} P (s) \times L (a, s)$

For the weather warning decision, let $W$ denote "issue warning" and $\neg W$ denote "no warning", while $S$ denotes "storm hits" and $\neg S$ denotes "storm misses":

	Storm Hits ( $S$ )	Storm Misses ( $\neg S$ )
Warn ( $W$ )	$L_{T P}$ (correct, costs of precaution)	$L_{F P}$ (false alarm costs)
Don't Warn ( $\neg W$ )	$L_{F N}$ (disaster costs)	$L_{T N}$ (correct, no costs)

The optimal decision rule becomes: Issue warning if

$P (S) \cdot L_{F N} + P (\neg S) \cdot L_{T N} > P (S) \cdot L_{T P} + P (\neg S) \cdot L_{F P}$

Rearranging:

$P (S) > \frac{L_{F P} - L_{T N}}{(L_{F N} - L_{T P}) + (L_{F P} - L_{T N})}$

If we assume $L_{T N} \approx L_{T P} \approx 0$ (correct decisions have minimal cost), this simplifies to:

$P (S) > \frac{L_{F P}}{L_{F N} + L_{F P}}$

Here's the key: when $L_{F N} ≫ L_{F P}$ , we should warn even at low probabilities.

If a false negative (missed severe storm) costs 100 units and a false positive (unnecessary warning) costs 1 unit:

$P (S) > \frac{1}{100 + 1} \approx 0.01$

We should issue the warning if there's even a 1% chance of the storm hitting. The asymmetry in consequences demands asymmetry in our decision threshold.

ROC Curves: The Trade-off Made Visible

Every prediction system has a threshold that determines when it "triggers." A weather model doesn't output "storm" or "no storm" — it outputs a probability or a continuous score. We choose where to draw the line.

The Receiver Operating Characteristic (ROC) curve visualizes this trade-off:

True Positive Rate (Sensitivity): $T P R = \frac{T P}{T P + F N}$ — Of all actual storms, how many did we warn about?
False Positive Rate (1 - Specificity): $F P R = \frac{F P}{F P + T N}$ — Of all non-storms, how many did we falsely warn about?

As we lower our warning threshold (become more cautious), we catch more true storms (TPR increases) but also trigger more false alarms (FPR increases). The ROC curve traces this relationship.

                       ROC Curve (TPR vs FPR)
    ┌─────────────────────────────────────────────────────┐
  1 │                          ●────────────────────────○ │
    │                      ╱       ↖ High sensitivity     │
    │                  ╱             (catch more,         │
    │              ╱                  more false alarms)  │
    │          ╱                                          │
T   │        ╱                                            │
P   │      ╱                                              │
R   │    ╱                                                │
    │   ●  ← High specificity                             │
    │  ╱       (fewer false alarms,                       │
    │ ╱         miss more storms)                         │
    │╱                                                    │
  0 ○─────────────────────────────────────────────────────│
    └─────────────────────────────────────────────────────┘
      0                    FPR                          1

The diagonal from (0,0) to (1,1) represents random guessing—a useless classifier. The further the curve bows toward the upper-left corner, the better the prediction system. Every point on the curve represents a different threshold choice.

The choice of operating point on the ROC curve is fundamentally a value judgment, not a statistical one.

A weather service optimizing for public safety will operate high on the curve (high sensitivity, accepting more false alarms). A service optimizing for credibility might operate lower (fewer false alarms, but missing more events).

The DWD's decision to warn Berlin reflects a choice to operate at high sensitivity. Given that Elli did cause significant damage in Hamburg, Schleswig-Holstein, and even killed people in other regions, this seems defensible.

The Problem: Binary Outputs Hide Uncertainty

Here's where current practice fails us. The public received a binary message: UNWETTERWARNUNG (severe weather warning) — yes or no. But the underlying meteorology was probabilistic.

Danijel Stanic from the DWD spoke of an "extreme situation," but what probability did the models actually assign to Berlin being hit? 30%? 50%? 80%? We don't know, because that information wasn't communicated.

This matters enormously for individual decision-making. Consider two scenarios:

"80% chance of 15cm snow in Berlin" → Most people stay home
"20% chance of 15cm snow in Berlin" → Many might risk the commute

By collapsing probability into a binary warning, we remove the individual's ability to calibrate their response to their own risk tolerance and circumstances. A surgeon who needs to perform an emergency operation has different stakes than someone whose meeting could be rescheduled.

Confidence Intervals: Communicating What We Don't Know

Beyond point probabilities, we should communicate uncertainty about our uncertainty. A forecast might say:

"Snow accumulation: 5-20 cm (90% confidence interval), most likely 12 cm"

Or for probability itself:

"Probability of >10cm snow: 40-60% (model disagreement range)"

This acknowledges that meteorological models themselves disagree, and that our probability estimate has uncertainty. The public can then understand not just the prediction, but how confident we are in that prediction.

The Bayesian Perspective

Bayesian statistics offers a natural framework for this kind of reasoning. Instead of asking "will the storm hit?" (a frequentist question about a single event), we ask "what is our degree of belief that the storm will hit, given all available evidence?"

This belief is expressed as a posterior distribution:

$P (Storm | Data) = \frac{P (Data | Storm) \cdot P (Storm)}{P (Data)}$

Where:

$P (Storm)$ is our prior belief (base rate of such storms)
$P (Data | Storm)$ is the likelihood (how consistent is current data with storm scenarios)
$P (Storm | Data)$ is our updated belief

Credible intervals from the posterior give us direct probability statements: "There is a 90% probability that snow accumulation will be between 5 and 20 cm." This is more intuitive than frequentist confidence intervals, which make statements about hypothetical repeated sampling.

For decision-making, we can then compute:

$Expected Loss (a) = \int L (a, θ) \cdot P (θ | Data) d θ$

This integrates over our uncertainty, weighting each possible outcome by our belief in its occurrence.

A Critical Assessment of the Berlin Decision

Was the Berlin warning the right call? Let's examine the decision:

Arguments in favor:

Asymmetric costs: The potential harm from an unwarned severe storm (deaths, injuries, chaos) far exceeds the cost of a false alarm (inconvenience, lost productivity)
Genuine uncertainty: The snow band's exact path was uncertain until hours before. The models showed Berlin in the potential impact zone.
Regional accuracy: The warning was correct for the broader region — Hamburg, Schleswig-Holstein, and parts of Brandenburg were severely affected
Precautionary principle: Given the potential for loss of life, erring on the side of caution is appropriate

Arguments against:

Erosion of trust: Repeated false alarms lead to "cry wolf" syndrome, potentially causing people to ignore future warnings
Economic costs: Closed businesses, cancelled events, disrupted logistics — these have real costs
Binary communication: The all-or-nothing warning didn't allow for nuanced individual decisions
Post-hoc clarity: With better mesoscale modeling, the sharp boundary could potentially have been predicted

My assessment:

The decision to warn was probably correct given the information available at decision time. The Tagesspiegel article makes this point well: the outcome on Friday "simply wasn't foreseeable" on Thursday when the decision was made.

However, the communication was suboptimal. A better approach might have been:

"Severe weather warning for Berlin region. Models show 40-70% probability of significant snow accumulation (10-20 cm). High confidence for areas north and east of Berlin. Exercise caution and consider postponing non-essential travel."

This communicates:

The warning (clear action signal)
The probability (allows calibrated response)
The uncertainty (models disagree)
The spatial gradient (higher confidence for some areas)
The recommended action (concrete guidance)

Lessons for Prediction Under Uncertainty

Asymmetric costs demand asymmetric thresholds: When false negatives are catastrophic and false positives are merely inconvenient, we should warn even at moderate probabilities.
Communicate probabilities, not just decisions: Let individuals calibrate their response to their own circumstances and risk tolerance.
Include uncertainty about uncertainty: Confidence/credible intervals acknowledge model disagreement and epistemic limitations.
Distinguish aleatory from epistemic uncertainty: Some uncertainty is inherent randomness (where exactly will the snow band go); some is limited knowledge (better models might reduce it). Both matter, but they call for different responses.
Evaluate decisions by process, not outcome: The Berlin warning should be judged by whether it was reasonable given available information, not by the fact that the storm missed. A good decision can have a bad outcome; that doesn't make it wrong.
Build calibrated systems: Over many predictions, a well-calibrated system should have its "30% events" occur about 30% of the time. This allows users to trust and appropriately weight probabilistic forecasts.

The storm that wasn't is not evidence of failure. It's evidence of a system appropriately calibrated for asymmetric risk. The real failure would be a system so afraid of false alarms that it misses the storm that kills.

"Uncertainties are part of meteorology" — and of life. The goal isn't to eliminate uncertainty, but to make good decisions despite it.

Links: Storm Warning Analysis (Tagesspiegel) | Berlin Storm Coverage (BZ Berlin) | Winter Storm Report (Tagesschau) | ROC Curves (Wikipedia) | Bayesian Inference (Wikipedia)

#decision-theory #risk-management #statistics #weather