just a tourist

Trust, Miscalibrated

By now most people use AI for something. Drafting an email. Reading a contract. Writing code. Diagnosing a rash. The question of how much to trust the output is being answered, implicitly, every minute, by hundreds of millions of people. Almost none of them have thought about it as a measurable property. Most are either deferring too much, taking outputs as fact because the prose sounds right, or rejecting too aggressively because of an earlier bad experience. Both are mistakes, and they are not symmetric mistakes. The literature on human-automation interaction has been thinking about this since the 1990s, and there is a small, surprisingly clean framework underneath. It is worth knowing.

What calibrated trust actually means

The framework starts with a single idea: trust is calibrated when a person's subjective confidence in a system matches the system's actual reliability. If the system is right 90% of the time and the user behaves as if it is right 90% of the time, trust is calibrated. If the user behaves as if it is right 99% of the time, trust is too high. Parasuraman and Riley called this misuse in their 1997 Human Factors paper "Humans and Automation: Use, Misuse, Disuse, Abuse." If the user treats the system as if it is right 50% of the time and refuses to rely on it, trust is too low: disuse, sometimes called algorithm aversion.

Lee and See's 2004 follow-up, "Trust in Automation: Designing for Appropriate Reliance," made the design implication explicit. The goal is not maximum trust. The goal is appropriate reliance: a 45° line where subjective trust equals system reliability, and any deviation in either direction is a failure. Modern researchers have given the deviation a name: resolution, the tightness with which subjective trust tracks objective trustworthiness.

The framework is older than LLMs and was originally aimed at autopilots and decision-support systems. It applies to AI assistants with almost no modification, except that AI assistants make the calibration problem dramatically harder.

            Subjective trust vs system reliability
┌────────────────────────────────────────────────────────────┐
││                                              +++++++++++oo│ 1
││                                   +++++++++++       xxoo  │
││                            +++++++              xxxxoo    │
││                       +++++                 xxxx  oo      │
││                    +++                  xxxx    oo        │
││                 +++                 xxxx     ooo          │
││              +++                xxxx       oo             │
││           +++               xxxx        ooo               │
││         ++              xxxx         ooo                  │
││       ++            xxxx          ooo                     │
││     ++          xxxx           ooo                        │
││    +        xxxx           oooo                           │
││  ++     xxxx           oooo                               │
││++   xxxx         oooooo                                   │
│+ xxxx     oooooooo                                         │
│ooooooooooo─────────────────────────────────────────────────│ 0
└────────────────────────────────────────────────────────────┘
0.0                           0.5                          1.0
         ++ Overtrust (misuse)   xx Calibrated (y=x)
                    oo Undertrust (disuse)

Illustrative. The 45° line is the calibrated case (subjective trust equals actual reliability). Curves above it are overtrust (misuse); curves below are undertrust (disuse). The two failure modes are the two regions, not the two endpoints.

Why AI breaks the trust cues humans use

Humans calibrate trust in other humans through cues: how fluent the person sounds, whether they hedge, whether they cite sources, whether their confidence tracks the difficulty of the question. These cues work, imperfectly, between humans because all of them are expensive to fake at scale. Fluent confident talk costs something (preparation, social risk), and most people don't pay it unless they have grounds. The cues correlate with actual competence not because they are competence but because they share a common cause.

LLMs break the correlation. Every single one of those cues is degraded or inverted when the speaker is a language model.

Fluency. Humans take fluency as evidence of competence. LLMs are fluent by training objective. They are fluent on questions they understand and questions they don't, on facts they remember and facts they invent. The fluency cue, when the speaker is a language model, conveys exactly zero information about correctness. Several CHI-track studies have found that users update their trust on fluency anyway.

Hedging. Humans use hedging language ("I think," "probably," "I'm not sure but") to signal uncertainty. The hedge is informative because saying it costs face. Linguistic-confidence research from the last two years (Wang et al. 2024, the late-2025 arXiv work on calibrating verbalized confidence) shows that LLM hedging is largely uncorrelated with internal confidence: models hedge when their training data hedged in similar contexts, not when they are actually unsure. Hedge frequency tracks topic, not reliability.

Citations. Humans treat citations as evidence of grounded sources. LLMs sometimes produce real citations and sometimes invent them, with no surface difference between the two. The 2023 New York legal case where a lawyer submitted a brief full of fabricated case law was an extreme version of a general pattern. Recent work on "situated faithfulness" (for instance the 2024 arXiv paper of that title) shows that even when the model is given external context, it over-trusts that context regardless of whether the context is correct.

Verbalized confidence. Ask an LLM "how confident are you?" and it will give a number. Several 2025-2026 papers have shown those numbers are systematically miscalibrated unless the model has been explicitly trained for calibration, which most have not been. The Rewarding Doubt line of work (arXiv 2503.02623) shows that the miscalibration can be reduced with the right reinforcement-learning objective; the default state, without that training, is overconfidence everywhere.

The net effect is brutal. Every surface signal a human uses to gauge another human's trustworthiness either fails or actively misleads when the speaker is a language model. People reading LLM output are reading a confidence cue that was optimized hard, by the training process, to look like the real thing.

Trust is a trajectory, not a number

So far this has been static. The actual experience of trust is dynamic. de Visser and colleagues' 2020 paper in the International Journal of Social Robotics, "Towards a Theory of Longitudinal Trust Calibration in Human-Robot Teams," proposed a longitudinal model with one striking central concept: relationship equity. Equity, in their framing, is the cumulative reserve of goodwill built from positive interactions over time. It explains why a trusted colleague's one bad day does not blow up the working relationship, while a stranger's first bad impression does. Equity is the moving average; any single interaction is a noisy sample.

The model gives four dynamics for how trust changes:

There is a specific failure mode inside the "destroyed" line that deserves its own paragraph because it is the most common way LLM trust is killed.

The trivial-task asymmetry. A confident-wrong answer on a hard question barely shifts the user's prior. Hard questions were going to be hard; the user has weak expectations going in. A confident-wrong answer on a trivial question (basic arithmetic, a word count, a date the user can check on Wikipedia in five seconds) is catastrophic. The textbook example is the model that confidently reports there are two r's in "strawberry" when there are three. The user is forced to conclude not just that the system got this one wrong, but that the confidence cue is uncorrelated with reality. That single observation invalidates retroactively every previous interaction in which the user trusted the system because it sounded sure. One easily-verifiable confident blunder destroys more equity than dozens of quiet successes built. The cost is informational, not emotional. The user has just learned that they were never reading the confidence cue. They were reading the fluency cue.

The size of that learning depends on what the user already knows. The "strawberry" failure has a known structural cause: modern tokenizers split text into chunks that obscure individual letters, so the model is reasoning about tokens, not characters. A reader who knows this updates very little. For that reader, miscounted letters is a known limitation, not a signal about everything else the model says. A reader who does not know this updates a lot. The trivial-task asymmetry is sharpest for users with the least mechanistic understanding of LLMs, which is most users. The user-side fix is to learn a few of the structural failure modes, so observed failures can be sorted into "known limitation" versus "the confidence cue is broken." That sorting is what protects the equity counter.

This is, in my impression of the field, the single most efficient mechanism for trust destruction in current LLM use. It is also the mechanism that an honestly-calibrated system would deliberately defend against, by visibly refusing or correcting on the cheap cases: the ones where the cost of a mistake is low and the diagnostic value of getting it right is high.

There is one further awkward asymmetry for LLM products specifically. In de Visser's original framing, equity compounds across interactions because the interactions happen with the same teammate over time. Most consumer LLM use today resets every session. Equity does not carry. Users rebuild from scratch every conversation. Multi-turn agentic systems with persistent memory are the first products where intra-system equity even starts to accumulate. Even there, a single trivial-task blunder during the session can reset the equity counter to zero.

The industry trust stack: theater vs substance

The companies shipping AI systems are not unaware of any of this. They have built a stack of trust-signaling mechanisms. Some of these mechanisms track reality. Some perform trustworthiness without delivering it.

On the substance side: published model and system cards (Anthropic publishes one for each Claude release; Anthropic's February 2026 Opus 4.6 card includes benchmark comparisons across vendors), Responsible Scaling Policies (RSP) and the evaluation regimes that go with them, Constitutional AI as a training method with documented principles, formal red-teaming, model specs that pin down intended behavior. Stanford's Foundation Model Transparency Index measures all of this and ranks vendors on a real gradient. Anthropic sits at "Annotated Disclosure," OpenAI and Google at "Summary Disclosure," which corresponds to actual differences in what is published. These signals track substance. The numbers in the system cards are reproducible by anyone who runs the benchmarks. The behavior described in the model spec is observable in the deployed system.

On the mixed side: refusal messages. Sometimes a refusal expresses a real safety policy traceable to RSP commitments. Sometimes it is liability minimization styled as ethics, where the model declines a benign request because something tangentially adjacent has been flagged. The user cannot tell which they are getting. The signal is real on average but noisy in the particular.

On the theater side: hedging language that does not track calibration; verbalized confidence scores from models that were never trained to be calibrated; transparency hubs that summarize without revealing internals; citations rendered with the visual styling of a citation but no actual retrieval underneath; "AI safety" UI elements that perform safety without changing what the underlying model will produce. The diagnostic question for any trust signal is the same one Lee and See would have asked: does the signal change when the system is actually less trustworthy? If the hedge appears at the same rate regardless of internal confidence, the hedge is theater. If the same refusal pattern fires on both genuinely sensitive content and on merely embarrassing requests, the refusal is theater. If the system card lists capabilities without specifying the evaluation under which they were measured, the card is theater.

The good news is that the same disciplines that produced the theater also produce the substance, and the substance is improving fast. Calibrated-confidence training is now a research line with results. Situated-faithfulness work is teaching models to weigh internal vs external evidence. Adaptive UI interventions, for example the "Adjust for Trust" CHI line, insert deliberate friction at high-risk moments. The trajectory is in the right direction. It is slow, and most users are not seeing the results yet.

The two-sided coordination problem

Appropriate reliance is an equilibrium between two players: the AI must produce signals that track its actual ability, and the user must read those signals correctly. The industry pours money into the first half: alignment, calibration, evals, model cards, red-teaming. The second half, AI literacy among users, is mostly self-taught from anecdotes, social media, and one-off incidents. The result is predictable. Users over-rely in domains where the AI sounds confident and the user does not have ground truth (medical, legal, niche technical), and they under-rely in domains where the AI would actually help most but the user is already overconfident (routine cognitive work, drafting, code review).

The companies that ship trust-calibration UI (visible confidence that tracks reality, deliberate friction on confident-wrong moments, interfaces that distinguish "I remember this" from "I am generating this") are doing work on both sides at once. Very few do. It does not sell well. A model that says "I'm not sure" loses a benchmark and loses a user demo. Calibrated humility costs the vendor immediately and pays the user back later. Most product roadmaps cannot tolerate that gradient.

One second take-away

Calibrated trust is the only kind of trust that compounds. Overtrust eventually crashes. The system confidently says something false in a domain that matters and the user finds out. Undertrust quietly never adopts the tool that would have helped, and the cost is invisible because it is a counterfactual. Between them sits the narrow band where reliable use actually lives. The width of that band is determined by how well the signals coming out of the system track its real ability, and by how well the person on the other end has learned to read those signals. Both have to move, or neither does.

A practical working rule, for anyone reading: assume the AI is more confident than it should be, and that the confidence cue you are reading is the part that has been optimized hardest. Before trusting it on something important, probe with something trivial that you can verify: a date, a sum, a quotation. The trivial probe gives you more information about the confidence cue than ten hard questions would, because the cost of a confident-wrong on a trivial check is informational gold. The signal you trust is the signal you most need to verify.


Links: Longitudinal Trust Calibration in Human-Robot Teams (de Visser et al. 2020, Int. J. Social Robotics) | Trust in Automation: Designing for Appropriate Reliance (Lee & See 2004, Human Factors) | Humans and Automation: Use, Misuse, Disuse, Abuse (Parasuraman & Riley 1997, Human Factors) | Rewarding Doubt: RL for Calibrated Confidence (arXiv) | Adjust for Trust (Srinivasan & Thomason, arXiv) | Situated Faithfulness in LLMs (arXiv) | Can LLMs Express Uncertainty Like Human? (arXiv) | Anthropic Transparency Hub (Anthropic) | Foundation Model Transparency Index 2025 (Stanford CRFM)

#ai #calibration #hci #human-factors #machine-learning #trust