Hallucination-Aware Calibration for Vision-Language Models

TL;DR

Medical Vision-Language Models (VLMs) are systematically overconfident, and popular fixes like prompting or scaling do not resolve this. We introduce Hallucination-Aware Calibration (HAC), which leverages vision-grounded hallucination signals to improve both calibration and ranking, reducing ECE by up to −0.38 pp and improving AUROC by up to +7.3 pp on open-ended questions.

1. Why Calibration Matters in Medicine

In clinical decision support, knowing when to trust a model is just as critical as its accuracy. Modern VLMs suffer from a dangerous combination: they hallucinate, and they are highly confident about it.

Medical VLM — Clinical Decision Support

"Does the brain MRI show any abnormalities?"

Q3
8B

"No abnormal findings are observed." Confidence: 95%

Qwen3-VL-8B · SLAKE benchmark · Ground truth: brain edema, brain enhancing tumor

The consequence: The issue is not merely incorrect predictions, but overconfident ones. High-confidence errors can mislead clinicians about when to trust the model, posing a critical safety risk.

2. Overconfidence is Pervasive Across Models

We evaluate three open-source VLM families—Qwen3-VL, InternVL3, and LLaVA-NeXT—spanning 2B to 38B parameters on three medical VQA benchmarks (VQA-RAD, SLAKE, VQA-Med).

Confidence vs. accuracy scatter plot showing all models fall below the diagonal, confirming systematic overconfidence — **Figure 1.** Mean confidence vs. accuracy for different question types on medical VQA benchmarks. All models fall below the diagonal (grey region), indicating consistent **overconfidence**. The gap is larger for open-ended questions.

Scaling Does Not Improve Calibration

Model	Accuracy	Confidence	Gap	ECE	ACE
Qwen3-VL-2B	.562	.860	.298	.316	.305
Qwen3-VL-8B	.605	.943	.338	.340	.338
Qwen3-VL-32B	.651	.954	.304	.307	.303
InternVL3-2B	.554	.774	.220	.227	.231
InternVL3-8B	.618	.811	.192	.204	.202
InternVL3-38B	.549	.767	.218	.231	.237
LLaVA-NeXT-7B	.434	.640	.206	.213	.226
LLaVA-NeXT-34B	.472	.751	.279	.285	.293

Table 1. Accuracy, confidence, overconfidence gap, ECE, and ACE on pooled medical VQA benchmarks. While larger models achieve higher accuracy, the overconfidence gap (red) remains uniformly high regardless of scale.

No Prompting Strategy Consistently Improves Calibration

Our study across 8+ confidence estimation strategies finds that no prompting strategy consistently improves calibration. CoT offers little to no benefit, and most verbalized variants fail to improve over vanilla prompting, often worsening calibration.

Bar chart comparing ACE across prompting strategies, showing no strategy consistently improves calibration — **Figure 2.** Adaptive Calibration Error (ACE) across sampling-based and verbalized confidence extraction methods and their prompting variants, averaged across 7/8B models on pooled benchmarks. No prompting strategy consistently improves calibration.

Key finding: Overconfidence is a systematic property, not an artifact of any particular setup. Neither scaling nor prompting resolves it.

3. Post-Hoc Calibration Works—But Has a Ceiling

Standard methods like Platt Scaling and Isotonic Regression dramatically reduce calibration error. But they have a fundamental structural limitation.

Bar chart showing ECE and ACE before and after Platt scaling for all models — **Figure 3.** Calibration errors (ECE and ACE) before and after post-hoc calibration (Platt scaling) for sampling-based and verbalized confidence on closed- and open-ended questions. Post-hoc calibration consistently and substantially reduces calibration error.

The Monotonicity Trap

These methods apply monotonic transformations. They compress or stretch scores but cannot change the rank order of predictions.

Before Calibration

1.00

✗ wrong

.95

✓ correct

.95

optimal threshold

.95

✓ correct

.95

✗ wrong

.90

✓ correct

.90

✗ wrong

.85

✓ correct

.85

✗ wrong

confidence ↓

Platt

→

After Calibration

.78

✗ wrong

.56

✓ correct

.56

optimal threshold

.56

✓ correct

.56

✗ wrong

.31

✓ correct

.31

✗ wrong

.14

✓ correct

.14

✗ wrong

confidence ↓

Predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). Calibration rescales the overconfident scores, but the same wrong predictions remain above the threshold. AUROC stays at .614.

The core problem: A confidently wrong prediction will always remain ranked above a less confident but correct one. Standard calibration fixes the scale but leaves discriminative quality (AUROC) frozen. To break this ceiling, we need a signal orthogonal to raw confidence.

4. Hallucination-Aware Calibration (HAC)

We found that vision-grounded hallucination scores provide the orthogonal signal we need. For example, VASE (Liao et al., MICCAI 2025) detects hallucination by contrasting output distributions from the original image versus a weakly perturbed version—a higher score indicates the prediction is not grounded in the visual input.

Raw Confidencec

Hallucination ScoreVASE · h

→

HAC ScoreCalibrated & Reranked

HAC-Platt

$s(c, h) = \sigma(a \cdot c \;+\; b \cdot h \;+\; d)$

where $a \geq 0$ (confidence ↑) and $b \leq 0$ (hallucination ↓)

HAC-Gate

$s(c, h) = c \cdot \sigma(-\alpha \cdot h + \beta)$

Sigmoid gate attenuates confidence when hallucination is high

Raw Confidence

1.00

✗ wrong vase=0.7

.95

✓ correct vase=0.0

.95

optimal threshold

.95

✓ correct vase=0.5

.95

✗ wrong vase=2.2

.90

✓ correct vase=2.6

.90

✗ wrong vase=2.0

.85

✓ correct vase=1.1

.85

✗ wrong vase=2.4

confidence ↓

HAC

→

After HAC

.79

✗ wrong was 1.00

.77

✓ correct was .95

.68

✓ correct was .95

.43

optimal threshold

.36

✗ wrong ↓ was .95

.24

✓ correct was .85

.23

✗ wrong ↓ was .90

.16

✓ correct was .90

.10

✗ wrong ↓ was .85

HAC score ↓

Same predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). HAC uses hallucination score to penalize hallucinated predictions, breaking the monotonicity trap. AUROC improves from .614 to .734.

5. Results

+5.3 pp

Avg. AUROC improvement
across all models

+7.3 pp

AUROC gain on
open-ended questions

+10.1 pp

AUROC gain, verbalized
confidence (open)

Bar chart comparing AUROC across calibration methods showing HAC improves AUROC while Platt scaling preserves it — **Figure 4.** AUROC comparison across post-hoc calibration methods on pooled datasets. Platt scaling preserves the same AUROC due to its monotonic transformation. HAC improves AUROC by incorporating hallucination signals, with larger gains on open-ended questions.

AUROC: Discriminative Quality (higher is better)

Platt scaling preserves AUROC by definition. HAC consistently improves it.

	Uncalibrated		Platt Scaling		HAC-Platt		HAC-Gate
Model	Samp.	Verb.	Samp.	Verb.	Samp.	Verb.	Samp.	Verb.
Qwen3-VL-8B	0.643	0.710	0.643	0.710	0.750	0.771	0.750	0.743
InternVL3-8B	0.764	0.614	0.764	0.614	0.789	0.742	0.793	0.736
LLaVA-NeXT-7B	0.706	0.603	0.706	0.603	0.725	0.703	0.721	0.702

Table 2. AUROC ($\uparrow$) comparison on open-ended questions (pooled datasets). HAC improves AUROC across all models while Platt scaling leaves it unchanged.

ACE: Calibration Error (lower is better)

	Uncalibrated		Platt Scaling		HAC-Platt		HAC-Gate
Model	Samp.	Verb.	Samp.	Verb.	Samp.	Verb.	Samp.	Verb.
Qwen3-VL-8B	0.400	0.363	0.098	0.095	0.086	0.075	0.086	0.075
InternVL3-8B	0.228	0.352	0.094	0.102	0.094	0.087	0.097	0.091
LLaVA-NeXT-7B	0.256	0.539	0.082	0.104	0.096	0.083	0.077	0.084

Table 3. ACE ($\downarrow$) comparison on open-ended questions (pooled datasets). HAC achieves calibration gains comparable to or better than standard post-hoc methods.

Bottom line: HAC is the only method that improves both calibration error and discriminative quality. Standard post-hoc methods reduce error but leave AUROC frozen. HAC breaks this ceiling by leveraging hallucination signals to rerank predictions.

Citation

@article{byun2026overconfidence,
  title     = {Overconfidence and Calibration in Medical VQA:
               Empirical Findings and Hallucination-Aware Mitigation},
  author    = {Byun, Ji Young and Park, Young-Jin and
               Corbeil, Jean-Philippe and Ben Abacha, Asma},
  journal   = {arXiv preprint arXiv:2604.02543},
  year      = {2026}
}

Hallucination-Aware Calibrationfor Vision-Language Models

TL;DR

1. Why Calibration Matters in Medicine

2. Overconfidence is Pervasive Across Models

Scaling Does Not Improve Calibration

No Prompting Strategy Consistently Improves Calibration

3. Post-Hoc Calibration Works—But Has a Ceiling

The Monotonicity Trap

4. Hallucination-Aware Calibration (HAC)

5. Results

AUROC: Discriminative Quality (higher is better)

ACE: Calibration Error (lower is better)

Citation

Hallucination-Aware Calibration
for Vision-Language Models