Hallucination-Aware Calibration
for Vision-Language Models

Empirical Studies on Overconfidence and Calibration in Medical VQA

Ji Young Byun1 Young-Jin Park2 Jean-Philippe Corbeil3 Asma Ben Abacha3
1 Johns Hopkins University 2 MIT 3 Microsoft

TL;DR

Medical Vision-Language Models (VLMs) are systematically overconfident, and popular fixes like prompting or scaling do not resolve this. We introduce Hallucination-Aware Calibration (HAC), which leverages vision-grounded hallucination signals to improve both calibration and ranking, reducing ECE by up to −0.38 pp and improving AUROC by up to +7.3 pp on open-ended questions.


1. Why Calibration Matters in Medicine

In clinical decision support, knowing when to trust a model is just as critical as its accuracy. Modern VLMs suffer from a dangerous combination: they hallucinate, and they are highly confident about it.

Medical VLM — Clinical Decision Support
Dr
Brain MRI with tumor (SLAKE xmlab449)
"Does the brain MRI show any abnormalities?"
Q3
8B
"No abnormal findings are observed." Confidence: 95%
Qwen3-VL-8B · SLAKE benchmark · Ground truth: brain edema, brain enhancing tumor
The consequence: The issue is not merely incorrect predictions, but overconfident ones. High-confidence errors can mislead clinicians about when to trust the model, posing a critical safety risk.

2. Overconfidence is Pervasive Across Models

We evaluate three open-source VLM families—Qwen3-VL, InternVL3, and LLaVA-NeXT—spanning 2B to 38B parameters on three medical VQA benchmarks (VQA-RAD, SLAKE, VQA-Med).

Confidence vs. accuracy scatter plot showing all models fall below the diagonal, confirming systematic overconfidence
Figure 1. Mean confidence vs. accuracy for different question types on medical VQA benchmarks. All models fall below the diagonal (grey region), indicating consistent overconfidence. The gap is larger for open-ended questions.

Scaling Does Not Improve Calibration

ModelAccuracyConfidenceGapECEACE
Qwen3-VL-2B.562.860.298.316.305
Qwen3-VL-8B.605.943.338.340.338
Qwen3-VL-32B.651.954.304.307.303
InternVL3-2B.554.774.220.227.231
InternVL3-8B.618.811.192.204.202
InternVL3-38B.549.767.218.231.237
LLaVA-NeXT-7B.434.640.206.213.226
LLaVA-NeXT-34B.472.751.279.285.293

Table 1. Accuracy, confidence, overconfidence gap, ECE, and ACE on pooled medical VQA benchmarks. While larger models achieve higher accuracy, the overconfidence gap (red) remains uniformly high regardless of scale.

No Prompting Strategy Consistently Improves Calibration

Our study across 8+ confidence estimation strategies finds that no prompting strategy consistently improves calibration. CoT offers little to no benefit, and most verbalized variants fail to improve over vanilla prompting, often worsening calibration.

Bar chart comparing ACE across prompting strategies, showing no strategy consistently improves calibration
Figure 2. Adaptive Calibration Error (ACE) across sampling-based and verbalized confidence extraction methods and their prompting variants, averaged across 7/8B models on pooled benchmarks. No prompting strategy consistently improves calibration.
Key finding: Overconfidence is a systematic property, not an artifact of any particular setup. Neither scaling nor prompting resolves it.

3. Post-Hoc Calibration Works—But Has a Ceiling

Standard methods like Platt Scaling and Isotonic Regression dramatically reduce calibration error. But they have a fundamental structural limitation.

Bar chart showing ECE and ACE before and after Platt scaling for all models
Figure 3. Calibration errors (ECE and ACE) before and after post-hoc calibration (Platt scaling) for sampling-based and verbalized confidence on closed- and open-ended questions. Post-hoc calibration consistently and substantially reduces calibration error.

The Monotonicity Trap

These methods apply monotonic transformations. They compress or stretch scores but cannot change the rank order of predictions.

Before Calibration
1.00
✗ wrong
.95
✓ correct
.95
optimal threshold
.95
✓ correct
.95
✗ wrong
.90
✓ correct
.90
✗ wrong
.85
✓ correct
.85
✗ wrong
confidence ↓
Platt
After Calibration
.78
✗ wrong
.56
✓ correct
.56
optimal threshold
.56
✓ correct
.56
✗ wrong
.31
✓ correct
.31
✗ wrong
.14
✓ correct
.14
✗ wrong
confidence ↓

Predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). Calibration rescales the overconfident scores, but the same wrong predictions remain above the threshold. AUROC stays at .614.

The core problem: A confidently wrong prediction will always remain ranked above a less confident but correct one. Standard calibration fixes the scale but leaves discriminative quality (AUROC) frozen. To break this ceiling, we need a signal orthogonal to raw confidence.

4. Hallucination-Aware Calibration (HAC)

We found that vision-grounded hallucination scores provide the orthogonal signal we need. For example, VASE (Liao et al., MICCAI 2025) detects hallucination by contrasting output distributions from the original image versus a weakly perturbed version—a higher score indicates the prediction is not grounded in the visual input.

Raw Confidencec
+
Hallucination ScoreVASE · h
HAC ScoreCalibrated & Reranked
HAC-Platt
$s(c, h) = \sigma(a \cdot c \;+\; b \cdot h \;+\; d)$
where $a \geq 0$ (confidence ↑) and $b \leq 0$ (hallucination ↓)
HAC-Gate
$s(c, h) = c \cdot \sigma(-\alpha \cdot h + \beta)$
Sigmoid gate attenuates confidence when hallucination is high

Raw Confidence
1.00
✗ wrong  vase=0.7
.95
✓ correct  vase=0.0
.95
optimal threshold
.95
✓ correct  vase=0.5
.95
✗ wrong  vase=2.2
.90
✓ correct  vase=2.6
.90
✗ wrong  vase=2.0
.85
✓ correct  vase=1.1
.85
✗ wrong  vase=2.4
confidence ↓
HAC
After HAC
.79
✗ wrong  was 1.00
.77
✓ correct  was .95
.68
✓ correct  was .95
.43
optimal threshold
.36
✗ wrong  ↓ was .95
.24
✓ correct  was .85
.23
✗ wrong  ↓ was .90
.16
✓ correct  was .90
.10
✗ wrong  ↓ was .85
HAC score ↓

Same predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). HAC uses hallucination score to penalize hallucinated predictions, breaking the monotonicity trap. AUROC improves from .614 to .734.

5. Results

+5.3 pp
Avg. AUROC improvement
across all models
+7.3 pp
AUROC gain on
open-ended questions
+10.1 pp
AUROC gain, verbalized
confidence (open)
Bar chart comparing AUROC across calibration methods showing HAC improves AUROC while Platt scaling preserves it
Figure 4. AUROC comparison across post-hoc calibration methods on pooled datasets. Platt scaling preserves the same AUROC due to its monotonic transformation. HAC improves AUROC by incorporating hallucination signals, with larger gains on open-ended questions.

AUROC: Discriminative Quality (higher is better)

Platt scaling preserves AUROC by definition. HAC consistently improves it.

Uncalibrated Platt Scaling HAC-Platt HAC-Gate
Model Samp.Verb. Samp.Verb. Samp.Verb. Samp.Verb.
Qwen3-VL-8B 0.6430.710 0.6430.710 0.7500.771 0.7500.743
InternVL3-8B 0.7640.614 0.7640.614 0.7890.742 0.7930.736
LLaVA-NeXT-7B 0.7060.603 0.7060.603 0.7250.703 0.7210.702

Table 2. AUROC ($\uparrow$) comparison on open-ended questions (pooled datasets). HAC improves AUROC across all models while Platt scaling leaves it unchanged.

ACE: Calibration Error (lower is better)

Uncalibrated Platt Scaling HAC-Platt HAC-Gate
Model Samp.Verb. Samp.Verb. Samp.Verb. Samp.Verb.
Qwen3-VL-8B 0.4000.363 0.0980.095 0.0860.075 0.0860.075
InternVL3-8B 0.2280.352 0.0940.102 0.0940.087 0.0970.091
LLaVA-NeXT-7B 0.2560.539 0.0820.104 0.0960.083 0.0770.084

Table 3. ACE ($\downarrow$) comparison on open-ended questions (pooled datasets). HAC achieves calibration gains comparable to or better than standard post-hoc methods.

Bottom line: HAC is the only method that improves both calibration error and discriminative quality. Standard post-hoc methods reduce error but leave AUROC frozen. HAC breaks this ceiling by leveraging hallucination signals to rerank predictions.

Citation

@article{byun2026overconfidence, title = {Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation}, author = {Byun, Ji Young and Park, Young-Jin and Corbeil, Jean-Philippe and Ben Abacha, Asma}, journal = {arXiv preprint arXiv:2604.02543}, year = {2026} }