Empirical Studies on Overconfidence and Calibration in Medical VQA
2
Medical Vision-Language Models (VLMs) are systematically overconfident, and popular fixes like prompting or scaling do not resolve this. We introduce Hallucination-Aware Calibration (HAC), which leverages vision-grounded hallucination signals to improve both calibration and ranking, reducing ECE by up to −0.38 pp and improving AUROC by up to +7.3 pp on open-ended questions.
In clinical decision support, knowing when to trust a model is just as critical as its accuracy. Modern VLMs suffer from a dangerous combination: they hallucinate, and they are highly confident about it.
We evaluate three open-source VLM families—Qwen3-VL, InternVL3, and LLaVA-NeXT—spanning 2B to 38B parameters on three medical VQA benchmarks (VQA-RAD, SLAKE, VQA-Med).
| Model | Accuracy | Confidence | Gap | ECE | ACE |
|---|---|---|---|---|---|
| Qwen3-VL-2B | .562 | .860 | .298 | .316 | .305 |
| Qwen3-VL-8B | .605 | .943 | .338 | .340 | .338 |
| Qwen3-VL-32B | .651 | .954 | .304 | .307 | .303 |
| InternVL3-2B | .554 | .774 | .220 | .227 | .231 |
| InternVL3-8B | .618 | .811 | .192 | .204 | .202 |
| InternVL3-38B | .549 | .767 | .218 | .231 | .237 |
| LLaVA-NeXT-7B | .434 | .640 | .206 | .213 | .226 |
| LLaVA-NeXT-34B | .472 | .751 | .279 | .285 | .293 |
Table 1. Accuracy, confidence, overconfidence gap, ECE, and ACE on pooled medical VQA benchmarks. While larger models achieve higher accuracy, the overconfidence gap (red) remains uniformly high regardless of scale.
Our study across 8+ confidence estimation strategies finds that no prompting strategy consistently improves calibration. CoT offers little to no benefit, and most verbalized variants fail to improve over vanilla prompting, often worsening calibration.
Standard methods like Platt Scaling and Isotonic Regression dramatically reduce calibration error. But they have a fundamental structural limitation.
These methods apply monotonic transformations. They compress or stretch scores but cannot change the rank order of predictions.
Predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). Calibration rescales the overconfident scores, but the same wrong predictions remain above the threshold. AUROC stays at .614.
We found that vision-grounded hallucination scores provide the orthogonal signal we need. For example, VASE (Liao et al., MICCAI 2025) detects hallucination by contrasting output distributions from the original image versus a weakly perturbed version—a higher score indicates the prediction is not grounded in the visual input.
Same predictions from InternVL3-8B on VQA-RAD (open-ended, verbalized confidence). HAC uses hallucination score to penalize hallucinated predictions, breaking the monotonicity trap. AUROC improves from .614 to .734.
Platt scaling preserves AUROC by definition. HAC consistently improves it.
| Uncalibrated | Platt Scaling | HAC-Platt | HAC-Gate | |||||
|---|---|---|---|---|---|---|---|---|
| Model | Samp. | Verb. | Samp. | Verb. | Samp. | Verb. | Samp. | Verb. |
| Qwen3-VL-8B | 0.643 | 0.710 | 0.643 | 0.710 | 0.750 | 0.771 | 0.750 | 0.743 |
| InternVL3-8B | 0.764 | 0.614 | 0.764 | 0.614 | 0.789 | 0.742 | 0.793 | 0.736 |
| LLaVA-NeXT-7B | 0.706 | 0.603 | 0.706 | 0.603 | 0.725 | 0.703 | 0.721 | 0.702 |
Table 2. AUROC ($\uparrow$) comparison on open-ended questions (pooled datasets). HAC improves AUROC across all models while Platt scaling leaves it unchanged.
| Uncalibrated | Platt Scaling | HAC-Platt | HAC-Gate | |||||
|---|---|---|---|---|---|---|---|---|
| Model | Samp. | Verb. | Samp. | Verb. | Samp. | Verb. | Samp. | Verb. |
| Qwen3-VL-8B | 0.400 | 0.363 | 0.098 | 0.095 | 0.086 | 0.075 | 0.086 | 0.075 |
| InternVL3-8B | 0.228 | 0.352 | 0.094 | 0.102 | 0.094 | 0.087 | 0.097 | 0.091 |
| LLaVA-NeXT-7B | 0.256 | 0.539 | 0.082 | 0.104 | 0.096 | 0.083 | 0.077 | 0.084 |
Table 3. ACE ($\downarrow$) comparison on open-ended questions (pooled datasets). HAC achieves calibration gains comparable to or better than standard post-hoc methods.
@article{byun2026overconfidence,
title = {Overconfidence and Calibration in Medical VQA:
Empirical Findings and Hallucination-Aware Mitigation},
author = {Byun, Ji Young and Park, Young-Jin and
Corbeil, Jean-Philippe and Ben Abacha, Asma},
journal = {arXiv preprint arXiv:2604.02543},
year = {2026}
}