CVPR 2026 Workshop · Med-Reasoner

To Agree or To Be Right?
The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

OFM Riaz Rahman Aranya
0000-0002-8195-2710
Department of Computer Science, The University of Texas at San Antonio

Abstract

Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes — hallucination and sycophancy — remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and CSI (Clinical Safety Index), a unified safety score inspired by FMEA that combines grounding, autonomy, and calibration via a geometric mean. No model in our study excels on both axes simultaneously, suggesting that current training paradigms may implicitly trade off one safety property for the other.

Key Finding

Models that hallucinate less are the most sycophantic. Models that resist social pressure hallucinate more than every medical-specialist. No model reaches the desired safe quadrant.

The Grounding–Sycophancy Tradeoff

Grounding-Sycophancy tradeoff scatter plot
Figure 1. The grounding–sycophancy tradeoff on VQA-RAD (n=451). The x-axis shows L-VASE (hallucination, lower is better) and the y-axis shows CCS (sycophancy, lower is better). Qwen3-VL and MedGemma hallucinate least but are the most sycophantic; IDEFICS2 resists pressure but hallucinates substantially. No model reaches the lower-left Desired quadrant.

Evaluation Pipeline

Evaluation pipeline diagram
Figure 2. Overview of the evaluation pipeline. Medical VQA items from three benchmarks are presented to six VLMs under baseline and adversarial social pressure conditions. Model outputs are assessed using L-VASE, CCS, and CSI.

Proposed Metrics

L-VASE
Logit-Level Visual Assertion Semantic Entropy
Reformulates VASE in logit-space, fixing the double-normalization issue. Measures hallucination propensity from raw log-probabilities.
CCS
Confidence-Calibrated Sycophancy
Weights each capitulation by the model's own logit-derived confidence. A high-confidence flip is a graver safety failure.
CSI
Clinical Safety Index
Inspired by FMEA risk analysis. Combines grounding, autonomy, and calibration via geometric mean — failure on any axis is unsafe.

VASE vs. L-VASE

VASE vs L-VASE comparison
Figure 3. Side-by-side comparison of VASE (left) and our proposed L-VASE (right). VASE applies softmax to contrastive differences of probability vectors, producing invalid negatives and losing ~46% of probability mass. L-VASE operates in logit-space with a single principled softmax, yielding 100% valid distributions.

Clinical Safety Index Distribution

CSI distribution across models
Figure 4. Distribution of the Clinical Safety Index (CSI) across all evaluated models and datasets. CSI enforces the principle that failure on any single safety axis renders a system clinically unsafe.

Models Evaluated

ModelTypeNotes
Qwen3-VLGeneralLow hallucination, highest sycophancy
LLaVA-1.5GeneralGeneral-purpose VLM
IDEFICS2GeneralMost pressure-resistant; higher hallucination
MedGemmaMedicalLow hallucination, high sycophancy
LLaVA-MedMedicalMedical domain-adapted visual instruction model
MedVLM-R1MedicalMedical-specialist VLM

Datasets

VQA-RAD

Radiology VQA benchmark with clinician-generated question–answer pairs over radiology images.

SLAKE

Bilingual medical VQA dataset with semantic labels, knowledge annotations, and diverse anatomical regions.

PathVQA

Pathology-focused visual QA with open-ended and yes/no questions over pathology images.

Citation

If you find this work useful in your research, please consider citing:

@article{aranya2026agree,
  title   = {To Agree or To Be Right? The Grounding-Sycophancy
             Tradeoff in Medical Vision-Language Models},
  author  = {Aranya, OFM Riaz Rahman and Desai, Kevin},
  journal = {arXiv preprint arXiv:2603.22623},
  year    = {2026}
}