Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.
Figures (a) and (b) illustrate standard DPO and our proposed token-adaptive strategy. Figure (c) shows a representative VQA example where DPO generates a hallucinated answer, failing to ground its output in the image content. In contrast, TARS produces a visually aligned response. Figures (d) and (e) visualize token-to-query attention maps during decoding. The attention map from DPO reveals excessive focus on spurious correlation tokens unrelated to the image, while TARS correctly shifts attention toward causally grounded visual-semantic cues, enabling more reliable multimodal reasoning.
TARS reformulates preference optimization as a Min–Max problem: (1) The maximization branch perturbs visual-agnostic tokens to simulate semantically shifted contexts (red dashed box); (2) The minimization branch fine-tunes the model to align with human preferences via the DPO objective (purple dashed box). TARS encourages the model to attend to causally grounded visual signals rather than spurious textual correlations, thereby reducing hallucinations.
Comparison across hallucination evaluation benchmarks. We evaluate state-of-the-art MLLMs as reference baselines, denoted by §. For algorithms with available checkpoints, results from re-testing are marked with †; for those without, we reproduce results using settings from original paper, denoted by ‡. All experiments use greedy decoding with temperature set to 0 for consistency and reproducibility. Bold denotes the best performance, and underlined denotes the second-best.
@article{zhang2025tarsminmaxtokenadaptivepreference,
title={TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs},
author={Kejia Zhang and Keda Tao and Zhiming Luo and Chang Liu and Jiasheng Tang and Huan Wang},
journal={arXiv preprint arXiv:2507.21584},
year={2025}
}