TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

1 Xiamen University   2 Westlake University   3 AWS AI Lab, Amazon
4 DAMO Academy, Alibaba Group   5 Hupan Laboratory
*Corresponding author.

Xiamen University
Westlake University
Amazon
DAMO
PDF Preview

Left: We present TARS, a token-adaptive preference strategy for mitigating hallucinations in MLLMs. TARS reformulates direct preference optimization (DPO) as a min-max objective that (1) minimizes behavioral misalignment via preference feedback and (2) maximizes adaptability through perturbations of visual-agnostic tokens.

Right: Evaluation on LLaVA-v1.5-13B with preference optimization (PO) and industrial MLLMs under the AMBER benchmark shows that TARS surpasses PO baselines in hallucination suppression and matches the performance of GPT-4o.

Abstract

Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

Motivation

Figures (a) and (b) illustrate standard DPO and our proposed token-adaptive strategy. Figure (c) shows a representative VQA example where DPO generates a hallucinated answer, failing to ground its output in the image content. In contrast, TARS produces a visually aligned response. Figures (d) and (e) visualize token-to-query attention maps during decoding. The attention map from DPO reveals excessive focus on spurious correlation tokens unrelated to the image, while TARS correctly shifts attention toward causally grounded visual-semantic cues, enabling more reliable multimodal reasoning.

Method Overview
Motivation illustration for TARS.

Detailed Overview of Our Proposed Method

TARS reformulates preference optimization as a Min–Max problem: (1) The maximization branch perturbs visual-agnostic tokens to simulate semantically shifted contexts (red dashed box); (2) The minimization branch fine-tunes the model to align with human preferences via the DPO objective (purple dashed box). TARS encourages the model to attend to causally grounded visual signals rather than spurious textual correlations, thereby reducing hallucinations.

Method Overview
Illustration of our proposed VAP method.

Experiment Results

Illustration of the effectiveness on VQA Tasks.

BibTeX


        @article{zhang2025tarsminmaxtokenadaptivepreference,
              title={TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs}, 
              author={Kejia Zhang and Keda Tao and Zhiming Luo and Chang Liu and Jiasheng Tang and Huan Wang},
              journal={arXiv preprint arXiv:2507.21584},
              year={2025}
        }