Does AI Reviewer See the Full
Picture?
Attacking and Defending Multimodal Peer Review
AI peer reviewers are pervasively vulnerable to cross-modal adversarial manipulation of submitted papers—through both text and figures. PaperGuard is the first benchmark to systematically measure these attacks and offers a practical, near-zero-false-positive defense.
Abstract
The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this score”) rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework makes three contributions: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.
Contributions
PaperGuard is the first standardized framework to evaluate the robustness of AI-generated scientific reviews under multimodal adversarial manipulation.
Multimodal Dataset
1,136 ICLR and F1000Research papers across scientific domains, parsed into text and key method/results figures.
Unified Attack Suite
Black-box prompt injection plus white-box attacks across modalities—GCG for text, PGD / APGD / C&W for figures.
Practical Defense
Chunk-based embedding search that localizes hidden malicious instructions in long papers with near-zero false positives.
Framework
PaperGuard processes multi-platform papers, formulates cross-modal attack tasks designed to mislead AI reviewers, and proposes defenses to detect and mitigate them.
Key Results
Across open-source and commercial LLMs/MLLMs, adversarial vulnerabilities persist broadly—and our defense is the only one that holds up across both modalities.
High susceptibility across SOTA models. Advanced LLMs and MLLMs are pervasively vulnerable to black-box prompt injection. Attacks succeed primarily by suppressing criticism rather than fabricating praise—negative review tags drop sharply (e.g., −3.26 per review on Mistral-Small-3.1).
Capability correlates with vulnerability. Stronger, larger models often succumb more easily due to superior instruction-following. The lowest ASR (DeepSeek-R1-Distill-Llama-8B, 0.46) reflects a capability failure—not security alignment.
Imperceptible figure perturbations alone mislead reviewers. White-box visual attacks (PGD / APGD / C&W) inflate scores without altering any text, demonstrating the insufficiency of text-only safeguards. Text GCG reaches up to 0.78 ASR on Qwen3-8B.
Standard defenses fail; ours holds. Moderation APIs and global classifiers reach 0.0 recall (instructions lost in document noise), while LLM-as-Judge suffers 100% FPR. Chunk-based Embedding Search achieves 95.0% / 93.5% accuracy and 92.86% / 90.32% recall on text / visual attacks at near-zero FPR.
| Defense | Recall ↑ | FPR ↓ |
|---|---|---|
| EmbSearch (ours) | 100.0 (17/17) | 0.0 |
| LLM-as-Judge (GPT-4o) | 64.7 (11/17) | 0.0 |
| Moderation API | 0.0 (0/17) | 0.0 |
Case Study
A qualitative look at how each attack modality exploits a distinct vulnerability to achieve score inflation.
BibTeX
@inproceedings{
zhao2026does,
title={Does {AI} Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review},
author={Zhao, Xinyu and Khan, Rana Muhammad Shahroz and Xu, Zhen and Tan, Zhen and Chen, Tianlong},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=l8RBjihPFk}
}