Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

Zhao, Xinyu; Khan, Rana Muhammad Shahroz; Xu, Zhen; Tan, Zhen; Chen, Tianlong

Does AI Reviewer See the Full Picture?
Attacking and Defending Multimodal Peer Review

ICML 2026

Xinyu Zhao^*¹ Rana Muhammad Shahroz Khan^*¹ Zhen Xu¹ Zhen Tan² Tianlong Chen¹

¹The University of North Carolina at Chapel Hill ²Arizona State University

^* Equal contribution

Paper arXiv Code Benchmark

TL;DR

AI peer reviewers are pervasively vulnerable to cross-modal adversarial manipulation of submitted papers—through both text and figures. PaperGuard is the first benchmark to systematically measure these attacks and offers a practical, near-zero-false-positive defense.

Overview of PaperGuard attacks against AI peer reviewers — AI reviewers ingest a paper's text and figures to produce a review and numerical scores. PaperGuard exposes three attack families that all inflate scores: **(a) black-box prompt injection** (e.g., GPT-4o), **(b) learnable text attacks** via GCG (e.g., Qwen3-8B), and **(c) imperceptible image perturbations** (e.g., Qwen2.5-VL-7B). In each radar chart, the inflated **attacked** scores (orange) overshadow the **clean** assessment (green).

Abstract

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this score”) rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework makes three contributions: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

Contributions

PaperGuard is the first standardized framework to evaluate the robustness of AI-generated scientific reviews under multimodal adversarial manipulation.

Multimodal Dataset

1,136 ICLR and F1000Research papers across scientific domains, parsed into text and key method/results figures.

Unified Attack Suite

Black-box prompt injection plus white-box attacks across modalities—GCG for text, PGD / APGD / C&W for figures.

Practical Defense

Chunk-based embedding search that localizes hidden malicious instructions in long papers with near-zero false positives.

Framework

PaperGuard processes multi-platform papers, formulates cross-modal attack tasks designed to mislead AI reviewers, and proposes defenses to detect and mitigate them.

Key Results

Across open-source and commercial LLMs/MLLMs, adversarial vulnerabilities persist broadly—and our defense is the only one that holds up across both modalities.

80%

Black-box prompt injection ASR against Claude-Sonnet-4.5

+14.1

Points of score inflation from text and visual attacks

95.0%

Detection accuracy of our Chunk-based Embedding Search

~0%

False positive rate—legitimate authors are not penalized

High susceptibility across SOTA models. Advanced LLMs and MLLMs are pervasively vulnerable to black-box prompt injection. Attacks succeed primarily by suppressing criticism rather than fabricating praise—negative review tags drop sharply (e.g., −3.26 per review on Mistral-Small-3.1).

Capability correlates with vulnerability. Stronger, larger models often succumb more easily due to superior instruction-following. The lowest ASR (DeepSeek-R1-Distill-Llama-8B, 0.46) reflects a capability failure—not security alignment.

Imperceptible figure perturbations alone mislead reviewers. White-box visual attacks (PGD / APGD / C&W) inflate scores without altering any text, demonstrating the insufficiency of text-only safeguards. Text GCG reaches up to 0.78 ASR on Qwen3-8B.

Standard defenses fail; ours holds. Moderation APIs and global classifiers reach 0.0 recall (instructions lost in document noise), while LLM-as-Judge suffers 100% FPR. Chunk-based Embedding Search achieves 95.0% / 93.5% accuracy and 92.86% / 90.32% recall on text / visual attacks at near-zero FPR.

Detection on 17 real arXiv papers with confirmed hidden prompt injections. Best in bold.
Defense	Recall ↑	FPR ↓
EmbSearch (ours)	100.0 (17/17)	0.0
LLM-as-Judge (GPT-4o)	64.7 (11/17)	0.0
Moderation API	0.0 (0/17)	0.0

Case Study

A qualitative look at how each attack modality exploits a distinct vulnerability to achieve score inflation.

BibTeX

@inproceedings{
zhao2026does,
title={Does {AI} Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review},
author={Zhao, Xinyu and Khan, Rana Muhammad Shahroz and Xu, Zhen and Tan, Zhen and Chen, Tianlong},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=l8RBjihPFk}
}

Does AI Reviewer See the Full Picture?Attacking and Defending Multimodal Peer Review