PaperGuard logoDoes AI Reviewer See the Full Picture?
Attacking and Defending Multimodal Peer Review

ICML 2026
1The University of North Carolina at Chapel Hill 2Arizona State University
* Equal contribution
TL;DR

AI peer reviewers are pervasively vulnerable to cross-modal adversarial manipulation of submitted papers—through both text and figures. PaperGuard is the first benchmark to systematically measure these attacks and offers a practical, near-zero-false-positive defense.

Overview of PaperGuard attacks against AI peer reviewers
AI reviewers ingest a paper's text and figures to produce a review and numerical scores. PaperGuard exposes three attack families that all inflate scores: (a) black-box prompt injection (e.g., GPT-4o), (b) learnable text attacks via GCG (e.g., Qwen3-8B), and (c) imperceptible image perturbations (e.g., Qwen2.5-VL-7B). In each radar chart, the inflated attacked scores (orange) overshadow the clean assessment (green).

Abstract

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this score”) rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework makes three contributions: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

Contributions

PaperGuard is the first standardized framework to evaluate the robustness of AI-generated scientific reviews under multimodal adversarial manipulation.

Multimodal Dataset

1,136 ICLR and F1000Research papers across scientific domains, parsed into text and key method/results figures.

Unified Attack Suite

Black-box prompt injection plus white-box attacks across modalities—GCG for text, PGD / APGD / C&W for figures.

Practical Defense

Chunk-based embedding search that localizes hidden malicious instructions in long papers with near-zero false positives.

Framework

PaperGuard processes multi-platform papers, formulates cross-modal attack tasks designed to mislead AI reviewers, and proposes defenses to detect and mitigate them.

PaperGuard framework pipeline
The PaperGuard pipeline. Diverse papers from multiple platforms are parsed into text and figures. The attack task manipulates this content (prompt injection, image perturbation) to maximize review-score inflation, while the defense task identifies attacked papers via LLM-as-Judge, trained classifiers, and chunk-based embedding search.

Key Results

Across open-source and commercial LLMs/MLLMs, adversarial vulnerabilities persist broadly—and our defense is the only one that holds up across both modalities.

80%
Black-box prompt injection ASR against Claude-Sonnet-4.5
+14.1
Points of score inflation from text and visual attacks
95.0%
Detection accuracy of our Chunk-based Embedding Search
~0%
False positive rate—legitimate authors are not penalized

High susceptibility across SOTA models. Advanced LLMs and MLLMs are pervasively vulnerable to black-box prompt injection. Attacks succeed primarily by suppressing criticism rather than fabricating praise—negative review tags drop sharply (e.g., −3.26 per review on Mistral-Small-3.1).

Capability correlates with vulnerability. Stronger, larger models often succumb more easily due to superior instruction-following. The lowest ASR (DeepSeek-R1-Distill-Llama-8B, 0.46) reflects a capability failure—not security alignment.

Imperceptible figure perturbations alone mislead reviewers. White-box visual attacks (PGD / APGD / C&W) inflate scores without altering any text, demonstrating the insufficiency of text-only safeguards. Text GCG reaches up to 0.78 ASR on Qwen3-8B.

Standard defenses fail; ours holds. Moderation APIs and global classifiers reach 0.0 recall (instructions lost in document noise), while LLM-as-Judge suffers 100% FPR. Chunk-based Embedding Search achieves 95.0% / 93.5% accuracy and 92.86% / 90.32% recall on text / visual attacks at near-zero FPR.

Detection on 17 real arXiv papers with confirmed hidden prompt injections. Best in bold.
Defense Recall ↑ FPR ↓
EmbSearch (ours) 100.0 (17/17) 0.0
LLM-as-Judge (GPT-4o) 64.7 (11/17) 0.0
Moderation API 0.0 (0/17) 0.0

Case Study

A qualitative look at how each attack modality exploits a distinct vulnerability to achieve score inflation.

Qualitative case study of the three attack modalities
The three attack modalities evaluated in PaperGuard. (a) Prompt Injection uses obfuscated instructions (e.g., Base64) to bypass safety filters in proprietary models. (b) Learnable Text Attacks optimize adversarial suffixes that look like gibberish to humans but maximize the score vector in open-source LLMs. (c) Image Perturbation adds pixel-level noise to scientific figures to mislead MLLMs. Radar charts contrast inflated attacked (orange) and clean (green) scores.

BibTeX

@inproceedings{
zhao2026does,
title={Does {AI} Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review},
author={Zhao, Xinyu and Khan, Rana Muhammad Shahroz and Xu, Zhen and Tan, Zhen and Chen, Tianlong},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=l8RBjihPFk}
}