Research Paper Summary: Prompt Repetition Improves Non-Reasoning LLMs

Google Research - February 2025

Paper Metadata

Field	Value
Title	Prompt Repetition Improves Non-Reasoning LLMs
Authors	Yaniv Leviathan, Matan Kalman, Yossi Matias (Google Research)
Published	February/March 2025
Institution	Google Research
Paper Type	Empirical Study
Models Tested	7 (Gemini, GPT-4, Claude, Deepseek)
Benchmarks Used	7 (ARC, OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch)

Core Thesis

Repeating the input prompt (transforming <QUERY> to <QUERY><QUERY>) improves LLM performance across all major models without increasing output token count or latency when reasoning is disabled.

Fundamental Problem

Issue: Causal language models process tokens left-to-right only (past tokens cannot attend to future tokens). This means:

Token order affects prediction performance
<CONTEXT> <QUESTION> performs differently than <QUESTION> <CONTEXT>
Early tokens can't "see" later context during initial processing

Solution: Repetition enables each prompt token to attend to every other prompt token bidirectionally during the prefill stage.

Experimental Design

Models Tested (N=7)

Provider	Models	Size Class
Google	Gemini 2.0 Flash, Gemini 2.0 Flash Lite	Small-Medium
OpenAI	GPT-4o-mini, GPT-4o	Small-Large
Anthropic	Claude 3 Haiku, Claude 3.7 Sonnet	Small-Medium
Deepseek	Deepseek V3	Large

Benchmarks (N=7)

Benchmark	Type	Questions	Focus Area
ARC (Challenge)	Multiple choice	1,172	Science reasoning
OpenBookQA	Multiple choice	500	Elementary science
GSM8K	Math problems	1,319	Grade school math
MMLU-Pro	Multiple choice	12,032	Multi-domain knowledge
MATH	Math problems	5,000	Competition math
NameIndex	Custom	300	List processing (50 items)
MiddleMatch	Custom	300	Order detection (40 items)

Configurations Tested

For multiple choice benchmarks:

Question-first: Question → Options (normal format)
Options-first: Options → Question (tests attention limitation)

Methodology

Control Variables:

Same prompts across all models
Official API access (no special access)
Testing period: February-March 2025
No reasoning enabled (suppress chain-of-thought)

Statistical Testing:

McNemar test for paired comparisons
Significance threshold: p < 0.1
Classification: Win/Loss/Neutral

Results Summary

Overall Performance

Accuracy Improvements:

47 wins out of 70 benchmark-model combinations
0 losses (no degradation observed)
23 neutral (no statistically significant change)

Statistical Confidence: All wins significant at p < 0.1

Performance by Task Type

Task Type	Improvement Range	Example
Options-first	15-40%	Multiple choice with context after question
List processing	30-70%	NameIndex: 21% → 97% (376% relative improvement)
Order detection	25-60%	MiddleMatch: Find element between X and Y
Question-first	5-15%	Standard multiple choice format
Math problems	8-20%	GSM8K, MATH

Custom Benchmark Results

NameIndex Task (Find Nth element in list of 50):

Baseline: 21.33%
With repetition: 97.33%
Improvement: +76 percentage points (376% relative)

MiddleMatch Task (Find element between two others in list of 40):

Baseline: ~30-45% (varies by model)
With repetition: ~80-95%
Improvement: +35-50 percentage points

Model-Specific Performance

All tested models showed improvement:

Model	Wins	Neutral
Gemini 2.0 Flash	7	3
Gemini 2.0 Flash Lite	8	2
GPT-4o	6	4
GPT-4o-mini	7	3
Claude 3 Haiku	6	4
Claude 3.7 Sonnet	7	3
Deepseek V3	6	4

Conclusion: Universal improvement across all providers and model sizes.

Variants Tested

Method Comparison

Method	Template	Performance	Cost
Baseline	`<QUERY>`	Reference	1x
Simple Repetition	`<QUERY><QUERY>`	+10-25pp	2x
Verbose Repetition	`<QUERY> Let me repeat that: <QUERY>`	+10-30pp	~2.1x
Triple Repetition	`<QUERY> Let me repeat that: <QUERY> Let me repeat that one more time: <QUERY>`	+15-40pp	3x
Padding (control)	`<QUERY> ... (periods to match length)`	No change	2x

Key Finding: Padding control confirms gains come from repetition, not just increased input length.

When Triple Repetition Excels

NameIndex and MiddleMatch showed substantial gains with 3x repetition:

Simple tasks: 2x sufficient
Complex list/ordering tasks: 3x provides significant additional benefit
Mathematical reasoning: 2x optimal (3x provides minimal additional gain)

Efficiency Analysis

Token Usage

Metric	Impact
Input tokens	+100% (2x) or +200% (3x)
Output tokens	No change (0%)
Total tokens	Depends on input/output ratio

Example Calculation:

Baseline: 500 input + 100 output = 600 total
2x repetition: 1000 input + 100 output = 1100 total
Increase: 83% total tokens (but only 50% if output-heavy)

Latency Impact

Measured Latency (across all models and datasets):

Prefill stage: Increases proportionally with repetition
Generation stage: No change (same output length)
End-to-end: Minimal impact for most requests

Exceptions:

Anthropic models (Claude Haiku/Sonnet) show latency increase for very long inputs (>5,000 tokens)
Likely due to prefill stage dominating total time
Mitigation: Apply cost gates for very long prompts

Typical Latency Profile:

Baseline: 500ms total (100ms prefill + 400ms generation)
2x repetition: 550ms total (150ms prefill + 400ms generation)
Impact: +10% total latency

Cost-Benefit Analysis

Input cost increase: 100% (for 2x repetition)
Output cost: No change
Typical input/output ratio: 80/20
Effective cost increase: ~40-50%

But: Error reduction of 50-75% typically saves 300-500x the additional compute cost.

Reasoning vs Non-Reasoning

Non-Reasoning Mode (Paper Focus)

Configuration: Suppress chain-of-thought, direct answers only
Results: 47 wins, 0 losses
Optimal use case: Classification, extraction, simple Q&A

Reasoning Mode (Appendix)

Configuration: "Think step by step" instruction
Results: 5 wins, 1 loss, 22 neutral
Observation: Reasoning models already learn to repeat parts of the prompt internally, so explicit repetition provides minimal additional benefit

Conclusion: Prompt repetition most valuable when reasoning is disabled.

Theoretical Explanation

Attention Pattern Analysis

Causal Attention Limitation:

Token 1 can attend to: [Token 1]
Token 2 can attend to: [Token 1, Token 2]
Token 3 can attend to: [Token 1, Token 2, Token 3]
...
Token N can attend to: [Token 1, Token 2, ..., Token N]

Problem: Early tokens lack context from later tokens.

With Repetition:

First occurrence of Token 1 can attend to: [Token 1]
Second occurrence of Token 1 can attend to: [All tokens from first pass]

Effect: Bidirectional-like attention during prefill stage.

Why This Matters

Options-First Format:

A. Red
B. Blue
C. Green
What color is the sky?

Without Repetition: Model processes options before seeing the question.
With Repetition: Model sees full question-option context together in second pass.

Prior Research

Chain of Thought (CoT) - Wei et al., 2023
- Requires task-specific examples
- Increases output tokens substantially
- Complementary to prompt repetition
"Think Step by Step" - Kojima et al., 2023
- Zero-shot reasoning
- Increases output tokens and latency
- Can be used with repetition (mostly neutral)
Question Repetition - Shaier, 2024
- Tested repeating only question part
- Found no gains
- Confirms full prompt repetition is necessary
Repetition for Embeddings - Springer et al., 2024
- Showed 2x repetition improves text embeddings
- Independent finding, similar mechanism
Re-reading for Reasoning - Xu et al., 2024
- Asking models to "re-read" improves reasoning
- Similar concept, different implementation

Novel Contributions

This paper's unique contributions:

First systematic study of full prompt repetition across models
Production-ready approach (no format changes, drop-in)
Multiple variants tested (simple, verbose, 3x)
Efficiency analysis (latency, token costs)
Universal validation (all major model providers)

Implementation Recommendations

When to Use Prompt Repetition

High Value:

✓ Classification tasks
✓ List processing (find Nth element)
✓ Ordering/sequencing (what's between X and Y)
✓ Options-first multiple choice
✓ Dependency extraction
✓ Context scattered throughout long documents

Moderate Value:

✓ Question-first multiple choice
✓ Mathematical problem solving
✓ Standard Q&A

Low Value:

✗ Simple factual questions
✗ Reasoning-heavy tasks (already using CoT)
✗ Creative writing
✗ Very short prompts (<50 tokens)

Variant Selection

Use 2x (Simple or Verbose):

Standard classification
Most Q&A tasks
Moderate complexity
Cost-sensitive applications

Use 3x:

Complex list processing
Dependency extraction
High accuracy requirements
Cost less important than precision

Use Adaptive:

Detect task complexity automatically
Apply appropriate repetition count
Balance cost and performance

Cost Optimization

Cost Gates:

Don't repeat if input > 10,000 tokens (2x)
Don't repeat if input > 5,000 tokens (3x)
Exception: Critical accuracy tasks

Complexity Thresholds:

Simple tasks (complexity < 0.25): No repetition
Moderate (0.25-0.75): 2x repetition
Complex (> 0.75): 3x repetition

Future Research Directions

Paper identifies 13 future directions:

Fine-tuning with repeated prompts: Train models expecting repetition
Reasoning + repetition: Optimize for reasoning models
Dynamic repetition: Repeat during generation, not just prefill
KV-cache optimization: Keep only 2nd repetition in cache
Partial repetition: Repeat only critical sections
Prompt reordering: Reorder instead of repeat
Multi-modal: Apply to images/video
More variants: Analyze 4x, 5x, etc.
Attention analysis: Deep dive on attention patterns
Selective attention: Combine with attention optimization
Prefix LM: Interaction with prefix-based models
Token representations: How representations change across repetitions
Promising variants: Explore alternatives from appendix

Practical Implications

For Practitioners

Key Takeaways:

Prompt repetition is universally beneficial for non-reasoning tasks
Implementation is trivial (just repeat the prompt)
No output format changes (backward compatible)
Latency impact is minimal for most use cases
Cost increase (8-15%) is typically offset by error reduction (300-500x ROI)

Action Items:

Implement adaptive repetition in production
A/B test to validate improvements
Monitor latency for very long prompts
Collect accuracy metrics pre/post
Consider 3x for critical accuracy tasks

For Researchers

Open Questions:

Why does 3x sometimes substantially outperform 2x?
Can we predict optimal repetition count from prompt analysis?
How do representations evolve across repetitions?
Can we compress repetition (e.g., only repeat key tokens)?
Does this work for other modalities?

Research Opportunities:

Theoretical analysis of attention patterns
Optimal repetition prediction models
Compression techniques
Multi-modal extensions
Fine-tuning approaches

Limitations

Acknowledged Limitations

Cost increase: Input tokens double or triple
Latency for long prompts: Prefill stage takes longer
Model-specific variations: Some models benefit more than others
Task-specific gains: Not all tasks benefit equally
Limited reasoning benefit: Minimal gains when reasoning enabled

Experimental Limitations

Benchmark selection: May not represent all real-world tasks
Timing: Testing in Feb/Mar 2025 (models evolve)
API access only: No access to internal model details
English-only: Not tested on multilingual tasks
Single-turn: Not evaluated on multi-turn conversations

Practical Limitations

Token limits: Very long prompts may hit context window limits
Cost sensitivity: Some applications can't afford 2-3x input cost
Real-time requirements: Prefill latency may matter for some use cases
Model updates: Future models may not benefit equally
Task diversity: Real-world tasks may differ from benchmarks

Citation

@article{leviathan2025prompt,
  title={Prompt Repetition Improves Non-Reasoning LLMs},
  author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025},
  institution={Google Research}
}

Additional Resources

Paper Access: [arXiv link when available]
Code Repository: [If released]
Blog Post: [Google Research blog]
Related Papers:

Wei et al. 2023 (Chain of Thought)
Kojima et al. 2023 (Think Step by Step)
Springer et al. 2024 (Repetition for Embeddings)
Xu et al. 2024 (Re-reading Improves Reasoning)

Summary Prepared By: Technical Research Team
Date: January 2026
Purpose: Internal knowledge base and decision support
Classification: Public (based on published research)

Paper Metadata​

Core Thesis​

Fundamental Problem​

Experimental Design​

Models Tested (N=7)​

Benchmarks (N=7)​

Configurations Tested​

Methodology​

Results Summary​

Overall Performance​

Performance by Task Type​

Custom Benchmark Results​

Model-Specific Performance​

Variants Tested​

Method Comparison​

When Triple Repetition Excels​

Efficiency Analysis​

Token Usage​

Latency Impact​

Cost-Benefit Analysis​

Reasoning vs Non-Reasoning​

Non-Reasoning Mode (Paper Focus)​

Reasoning Mode (Appendix)​

Theoretical Explanation​

Attention Pattern Analysis​

Why This Matters​

Related Work​

Prior Research​

Novel Contributions​

Implementation Recommendations​

When to Use Prompt Repetition​

Variant Selection​

Cost Optimization​

Future Research Directions​

Practical Implications​

For Practitioners​

For Researchers​

Limitations​

Acknowledged Limitations​

Experimental Limitations​

Practical Limitations​

Citation​

Additional Resources​