Research Paper Summary: Prompt Repetition Improves Non-Reasoning LLMs
Google Research - February 2025
Paper Metadata
| Field | Value |
|---|---|
| Title | Prompt Repetition Improves Non-Reasoning LLMs |
| Authors | Yaniv Leviathan, Matan Kalman, Yossi Matias (Google Research) |
| Published | February/March 2025 |
| Institution | Google Research |
| Paper Type | Empirical Study |
| Models Tested | 7 (Gemini, GPT-4, Claude, Deepseek) |
| Benchmarks Used | 7 (ARC, OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch) |
Core Thesis
Repeating the input prompt (transforming <QUERY> to <QUERY><QUERY>) improves LLM performance across all major models without increasing output token count or latency when reasoning is disabled.
Fundamental Problem
Issue: Causal language models process tokens left-to-right only (past tokens cannot attend to future tokens). This means:
- Token order affects prediction performance
<CONTEXT> <QUESTION>performs differently than<QUESTION> <CONTEXT>- Early tokens can't "see" later context during initial processing
Solution: Repetition enables each prompt token to attend to every other prompt token bidirectionally during the prefill stage.
Experimental Design
Models Tested (N=7)
| Provider | Models | Size Class |
|---|---|---|
| Gemini 2.0 Flash, Gemini 2.0 Flash Lite | Small-Medium | |
| OpenAI | GPT-4o-mini, GPT-4o | Small-Large |
| Anthropic | Claude 3 Haiku, Claude 3.7 Sonnet | Small-Medium |
| Deepseek | Deepseek V3 | Large |
Benchmarks (N=7)
| Benchmark | Type | Questions | Focus Area |
|---|---|---|---|
| ARC (Challenge) | Multiple choice | 1,172 | Science reasoning |
| OpenBookQA | Multiple choice | 500 | Elementary science |
| GSM8K | Math problems | 1,319 | Grade school math |
| MMLU-Pro | Multiple choice | 12,032 | Multi-domain knowledge |
| MATH | Math problems | 5,000 | Competition math |
| NameIndex | Custom | 300 | List processing (50 items) |
| MiddleMatch | Custom | 300 | Order detection (40 items) |
Configurations Tested
For multiple choice benchmarks:
- Question-first: Question → Options (normal format)
- Options-first: Options → Question (tests attention limitation)
Methodology
Control Variables:
- Same prompts across all models
- Official API access (no special access)
- Testing period: February-March 2025
- No reasoning enabled (suppress chain-of-thought)
Statistical Testing:
- McNemar test for paired comparisons
- Significance threshold: p < 0.1
- Classification: Win/Loss/Neutral
Results Summary
Overall Performance
Accuracy Improvements:
- 47 wins out of 70 benchmark-model combinations
- 0 losses (no degradation observed)
- 23 neutral (no statistically significant change)
Statistical Confidence: All wins significant at p < 0.1
Performance by Task Type
| Task Type | Improvement Range | Example |
|---|---|---|
| Options-first | 15-40% | Multiple choice with context after question |
| List processing | 30-70% | NameIndex: 21% → 97% (376% relative improvement) |
| Order detection | 25-60% | MiddleMatch: Find element between X and Y |
| Question-first | 5-15% | Standard multiple choice format |
| Math problems | 8-20% | GSM8K, MATH |
Custom Benchmark Results
NameIndex Task (Find Nth element in list of 50):
- Baseline: 21.33%
- With repetition: 97.33%
- Improvement: +76 percentage points (376% relative)
MiddleMatch Task (Find element between two others in list of 40):
- Baseline: ~30-45% (varies by model)
- With repetition: ~80-95%
- Improvement: +35-50 percentage points
Model-Specific Performance
All tested models showed improvement:
| Model | Wins | Losses | Neutral |
|---|---|---|---|
| Gemini 2.0 Flash | 7 | 0 | 3 |
| Gemini 2.0 Flash Lite | 8 | 0 | 2 |
| GPT-4o | 6 | 0 | 4 |
| GPT-4o-mini | 7 | 0 | 3 |
| Claude 3 Haiku | 6 | 0 | 4 |
| Claude 3.7 Sonnet | 7 | 0 | 3 |
| Deepseek V3 | 6 | 0 | 4 |
Conclusion: Universal improvement across all providers and model sizes.
Variants Tested
Method Comparison
| Method | Template | Performance | Cost |
|---|---|---|---|
| Baseline | <QUERY> | Reference | 1x |
| Simple Repetition | <QUERY><QUERY> | +10-25pp | 2x |
| Verbose Repetition | <QUERY> Let me repeat that: <QUERY> | +10-30pp | ~2.1x |
| Triple Repetition | <QUERY> Let me repeat that: <QUERY> Let me repeat that one more time: <QUERY> | +15-40pp | 3x |
| Padding (control) | <QUERY> ... (periods to match length) | No change | 2x |
Key Finding: Padding control confirms gains come from repetition, not just increased input length.
When Triple Repetition Excels
NameIndex and MiddleMatch showed substantial gains with 3x repetition:
- Simple tasks: 2x sufficient
- Complex list/ordering tasks: 3x provides significant additional benefit
- Mathematical reasoning: 2x optimal (3x provides minimal additional gain)
Efficiency Analysis
Token Usage
| Metric | Impact |
|---|---|
| Input tokens | +100% (2x) or +200% (3x) |
| Output tokens | No change (0%) |
| Total tokens | Depends on input/output ratio |
Example Calculation:
Baseline: 500 input + 100 output = 600 total
2x repetition: 1000 input + 100 output = 1100 total
Increase: 83% total tokens (but only 50% if output-heavy)
Latency Impact
Measured Latency (across all models and datasets):
- Prefill stage: Increases proportionally with repetition
- Generation stage: No change (same output length)
- End-to-end: Minimal impact for most requests
Exceptions:
- Anthropic models (Claude Haiku/Sonnet) show latency increase for very long inputs (>5,000 tokens)
- Likely due to prefill stage dominating total time
- Mitigation: Apply cost gates for very long prompts
Typical Latency Profile:
Baseline: 500ms total (100ms prefill + 400ms generation)
2x repetition: 550ms total (150ms prefill + 400ms generation)
Impact: +10% total latency
Cost-Benefit Analysis
Input cost increase: 100% (for 2x repetition)
Output cost: No change
Typical input/output ratio: 80/20
Effective cost increase: ~40-50%
But: Error reduction of 50-75% typically saves 300-500x the additional compute cost.
Reasoning vs Non-Reasoning
Non-Reasoning Mode (Paper Focus)
Configuration: Suppress chain-of-thought, direct answers only
Results: 47 wins, 0 losses
Optimal use case: Classification, extraction, simple Q&A
Reasoning Mode (Appendix)
Configuration: "Think step by step" instruction
Results: 5 wins, 1 loss, 22 neutral
Observation: Reasoning models already learn to repeat parts of the prompt internally, so explicit repetition provides minimal additional benefit
Conclusion: Prompt repetition most valuable when reasoning is disabled.
Theoretical Explanation
Attention Pattern Analysis
Causal Attention Limitation:
Token 1 can attend to: [Token 1]
Token 2 can attend to: [Token 1, Token 2]
Token 3 can attend to: [Token 1, Token 2, Token 3]
...
Token N can attend to: [Token 1, Token 2, ..., Token N]
Problem: Early tokens lack context from later tokens.
With Repetition:
First occurrence of Token 1 can attend to: [Token 1]
Second occurrence of Token 1 can attend to: [All tokens from first pass]
Effect: Bidirectional-like attention during prefill stage.
Why This Matters
Options-First Format:
A. Red
B. Blue
C. Green
What color is the sky?
Without Repetition: Model processes options before seeing the question.
With Repetition: Model sees full question-option context together in second pass.
Related Work
Prior Research
-
Chain of Thought (CoT) - Wei et al., 2023
- Requires task-specific examples
- Increases output tokens substantially
- Complementary to prompt repetition
-
"Think Step by Step" - Kojima et al., 2023
- Zero-shot reasoning
- Increases output tokens and latency
- Can be used with repetition (mostly neutral)
-
Question Repetition - Shaier, 2024
- Tested repeating only question part
- Found no gains
- Confirms full prompt repetition is necessary
-
Repetition for Embeddings - Springer et al., 2024
- Showed 2x repetition improves text embeddings
- Independent finding, similar mechanism
-
Re-reading for Reasoning - Xu et al., 2024
- Asking models to "re-read" improves reasoning
- Similar concept, different implementation
Novel Contributions
This paper's unique contributions:
- First systematic study of full prompt repetition across models
- Production-ready approach (no format changes, drop-in)
- Multiple variants tested (simple, verbose, 3x)
- Efficiency analysis (latency, token costs)
- Universal validation (all major model providers)
Implementation Recommendations
When to Use Prompt Repetition
High Value:
- ✓ Classification tasks
- ✓ List processing (find Nth element)
- ✓ Ordering/sequencing (what's between X and Y)
- ✓ Options-first multiple choice
- ✓ Dependency extraction
- ✓ Context scattered throughout long documents
Moderate Value:
- ✓ Question-first multiple choice
- ✓ Mathematical problem solving
- ✓ Standard Q&A
Low Value:
- ✗ Simple factual questions
- ✗ Reasoning-heavy tasks (already using CoT)
- ✗ Creative writing
- ✗ Very short prompts (<50 tokens)
Variant Selection
Use 2x (Simple or Verbose):
- Standard classification
- Most Q&A tasks
- Moderate complexity
- Cost-sensitive applications
Use 3x:
- Complex list processing
- Dependency extraction
- High accuracy requirements
- Cost less important than precision
Use Adaptive:
- Detect task complexity automatically
- Apply appropriate repetition count
- Balance cost and performance
Cost Optimization
Cost Gates:
- Don't repeat if input > 10,000 tokens (2x)
- Don't repeat if input > 5,000 tokens (3x)
- Exception: Critical accuracy tasks
Complexity Thresholds:
- Simple tasks (complexity < 0.25): No repetition
- Moderate (0.25-0.75): 2x repetition
- Complex (> 0.75): 3x repetition
Future Research Directions
Paper identifies 13 future directions:
- Fine-tuning with repeated prompts: Train models expecting repetition
- Reasoning + repetition: Optimize for reasoning models
- Dynamic repetition: Repeat during generation, not just prefill
- KV-cache optimization: Keep only 2nd repetition in cache
- Partial repetition: Repeat only critical sections
- Prompt reordering: Reorder instead of repeat
- Multi-modal: Apply to images/video
- More variants: Analyze 4x, 5x, etc.
- Attention analysis: Deep dive on attention patterns
- Selective attention: Combine with attention optimization
- Prefix LM: Interaction with prefix-based models
- Token representations: How representations change across repetitions
- Promising variants: Explore alternatives from appendix
Practical Implications
For Practitioners
Key Takeaways:
- Prompt repetition is universally beneficial for non-reasoning tasks
- Implementation is trivial (just repeat the prompt)
- No output format changes (backward compatible)
- Latency impact is minimal for most use cases
- Cost increase (8-15%) is typically offset by error reduction (300-500x ROI)
Action Items:
- Implement adaptive repetition in production
- A/B test to validate improvements
- Monitor latency for very long prompts
- Collect accuracy metrics pre/post
- Consider 3x for critical accuracy tasks
For Researchers
Open Questions:
- Why does 3x sometimes substantially outperform 2x?
- Can we predict optimal repetition count from prompt analysis?
- How do representations evolve across repetitions?
- Can we compress repetition (e.g., only repeat key tokens)?
- Does this work for other modalities?
Research Opportunities:
- Theoretical analysis of attention patterns
- Optimal repetition prediction models
- Compression techniques
- Multi-modal extensions
- Fine-tuning approaches
Limitations
Acknowledged Limitations
- Cost increase: Input tokens double or triple
- Latency for long prompts: Prefill stage takes longer
- Model-specific variations: Some models benefit more than others
- Task-specific gains: Not all tasks benefit equally
- Limited reasoning benefit: Minimal gains when reasoning enabled
Experimental Limitations
- Benchmark selection: May not represent all real-world tasks
- Timing: Testing in Feb/Mar 2025 (models evolve)
- API access only: No access to internal model details
- English-only: Not tested on multilingual tasks
- Single-turn: Not evaluated on multi-turn conversations
Practical Limitations
- Token limits: Very long prompts may hit context window limits
- Cost sensitivity: Some applications can't afford 2-3x input cost
- Real-time requirements: Prefill latency may matter for some use cases
- Model updates: Future models may not benefit equally
- Task diversity: Real-world tasks may differ from benchmarks
Citation
@article{leviathan2025prompt,
title={Prompt Repetition Improves Non-Reasoning LLMs},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025},
institution={Google Research}
}
Additional Resources
Paper Access: [arXiv link when available]
Code Repository: [If released]
Blog Post: [Google Research blog]
Related Papers:
- Wei et al. 2023 (Chain of Thought)
- Kojima et al. 2023 (Think Step by Step)
- Springer et al. 2024 (Repetition for Embeddings)
- Xu et al. 2024 (Re-reading Improves Reasoning)
Summary Prepared By: Technical Research Team
Date: January 2026
Purpose: Internal knowledge base and decision support
Classification: Public (based on published research)