Skip to main content

Research Paper Summary: Prompt Repetition Improves Non-Reasoning LLMs

Google Research - February 2025

Paper Metadata

FieldValue
TitlePrompt Repetition Improves Non-Reasoning LLMs
AuthorsYaniv Leviathan, Matan Kalman, Yossi Matias (Google Research)
PublishedFebruary/March 2025
InstitutionGoogle Research
Paper TypeEmpirical Study
Models Tested7 (Gemini, GPT-4, Claude, Deepseek)
Benchmarks Used7 (ARC, OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch)

Core Thesis

Repeating the input prompt (transforming <QUERY> to <QUERY><QUERY>) improves LLM performance across all major models without increasing output token count or latency when reasoning is disabled.

Fundamental Problem

Issue: Causal language models process tokens left-to-right only (past tokens cannot attend to future tokens). This means:

  • Token order affects prediction performance
  • <CONTEXT> <QUESTION> performs differently than <QUESTION> <CONTEXT>
  • Early tokens can't "see" later context during initial processing

Solution: Repetition enables each prompt token to attend to every other prompt token bidirectionally during the prefill stage.

Experimental Design

Models Tested (N=7)

ProviderModelsSize Class
GoogleGemini 2.0 Flash, Gemini 2.0 Flash LiteSmall-Medium
OpenAIGPT-4o-mini, GPT-4oSmall-Large
AnthropicClaude 3 Haiku, Claude 3.7 SonnetSmall-Medium
DeepseekDeepseek V3Large

Benchmarks (N=7)

BenchmarkTypeQuestionsFocus Area
ARC (Challenge)Multiple choice1,172Science reasoning
OpenBookQAMultiple choice500Elementary science
GSM8KMath problems1,319Grade school math
MMLU-ProMultiple choice12,032Multi-domain knowledge
MATHMath problems5,000Competition math
NameIndexCustom300List processing (50 items)
MiddleMatchCustom300Order detection (40 items)

Configurations Tested

For multiple choice benchmarks:

  1. Question-first: Question → Options (normal format)
  2. Options-first: Options → Question (tests attention limitation)

Methodology

Control Variables:

  • Same prompts across all models
  • Official API access (no special access)
  • Testing period: February-March 2025
  • No reasoning enabled (suppress chain-of-thought)

Statistical Testing:

  • McNemar test for paired comparisons
  • Significance threshold: p < 0.1
  • Classification: Win/Loss/Neutral

Results Summary

Overall Performance

Accuracy Improvements:

  • 47 wins out of 70 benchmark-model combinations
  • 0 losses (no degradation observed)
  • 23 neutral (no statistically significant change)

Statistical Confidence: All wins significant at p < 0.1

Performance by Task Type

Task TypeImprovement RangeExample
Options-first15-40%Multiple choice with context after question
List processing30-70%NameIndex: 21% → 97% (376% relative improvement)
Order detection25-60%MiddleMatch: Find element between X and Y
Question-first5-15%Standard multiple choice format
Math problems8-20%GSM8K, MATH

Custom Benchmark Results

NameIndex Task (Find Nth element in list of 50):

  • Baseline: 21.33%
  • With repetition: 97.33%
  • Improvement: +76 percentage points (376% relative)

MiddleMatch Task (Find element between two others in list of 40):

  • Baseline: ~30-45% (varies by model)
  • With repetition: ~80-95%
  • Improvement: +35-50 percentage points

Model-Specific Performance

All tested models showed improvement:

ModelWinsLossesNeutral
Gemini 2.0 Flash703
Gemini 2.0 Flash Lite802
GPT-4o604
GPT-4o-mini703
Claude 3 Haiku604
Claude 3.7 Sonnet703
Deepseek V3604

Conclusion: Universal improvement across all providers and model sizes.

Variants Tested

Method Comparison

MethodTemplatePerformanceCost
Baseline<QUERY>Reference1x
Simple Repetition<QUERY><QUERY>+10-25pp2x
Verbose Repetition<QUERY> Let me repeat that: <QUERY>+10-30pp~2.1x
Triple Repetition<QUERY> Let me repeat that: <QUERY> Let me repeat that one more time: <QUERY>+15-40pp3x
Padding (control)<QUERY> ... (periods to match length)No change2x

Key Finding: Padding control confirms gains come from repetition, not just increased input length.

When Triple Repetition Excels

NameIndex and MiddleMatch showed substantial gains with 3x repetition:

  • Simple tasks: 2x sufficient
  • Complex list/ordering tasks: 3x provides significant additional benefit
  • Mathematical reasoning: 2x optimal (3x provides minimal additional gain)

Efficiency Analysis

Token Usage

MetricImpact
Input tokens+100% (2x) or +200% (3x)
Output tokensNo change (0%)
Total tokensDepends on input/output ratio

Example Calculation:

Baseline: 500 input + 100 output = 600 total
2x repetition: 1000 input + 100 output = 1100 total
Increase: 83% total tokens (but only 50% if output-heavy)

Latency Impact

Measured Latency (across all models and datasets):

  • Prefill stage: Increases proportionally with repetition
  • Generation stage: No change (same output length)
  • End-to-end: Minimal impact for most requests

Exceptions:

  • Anthropic models (Claude Haiku/Sonnet) show latency increase for very long inputs (>5,000 tokens)
  • Likely due to prefill stage dominating total time
  • Mitigation: Apply cost gates for very long prompts

Typical Latency Profile:

Baseline: 500ms total (100ms prefill + 400ms generation)
2x repetition: 550ms total (150ms prefill + 400ms generation)
Impact: +10% total latency

Cost-Benefit Analysis

Input cost increase: 100% (for 2x repetition)
Output cost: No change
Typical input/output ratio: 80/20
Effective cost increase: ~40-50%

But: Error reduction of 50-75% typically saves 300-500x the additional compute cost.

Reasoning vs Non-Reasoning

Non-Reasoning Mode (Paper Focus)

Configuration: Suppress chain-of-thought, direct answers only
Results: 47 wins, 0 losses
Optimal use case: Classification, extraction, simple Q&A

Reasoning Mode (Appendix)

Configuration: "Think step by step" instruction
Results: 5 wins, 1 loss, 22 neutral
Observation: Reasoning models already learn to repeat parts of the prompt internally, so explicit repetition provides minimal additional benefit

Conclusion: Prompt repetition most valuable when reasoning is disabled.

Theoretical Explanation

Attention Pattern Analysis

Causal Attention Limitation:

Token 1 can attend to: [Token 1]
Token 2 can attend to: [Token 1, Token 2]
Token 3 can attend to: [Token 1, Token 2, Token 3]
...
Token N can attend to: [Token 1, Token 2, ..., Token N]

Problem: Early tokens lack context from later tokens.

With Repetition:

First occurrence of Token 1 can attend to: [Token 1]
Second occurrence of Token 1 can attend to: [All tokens from first pass]

Effect: Bidirectional-like attention during prefill stage.

Why This Matters

Options-First Format:

A. Red
B. Blue
C. Green
What color is the sky?

Without Repetition: Model processes options before seeing the question.
With Repetition: Model sees full question-option context together in second pass.

Prior Research

  1. Chain of Thought (CoT) - Wei et al., 2023

    • Requires task-specific examples
    • Increases output tokens substantially
    • Complementary to prompt repetition
  2. "Think Step by Step" - Kojima et al., 2023

    • Zero-shot reasoning
    • Increases output tokens and latency
    • Can be used with repetition (mostly neutral)
  3. Question Repetition - Shaier, 2024

    • Tested repeating only question part
    • Found no gains
    • Confirms full prompt repetition is necessary
  4. Repetition for Embeddings - Springer et al., 2024

    • Showed 2x repetition improves text embeddings
    • Independent finding, similar mechanism
  5. Re-reading for Reasoning - Xu et al., 2024

    • Asking models to "re-read" improves reasoning
    • Similar concept, different implementation

Novel Contributions

This paper's unique contributions:

  1. First systematic study of full prompt repetition across models
  2. Production-ready approach (no format changes, drop-in)
  3. Multiple variants tested (simple, verbose, 3x)
  4. Efficiency analysis (latency, token costs)
  5. Universal validation (all major model providers)

Implementation Recommendations

When to Use Prompt Repetition

High Value:

  • ✓ Classification tasks
  • ✓ List processing (find Nth element)
  • ✓ Ordering/sequencing (what's between X and Y)
  • ✓ Options-first multiple choice
  • ✓ Dependency extraction
  • ✓ Context scattered throughout long documents

Moderate Value:

  • ✓ Question-first multiple choice
  • ✓ Mathematical problem solving
  • ✓ Standard Q&A

Low Value:

  • ✗ Simple factual questions
  • ✗ Reasoning-heavy tasks (already using CoT)
  • ✗ Creative writing
  • ✗ Very short prompts (<50 tokens)

Variant Selection

Use 2x (Simple or Verbose):

  • Standard classification
  • Most Q&A tasks
  • Moderate complexity
  • Cost-sensitive applications

Use 3x:

  • Complex list processing
  • Dependency extraction
  • High accuracy requirements
  • Cost less important than precision

Use Adaptive:

  • Detect task complexity automatically
  • Apply appropriate repetition count
  • Balance cost and performance

Cost Optimization

Cost Gates:

  • Don't repeat if input > 10,000 tokens (2x)
  • Don't repeat if input > 5,000 tokens (3x)
  • Exception: Critical accuracy tasks

Complexity Thresholds:

  • Simple tasks (complexity < 0.25): No repetition
  • Moderate (0.25-0.75): 2x repetition
  • Complex (> 0.75): 3x repetition

Future Research Directions

Paper identifies 13 future directions:

  1. Fine-tuning with repeated prompts: Train models expecting repetition
  2. Reasoning + repetition: Optimize for reasoning models
  3. Dynamic repetition: Repeat during generation, not just prefill
  4. KV-cache optimization: Keep only 2nd repetition in cache
  5. Partial repetition: Repeat only critical sections
  6. Prompt reordering: Reorder instead of repeat
  7. Multi-modal: Apply to images/video
  8. More variants: Analyze 4x, 5x, etc.
  9. Attention analysis: Deep dive on attention patterns
  10. Selective attention: Combine with attention optimization
  11. Prefix LM: Interaction with prefix-based models
  12. Token representations: How representations change across repetitions
  13. Promising variants: Explore alternatives from appendix

Practical Implications

For Practitioners

Key Takeaways:

  1. Prompt repetition is universally beneficial for non-reasoning tasks
  2. Implementation is trivial (just repeat the prompt)
  3. No output format changes (backward compatible)
  4. Latency impact is minimal for most use cases
  5. Cost increase (8-15%) is typically offset by error reduction (300-500x ROI)

Action Items:

  1. Implement adaptive repetition in production
  2. A/B test to validate improvements
  3. Monitor latency for very long prompts
  4. Collect accuracy metrics pre/post
  5. Consider 3x for critical accuracy tasks

For Researchers

Open Questions:

  1. Why does 3x sometimes substantially outperform 2x?
  2. Can we predict optimal repetition count from prompt analysis?
  3. How do representations evolve across repetitions?
  4. Can we compress repetition (e.g., only repeat key tokens)?
  5. Does this work for other modalities?

Research Opportunities:

  1. Theoretical analysis of attention patterns
  2. Optimal repetition prediction models
  3. Compression techniques
  4. Multi-modal extensions
  5. Fine-tuning approaches

Limitations

Acknowledged Limitations

  1. Cost increase: Input tokens double or triple
  2. Latency for long prompts: Prefill stage takes longer
  3. Model-specific variations: Some models benefit more than others
  4. Task-specific gains: Not all tasks benefit equally
  5. Limited reasoning benefit: Minimal gains when reasoning enabled

Experimental Limitations

  1. Benchmark selection: May not represent all real-world tasks
  2. Timing: Testing in Feb/Mar 2025 (models evolve)
  3. API access only: No access to internal model details
  4. English-only: Not tested on multilingual tasks
  5. Single-turn: Not evaluated on multi-turn conversations

Practical Limitations

  1. Token limits: Very long prompts may hit context window limits
  2. Cost sensitivity: Some applications can't afford 2-3x input cost
  3. Real-time requirements: Prefill latency may matter for some use cases
  4. Model updates: Future models may not benefit equally
  5. Task diversity: Real-world tasks may differ from benchmarks

Citation

@article{leviathan2025prompt,
title={Prompt Repetition Improves Non-Reasoning LLMs},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025},
institution={Google Research}
}

Additional Resources

Paper Access: [arXiv link when available]
Code Repository: [If released]
Blog Post: [Google Research blog]
Related Papers:

  • Wei et al. 2023 (Chain of Thought)
  • Kojima et al. 2023 (Think Step by Step)
  • Springer et al. 2024 (Repetition for Embeddings)
  • Xu et al. 2024 (Re-reading Improves Reasoning)

Summary Prepared By: Technical Research Team
Date: January 2026
Purpose: Internal knowledge base and decision support
Classification: Public (based on published research)