Context Optimization Techniques
Context Optimization Techniques
Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity. Effective optimization can double or triple effective context capacity without requiring larger models or longer contexts.
When to Use
✅ Use this skill when:
- Context limits constrain task complexity
- Optimizing for cost reduction (fewer tokens = lower costs)
- Reducing latency for long conversations
- Implementing long-running agent systems
- Needing to handle larger documents or conversations
- Building production systems at scale
❌ Don't use this skill when:
- Short conversations with ample context budget
- Latency-insensitive applications where optimization overhead isn't worth it
- Single-turn interactions
Core Concepts
Context optimization extends effective capacity through four primary strategies:
| Strategy | Mechanism | Token Savings | Quality Impact |
|---|---|---|---|
| Compaction | Summarize context near limits | 50-70% | Low if done well |
| Observation Masking | Replace verbose outputs with references | 60-80% | Very low |
| KV-Cache Optimization | Reuse cached computations | Cost savings | None |
| Context Partitioning | Split work across isolated contexts | Variable | Enables isolation |
Compaction Strategies
Compaction summarizes context contents when approaching limits, then reinitializes with the summary.
What to Compress (Priority Order)
- Tool outputs: Replace with summaries (highest impact)
- Old turns: Summarize early conversation
- Retrieved docs: Summarize if recent versions exist
- Never compress: System prompt
Summary Generation by Type
| Message Type | Preserve | Remove |
|---|---|---|
| Tool outputs | Key findings, metrics, conclusions | Verbose raw output |
| Conversational | Decisions, commitments, context shifts | Filler, back-and-forth |
| Documents | Key facts and claims | Supporting evidence, elaboration |
Observation Masking
Tool outputs can comprise 80%+ of token usage in agent trajectories. Once an agent has used a tool output to make a decision, keeping the full output provides diminishing value.
Masking Strategy Selection
| Condition | Action |
|---|---|
| Critical to current task | Never mask |
| Most recent turn | Never mask |
| Used in active reasoning | Never mask |
| 3+ turns ago | Consider masking |
| Purpose has been served | Consider masking |
| Already summarized | Always mask |
| Boilerplate headers/footers | Always mask |
Implementation Pattern
def mask_observation(observation: str, max_length: int = 500):
if len(observation) <= max_length:
return observation
# Store full observation
ref_id = store_observation(observation)
key_info = extract_key_points(observation)
return f"[Obs:{ref_id} elided. Key: {key_info}]"
KV-Cache Optimization
The KV-cache stores Key and Value tensors computed during inference. Caching across requests with identical prefixes avoids recomputation.
Cache-Friendly Context Ordering
# Optimal ordering for cache hits
context = []
# 1. Stable content first (highly cacheable)
context.extend([system_prompt, tool_definitions])
# 2. Frequently reused elements
context.extend([reused_templates, common_instructions])
# 3. Unique content last (not cached)
context.extend([unique_query, dynamic_content])
Cache Stability Design
- Avoid dynamic content like timestamps in stable sections
- Use consistent formatting across sessions
- Keep structure stable, vary only what must vary
Context Partitioning
The most aggressive optimization: partition work across sub-agents with isolated contexts.
When to Partition
- Single context cannot contain all required information
- Subtasks are logically independent
- Context degradation is already occurring
- Parallel execution would speed completion
Partitioning Pattern
┌─────────────────────────────────────────────┐
│ Coordinator Agent │
│ (Minimal context: task + routing) │
└─────────────────┬───────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Sub-1 │ │ Sub-2 │ │ Sub-3 │
│(Clean │ │(Clean │ │(Clean │
│context)│ │context)│ │context)│
└───────┘ └───────┘ └───────┘
Each sub-agent operates in a clean context focused on its subtask.
Budget Management
Context Budget Allocation
context_budget:
system_prompt: 2000 # Fixed, never compress
tool_definitions: 1500 # Fixed
retrieved_docs: 3000 # Compressible
message_history: 5000 # Compressible
reserved_buffer: 2000 # Safety margin
total_limit: 16000
Trigger-Based Optimization
def should_optimize(context_usage: int, budget: int) -> bool:
utilization = context_usage / budget
if utilization > 0.8:
return True # Hard trigger
if utilization > 0.7 and quality_degradation_detected():
return True # Soft trigger with quality check
return False
Performance Targets
| Technique | Target Savings | Quality Threshold |
|---|---|---|
| Compaction | 50-70% reduction | <5% quality degradation |
| Masking | 60-80% on masked obs | Minimal impact |
| Cache | 70%+ hit rate | No impact |
| Partitioning | Variable | Enables capability |
Optimization Decision Framework
Context utilization > 70%?
├── Yes → Check what dominates context
│ ├── Tool outputs dominate → Apply observation masking
│ ├── Retrieved docs dominate → Summarize or partition
│ ├── Message history dominates → Compaction with summarization
│ └── Multiple components → Combine strategies
│
└── No → Continue, but monitor
Example: Combined Optimization
async def optimize_context(context: Context) -> Context:
# Step 1: Mask old observations
context.mask_old_observations(turns_threshold=3)
# Step 2: Check if still over budget
if context.utilization > 0.8:
# Step 3: Compact message history
context.compact_history(preserve_recent=5)
# Step 4: If still over, partition to sub-agents
if context.utilization > 0.9:
return await partition_to_subagents(context)
return context
Guidelines
- Measure before optimizing—know your current state
- Apply compaction before masking when possible
- Design for cache stability with consistent prompts
- Partition before context becomes problematic
- Monitor optimization effectiveness over time
- Balance token savings against quality preservation
- Test optimization at production scale
- Implement graceful degradation for edge cases
Related Components
Skills
context-fundamentals- Context basics (prerequisite)context-degradation- Understanding when to optimizecontext-compression- Detailed compression strategies
Agents
context-health-analyst- Monitor optimization needsmulti-agent-coordinator- Partition coordination
Scripts
external/Agent-Skills-for-Context-Engineering/skills/context-optimization/scripts/compaction.py- Observation store and budget management
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: context-optimization
Completed:
- [x] Context utilization measured (before: X%, after: Y%)
- [x] Optimization strategy selected and applied
- [x] Token savings achieved (Z% reduction)
- [x] Quality validation passed (<5% degradation)
- [x] Performance metrics recorded
Optimization Applied:
- Strategy: [Compaction/Masking/Partitioning/Combined]
- Tokens before: X | Tokens after: Y | Savings: Z%
- Quality impact: <5% degradation
- Cache hit rate: N% (if KV-cache optimization)
Outputs:
- Optimized context with reduced token count
- Quality metrics report
- Performance benchmarks
Completion Checklist
Before marking this skill as complete, verify:
- Baseline context utilization measured before optimization
- Optimization strategy selected based on what dominates context
- Compaction applied to message history (if needed)
- Observation masking applied to old tool outputs (if needed)
- KV-cache stability design implemented (if applicable)
- Context partitioning to sub-agents (if needed)
- Token reduction achieved (target: 50-70%)
- Quality degradation measured (<5% threshold)
- Performance benchmarks recorded (latency, throughput)
- Budget allocation documented
- Trigger-based optimization rules configured
- Monitoring in place for ongoing optimization
Failure Indicators
This skill has FAILED if:
- ❌ Context utilization still >80% after optimization
- ❌ Quality degradation >5% (unacceptable quality loss)
- ❌ No token savings achieved (ineffective optimization)
- ❌ System prompt accidentally compressed (critical data lost)
- ❌ Recent tool outputs masked (breaks current reasoning)
- ❌ Cache hit rate <50% (poor cache design)
- ❌ Partitioning overhead exceeds savings (wrong strategy)
- ❌ Agent cannot complete task after optimization
- ❌ Optimization triggers too frequently (thrashing)
- ❌ No measurement before/after (cannot prove effectiveness)
When NOT to Use
Do NOT use context-optimization when:
- Plenty of context available - Utilization <70%, no optimization needed
- Single-turn interactions - No context accumulation to optimize
- Short conversations - Optimization overhead not worth it
- Latency-critical applications - Optimization adds processing time
- Prototyping phase - Premature optimization
- Context already minimal - Nothing left to compress
- Quality cannot be compromised - Risk of degradation too high
Skip optimization when: Context utilization <70% and no quality issues Use larger model when: More context needed and quality is paramount Use this skill when: Context limits constrain capability or cost reduction needed
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Compress system prompt | Critical instructions lost, agent fails | Never compress system prompt |
| Mask recent outputs | Breaks current reasoning chain | Only mask outputs 3+ turns old |
| No quality measurement | Degradation goes unnoticed | Measure quality before and after |
| Optimize prematurely | Overhead without benefit | Wait until utilization >70% |
| Fixed compression ratio | Over/under compression | Adaptive compression based on content type |
| No baseline metrics | Cannot prove effectiveness | Measure tokens and quality before optimizing |
| Aggressive partitioning | Coordination overhead exceeds savings | Only partition when single context insufficient |
| Ignore cache stability | Poor cache hit rates | Design for cache reuse (stable prefixes) |
| No monitoring | Optimization degrades over time | Monitor utilization and quality continuously |
| One-size-fits-all | Wrong strategy for content type | Select strategy based on what dominates context |
Principles
This skill embodies the following CODITECT principles:
#2 First Principles Thinking:
- Understand WHY context limits exist: computational cost, attention degradation
- Apply optimization only when benefits exceed costs
#4 Measure and Verify:
- Measure baseline before optimizing
- Verify quality impact <5% threshold
- Benchmark performance improvements
#5 Eliminate Ambiguity:
- Clear optimization targets (50-70% reduction, <5% quality loss)
- Explicit strategy selection criteria
#8 No Assumptions:
- Never assume optimization worked (measure it)
- Verify cache hit rates, don't assume high reuse
- Test quality on representative examples
#10 Automation First:
- Trigger-based optimization (not manual)
- Automated quality validation
- Dynamic strategy selection based on context composition
Full Standard: CODITECT-STANDARD-AUTOMATION.md