Agentic AI Troubleshooting Guide
Diagnostic Decision Trees for Common Issues
Document ID: F3-TROUBLESHOOTING
Version: 1.0
Category: Operations
Quick Diagnostic Flow
START: What type of issue?
│
├─► Output Quality Issues ──► Section 1
├─► Performance Issues ──────► Section 2
├─► Tool/Integration Issues ─► Section 3
├─► Cost/Token Issues ───────► Section 4
└─► Coordination Issues ─────► Section 5
Section 1: Output Quality Issues
1.1 Hallucination / Inaccurate Information
Decision Tree:
Is the system using GS paradigm with retrieval?
├─► NO: Consider switching to GS for factual accuracy
└─► YES: Check retrieval quality
│
Are retrieved documents relevant?
├─► NO: Improve retrieval (see 1.1a)
└─► YES: Check citation usage
│
Is agent citing sources correctly?
├─► NO: Strengthen citation prompts (see 1.1b)
└─► YES: Check source quality
│
Are sources accurate/current?
├─► NO: Update knowledge base
└─► YES: Escalate to advanced diagnosis
1.1a: Improve Retrieval
# Solution: Add reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.content) for doc in results])
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
1.1b: Strengthen Citation Prompts
Add to system prompt:
"CRITICAL: You MUST cite a source for every factual claim.
Format: [Source N] immediately after the claim.
If you cannot cite a source, state 'Unable to verify' instead of the claim."
1.2 Inconsistent Outputs
Decision Tree:
Same input producing different outputs?
├─► Check temperature setting
│ ├─► Temperature > 0.5: Lower to 0.3-0.5
│ └─► Temperature < 0.5: Check other factors
│
├─► Using caching?
│ ├─► NO: Consider adding for consistency
│ └─► YES: Check cache invalidation
│
└─► Check for non-deterministic tools
└─► Web search results vary: Add result pinning
Solution: Output Validation
def validate_output(output, schema):
try:
validated = schema.parse(output)
return validated, True
except ValidationError as e:
# Retry with more explicit instructions
retry_prompt = f"Previous output was invalid: {e}. Please provide output matching: {schema}"
return retry(retry_prompt), False
1.3 Missing Required Information
Checklist:
- Is the required information in the input?
- Is the required information in retrieved context?
- Is the output schema clearly specified?
- Are there examples of expected output?
Solution: Explicit Output Requirements
Your response MUST include ALL of the following:
1. [Required field 1] - [description]
2. [Required field 2] - [description]
3. [Required field 3] - [description]
If any information is unavailable, explicitly state "Not available" for that field.
Do not omit any required fields.
Section 2: Performance Issues
2.1 High Latency
Diagnostic Questions:
- Where is time spent? (LLM, retrieval, tools, network)
- What's the token count per request?
- Are there sequential dependencies that could parallelize?
Quick Fixes:
| Symptom | Likely Cause | Solution |
|---|---|---|
| First token slow | Cold start | Implement keep-alive |
| Retrieval slow | Large corpus | Add caching, optimize index |
| Many tool calls | Sequential execution | Parallelize where possible |
| Long context | Token overhead | Summarize, use sliding window |
Latency Optimization Code:
import asyncio
async def parallel_retrieval(queries):
"""Execute multiple retrievals in parallel"""
tasks = [retrieve(q) for q in queries]
results = await asyncio.gather(*tasks)
return results
async def optimized_agent(task):
# Parallel: decompose + initial retrieval
decompose_task = decompose(task)
initial_retrieval_task = retrieve_context(task)
subtasks, context = await asyncio.gather(
decompose_task,
initial_retrieval_task
)
# Continue with results
...
2.2 Timeout Errors
Decision Tree:
What component is timing out?
├─► LLM API
│ ├─► Check rate limits
│ ├─► Reduce context size
│ └─► Implement streaming
│
├─► External Tools
│ ├─► Add timeout with fallback
│ ├─► Implement circuit breaker
│ └─► Check tool health
│
└─► Orchestrator
├─► Subagent stuck: Add heartbeat
├─► Coordination deadlock: Add timeout
└─► Resource exhaustion: Scale resources
Timeout Handling Pattern:
async def call_with_timeout(func, timeout=30, fallback=None):
try:
return await asyncio.wait_for(func(), timeout=timeout)
except asyncio.TimeoutError:
logger.warning(f"{func.__name__} timed out after {timeout}s")
if fallback:
return await fallback()
raise
Section 3: Tool/Integration Issues
3.1 Tool Call Failures
Common Causes:
- Invalid parameters (hallucinated)
- Rate limiting
- Authentication expired
- Schema mismatch
Diagnostic Steps:
def diagnose_tool_failure(tool_call, error):
# 1. Check parameter validity
try:
tool.schema.validate(tool_call.params)
except ValidationError as e:
return f"Invalid parameters: {e}"
# 2. Check rate limits
if "429" in str(error) or "rate limit" in str(error).lower():
return "Rate limit exceeded"
# 3. Check authentication
if "401" in str(error) or "403" in str(error):
return "Authentication/authorization failure"
# 4. Check connectivity
if "timeout" in str(error).lower() or "connection" in str(error).lower():
return "Network connectivity issue"
return f"Unknown error: {error}"
3.2 API Integration Problems
Checklist:
- API key/token valid and not expired?
- Correct endpoint URL?
- Request format matches API spec?
- Response parsing handles all cases?
- Error responses handled gracefully?
Section 4: Cost/Token Issues
4.1 Unexpected High Token Usage
Investigation Steps:
1. Check input token count
- Context too large?
- Retrieved too many documents?
2. Check output token count
- Verbose responses?
- Unnecessary explanations?
3. Check iteration count (EP paradigm)
- Stuck in loop?
- Reflexion generating too much?
4. Check multi-agent overhead
- Too many agents?
- Redundant coordination?
Token Reduction Strategies:
| Strategy | Token Reduction | Implementation |
|---|---|---|
| Context summarization | 30-50% | Summarize old messages |
| Retrieval limiting | 20-40% | Reduce top_k, filter by score |
| Response length limits | 10-30% | Add max_tokens, be explicit |
| Tool result truncation | 20-40% | Limit tool output size |
4.2 Cost Optimization
class TokenBudgetManager:
def __init__(self, budget_per_task):
self.budget = budget_per_task
self.used = 0
def check_budget(self, estimated_tokens):
if self.used + estimated_tokens > self.budget:
raise BudgetExceededError(
f"Would exceed budget: {self.used + estimated_tokens} > {self.budget}"
)
def record_usage(self, tokens):
self.used += tokens
if self.used > self.budget * 0.8:
logger.warning(f"Token usage at {self.used/self.budget*100:.0f}%")
Section 5: Coordination Issues
5.1 Multi-Agent Deadlock
Symptoms:
- Orchestrator waiting indefinitely
- Agents waiting for each other
- No progress despite activity
Solutions:
- Add timeouts to all agent calls
- Implement circuit breaker pattern
- Add progress heartbeats
- Define maximum wait times
5.2 Result Aggregation Failures
Checklist:
- All subagents returned results?
- Result formats consistent?
- Conflict resolution defined?
- Partial results handled?
Emergency Procedures
Full System Rollback
# 1. Enable emergency mode
export AGENT_EMERGENCY_MODE=true
# 2. Route all traffic to fallback
./scripts/enable_fallback.sh
# 3. Notify stakeholders
./scripts/notify_incident.sh "Agentic AI rollback initiated"
# 4. Preserve logs for analysis
./scripts/archive_logs.sh
Incident Response Template
INCIDENT: [Brief description]
TIME: [When detected]
IMPACT: [User/business impact]
SYMPTOMS: [What was observed]
ROOT CAUSE: [If known]
ACTIONS TAKEN: [Steps taken]
STATUS: [Current status]
NEXT STEPS: [Planned actions]
Document maintained by CODITECT Support Team