Skip to main content

Agentic AI Troubleshooting Guide

Diagnostic Decision Trees for Common Issues

Document ID: F3-TROUBLESHOOTING
Version: 1.0
Category: Operations


Quick Diagnostic Flow

START: What type of issue?

├─► Output Quality Issues ──► Section 1
├─► Performance Issues ──────► Section 2
├─► Tool/Integration Issues ─► Section 3
├─► Cost/Token Issues ───────► Section 4
└─► Coordination Issues ─────► Section 5

Section 1: Output Quality Issues

1.1 Hallucination / Inaccurate Information

Decision Tree:

Is the system using GS paradigm with retrieval?
├─► NO: Consider switching to GS for factual accuracy
└─► YES: Check retrieval quality

Are retrieved documents relevant?
├─► NO: Improve retrieval (see 1.1a)
└─► YES: Check citation usage

Is agent citing sources correctly?
├─► NO: Strengthen citation prompts (see 1.1b)
└─► YES: Check source quality

Are sources accurate/current?
├─► NO: Update knowledge base
└─► YES: Escalate to advanced diagnosis

1.1a: Improve Retrieval

# Solution: Add reranking
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.content) for doc in results])
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)

1.1b: Strengthen Citation Prompts

Add to system prompt:
"CRITICAL: You MUST cite a source for every factual claim.
Format: [Source N] immediately after the claim.
If you cannot cite a source, state 'Unable to verify' instead of the claim."

1.2 Inconsistent Outputs

Decision Tree:

Same input producing different outputs?
├─► Check temperature setting
│ ├─► Temperature > 0.5: Lower to 0.3-0.5
│ └─► Temperature < 0.5: Check other factors

├─► Using caching?
│ ├─► NO: Consider adding for consistency
│ └─► YES: Check cache invalidation

└─► Check for non-deterministic tools
└─► Web search results vary: Add result pinning

Solution: Output Validation

def validate_output(output, schema):
try:
validated = schema.parse(output)
return validated, True
except ValidationError as e:
# Retry with more explicit instructions
retry_prompt = f"Previous output was invalid: {e}. Please provide output matching: {schema}"
return retry(retry_prompt), False

1.3 Missing Required Information

Checklist:

  • Is the required information in the input?
  • Is the required information in retrieved context?
  • Is the output schema clearly specified?
  • Are there examples of expected output?

Solution: Explicit Output Requirements

Your response MUST include ALL of the following:
1. [Required field 1] - [description]
2. [Required field 2] - [description]
3. [Required field 3] - [description]

If any information is unavailable, explicitly state "Not available" for that field.
Do not omit any required fields.

Section 2: Performance Issues

2.1 High Latency

Diagnostic Questions:

  1. Where is time spent? (LLM, retrieval, tools, network)
  2. What's the token count per request?
  3. Are there sequential dependencies that could parallelize?

Quick Fixes:

SymptomLikely CauseSolution
First token slowCold startImplement keep-alive
Retrieval slowLarge corpusAdd caching, optimize index
Many tool callsSequential executionParallelize where possible
Long contextToken overheadSummarize, use sliding window

Latency Optimization Code:

import asyncio

async def parallel_retrieval(queries):
"""Execute multiple retrievals in parallel"""
tasks = [retrieve(q) for q in queries]
results = await asyncio.gather(*tasks)
return results

async def optimized_agent(task):
# Parallel: decompose + initial retrieval
decompose_task = decompose(task)
initial_retrieval_task = retrieve_context(task)

subtasks, context = await asyncio.gather(
decompose_task,
initial_retrieval_task
)

# Continue with results
...

2.2 Timeout Errors

Decision Tree:

What component is timing out?
├─► LLM API
│ ├─► Check rate limits
│ ├─► Reduce context size
│ └─► Implement streaming

├─► External Tools
│ ├─► Add timeout with fallback
│ ├─► Implement circuit breaker
│ └─► Check tool health

└─► Orchestrator
├─► Subagent stuck: Add heartbeat
├─► Coordination deadlock: Add timeout
└─► Resource exhaustion: Scale resources

Timeout Handling Pattern:

async def call_with_timeout(func, timeout=30, fallback=None):
try:
return await asyncio.wait_for(func(), timeout=timeout)
except asyncio.TimeoutError:
logger.warning(f"{func.__name__} timed out after {timeout}s")
if fallback:
return await fallback()
raise

Section 3: Tool/Integration Issues

3.1 Tool Call Failures

Common Causes:

  1. Invalid parameters (hallucinated)
  2. Rate limiting
  3. Authentication expired
  4. Schema mismatch

Diagnostic Steps:

def diagnose_tool_failure(tool_call, error):
# 1. Check parameter validity
try:
tool.schema.validate(tool_call.params)
except ValidationError as e:
return f"Invalid parameters: {e}"

# 2. Check rate limits
if "429" in str(error) or "rate limit" in str(error).lower():
return "Rate limit exceeded"

# 3. Check authentication
if "401" in str(error) or "403" in str(error):
return "Authentication/authorization failure"

# 4. Check connectivity
if "timeout" in str(error).lower() or "connection" in str(error).lower():
return "Network connectivity issue"

return f"Unknown error: {error}"

3.2 API Integration Problems

Checklist:

  • API key/token valid and not expired?
  • Correct endpoint URL?
  • Request format matches API spec?
  • Response parsing handles all cases?
  • Error responses handled gracefully?

Section 4: Cost/Token Issues

4.1 Unexpected High Token Usage

Investigation Steps:

1. Check input token count
- Context too large?
- Retrieved too many documents?

2. Check output token count
- Verbose responses?
- Unnecessary explanations?

3. Check iteration count (EP paradigm)
- Stuck in loop?
- Reflexion generating too much?

4. Check multi-agent overhead
- Too many agents?
- Redundant coordination?

Token Reduction Strategies:

StrategyToken ReductionImplementation
Context summarization30-50%Summarize old messages
Retrieval limiting20-40%Reduce top_k, filter by score
Response length limits10-30%Add max_tokens, be explicit
Tool result truncation20-40%Limit tool output size

4.2 Cost Optimization

class TokenBudgetManager:
def __init__(self, budget_per_task):
self.budget = budget_per_task
self.used = 0

def check_budget(self, estimated_tokens):
if self.used + estimated_tokens > self.budget:
raise BudgetExceededError(
f"Would exceed budget: {self.used + estimated_tokens} > {self.budget}"
)

def record_usage(self, tokens):
self.used += tokens
if self.used > self.budget * 0.8:
logger.warning(f"Token usage at {self.used/self.budget*100:.0f}%")

Section 5: Coordination Issues

5.1 Multi-Agent Deadlock

Symptoms:

  • Orchestrator waiting indefinitely
  • Agents waiting for each other
  • No progress despite activity

Solutions:

  1. Add timeouts to all agent calls
  2. Implement circuit breaker pattern
  3. Add progress heartbeats
  4. Define maximum wait times

5.2 Result Aggregation Failures

Checklist:

  • All subagents returned results?
  • Result formats consistent?
  • Conflict resolution defined?
  • Partial results handled?

Emergency Procedures

Full System Rollback

# 1. Enable emergency mode
export AGENT_EMERGENCY_MODE=true

# 2. Route all traffic to fallback
./scripts/enable_fallback.sh

# 3. Notify stakeholders
./scripts/notify_incident.sh "Agentic AI rollback initiated"

# 4. Preserve logs for analysis
./scripts/archive_logs.sh

Incident Response Template

INCIDENT: [Brief description]
TIME: [When detected]
IMPACT: [User/business impact]
SYMPTOMS: [What was observed]
ROOT CAUSE: [If known]
ACTIONS TAKEN: [Steps taken]
STATUS: [Current status]
NEXT STEPS: [Planned actions]

Document maintained by CODITECT Support Team