Agentic AI Troubleshooting Guide

Diagnostic Decision Trees for Common Issues

Document ID: F3-TROUBLESHOOTING
Version: 1.0
Category: Operations

Quick Diagnostic Flow

START: What type of issue?
│
├─► Output Quality Issues ──► Section 1
├─► Performance Issues ──────► Section 2
├─► Tool/Integration Issues ─► Section 3
├─► Cost/Token Issues ───────► Section 4
└─► Coordination Issues ─────► Section 5

Section 1: Output Quality Issues

1.1 Hallucination / Inaccurate Information

Decision Tree:

Is the system using GS paradigm with retrieval?
├─► NO: Consider switching to GS for factual accuracy
└─► YES: Check retrieval quality
    │
    Are retrieved documents relevant?
    ├─► NO: Improve retrieval (see 1.1a)
    └─► YES: Check citation usage
        │
        Is agent citing sources correctly?
        ├─► NO: Strengthen citation prompts (see 1.1b)
        └─► YES: Check source quality
            │
            Are sources accurate/current?
            ├─► NO: Update knowledge base
            └─► YES: Escalate to advanced diagnosis

1.1a: Improve Retrieval

# Solution: Add reranking
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.content) for doc in results])
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)

1.1b: Strengthen Citation Prompts

Add to system prompt:
"CRITICAL: You MUST cite a source for every factual claim.
Format: [Source N] immediately after the claim.
If you cannot cite a source, state 'Unable to verify' instead of the claim."

1.2 Inconsistent Outputs

Decision Tree:

Same input producing different outputs?
├─► Check temperature setting
│   ├─► Temperature > 0.5: Lower to 0.3-0.5
│   └─► Temperature < 0.5: Check other factors
│
├─► Using caching?
│   ├─► NO: Consider adding for consistency
│   └─► YES: Check cache invalidation
│
└─► Check for non-deterministic tools
    └─► Web search results vary: Add result pinning

Solution: Output Validation

def validate_output(output, schema):
    try:
        validated = schema.parse(output)
        return validated, True
    except ValidationError as e:
        # Retry with more explicit instructions
        retry_prompt = f"Previous output was invalid: {e}. Please provide output matching: {schema}"
        return retry(retry_prompt), False

1.3 Missing Required Information

Checklist:

Is the required information in the input?
Is the required information in retrieved context?
Is the output schema clearly specified?
Are there examples of expected output?

Solution: Explicit Output Requirements

Your response MUST include ALL of the following:
1. [Required field 1] - [description]
2. [Required field 2] - [description]
3. [Required field 3] - [description]

If any information is unavailable, explicitly state "Not available" for that field.
Do not omit any required fields.

Section 2: Performance Issues

2.1 High Latency

Diagnostic Questions:

Where is time spent? (LLM, retrieval, tools, network)
What's the token count per request?
Are there sequential dependencies that could parallelize?

Quick Fixes:

Symptom	Likely Cause	Solution
First token slow	Cold start	Implement keep-alive
Retrieval slow	Large corpus	Add caching, optimize index
Many tool calls	Sequential execution	Parallelize where possible
Long context	Token overhead	Summarize, use sliding window

Latency Optimization Code:

import asyncio

async def parallel_retrieval(queries):
    """Execute multiple retrievals in parallel"""
    tasks = [retrieve(q) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

async def optimized_agent(task):
    # Parallel: decompose + initial retrieval
    decompose_task = decompose(task)
    initial_retrieval_task = retrieve_context(task)
    
    subtasks, context = await asyncio.gather(
        decompose_task, 
        initial_retrieval_task
    )
    
    # Continue with results
    ...

2.2 Timeout Errors

Decision Tree:

What component is timing out?
├─► LLM API
│   ├─► Check rate limits
│   ├─► Reduce context size
│   └─► Implement streaming
│
├─► External Tools
│   ├─► Add timeout with fallback
│   ├─► Implement circuit breaker
│   └─► Check tool health
│
└─► Orchestrator
    ├─► Subagent stuck: Add heartbeat
    ├─► Coordination deadlock: Add timeout
    └─► Resource exhaustion: Scale resources

Timeout Handling Pattern:

async def call_with_timeout(func, timeout=30, fallback=None):
    try:
        return await asyncio.wait_for(func(), timeout=timeout)
    except asyncio.TimeoutError:
        logger.warning(f"{func.__name__} timed out after {timeout}s")
        if fallback:
            return await fallback()
        raise

Section 3: Tool/Integration Issues

3.1 Tool Call Failures

Common Causes:

Invalid parameters (hallucinated)
Rate limiting
Authentication expired
Schema mismatch

Diagnostic Steps:

def diagnose_tool_failure(tool_call, error):
    # 1. Check parameter validity
    try:
        tool.schema.validate(tool_call.params)
    except ValidationError as e:
        return f"Invalid parameters: {e}"
    
    # 2. Check rate limits
    if "429" in str(error) or "rate limit" in str(error).lower():
        return "Rate limit exceeded"
    
    # 3. Check authentication
    if "401" in str(error) or "403" in str(error):
        return "Authentication/authorization failure"
    
    # 4. Check connectivity
    if "timeout" in str(error).lower() or "connection" in str(error).lower():
        return "Network connectivity issue"
    
    return f"Unknown error: {error}"

3.2 API Integration Problems

Checklist:

API key/token valid and not expired?
Correct endpoint URL?
Request format matches API spec?
Response parsing handles all cases?
Error responses handled gracefully?

Section 4: Cost/Token Issues

4.1 Unexpected High Token Usage

Investigation Steps:

1. Check input token count
   - Context too large?
   - Retrieved too many documents?
   
2. Check output token count
   - Verbose responses?
   - Unnecessary explanations?
   
3. Check iteration count (EP paradigm)
   - Stuck in loop?
   - Reflexion generating too much?
   
4. Check multi-agent overhead
   - Too many agents?
   - Redundant coordination?

Token Reduction Strategies:

Strategy	Token Reduction	Implementation
Context summarization	30-50%	Summarize old messages
Retrieval limiting	20-40%	Reduce top_k, filter by score
Response length limits	10-30%	Add max_tokens, be explicit
Tool result truncation	20-40%	Limit tool output size

4.2 Cost Optimization

class TokenBudgetManager:
    def __init__(self, budget_per_task):
        self.budget = budget_per_task
        self.used = 0
    
    def check_budget(self, estimated_tokens):
        if self.used + estimated_tokens > self.budget:
            raise BudgetExceededError(
                f"Would exceed budget: {self.used + estimated_tokens} > {self.budget}"
            )
    
    def record_usage(self, tokens):
        self.used += tokens
        if self.used > self.budget * 0.8:
            logger.warning(f"Token usage at {self.used/self.budget*100:.0f}%")

Section 5: Coordination Issues

5.1 Multi-Agent Deadlock

Symptoms:

Orchestrator waiting indefinitely
Agents waiting for each other
No progress despite activity

Solutions:

Add timeouts to all agent calls
Implement circuit breaker pattern
Add progress heartbeats
Define maximum wait times

5.2 Result Aggregation Failures

Checklist:

All subagents returned results?
Result formats consistent?
Conflict resolution defined?
Partial results handled?

Emergency Procedures

Full System Rollback

# 1. Enable emergency mode
export AGENT_EMERGENCY_MODE=true

# 2. Route all traffic to fallback
./scripts/enable_fallback.sh

# 3. Notify stakeholders
./scripts/notify_incident.sh "Agentic AI rollback initiated"

# 4. Preserve logs for analysis
./scripts/archive_logs.sh

Incident Response Template

INCIDENT: [Brief description]
TIME: [When detected]
IMPACT: [User/business impact]
SYMPTOMS: [What was observed]
ROOT CAUSE: [If known]
ACTIONS TAKEN: [Steps taken]
STATUS: [Current status]
NEXT STEPS: [Planned actions]

Document maintained by CODITECT Support Team

Diagnostic Decision Trees for Common Issues​

Quick Diagnostic Flow​

Section 1: Output Quality Issues​

1.1 Hallucination / Inaccurate Information​

1.2 Inconsistent Outputs​

1.3 Missing Required Information​

Section 2: Performance Issues​

2.1 High Latency​

2.2 Timeout Errors​

Section 3: Tool/Integration Issues​

3.1 Tool Call Failures​

3.2 API Integration Problems​

Section 4: Cost/Token Issues​

4.1 Unexpected High Token Usage​

4.2 Cost Optimization​

Section 5: Coordination Issues​

5.1 Multi-Agent Deadlock​

5.2 Result Aggregation Failures​

Emergency Procedures​

Full System Rollback​

Incident Response Template​

Diagnostic Decision Trees for Common Issues

Quick Diagnostic Flow

Section 1: Output Quality Issues

1.1 Hallucination / Inaccurate Information

1.2 Inconsistent Outputs

1.3 Missing Required Information

Section 2: Performance Issues

2.1 High Latency

2.2 Timeout Errors

Section 3: Tool/Integration Issues

3.1 Tool Call Failures

3.2 API Integration Problems

Section 4: Cost/Token Issues

4.1 Unexpected High Token Usage

4.2 Cost Optimization

Section 5: Coordination Issues

5.1 Multi-Agent Deadlock

5.2 Result Aggregation Failures

Emergency Procedures

Full System Rollback

Incident Response Template