Skip to main content

Moe Enhancement Recommendations


title: MoE System Enhancement Recommendations component_type: reference version: 1.0.0 audience: contributor status: active summary: Comprehensive recommendations for improving the CODITECT Mixture of Experts classification system based on codebase analysis keywords:

  • moe
  • enhancement
  • recommendations
  • improvements
  • semantic-embeddings
  • machine-learning tokens: ~5000 created: '2025-12-31' updated: '2025-12-31' type: reference tags:
  • reference
  • architecture
  • moe
  • improvements moe_confidence: 0.950 moe_classified: 2025-12-31

MoE System Enhancement Recommendations

Generated: December 31, 2025 Analysis Type: Codebase Gap Analysis & Enhancement Roadmap Certainty: HIGH (95%) - Based on direct code inspection of 23+ MoE implementation files


Executive Summary

The current CODITECT MoE system is well-architected with a solid three-layer model (analysts → judges → consensus). However, analysis reveals 7 high-impact enhancement opportunities that could significantly improve classification accuracy, reduce escalations, and enable adaptive learning.

Impact Summary

EnhancementImpactEffortPriority
Semantic EmbeddingsHIGHHIGHP0
Historical LearningHIGHMEDIUMP0
Memory System IntegrationMEDIUMLOWP1
Adaptive ThresholdsMEDIUMMEDIUMP1
Confidence CalibrationHIGHMEDIUMP1
Additional Judge TypesMEDIUMLOWP2
Batch Corpus AnalysisLOWHIGHP2

Workflow Checklist

  • Prerequisites verified
  • Configuration applied
  • Process executed
  • Results validated
  • Documentation updated

Workflow Steps

  1. Initialize - Set up the environment
  2. Configure - Apply settings
  3. Execute - Run the process
  4. Validate - Check results
  5. Complete - Finalize workflow

Enhancement 1: True Semantic Embeddings (P0)

Current State

The SemanticSimilarityAnalyst in core/deep_analysts.py:216 explicitly states:

class SemanticSimilarityAnalyst:
"""
Analyzes document similarity to known exemplars.
Uses pattern matching as a lightweight embedding proxy. # <-- Current approach
"""

The system uses regex patterns (EXEMPLAR_PATTERNS dict with 7 document types) as a proxy for semantic similarity. This approach:

  • Cannot capture nuanced meaning
  • Fails on paraphrased content
  • Misses semantic relationships between concepts
  • Has fixed pattern vocabulary

Implement true vector embeddings using a lightweight embedding model:

# Proposed: core/embeddings.py
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Dict, List, Tuple

class SemanticEmbeddingService:
"""
True semantic embedding service for document classification.
Uses sentence-transformers for efficient local embeddings.
"""

def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
# Lightweight model: 80MB, ~14K docs/sec on CPU
self.model = SentenceTransformer(model_name)
self.exemplar_embeddings: Dict[str, np.ndarray] = {}
self._load_exemplars()

def _load_exemplars(self):
"""Pre-compute embeddings for known document types."""
exemplar_texts = {
"agent": [
"You are a specialized AI agent for...",
"This agent handles... capabilities include...",
"System prompt: You are a..."
],
"command": [
"Usage: /command-name [options]",
"This slash command executes...",
"Invocation: /cmd --flag value"
],
# ... other types
}

for doc_type, texts in exemplar_texts.items():
embeddings = self.model.encode(texts)
self.exemplar_embeddings[doc_type] = np.mean(embeddings, axis=0)

def classify(self, content: str) -> Tuple[str, float]:
"""Classify document by embedding similarity."""
doc_embedding = self.model.encode(content[:8000]) # Truncate for efficiency

similarities = {}
for doc_type, exemplar_emb in self.exemplar_embeddings.items():
similarity = np.dot(doc_embedding, exemplar_emb) / (
np.linalg.norm(doc_embedding) * np.linalg.norm(exemplar_emb)
)
similarities[doc_type] = float(similarity)

best_type = max(similarities, key=similarities.get)
return best_type, similarities[best_type]

Implementation Notes

AspectRecommendation
Modelall-MiniLM-L6-v2 (80MB, fast, accurate)
FallbackKeep regex patterns for offline/no-model scenarios
CachingCache embeddings in context.db with document hash key
Memory~200MB RAM overhead for model
Performance<100ms per document

Expected Impact

  • Accuracy improvement: +15-25% for ambiguous documents
  • Escalation reduction: -30% (fewer documents to deep analysis)
  • Semantic understanding: Captures meaning, not just keywords

Enhancement 2: Historical Learning Loop (P0)

Current State

No learning from classification history. Each document classified independently with no feedback mechanism. From core/models.py:

@dataclass
class ClassificationResult:
"""Final classification result with audit trail."""
# ... stores result but no mechanism to learn from it

Implement a feedback loop that learns from confirmed classifications:

# Proposed: core/learning.py
import sqlite3
from datetime import datetime
from typing import Dict, List, Optional

class ClassificationLearner:
"""
Learns from historical classification outcomes.
Tracks analyst accuracy and adjusts weights dynamically.
"""

def __init__(self, db_path: str = "context.db"):
self.db_path = db_path
self._init_tables()

def _init_tables(self):
"""Create learning tables."""
conn = sqlite3.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS classification_outcomes (
id INTEGER PRIMARY KEY,
document_path TEXT,
predicted_type TEXT,
actual_type TEXT, -- NULL until confirmed
confidence REAL,
analyst_votes TEXT, -- JSON
confirmed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS analyst_accuracy (
analyst TEXT PRIMARY KEY,
correct_count INTEGER DEFAULT 0,
total_count INTEGER DEFAULT 0,
accuracy REAL DEFAULT 0.0,
last_updated TIMESTAMP
)
""")
conn.commit()
conn.close()

def record_classification(self, result: ClassificationResult):
"""Record classification for future learning."""
conn = sqlite3.connect(self.db_path)
conn.execute("""
INSERT INTO classification_outcomes
(document_path, predicted_type, confidence, analyst_votes)
VALUES (?, ?, ?, ?)
""", (result.document_path, result.result.classification,
result.result.confidence, json.dumps([v.__dict__ for v in result.result.votes])))
conn.commit()
conn.close()

def confirm_classification(self, document_path: str, actual_type: str):
"""Confirm or correct a classification - triggers learning."""
conn = sqlite3.connect(self.db_path)

# Get the original classification
cursor = conn.execute("""
SELECT predicted_type, analyst_votes FROM classification_outcomes
WHERE document_path = ? AND actual_type IS NULL
ORDER BY created_at DESC LIMIT 1
""", (document_path,))
row = cursor.fetchone()

if row:
predicted, votes_json = row
votes = json.loads(votes_json)

# Update outcome
conn.execute("""
UPDATE classification_outcomes
SET actual_type = ?, confirmed_at = ?
WHERE document_path = ? AND actual_type IS NULL
""", (actual_type, datetime.utcnow(), document_path))

# Update analyst accuracy
for vote in votes:
is_correct = vote['classification'] == actual_type
conn.execute("""
INSERT INTO analyst_accuracy (analyst, correct_count, total_count, last_updated)
VALUES (?, ?, 1, ?)
ON CONFLICT(analyst) DO UPDATE SET
correct_count = correct_count + ?,
total_count = total_count + 1,
accuracy = CAST(correct_count + ? AS REAL) / (total_count + 1),
last_updated = ?
""", (vote['agent'], 1 if is_correct else 0, datetime.utcnow(),
1 if is_correct else 0, 1 if is_correct else 0, datetime.utcnow()))

conn.commit()
conn.close()

def get_analyst_weights(self) -> Dict[str, float]:
"""Get dynamic weights based on analyst accuracy."""
conn = sqlite3.connect(self.db_path)
cursor = conn.execute("""
SELECT analyst, accuracy, total_count FROM analyst_accuracy
WHERE total_count >= 10 -- Minimum samples
""")

weights = {}
for analyst, accuracy, count in cursor.fetchall():
# Weight = accuracy with confidence adjustment
confidence_factor = min(1.0, count / 100) # Full confidence at 100 samples
weights[analyst] = accuracy * confidence_factor + (1 - confidence_factor) * 0.5

conn.close()
return weights

Integration with Consensus Calculator

Modify core/consensus.py to use dynamic weights:

# In ConsensusCalculator
def calculate_from_votes(self, votes: List[AnalystVote]) -> ConsensusResult:
# Get dynamic weights from learning system
dynamic_weights = self.learner.get_analyst_weights()

# Apply weights to votes
weighted_votes = {}
for vote in votes:
weight = dynamic_weights.get(vote.agent, 1.0) # Default weight 1.0
if vote.classification not in weighted_votes:
weighted_votes[vote.classification] = 0.0
weighted_votes[vote.classification] += vote.confidence * weight

# Continue with weighted consensus...

Expected Impact

  • Self-improvement: System gets better over time
  • Analyst accountability: Track which analysts are accurate
  • Weight optimization: More accurate analysts have more influence

Enhancement 3: Memory System Integration (P1)

Current State

MoE operates independently of the CODITECT memory system (context.db with 584MB of historical context). No integration with /cxq queries or session knowledge.

Integrate with existing context.db to leverage historical patterns:

# Proposed: core/memory_integration.py
import sqlite3
from typing import List, Dict

class MemoryEnhancedClassifier:
"""
Enhances classification using CODITECT memory system.
Queries historical patterns from context.db.
"""

def __init__(self, context_db_path: str):
self.db_path = context_db_path

def find_similar_documents(self, content: str, limit: int = 5) -> List[Dict]:
"""Find similar documents from session history."""
conn = sqlite3.connect(self.db_path)

# Search unified messages for similar file discussions
cursor = conn.execute("""
SELECT content, metadata FROM unified_messages
WHERE content LIKE '%type:%' OR content LIKE '%component_type:%'
ORDER BY timestamp DESC
LIMIT ?
""", (limit * 10,)) # Get more, filter later

# TODO: Replace with vector search when embeddings are added
results = []
for row in cursor.fetchall():
results.append({
'content': row[0][:500],
'metadata': row[1]
})

conn.close()
return results[:limit]

def get_project_conventions(self) -> Dict[str, str]:
"""Extract project-specific naming/type conventions."""
conn = sqlite3.connect(self.db_path)

# Find patterns in historical classifications
cursor = conn.execute("""
SELECT content FROM unified_messages
WHERE content LIKE '%classified as%' OR content LIKE '%document type%'
ORDER BY timestamp DESC
LIMIT 100
""")

conventions = {}
# Parse and extract patterns...

conn.close()
return conventions

Expected Impact

  • Context awareness: Uses historical project patterns
  • Consistency: Aligns with prior classification decisions
  • Reduced manual review: Leverages existing knowledge

Enhancement 4: Adaptive Thresholds (P1)

Current State

Fixed thresholds in core/consensus.py:

AUTO_APPROVAL_CONFIDENCE = 0.90     # Fixed
JUDGE_APPROVAL_CONFIDENCE = 0.85 # Fixed
AGREEMENT_THRESHOLD = 0.60 # Fixed

Implement adaptive thresholds based on document corpus:

# Proposed: core/adaptive_thresholds.py
from dataclasses import dataclass
from typing import Dict
import numpy as np

@dataclass
class AdaptiveThresholdConfig:
"""Thresholds that adjust based on classification outcomes."""

# Base thresholds
base_auto_approval: float = 0.90
base_judge_approval: float = 0.85
base_agreement: float = 0.60

# Adjustment factors
escalation_penalty: float = 0.01 # Lower threshold if too many escalations
accuracy_bonus: float = 0.02 # Raise threshold if high accuracy

def adjust(self,
escalation_rate: float,
accuracy_rate: float,
target_escalation: float = 0.15) -> 'AdaptiveThresholdConfig':
"""Adjust thresholds based on performance metrics."""

# If escalation rate too high, lower thresholds slightly
if escalation_rate > target_escalation:
adjustment = -self.escalation_penalty * (escalation_rate - target_escalation) * 10
else:
adjustment = self.accuracy_bonus * (accuracy_rate - 0.85)

return AdaptiveThresholdConfig(
base_auto_approval=np.clip(self.base_auto_approval + adjustment, 0.80, 0.95),
base_judge_approval=np.clip(self.base_judge_approval + adjustment, 0.75, 0.92),
base_agreement=np.clip(self.base_agreement + adjustment * 0.5, 0.50, 0.75)
)

Expected Impact

  • Self-tuning: Thresholds optimize for corpus characteristics
  • Reduced escalations: Adjust based on historical patterns
  • Better calibration: Align confidence with actual accuracy

Enhancement 5: Confidence Calibration Validation (P1)

Current State

No validation that confidence scores correlate with actual accuracy. A 90% confidence classification might actually only be correct 70% of the time.

Implement calibration curve tracking:

# Proposed: core/calibration.py
import numpy as np
from typing import List, Tuple

class ConfidenceCalibrator:
"""
Validates and calibrates confidence scores.
Ensures 90% confidence = 90% accuracy.
"""

def __init__(self):
self.bins = 10
self.calibration_data: List[Tuple[float, bool]] = []

def record(self, confidence: float, was_correct: bool):
"""Record a classification outcome."""
self.calibration_data.append((confidence, was_correct))

def get_calibration_curve(self) -> Dict[str, List[float]]:
"""Calculate calibration curve."""
if len(self.calibration_data) < 100:
return {"warning": "Insufficient data for calibration"}

confidences = np.array([c[0] for c in self.calibration_data])
accuracies = np.array([c[1] for c in self.calibration_data])

bin_edges = np.linspace(0, 1, self.bins + 1)
bin_centers = []
bin_accuracies = []

for i in range(self.bins):
mask = (confidences >= bin_edges[i]) & (confidences < bin_edges[i + 1])
if mask.sum() > 0:
bin_centers.append((bin_edges[i] + bin_edges[i + 1]) / 2)
bin_accuracies.append(accuracies[mask].mean())

return {
"predicted_confidence": bin_centers,
"actual_accuracy": bin_accuracies,
"expected_calibration_error": self._calculate_ece(confidences, accuracies)
}

def _calculate_ece(self, confidences: np.ndarray, accuracies: np.ndarray) -> float:
"""Calculate Expected Calibration Error."""
ece = 0.0
bin_edges = np.linspace(0, 1, self.bins + 1)

for i in range(self.bins):
mask = (confidences >= bin_edges[i]) & (confidences < bin_edges[i + 1])
if mask.sum() > 0:
avg_confidence = confidences[mask].mean()
avg_accuracy = accuracies[mask].mean()
ece += mask.sum() * abs(avg_accuracy - avg_confidence)

return ece / len(confidences)

def calibrate(self, raw_confidence: float) -> float:
"""Apply Platt scaling to calibrate confidence."""
# TODO: Implement Platt scaling or isotonic regression
# For now, return raw confidence
return raw_confidence

Expected Impact

  • Honest confidence: Scores reflect actual accuracy
  • Better decisions: Escalation decisions based on calibrated confidence
  • Transparency: Users know confidence meaning

Enhancement 6: Additional Judge Types (P2)

Current State

Only 3 judges in judges/base.py:

  • ConsistencyJudge (cross-reference)
  • QualityJudge (vote distribution)
  • DomainJudge (CODITECT rules)

Add specialized judges:

# Proposed: judges/specialized_judges.py

class FrontmatterJudge(BaseJudge):
"""
Validates classification against frontmatter metadata.
Vetoes if frontmatter explicitly contradicts classification.
"""

name = "frontmatter"
has_veto_authority = True
weight = 1.5 # Higher weight - frontmatter is authoritative

def evaluate(self, document: Document, votes: List[AnalystVote],
consensus: ConsensusResult) -> JudgeDecision:
frontmatter = document.frontmatter

# Check explicit type declarations
declared_type = (
frontmatter.get('component_type') or
frontmatter.get('type') or
frontmatter.get('doc_type')
)

if declared_type and declared_type != consensus.classification:
return JudgeDecision(
judge=self.name,
approved=False, # VETO
reason=f"Frontmatter declares '{declared_type}' but classified as '{consensus.classification}'",
confidence=0.95
)

return JudgeDecision(
judge=self.name,
approved=True,
reason="No frontmatter conflict",
confidence=0.9
)


class DirectoryConventionJudge(BaseJudge):
"""
Validates classification against directory placement.
"""

name = "directory"
has_veto_authority = False # Advisory only
weight = 0.8

DIRECTORY_CONVENTIONS = {
"agents/": "agent",
"commands/": "command",
"skills/": "skill",
"workflows/": "workflow",
"adrs/": "adr",
"guides/": "guide",
}

def evaluate(self, document: Document, votes: List[AnalystVote],
consensus: ConsensusResult) -> JudgeDecision:
path = str(document.path).lower()

for dir_pattern, expected_type in self.DIRECTORY_CONVENTIONS.items():
if dir_pattern in path:
if consensus.classification != expected_type:
return JudgeDecision(
judge=self.name,
approved=False,
reason=f"Directory '{dir_pattern}' suggests '{expected_type}'",
confidence=0.7
)

return JudgeDecision(
judge=self.name,
approved=True,
reason="Directory placement consistent",
confidence=0.8
)


class HistoricalPatternJudge(BaseJudge):
"""
Compares against historical classifications of similar documents.
"""

name = "historical"
has_veto_authority = False
weight = 0.7

def __init__(self, learner: ClassificationLearner):
super().__init__()
self.learner = learner

def evaluate(self, document: Document, votes: List[AnalystVote],
consensus: ConsensusResult) -> JudgeDecision:
# Check if similar documents were classified differently
similar = self.learner.find_similar_by_path(document.path)

if similar:
historical_type = similar[0]['actual_type']
if historical_type and historical_type != consensus.classification:
return JudgeDecision(
judge=self.name,
approved=False,
reason=f"Similar document was classified as '{historical_type}'",
confidence=0.6
)

return JudgeDecision(
judge=self.name,
approved=True,
reason="No conflicting historical patterns",
confidence=0.7
)

Expected Impact

  • Frontmatter authority: Explicit declarations override inference
  • Convention compliance: Ensure directory structure alignment
  • Historical consistency: Learn from past decisions

Enhancement 7: Batch Corpus Analysis (P2)

Current State

Each document classified independently. No corpus-level analysis.

Implement batch processing with cross-document insights:

# Proposed: core/batch_processor.py
from typing import List, Dict
from concurrent.futures import ThreadPoolExecutor

class BatchCorpusAnalyzer:
"""
Analyzes document corpus for cross-document patterns.
"""

def __init__(self, orchestrator: MoEOrchestrator):
self.orchestrator = orchestrator

def analyze_corpus(self, documents: List[Document]) -> Dict:
"""Analyze entire corpus for patterns before classification."""

# Phase 1: Pre-scan for corpus characteristics
corpus_profile = self._profile_corpus(documents)

# Phase 2: Identify clusters of similar documents
clusters = self._cluster_documents(documents)

# Phase 3: Classify with corpus context
results = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = []
for doc in documents:
cluster_context = clusters.get(doc.path, {})
future = executor.submit(
self._classify_with_context, doc, corpus_profile, cluster_context
)
futures.append(future)

for future in futures:
results.append(future.result())

return {
"corpus_profile": corpus_profile,
"clusters": len(clusters),
"results": results
}

def _profile_corpus(self, documents: List[Document]) -> Dict:
"""Build corpus profile."""
type_distribution = {}
directory_patterns = {}

for doc in documents:
# Analyze directory patterns
parent = str(doc.path.parent)
if parent not in directory_patterns:
directory_patterns[parent] = []
directory_patterns[parent].append(doc.filename)

return {
"document_count": len(documents),
"directory_count": len(directory_patterns),
"directories": directory_patterns
}

def _cluster_documents(self, documents: List[Document]) -> Dict:
"""Cluster similar documents."""
# TODO: Implement clustering when embeddings available
return {}

def _classify_with_context(self, doc: Document,
corpus_profile: Dict,
cluster_context: Dict) -> ClassificationResult:
"""Classify with corpus awareness."""
# Inject corpus context into classification
return self.orchestrator.classify(doc)

Expected Impact

  • Corpus awareness: Understand document ecosystem
  • Cluster consistency: Similar docs get similar types
  • Efficiency: Batch processing optimizations

Implementation Roadmap

Phase 1: Foundation (P0) - Weeks 1-2

TaskEffortDependencies
Add sentence-transformers dependency1dNone
Implement SemanticEmbeddingService3dDependency
Create classification_outcomes table1dNone
Implement ClassificationLearner3dTable
Add confirm_classification endpoint1dLearner

Phase 2: Integration (P1) - Weeks 3-4

TaskEffortDependencies
Integrate with context.db2dPhase 1
Implement AdaptiveThresholdConfig2dPhase 1
Implement ConfidenceCalibrator3dPhase 1
Add calibration dashboard2dCalibrator

Phase 3: Enhancement (P2) - Weeks 5-6

TaskEffortDependencies
Implement FrontmatterJudge1dNone
Implement DirectoryConventionJudge1dNone
Implement HistoricalPatternJudge2dPhase 1
Implement BatchCorpusAnalyzer3dPhase 1-2

Metrics & Success Criteria

Baseline Metrics (Current)

MetricCurrent ValueSource
Auto-approval rate~85%Estimated
Escalation rate~15%Estimated
Human review required~5%Estimated
Confidence calibrationUnknownNot measured

Target Metrics (Post-Enhancement)

MetricTargetImprovement
Auto-approval rate≥92%+7%
Escalation rate≤8%-7%
Human review required≤2%-3%
Calibration error (ECE)≤0.05New metric
Analyst accuracy tracking100%New capability

Risk Assessment

RiskProbabilityImpactMitigation
Embedding model latencyMediumMediumCaching, async processing
Learning feedback sparsityHighMediumBootstrap with existing data
Threshold oscillationLowLowSmoothing, minimum samples
Memory consumptionMediumLowLazy loading, cleanup


Document Version: 1.0.0 Last Updated: December 31, 2025 Author: CODITECT Analysis System Certainty Level: HIGH (95%) - Based on comprehensive codebase analysis