ADR 100: CODITECT Master Index Search and Retrieval System
ADR-100: CODITECT Master Index Search and Retrieval System
Status: Proposed Date: December 16, 2025 Author: AZ1.AI INC / CODITECT Architecture Team Supersedes: None Related ADRs: ADR-029 (FoundationDB Data Model), ADR-066 (Ephemeral Workspaces)
Executive Summary
This Architecture Decision Record defines the CODITECT Master Index System - a comprehensive platform-wide search and retrieval system that indexes every markdown document, code file, agent definition, command, skill, script, and configuration across the entire CODITECT ecosystem. The system provides unified semantic search, knowledge graph navigation, and RAG (Retrieval-Augmented Generation) capabilities for both human users and AI agents.
Context
Current State
The CODITECT platform currently has:
- 57 submodules across 8 category folders
- 560+ framework components (agents, commands, skills, scripts, hooks)
- 1,096 N8N workflows across 29 industries
- Thousands of markdown documents scattered across repositories
- 100K+ lines of Rust/Python/TypeScript code
Problem Statement
- Fragmented Discovery: No unified way to search across all CODITECT repositories
- Context Loss: AI agents cannot reference documents outside current scope
- Duplicate Content: Same patterns documented in multiple places without linking
- No Semantic Understanding: Current search is keyword-based only
- Missing Relationships: No graph of document relationships and dependencies
Requirements
- Universal Indexing: Index ALL content across 57 submodules
- Semantic Search: Find documents by meaning, not just keywords
- Knowledge Graph: Map relationships between documents, code, and components
- RAG Integration: Enable AI agents to retrieve relevant context automatically
- Real-time Updates: Incremental indexing on git push
- Multi-tenant Ready: Scale to enterprise with tenant isolation
Decision
We will implement a three-layer Master Index System:
Layer 1: Document Indexer
Technology: Python + SQLite/PostgreSQL + Full-Text Search
Scope:
- All
.mdfiles across 57 submodules - All component definitions (agents/, commands/, skills/)
- Configuration files (JSON, YAML, TOML)
- Code documentation (docstrings, comments)
Schema:
CREATE TABLE documents (
id UUID PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
document_type VARCHAR(50), -- 'markdown', 'agent', 'command', 'skill', 'code'
submodule VARCHAR(100),
category VARCHAR(50), -- 'core', 'cloud', 'labs', etc.
frontmatter JSONB,
word_count INTEGER,
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ,
indexed_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE document_sections (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
heading TEXT,
heading_level INTEGER,
content TEXT NOT NULL,
position INTEGER,
parent_section_id UUID REFERENCES document_sections(id)
);
CREATE TABLE document_links (
id UUID PRIMARY KEY,
source_document_id UUID REFERENCES documents(id),
target_document_id UUID REFERENCES documents(id),
link_text TEXT,
link_type VARCHAR(50), -- 'reference', 'see_also', 'import', 'extends'
is_broken BOOLEAN DEFAULT FALSE
);
-- Full-text search
CREATE INDEX idx_documents_fts ON documents
USING GIN (to_tsvector('english', title || ' ' || content));
CREATE INDEX idx_sections_fts ON document_sections
USING GIN (to_tsvector('english', heading || ' ' || content));
Layer 2: Semantic Embeddings
Technology: Sentence Transformers + Vector Database (ChromaDB/Faiss/pgvector)
Purpose: Enable meaning-based search beyond keyword matching
Schema:
CREATE TABLE embeddings (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
section_id UUID REFERENCES document_sections(id),
chunk_text TEXT NOT NULL,
chunk_index INTEGER,
embedding vector(1536), -- OpenAI ada-002 dimensions
model VARCHAR(100),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Vector similarity index (pgvector)
CREATE INDEX idx_embeddings_vector ON embeddings
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Embedding Strategy:
- Chunk documents by section (respect heading boundaries)
- Maximum chunk size: 512 tokens
- Overlap: 50 tokens between chunks
- Model:
text-embedding-ada-002(OpenAI) orall-MiniLM-L6-v2(local)
Layer 3: Knowledge Graph
Technology: FoundationDB / Neo4j / NetworkX
Purpose: Map relationships between all platform entities
Graph Schema:
Nodes:
├── Document (path, title, type)
├── Component (name, type, version)
├── Submodule (name, category, url)
├── Concept (name, domain)
├── Entity (name, type) -- extracted entities
└── Author (name, email)
Edges:
├── CONTAINS (Submodule → Document)
├── REFERENCES (Document → Document)
├── DEFINES (Document → Component)
├── IMPLEMENTS (Code → Component)
├── DEPENDS_ON (Component → Component)
├── RELATES_TO (Concept → Concept)
├── MENTIONS (Document → Entity)
└── AUTHORED_BY (Document → Author)
FoundationDB Key Design:
/master_index/
├── /documents/{hash}/metadata
├── /documents/{hash}/content
├── /documents/{hash}/sections/{section_id}
├── /embeddings/{document_hash}/{chunk_index}
├── /graph/nodes/{node_type}/{node_id}
├── /graph/edges/{edge_type}/{source_id}/{target_id}
├── /search/inverted/{term}/{document_hash}
└── /stats/{metric}/{timestamp}
Architecture
System Components
┌─────────────────────────────────────────────────────────────────┐
│ CODITECT Master Index System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scanner │ │ Parser │ │ Indexer │ │
│ │ (Git Walk) │──│ (Markdown) │──│ (SQL/FDB) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedder │ │ Graph │ │ Search │ │
│ │ (Semantic) │──│ (Neo4j) │──│ (Query) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ RAG API │ │
│ │ Endpoint │ │
│ └─────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Claude Code │ │ Web UI │ │ AI Agents │ │
│ │ CLI │ │ Search │ │ Context │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Data Flow
Implementation Plan
Phase 1: Document Indexer (Week 1-2)
Deliverables:
- Git repository scanner (walk all 57 submodules)
- Markdown parser with frontmatter extraction
- SQLite/PostgreSQL schema migration
- FTS5/tsvector indexing
- CLI:
coditect-index scan --all - CLI:
coditect-index search "query"
Files:
scripts/master-index/scanner.pyscripts/master-index/parser.pyscripts/master-index/indexer.pyscripts/master-index/cli.py
Phase 2: Semantic Embeddings (Week 3-4)
Deliverables:
- Chunking strategy (section-aware, 512 tokens)
- Embedding generation (OpenAI or local model)
- Vector storage (ChromaDB or pgvector)
- Similarity search API
- CLI:
coditect-index embed --incremental - CLI:
coditect-index semantic "query"
Phase 3: Knowledge Graph (Week 5-6)
Deliverables:
- Entity extraction (components, concepts, authors)
- Relationship mapping (references, dependencies)
- Graph database integration (Neo4j or FDB)
- Graph traversal queries
- CLI:
coditect-index graph --visualize
Phase 4: RAG API (Week 7-8)
Deliverables:
- REST API for search and retrieval
- Unified query endpoint (FTS + semantic + graph)
- Context assembly for AI agents
- Claude Code integration (
/cxq --master-index) - Web UI for visual search
API Design
Search Endpoint
POST /api/v1/master-index/search
Content-Type: application/json
{
"query": "FoundationDB multi-tenant patterns",
"search_type": "hybrid", // "fts", "semantic", "graph", "hybrid"
"filters": {
"document_type": ["markdown", "agent"],
"submodule": ["coditect-core", "coditect-labs-v4-archive"],
"category": ["core", "labs"]
},
"limit": 20,
"include_context": true,
"context_window": 500
}
Response:
{
"results": [
{
"document": {
"id": "uuid",
"path": "submodules/labs/coditect-labs-v4-archive/docs/architecture/adrs/ADR-029-v4-foundationdb-issue-tracking-data-model-part2-technical.md",
"title": "ADR-029: FoundationDB Issue Tracking Data Model",
"type": "markdown",
"submodule": "coditect-labs-v4-archive"
},
"relevance_score": 0.94,
"match_type": "semantic",
"context": "Multi-tenant key design uses prefix /{tenant_id}/ for all entity types...",
"related_documents": [
"agents/foundationdb-expert.md",
"skills/foundationdb-queries/SKILL.md"
]
}
],
"graph_context": {
"concepts": ["multi-tenant", "ACID transactions", "key design"],
"related_components": ["foundationdb-expert", "database-architect"],
"dependency_chain": ["coditect-core", "coditect-labs-v4-archive"]
},
"total_results": 47,
"search_time_ms": 45
}
Claude Code Integration
# New slash command
/cxq --master-index "FoundationDB patterns"
# Or via cxq with flag
/cxq --scope master "database architecture"
# List all indexed documents
/cxq --master-index --list --type agent
# Graph traversal
/cxq --master-index --related "orchestrator agent"
Quality Metrics
Indexing Coverage
| Metric | Target | Measurement |
|---|---|---|
| Document coverage | 100% | All .md files indexed |
| Code coverage | 80% | Docstrings and key files |
| Component coverage | 100% | All agents, commands, skills |
| Link integrity | 95% | Broken link detection |
Search Quality
| Metric | Target | Measurement |
|---|---|---|
| Precision@10 | >85% | Relevant results in top 10 |
| Recall | >90% | Find relevant documents |
| Latency p95 | <200ms | Search response time |
| Embedding quality | >0.8 | Cosine similarity threshold |
Freshness
| Metric | Target | Measurement |
|---|---|---|
| Index lag | <5 min | Time from commit to indexed |
| Incremental sync | 100% | Only changed files re-indexed |
| Stale detection | <1 day | Identify outdated content |
Security Considerations
Access Control
- Read Access: All indexed content is read from git (existing permissions)
- Query Access: API requires authentication (future: tenant-scoped)
- Embedding Storage: Vectors don't expose raw content
- Graph Traversal: Respects document-level permissions
Data Protection
- No PII in Index: Exclude credentials, secrets, personal data
- Content Hashing: Detect unauthorized modifications
- Audit Logging: Track all search queries
- Retention: Purge embeddings when documents deleted
Alternatives Considered
Alternative 1: Elasticsearch
Pros: Mature, scalable, rich query language Cons: Heavy infrastructure, operational complexity, cost Decision: Rejected for MVP; consider for enterprise tier
Alternative 2: Algolia
Pros: Managed service, fast, good UI Cons: Vendor lock-in, cost per operation, no semantic search Decision: Rejected due to cost and limited semantic capabilities
Alternative 3: Pinecone
Pros: Managed vector database, scalable Cons: No FTS, cost per vector, requires separate FTS Decision: Rejected; prefer unified solution
Selected Approach: Hybrid (PostgreSQL + ChromaDB + Graph)
Rationale:
- PostgreSQL provides robust FTS and structured data
- ChromaDB handles semantic embeddings (self-hosted)
- Graph layer maps relationships
- All components can scale to enterprise via FDB
Dependencies
Infrastructure
- PostgreSQL 14+ (with pgvector extension)
- ChromaDB or Faiss (vector search)
- Redis (caching, optional)
- Neo4j or FoundationDB (graph, optional for Phase 3)
Python Libraries
sentence-transformers- Local embeddingschromadb- Vector databasemarkdown-it-py- Markdown parsingpyyaml- Frontmatter parsingnetworkx- Graph operationsfastapi- API server
External Services (Optional)
- OpenAI API - High-quality embeddings
- Anthropic API - Entity extraction
Success Criteria
MVP (Phase 1-2)
- All 57 submodules indexed
- FTS search operational
- Semantic search operational
- CLI integration complete
- Documentation complete
Enterprise (Phase 3-4)
- Knowledge graph operational
- RAG API deployed
- Claude Code integration
- Multi-tenant isolation
- <200ms p95 latency
Futureprise
- FoundationDB backend
- Real-time sync
- 1M+ document capacity
- Global distribution
References
- ADR-029: FoundationDB Issue Tracking Data Model
- MULTI-TENANT-CONTEXT-architecture.md
- Sentence Transformers Documentation
- ChromaDB Documentation
- pgvector Extension
Owner: AZ1.AI INC Lead: Hal Casteel, Founder/CEO/CTO Review Date: December 23, 2025 Implementation Start: TBD (pending approval)