Skip to main content

ADR 100: CODITECT Master Index Search and Retrieval System

ADR-100: CODITECT Master Index Search and Retrieval System

Status: Proposed Date: December 16, 2025 Author: AZ1.AI INC / CODITECT Architecture Team Supersedes: None Related ADRs: ADR-029 (FoundationDB Data Model), ADR-066 (Ephemeral Workspaces)


Executive Summary

This Architecture Decision Record defines the CODITECT Master Index System - a comprehensive platform-wide search and retrieval system that indexes every markdown document, code file, agent definition, command, skill, script, and configuration across the entire CODITECT ecosystem. The system provides unified semantic search, knowledge graph navigation, and RAG (Retrieval-Augmented Generation) capabilities for both human users and AI agents.


Context

Current State

The CODITECT platform currently has:

  • 57 submodules across 8 category folders
  • 560+ framework components (agents, commands, skills, scripts, hooks)
  • 1,096 N8N workflows across 29 industries
  • Thousands of markdown documents scattered across repositories
  • 100K+ lines of Rust/Python/TypeScript code

Problem Statement

  1. Fragmented Discovery: No unified way to search across all CODITECT repositories
  2. Context Loss: AI agents cannot reference documents outside current scope
  3. Duplicate Content: Same patterns documented in multiple places without linking
  4. No Semantic Understanding: Current search is keyword-based only
  5. Missing Relationships: No graph of document relationships and dependencies

Requirements

  1. Universal Indexing: Index ALL content across 57 submodules
  2. Semantic Search: Find documents by meaning, not just keywords
  3. Knowledge Graph: Map relationships between documents, code, and components
  4. RAG Integration: Enable AI agents to retrieve relevant context automatically
  5. Real-time Updates: Incremental indexing on git push
  6. Multi-tenant Ready: Scale to enterprise with tenant isolation

Decision

We will implement a three-layer Master Index System:

Layer 1: Document Indexer

Technology: Python + SQLite/PostgreSQL + Full-Text Search

Scope:

  • All .md files across 57 submodules
  • All component definitions (agents/, commands/, skills/)
  • Configuration files (JSON, YAML, TOML)
  • Code documentation (docstrings, comments)

Schema:

CREATE TABLE documents (
id UUID PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
document_type VARCHAR(50), -- 'markdown', 'agent', 'command', 'skill', 'code'
submodule VARCHAR(100),
category VARCHAR(50), -- 'core', 'cloud', 'labs', etc.
frontmatter JSONB,
word_count INTEGER,
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ,
indexed_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE document_sections (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
heading TEXT,
heading_level INTEGER,
content TEXT NOT NULL,
position INTEGER,
parent_section_id UUID REFERENCES document_sections(id)
);

CREATE TABLE document_links (
id UUID PRIMARY KEY,
source_document_id UUID REFERENCES documents(id),
target_document_id UUID REFERENCES documents(id),
link_text TEXT,
link_type VARCHAR(50), -- 'reference', 'see_also', 'import', 'extends'
is_broken BOOLEAN DEFAULT FALSE
);

-- Full-text search
CREATE INDEX idx_documents_fts ON documents
USING GIN (to_tsvector('english', title || ' ' || content));

CREATE INDEX idx_sections_fts ON document_sections
USING GIN (to_tsvector('english', heading || ' ' || content));

Layer 2: Semantic Embeddings

Technology: Sentence Transformers + Vector Database (ChromaDB/Faiss/pgvector)

Purpose: Enable meaning-based search beyond keyword matching

Schema:

CREATE TABLE embeddings (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
section_id UUID REFERENCES document_sections(id),
chunk_text TEXT NOT NULL,
chunk_index INTEGER,
embedding vector(1536), -- OpenAI ada-002 dimensions
model VARCHAR(100),
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Vector similarity index (pgvector)
CREATE INDEX idx_embeddings_vector ON embeddings
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Embedding Strategy:

  1. Chunk documents by section (respect heading boundaries)
  2. Maximum chunk size: 512 tokens
  3. Overlap: 50 tokens between chunks
  4. Model: text-embedding-ada-002 (OpenAI) or all-MiniLM-L6-v2 (local)

Layer 3: Knowledge Graph

Technology: FoundationDB / Neo4j / NetworkX

Purpose: Map relationships between all platform entities

Graph Schema:

Nodes:
├── Document (path, title, type)
├── Component (name, type, version)
├── Submodule (name, category, url)
├── Concept (name, domain)
├── Entity (name, type) -- extracted entities
└── Author (name, email)

Edges:
├── CONTAINS (Submodule → Document)
├── REFERENCES (Document → Document)
├── DEFINES (Document → Component)
├── IMPLEMENTS (Code → Component)
├── DEPENDS_ON (Component → Component)
├── RELATES_TO (Concept → Concept)
├── MENTIONS (Document → Entity)
└── AUTHORED_BY (Document → Author)

FoundationDB Key Design:

/master_index/
├── /documents/{hash}/metadata
├── /documents/{hash}/content
├── /documents/{hash}/sections/{section_id}
├── /embeddings/{document_hash}/{chunk_index}
├── /graph/nodes/{node_type}/{node_id}
├── /graph/edges/{edge_type}/{source_id}/{target_id}
├── /search/inverted/{term}/{document_hash}
└── /stats/{metric}/{timestamp}

Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│ CODITECT Master Index System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scanner │ │ Parser │ │ Indexer │ │
│ │ (Git Walk) │──│ (Markdown) │──│ (SQL/FDB) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedder │ │ Graph │ │ Search │ │
│ │ (Semantic) │──│ (Neo4j) │──│ (Query) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ RAG API │ │
│ │ Endpoint │ │
│ └─────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Claude Code │ │ Web UI │ │ AI Agents │ │
│ │ CLI │ │ Search │ │ Context │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Data Flow


Implementation Plan

Phase 1: Document Indexer (Week 1-2)

Deliverables:

  • Git repository scanner (walk all 57 submodules)
  • Markdown parser with frontmatter extraction
  • SQLite/PostgreSQL schema migration
  • FTS5/tsvector indexing
  • CLI: coditect-index scan --all
  • CLI: coditect-index search "query"

Files:

  • scripts/master-index/scanner.py
  • scripts/master-index/parser.py
  • scripts/master-index/indexer.py
  • scripts/master-index/cli.py

Phase 2: Semantic Embeddings (Week 3-4)

Deliverables:

  • Chunking strategy (section-aware, 512 tokens)
  • Embedding generation (OpenAI or local model)
  • Vector storage (ChromaDB or pgvector)
  • Similarity search API
  • CLI: coditect-index embed --incremental
  • CLI: coditect-index semantic "query"

Phase 3: Knowledge Graph (Week 5-6)

Deliverables:

  • Entity extraction (components, concepts, authors)
  • Relationship mapping (references, dependencies)
  • Graph database integration (Neo4j or FDB)
  • Graph traversal queries
  • CLI: coditect-index graph --visualize

Phase 4: RAG API (Week 7-8)

Deliverables:

  • REST API for search and retrieval
  • Unified query endpoint (FTS + semantic + graph)
  • Context assembly for AI agents
  • Claude Code integration (/cxq --master-index)
  • Web UI for visual search

API Design

Search Endpoint

POST /api/v1/master-index/search
Content-Type: application/json

{
"query": "FoundationDB multi-tenant patterns",
"search_type": "hybrid", // "fts", "semantic", "graph", "hybrid"
"filters": {
"document_type": ["markdown", "agent"],
"submodule": ["coditect-core", "coditect-labs-v4-archive"],
"category": ["core", "labs"]
},
"limit": 20,
"include_context": true,
"context_window": 500
}

Response:

{
"results": [
{
"document": {
"id": "uuid",
"path": "submodules/labs/coditect-labs-v4-archive/docs/architecture/adrs/ADR-029-v4-foundationdb-issue-tracking-data-model-part2-technical.md",
"title": "ADR-029: FoundationDB Issue Tracking Data Model",
"type": "markdown",
"submodule": "coditect-labs-v4-archive"
},
"relevance_score": 0.94,
"match_type": "semantic",
"context": "Multi-tenant key design uses prefix /{tenant_id}/ for all entity types...",
"related_documents": [
"agents/foundationdb-expert.md",
"skills/foundationdb-queries/SKILL.md"
]
}
],
"graph_context": {
"concepts": ["multi-tenant", "ACID transactions", "key design"],
"related_components": ["foundationdb-expert", "database-architect"],
"dependency_chain": ["coditect-core", "coditect-labs-v4-archive"]
},
"total_results": 47,
"search_time_ms": 45
}

Claude Code Integration

# New slash command
/cxq --master-index "FoundationDB patterns"

# Or via cxq with flag
/cxq --scope master "database architecture"

# List all indexed documents
/cxq --master-index --list --type agent

# Graph traversal
/cxq --master-index --related "orchestrator agent"

Quality Metrics

Indexing Coverage

MetricTargetMeasurement
Document coverage100%All .md files indexed
Code coverage80%Docstrings and key files
Component coverage100%All agents, commands, skills
Link integrity95%Broken link detection

Search Quality

MetricTargetMeasurement
Precision@10>85%Relevant results in top 10
Recall>90%Find relevant documents
Latency p95<200msSearch response time
Embedding quality>0.8Cosine similarity threshold

Freshness

MetricTargetMeasurement
Index lag<5 minTime from commit to indexed
Incremental sync100%Only changed files re-indexed
Stale detection<1 dayIdentify outdated content

Security Considerations

Access Control

  1. Read Access: All indexed content is read from git (existing permissions)
  2. Query Access: API requires authentication (future: tenant-scoped)
  3. Embedding Storage: Vectors don't expose raw content
  4. Graph Traversal: Respects document-level permissions

Data Protection

  1. No PII in Index: Exclude credentials, secrets, personal data
  2. Content Hashing: Detect unauthorized modifications
  3. Audit Logging: Track all search queries
  4. Retention: Purge embeddings when documents deleted

Alternatives Considered

Alternative 1: Elasticsearch

Pros: Mature, scalable, rich query language Cons: Heavy infrastructure, operational complexity, cost Decision: Rejected for MVP; consider for enterprise tier

Alternative 2: Algolia

Pros: Managed service, fast, good UI Cons: Vendor lock-in, cost per operation, no semantic search Decision: Rejected due to cost and limited semantic capabilities

Alternative 3: Pinecone

Pros: Managed vector database, scalable Cons: No FTS, cost per vector, requires separate FTS Decision: Rejected; prefer unified solution

Selected Approach: Hybrid (PostgreSQL + ChromaDB + Graph)

Rationale:

  • PostgreSQL provides robust FTS and structured data
  • ChromaDB handles semantic embeddings (self-hosted)
  • Graph layer maps relationships
  • All components can scale to enterprise via FDB

Dependencies

Infrastructure

  • PostgreSQL 14+ (with pgvector extension)
  • ChromaDB or Faiss (vector search)
  • Redis (caching, optional)
  • Neo4j or FoundationDB (graph, optional for Phase 3)

Python Libraries

  • sentence-transformers - Local embeddings
  • chromadb - Vector database
  • markdown-it-py - Markdown parsing
  • pyyaml - Frontmatter parsing
  • networkx - Graph operations
  • fastapi - API server

External Services (Optional)

  • OpenAI API - High-quality embeddings
  • Anthropic API - Entity extraction

Success Criteria

MVP (Phase 1-2)

  • All 57 submodules indexed
  • FTS search operational
  • Semantic search operational
  • CLI integration complete
  • Documentation complete

Enterprise (Phase 3-4)

  • Knowledge graph operational
  • RAG API deployed
  • Claude Code integration
  • Multi-tenant isolation
  • <200ms p95 latency

Futureprise

  • FoundationDB backend
  • Real-time sync
  • 1M+ document capacity
  • Global distribution

References


Owner: AZ1.AI INC Lead: Hal Casteel, Founder/CEO/CTO Review Date: December 23, 2025 Implementation Start: TBD (pending approval)