ADR 100: CODITECT Master Index Search and Retrieval System

ADR-100: CODITECT Master Index Search and Retrieval System

Status: Proposed Date: December 16, 2025 Author: AZ1.AI INC / CODITECT Architecture Team Supersedes: None Related ADRs: ADR-029 (FoundationDB Data Model), ADR-066 (Ephemeral Workspaces)

Executive Summary

This Architecture Decision Record defines the CODITECT Master Index System - a comprehensive platform-wide search and retrieval system that indexes every markdown document, code file, agent definition, command, skill, script, and configuration across the entire CODITECT ecosystem. The system provides unified semantic search, knowledge graph navigation, and RAG (Retrieval-Augmented Generation) capabilities for both human users and AI agents.

Context

Current State

The CODITECT platform currently has:

57 submodules across 8 category folders
560+ framework components (agents, commands, skills, scripts, hooks)
1,096 N8N workflows across 29 industries
Thousands of markdown documents scattered across repositories
100K+ lines of Rust/Python/TypeScript code

Problem Statement

Fragmented Discovery: No unified way to search across all CODITECT repositories
Context Loss: AI agents cannot reference documents outside current scope
Duplicate Content: Same patterns documented in multiple places without linking
No Semantic Understanding: Current search is keyword-based only
Missing Relationships: No graph of document relationships and dependencies

Requirements

Universal Indexing: Index ALL content across 57 submodules
Semantic Search: Find documents by meaning, not just keywords
Knowledge Graph: Map relationships between documents, code, and components
RAG Integration: Enable AI agents to retrieve relevant context automatically
Real-time Updates: Incremental indexing on git push
Multi-tenant Ready: Scale to enterprise with tenant isolation

Decision

We will implement a three-layer Master Index System:

Layer 1: Document Indexer

Technology: Python + SQLite/PostgreSQL + Full-Text Search

Scope:

All .md files across 57 submodules
All component definitions (agents/, commands/, skills/)
Configuration files (JSON, YAML, TOML)
Code documentation (docstrings, comments)

Schema:

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    path TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT NOT NULL,
    content_hash TEXT NOT NULL,
    document_type VARCHAR(50),  -- 'markdown', 'agent', 'command', 'skill', 'code'
    submodule VARCHAR(100),
    category VARCHAR(50),       -- 'core', 'cloud', 'labs', etc.
    frontmatter JSONB,
    word_count INTEGER,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ,
    indexed_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE document_sections (
    id UUID PRIMARY KEY,
    document_id UUID REFERENCES documents(id),
    heading TEXT,
    heading_level INTEGER,
    content TEXT NOT NULL,
    position INTEGER,
    parent_section_id UUID REFERENCES document_sections(id)
);

CREATE TABLE document_links (
    id UUID PRIMARY KEY,
    source_document_id UUID REFERENCES documents(id),
    target_document_id UUID REFERENCES documents(id),
    link_text TEXT,
    link_type VARCHAR(50),  -- 'reference', 'see_also', 'import', 'extends'
    is_broken BOOLEAN DEFAULT FALSE
);

-- Full-text search
CREATE INDEX idx_documents_fts ON documents
    USING GIN (to_tsvector('english', title || ' ' || content));

CREATE INDEX idx_sections_fts ON document_sections
    USING GIN (to_tsvector('english', heading || ' ' || content));

Layer 2: Semantic Embeddings

Technology: Sentence Transformers + Vector Database (ChromaDB/Faiss/pgvector)

Purpose: Enable meaning-based search beyond keyword matching

Schema:

CREATE TABLE embeddings (
    id UUID PRIMARY KEY,
    document_id UUID REFERENCES documents(id),
    section_id UUID REFERENCES document_sections(id),
    chunk_text TEXT NOT NULL,
    chunk_index INTEGER,
    embedding vector(1536),  -- OpenAI ada-002 dimensions
    model VARCHAR(100),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Vector similarity index (pgvector)
CREATE INDEX idx_embeddings_vector ON embeddings
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Embedding Strategy:

Chunk documents by section (respect heading boundaries)
Maximum chunk size: 512 tokens
Overlap: 50 tokens between chunks
Model: text-embedding-ada-002 (OpenAI) or all-MiniLM-L6-v2 (local)

Layer 3: Knowledge Graph

Technology: FoundationDB / Neo4j / NetworkX

Purpose: Map relationships between all platform entities

Graph Schema:

Nodes:
├── Document (path, title, type)
├── Component (name, type, version)
├── Submodule (name, category, url)
├── Concept (name, domain)
├── Entity (name, type)  -- extracted entities
└── Author (name, email)

Edges:
├── CONTAINS (Submodule → Document)
├── REFERENCES (Document → Document)
├── DEFINES (Document → Component)
├── IMPLEMENTS (Code → Component)
├── DEPENDS_ON (Component → Component)
├── RELATES_TO (Concept → Concept)
├── MENTIONS (Document → Entity)
└── AUTHORED_BY (Document → Author)

FoundationDB Key Design:

/master_index/
├── /documents/{hash}/metadata
├── /documents/{hash}/content
├── /documents/{hash}/sections/{section_id}
├── /embeddings/{document_hash}/{chunk_index}
├── /graph/nodes/{node_type}/{node_id}
├── /graph/edges/{edge_type}/{source_id}/{target_id}
├── /search/inverted/{term}/{document_hash}
└── /stats/{metric}/{timestamp}

Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│                    CODITECT Master Index System                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   Scanner   │  │   Parser    │  │  Indexer    │              │
│  │  (Git Walk) │──│  (Markdown) │──│  (SQL/FDB)  │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│         │                │                │                      │
│         ▼                ▼                ▼                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │  Embedder   │  │   Graph     │  │   Search    │              │
│  │ (Semantic)  │──│  (Neo4j)    │──│   (Query)   │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│         │                │                │                      │
│         └────────────────┼────────────────┘                      │
│                          ▼                                       │
│                  ┌─────────────┐                                 │
│                  │  RAG API    │                                 │
│                  │  Endpoint   │                                 │
│                  └─────────────┘                                 │
│                          │                                       │
│         ┌────────────────┼────────────────┐                      │
│         ▼                ▼                ▼                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │ Claude Code │  │  Web UI     │  │  AI Agents  │              │
│  │    CLI      │  │  Search     │  │   Context   │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Data Flow

Implementation Plan

Phase 1: Document Indexer (Week 1-2)

Deliverables:

Git repository scanner (walk all 57 submodules)
Markdown parser with frontmatter extraction
SQLite/PostgreSQL schema migration
FTS5/tsvector indexing
CLI: coditect-index scan --all
CLI: coditect-index search "query"

Files:

scripts/master-index/scanner.py
scripts/master-index/parser.py
scripts/master-index/indexer.py
scripts/master-index/cli.py

Phase 2: Semantic Embeddings (Week 3-4)

Deliverables:

Chunking strategy (section-aware, 512 tokens)
Embedding generation (OpenAI or local model)
Vector storage (ChromaDB or pgvector)
Similarity search API
CLI: coditect-index embed --incremental
CLI: coditect-index semantic "query"

Phase 3: Knowledge Graph (Week 5-6)

Deliverables:

Entity extraction (components, concepts, authors)
Relationship mapping (references, dependencies)
Graph database integration (Neo4j or FDB)
Graph traversal queries
CLI: coditect-index graph --visualize

Phase 4: RAG API (Week 7-8)

Deliverables:

REST API for search and retrieval
Unified query endpoint (FTS + semantic + graph)
Context assembly for AI agents
Claude Code integration (/cxq --master-index)
Web UI for visual search

API Design

Search Endpoint

POST /api/v1/master-index/search
Content-Type: application/json

{
  "query": "FoundationDB multi-tenant patterns",
  "search_type": "hybrid",  // "fts", "semantic", "graph", "hybrid"
  "filters": {
    "document_type": ["markdown", "agent"],
    "submodule": ["coditect-core", "coditect-labs-v4-archive"],
    "category": ["core", "labs"]
  },
  "limit": 20,
  "include_context": true,
  "context_window": 500
}

Response:

{
  "results": [
    {
      "document": {
        "id": "uuid",
        "path": "submodules/labs/coditect-labs-v4-archive/docs/architecture/adrs/ADR-029-v4-foundationdb-issue-tracking-data-model-part2-technical.md",
        "title": "ADR-029: FoundationDB Issue Tracking Data Model",
        "type": "markdown",
        "submodule": "coditect-labs-v4-archive"
      },
      "relevance_score": 0.94,
      "match_type": "semantic",
      "context": "Multi-tenant key design uses prefix /{tenant_id}/ for all entity types...",
      "related_documents": [
        "agents/foundationdb-expert.md",
        "skills/foundationdb-queries/SKILL.md"
      ]
    }
  ],
  "graph_context": {
    "concepts": ["multi-tenant", "ACID transactions", "key design"],
    "related_components": ["foundationdb-expert", "database-architect"],
    "dependency_chain": ["coditect-core", "coditect-labs-v4-archive"]
  },
  "total_results": 47,
  "search_time_ms": 45
}

Claude Code Integration

# New slash command
/cxq --master-index "FoundationDB patterns"

# Or via cxq with flag
/cxq --scope master "database architecture"

# List all indexed documents
/cxq --master-index --list --type agent

# Graph traversal
/cxq --master-index --related "orchestrator agent"

Quality Metrics

Indexing Coverage

Metric	Target	Measurement
Document coverage	100%	All .md files indexed
Code coverage	80%	Docstrings and key files
Component coverage	100%	All agents, commands, skills
Link integrity	95%	Broken link detection

Search Quality

Metric	Target	Measurement
Precision@10	>85%	Relevant results in top 10
Recall	>90%	Find relevant documents
Latency p95	<200ms	Search response time
Embedding quality	>0.8	Cosine similarity threshold

Freshness

Metric	Target	Measurement
Index lag	<5 min	Time from commit to indexed
Incremental sync	100%	Only changed files re-indexed
Stale detection	<1 day	Identify outdated content

Security Considerations

Access Control

Read Access: All indexed content is read from git (existing permissions)
Query Access: API requires authentication (future: tenant-scoped)
Embedding Storage: Vectors don't expose raw content
Graph Traversal: Respects document-level permissions

Data Protection

No PII in Index: Exclude credentials, secrets, personal data
Content Hashing: Detect unauthorized modifications
Audit Logging: Track all search queries
Retention: Purge embeddings when documents deleted

Alternatives Considered

Alternative 1: Elasticsearch

Pros: Mature, scalable, rich query language Cons: Heavy infrastructure, operational complexity, cost Decision: Rejected for MVP; consider for enterprise tier

Alternative 2: Algolia

Pros: Managed service, fast, good UI Cons: Vendor lock-in, cost per operation, no semantic search Decision: Rejected due to cost and limited semantic capabilities

Alternative 3: Pinecone

Pros: Managed vector database, scalable Cons: No FTS, cost per vector, requires separate FTS Decision: Rejected; prefer unified solution

Selected Approach: Hybrid (PostgreSQL + ChromaDB + Graph)

Rationale:

PostgreSQL provides robust FTS and structured data
ChromaDB handles semantic embeddings (self-hosted)
Graph layer maps relationships
All components can scale to enterprise via FDB

Dependencies

Infrastructure

PostgreSQL 14+ (with pgvector extension)
ChromaDB or Faiss (vector search)
Redis (caching, optional)
Neo4j or FoundationDB (graph, optional for Phase 3)

Python Libraries

sentence-transformers - Local embeddings
chromadb - Vector database
markdown-it-py - Markdown parsing
pyyaml - Frontmatter parsing
networkx - Graph operations
fastapi - API server

External Services (Optional)

OpenAI API - High-quality embeddings
Anthropic API - Entity extraction

Success Criteria

MVP (Phase 1-2)

Enterprise (Phase 3-4)

Futureprise

FoundationDB backend
Real-time sync
1M+ document capacity
Global distribution

References

Owner: AZ1.AI INC Lead: Hal Casteel, Founder/CEO/CTO Review Date: December 23, 2025 Implementation Start: TBD (pending approval)

Executive Summary​

Context​

Current State​

Problem Statement​

Requirements​

Decision​

Layer 1: Document Indexer​

Layer 2: Semantic Embeddings​

Layer 3: Knowledge Graph​

Architecture​

System Components​

Data Flow​

Implementation Plan​

Phase 1: Document Indexer (Week 1-2)​

Phase 2: Semantic Embeddings (Week 3-4)​

Phase 3: Knowledge Graph (Week 5-6)​

Phase 4: RAG API (Week 7-8)​

API Design​

Search Endpoint​

Claude Code Integration​

Quality Metrics​

Indexing Coverage​

Search Quality​

Freshness​

Security Considerations​

Access Control​

Data Protection​

Alternatives Considered​

Alternative 1: Elasticsearch​

Alternative 2: Algolia​

Alternative 3: Pinecone​

Selected Approach: Hybrid (PostgreSQL + ChromaDB + Graph)​

Dependencies​

Infrastructure​

Python Libraries​

External Services (Optional)​

Success Criteria​

MVP (Phase 1-2)​

Enterprise (Phase 3-4)​

Futureprise​

References​

Executive Summary

Context

Current State

Problem Statement

Requirements

Decision

Layer 1: Document Indexer

Layer 2: Semantic Embeddings

Layer 3: Knowledge Graph

Architecture

System Components

Data Flow

Implementation Plan

Phase 1: Document Indexer (Week 1-2)

Phase 2: Semantic Embeddings (Week 3-4)

Phase 3: Knowledge Graph (Week 5-6)

Phase 4: RAG API (Week 7-8)

API Design

Search Endpoint

Claude Code Integration

Quality Metrics

Indexing Coverage

Search Quality

Freshness

Security Considerations

Access Control

Data Protection

Alternatives Considered

Alternative 1: Elasticsearch

Alternative 2: Algolia

Alternative 3: Pinecone

Selected Approach: Hybrid (PostgreSQL + ChromaDB + Graph)

Dependencies

Infrastructure

Python Libraries

External Services (Optional)

Success Criteria

MVP (Phase 1-2)

Enterprise (Phase 3-4)

Futureprise

References