ADR-029: Hierarchical Knowledge Store

Status

PROPOSED

Date

2026-01-15

Context

Document analysis generates insights at multiple granularities: raw extractions, chunk summaries, section summaries, and corpus-level synthesis. Users need to query at different levels depending on their task:

Executive: "What are the top 3 customer complaints?" → Corpus summary
Analyst: "Show me complaints about pricing" → Section-level with drill-down
Auditor: "What exactly did customer X say?" → Raw extraction with citation

A flat storage model forces re-computation for each query type. A hierarchical model pre-computes summaries at each level, enabling instant retrieval at any granularity.

Research Findings

From Pieces.app research on hierarchical summarization:

"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."

This informs our design: preserve raw extractions at the base layer, use extractive (not abstractive) summaries at intermediate levels, and only apply abstractive synthesis at the top.

Decision

Implement a Hierarchical Knowledge Store with four tiers stored in FoundationDB.

Tier Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  HIERARCHICAL KNOWLEDGE STORE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TIER 4: CORPUS SUMMARY (Abstractive)                          │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Executive-level insights                                │ │
│  │ • Cross-document themes                                   │ │
│  │ • Aggregate statistics                                    │ │
│  │ • Storage: ~1KB per corpus                                │ │
│  └───────────────────────────────────────────────────────────┘ │
│                              ▲                                  │
│                              │ Synthesizes from                 │
│  TIER 3: SECTION SUMMARIES (Extractive + Light Abstraction)    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Topic-grouped insights                                  │ │
│  │ • Key quotes preserved                                    │ │
│  │ • Citation chains maintained                              │ │
│  │ • Storage: ~10KB per section                              │ │
│  └───────────────────────────────────────────────────────────┘ │
│                              ▲                                  │
│                              │ Groups and merges               │
│  TIER 2: CHUNK SUMMARIES (Extractive Only)                     │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Per-document summaries                                  │ │
│  │ • Key sentence extraction (TextRank/TF-IDF)               │ │
│  │ • Entity mentions preserved                               │ │
│  │ • Storage: ~5KB per document                              │ │
│  └───────────────────────────────────────────────────────────┘ │
│                              ▲                                  │
│                              │ Extracts from                   │
│  TIER 1: RAW EXTRACTIONS (Immutable Source of Truth)           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Original extracted data from map phase                  │ │
│  │ • Exact quotes with character offsets                     │ │
│  │ • Full schema-based extractions                           │ │
│  │ • Storage: Full extraction size                           │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Hallucination Prevention Strategy

class HallucinationPrevention:
    """
    Strategies to prevent hallucination propagation through hierarchy.
    Based on Pieces.app research findings.
    """
    
    TIER_STRATEGIES = {
        1: {  # Raw extractions
            "type": "preservation",
            "rules": [
                "Store verbatim extracted text",
                "Include character offsets for verification",
                "No summarization or paraphrasing",
                "Immutable after creation"
            ]
        },
        2: {  # Chunk summaries
            "type": "extractive_only",
            "rules": [
                "Select sentences, don't generate new ones",
                "Use TextRank or TF-IDF for selection",
                "Preserve exact wording of selected sentences",
                "Include back-references to Tier 1 sources"
            ]
        },
        3: {  # Section summaries
            "type": "extractive_with_light_synthesis",
            "rules": [
                "80% extractive (key quotes)",
                "20% connective tissue (transitions only)",
                "Never introduce facts not in Tier 2",
                "Flag synthesized content explicitly"
            ]
        },
        4: {  # Corpus summary
            "type": "abstractive_with_grounding",
            "rules": [
                "May synthesize new sentences",
                "Every claim must cite Tier 3 source",
                "Validation agent checks claim support",
                "Include confidence scores"
            ]
        }
    }

FoundationDB Schema

# Hierarchical Knowledge Store schema

knowledge_store_schema = {
    "corpus": {
        # Key: ('knowledge', 'corpus', corpus_id)
        "corpus_id": "uuid",
        "name": "string",
        "created_at": "timestamp",
        "document_count": "int",
        "tier4_summary": "json",  # Corpus summary
        "tier4_generated_at": "timestamp",
        "metadata": "json"
    },
    
    "sections": {
        # Key: ('knowledge', 'sections', corpus_id, section_id)
        "section_id": "uuid",
        "corpus_id": "uuid",
        "topic": "string",
        "tier3_summary": "json",
        "source_chunks": "list[uuid]",  # Back-references
        "key_quotes": "list[json]",
        "synthesized_content_ratio": "float",  # Track abstraction %
        "generated_at": "timestamp"
    },
    
    "chunks": {
        # Key: ('knowledge', 'chunks', corpus_id, document_id, chunk_id)
        "chunk_id": "uuid",
        "document_id": "uuid",
        "corpus_id": "uuid",
        "tier2_summary": "json",
        "selected_sentences": "list[json]",  # With char offsets
        "extraction_method": "string",  # textrank|tfidf|llm_extractive
        "source_extractions": "list[uuid]",  # Back-references
        "generated_at": "timestamp"
    },
    
    "extractions": {
        # Key: ('knowledge', 'extractions', corpus_id, document_id, extraction_id)
        "extraction_id": "uuid",
        "document_id": "uuid",
        "corpus_id": "uuid",
        "tier1_data": "json",  # Raw extraction per schema
        "source_text": "string",  # Verbatim quote
        "char_offset_start": "int",
        "char_offset_end": "int",
        "confidence": "float",
        "extracted_at": "timestamp",
        "immutable": True  # Enforced at application layer
    },
    
    # Secondary indexes for efficient querying
    "indexes": {
        "by_topic": {
            # Key: ('knowledge', 'idx', 'topic', topic_hash, section_id)
        },
        "by_entity": {
            # Key: ('knowledge', 'idx', 'entity', entity_name, extraction_id)
        },
        "by_document": {
            # Key: ('knowledge', 'idx', 'document', document_id, tier, item_id)
        }
    }
}

Implementation

# /coditect/knowledge/hierarchical_store.py

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import IntEnum
import hashlib

class KnowledgeTier(IntEnum):
    RAW_EXTRACTION = 1
    CHUNK_SUMMARY = 2
    SECTION_SUMMARY = 3
    CORPUS_SUMMARY = 4

@dataclass
class KnowledgeItem:
    """Base class for items at any tier"""
    item_id: str
    corpus_id: str
    tier: KnowledgeTier
    content: Dict[str, Any]
    sources: List[str]  # IDs of items this was derived from
    created_at: float
    metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class Citation:
    """Citation linking to source material"""
    extraction_id: str
    document_id: str
    text: str
    char_start: int
    char_end: int
    confidence: float

class HierarchicalKnowledgeStore:
    """
    Multi-tier knowledge storage with hallucination prevention.
    
    Tier 1: Raw extractions (immutable)
    Tier 2: Chunk summaries (extractive only)
    Tier 3: Section summaries (extractive + light synthesis)
    Tier 4: Corpus summary (abstractive with grounding)
    """
    
    def __init__(self, fdb_client: 'FoundationDBClient'):
        self.fdb = fdb_client
        self.validators = {
            KnowledgeTier.CHUNK_SUMMARY: self._validate_extractive,
            KnowledgeTier.SECTION_SUMMARY: self._validate_section,
            KnowledgeTier.CORPUS_SUMMARY: self._validate_corpus
        }
    
    # ==================== TIER 1: Raw Extractions ====================
    
    async def store_extraction(
        self,
        corpus_id: str,
        document_id: str,
        extraction: Dict[str, Any],
        source_text: str,
        char_offset: tuple[int, int],
        confidence: float
    ) -> str:
        """
        Store raw extraction (Tier 1). Immutable after creation.
        """
        extraction_id = self._generate_id("ext", document_id, source_text)
        
        item = {
            "extraction_id": extraction_id,
            "document_id": document_id,
            "corpus_id": corpus_id,
            "tier1_data": extraction,
            "source_text": source_text,
            "char_offset_start": char_offset[0],
            "char_offset_end": char_offset[1],
            "confidence": confidence,
            "extracted_at": time.time(),
            "immutable": True
        }
        
        key = ('knowledge', 'extractions', corpus_id, document_id, extraction_id)
        await self.fdb.set(key, item)
        
        # Index by entities mentioned
        for entity in self._extract_entities(extraction):
            idx_key = ('knowledge', 'idx', 'entity', entity.lower(), extraction_id)
            await self.fdb.set(idx_key, {"extraction_id": extraction_id})
        
        return extraction_id
    
    # ==================== TIER 2: Chunk Summaries ====================
    
    async def build_chunk_summary(
        self,
        corpus_id: str,
        document_id: str,
        extraction_ids: List[str],
        method: str = "textrank"
    ) -> str:
        """
        Build extractive summary from raw extractions (Tier 2).
        Only selects sentences, never generates new content.
        """
        # Load source extractions
        extractions = await self._load_extractions(corpus_id, document_id, extraction_ids)
        
        # Extract key sentences (no generation)
        if method == "textrank":
            selected = self._textrank_select(extractions, top_k=10)
        elif method == "tfidf":
            selected = self._tfidf_select(extractions, top_k=10)
        else:
            raise ValueError(f"Unknown extraction method: {method}")
        
        # Validate: all selected content exists verbatim in sources
        await self._validate_extractive(selected, extractions)
        
        chunk_id = self._generate_id("chunk", document_id)
        
        item = {
            "chunk_id": chunk_id,
            "document_id": document_id,
            "corpus_id": corpus_id,
            "tier2_summary": {
                "selected_sentences": selected,
                "sentence_count": len(selected)
            },
            "extraction_method": method,
            "source_extractions": extraction_ids,
            "generated_at": time.time()
        }
        
        key = ('knowledge', 'chunks', corpus_id, document_id, chunk_id)
        await self.fdb.set(key, item)
        
        return chunk_id
    
    def _textrank_select(
        self,
        extractions: List[Dict],
        top_k: int
    ) -> List[Dict]:
        """TextRank-based key sentence selection"""
        # Build sentence graph based on similarity
        sentences = [e["source_text"] for e in extractions]
        
        # Compute similarity matrix
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(sentences)
        similarity_matrix = cosine_similarity(tfidf_matrix)
        
        # PageRank-style scoring
        import numpy as np
        scores = np.ones(len(sentences))
        damping = 0.85
        
        for _ in range(10):  # Iterate until convergence
            new_scores = (1 - damping) + damping * similarity_matrix.T @ scores
            scores = new_scores / new_scores.sum()
        
        # Select top-k by score
        top_indices = np.argsort(scores)[-top_k:][::-1]
        
        return [
            {
                "text": extractions[i]["source_text"],
                "extraction_id": extractions[i]["extraction_id"],
                "score": float(scores[i]),
                "char_start": extractions[i]["char_offset_start"],
                "char_end": extractions[i]["char_offset_end"]
            }
            for i in top_indices
        ]
    
    # ==================== TIER 3: Section Summaries ====================
    
    async def build_section_summary(
        self,
        corpus_id: str,
        topic: str,
        chunk_ids: List[str],
        max_synthesis_ratio: float = 0.2
    ) -> str:
        """
        Build section summary from chunk summaries (Tier 3).
        Limited synthesis allowed for transitions.
        """
        # Load chunk summaries
        chunks = await self._load_chunks(corpus_id, chunk_ids)
        
        # Group by subtopic
        grouped = self._group_by_subtopic(chunks)
        
        # Build section with mostly extractive content
        section_content = []
        synthesized_chars = 0
        total_chars = 0
        
        for subtopic, items in grouped.items():
            # Add key quotes (extractive)
            key_quotes = self._select_representative_quotes(items, top_k=3)
            for quote in key_quotes:
                section_content.append({
                    "type": "extractive",
                    "content": quote["text"],
                    "source": quote["extraction_id"]
                })
                total_chars += len(quote["text"])
            
            # Add minimal transition (synthetic)
            if len(grouped) > 1:
                transition = f"Related findings on {subtopic}:"
                section_content.append({
                    "type": "synthetic",
                    "content": transition,
                    "source": None
                })
                synthesized_chars += len(transition)
                total_chars += len(transition)
        
        # Validate synthesis ratio
        actual_ratio = synthesized_chars / total_chars if total_chars > 0 else 0
        if actual_ratio > max_synthesis_ratio:
            raise ValidationError(
                f"Synthesis ratio {actual_ratio:.2%} exceeds max {max_synthesis_ratio:.2%}"
            )
        
        section_id = self._generate_id("section", topic)
        
        item = {
            "section_id": section_id,
            "corpus_id": corpus_id,
            "topic": topic,
            "tier3_summary": {
                "content": section_content,
                "key_quotes": [c for c in section_content if c["type"] == "extractive"]
            },
            "source_chunks": chunk_ids,
            "synthesized_content_ratio": actual_ratio,
            "generated_at": time.time()
        }
        
        key = ('knowledge', 'sections', corpus_id, section_id)
        await self.fdb.set(key, item)
        
        # Index by topic
        topic_hash = hashlib.md5(topic.lower().encode()).hexdigest()[:8]
        idx_key = ('knowledge', 'idx', 'topic', topic_hash, section_id)
        await self.fdb.set(idx_key, {"section_id": section_id})
        
        return section_id
    
    # ==================== TIER 4: Corpus Summary ====================
    
    async def build_corpus_summary(
        self,
        corpus_id: str,
        section_ids: List[str],
        validation_agent: 'ValidationAgent'
    ) -> Dict[str, Any]:
        """
        Build corpus summary from sections (Tier 4).
        Abstractive synthesis with mandatory validation.
        """
        # Load sections
        sections = await self._load_sections(corpus_id, section_ids)
        
        # Generate abstractive summary via LLM
        synthesis_prompt = self._build_corpus_synthesis_prompt(sections)
        
        synthesis_agent = await self.agent_pool.acquire(AgentRole.SYNTHESIZER)
        try:
            raw_summary = await synthesis_agent.execute(
                prompt=synthesis_prompt,
                require_citations=True
            )
        finally:
            await self.agent_pool.release(synthesis_agent)
        
        # CRITICAL: Validate every claim against sources
        validated_summary = await validation_agent.validate_claims(
            claims=raw_summary.claims,
            sources=sections,
            require_support=True
        )
        
        # Filter out unsupported claims
        supported_claims = [
            claim for claim in validated_summary.claims
            if claim.support_score >= 0.7
        ]
        
        unsupported = len(raw_summary.claims) - len(supported_claims)
        if unsupported > 0:
            logger.warning(
                f"Filtered {unsupported} unsupported claims from corpus summary"
            )
        
        corpus_summary = {
            "summary_text": self._reconstruct_summary(supported_claims),
            "claims": [c.to_dict() for c in supported_claims],
            "filtered_claims": unsupported,
            "validation_scores": validated_summary.scores,
            "source_sections": section_ids,
            "generated_at": time.time()
        }
        
        # Update corpus record
        key = ('knowledge', 'corpus', corpus_id)
        corpus = await self.fdb.get(key)
        corpus["tier4_summary"] = corpus_summary
        corpus["tier4_generated_at"] = time.time()
        await self.fdb.set(key, corpus)
        
        return corpus_summary
    
    # ==================== Query Interface ====================
    
    async def query(
        self,
        corpus_id: str,
        query: str,
        target_tier: KnowledgeTier = KnowledgeTier.SECTION_SUMMARY,
        include_sources: bool = True
    ) -> 'QueryResult':
        """
        Query knowledge store at specified tier with drill-down capability.
        """
        if target_tier == KnowledgeTier.CORPUS_SUMMARY:
            result = await self._query_corpus(corpus_id, query)
        elif target_tier == KnowledgeTier.SECTION_SUMMARY:
            result = await self._query_sections(corpus_id, query)
        elif target_tier == KnowledgeTier.CHUNK_SUMMARY:
            result = await self._query_chunks(corpus_id, query)
        else:
            result = await self._query_extractions(corpus_id, query)
        
        if include_sources:
            result.sources = await self._load_source_chain(result.item_ids)
        
        return result
    
    async def drill_down(
        self,
        item_id: str,
        current_tier: KnowledgeTier
    ) -> List['KnowledgeItem']:
        """
        Get source items from the tier below.
        """
        if current_tier == KnowledgeTier.RAW_EXTRACTION:
            return []  # Already at bottom
        
        item = await self._load_item(item_id, current_tier)
        source_ids = item.get("sources", [])
        
        target_tier = KnowledgeTier(current_tier - 1)
        return await self._load_items(source_ids, target_tier)
    
    # ==================== Validation ====================
    
    async def _validate_extractive(
        self,
        selected: List[Dict],
        sources: List[Dict]
    ) -> None:
        """Ensure selected content exists verbatim in sources"""
        source_texts = {s["source_text"] for s in sources}
        
        for item in selected:
            if item["text"] not in source_texts:
                raise ValidationError(
                    f"Selected text not found in sources: {item['text'][:50]}..."
                )
    
    async def _validate_section(
        self,
        section: Dict,
        chunks: List[Dict]
    ) -> None:
        """Validate section synthesis ratio and source coverage"""
        # Implementation details...
        pass
    
    async def _validate_corpus(
        self,
        summary: Dict,
        sections: List[Dict]
    ) -> None:
        """Validate corpus claims are supported by sections"""
        # Implementation details...
        pass

CLI Commands

# Build hierarchy for a corpus
coditect knowledge build \
  --corpus-id corpus_20260115 \
  --extraction-method textrank \
  --max-synthesis-ratio 0.2

# Query at specific tier
coditect knowledge query \
  --corpus-id corpus_20260115 \
  --query "customer complaints about pricing" \
  --tier section \
  --include-sources

# Drill down from summary to sources
coditect knowledge drilldown \
  --item-id section_abc123 \
  --format json

# Validate hierarchy integrity
coditect knowledge validate \
  --corpus-id corpus_20260115 \
  --check-citations \
  --check-synthesis-ratios

Consequences

Positive

Multi-granularity access: Query at any level of detail
Hallucination prevention: Extractive lower tiers ground upper tiers
Audit trail: Full citation chain from summary to source
Performance: Pre-computed summaries enable instant retrieval

Negative

Storage overhead: 4x content stored at different granularities
Build time: Hierarchy construction adds processing time
Staleness: Updates require rebuild of affected tiers
Complexity: More moving parts to maintain

Metrics

Metric	Target	Measurement
Hallucination rate	<2%	Claims without source support
Query latency (any tier)	<100ms	p99 response time
Citation accuracy	>98%	Valid source references
Synthesis ratio (Tier 3)	<20%	Generated vs extracted content

ADR-027: Hybrid Document Processing Architecture (parent)
ADR-028: Map-Reduce Agent Orchestration (feeds Tier 1)
ADR-030: Compliance-Aware RAG (queries this store)

Status​

Date​

Context​

Research Findings​

Decision​

Tier Architecture​

Hallucination Prevention Strategy​

FoundationDB Schema​

Implementation​

CLI Commands​

Consequences​

Positive​

Negative​

Metrics​

Related ADRs​