Skip to main content

ADR-029: Hierarchical Knowledge Store

Status

PROPOSED

Date

2026-01-15

Context

Document analysis generates insights at multiple granularities: raw extractions, chunk summaries, section summaries, and corpus-level synthesis. Users need to query at different levels depending on their task:

  • Executive: "What are the top 3 customer complaints?" → Corpus summary
  • Analyst: "Show me complaints about pricing" → Section-level with drill-down
  • Auditor: "What exactly did customer X say?" → Raw extraction with citation

A flat storage model forces re-computation for each query type. A hierarchical model pre-computes summaries at each level, enabling instant retrieval at any granularity.

Research Findings

From Pieces.app research on hierarchical summarization:

"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."

This informs our design: preserve raw extractions at the base layer, use extractive (not abstractive) summaries at intermediate levels, and only apply abstractive synthesis at the top.

Decision

Implement a Hierarchical Knowledge Store with four tiers stored in FoundationDB.

Tier Architecture

┌─────────────────────────────────────────────────────────────────┐
│ HIERARCHICAL KNOWLEDGE STORE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TIER 4: CORPUS SUMMARY (Abstractive) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Executive-level insights │ │
│ │ • Cross-document themes │ │
│ │ • Aggregate statistics │ │
│ │ • Storage: ~1KB per corpus │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Synthesizes from │
│ TIER 3: SECTION SUMMARIES (Extractive + Light Abstraction) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Topic-grouped insights │ │
│ │ • Key quotes preserved │ │
│ │ • Citation chains maintained │ │
│ │ • Storage: ~10KB per section │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Groups and merges │
│ TIER 2: CHUNK SUMMARIES (Extractive Only) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Per-document summaries │ │
│ │ • Key sentence extraction (TextRank/TF-IDF) │ │
│ │ • Entity mentions preserved │ │
│ │ • Storage: ~5KB per document │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Extracts from │
│ TIER 1: RAW EXTRACTIONS (Immutable Source of Truth) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Original extracted data from map phase │ │
│ │ • Exact quotes with character offsets │ │
│ │ • Full schema-based extractions │ │
│ │ • Storage: Full extraction size │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Hallucination Prevention Strategy

class HallucinationPrevention:
"""
Strategies to prevent hallucination propagation through hierarchy.
Based on Pieces.app research findings.
"""

TIER_STRATEGIES = {
1: { # Raw extractions
"type": "preservation",
"rules": [
"Store verbatim extracted text",
"Include character offsets for verification",
"No summarization or paraphrasing",
"Immutable after creation"
]
},
2: { # Chunk summaries
"type": "extractive_only",
"rules": [
"Select sentences, don't generate new ones",
"Use TextRank or TF-IDF for selection",
"Preserve exact wording of selected sentences",
"Include back-references to Tier 1 sources"
]
},
3: { # Section summaries
"type": "extractive_with_light_synthesis",
"rules": [
"80% extractive (key quotes)",
"20% connective tissue (transitions only)",
"Never introduce facts not in Tier 2",
"Flag synthesized content explicitly"
]
},
4: { # Corpus summary
"type": "abstractive_with_grounding",
"rules": [
"May synthesize new sentences",
"Every claim must cite Tier 3 source",
"Validation agent checks claim support",
"Include confidence scores"
]
}
}

FoundationDB Schema

# Hierarchical Knowledge Store schema

knowledge_store_schema = {
"corpus": {
# Key: ('knowledge', 'corpus', corpus_id)
"corpus_id": "uuid",
"name": "string",
"created_at": "timestamp",
"document_count": "int",
"tier4_summary": "json", # Corpus summary
"tier4_generated_at": "timestamp",
"metadata": "json"
},

"sections": {
# Key: ('knowledge', 'sections', corpus_id, section_id)
"section_id": "uuid",
"corpus_id": "uuid",
"topic": "string",
"tier3_summary": "json",
"source_chunks": "list[uuid]", # Back-references
"key_quotes": "list[json]",
"synthesized_content_ratio": "float", # Track abstraction %
"generated_at": "timestamp"
},

"chunks": {
# Key: ('knowledge', 'chunks', corpus_id, document_id, chunk_id)
"chunk_id": "uuid",
"document_id": "uuid",
"corpus_id": "uuid",
"tier2_summary": "json",
"selected_sentences": "list[json]", # With char offsets
"extraction_method": "string", # textrank|tfidf|llm_extractive
"source_extractions": "list[uuid]", # Back-references
"generated_at": "timestamp"
},

"extractions": {
# Key: ('knowledge', 'extractions', corpus_id, document_id, extraction_id)
"extraction_id": "uuid",
"document_id": "uuid",
"corpus_id": "uuid",
"tier1_data": "json", # Raw extraction per schema
"source_text": "string", # Verbatim quote
"char_offset_start": "int",
"char_offset_end": "int",
"confidence": "float",
"extracted_at": "timestamp",
"immutable": True # Enforced at application layer
},

# Secondary indexes for efficient querying
"indexes": {
"by_topic": {
# Key: ('knowledge', 'idx', 'topic', topic_hash, section_id)
},
"by_entity": {
# Key: ('knowledge', 'idx', 'entity', entity_name, extraction_id)
},
"by_document": {
# Key: ('knowledge', 'idx', 'document', document_id, tier, item_id)
}
}
}

Implementation

# /coditect/knowledge/hierarchical_store.py

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import IntEnum
import hashlib

class KnowledgeTier(IntEnum):
RAW_EXTRACTION = 1
CHUNK_SUMMARY = 2
SECTION_SUMMARY = 3
CORPUS_SUMMARY = 4

@dataclass
class KnowledgeItem:
"""Base class for items at any tier"""
item_id: str
corpus_id: str
tier: KnowledgeTier
content: Dict[str, Any]
sources: List[str] # IDs of items this was derived from
created_at: float
metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class Citation:
"""Citation linking to source material"""
extraction_id: str
document_id: str
text: str
char_start: int
char_end: int
confidence: float

class HierarchicalKnowledgeStore:
"""
Multi-tier knowledge storage with hallucination prevention.

Tier 1: Raw extractions (immutable)
Tier 2: Chunk summaries (extractive only)
Tier 3: Section summaries (extractive + light synthesis)
Tier 4: Corpus summary (abstractive with grounding)
"""

def __init__(self, fdb_client: 'FoundationDBClient'):
self.fdb = fdb_client
self.validators = {
KnowledgeTier.CHUNK_SUMMARY: self._validate_extractive,
KnowledgeTier.SECTION_SUMMARY: self._validate_section,
KnowledgeTier.CORPUS_SUMMARY: self._validate_corpus
}

# ==================== TIER 1: Raw Extractions ====================

async def store_extraction(
self,
corpus_id: str,
document_id: str,
extraction: Dict[str, Any],
source_text: str,
char_offset: tuple[int, int],
confidence: float
) -> str:
"""
Store raw extraction (Tier 1). Immutable after creation.
"""
extraction_id = self._generate_id("ext", document_id, source_text)

item = {
"extraction_id": extraction_id,
"document_id": document_id,
"corpus_id": corpus_id,
"tier1_data": extraction,
"source_text": source_text,
"char_offset_start": char_offset[0],
"char_offset_end": char_offset[1],
"confidence": confidence,
"extracted_at": time.time(),
"immutable": True
}

key = ('knowledge', 'extractions', corpus_id, document_id, extraction_id)
await self.fdb.set(key, item)

# Index by entities mentioned
for entity in self._extract_entities(extraction):
idx_key = ('knowledge', 'idx', 'entity', entity.lower(), extraction_id)
await self.fdb.set(idx_key, {"extraction_id": extraction_id})

return extraction_id

# ==================== TIER 2: Chunk Summaries ====================

async def build_chunk_summary(
self,
corpus_id: str,
document_id: str,
extraction_ids: List[str],
method: str = "textrank"
) -> str:
"""
Build extractive summary from raw extractions (Tier 2).
Only selects sentences, never generates new content.
"""
# Load source extractions
extractions = await self._load_extractions(corpus_id, document_id, extraction_ids)

# Extract key sentences (no generation)
if method == "textrank":
selected = self._textrank_select(extractions, top_k=10)
elif method == "tfidf":
selected = self._tfidf_select(extractions, top_k=10)
else:
raise ValueError(f"Unknown extraction method: {method}")

# Validate: all selected content exists verbatim in sources
await self._validate_extractive(selected, extractions)

chunk_id = self._generate_id("chunk", document_id)

item = {
"chunk_id": chunk_id,
"document_id": document_id,
"corpus_id": corpus_id,
"tier2_summary": {
"selected_sentences": selected,
"sentence_count": len(selected)
},
"extraction_method": method,
"source_extractions": extraction_ids,
"generated_at": time.time()
}

key = ('knowledge', 'chunks', corpus_id, document_id, chunk_id)
await self.fdb.set(key, item)

return chunk_id

def _textrank_select(
self,
extractions: List[Dict],
top_k: int
) -> List[Dict]:
"""TextRank-based key sentence selection"""
# Build sentence graph based on similarity
sentences = [e["source_text"] for e in extractions]

# Compute similarity matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = cosine_similarity(tfidf_matrix)

# PageRank-style scoring
import numpy as np
scores = np.ones(len(sentences))
damping = 0.85

for _ in range(10): # Iterate until convergence
new_scores = (1 - damping) + damping * similarity_matrix.T @ scores
scores = new_scores / new_scores.sum()

# Select top-k by score
top_indices = np.argsort(scores)[-top_k:][::-1]

return [
{
"text": extractions[i]["source_text"],
"extraction_id": extractions[i]["extraction_id"],
"score": float(scores[i]),
"char_start": extractions[i]["char_offset_start"],
"char_end": extractions[i]["char_offset_end"]
}
for i in top_indices
]

# ==================== TIER 3: Section Summaries ====================

async def build_section_summary(
self,
corpus_id: str,
topic: str,
chunk_ids: List[str],
max_synthesis_ratio: float = 0.2
) -> str:
"""
Build section summary from chunk summaries (Tier 3).
Limited synthesis allowed for transitions.
"""
# Load chunk summaries
chunks = await self._load_chunks(corpus_id, chunk_ids)

# Group by subtopic
grouped = self._group_by_subtopic(chunks)

# Build section with mostly extractive content
section_content = []
synthesized_chars = 0
total_chars = 0

for subtopic, items in grouped.items():
# Add key quotes (extractive)
key_quotes = self._select_representative_quotes(items, top_k=3)
for quote in key_quotes:
section_content.append({
"type": "extractive",
"content": quote["text"],
"source": quote["extraction_id"]
})
total_chars += len(quote["text"])

# Add minimal transition (synthetic)
if len(grouped) > 1:
transition = f"Related findings on {subtopic}:"
section_content.append({
"type": "synthetic",
"content": transition,
"source": None
})
synthesized_chars += len(transition)
total_chars += len(transition)

# Validate synthesis ratio
actual_ratio = synthesized_chars / total_chars if total_chars > 0 else 0
if actual_ratio > max_synthesis_ratio:
raise ValidationError(
f"Synthesis ratio {actual_ratio:.2%} exceeds max {max_synthesis_ratio:.2%}"
)

section_id = self._generate_id("section", topic)

item = {
"section_id": section_id,
"corpus_id": corpus_id,
"topic": topic,
"tier3_summary": {
"content": section_content,
"key_quotes": [c for c in section_content if c["type"] == "extractive"]
},
"source_chunks": chunk_ids,
"synthesized_content_ratio": actual_ratio,
"generated_at": time.time()
}

key = ('knowledge', 'sections', corpus_id, section_id)
await self.fdb.set(key, item)

# Index by topic
topic_hash = hashlib.md5(topic.lower().encode()).hexdigest()[:8]
idx_key = ('knowledge', 'idx', 'topic', topic_hash, section_id)
await self.fdb.set(idx_key, {"section_id": section_id})

return section_id

# ==================== TIER 4: Corpus Summary ====================

async def build_corpus_summary(
self,
corpus_id: str,
section_ids: List[str],
validation_agent: 'ValidationAgent'
) -> Dict[str, Any]:
"""
Build corpus summary from sections (Tier 4).
Abstractive synthesis with mandatory validation.
"""
# Load sections
sections = await self._load_sections(corpus_id, section_ids)

# Generate abstractive summary via LLM
synthesis_prompt = self._build_corpus_synthesis_prompt(sections)

synthesis_agent = await self.agent_pool.acquire(AgentRole.SYNTHESIZER)
try:
raw_summary = await synthesis_agent.execute(
prompt=synthesis_prompt,
require_citations=True
)
finally:
await self.agent_pool.release(synthesis_agent)

# CRITICAL: Validate every claim against sources
validated_summary = await validation_agent.validate_claims(
claims=raw_summary.claims,
sources=sections,
require_support=True
)

# Filter out unsupported claims
supported_claims = [
claim for claim in validated_summary.claims
if claim.support_score >= 0.7
]

unsupported = len(raw_summary.claims) - len(supported_claims)
if unsupported > 0:
logger.warning(
f"Filtered {unsupported} unsupported claims from corpus summary"
)

corpus_summary = {
"summary_text": self._reconstruct_summary(supported_claims),
"claims": [c.to_dict() for c in supported_claims],
"filtered_claims": unsupported,
"validation_scores": validated_summary.scores,
"source_sections": section_ids,
"generated_at": time.time()
}

# Update corpus record
key = ('knowledge', 'corpus', corpus_id)
corpus = await self.fdb.get(key)
corpus["tier4_summary"] = corpus_summary
corpus["tier4_generated_at"] = time.time()
await self.fdb.set(key, corpus)

return corpus_summary

# ==================== Query Interface ====================

async def query(
self,
corpus_id: str,
query: str,
target_tier: KnowledgeTier = KnowledgeTier.SECTION_SUMMARY,
include_sources: bool = True
) -> 'QueryResult':
"""
Query knowledge store at specified tier with drill-down capability.
"""
if target_tier == KnowledgeTier.CORPUS_SUMMARY:
result = await self._query_corpus(corpus_id, query)
elif target_tier == KnowledgeTier.SECTION_SUMMARY:
result = await self._query_sections(corpus_id, query)
elif target_tier == KnowledgeTier.CHUNK_SUMMARY:
result = await self._query_chunks(corpus_id, query)
else:
result = await self._query_extractions(corpus_id, query)

if include_sources:
result.sources = await self._load_source_chain(result.item_ids)

return result

async def drill_down(
self,
item_id: str,
current_tier: KnowledgeTier
) -> List['KnowledgeItem']:
"""
Get source items from the tier below.
"""
if current_tier == KnowledgeTier.RAW_EXTRACTION:
return [] # Already at bottom

item = await self._load_item(item_id, current_tier)
source_ids = item.get("sources", [])

target_tier = KnowledgeTier(current_tier - 1)
return await self._load_items(source_ids, target_tier)

# ==================== Validation ====================

async def _validate_extractive(
self,
selected: List[Dict],
sources: List[Dict]
) -> None:
"""Ensure selected content exists verbatim in sources"""
source_texts = {s["source_text"] for s in sources}

for item in selected:
if item["text"] not in source_texts:
raise ValidationError(
f"Selected text not found in sources: {item['text'][:50]}..."
)

async def _validate_section(
self,
section: Dict,
chunks: List[Dict]
) -> None:
"""Validate section synthesis ratio and source coverage"""
# Implementation details...
pass

async def _validate_corpus(
self,
summary: Dict,
sections: List[Dict]
) -> None:
"""Validate corpus claims are supported by sections"""
# Implementation details...
pass

CLI Commands

# Build hierarchy for a corpus
coditect knowledge build \
--corpus-id corpus_20260115 \
--extraction-method textrank \
--max-synthesis-ratio 0.2

# Query at specific tier
coditect knowledge query \
--corpus-id corpus_20260115 \
--query "customer complaints about pricing" \
--tier section \
--include-sources

# Drill down from summary to sources
coditect knowledge drilldown \
--item-id section_abc123 \
--format json

# Validate hierarchy integrity
coditect knowledge validate \
--corpus-id corpus_20260115 \
--check-citations \
--check-synthesis-ratios

Consequences

Positive

  • Multi-granularity access: Query at any level of detail
  • Hallucination prevention: Extractive lower tiers ground upper tiers
  • Audit trail: Full citation chain from summary to source
  • Performance: Pre-computed summaries enable instant retrieval

Negative

  • Storage overhead: 4x content stored at different granularities
  • Build time: Hierarchy construction adds processing time
  • Staleness: Updates require rebuild of affected tiers
  • Complexity: More moving parts to maintain

Metrics

MetricTargetMeasurement
Hallucination rate<2%Claims without source support
Query latency (any tier)<100msp99 response time
Citation accuracy>98%Valid source references
Synthesis ratio (Tier 3)<20%Generated vs extracted content
  • ADR-027: Hybrid Document Processing Architecture (parent)
  • ADR-028: Map-Reduce Agent Orchestration (feeds Tier 1)
  • ADR-030: Compliance-Aware RAG (queries this store)