Better Methods: Advanced Alternatives to File-Based External Memory
Assessment of the Video's Approach
What It Gets Right
- Core insight is valid: Externalized state enables unbounded processing
- Checkpoint-resume pattern: Fundamental technique that scales
- Accessibility: Zero-code barrier for non-technical users
- Immediate utility: Works today with Claude Code
Critical Limitations
| Limitation | Impact | Severity |
|---|---|---|
| Linear processing | No parallelization, O(n) time complexity | High |
| No semantic awareness | Processes files in order, not by relevance | High |
| Full re-read on resume | Reads entire file set each checkpoint | Medium |
| No pre-filtering | LLM processes raw data, high token waste | High |
| Single-level memory | No hierarchy, flat structure | Medium |
| No quality validation | Hallucinations propagate through summaries | High |
| Manual prompt engineering | Each use case requires custom prompt | Medium |
Alternative Method 1: Map-Reduce Pattern
Concept
Split documents into chunks, process in parallel ("map"), then aggregate results ("reduce"). Enables horizontal scaling and dramatically reduces wall-clock time.
Architecture
┌────────────────────────────────────────────────────────────┐
│ MAP PHASE │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Doc 1 │ │ Doc 2 │ │ Doc 3 │ │ Doc N │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Summary │ │Summary │ │Summary │ │Summary │ PARALLEL │
│ │ 1 │ │ 2 │ │ 3 │ │ N │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
└──────┼───────────┼───────────┼───────────┼─────────────────┘
│ │ │ │
└─────────┬─┴─────────┬─┴───────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ REDUCE PHASE │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Aggregate Summaries │ │
│ │ (Combine, Deduplicate, Synthesize) │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Final Output │ │
│ └───────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
Advantages Over Video Method
- 50 files in parallel vs sequential → 10-50x faster
- Independent failure isolation → one bad file doesn't block others
- Optimal token utilization → each agent processes within context window
- Stateless map workers → no checkpoint complexity per worker
Implementation (LangChain Example)
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_anthropic import ChatAnthropic
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=10000,
chunk_overlap=1000
)
docs = text_splitter.split_documents(raw_documents)
# Map-reduce chain
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
chain = load_summarize_chain(
llm,
chain_type="map_reduce",
map_prompt=MAP_PROMPT,
combine_prompt=REDUCE_PROMPT
)
result = chain.invoke(docs)
When to Use
- Batch processing of independent documents
- Time-sensitive analysis (need results fast)
- Documents don't have cross-references
- Infrastructure supports parallel API calls
Alternative Method 2: Hierarchical Summarization
Concept
Build a pyramid of summaries: summarize chunks → summarize summaries → summarize the summary of summaries. Each level compresses information while preserving key insights.
Architecture
Level 3 (Final): ┌─────────────────────┐
│ Master Summary │
│ (500 tokens) │
└──────────┬──────────┘
│
Level 2 (Sections): ┌──────────┴──────────┐
│ │
┌──────┴─────┐ ┌──────┴─────┐
│ Section │ │ Section │
│ Summary A │ │ Summary B │
│ (2K tokens)│ │ (2K tokens)│
└──────┬─────┘ └──────┬─────┘
│ │
Level 1 (Chunks): │ │
┌─────────┼─────────┐ ┌────────┼────────┐
│ │ │ │ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐
│ C1 │ │ C2 │ │ C3 │ │ C4 │ │ C5 │
│ Sum │ │ Sum │ │ Sum │ │ Sum │ │ Sum │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
▲ ▲ ▲ ▲ ▲
│ │ │ │ │
Level 0: Doc1 Doc2 Doc3 Doc4 Doc5
(Raw) (Raw) (Raw) (Raw) (Raw)
Key Insight: Compression Ratios
| Level | Input Size | Output Size | Compression |
|---|---|---|---|
| L0 → L1 | 10K tokens | 1K tokens | 10:1 |
| L1 → L2 | 5K tokens (5×1K) | 2K tokens | 2.5:1 |
| L2 → L3 | 4K tokens (2×2K) | 500 tokens | 8:1 |
Advantages Over Video Method
- Multi-level retrieval: Query at appropriate granularity
- Progressive disclosure: Start with summary, drill down as needed
- Model flexibility: Use cheaper models for lower levels
- Parallelizable at each level: All L1 summaries can run in parallel
Critical Warning: Hallucination Propagation
Research from Pieces.app shows that over-preprocessing increases hallucinations:
"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."
Mitigation:
- Preserve exact quotes at lower levels
- Use extractive summaries (exact text selection) before abstractive
- Ground upper levels in lower-level citations
Implementation Pattern
async def hierarchical_summarize(
documents: List[str],
chunk_size: int = 10000,
compression_ratio: float = 0.1
) -> HierarchySummary:
"""Build hierarchical summary pyramid"""
# Level 0 → Level 1: Chunk summaries (parallel)
chunks = split_into_chunks(documents, chunk_size)
level_1 = await asyncio.gather(*[
summarize_chunk(chunk, target_size=int(chunk_size * compression_ratio))
for chunk in chunks
])
# Level 1 → Level 2: Section summaries
sections = group_related_summaries(level_1, group_size=5)
level_2 = await asyncio.gather(*[
merge_summaries(section)
for section in sections
])
# Level 2 → Level 3: Master summary
level_3 = await create_master_summary(level_2)
return HierarchySummary(
master=level_3,
sections=level_2,
chunks=level_1,
raw_count=len(documents)
)
When to Use
- Long-form document analysis (books, reports, transcripts)
- Need both high-level overview AND ability to drill down
- Information density varies across document
- Query complexity varies (simple → detailed)
Alternative Method 3: RAG with Vector Store
Concept
Instead of processing all documents sequentially, embed documents into a vector database and retrieve only the most relevant chunks for each query. Process 10% of data, get 90% of value.
Architecture
INGESTION PHASE (One-time):
┌──────────────────────────────────────────────────────────────┐
│ ┌────────┐ ┌────────────┐ ┌─────────────────────┐ │
│ │Raw Docs│───►│ Chunker + │───►│ Vector Database │ │
│ │ │ │ Embedder │ │ (Pinecone/Chroma) │ │
│ └────────┘ └────────────┘ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
QUERY PHASE (Per-request):
┌──────────────────────────────────────────────────────────────┐
│ ┌────────┐ ┌────────────┐ ┌─────────────────────┐ │
│ │ Query │───►│ Embed + │───►│ Top-K Retrieval │ │
│ │ │ │ Search │ │ (k=5-20 chunks) │ │
│ └────────┘ └────────────┘ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ LLM Generation │ │
│ │ (with context) │ │
│ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Advantages Over Video Method
- Semantic relevance: Retrieve by meaning, not file order
- Massive scale: Handle millions of documents
- Query-specific context: Only load what's needed
- Incremental updates: Add new docs without reprocessing all
- Sub-second retrieval: Index enables O(log n) search
RAG Variants for Different Use Cases
| Variant | Description | Best For |
|---|---|---|
| Traditional RAG | Simple retrieve + generate | FAQ, simple queries |
| Long RAG | Retrieve entire sections, not snippets | Context-heavy analysis |
| Self-RAG | Model decides when to retrieve | Complex reasoning |
| Corrective RAG | Validates and corrects retrieved context | High-accuracy needs |
| GraphRAG | Knowledge graph + vector retrieval | Entity relationships |
| Adaptive RAG | Dynamic strategy based on query | Mixed query types |
Implementation (Simple RAG)
from langchain_community.vectorstores import Chroma
from langchain_anthropic import ChatAnthropic
from langchain.chains import RetrievalQA
# Create vector store (one-time)
vectorstore = Chroma.from_documents(
documents=split_docs,
embedding=embedding_model,
persist_directory="./chroma_db"
)
# Create retrieval chain
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 10}
)
qa_chain = RetrievalQA.from_chain_type(
llm=ChatAnthropic(model="claude-sonnet-4-20250514"),
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# Query
result = qa_chain.invoke({"query": "What are customer pain points?"})
When to Use
- Large document corpus (>100 documents)
- Repeated queries against same corpus
- Need source attribution
- Query intent varies widely
- Real-time response requirements
Alternative Method 4: Multi-Level Memory Hierarchy
Concept
Inspired by human memory: maintain separate memory systems operating at different timescales and abstraction levels.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ MEMORY HIERARCHY │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ WORKING MEMORY (Immediate) │ │
│ │ - Current conversation context │ │
│ │ - Active task state │ │
│ │ - Scope: Last 5-10 interactions │ │
│ │ - Storage: In-context (prompt) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ EPISODIC MEMORY (Session) │ │
│ │ - Important past interactions │ │
│ │ - Key decisions and outcomes │ │
│ │ - Scope: Current session + recent sessions │ │
│ │ - Storage: Compressed summaries in DB │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ SEMANTIC MEMORY (Long-term) │ │
│ │ - General knowledge extracted over time │ │
│ │ - User preferences and patterns │ │
│ │ - Scope: All historical interactions │ │
│ │ - Storage: Knowledge graph + vector store │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Advantages Over Video Method
- Appropriate granularity: Recent = detailed, old = summarized
- Efficient retrieval: Query correct memory level
- Continuous learning: Semantic memory evolves over time
- Graceful degradation: Lose detail, preserve essence
Implementation Pattern
@dataclass
class MemoryHierarchy:
working: WorkingMemory # In-context, last N messages
episodic: EpisodicMemory # Compressed session summaries
semantic: SemanticMemory # Extracted knowledge graph
async def recall(self, query: str) -> MemoryContext:
"""Retrieve relevant context from all memory levels"""
# Always include working memory
working_ctx = self.working.get_recent(n=10)
# Search episodic for relevant past sessions
episodic_ctx = await self.episodic.search(
query=query,
max_results=5
)
# Query semantic for general knowledge
semantic_ctx = await self.semantic.query(
query=query,
include_relations=True
)
return MemoryContext(
working=working_ctx,
episodic=episodic_ctx,
semantic=semantic_ctx
)
async def commit(self, interaction: Interaction):
"""Update memory hierarchy after interaction"""
# Update working memory
self.working.append(interaction)
# Periodically compress to episodic
if self.working.should_compress():
summary = await self.compress_working()
await self.episodic.store(summary)
self.working.clear_old()
# Extract semantic knowledge
facts = await self.extract_facts(interaction)
await self.semantic.update(facts)
When to Use
- Long-running agent sessions
- Need to maintain user context over time
- Information has different decay rates
- Query complexity varies
Alternative Method 5: Pre-Processing Pipeline
Concept
Instead of feeding raw documents to LLM, apply traditional NLP/ML techniques first to extract structure, filter noise, and reduce token consumption.
Pipeline Architecture
┌────────────────────────────────────────────────────────────────┐
│ PRE-PROCESSING PIPELINE │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Raw │──►│ OCR/ │──►│ Format │──►│ Entity │ │
│ │ Docs │ │ Parse │ │ Detect │ │ Extract │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ LLM │◄──│ Semantic │◄──│ Keyword │◄──│ NER │ │
│ │ Analysis │ │ Chunking │ │ Filter │ │ Output │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Pre-Processing Techniques
| Stage | Technique | Token Reduction |
|---|---|---|
| Parsing | Extract text from PDF/DOCX, remove boilerplate | 20-40% |
| Deduplication | Remove repeated content, headers/footers | 10-30% |
| NER | Extract names, dates, amounts before LLM | 5-15% |
| Keyword filtering | Only process sections containing target terms | 50-80% |
| Extractive summary | Select key sentences (TF-IDF, TextRank) | 60-90% |
| Compression | Remove redundancy while preserving meaning | 30-50% |
Advantages Over Video Method
- Dramatic token savings: Process 10-20% of original tokens
- Structured output: Entities extracted in consistent format
- Deterministic extraction: NER is reproducible, LLM isn't
- Cost reduction: Less LLM API spend
Implementation Example
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
class PreProcessor:
def __init__(self):
self.nlp = spacy.load("en_core_web_lg")
self.tfidf = TfidfVectorizer(max_features=1000)
def process(self, documents: List[str], keywords: List[str]) -> ProcessedCorpus:
results = []
for doc in documents:
# Step 1: Basic cleaning
cleaned = self.clean_text(doc)
# Step 2: Keyword filtering (keep only relevant sections)
relevant_sections = self.filter_by_keywords(cleaned, keywords)
if not relevant_sections:
continue
# Step 3: NER extraction
entities = self.extract_entities(relevant_sections)
# Step 4: Extractive summary (reduce to key sentences)
key_sentences = self.extractive_summary(
relevant_sections,
num_sentences=10
)
results.append(ProcessedDoc(
original_tokens=len(doc.split()),
processed_tokens=len(key_sentences.split()),
entities=entities,
summary=key_sentences
))
return ProcessedCorpus(documents=results)
def filter_by_keywords(self, text: str, keywords: List[str]) -> str:
"""Keep only paragraphs containing target keywords"""
paragraphs = text.split('\n\n')
relevant = [
p for p in paragraphs
if any(kw.lower() in p.lower() for kw in keywords)
]
return '\n\n'.join(relevant)
When to Use
- High document volume (>1000 files)
- Known extraction targets (specific entities, topics)
- Cost-sensitive applications
- Need deterministic extraction alongside LLM analysis
Hybrid Approach: Best of All Methods
Recommended Architecture for Enterprise
┌─────────────────────────────────────────────────────────────────┐
│ HYBRID PROCESSING PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: PRE-PROCESSING (Reduce input) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ OCR → Clean → NER → Keyword Filter → Extractive Sum │ │
│ │ Result: 80% token reduction, structured entities │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 2: INDEXING (Enable retrieval) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Chunk → Embed → Store in Vector DB → Build KG │ │
│ │ Result: Semantic search enabled, entity graph built │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 3: MAP-REDUCE ANALYSIS (Parallel processing) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Parallel chunk analysis → Aggregate → Synthesize │ │
│ │ Result: Comprehensive insights in 1/10th time │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 4: HIERARCHICAL MEMORY (Persistent state) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Store summaries → Build hierarchy → Enable drill-down │ │
│ │ Result: Multi-level retrieval, persistent knowledge │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 5: RAG QUERY (Interactive retrieval) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Query → Retrieve relevant context → Generate response │ │
│ │ Result: Fast, accurate, cited responses │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Comparison Matrix
| Method | Speed | Accuracy | Cost | Complexity | Best For |
|---|---|---|---|---|---|
| Video (Sequential) | Slow | Medium | High | Low | Quick demos |
| Map-Reduce | Fast | Medium | Medium | Medium | Batch processing |
| Hierarchical | Medium | High | Medium | High | Long documents |
| RAG | Fast | High | Low* | High | Interactive queries |
| Pre-Processing | Fast | Medium | Low | Medium | High volume |
| Hybrid | Fast | High | Low | Very High | Enterprise |
*After initial indexing investment
Coditect Implementation Recommendations
Priority 1: Map-Reduce Agent Orchestration
class CoditectMapReduce:
"""Parallel document processing with agent specialization"""
async def process_corpus(
self,
documents: List[Document],
extraction_schema: ExtractionSchema
) -> CorpusAnalysis:
# Map: Spawn specialized agents per document
map_tasks = [
self.spawn_extraction_agent(doc, extraction_schema)
for doc in documents
]
# Execute in parallel with token budgets
map_results = await asyncio.gather(*map_tasks)
# Reduce: Synthesis agent aggregates
return await self.synthesis_agent.aggregate(map_results)
Priority 2: Hierarchical Knowledge Store
class CoditectKnowledgeStore:
"""FoundationDB-backed hierarchical memory"""
def __init__(self, fdb_cluster: str):
self.fdb = fdb.open(fdb_cluster)
self.hierarchy = {
'raw': self.fdb.directory.create_or_open(('knowledge', 'raw')),
'chunks': self.fdb.directory.create_or_open(('knowledge', 'chunks')),
'summaries': self.fdb.directory.create_or_open(('knowledge', 'summaries')),
'master': self.fdb.directory.create_or_open(('knowledge', 'master'))
}
Priority 3: Compliance-Aware RAG
class CoditectRAG:
"""RAG with audit trails for regulated industries"""
async def query(
self,
query: str,
user_context: UserContext
) -> AuditedResponse:
# Retrieve with access control
chunks = await self.retriever.search(
query=query,
filters=self.access_control.get_filters(user_context)
)
# Generate with source tracking
response = await self.generator.generate(
query=query,
context=chunks,
require_citations=True
)
# Create audit record (21 CFR Part 11)
await self.audit_log.record(
query=query,
sources=chunks,
response=response,
user=user_context.user_id,
timestamp=datetime.utcnow()
)
return AuditedResponse(
answer=response.text,
citations=response.citations,
audit_id=audit_record.id
)
Actionable Recommendations
For Immediate Use (Today)
- Switch to Map-Reduce for batch processing → 5-10x speed improvement
- Add keyword pre-filtering → 50-80% token reduction
- Use RAG for repeated queries against same corpus
For Coditect Roadmap
- Sprint 1-2: Implement parallel agent orchestration with map-reduce
- Sprint 3-4: Build hierarchical knowledge store on FoundationDB
- Sprint 5-6: Add RAG layer with compliance audit trails
- Sprint 7-8: Integrate pre-processing pipeline for token efficiency
ROI Projections
| Improvement | Token Savings | Time Savings | Cost Savings |
|---|---|---|---|
| Map-Reduce | 0% | 80-90% | 0% |
| Pre-Processing | 60-80% | 20% | 60-80% |
| RAG (vs full reprocess) | 90%+ | 95%+ | 90%+ |
| Combined | 70-85% | 90-95% | 75-90% |
Conclusion
The video's approach is a valid starting point but represents the simplest possible implementation. Enterprise-grade systems require:
- Parallelization (Map-Reduce) for speed
- Semantic retrieval (RAG) for relevance
- Hierarchical memory for multi-scale access
- Pre-processing for token efficiency
- Audit trails for compliance
Coditect's opportunity is to package these advanced patterns into a turnkey solution that regulated industries can adopt without building from scratch.