Skip to main content

Document Segmentation

Document Segmentation

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Expert skill for intelligently splitting large documents at semantic boundaries rather than arbitrary character counts. Essential for processing research papers, technical documentation, and long-form content.

When to Use

Use this skill when:

  • Processing research papers longer than context window
  • Analyzing technical documentation with code examples
  • Splitting documents that contain algorithms or formulas
  • Need to preserve semantic context across segments
  • Processing markdown documents with nested structure
  • Handling PDFs converted to text

Don't use this skill when:

  • Document fits within context window
  • Simple text without structure (plain prose)
  • Already-segmented content (chapters, sections)
  • Real-time streaming content

Core Algorithm

Semantic Boundary Detection

import re
from typing import List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum


class BoundaryType(Enum):
"""Types of semantic boundaries"""
HEADING = "heading"
PARAGRAPH = "paragraph"
CODE_BLOCK = "code_block"
FORMULA = "formula"
LIST = "list"
TABLE = "table"
SECTION = "section"


@dataclass
class Segment:
"""A document segment with metadata"""
content: str
start_line: int
end_line: int
boundary_type: BoundaryType
heading: Optional[str] = None
token_estimate: int = 0


@dataclass
class SegmentationConfig:
"""Configuration for document segmentation"""
target_tokens: int = 4000 # Target tokens per segment
max_tokens: int = 6000 # Hard maximum
min_tokens: int = 500 # Minimum viable segment
overlap_tokens: int = 200 # Overlap for context preservation
preserve_code_blocks: bool = True
preserve_formulas: bool = True
preserve_tables: bool = True


class DocumentSegmenter:
"""
Intelligently segment documents at semantic boundaries.

Key Innovation: Split at natural boundaries (sections, paragraphs)
rather than arbitrary character counts, preserving context.
"""

# Patterns for boundary detection
HEADING_PATTERN = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
CODE_BLOCK_PATTERN = re.compile(r'```[\s\S]*?```', re.MULTILINE)
FORMULA_PATTERN = re.compile(r'\$\$[\s\S]*?\$\$|\\\[[\s\S]*?\\\]', re.MULTILINE)
TABLE_PATTERN = re.compile(r'^\|.+\|$', re.MULTILINE)
LIST_PATTERN = re.compile(r'^[\s]*[-*+]\s+.+$', re.MULTILINE)

def __init__(self, config: SegmentationConfig = None):
self.config = config or SegmentationConfig()

def estimate_tokens(self, text: str) -> int:
"""Estimate token count (rough: ~4 chars per token)"""
return len(text) // 4

def find_boundaries(self, text: str) -> List[Tuple[int, BoundaryType, str]]:
"""
Find all semantic boundaries in document.

Returns list of (position, type, content) tuples.
"""
boundaries = []

# Find headings
for match in self.HEADING_PATTERN.finditer(text):
level = len(match.group(1))
boundaries.append((
match.start(),
BoundaryType.HEADING,
f"h{level}: {match.group(2)}"
))

# Find code blocks (to preserve)
for match in self.CODE_BLOCK_PATTERN.finditer(text):
boundaries.append((
match.start(),
BoundaryType.CODE_BLOCK,
"code_block"
))

# Find formulas (to preserve)
for match in self.FORMULA_PATTERN.finditer(text):
boundaries.append((
match.start(),
BoundaryType.FORMULA,
"formula"
))

# Sort by position
boundaries.sort(key=lambda x: x[0])
return boundaries

def find_safe_split_point(
self,
text: str,
target_pos: int,
boundaries: List[Tuple[int, BoundaryType, str]]
) -> int:
"""
Find nearest safe split point (paragraph or heading boundary).

Avoids splitting:
- Inside code blocks
- Inside formulas
- Mid-sentence
"""
# Find nearest boundary before target position
best_pos = 0
for pos, btype, _ in boundaries:
if pos > target_pos:
break
if btype in (BoundaryType.HEADING, BoundaryType.PARAGRAPH):
best_pos = pos

# If no boundary found, try to find paragraph break
if best_pos == 0:
# Look for double newline (paragraph break)
para_breaks = [m.start() for m in re.finditer(r'\n\n', text[:target_pos])]
if para_breaks:
best_pos = para_breaks[-1] + 2

# Last resort: find any newline
if best_pos == 0:
newlines = [m.start() for m in re.finditer(r'\n', text[:target_pos])]
if newlines:
best_pos = newlines[-1] + 1

return best_pos if best_pos > 0 else target_pos

def segment(self, text: str) -> List[Segment]:
"""
Segment document into chunks respecting semantic boundaries.

Args:
text: Full document text

Returns:
List of Segment objects
"""
segments = []
boundaries = self.find_boundaries(text)

current_pos = 0
current_heading = None
segment_num = 0

while current_pos < len(text):
# Calculate target end position
remaining_text = text[current_pos:]
target_tokens = self.config.target_tokens
target_chars = target_tokens * 4 # Rough estimate

if len(remaining_text) <= target_chars:
# Last segment - take everything
segment_text = remaining_text
end_pos = len(text)
else:
# Find safe split point
target_end = current_pos + target_chars
split_pos = self.find_safe_split_point(
text,
target_end,
boundaries
)

# Ensure minimum progress
if split_pos <= current_pos:
split_pos = min(current_pos + target_chars, len(text))

segment_text = text[current_pos:split_pos]
end_pos = split_pos

# Find current heading context
for pos, btype, content in boundaries:
if pos >= end_pos:
break
if btype == BoundaryType.HEADING and pos >= current_pos:
current_heading = content

# Create segment
segment = Segment(
content=segment_text.strip(),
start_line=text[:current_pos].count('\n') + 1,
end_line=text[:end_pos].count('\n') + 1,
boundary_type=BoundaryType.SECTION,
heading=current_heading,
token_estimate=self.estimate_tokens(segment_text)
)
segments.append(segment)

# Move to next position with overlap
overlap_chars = self.config.overlap_tokens * 4
current_pos = max(end_pos - overlap_chars, end_pos)
segment_num += 1

return segments

def segment_with_context(self, text: str) -> List[dict]:
"""
Segment with context preservation for LLM processing.

Returns segments with preceding context summary.
"""
segments = self.segment(text)
result = []

for i, segment in enumerate(segments):
context = {
"segment_number": i + 1,
"total_segments": len(segments),
"content": segment.content,
"heading": segment.heading,
"token_estimate": segment.token_estimate,
"lines": f"{segment.start_line}-{segment.end_line}",
}

# Add preceding context summary
if i > 0:
prev_headings = [
s.heading for s in segments[:i]
if s.heading
]
context["preceding_sections"] = list(dict.fromkeys(prev_headings))

result.append(context)

return result

Research Paper Specific Processing

class ResearchPaperSegmenter(DocumentSegmenter):
"""
Specialized segmenter for academic papers.

Handles:
- Abstract, Introduction, Methods, Results, Discussion sections
- Equations and formulas
- Algorithm pseudocode
- References section
"""

SECTION_PATTERNS = {
"abstract": re.compile(r'^#+\s*Abstract', re.IGNORECASE | re.MULTILINE),
"introduction": re.compile(r'^#+\s*Introduction', re.IGNORECASE | re.MULTILINE),
"methods": re.compile(r'^#+\s*(Methods?|Methodology)', re.IGNORECASE | re.MULTILINE),
"results": re.compile(r'^#+\s*Results?', re.IGNORECASE | re.MULTILINE),
"discussion": re.compile(r'^#+\s*Discussion', re.IGNORECASE | re.MULTILINE),
"conclusion": re.compile(r'^#+\s*Conclusion', re.IGNORECASE | re.MULTILINE),
"references": re.compile(r'^#+\s*References?', re.IGNORECASE | re.MULTILINE),
}

ALGORITHM_PATTERN = re.compile(
r'(Algorithm\s+\d+|Procedure\s+\d+|\\begin\{algorithm\}[\s\S]*?\\end\{algorithm\})',
re.IGNORECASE
)

def identify_paper_sections(self, text: str) -> dict:
"""Identify standard paper sections and their positions"""
sections = {}
for name, pattern in self.SECTION_PATTERNS.items():
match = pattern.search(text)
if match:
sections[name] = match.start()
return dict(sorted(sections.items(), key=lambda x: x[1]))

def extract_algorithms(self, text: str) -> List[dict]:
"""Extract algorithm blocks that must be preserved intact"""
algorithms = []
for match in self.ALGORITHM_PATTERN.finditer(text):
algorithms.append({
"content": match.group(0),
"start": match.start(),
"end": match.end(),
})
return algorithms

def segment_paper(self, text: str) -> List[dict]:
"""
Segment research paper with section awareness.

Ensures:
- Algorithms stay intact
- Sections don't split mid-paragraph
- Mathematical notation preserved
"""
sections = self.identify_paper_sections(text)
algorithms = self.extract_algorithms(text)

# Mark algorithm regions as protected
protected_regions = [(a["start"], a["end"]) for a in algorithms]

# Segment with protection
segments = self.segment_with_context(text)

# Annotate segments with section info
for segment in segments:
# Find which section this segment belongs to
segment_start = text.find(segment["content"][:100])
current_section = None
for section_name, section_start in sections.items():
if section_start <= segment_start:
current_section = section_name
segment["paper_section"] = current_section

return segments

Usage Examples

Basic Document Segmentation

# Create segmenter with custom config
config = SegmentationConfig(
target_tokens=4000,
max_tokens=6000,
overlap_tokens=200
)
segmenter = DocumentSegmenter(config)

# Read and segment document
with open("large_document.md") as f:
text = f.read()

segments = segmenter.segment(text)

for i, seg in enumerate(segments):
print(f"Segment {i+1}: {seg.token_estimate} tokens")
print(f" Lines: {seg.start_line}-{seg.end_line}")
print(f" Heading: {seg.heading}")

Research Paper Processing

# Process research paper
paper_segmenter = ResearchPaperSegmenter()

with open("research_paper.md") as f:
paper_text = f.read()

# Get segmented paper with section awareness
segments = paper_segmenter.segment_paper(paper_text)

for seg in segments:
print(f"Section: {seg['paper_section']}")
print(f"Tokens: {seg['token_estimate']}")
print(f"Preceding: {seg.get('preceding_sections', [])}")
print("---")

Integration with LLM Processing

async def process_large_document(document_path: str, llm_client) -> dict:
"""Process large document in segments with LLM"""

segmenter = DocumentSegmenter()

with open(document_path) as f:
text = f.read()

segments = segmenter.segment_with_context(text)
results = []

for seg in segments:
prompt = f"""
Analyzing document segment {seg['segment_number']} of {seg['total_segments']}.

Current section: {seg.get('heading', 'Unknown')}
Preceding sections: {', '.join(seg.get('preceding_sections', []))}

Content:
{seg['content']}

Extract key concepts, algorithms, and implementation details.
"""

result = await llm_client.complete(prompt)
results.append({
"segment": seg['segment_number'],
"analysis": result
})

return {
"total_segments": len(segments),
"segment_analyses": results
}

Best Practices

DO

  • Preserve code blocks intact - Never split mid-code
  • Keep formulas together - Mathematical notation must stay complete
  • Use overlap for context - Small overlap prevents context loss
  • Annotate with headings - Track document structure across segments
  • Process sequentially - Build understanding across segments

DON'T

  • Don't split mid-sentence - Always find sentence boundaries
  • Don't ignore document structure - Headings matter
  • Don't use fixed character splits - Use semantic boundaries
  • Don't lose context - Pass section summaries between segments
  • Don't process out of order - Documents have flow

Configuration Reference

ParameterDefaultDescription
target_tokens4000Target tokens per segment
max_tokens6000Hard maximum per segment
min_tokens500Minimum viable segment
overlap_tokens200Overlap for context
preserve_code_blockstrueKeep code blocks intact
preserve_formulastrueKeep math notation intact
preserve_tablestrueKeep tables intact

Integration with CODITECT

Recommended integration points:

Agent/WorkflowUsageNotes
paper-to-codePrimary input processorSegment papers before analysis
thoughts-analyzerLarge document analysisProcess research documents
research-agentTechnical doc processingHandle long specifications
code-indexerRepository documentationProcess README files

Success Metrics

MetricTargetMeasurement
Code block preservation100%No splits inside code
Formula preservation100%No splits inside math
Context continuity>90%Heading context maintained
Token efficiency±10%Segments near target size

Source Reference

This pattern was extracted from DeepCode (HKUDS/DeepCode) multi-agent system.

Original location: tools/document_segmentation_server.py (1,600+ lines)

Original codebase stats:

  • 51 Python files analyzed
  • 33,497 lines of code
  • 12 patterns extracted

See /submodules/labs/DeepCode/DEEP-ANALYSIS.md for complete analysis.

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: document-segmentation

Completed:
- [x] Document analyzed for semantic boundaries
- [x] Segments created respecting code blocks, formulas, and tables
- [x] Context preserved with heading hierarchy
- [x] Token targets achieved (±10% variance)
- [x] Segment metadata generated

Outputs:
- segments/ directory with numbered segment files (segment-001.md, segment-002.md, ...)
- segments/manifest.json (segment metadata: token counts, headings, line ranges)
- segments/context-map.json (preceding sections for each segment)

Statistics:
- Total segments: X
- Average tokens per segment: Y
- Code blocks preserved: 100%
- Formula preservation: 100%

Completion Checklist

Before marking this skill as complete, verify:

  • Document exceeds context window size (>100K tokens)
  • Semantic boundaries identified (headings, paragraphs, code blocks)
  • No code blocks split across segments
  • No formulas/equations split across segments
  • No tables split across segments
  • Segment overlap configured (default 200 tokens)
  • Each segment includes heading context
  • Token estimates within ±10% of target
  • Manifest file created with all segment metadata
  • Context map generated for sequential processing

Failure Indicators

This skill has FAILED if:

  • ❌ Code block split mid-function (syntax would break)
  • ❌ Formula split mid-equation (mathematical notation incomplete)
  • ❌ Table split mid-row (data integrity compromised)
  • ❌ Segment has no heading context (orphaned content)
  • ❌ Token variance >20% from target (inefficient distribution)
  • ❌ Segments created for document that fits in context window
  • ❌ Protected regions (algorithms, proofs) fragmented

When NOT to Use

Do NOT use this skill when:

  • Document fits within context window (<100K tokens)
  • Simple plain text without structure (no headings, code, formulas)
  • Already-segmented content (existing chapters/sections are sufficient)
  • Real-time streaming content (not static document)
  • Need random access (segmentation implies sequential processing)
  • Document is primarily images/diagrams (text segmentation not applicable)

Use alternative skills:

  • summarization - When need condensed version, not full segmentation
  • chunk-and-embed - When building vector database for semantic search
  • outline-extraction - When only need document structure, not full segments

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Fixed character splitsBreaks code/formulas mid-blockUse semantic boundary detection
Ignoring document structureLoses context across segmentsAlways include heading hierarchy
No overlap between segmentsContext discontinuityUse 200-token overlap (configurable)
Processing out of orderDocument flow assumptions brokenProcess segments sequentially
Skipping metadata generationNo way to reconstruct contextAlways create manifest.json
Over-segmentationToo many small segmentsRespect min_tokens threshold (default 500)
Under-segmentationSegments exceed context windowEnforce max_tokens hard limit

Principles

This skill embodies:

  • #2 First Principles - Understand document semantics before splitting
  • #3 Keep It Simple - Split at natural boundaries, not arbitrary positions
  • #4 Separation of Concerns - Preserve code blocks, formulas, tables intact
  • #6 Clear, Understandable, Explainable - Each segment self-contained with context
  • #8 No Assumptions - Verify boundaries don't break syntax or semantics

Full Standard: CODITECT-STANDARD-AUTOMATION.md