Document Segmentation

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Expert skill for intelligently splitting large documents at semantic boundaries rather than arbitrary character counts. Essential for processing research papers, technical documentation, and long-form content.

When to Use

Use this skill when:

Processing research papers longer than context window
Analyzing technical documentation with code examples
Splitting documents that contain algorithms or formulas
Need to preserve semantic context across segments
Processing markdown documents with nested structure
Handling PDFs converted to text

Don't use this skill when:

Document fits within context window
Simple text without structure (plain prose)
Already-segmented content (chapters, sections)
Real-time streaming content

Core Algorithm

Semantic Boundary Detection

import re
from typing import List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum


class BoundaryType(Enum):
    """Types of semantic boundaries"""
    HEADING = "heading"
    PARAGRAPH = "paragraph"
    CODE_BLOCK = "code_block"
    FORMULA = "formula"
    LIST = "list"
    TABLE = "table"
    SECTION = "section"


@dataclass
class Segment:
    """A document segment with metadata"""
    content: str
    start_line: int
    end_line: int
    boundary_type: BoundaryType
    heading: Optional[str] = None
    token_estimate: int = 0


@dataclass
class SegmentationConfig:
    """Configuration for document segmentation"""
    target_tokens: int = 4000  # Target tokens per segment
    max_tokens: int = 6000  # Hard maximum
    min_tokens: int = 500  # Minimum viable segment
    overlap_tokens: int = 200  # Overlap for context preservation
    preserve_code_blocks: bool = True
    preserve_formulas: bool = True
    preserve_tables: bool = True


class DocumentSegmenter:
    """
    Intelligently segment documents at semantic boundaries.

    Key Innovation: Split at natural boundaries (sections, paragraphs)
    rather than arbitrary character counts, preserving context.
    """

    # Patterns for boundary detection
    HEADING_PATTERN = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
    CODE_BLOCK_PATTERN = re.compile(r'```[\s\S]*?```', re.MULTILINE)
    FORMULA_PATTERN = re.compile(r'\$\$[\s\S]*?\$\$|\\\[[\s\S]*?\\\]', re.MULTILINE)
    TABLE_PATTERN = re.compile(r'^\|.+\|$', re.MULTILINE)
    LIST_PATTERN = re.compile(r'^[\s]*[-*+]\s+.+$', re.MULTILINE)

    def __init__(self, config: SegmentationConfig = None):
        self.config = config or SegmentationConfig()

    def estimate_tokens(self, text: str) -> int:
        """Estimate token count (rough: ~4 chars per token)"""
        return len(text) // 4

    def find_boundaries(self, text: str) -> List[Tuple[int, BoundaryType, str]]:
        """
        Find all semantic boundaries in document.

        Returns list of (position, type, content) tuples.
        """
        boundaries = []

        # Find headings
        for match in self.HEADING_PATTERN.finditer(text):
            level = len(match.group(1))
            boundaries.append((
                match.start(),
                BoundaryType.HEADING,
                f"h{level}: {match.group(2)}"
            ))

        # Find code blocks (to preserve)
        for match in self.CODE_BLOCK_PATTERN.finditer(text):
            boundaries.append((
                match.start(),
                BoundaryType.CODE_BLOCK,
                "code_block"
            ))

        # Find formulas (to preserve)
        for match in self.FORMULA_PATTERN.finditer(text):
            boundaries.append((
                match.start(),
                BoundaryType.FORMULA,
                "formula"
            ))

        # Sort by position
        boundaries.sort(key=lambda x: x[0])
        return boundaries

    def find_safe_split_point(
        self,
        text: str,
        target_pos: int,
        boundaries: List[Tuple[int, BoundaryType, str]]
    ) -> int:
        """
        Find nearest safe split point (paragraph or heading boundary).

        Avoids splitting:
        - Inside code blocks
        - Inside formulas
        - Mid-sentence
        """
        # Find nearest boundary before target position
        best_pos = 0
        for pos, btype, _ in boundaries:
            if pos > target_pos:
                break
            if btype in (BoundaryType.HEADING, BoundaryType.PARAGRAPH):
                best_pos = pos

        # If no boundary found, try to find paragraph break
        if best_pos == 0:
            # Look for double newline (paragraph break)
            para_breaks = [m.start() for m in re.finditer(r'\n\n', text[:target_pos])]
            if para_breaks:
                best_pos = para_breaks[-1] + 2

        # Last resort: find any newline
        if best_pos == 0:
            newlines = [m.start() for m in re.finditer(r'\n', text[:target_pos])]
            if newlines:
                best_pos = newlines[-1] + 1

        return best_pos if best_pos > 0 else target_pos

    def segment(self, text: str) -> List[Segment]:
        """
        Segment document into chunks respecting semantic boundaries.

        Args:
            text: Full document text

        Returns:
            List of Segment objects
        """
        segments = []
        boundaries = self.find_boundaries(text)

        current_pos = 0
        current_heading = None
        segment_num = 0

        while current_pos < len(text):
            # Calculate target end position
            remaining_text = text[current_pos:]
            target_tokens = self.config.target_tokens
            target_chars = target_tokens * 4  # Rough estimate

            if len(remaining_text) <= target_chars:
                # Last segment - take everything
                segment_text = remaining_text
                end_pos = len(text)
            else:
                # Find safe split point
                target_end = current_pos + target_chars
                split_pos = self.find_safe_split_point(
                    text,
                    target_end,
                    boundaries
                )

                # Ensure minimum progress
                if split_pos <= current_pos:
                    split_pos = min(current_pos + target_chars, len(text))

                segment_text = text[current_pos:split_pos]
                end_pos = split_pos

            # Find current heading context
            for pos, btype, content in boundaries:
                if pos >= end_pos:
                    break
                if btype == BoundaryType.HEADING and pos >= current_pos:
                    current_heading = content

            # Create segment
            segment = Segment(
                content=segment_text.strip(),
                start_line=text[:current_pos].count('\n') + 1,
                end_line=text[:end_pos].count('\n') + 1,
                boundary_type=BoundaryType.SECTION,
                heading=current_heading,
                token_estimate=self.estimate_tokens(segment_text)
            )
            segments.append(segment)

            # Move to next position with overlap
            overlap_chars = self.config.overlap_tokens * 4
            current_pos = max(end_pos - overlap_chars, end_pos)
            segment_num += 1

        return segments

    def segment_with_context(self, text: str) -> List[dict]:
        """
        Segment with context preservation for LLM processing.

        Returns segments with preceding context summary.
        """
        segments = self.segment(text)
        result = []

        for i, segment in enumerate(segments):
            context = {
                "segment_number": i + 1,
                "total_segments": len(segments),
                "content": segment.content,
                "heading": segment.heading,
                "token_estimate": segment.token_estimate,
                "lines": f"{segment.start_line}-{segment.end_line}",
            }

            # Add preceding context summary
            if i > 0:
                prev_headings = [
                    s.heading for s in segments[:i]
                    if s.heading
                ]
                context["preceding_sections"] = list(dict.fromkeys(prev_headings))

            result.append(context)

        return result

Research Paper Specific Processing

class ResearchPaperSegmenter(DocumentSegmenter):
    """
    Specialized segmenter for academic papers.

    Handles:
    - Abstract, Introduction, Methods, Results, Discussion sections
    - Equations and formulas
    - Algorithm pseudocode
    - References section
    """

    SECTION_PATTERNS = {
        "abstract": re.compile(r'^#+\s*Abstract', re.IGNORECASE | re.MULTILINE),
        "introduction": re.compile(r'^#+\s*Introduction', re.IGNORECASE | re.MULTILINE),
        "methods": re.compile(r'^#+\s*(Methods?|Methodology)', re.IGNORECASE | re.MULTILINE),
        "results": re.compile(r'^#+\s*Results?', re.IGNORECASE | re.MULTILINE),
        "discussion": re.compile(r'^#+\s*Discussion', re.IGNORECASE | re.MULTILINE),
        "conclusion": re.compile(r'^#+\s*Conclusion', re.IGNORECASE | re.MULTILINE),
        "references": re.compile(r'^#+\s*References?', re.IGNORECASE | re.MULTILINE),
    }

    ALGORITHM_PATTERN = re.compile(
        r'(Algorithm\s+\d+|Procedure\s+\d+|\\begin\{algorithm\}[\s\S]*?\\end\{algorithm\})',
        re.IGNORECASE
    )

    def identify_paper_sections(self, text: str) -> dict:
        """Identify standard paper sections and their positions"""
        sections = {}
        for name, pattern in self.SECTION_PATTERNS.items():
            match = pattern.search(text)
            if match:
                sections[name] = match.start()
        return dict(sorted(sections.items(), key=lambda x: x[1]))

    def extract_algorithms(self, text: str) -> List[dict]:
        """Extract algorithm blocks that must be preserved intact"""
        algorithms = []
        for match in self.ALGORITHM_PATTERN.finditer(text):
            algorithms.append({
                "content": match.group(0),
                "start": match.start(),
                "end": match.end(),
            })
        return algorithms

    def segment_paper(self, text: str) -> List[dict]:
        """
        Segment research paper with section awareness.

        Ensures:
        - Algorithms stay intact
        - Sections don't split mid-paragraph
        - Mathematical notation preserved
        """
        sections = self.identify_paper_sections(text)
        algorithms = self.extract_algorithms(text)

        # Mark algorithm regions as protected
        protected_regions = [(a["start"], a["end"]) for a in algorithms]

        # Segment with protection
        segments = self.segment_with_context(text)

        # Annotate segments with section info
        for segment in segments:
            # Find which section this segment belongs to
            segment_start = text.find(segment["content"][:100])
            current_section = None
            for section_name, section_start in sections.items():
                if section_start <= segment_start:
                    current_section = section_name
            segment["paper_section"] = current_section

        return segments

Usage Examples

Basic Document Segmentation

# Create segmenter with custom config
config = SegmentationConfig(
    target_tokens=4000,
    max_tokens=6000,
    overlap_tokens=200
)
segmenter = DocumentSegmenter(config)

# Read and segment document
with open("large_document.md") as f:
    text = f.read()

segments = segmenter.segment(text)

for i, seg in enumerate(segments):
    print(f"Segment {i+1}: {seg.token_estimate} tokens")
    print(f"  Lines: {seg.start_line}-{seg.end_line}")
    print(f"  Heading: {seg.heading}")

Research Paper Processing

# Process research paper
paper_segmenter = ResearchPaperSegmenter()

with open("research_paper.md") as f:
    paper_text = f.read()

# Get segmented paper with section awareness
segments = paper_segmenter.segment_paper(paper_text)

for seg in segments:
    print(f"Section: {seg['paper_section']}")
    print(f"Tokens: {seg['token_estimate']}")
    print(f"Preceding: {seg.get('preceding_sections', [])}")
    print("---")

Integration with LLM Processing

async def process_large_document(document_path: str, llm_client) -> dict:
    """Process large document in segments with LLM"""

    segmenter = DocumentSegmenter()

    with open(document_path) as f:
        text = f.read()

    segments = segmenter.segment_with_context(text)
    results = []

    for seg in segments:
        prompt = f"""
Analyzing document segment {seg['segment_number']} of {seg['total_segments']}.

Current section: {seg.get('heading', 'Unknown')}
Preceding sections: {', '.join(seg.get('preceding_sections', []))}

Content:
{seg['content']}

Extract key concepts, algorithms, and implementation details.
"""

        result = await llm_client.complete(prompt)
        results.append({
            "segment": seg['segment_number'],
            "analysis": result
        })

    return {
        "total_segments": len(segments),
        "segment_analyses": results
    }

Best Practices

DO

Preserve code blocks intact - Never split mid-code
Keep formulas together - Mathematical notation must stay complete
Use overlap for context - Small overlap prevents context loss
Annotate with headings - Track document structure across segments
Process sequentially - Build understanding across segments

DON'T

Don't split mid-sentence - Always find sentence boundaries
Don't ignore document structure - Headings matter
Don't use fixed character splits - Use semantic boundaries
Don't lose context - Pass section summaries between segments
Don't process out of order - Documents have flow

Configuration Reference

Parameter	Default	Description
`target_tokens`	4000	Target tokens per segment
`max_tokens`	6000	Hard maximum per segment
`min_tokens`	500	Minimum viable segment
`overlap_tokens`	200	Overlap for context
`preserve_code_blocks`	true	Keep code blocks intact
`preserve_formulas`	true	Keep math notation intact
`preserve_tables`	true	Keep tables intact

Integration with CODITECT

Recommended integration points:

Agent/Workflow	Usage	Notes
paper-to-code	Primary input processor	Segment papers before analysis
thoughts-analyzer	Large document analysis	Process research documents
research-agent	Technical doc processing	Handle long specifications
code-indexer	Repository documentation	Process README files

Success Metrics

Metric	Target	Measurement
Code block preservation	100%	No splits inside code
Formula preservation	100%	No splits inside math
Context continuity	>90%	Heading context maintained
Token efficiency	±10%	Segments near target size

Source Reference

This pattern was extracted from DeepCode (HKUDS/DeepCode) multi-agent system.

Original location: tools/document_segmentation_server.py (1,600+ lines)

Original codebase stats:

51 Python files analyzed
33,497 lines of code
12 patterns extracted

See /submodules/labs/DeepCode/DEEP-ANALYSIS.md for complete analysis.

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: document-segmentation

Completed:
- [x] Document analyzed for semantic boundaries
- [x] Segments created respecting code blocks, formulas, and tables
- [x] Context preserved with heading hierarchy
- [x] Token targets achieved (±10% variance)
- [x] Segment metadata generated

Outputs:
- segments/ directory with numbered segment files (segment-001.md, segment-002.md, ...)
- segments/manifest.json (segment metadata: token counts, headings, line ranges)
- segments/context-map.json (preceding sections for each segment)

Statistics:
- Total segments: X
- Average tokens per segment: Y
- Code blocks preserved: 100%
- Formula preservation: 100%

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Code block split mid-function (syntax would break)
❌ Formula split mid-equation (mathematical notation incomplete)
❌ Table split mid-row (data integrity compromised)
❌ Segment has no heading context (orphaned content)
❌ Token variance >20% from target (inefficient distribution)
❌ Segments created for document that fits in context window
❌ Protected regions (algorithms, proofs) fragmented

When NOT to Use

Do NOT use this skill when:

Document fits within context window (<100K tokens)
Simple plain text without structure (no headings, code, formulas)
Already-segmented content (existing chapters/sections are sufficient)
Real-time streaming content (not static document)
Need random access (segmentation implies sequential processing)
Document is primarily images/diagrams (text segmentation not applicable)

Use alternative skills:

summarization - When need condensed version, not full segmentation
chunk-and-embed - When building vector database for semantic search
outline-extraction - When only need document structure, not full segments

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Fixed character splits	Breaks code/formulas mid-block	Use semantic boundary detection
Ignoring document structure	Loses context across segments	Always include heading hierarchy
No overlap between segments	Context discontinuity	Use 200-token overlap (configurable)
Processing out of order	Document flow assumptions broken	Process segments sequentially
Skipping metadata generation	No way to reconstruct context	Always create manifest.json
Over-segmentation	Too many small segments	Respect min_tokens threshold (default 500)
Under-segmentation	Segments exceed context window	Enforce max_tokens hard limit

Principles

This skill embodies:

#2 First Principles - Understand document semantics before splitting
#3 Keep It Simple - Split at natural boundaries, not arbitrary positions
#4 Separation of Concerns - Preserve code blocks, formulas, tables intact
#6 Clear, Understandable, Explainable - Each segment self-contained with context
#8 No Assumptions - Verify boundaries don't break syntax or semantics

Full Standard: CODITECT-STANDARD-AUTOMATION.md

How to Use This Skill​

When to Use​

Core Algorithm​

Semantic Boundary Detection​

Research Paper Specific Processing​

Usage Examples​

Basic Document Segmentation​

Research Paper Processing​

Integration with LLM Processing​

Best Practices​

DO​

DON'T​

Configuration Reference​

Integration with CODITECT​

Success Metrics​

Source Reference​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​