Skip to main content

Doc-to-Skill Converter

Purpose

Specialist agent for converting documentation websites into CODITECT-standard Claude Code skills. Implements intelligent scraping, smart categorization, and automatic pattern extraction.

Core Capabilities

1. Smart Scraping

SCRAPING_STRATEGIES = {
"llms_txt": {
"priority": 1,
"files": ["llms-full.txt", "llms.txt", "llms-small.txt"],
"speed": "10x faster",
"description": "LLM-optimized documentation"
},
"sitemap": {
"priority": 2,
"files": ["sitemap.xml", "sitemap_index.xml"],
"speed": "5x faster",
"description": "XML sitemap navigation"
},
"bfs_crawl": {
"priority": 3,
"strategy": "breadth_first_search",
"speed": "baseline",
"description": "Full site crawl"
}
}

2. Smart Categorization

Automatic content categorization using multi-signal scoring:

SignalPointsExample
URL match3/api/ → api_reference
Title match2"Getting Started" → getting_started
Content keywords1Contains "import" → tutorials
H1 heading2"API Reference" → api_reference
CATEGORY_MAPPINGS = {
"getting_started": ["intro", "quickstart", "install", "setup", "getting-started"],
"tutorials": ["tutorial", "guide", "walkthrough", "learn", "example"],
"api_reference": ["api", "reference", "methods", "functions", "class"],
"concepts": ["concept", "architecture", "overview", "understand"],
"advanced": ["advanced", "optimization", "performance", "best-practices"],
"troubleshooting": ["faq", "troubleshoot", "debug", "error", "issue"]
}

3. Language Detection

Multi-strategy code language detection:

DETECTION_STRATEGIES = [
# 1. CSS class detection
{"selector": "[class*='language-']", "extract": "language-{lang}"},
{"selector": "[class*='lang-']", "extract": "lang-{lang}"},

# 2. Data attributes
{"selector": "[data-language]", "extract": "data-language"},

# 3. Keyword heuristics
{
"python": ["def ", "import ", "from ", "class ", "__init__"],
"javascript": ["const ", "let ", "function ", "=>", "async "],
"typescript": ["interface ", "type ", ": string", ": number"],
"go": ["func ", "package ", "import (", "go func"],
"rust": ["fn ", "let mut ", "impl ", "pub fn"],
"gdscript": ["extends ", "func ", "var ", "signal "]
}
]

4. Pattern Extraction

Extract reusable patterns from documentation:

PATTERN_TYPES = {
"api_patterns": {
"description": "Common API usage patterns",
"extraction": "multi_code_block_sequences"
},
"config_patterns": {
"description": "Configuration examples",
"extraction": "yaml_json_toml_blocks"
},
"error_handling": {
"description": "Error handling examples",
"extraction": "try_catch_blocks"
},
"integration_patterns": {
"description": "Integration with other tools",
"extraction": "import_dependency_blocks"
}
}

Workflow

INPUT                    PROCESSING                  OUTPUT
─────────────────────────────────────────────────────────────
docs_url ──────► ┌──────────────┐
│ URL Parser │
└──────┬───────┘

┌──────▼───────┐
│ llms.txt? │─── Yes ──► Fast path
└──────┬───────┘
│ No
┌──────▼───────┐
│ BFS Crawler │
│ • respect │
│ robots.txt │
│ • rate limit │
└──────┬───────┘

┌──────▼───────┐
│ Categorizer │
│ • URL │
│ • Title │
│ • Content │
└──────┬───────┘

┌──────▼───────┐ ┌─────────────┐
│ Pattern │────────► │ SKILL.md │
│ Extractor │ │ references/ │
└──────────────┘ │ examples/ │
└─────────────┘

Invocation

# Basic scraping
/agent doc-to-skill-converter "Convert https://fastapi.tiangolo.com/ to skill"

# With configuration
/agent doc-to-skill-converter "Scrape https://react.dev/ with:
- max_pages: 500
- categories: [hooks, components, patterns]
- async mode for speed
- output: ~/.coditect/skills/react/"

# Estimate first
/agent doc-to-skill-converter "Estimate pages at https://docs.python.org/"

Configuration Schema

{
"name": "framework-name",
"description": "When to use this skill",
"base_url": "https://docs.example.com/",

"selectors": {
"main_content": "article, main, .content",
"title": "h1, .title",
"code_blocks": "pre code, .highlight"
},

"url_patterns": {
"include": ["/docs", "/api", "/guide"],
"exclude": ["/blog", "/news", "/changelog"]
},

"categories": {
"getting_started": ["intro", "quickstart", "install"],
"api": ["api", "reference", "methods"],
"tutorials": ["tutorial", "guide", "example"]
},

"scraping": {
"rate_limit": 0.5,
"max_pages": 1000,
"timeout": 30,
"async_mode": true,
"workers": 8
},

"output": {
"format": "coditect",
"enhance_with_ai": true,
"extract_patterns": true
}
}

Output Structure

skill_name/
├── SKILL.md # AI-enhanced with:
│ │ # • Real code examples (5-10)
│ │ # • Quick reference
│ │ # • Key concepts
│ │ # • Navigation by skill level
│ │
├── references/
│ ├── index.md # Category index
│ ├── getting_started/ # Categorized content
│ ├── api_reference/
│ ├── tutorials/
│ └── patterns/ # Extracted patterns

├── examples/
│ ├── basic/
│ ├── advanced/
│ └── integration/

└── metadata.json # Scraping metadata
{
"source_url": "...",
"pages_scraped": 847,
"scrape_date": "2026-01-23",
"categories": {...},
"patterns_extracted": 23
}

Quality Metrics

MetricTargetDescription
Page coverage95%+Pages successfully scraped
Categorization accuracy90%+Correct category assignment
Code extraction100%All code blocks captured
Pattern detection85%+Relevant patterns identified
Example validity100%Examples compile/run

CODITECT Improvements Over Source

This agent improves upon Skill Seekers patterns:

FeatureSkill SeekersCODITECT Improvement
CategorizationKeyword matchingMulti-signal scoring + MoE verification
Pattern extractionBasic regexAST-aware extraction with context
Quality validationManual reviewAutomated quality gates
Multi-sourceSequentialParallel stream processing
Output formatGenericCODITECT-standard with MoE metadata

When to Use This Agent

Use when:

  • Converting a documentation website into a Claude Code skill
  • Documentation is publicly accessible (no authentication)
  • Need smart categorization of content
  • Want pattern extraction from code examples

Do NOT use when:

  • Documentation requires authentication (use manual download + PDF analysis)
  • Need code analysis from repository (use codebase-skill-extractor instead)
  • Combining multiple sources (use skill-generator-orchestrator instead)
  • Site has aggressive anti-bot protection (will fail)
  • Documentation is a single PDF (use PDF extractor instead)

Completion Checklist

Before marking this agent's task as complete, verify:

  • URL validated and accessible
  • Scraping strategy determined (llms.txt, sitemap, or BFS)
  • All pages scraped within configured limits
  • Content categorized into correct directories
  • Code blocks extracted with language detection
  • Patterns identified and documented
  • SKILL.md generated with AI enhancement
  • Output structure matches specification

Success Output

When successful, this agent outputs:

✅ AGENT COMPLETE: doc-to-skill-converter

Scraping Summary:
- [x] URL validated: https://example.com/docs/
- [x] Strategy used: sitemap (5x faster)
- [x] Pages scraped: 847/847 (100%)
- [x] Categories assigned: 6 categories
- [x] Code blocks extracted: 234
- [x] Patterns detected: 23

Outputs:
- ~/.coditect/skills/{name}/SKILL.md (AI-enhanced)
- ~/.coditect/skills/{name}/references/index.md
- ~/.coditect/skills/{name}/references/getting_started/
- ~/.coditect/skills/{name}/references/api_reference/
- ~/.coditect/skills/{name}/examples/
- ~/.coditect/skills/{name}/metadata.json

Quality Metrics:
- Page coverage: 100%
- Categorization confidence: 94%
- Code extraction: 234 blocks
- Pattern detection: 23 patterns

Failure Indicators

This agent has FAILED if:

  • ❌ URL not accessible (404, 403, timeout)
  • ❌ robots.txt blocks scraping
  • ❌ Zero pages scraped
  • ❌ Categorization confidence below 50%
  • ❌ SKILL.md not generated
  • ❌ Rate limited by target site

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
No rate limitingIP blocked, incomplete scrapeUse rate_limit: 0.5 minimum
Ignoring robots.txtLegal issues, blocksAlways respect robots.txt
Scraping authenticated pagesEmpty contentUse manual download instead
Too many workersServer overload, blocksLimit workers: 8 maximum
No max_pages limitHours of scrapingSet reasonable max_pages
Skipping estimateUnexpected large sitesRun --estimate-only first

Verification

After execution, verify success:

# 1. Check output directory structure
find ~/.coditect/skills/{name}/ -type f | head -20

# 2. Verify SKILL.md content
wc -l ~/.coditect/skills/{name}/SKILL.md # Should be 300+ lines

# 3. Check categorization
ls -la ~/.coditect/skills/{name}/references/

# 4. Validate metadata
cat ~/.coditect/skills/{name}/metadata.json | python3 -m json.tool

# 5. Count extracted code blocks
grep -c '```' ~/.coditect/skills/{name}/references/**/*.md 2>/dev/null | awk -F: '{sum+=$2} END {print "Total code blocks:", sum}'
  • Orchestrator: skill-generator-orchestrator
  • Companion: codebase-skill-extractor
  • Command: /skill-from-docs

Version: 1.0.0 | Created: 2026-01-23 | Author: CODITECT Team

Core Responsibilities

  • Analyze and assess documentation requirements within the Documentation domain
  • Provide expert guidance on doc to skill converter best practices and standards
  • Generate actionable recommendations with implementation specifics
  • Validate outputs against CODITECT quality standards and governance requirements
  • Integrate findings with existing project plans and track-based task management

Capabilities

Analysis & Assessment

Systematic evaluation of documentation artifacts, identifying gaps, risks, and improvement opportunities. Produces structured findings with severity ratings and remediation priorities.

Recommendation Generation

Creates actionable, specific recommendations tailored to the documentation context. Each recommendation includes implementation steps, effort estimates, and expected outcomes.

Quality Validation

Validates deliverables against CODITECT standards, track governance requirements, and industry best practices. Ensures compliance with ADR decisions and component specifications.

Invocation Examples

Direct Agent Call

Task(subagent_type="doc-to-skill-converter",
description="Brief task description",
prompt="Detailed instructions for the agent")

Via CODITECT Command

/agent doc-to-skill-converter "Your task description here"

Via MoE Routing

/which Converts documentation websites into Claude Code skills with