Web Archive - Systematic Web Content Archival
System Prompt
⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:
- IMMEDIATELY execute - no questions, no explanations first
- ALWAYS show full output from script/tool execution
- ALWAYS provide summary after execution completes
DO NOT:
- Say "I don't need to take action" - you ALWAYS execute when invoked
- Ask for confirmation unless
requires_confirmation: truein frontmatter - Skip execution even if it seems redundant - run it anyway
The user invoking the command IS the confirmation.
Usage
/web-archive
Archive web content from: $ARGUMENTS
Systematically archive web content from a seed URL with recursive link discovery, organized directory structure, and comprehensive tracking. Perfect for archiving documentation, research, competitive intelligence, and reference materials.
Arguments
$ARGUMENTS - Seed URL and Configuration (required)
Specify the web archival task:
- Single URL: "Archive https://example.com/docs"
- With depth: "Archive https://example.com depth 3"
- With project: "Archive buildermethods.com agent-os documentation"
- Documentation site: "Archive API docs from https://api.example.com"
Configuration Options
Can be specified in arguments or prompted:
- Seed URL (required): Starting point for archival
- Depth limit: How many levels deep to follow links (default: 2)
- Domain filter: Restrict to specific domain
- Include patterns: URL patterns to include (e.g., /docs/, /api/)
- Exclude patterns: URL patterns to skip (e.g., /login, /signup)
What This Command Does
- Creates project-specific tracking document from template
- Fetches seed URL and converts to markdown
- Discovers and follows links (respecting depth limit and filters)
- Organizes content into clean directory structure
- Tracks progress with real-time updates to tracking document
- Validates directory structure and markdown quality
Steps to Follow
Step 1: Initialize Tracking Document
Action: Create project-specific tracking document from template.
# Create research archive directory
mkdir -p research-archive/[PROJECT-NAME]
# Copy template
cp .coditect/CODITECT-CORE-STANDARDS/TEMPLATES/WEB-SEARCH-URL-TEMPLATE.md \
research-archive/[PROJECT-NAME]/WEB-SEARCH-URL.md
Replace placeholders:
[PROJECT-NAME-PLACEHOLDER]→ Your project name[URL-PLACEHOLDER]→ Seed URL to scrape[DOMAIN-PLACEHOLDER]→ Domain filter (e.g., buildermethods.com)[DEPTH-PLACEHOLDER]→ Max depth (recommended: 2-3)[URL-PATTERN-PLACEHOLDER]→ URL patterns to include[DATE-PLACEHOLDER]→ Current date
Example:
---
title: "Web Research Archive - BuilderMethods Agent OS"
seed_url: "https://buildermethods.com/agent-os"
domain: "buildermethods.com"
created: "2025-12-03"
depth_limit: 3
---
Step 2: Configure Filters (Optional)
Action: Customize inclusion/exclusion patterns for your use case.
Common Patterns:
Documentation Sites:
--include-pattern "/docs/,/api/,/guides/"
--exclude-pattern "/login,/signup,/pricing"
Blog/News Sites:
--include-pattern "/blog/,/articles/"
--exclude-pattern "/author/,/category/,/tag/"
Product Pages:
--include-pattern "/product/,/features/"
--exclude-pattern "/cart,/checkout,/account"
Step 3: Run Scraper
Action: Execute web-archive-scraper.py with your configuration.
Basic Usage:
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://example.com/page" \
--output "research-archive/example-com/" \
--tracking "research-archive/project-name/WEB-SEARCH-URL.md"
Advanced Usage:
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://example.com/page" \
--depth 3 \
--rate-limit 2.0 \
--include-pattern "/docs/,/api/" \
--exclude-pattern "/login,/signup" \
--output "research-archive/example-com/" \
--tracking "research-archive/project-name/WEB-SEARCH-URL.md" \
--verbose
Parameters:
--url(required): Seed URL to start scraping--output: Output directory (default: research-archive/)--tracking: Path to tracking document--depth: Max depth to scrape (default: 3)--rate-limit: Seconds between requests (default: 2.0)--domain-filter: Only scrape this domain (default: seed URL domain)--include-pattern: Comma-separated URL patterns to include--exclude-pattern: Comma-separated URL patterns to exclude--verbose: Enable detailed logging
Step 4: Monitor Progress
Action: Watch the tracking document update in real-time.
What to Monitor:
- ✅ Scraped Pages - Successfully archived content
- 🔍 Discovered Links - Links found but not yet processed
- ❌ Excluded Links - Links filtered out by rules
- ⚠️ Failed Links - Errors encountered
Check Progress:
# View tracking document
cat research-archive/project-name/WEB-SEARCH-URL.md
# Count scraped pages
find research-archive/example-com -name "*.md" | wc -l
# View processing log
tail -f research-archive/project-name/WEB-SEARCH-URL.md
Step 5: Validate Results
Action: Verify directory structure and markdown quality.
Validation Checklist:
Directory Structure:
- URLs map to filesystem paths correctly
- No duplicate filenames
- Special characters handled properly
- Depth structure reflects link hierarchy
Markdown Quality:
- All pages have YAML frontmatter
- Content is readable (not raw HTML)
- Headings follow hierarchy
- Links are preserved
Metadata Completeness:
- All pages have
source_url - All pages have
scraped_attimestamp - All pages have
depthrecorded - All pages have
domain
Example Validation:
# Check frontmatter
head -10 research-archive/example-com/path/to/page/index.md
# Validate structure
tree research-archive/example-com -L 3
# Check for errors in tracking doc
grep "Failed:" research-archive/project-name/WEB-SEARCH-URL.md
Step 6: Error Recovery (If Needed)
Action: Handle interruptions or failures gracefully.
Resume from Checkpoint: If scraping was interrupted, resume from last successful page:
python3 .coditect/scripts/web-archive-scraper.py \
--resume "research-archive/project-name/WEB-SEARCH-URL.md"
Retry Failed Links: Re-attempt failed links with exponential backoff:
python3 .coditect/scripts/web-archive-scraper.py \
--retry-failed "research-archive/project-name/WEB-SEARCH-URL.md"
Output Deliverables
This command produces:
-
Organized Directory Structure
research-archive/[DOMAIN]/
├── [url-path-1]/
│ ├── index.md
│ └── sub-pages/
├── [url-path-2]/
│ └── index.md
└── WEB-SEARCH-URL.md -
Markdown Files with Metadata
- YAML frontmatter with source URL, timestamp, depth, parent
- Clean markdown content (basic HTML-to-markdown conversion)
- Preserved links and structure
-
Tracking Document (WEB-SEARCH-URL.md)
- Real-time progress updates
- Link discovery tracking
- Processing statistics
- Error logs
-
Processing Statistics
- Total pages discovered
- Successfully scraped count
- Excluded/failed counts
- Average fetch time
- Progress percentage
Use Cases
1. Documentation Archival
Archive complete documentation sites for offline reference or competitor analysis.
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://docs.example.com" \
--include-pattern "/docs/" \
--depth 4 \
--output "research-archive/example-docs/"
2. Competitive Intelligence
Archive competitor product pages, features, and pricing.
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://competitor.com/product" \
--include-pattern "/product/,/features/,/pricing" \
--depth 2 \
--output "research-archive/competitor-analysis/"
3. Research Material Collection
Gather research papers, articles, and references from academic sites.
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://research-site.edu/papers" \
--include-pattern "/papers/,/publications/" \
--depth 3 \
--output "research-archive/academic-research/"
4. Blog/Content Archival
Archive blog posts and articles for analysis or backup.
python3 .coditect/scripts/web-archive-scraper.py \
--url "https://blog.example.com" \
--include-pattern "/blog/" \
--exclude-pattern "/author/,/tag/" \
--depth 2 \
--output "research-archive/blog-archive/"
Best Practices
Rate Limiting
- Default: 2 seconds between requests (respectful to servers)
- Faster (1 second): For internal sites or with permission
- Slower (3-5 seconds): For rate-limited APIs or public sites
Depth Selection
- Depth 1: Seed page + direct links only (quick test)
- Depth 2: Seed + 2 levels (most documentation)
- Depth 3: Comprehensive archival (recommended)
- Depth 4+: Very large sites (can take hours)
Filtering Strategy
- Include patterns: Focus on content paths (e.g., /docs/, /guides/)
- Exclude patterns: Skip navigation, auth, media paths
- Domain filter: Stay within target domain (avoid external links)
Error Handling
- Monitor failed links: Check tracking document for 404s, timeouts
- Resume capability: Use --resume for large scrapes
- Retry strategy: Use --retry-failed with exponential backoff
Performance Optimization
- Start small: Test with depth 1 first
- Adjust rate limit: Balance speed vs. server load
- Filter aggressively: Exclude unnecessary paths early
- Monitor progress: Watch tracking document for issues
Troubleshooting
Issue: No files created
Solution: Check that seed URL is accessible, domain filter is correct.
Issue: Too many files created
Solution: Refine include/exclude patterns, reduce depth limit.
Issue: Poor markdown quality
Solution: Install html2text library for better conversion:
pip install html2text
# Then modify scraper to use html2text.HTML2Text()
Issue: Scraper taking too long
Solution: Reduce depth, add more exclude patterns, increase rate limit (faster).
Issue: 404 errors
Solution: Check failed links in tracking document, verify URL patterns.
Integration with Other Commands
This command complements:
/web-search-hooks- Research hooks patterns from archived docs/analyze-hooks- Analyze archived documentation for patterns/multi-agent-research- Use archived content for research workflows
Together they provide:
- ✅ Systematic content archival (web-archive)
- ✅ Pattern extraction and analysis
- ✅ Multi-agent orchestration for complex research
Important Notes
- Respect robots.txt: The scraper should honor robots.txt (future enhancement)
- Rate limiting: Default 2 seconds is respectful; adjust as needed
- Legal considerations: Only archive publicly accessible content
- Storage: Large sites can create thousands of files
- Bandwidth: Be mindful of bandwidth usage on metered connections
- Attribution: Keep source URLs in frontmatter for proper attribution
Success Criteria
- ✅ All target pages successfully scraped
- ✅ Directory structure is clean and navigable
- ✅ Markdown files have complete metadata
- ✅ Tracking document shows 100% progress
- ✅ No unresolved failures (or documented reasons)
- ✅ Content is readable and usable
Action Policy
<default_behavior> This command creates research archives without modifying source sites. Provides:
- Systematic archival workflow
- Organized directory structure
- Real-time progress tracking
- Error recovery procedures
User decides:
- Which URLs to archive
- Depth and filtering settings
- How to use archived content </default_behavior>
Command Version: 1.0.0 Created: 2025-12-03 CODITECT Standards Compliant: ✅ Requires: web-archive-scraper.py v1.0.0, WEB-SEARCH-URL-TEMPLATE.md
Success Output
When web archival completes:
✅ COMMAND COMPLETE: /web-archive
Seed URL: <seed-url>
Pages Scraped: N
Depth: D levels
Output: <output-path>
Tracking: <tracking-doc>
Completion Checklist
Before marking complete:
- Tracking document created
- Seed URL fetched
- Links discovered and followed
- Directory structure organized
- Progress 100%
Failure Indicators
This command has FAILED if:
- ❌ Seed URL inaccessible
- ❌ No pages scraped
- ❌ Directory structure broken
- ❌ Tracking document missing
When NOT to Use
Do NOT use when:
- Site requires authentication
- Robots.txt disallows scraping
- Single page fetch needed (use WebFetch)
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Depth too high | Hours of scraping | Start with depth 2-3 |
| No filters | Too many files | Add include/exclude patterns |
| Skip rate limiting | IP blocked | Use 2+ second delay |
Principles
This command embodies:
- #3 Complete Execution - Full archival workflow
- #1 Recycle → Extend - Organized for reuse
Full Standard: CODITECT-STANDARD-AUTOMATION.md