Skip to main content

Web Archive - Systematic Web Content Archival

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. IMMEDIATELY execute - no questions, no explanations first
  2. ALWAYS show full output from script/tool execution
  3. ALWAYS provide summary after execution completes

DO NOT:

  • Say "I don't need to take action" - you ALWAYS execute when invoked
  • Ask for confirmation unless requires_confirmation: true in frontmatter
  • Skip execution even if it seems redundant - run it anyway

The user invoking the command IS the confirmation.


Usage

/web-archive

Archive web content from: $ARGUMENTS

Systematically archive web content from a seed URL with recursive link discovery, organized directory structure, and comprehensive tracking. Perfect for archiving documentation, research, competitive intelligence, and reference materials.

Arguments

$ARGUMENTS - Seed URL and Configuration (required)

Specify the web archival task:

Configuration Options

Can be specified in arguments or prompted:

  • Seed URL (required): Starting point for archival
  • Depth limit: How many levels deep to follow links (default: 2)
  • Domain filter: Restrict to specific domain
  • Include patterns: URL patterns to include (e.g., /docs/, /api/)
  • Exclude patterns: URL patterns to skip (e.g., /login, /signup)

What This Command Does

  1. Creates project-specific tracking document from template
  2. Fetches seed URL and converts to markdown
  3. Discovers and follows links (respecting depth limit and filters)
  4. Organizes content into clean directory structure
  5. Tracks progress with real-time updates to tracking document
  6. Validates directory structure and markdown quality

Steps to Follow

Step 1: Initialize Tracking Document

Action: Create project-specific tracking document from template.

# Create research archive directory
mkdir -p research-archive/[PROJECT-NAME]

# Copy template
cp .coditect/CODITECT-CORE-STANDARDS/TEMPLATES/WEB-SEARCH-URL-TEMPLATE.md \
research-archive/[PROJECT-NAME]/WEB-SEARCH-URL.md

Replace placeholders:

  • [PROJECT-NAME-PLACEHOLDER] → Your project name
  • [URL-PLACEHOLDER] → Seed URL to scrape
  • [DOMAIN-PLACEHOLDER] → Domain filter (e.g., buildermethods.com)
  • [DEPTH-PLACEHOLDER] → Max depth (recommended: 2-3)
  • [URL-PATTERN-PLACEHOLDER] → URL patterns to include
  • [DATE-PLACEHOLDER] → Current date

Example:

---
title: "Web Research Archive - BuilderMethods Agent OS"
seed_url: "https://buildermethods.com/agent-os"
domain: "buildermethods.com"
created: "2025-12-03"
depth_limit: 3
---

Step 2: Configure Filters (Optional)

Action: Customize inclusion/exclusion patterns for your use case.

Common Patterns:

Documentation Sites:

--include-pattern "/docs/,/api/,/guides/"
--exclude-pattern "/login,/signup,/pricing"

Blog/News Sites:

--include-pattern "/blog/,/articles/"
--exclude-pattern "/author/,/category/,/tag/"

Product Pages:

--include-pattern "/product/,/features/"
--exclude-pattern "/cart,/checkout,/account"

Step 3: Run Scraper

Action: Execute web-archive-scraper.py with your configuration.

Basic Usage:

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://example.com/page" \
--output "research-archive/example-com/" \
--tracking "research-archive/project-name/WEB-SEARCH-URL.md"

Advanced Usage:

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://example.com/page" \
--depth 3 \
--rate-limit 2.0 \
--include-pattern "/docs/,/api/" \
--exclude-pattern "/login,/signup" \
--output "research-archive/example-com/" \
--tracking "research-archive/project-name/WEB-SEARCH-URL.md" \
--verbose

Parameters:

  • --url (required): Seed URL to start scraping
  • --output: Output directory (default: research-archive/)
  • --tracking: Path to tracking document
  • --depth: Max depth to scrape (default: 3)
  • --rate-limit: Seconds between requests (default: 2.0)
  • --domain-filter: Only scrape this domain (default: seed URL domain)
  • --include-pattern: Comma-separated URL patterns to include
  • --exclude-pattern: Comma-separated URL patterns to exclude
  • --verbose: Enable detailed logging

Step 4: Monitor Progress

Action: Watch the tracking document update in real-time.

What to Monitor:

  • Scraped Pages - Successfully archived content
  • 🔍 Discovered Links - Links found but not yet processed
  • Excluded Links - Links filtered out by rules
  • ⚠️ Failed Links - Errors encountered

Check Progress:

# View tracking document
cat research-archive/project-name/WEB-SEARCH-URL.md

# Count scraped pages
find research-archive/example-com -name "*.md" | wc -l

# View processing log
tail -f research-archive/project-name/WEB-SEARCH-URL.md

Step 5: Validate Results

Action: Verify directory structure and markdown quality.

Validation Checklist:

Directory Structure:

  • URLs map to filesystem paths correctly
  • No duplicate filenames
  • Special characters handled properly
  • Depth structure reflects link hierarchy

Markdown Quality:

  • All pages have YAML frontmatter
  • Content is readable (not raw HTML)
  • Headings follow hierarchy
  • Links are preserved

Metadata Completeness:

  • All pages have source_url
  • All pages have scraped_at timestamp
  • All pages have depth recorded
  • All pages have domain

Example Validation:

# Check frontmatter
head -10 research-archive/example-com/path/to/page/index.md

# Validate structure
tree research-archive/example-com -L 3

# Check for errors in tracking doc
grep "Failed:" research-archive/project-name/WEB-SEARCH-URL.md

Step 6: Error Recovery (If Needed)

Action: Handle interruptions or failures gracefully.

Resume from Checkpoint: If scraping was interrupted, resume from last successful page:

python3 .coditect/scripts/web-archive-scraper.py \
--resume "research-archive/project-name/WEB-SEARCH-URL.md"

Retry Failed Links: Re-attempt failed links with exponential backoff:

python3 .coditect/scripts/web-archive-scraper.py \
--retry-failed "research-archive/project-name/WEB-SEARCH-URL.md"

Output Deliverables

This command produces:

  1. Organized Directory Structure

    research-archive/[DOMAIN]/
    ├── [url-path-1]/
    │ ├── index.md
    │ └── sub-pages/
    ├── [url-path-2]/
    │ └── index.md
    └── WEB-SEARCH-URL.md
  2. Markdown Files with Metadata

    • YAML frontmatter with source URL, timestamp, depth, parent
    • Clean markdown content (basic HTML-to-markdown conversion)
    • Preserved links and structure
  3. Tracking Document (WEB-SEARCH-URL.md)

    • Real-time progress updates
    • Link discovery tracking
    • Processing statistics
    • Error logs
  4. Processing Statistics

    • Total pages discovered
    • Successfully scraped count
    • Excluded/failed counts
    • Average fetch time
    • Progress percentage

Use Cases

1. Documentation Archival

Archive complete documentation sites for offline reference or competitor analysis.

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://docs.example.com" \
--include-pattern "/docs/" \
--depth 4 \
--output "research-archive/example-docs/"

2. Competitive Intelligence

Archive competitor product pages, features, and pricing.

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://competitor.com/product" \
--include-pattern "/product/,/features/,/pricing" \
--depth 2 \
--output "research-archive/competitor-analysis/"

3. Research Material Collection

Gather research papers, articles, and references from academic sites.

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://research-site.edu/papers" \
--include-pattern "/papers/,/publications/" \
--depth 3 \
--output "research-archive/academic-research/"

4. Blog/Content Archival

Archive blog posts and articles for analysis or backup.

python3 .coditect/scripts/web-archive-scraper.py \
--url "https://blog.example.com" \
--include-pattern "/blog/" \
--exclude-pattern "/author/,/tag/" \
--depth 2 \
--output "research-archive/blog-archive/"

Best Practices

Rate Limiting

  • Default: 2 seconds between requests (respectful to servers)
  • Faster (1 second): For internal sites or with permission
  • Slower (3-5 seconds): For rate-limited APIs or public sites

Depth Selection

  • Depth 1: Seed page + direct links only (quick test)
  • Depth 2: Seed + 2 levels (most documentation)
  • Depth 3: Comprehensive archival (recommended)
  • Depth 4+: Very large sites (can take hours)

Filtering Strategy

  • Include patterns: Focus on content paths (e.g., /docs/, /guides/)
  • Exclude patterns: Skip navigation, auth, media paths
  • Domain filter: Stay within target domain (avoid external links)

Error Handling

  • Monitor failed links: Check tracking document for 404s, timeouts
  • Resume capability: Use --resume for large scrapes
  • Retry strategy: Use --retry-failed with exponential backoff

Performance Optimization

  • Start small: Test with depth 1 first
  • Adjust rate limit: Balance speed vs. server load
  • Filter aggressively: Exclude unnecessary paths early
  • Monitor progress: Watch tracking document for issues

Troubleshooting

Issue: No files created

Solution: Check that seed URL is accessible, domain filter is correct.

Issue: Too many files created

Solution: Refine include/exclude patterns, reduce depth limit.

Issue: Poor markdown quality

Solution: Install html2text library for better conversion:

pip install html2text
# Then modify scraper to use html2text.HTML2Text()

Issue: Scraper taking too long

Solution: Reduce depth, add more exclude patterns, increase rate limit (faster).

Issue: 404 errors

Solution: Check failed links in tracking document, verify URL patterns.

Integration with Other Commands

This command complements:

  • /web-search-hooks - Research hooks patterns from archived docs
  • /analyze-hooks - Analyze archived documentation for patterns
  • /multi-agent-research - Use archived content for research workflows

Together they provide:

  • ✅ Systematic content archival (web-archive)
  • ✅ Pattern extraction and analysis
  • ✅ Multi-agent orchestration for complex research

Important Notes

  • Respect robots.txt: The scraper should honor robots.txt (future enhancement)
  • Rate limiting: Default 2 seconds is respectful; adjust as needed
  • Legal considerations: Only archive publicly accessible content
  • Storage: Large sites can create thousands of files
  • Bandwidth: Be mindful of bandwidth usage on metered connections
  • Attribution: Keep source URLs in frontmatter for proper attribution

Success Criteria

  • ✅ All target pages successfully scraped
  • ✅ Directory structure is clean and navigable
  • ✅ Markdown files have complete metadata
  • ✅ Tracking document shows 100% progress
  • ✅ No unresolved failures (or documented reasons)
  • ✅ Content is readable and usable

Action Policy

<default_behavior> This command creates research archives without modifying source sites. Provides:

  • Systematic archival workflow
  • Organized directory structure
  • Real-time progress tracking
  • Error recovery procedures

User decides:

  • Which URLs to archive
  • Depth and filtering settings
  • How to use archived content </default_behavior>
After archival, verify: - Directory structure is correct - Markdown quality is acceptable - Tracking document shows completion - No critical errors in logs - Content is usable for intended purpose

Command Version: 1.0.0 Created: 2025-12-03 CODITECT Standards Compliant:Requires: web-archive-scraper.py v1.0.0, WEB-SEARCH-URL-TEMPLATE.md

Success Output

When web archival completes:

✅ COMMAND COMPLETE: /web-archive
Seed URL: <seed-url>
Pages Scraped: N
Depth: D levels
Output: <output-path>
Tracking: <tracking-doc>

Completion Checklist

Before marking complete:

  • Tracking document created
  • Seed URL fetched
  • Links discovered and followed
  • Directory structure organized
  • Progress 100%

Failure Indicators

This command has FAILED if:

  • ❌ Seed URL inaccessible
  • ❌ No pages scraped
  • ❌ Directory structure broken
  • ❌ Tracking document missing

When NOT to Use

Do NOT use when:

  • Site requires authentication
  • Robots.txt disallows scraping
  • Single page fetch needed (use WebFetch)

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Depth too highHours of scrapingStart with depth 2-3
No filtersToo many filesAdd include/exclude patterns
Skip rate limitingIP blockedUse 2+ second delay

Principles

This command embodies:

  • #3 Complete Execution - Full archival workflow
  • #1 Recycle → Extend - Organized for reuse

Full Standard: CODITECT-STANDARD-AUTOMATION.md