Web Archive - Systematic Web Content Archival

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

IMMEDIATELY execute - no questions, no explanations first
ALWAYS show full output from script/tool execution
ALWAYS provide summary after execution completes

DO NOT:

Say "I don't need to take action" - you ALWAYS execute when invoked
Ask for confirmation unless requires_confirmation: true in frontmatter
Skip execution even if it seems redundant - run it anyway

The user invoking the command IS the confirmation.

Usage

/web-archive

Archive web content from: $ARGUMENTS

Systematically archive web content from a seed URL with recursive link discovery, organized directory structure, and comprehensive tracking. Perfect for archiving documentation, research, competitive intelligence, and reference materials.

Arguments

$ARGUMENTS - Seed URL and Configuration (required)

Specify the web archival task:

Single URL: "Archive https://example.com/docs"
With depth: "Archive https://example.com depth 3"
With project: "Archive buildermethods.com agent-os documentation"
Documentation site: "Archive API docs from https://api.example.com"

Configuration Options

Can be specified in arguments or prompted:

Seed URL (required): Starting point for archival
Depth limit: How many levels deep to follow links (default: 2)
Domain filter: Restrict to specific domain
Include patterns: URL patterns to include (e.g., /docs/, /api/)
Exclude patterns: URL patterns to skip (e.g., /login, /signup)

What This Command Does

Creates project-specific tracking document from template
Fetches seed URL and converts to markdown
Discovers and follows links (respecting depth limit and filters)
Organizes content into clean directory structure
Tracks progress with real-time updates to tracking document
Validates directory structure and markdown quality

Steps to Follow

Step 1: Initialize Tracking Document

Action: Create project-specific tracking document from template.

# Create research archive directory
mkdir -p research-archive/[PROJECT-NAME]

# Copy template
cp .coditect/CODITECT-CORE-STANDARDS/TEMPLATES/WEB-SEARCH-URL-TEMPLATE.md \
   research-archive/[PROJECT-NAME]/WEB-SEARCH-URL.md

Replace placeholders:

[PROJECT-NAME-PLACEHOLDER] → Your project name
[URL-PLACEHOLDER] → Seed URL to scrape
[DOMAIN-PLACEHOLDER] → Domain filter (e.g., buildermethods.com)
[DEPTH-PLACEHOLDER] → Max depth (recommended: 2-3)
[URL-PATTERN-PLACEHOLDER] → URL patterns to include
[DATE-PLACEHOLDER] → Current date

Example:

---
title: "Web Research Archive - BuilderMethods Agent OS"
seed_url: "https://buildermethods.com/agent-os"
domain: "buildermethods.com"
created: "2025-12-03"
depth_limit: 3
---

Step 2: Configure Filters (Optional)

Action: Customize inclusion/exclusion patterns for your use case.

Common Patterns:

Documentation Sites:

--include-pattern "/docs/,/api/,/guides/"
--exclude-pattern "/login,/signup,/pricing"

Blog/News Sites:

--include-pattern "/blog/,/articles/"
--exclude-pattern "/author/,/category/,/tag/"

Product Pages:

--include-pattern "/product/,/features/"
--exclude-pattern "/cart,/checkout,/account"

Step 3: Run Scraper

Action: Execute web-archive-scraper.py with your configuration.

Basic Usage:

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://example.com/page" \
  --output "research-archive/example-com/" \
  --tracking "research-archive/project-name/WEB-SEARCH-URL.md"

Advanced Usage:

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://example.com/page" \
  --depth 3 \
  --rate-limit 2.0 \
  --include-pattern "/docs/,/api/" \
  --exclude-pattern "/login,/signup" \
  --output "research-archive/example-com/" \
  --tracking "research-archive/project-name/WEB-SEARCH-URL.md" \
  --verbose

Parameters:

--url (required): Seed URL to start scraping
--output: Output directory (default: research-archive/)
--tracking: Path to tracking document
--depth: Max depth to scrape (default: 3)
--rate-limit: Seconds between requests (default: 2.0)
--domain-filter: Only scrape this domain (default: seed URL domain)
--include-pattern: Comma-separated URL patterns to include
--exclude-pattern: Comma-separated URL patterns to exclude
--verbose: Enable detailed logging

Step 4: Monitor Progress

Action: Watch the tracking document update in real-time.

What to Monitor:

✅ Scraped Pages - Successfully archived content
🔍 Discovered Links - Links found but not yet processed
❌ Excluded Links - Links filtered out by rules
⚠️ Failed Links - Errors encountered

Check Progress:

# View tracking document
cat research-archive/project-name/WEB-SEARCH-URL.md

# Count scraped pages
find research-archive/example-com -name "*.md" | wc -l

# View processing log
tail -f research-archive/project-name/WEB-SEARCH-URL.md

Step 5: Validate Results

Action: Verify directory structure and markdown quality.

Validation Checklist:

Directory Structure:

URLs map to filesystem paths correctly
No duplicate filenames
Special characters handled properly
Depth structure reflects link hierarchy

Markdown Quality:

All pages have YAML frontmatter
Content is readable (not raw HTML)
Headings follow hierarchy
Links are preserved

Metadata Completeness:

All pages have source_url
All pages have scraped_at timestamp
All pages have depth recorded
All pages have domain

Example Validation:

# Check frontmatter
head -10 research-archive/example-com/path/to/page/index.md

# Validate structure
tree research-archive/example-com -L 3

# Check for errors in tracking doc
grep "Failed:" research-archive/project-name/WEB-SEARCH-URL.md

Step 6: Error Recovery (If Needed)

Action: Handle interruptions or failures gracefully.

Resume from Checkpoint: If scraping was interrupted, resume from last successful page:

python3 .coditect/scripts/web-archive-scraper.py \
  --resume "research-archive/project-name/WEB-SEARCH-URL.md"

Retry Failed Links: Re-attempt failed links with exponential backoff:

python3 .coditect/scripts/web-archive-scraper.py \
  --retry-failed "research-archive/project-name/WEB-SEARCH-URL.md"

Output Deliverables

This command produces:

Organized Directory Structure

research-archive/[DOMAIN]/
├── [url-path-1]/
│   ├── index.md
│   └── sub-pages/
├── [url-path-2]/
│   └── index.md
└── WEB-SEARCH-URL.md

Markdown Files with Metadata
- YAML frontmatter with source URL, timestamp, depth, parent
- Clean markdown content (basic HTML-to-markdown conversion)
- Preserved links and structure
Tracking Document (WEB-SEARCH-URL.md)
- Real-time progress updates
- Link discovery tracking
- Processing statistics
- Error logs
Processing Statistics
- Total pages discovered
- Successfully scraped count
- Excluded/failed counts
- Average fetch time
- Progress percentage

Use Cases

1. Documentation Archival

Archive complete documentation sites for offline reference or competitor analysis.

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://docs.example.com" \
  --include-pattern "/docs/" \
  --depth 4 \
  --output "research-archive/example-docs/"

2. Competitive Intelligence

Archive competitor product pages, features, and pricing.

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://competitor.com/product" \
  --include-pattern "/product/,/features/,/pricing" \
  --depth 2 \
  --output "research-archive/competitor-analysis/"

3. Research Material Collection

Gather research papers, articles, and references from academic sites.

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://research-site.edu/papers" \
  --include-pattern "/papers/,/publications/" \
  --depth 3 \
  --output "research-archive/academic-research/"

4. Blog/Content Archival

Archive blog posts and articles for analysis or backup.

python3 .coditect/scripts/web-archive-scraper.py \
  --url "https://blog.example.com" \
  --include-pattern "/blog/" \
  --exclude-pattern "/author/,/tag/" \
  --depth 2 \
  --output "research-archive/blog-archive/"

Best Practices

Rate Limiting

Default: 2 seconds between requests (respectful to servers)
Faster (1 second): For internal sites or with permission
Slower (3-5 seconds): For rate-limited APIs or public sites

Depth Selection

Depth 1: Seed page + direct links only (quick test)
Depth 2: Seed + 2 levels (most documentation)
Depth 3: Comprehensive archival (recommended)
Depth 4+: Very large sites (can take hours)

Filtering Strategy

Include patterns: Focus on content paths (e.g., /docs/, /guides/)
Exclude patterns: Skip navigation, auth, media paths
Domain filter: Stay within target domain (avoid external links)

Error Handling

Monitor failed links: Check tracking document for 404s, timeouts
Resume capability: Use --resume for large scrapes
Retry strategy: Use --retry-failed with exponential backoff

Performance Optimization

Start small: Test with depth 1 first
Adjust rate limit: Balance speed vs. server load
Filter aggressively: Exclude unnecessary paths early
Monitor progress: Watch tracking document for issues

Troubleshooting

Issue: No files created

Solution: Check that seed URL is accessible, domain filter is correct.

Issue: Too many files created

Solution: Refine include/exclude patterns, reduce depth limit.

Issue: Poor markdown quality

Solution: Install html2text library for better conversion:

pip install html2text
# Then modify scraper to use html2text.HTML2Text()

Issue: Scraper taking too long

Solution: Reduce depth, add more exclude patterns, increase rate limit (faster).

Issue: 404 errors

Solution: Check failed links in tracking document, verify URL patterns.

Integration with Other Commands

This command complements:

/web-search-hooks - Research hooks patterns from archived docs
/analyze-hooks - Analyze archived documentation for patterns
/multi-agent-research - Use archived content for research workflows

Together they provide:

✅ Systematic content archival (web-archive)
✅ Pattern extraction and analysis
✅ Multi-agent orchestration for complex research

Important Notes

Respect robots.txt: The scraper should honor robots.txt (future enhancement)
Rate limiting: Default 2 seconds is respectful; adjust as needed
Legal considerations: Only archive publicly accessible content
Storage: Large sites can create thousands of files
Bandwidth: Be mindful of bandwidth usage on metered connections
Attribution: Keep source URLs in frontmatter for proper attribution

Success Criteria

✅ All target pages successfully scraped
✅ Directory structure is clean and navigable
✅ Markdown files have complete metadata
✅ Tracking document shows 100% progress
✅ No unresolved failures (or documented reasons)
✅ Content is readable and usable

Action Policy

<default_behavior> This command creates research archives without modifying source sites. Provides:

Systematic archival workflow
Organized directory structure
Real-time progress tracking
Error recovery procedures

User decides:

Which URLs to archive
Depth and filtering settings
How to use archived content </default_behavior>

After archival, verify: - Directory structure is correct - Markdown quality is acceptable - Tracking document shows completion - No critical errors in logs - Content is usable for intended purpose

Command Version: 1.0.0 Created: 2025-12-03 CODITECT Standards Compliant: ✅ Requires: web-archive-scraper.py v1.0.0, WEB-SEARCH-URL-TEMPLATE.md

Success Output

When web archival completes:

✅ COMMAND COMPLETE: /web-archive
Seed URL: <seed-url>
Pages Scraped: N
Depth: D levels
Output: <output-path>
Tracking: <tracking-doc>

Completion Checklist

Before marking complete:

Failure Indicators

This command has FAILED if:

❌ Seed URL inaccessible
❌ No pages scraped
❌ Directory structure broken
❌ Tracking document missing

When NOT to Use

Do NOT use when:

Site requires authentication
Robots.txt disallows scraping
Single page fetch needed (use WebFetch)

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Depth too high	Hours of scraping	Start with depth 2-3
No filters	Too many files	Add include/exclude patterns
Skip rate limiting	IP blocked	Use 2+ second delay

Principles

This command embodies:

#3 Complete Execution - Full archival workflow
#1 Recycle → Extend - Organized for reuse

Full Standard: CODITECT-STANDARD-AUTOMATION.md

System Prompt​

Usage​

Arguments​

$ARGUMENTS - Seed URL and Configuration (required)​

Configuration Options​

What This Command Does​

Steps to Follow​

Step 1: Initialize Tracking Document​

Step 2: Configure Filters (Optional)​

Step 3: Run Scraper​

Step 4: Monitor Progress​

Step 5: Validate Results​

Step 6: Error Recovery (If Needed)​

Output Deliverables​

Use Cases​

1. Documentation Archival​

2. Competitive Intelligence​

3. Research Material Collection​

4. Blog/Content Archival​

Best Practices​

Rate Limiting​

Depth Selection​

Filtering Strategy​

Error Handling​

Performance Optimization​

Troubleshooting​

Issue: No files created​

Issue: Too many files created​

Issue: Poor markdown quality​

Issue: Scraper taking too long​

Issue: 404 errors​

Integration with Other Commands​

Important Notes​

Success Criteria​

Action Policy​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​