/browser-research - Agentic Multi-Page Content Extraction

Agentic browser research that autonomously discovers pages from a starting URL or search query, presents an interactive checklist for selection, extracts content from all selected pages (including PDFs), and assembles everything into a structured research document with summaries and recommendations.

Usage

# From a starting URL (discovers linked pages)
/browser-research "https://docs.coditect.ai"

# From a search query (finds relevant URLs)
/browser-research "CODITECT pricing competitors 2026"

# With options
/browser-research "https://example.com" --depth 2          # Follow links 2 levels deep
/browser-research "https://example.com" --include-pdfs      # Download and extract PDFs
/browser-research "https://example.com" --max-pages 20      # Limit to 20 pages
/browser-research "https://example.com" --scope internal     # Only same-domain pages
/browser-research "https://example.com" --scope sitemap      # Use sitemap.xml
/browser-research "https://example.com" --output ./research  # Custom output directory
/browser-research "https://example.com" --auto               # Skip checklist, extract all
/browser-research "https://example.com" --summary-only       # Only produce summary doc

System Prompt

EXECUTION DIRECTIVE: When the user invokes /browser-research, execute this 6-phase agentic workflow:

Phase 1: URL Discovery

From a URL starting point:

# Step 1: Navigate to starting page
npx agent-browser open "<url>"

# Step 2: Get page snapshot + extract all links
npx agent-browser snapshot
npx agent-browser eval "JSON.stringify({
  title: document.title,
  links: [...document.querySelectorAll('a[href]')].map(a => ({
    text: a.textContent.trim().slice(0, 80),
    href: a.href,
    internal: a.host === location.host,
    isDoc: /\\.(pdf|doc|docx|txt|csv|xlsx)$/i.test(a.href),
    isPdf: /\\.pdf$/i.test(a.href),
    section: a.closest('nav,header,footer,main,aside')?.tagName?.toLowerCase() || 'body'
  })).filter(l => l.href.startsWith('http'))
})"

If --depth 2 or higher: Follow discovered internal links and extract their links too.

If --scope sitemap:

npx agent-browser navigate "<origin>/sitemap.xml"
npx agent-browser eval "
  const urls = [...document.querySelectorAll('loc')].map(l => l.textContent);
  return JSON.stringify(urls);
"

From a search query:

# Use web search to find URLs
npx agent-browser open "https://www.google.com/search?q=<encoded-query>"
npx agent-browser snapshot -i
# Extract search result URLs
npx agent-browser eval "JSON.stringify([...document.querySelectorAll('a[href]')]
  .filter(a => a.closest('#search') || a.closest('#rso'))
  .map(a => ({text: a.textContent.trim(), href: a.href}))
  .filter(l => !l.href.includes('google.com'))
)"

Phase 2: Interactive Checklist

Present the discovered URLs to the user as an interactive checklist using AskUserQuestion:

Found 15 pages from https://docs.coditect.ai:

PAGES:
  [x] / - CODITECT Documentation Home
  [x] /getting-started - Getting Started Guide
  [x] /api-reference - API Reference
  [ ] /changelog - Changelog
  [x] /pricing - Pricing
  [ ] /blog/post-1 - Blog: Announcing v2.0
  [ ] /blog/post-2 - Blog: Performance Update

DOCUMENTS:
  [x] /docs/whitepaper.pdf - CODITECT Technical Whitepaper (PDF)
  [ ] /docs/soc2-report.pdf - SOC 2 Report (PDF)

EXTERNAL:
  [ ] https://github.com/coditect-ai - GitHub Repository
  [ ] https://status.coditect.ai - Status Page

Select pages to extract (default: all internal pages + PDFs marked with [x]):

User interaction options:

Select/deselect individual pages
"All internal" — select all same-domain pages
"All + PDFs" — all internal pages plus all PDFs
"All" — everything including external
"Custom" — user provides a list or regex pattern

If --auto is specified, skip the checklist and extract all discovered pages (respecting --scope and --max-pages).

Phase 3: Content Extraction

Extract content from each selected page sequentially:

# For each selected URL:
npx agent-browser navigate "<url>"
npx agent-browser snapshot

# Get metadata
npx agent-browser eval "JSON.stringify({
  title: document.title,
  description: document.querySelector('meta[name=description]')?.content,
  wordCount: document.body.innerText.split(/\\s+/).length,
  headings: [...document.querySelectorAll('h1,h2,h3')].map(h => ({
    level: parseInt(h.tagName[1]),
    text: h.textContent.trim()
  }))
})"

For each page, produce:

Markdown extraction (snapshot -> markdown conversion)
Key metadata (title, description, word count, headings)
Link inventory (for cross-referencing)

Progress reporting:

Extracting: [3/8] /api-reference (1,234 words)...

Phase 4: PDF Extraction

For selected PDFs (--include-pdfs or user-selected):

# Navigate to PDF URL
npx agent-browser navigate "<pdf-url>"

# Check if browser rendered it
npx agent-browser eval "document.contentType"

# Option A: Browser PDF viewer - extract visible text
npx agent-browser eval "document.body.innerText"

# Option B: Download for local processing
npx agent-browser eval "
  const resp = await fetch(location.href);
  const blob = await resp.blob();
  const reader = new FileReader();
  return new Promise(resolve => {
    reader.onload = () => resolve({
      type: blob.type,
      size: blob.size,
      base64: reader.result.split(',')[1]?.slice(0, 100) + '...'
    });
    reader.readAsDataURL(blob);
  });
"

For downloaded PDFs, use the Read tool to process them locally (Claude can read PDFs natively).

PDF output: Extract text, preserve heading structure where possible, note page count and any tables found.

Phase 5: Assembly

Combine all extracted content into a structured research document:

---
research_source: {starting_url_or_query}
pages_extracted: {count}
pdfs_extracted: {count}
total_words: {sum}
extracted_at: {timestamp}
---

# Research: {Topic/Domain}

**Source:** {starting URL or search query}
**Pages:** {count} pages + {count} PDFs extracted
**Total content:** ~{word_count} words
**Extracted:** {date}

---

## Table of Contents

1. [Summary](#summary)
2. [Page 1: {title}](#page-1-title)
3. [Page 2: {title}](#page-2-title)
...
N. [Appendix: PDF - {title}](#appendix-pdf-title)

---

## Summary

{AI-generated 3-5 paragraph executive summary of all extracted content}

### Key Findings

- {Finding 1}
- {Finding 2}
- {Finding 3}

### Recommendations

| # | Recommendation | Source | Priority |
|---|---------------|--------|----------|
| 1 | {recommendation} | {page reference} | HIGH |
| 2 | {recommendation} | {page reference} | MEDIUM |

---

## Page 1: {Title}

**URL:** {url}
**Words:** {count}

{Full markdown extraction of page content}

---

## Page 2: {Title}

...

---

## Appendix: PDF - {Title}

**URL:** {url}
**Pages:** {count}

{Extracted PDF text content}

---

Phase 6: Suggestions

After assembly, generate actionable suggestions based on the extracted content:

## Suggestions & Next Steps

### Content Opportunities
- {Gap identified across pages — e.g., "No API rate limits documented"}
- {Inconsistency — e.g., "Pricing page says X but FAQ says Y"}

### Cross-Page Issues
- {Broken internal links between pages}
- {Duplicate content across pages}
- {Navigation items that lead to 404}

### Competitive Intelligence (if applicable)
- {Positioning observations}
- {Feature gaps vs competitors}
- {Pricing model comparison}

### Follow-Up Research
- {Additional URLs worth investigating}
- {Topics that need deeper extraction}
- {External sources to cross-reference}

Options

Option	Description
`<url-or-query>`	Starting URL or search query (required)
`--depth <N>`	Link-following depth (default: 1)
`--max-pages <N>`	Maximum pages to extract (default: 10)
`--scope <mode>`	`internal` (default), `sitemap`, `all`
`--include-pdfs`	Download and extract linked PDFs
`--auto`	Skip interactive checklist, extract all
`--summary-only`	Only produce summary, not full extractions
`--output <dir>`	Custom output directory
`--format <type>`	Output format: `markdown` (default), `json`, `both`
`--analyze`	Include per-page analysis (from `/browser-extract --analyze`)
`--no-close`	Keep browser open after research
`--help`	Show this help

Output Structure

analyze-new-artifacts/coditect-browser-analysis/
  {domain}-research/
    README.md                    # Master research document with summary
    pages/
      01-{slug}.md              # Individual page extractions
      02-{slug}.md
      ...
    pdfs/
      {filename}.md             # Extracted PDF content
    data/
      url-inventory.json        # All discovered URLs
      link-graph.json           # Cross-page link relationships
      extraction-manifest.json  # What was extracted, when, stats

Examples

Research a Documentation Site

/browser-research "https://docs.coditect.ai" --depth 2 --include-pdfs

Opens docs.coditect.ai, discovers all linked pages
Follows links 2 levels deep (docs -> sub-pages -> sub-sub-pages)
Presents checklist: 23 pages found, 2 PDFs
User selects 15 pages + both PDFs
Extracts all, assembles research document
Generates summary + recommendations

Competitive Research from Search

/browser-research "AI code generation platforms pricing 2026" --max-pages 5

Searches Google for the query
Presents top 10 results as checklist
User selects 5 competitor pages
Extracts pricing and feature information
Assembles comparison document with suggestions

Quick Site Audit

/browser-research "https://mysite.com" --scope sitemap --analyze --auto

Fetches sitemap.xml
Auto-extracts all pages (no checklist)
Runs analysis on each page
Produces comprehensive site audit

Extract a Specific Set of Pages

/browser-research "https://example.com" --scope internal --max-pages 50 --summary-only

Discovers all internal pages up to 50
Extracts content from all
Produces summary document only (not individual page files)

Agentic Behavior

This command is agentic — it makes autonomous decisions during execution:

Decision	How It Decides
Which links to follow	Internal links in `<main>` or `<nav>`, not ads/tracking
PDF detection	File extension `.pdf` or content-type `application/pdf`
Duplicate detection	URL normalization (strip trailing slash, fragments, tracking params)
Rate limiting	1-second delay between page loads to be respectful
Error recovery	Skip pages that fail to load after 10s timeout, log error, continue
Content relevance	Skip utility pages (login, 404, search results) unless explicitly selected

Success Output

/browser-research: https://docs.coditect.ai
  Discovered: 23 pages, 2 PDFs
  Selected: 15 pages, 2 PDFs
  Extracted: 15/15 pages, 2/2 PDFs (28,450 words total)
  Output: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-research/
  Summary: README.md (executive summary + 5 recommendations)

Command: /browser-extract — Single-page extraction
Command: /browser — Direct browser control
Skill: browser-content-extraction — Extraction patterns
Skill: browser-automation-patterns — Browser workflow patterns
Agent: coditect-browser-agent — Browser automation agent

Principles

This command embodies:

#3 Complete Execution — Full research pipeline: discover -> select -> extract -> assemble -> summarize
#4 Separation of Concerns — Discovery, extraction, and assembly are distinct phases
#6 Clear, Understandable — Interactive checklist gives user control
#8 No Assumptions — User confirms which pages to extract via checklist
#9 Based on Facts — Real page content, cross-referenced and summarized

Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team

Usage​

System Prompt​

Phase 1: URL Discovery​

Phase 2: Interactive Checklist​

Phase 3: Content Extraction​

Phase 4: PDF Extraction​

Phase 5: Assembly​

Phase 6: Suggestions​

Options​

Output Structure​

Examples​

Research a Documentation Site​

Competitive Research from Search​

Quick Site Audit​

Extract a Specific Set of Pages​

Agentic Behavior​

Success Output​

Related​

Principles​