Skip to main content

/browser-research - Agentic Multi-Page Content Extraction

Agentic browser research that autonomously discovers pages from a starting URL or search query, presents an interactive checklist for selection, extracts content from all selected pages (including PDFs), and assembles everything into a structured research document with summaries and recommendations.

Usage

# From a starting URL (discovers linked pages)
/browser-research "https://docs.coditect.ai"

# From a search query (finds relevant URLs)
/browser-research "CODITECT pricing competitors 2026"

# With options
/browser-research "https://example.com" --depth 2 # Follow links 2 levels deep
/browser-research "https://example.com" --include-pdfs # Download and extract PDFs
/browser-research "https://example.com" --max-pages 20 # Limit to 20 pages
/browser-research "https://example.com" --scope internal # Only same-domain pages
/browser-research "https://example.com" --scope sitemap # Use sitemap.xml
/browser-research "https://example.com" --output ./research # Custom output directory
/browser-research "https://example.com" --auto # Skip checklist, extract all
/browser-research "https://example.com" --summary-only # Only produce summary doc

System Prompt

EXECUTION DIRECTIVE: When the user invokes /browser-research, execute this 6-phase agentic workflow:

Phase 1: URL Discovery

From a URL starting point:

# Step 1: Navigate to starting page
npx agent-browser open "<url>"

# Step 2: Get page snapshot + extract all links
npx agent-browser snapshot
npx agent-browser eval "JSON.stringify({
title: document.title,
links: [...document.querySelectorAll('a[href]')].map(a => ({
text: a.textContent.trim().slice(0, 80),
href: a.href,
internal: a.host === location.host,
isDoc: /\\.(pdf|doc|docx|txt|csv|xlsx)$/i.test(a.href),
isPdf: /\\.pdf$/i.test(a.href),
section: a.closest('nav,header,footer,main,aside')?.tagName?.toLowerCase() || 'body'
})).filter(l => l.href.startsWith('http'))
})"

If --depth 2 or higher: Follow discovered internal links and extract their links too.

If --scope sitemap:

npx agent-browser navigate "<origin>/sitemap.xml"
npx agent-browser eval "
const urls = [...document.querySelectorAll('loc')].map(l => l.textContent);
return JSON.stringify(urls);
"

From a search query:

# Use web search to find URLs
npx agent-browser open "https://www.google.com/search?q=<encoded-query>"
npx agent-browser snapshot -i
# Extract search result URLs
npx agent-browser eval "JSON.stringify([...document.querySelectorAll('a[href]')]
.filter(a => a.closest('#search') || a.closest('#rso'))
.map(a => ({text: a.textContent.trim(), href: a.href}))
.filter(l => !l.href.includes('google.com'))
)"

Phase 2: Interactive Checklist

Present the discovered URLs to the user as an interactive checklist using AskUserQuestion:

Found 15 pages from https://docs.coditect.ai:

PAGES:
[x] / - CODITECT Documentation Home
[x] /getting-started - Getting Started Guide
[x] /api-reference - API Reference
[ ] /changelog - Changelog
[x] /pricing - Pricing
[ ] /blog/post-1 - Blog: Announcing v2.0
[ ] /blog/post-2 - Blog: Performance Update

DOCUMENTS:
[x] /docs/whitepaper.pdf - CODITECT Technical Whitepaper (PDF)
[ ] /docs/soc2-report.pdf - SOC 2 Report (PDF)

EXTERNAL:
[ ] https://github.com/coditect-ai - GitHub Repository
[ ] https://status.coditect.ai - Status Page

Select pages to extract (default: all internal pages + PDFs marked with [x]):

User interaction options:

  • Select/deselect individual pages
  • "All internal" — select all same-domain pages
  • "All + PDFs" — all internal pages plus all PDFs
  • "All" — everything including external
  • "Custom" — user provides a list or regex pattern

If --auto is specified, skip the checklist and extract all discovered pages (respecting --scope and --max-pages).

Phase 3: Content Extraction

Extract content from each selected page sequentially:

# For each selected URL:
npx agent-browser navigate "<url>"
npx agent-browser snapshot

# Get metadata
npx agent-browser eval "JSON.stringify({
title: document.title,
description: document.querySelector('meta[name=description]')?.content,
wordCount: document.body.innerText.split(/\\s+/).length,
headings: [...document.querySelectorAll('h1,h2,h3')].map(h => ({
level: parseInt(h.tagName[1]),
text: h.textContent.trim()
}))
})"

For each page, produce:

  1. Markdown extraction (snapshot -> markdown conversion)
  2. Key metadata (title, description, word count, headings)
  3. Link inventory (for cross-referencing)

Progress reporting:

Extracting: [3/8] /api-reference (1,234 words)...

Phase 4: PDF Extraction

For selected PDFs (--include-pdfs or user-selected):

# Navigate to PDF URL
npx agent-browser navigate "<pdf-url>"

# Check if browser rendered it
npx agent-browser eval "document.contentType"

# Option A: Browser PDF viewer - extract visible text
npx agent-browser eval "document.body.innerText"

# Option B: Download for local processing
npx agent-browser eval "
const resp = await fetch(location.href);
const blob = await resp.blob();
const reader = new FileReader();
return new Promise(resolve => {
reader.onload = () => resolve({
type: blob.type,
size: blob.size,
base64: reader.result.split(',')[1]?.slice(0, 100) + '...'
});
reader.readAsDataURL(blob);
});
"

For downloaded PDFs, use the Read tool to process them locally (Claude can read PDFs natively).

PDF output: Extract text, preserve heading structure where possible, note page count and any tables found.

Phase 5: Assembly

Combine all extracted content into a structured research document:

---
research_source: {starting_url_or_query}
pages_extracted: {count}
pdfs_extracted: {count}
total_words: {sum}
extracted_at: {timestamp}
---

# Research: {Topic/Domain}

**Source:** {starting URL or search query}
**Pages:** {count} pages + {count} PDFs extracted
**Total content:** ~{word_count} words
**Extracted:** {date}

---

## Table of Contents

1. [Summary](#summary)
2. [Page 1: {title}](#page-1-title)
3. [Page 2: {title}](#page-2-title)
...
N. [Appendix: PDF - {title}](#appendix-pdf-title)

---

## Summary

{AI-generated 3-5 paragraph executive summary of all extracted content}

### Key Findings

- {Finding 1}
- {Finding 2}
- {Finding 3}

### Recommendations

| # | Recommendation | Source | Priority |
|---|---------------|--------|----------|
| 1 | {recommendation} | {page reference} | HIGH |
| 2 | {recommendation} | {page reference} | MEDIUM |

---

## Page 1: {Title}

**URL:** {url}
**Words:** {count}

{Full markdown extraction of page content}

---

## Page 2: {Title}

...

---

## Appendix: PDF - {Title}

**URL:** {url}
**Pages:** {count}

{Extracted PDF text content}

---

Phase 6: Suggestions

After assembly, generate actionable suggestions based on the extracted content:

## Suggestions & Next Steps

### Content Opportunities
- {Gap identified across pages — e.g., "No API rate limits documented"}
- {Inconsistency — e.g., "Pricing page says X but FAQ says Y"}

### Cross-Page Issues
- {Broken internal links between pages}
- {Duplicate content across pages}
- {Navigation items that lead to 404}

### Competitive Intelligence (if applicable)
- {Positioning observations}
- {Feature gaps vs competitors}
- {Pricing model comparison}

### Follow-Up Research
- {Additional URLs worth investigating}
- {Topics that need deeper extraction}
- {External sources to cross-reference}

Options

OptionDescription
<url-or-query>Starting URL or search query (required)
--depth <N>Link-following depth (default: 1)
--max-pages <N>Maximum pages to extract (default: 10)
--scope <mode>internal (default), sitemap, all
--include-pdfsDownload and extract linked PDFs
--autoSkip interactive checklist, extract all
--summary-onlyOnly produce summary, not full extractions
--output <dir>Custom output directory
--format <type>Output format: markdown (default), json, both
--analyzeInclude per-page analysis (from /browser-extract --analyze)
--no-closeKeep browser open after research
--helpShow this help

Output Structure

analyze-new-artifacts/coditect-browser-analysis/
{domain}-research/
README.md # Master research document with summary
pages/
01-{slug}.md # Individual page extractions
02-{slug}.md
...
pdfs/
{filename}.md # Extracted PDF content
data/
url-inventory.json # All discovered URLs
link-graph.json # Cross-page link relationships
extraction-manifest.json # What was extracted, when, stats

Examples

Research a Documentation Site

/browser-research "https://docs.coditect.ai" --depth 2 --include-pdfs
  1. Opens docs.coditect.ai, discovers all linked pages
  2. Follows links 2 levels deep (docs -> sub-pages -> sub-sub-pages)
  3. Presents checklist: 23 pages found, 2 PDFs
  4. User selects 15 pages + both PDFs
  5. Extracts all, assembles research document
  6. Generates summary + recommendations
/browser-research "AI code generation platforms pricing 2026" --max-pages 5
  1. Searches Google for the query
  2. Presents top 10 results as checklist
  3. User selects 5 competitor pages
  4. Extracts pricing and feature information
  5. Assembles comparison document with suggestions

Quick Site Audit

/browser-research "https://mysite.com" --scope sitemap --analyze --auto
  1. Fetches sitemap.xml
  2. Auto-extracts all pages (no checklist)
  3. Runs analysis on each page
  4. Produces comprehensive site audit

Extract a Specific Set of Pages

/browser-research "https://example.com" --scope internal --max-pages 50 --summary-only
  1. Discovers all internal pages up to 50
  2. Extracts content from all
  3. Produces summary document only (not individual page files)

Agentic Behavior

This command is agentic — it makes autonomous decisions during execution:

DecisionHow It Decides
Which links to followInternal links in <main> or <nav>, not ads/tracking
PDF detectionFile extension .pdf or content-type application/pdf
Duplicate detectionURL normalization (strip trailing slash, fragments, tracking params)
Rate limiting1-second delay between page loads to be respectful
Error recoverySkip pages that fail to load after 10s timeout, log error, continue
Content relevanceSkip utility pages (login, 404, search results) unless explicitly selected

Success Output

/browser-research: https://docs.coditect.ai
Discovered: 23 pages, 2 PDFs
Selected: 15 pages, 2 PDFs
Extracted: 15/15 pages, 2/2 PDFs (28,450 words total)
Output: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-research/
Summary: README.md (executive summary + 5 recommendations)

Principles

This command embodies:

  • #3 Complete Execution — Full research pipeline: discover -> select -> extract -> assemble -> summarize
  • #4 Separation of Concerns — Discovery, extraction, and assembly are distinct phases
  • #6 Clear, Understandable — Interactive checklist gives user control
  • #8 No Assumptions — User confirms which pages to extract via checklist
  • #9 Based on Facts — Real page content, cross-referenced and summarized

Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team