/browser-research - Agentic Multi-Page Content Extraction
Agentic browser research that autonomously discovers pages from a starting URL or search query, presents an interactive checklist for selection, extracts content from all selected pages (including PDFs), and assembles everything into a structured research document with summaries and recommendations.
Usage
# From a starting URL (discovers linked pages)
/browser-research "https://docs.coditect.ai"
# From a search query (finds relevant URLs)
/browser-research "CODITECT pricing competitors 2026"
# With options
/browser-research "https://example.com" --depth 2 # Follow links 2 levels deep
/browser-research "https://example.com" --include-pdfs # Download and extract PDFs
/browser-research "https://example.com" --max-pages 20 # Limit to 20 pages
/browser-research "https://example.com" --scope internal # Only same-domain pages
/browser-research "https://example.com" --scope sitemap # Use sitemap.xml
/browser-research "https://example.com" --output ./research # Custom output directory
/browser-research "https://example.com" --auto # Skip checklist, extract all
/browser-research "https://example.com" --summary-only # Only produce summary doc
System Prompt
EXECUTION DIRECTIVE:
When the user invokes /browser-research, execute this 6-phase agentic workflow:
Phase 1: URL Discovery
From a URL starting point:
# Step 1: Navigate to starting page
npx agent-browser open "<url>"
# Step 2: Get page snapshot + extract all links
npx agent-browser snapshot
npx agent-browser eval "JSON.stringify({
title: document.title,
links: [...document.querySelectorAll('a[href]')].map(a => ({
text: a.textContent.trim().slice(0, 80),
href: a.href,
internal: a.host === location.host,
isDoc: /\\.(pdf|doc|docx|txt|csv|xlsx)$/i.test(a.href),
isPdf: /\\.pdf$/i.test(a.href),
section: a.closest('nav,header,footer,main,aside')?.tagName?.toLowerCase() || 'body'
})).filter(l => l.href.startsWith('http'))
})"
If --depth 2 or higher: Follow discovered internal links and extract their links too.
If --scope sitemap:
npx agent-browser navigate "<origin>/sitemap.xml"
npx agent-browser eval "
const urls = [...document.querySelectorAll('loc')].map(l => l.textContent);
return JSON.stringify(urls);
"
From a search query:
# Use web search to find URLs
npx agent-browser open "https://www.google.com/search?q=<encoded-query>"
npx agent-browser snapshot -i
# Extract search result URLs
npx agent-browser eval "JSON.stringify([...document.querySelectorAll('a[href]')]
.filter(a => a.closest('#search') || a.closest('#rso'))
.map(a => ({text: a.textContent.trim(), href: a.href}))
.filter(l => !l.href.includes('google.com'))
)"
Phase 2: Interactive Checklist
Present the discovered URLs to the user as an interactive checklist using AskUserQuestion:
Found 15 pages from https://docs.coditect.ai:
PAGES:
[x] / - CODITECT Documentation Home
[x] /getting-started - Getting Started Guide
[x] /api-reference - API Reference
[ ] /changelog - Changelog
[x] /pricing - Pricing
[ ] /blog/post-1 - Blog: Announcing v2.0
[ ] /blog/post-2 - Blog: Performance Update
DOCUMENTS:
[x] /docs/whitepaper.pdf - CODITECT Technical Whitepaper (PDF)
[ ] /docs/soc2-report.pdf - SOC 2 Report (PDF)
EXTERNAL:
[ ] https://github.com/coditect-ai - GitHub Repository
[ ] https://status.coditect.ai - Status Page
Select pages to extract (default: all internal pages + PDFs marked with [x]):
User interaction options:
- Select/deselect individual pages
- "All internal" — select all same-domain pages
- "All + PDFs" — all internal pages plus all PDFs
- "All" — everything including external
- "Custom" — user provides a list or regex pattern
If --auto is specified, skip the checklist and extract all discovered pages (respecting --scope and --max-pages).
Phase 3: Content Extraction
Extract content from each selected page sequentially:
# For each selected URL:
npx agent-browser navigate "<url>"
npx agent-browser snapshot
# Get metadata
npx agent-browser eval "JSON.stringify({
title: document.title,
description: document.querySelector('meta[name=description]')?.content,
wordCount: document.body.innerText.split(/\\s+/).length,
headings: [...document.querySelectorAll('h1,h2,h3')].map(h => ({
level: parseInt(h.tagName[1]),
text: h.textContent.trim()
}))
})"
For each page, produce:
- Markdown extraction (snapshot -> markdown conversion)
- Key metadata (title, description, word count, headings)
- Link inventory (for cross-referencing)
Progress reporting:
Extracting: [3/8] /api-reference (1,234 words)...
Phase 4: PDF Extraction
For selected PDFs (--include-pdfs or user-selected):
# Navigate to PDF URL
npx agent-browser navigate "<pdf-url>"
# Check if browser rendered it
npx agent-browser eval "document.contentType"
# Option A: Browser PDF viewer - extract visible text
npx agent-browser eval "document.body.innerText"
# Option B: Download for local processing
npx agent-browser eval "
const resp = await fetch(location.href);
const blob = await resp.blob();
const reader = new FileReader();
return new Promise(resolve => {
reader.onload = () => resolve({
type: blob.type,
size: blob.size,
base64: reader.result.split(',')[1]?.slice(0, 100) + '...'
});
reader.readAsDataURL(blob);
});
"
For downloaded PDFs, use the Read tool to process them locally (Claude can read PDFs natively).
PDF output: Extract text, preserve heading structure where possible, note page count and any tables found.
Phase 5: Assembly
Combine all extracted content into a structured research document:
---
research_source: {starting_url_or_query}
pages_extracted: {count}
pdfs_extracted: {count}
total_words: {sum}
extracted_at: {timestamp}
---
# Research: {Topic/Domain}
**Source:** {starting URL or search query}
**Pages:** {count} pages + {count} PDFs extracted
**Total content:** ~{word_count} words
**Extracted:** {date}
---
## Table of Contents
1. [Summary](#summary)
2. [Page 1: {title}](#page-1-title)
3. [Page 2: {title}](#page-2-title)
...
N. [Appendix: PDF - {title}](#appendix-pdf-title)
---
## Summary
{AI-generated 3-5 paragraph executive summary of all extracted content}
### Key Findings
- {Finding 1}
- {Finding 2}
- {Finding 3}
### Recommendations
| # | Recommendation | Source | Priority |
|---|---------------|--------|----------|
| 1 | {recommendation} | {page reference} | HIGH |
| 2 | {recommendation} | {page reference} | MEDIUM |
---
## Page 1: {Title}
**URL:** {url}
**Words:** {count}
{Full markdown extraction of page content}
---
## Page 2: {Title}
...
---
## Appendix: PDF - {Title}
**URL:** {url}
**Pages:** {count}
{Extracted PDF text content}
---
Phase 6: Suggestions
After assembly, generate actionable suggestions based on the extracted content:
## Suggestions & Next Steps
### Content Opportunities
- {Gap identified across pages — e.g., "No API rate limits documented"}
- {Inconsistency — e.g., "Pricing page says X but FAQ says Y"}
### Cross-Page Issues
- {Broken internal links between pages}
- {Duplicate content across pages}
- {Navigation items that lead to 404}
### Competitive Intelligence (if applicable)
- {Positioning observations}
- {Feature gaps vs competitors}
- {Pricing model comparison}
### Follow-Up Research
- {Additional URLs worth investigating}
- {Topics that need deeper extraction}
- {External sources to cross-reference}
Options
| Option | Description |
|---|---|
<url-or-query> | Starting URL or search query (required) |
--depth <N> | Link-following depth (default: 1) |
--max-pages <N> | Maximum pages to extract (default: 10) |
--scope <mode> | internal (default), sitemap, all |
--include-pdfs | Download and extract linked PDFs |
--auto | Skip interactive checklist, extract all |
--summary-only | Only produce summary, not full extractions |
--output <dir> | Custom output directory |
--format <type> | Output format: markdown (default), json, both |
--analyze | Include per-page analysis (from /browser-extract --analyze) |
--no-close | Keep browser open after research |
--help | Show this help |
Output Structure
analyze-new-artifacts/coditect-browser-analysis/
{domain}-research/
README.md # Master research document with summary
pages/
01-{slug}.md # Individual page extractions
02-{slug}.md
...
pdfs/
{filename}.md # Extracted PDF content
data/
url-inventory.json # All discovered URLs
link-graph.json # Cross-page link relationships
extraction-manifest.json # What was extracted, when, stats
Examples
Research a Documentation Site
/browser-research "https://docs.coditect.ai" --depth 2 --include-pdfs
- Opens docs.coditect.ai, discovers all linked pages
- Follows links 2 levels deep (docs -> sub-pages -> sub-sub-pages)
- Presents checklist: 23 pages found, 2 PDFs
- User selects 15 pages + both PDFs
- Extracts all, assembles research document
- Generates summary + recommendations
Competitive Research from Search
/browser-research "AI code generation platforms pricing 2026" --max-pages 5
- Searches Google for the query
- Presents top 10 results as checklist
- User selects 5 competitor pages
- Extracts pricing and feature information
- Assembles comparison document with suggestions
Quick Site Audit
/browser-research "https://mysite.com" --scope sitemap --analyze --auto
- Fetches sitemap.xml
- Auto-extracts all pages (no checklist)
- Runs analysis on each page
- Produces comprehensive site audit
Extract a Specific Set of Pages
/browser-research "https://example.com" --scope internal --max-pages 50 --summary-only
- Discovers all internal pages up to 50
- Extracts content from all
- Produces summary document only (not individual page files)
Agentic Behavior
This command is agentic — it makes autonomous decisions during execution:
| Decision | How It Decides |
|---|---|
| Which links to follow | Internal links in <main> or <nav>, not ads/tracking |
| PDF detection | File extension .pdf or content-type application/pdf |
| Duplicate detection | URL normalization (strip trailing slash, fragments, tracking params) |
| Rate limiting | 1-second delay between page loads to be respectful |
| Error recovery | Skip pages that fail to load after 10s timeout, log error, continue |
| Content relevance | Skip utility pages (login, 404, search results) unless explicitly selected |
Success Output
/browser-research: https://docs.coditect.ai
Discovered: 23 pages, 2 PDFs
Selected: 15 pages, 2 PDFs
Extracted: 15/15 pages, 2/2 PDFs (28,450 words total)
Output: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-research/
Summary: README.md (executive summary + 5 recommendations)
Related
- Command: /browser-extract — Single-page extraction
- Command: /browser — Direct browser control
- Skill: browser-content-extraction — Extraction patterns
- Skill: browser-automation-patterns — Browser workflow patterns
- Agent: coditect-browser-agent — Browser automation agent
Principles
This command embodies:
- #3 Complete Execution — Full research pipeline: discover -> select -> extract -> assemble -> summarize
- #4 Separation of Concerns — Discovery, extraction, and assembly are distinct phases
- #6 Clear, Understandable — Interactive checklist gives user control
- #8 No Assumptions — User confirms which pages to extract via checklist
- #9 Based on Facts — Real page content, cross-referenced and summarized
Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team