/browser-extract - Browser Content Extraction
Extract web page content directly into structured markdown or nested JSON. No screenshot required. Optionally analyze content quality, accessibility, SEO, and compare pages side-by-side.
Usage
# Basic extraction (default: markdown)
/browser-extract "https://example.com"
# Output format
/browser-extract "https://example.com" --json # Nested JSON
/browser-extract "https://example.com" --both # Markdown + JSON
/browser-extract "https://example.com" --csv # Tables as CSV
# Focused extraction
/browser-extract "https://..." --sections # Headings + content by section
/browser-extract "https://..." --links # Link graph (internal/external/broken)
/browser-extract "https://..." --forms # Form structure, fields, validation
/browser-extract "https://..." --tables # Tables as markdown/CSV
/browser-extract "https://..." --meta # SEO: title, og:*, schema.org, robots
# Analysis modes
/browser-extract "https://..." --analyze # Full content + UX analysis + suggestions
/browser-extract "https://..." --a11y # Accessibility audit from snapshot
/browser-extract "https://..." --compare "https://..." # Side-by-side page diff
/browser-extract "https://..." --monitor # Save baseline, detect changes on re-run
# Output control
/browser-extract "https://..." --output ./report.md # Custom output path
/browser-extract "https://..." --screenshot # Include screenshot in output
/browser-extract "https://..." --depth 3 # Snapshot depth limit
/browser-extract "https://..." --scope "#main" # CSS scope for extraction
/browser-extract "https://..." --no-close # Keep browser open after extraction
# Combined
/browser-extract "https://example.com" --analyze --screenshot --output ./analysis.md
System Prompt
EXECUTION DIRECTIVE:
When the user invokes /browser-extract, you MUST:
- Run the orchestrator —
scripts/browser/extract.sh "<url>" --all --snapshotfor combined JSON + snapshot - Convert snapshot to markdown — Apply the element mapping rules from the skill
- Merge JSON metadata — Add orchestrator data as frontmatter and inline tables
- Run analysis — If
--analyze,--a11y, or--metarequested, apply the 8-dimension rubric - Write file — Save to output path (default:
analyze-new-artifacts/coditect-browser-analysis/) - Close browser — Unless
--no-closespecified (orchestrator handles this)
Extraction Pipeline
URL → extract.sh (open → snapshot → JS extractors → close) → agent (structure → analyze → write)
Primary Method: Orchestrator Script
The orchestrator handles browser lifecycle, JS execution (avoiding shell quoting issues), and outputs combined JSON:
# Full extraction — all 9 JS extractors + snapshot
scripts/browser/extract.sh "<url>" --all --snapshot
# Specific extractors only
scripts/browser/extract.sh "<url>" --meta --seo --links
# With screenshot
scripts/browser/extract.sh "<url>" --all --snapshot --screenshot /tmp/page.png
# Keep browser open (for multi-page or follow-up commands)
scripts/browser/extract.sh "<url>" --all --snapshot --no-close
The orchestrator runs 9 JS extractor modules from scripts/browser/extractors/:
| Extractor | Flag | Data |
|---|---|---|
meta.js | --meta | Title, description, viewport, word count, copyright |
og-twitter.js | --og | Open Graph + Twitter Card tags |
structured-data.js | --structured | JSON-LD schemas |
links.js | --links | Internal, external, mailto links |
forms.js | --forms | Form structure and fields |
tables.js | --tables | Table data with headers and rows |
images.js | --images | Image inventory with alt text |
landmarks.js | --landmarks | ARIA landmarks + heading hierarchy |
seo.js | --seo | Full SEO assessment with pass/fail |
Fallback: Manual Step-by-Step
If the orchestrator is unavailable, run individual extractors directly:
# Step 1: Open and snapshot
npx agent-browser open "<url>"
npx agent-browser snapshot
# Step 2: Run JS extractors via file (avoids shell quoting issues)
npx agent-browser eval "$(cat scripts/browser/extractors/meta.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/links.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"
# Step 3: Close
npx agent-browser close
Step 3: Structure as Markdown
Convert the snapshot accessibility tree to markdown:
| Snapshot Element | Markdown Output |
|---|---|
heading [level=1] | # Heading Text |
heading [level=2] | ## Heading Text |
heading [level=3] | ### Heading Text |
paragraph | Plain text paragraph |
link "text" [ref=eN] | [text](url) |
list > listitem | - Item text |
table | Markdown table |
textbox "label" | **label:** [input field] |
button "text" | [Button: text] |
img "alt" |  |
region "name" | ## name (section heading) |
navigation | ### Navigation |
banner | ### Header |
contentinfo | ### Footer |
Step 4: Structure as JSON
Nested JSON preserving the document hierarchy:
{
"url": "https://example.com",
"title": "Page Title",
"extracted_at": "2026-02-08T06:00:00Z",
"meta": {
"description": "...",
"og_title": "...",
"og_image": "...",
"canonical": "...",
"lang": "en"
},
"content": {
"navigation": {
"links": [
{"text": "Home", "href": "/", "type": "internal"},
{"text": "About", "href": "/about", "type": "internal"}
]
},
"sections": [
{
"heading": "Hero Section",
"level": 1,
"content": [
{"type": "paragraph", "text": "..."},
{"type": "cta", "text": "Get Started", "href": "/register"}
],
"subsections": [...]
}
],
"footer": {
"copyright": "...",
"links": [...]
}
},
"stats": {
"word_count": 1800,
"headings": 36,
"links": {"internal": 8, "external": 2},
"images": 5,
"forms": 0,
"tables": 3,
"interactive_elements": 68
}
}
Analysis Mode (--analyze)
When --analyze is specified, append a full analysis section covering:
| Category | What to Assess |
|---|---|
| Content Quality | Clarity, jargon level, reading grade, narrative structure |
| Information Architecture | Heading hierarchy, section flow, content grouping |
| CTAs & Conversion | CTA count, placement, hierarchy, clarity of action |
| Trust Signals | Testimonials, logos, case studies, security badges, social proof |
| SEO Basics | Title tag, meta description, H1 count, canonical, structured data |
| Accessibility | Heading levels, ARIA landmarks, alt text, form labels |
| Mobile Readiness | Viewport meta, responsive indicators, touch targets |
| Content Gaps | Missing pricing, FAQ gaps, no contact info, stale dates |
Rate each category: Strong / Adequate / Needs Improvement / Missing
Provide prioritized recommendations: HIGH / MEDIUM / LOW with effort estimates.
See: Analysis Rubric
Accessibility Mode (--a11y)
Run an accessibility audit using snapshot data:
| Check | Source | How |
|---|---|---|
| Heading hierarchy | Snapshot [level=N] | Verify H1 -> H2 -> H3 with no skips |
| ARIA landmarks | Snapshot roles | Verify banner, main, contentinfo, navigation present |
| Link text | Snapshot link elements | Flag "click here", "read more", empty links |
| Form labels | Snapshot + JS | Every input must have associated label |
| Alt text | JS extraction | Every <img> must have alt attribute |
| Color contrast | JS extraction | Check computed contrast ratios |
| Focus order | Snapshot tabindex | Flag positive tabindex values |
| Language | JS document.documentElement.lang | Must be set |
Meta / SEO Mode (--meta)
Extract and assess using the orchestrator or individual extractors:
# Via orchestrator (recommended)
scripts/browser/extract.sh "<url>" --meta --og --structured --seo
# Via individual extractor files
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/og-twitter.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/structured-data.js)"
Compare Mode (--compare)
Extract both pages, then produce a side-by-side diff:
/browser-extract "https://v1.example.com" --compare "https://v2.example.com"
Output sections:
- Structural Diff — Sections added/removed/changed
- Content Diff — Text changes (word-level)
- Link Diff — Links added/removed/changed
- Meta Diff — SEO tag changes
- Stats Comparison — Word count, headings, links, images
Monitor Mode (--monitor)
First run saves a baseline. Subsequent runs diff against baseline:
# First run: saves baseline
/browser-extract "https://example.com" --monitor
# Later run: detects changes
/browser-extract "https://example.com" --monitor
# Output: "3 sections changed, 2 links added, copyright updated"
Baselines stored in: analyze-new-artifacts/coditect-browser-analysis/.baselines/
Options
| Option | Description |
|---|---|
<url> | URL to extract (required) |
--json | Output as nested JSON |
--both | Output markdown + JSON |
--csv | Extract tables as CSV |
--sections | Headings + content by section |
--links | Link graph with internal/external classification |
--forms | Form structure and field inventory |
--tables | Tables as markdown or CSV |
--meta | SEO metadata extraction and assessment |
--analyze | Full content + UX analysis with recommendations |
--a11y | Accessibility audit from snapshot |
--compare <url> | Side-by-side comparison with another URL |
--monitor | Change detection against saved baseline |
--output <path> | Custom output file path |
--screenshot | Include full-page screenshot |
--depth <N> | Snapshot depth limit |
--scope <selector> | CSS selector to scope extraction |
--no-close | Keep browser open after extraction |
--help | Show this help |
Output Paths
| Mode | Default Output Path |
|---|---|
| Basic extract | analyze-new-artifacts/coditect-browser-analysis/{domain}-extract.md |
| JSON | analyze-new-artifacts/coditect-browser-analysis/{domain}-extract.json |
| Analysis | analyze-new-artifacts/coditect-browser-analysis/{domain}-analysis.md |
| Compare | analyze-new-artifacts/coditect-browser-analysis/{domain}-vs-{domain2}-compare.md |
| Monitor baseline | analyze-new-artifacts/coditect-browser-analysis/.baselines/{domain}-{date}.json |
| Screenshot | analyze-new-artifacts/coditect-browser-screenshots/{domain}.png |
Examples
Quick Extract to Markdown
/browser-extract "https://docs.coditect.ai"
# Creates: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-extract.md
Full Analysis with Screenshot
/browser-extract "https://auth.coditect.ai" --analyze --screenshot
# Creates: auth-coditect-ai-analysis.md + auth-coditect-ai.png
Extract Just the Main Content
/browser-extract "https://blog.example.com/post" --scope "article" --sections
# Extracts only the <article> element, organized by sections
Compare Two Versions
/browser-extract "https://staging.example.com" --compare "https://example.com"
# Creates: staging-example-com-vs-example-com-compare.md
SEO Quick Check
/browser-extract "https://example.com" --meta
# Outputs: title length, meta description, OG tags, structured data, robots
Success Output
/browser-extract: https://example.com
Format: markdown
Sections: 8
Words: 1,842
Links: 12 internal, 3 external
Output: analyze-new-artifacts/coditect-browser-analysis/example-com-extract.md
Related
- Command: /browser — Direct browser control
- Command: /browser-research — Multi-page agentic extraction
- Skill: browser-content-extraction — Extraction patterns
- Skill: browser-automation-patterns — Browser workflow patterns
- Agent: coditect-browser-agent — Browser automation agent
Principles
This command embodies:
- #3 Complete Execution — open -> extract -> structure -> analyze -> write -> close in one command
- #6 Clear, Understandable — Markdown output readable by anyone
- #9 Based on Facts — Content extracted from real page state, not assumptions
Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team