Skip to main content

/browser-extract - Browser Content Extraction

Extract web page content directly into structured markdown or nested JSON. No screenshot required. Optionally analyze content quality, accessibility, SEO, and compare pages side-by-side.

Usage

# Basic extraction (default: markdown)
/browser-extract "https://example.com"

# Output format
/browser-extract "https://example.com" --json # Nested JSON
/browser-extract "https://example.com" --both # Markdown + JSON
/browser-extract "https://example.com" --csv # Tables as CSV

# Focused extraction
/browser-extract "https://..." --sections # Headings + content by section
/browser-extract "https://..." --links # Link graph (internal/external/broken)
/browser-extract "https://..." --forms # Form structure, fields, validation
/browser-extract "https://..." --tables # Tables as markdown/CSV
/browser-extract "https://..." --meta # SEO: title, og:*, schema.org, robots

# Analysis modes
/browser-extract "https://..." --analyze # Full content + UX analysis + suggestions
/browser-extract "https://..." --a11y # Accessibility audit from snapshot
/browser-extract "https://..." --compare "https://..." # Side-by-side page diff
/browser-extract "https://..." --monitor # Save baseline, detect changes on re-run

# Output control
/browser-extract "https://..." --output ./report.md # Custom output path
/browser-extract "https://..." --screenshot # Include screenshot in output
/browser-extract "https://..." --depth 3 # Snapshot depth limit
/browser-extract "https://..." --scope "#main" # CSS scope for extraction
/browser-extract "https://..." --no-close # Keep browser open after extraction

# Combined
/browser-extract "https://example.com" --analyze --screenshot --output ./analysis.md

System Prompt

EXECUTION DIRECTIVE: When the user invokes /browser-extract, you MUST:

  1. Run the orchestratorscripts/browser/extract.sh "<url>" --all --snapshot for combined JSON + snapshot
  2. Convert snapshot to markdown — Apply the element mapping rules from the skill
  3. Merge JSON metadata — Add orchestrator data as frontmatter and inline tables
  4. Run analysis — If --analyze, --a11y, or --meta requested, apply the 8-dimension rubric
  5. Write file — Save to output path (default: analyze-new-artifacts/coditect-browser-analysis/)
  6. Close browser — Unless --no-close specified (orchestrator handles this)

Extraction Pipeline

URL → extract.sh (open → snapshot → JS extractors → close) → agent (structure → analyze → write)

Primary Method: Orchestrator Script

The orchestrator handles browser lifecycle, JS execution (avoiding shell quoting issues), and outputs combined JSON:

# Full extraction — all 9 JS extractors + snapshot
scripts/browser/extract.sh "<url>" --all --snapshot

# Specific extractors only
scripts/browser/extract.sh "<url>" --meta --seo --links

# With screenshot
scripts/browser/extract.sh "<url>" --all --snapshot --screenshot /tmp/page.png

# Keep browser open (for multi-page or follow-up commands)
scripts/browser/extract.sh "<url>" --all --snapshot --no-close

The orchestrator runs 9 JS extractor modules from scripts/browser/extractors/:

ExtractorFlagData
meta.js--metaTitle, description, viewport, word count, copyright
og-twitter.js--ogOpen Graph + Twitter Card tags
structured-data.js--structuredJSON-LD schemas
links.js--linksInternal, external, mailto links
forms.js--formsForm structure and fields
tables.js--tablesTable data with headers and rows
images.js--imagesImage inventory with alt text
landmarks.js--landmarksARIA landmarks + heading hierarchy
seo.js--seoFull SEO assessment with pass/fail

Fallback: Manual Step-by-Step

If the orchestrator is unavailable, run individual extractors directly:

# Step 1: Open and snapshot
npx agent-browser open "<url>"
npx agent-browser snapshot

# Step 2: Run JS extractors via file (avoids shell quoting issues)
npx agent-browser eval "$(cat scripts/browser/extractors/meta.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/links.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"

# Step 3: Close
npx agent-browser close

Step 3: Structure as Markdown

Convert the snapshot accessibility tree to markdown:

Snapshot ElementMarkdown Output
heading [level=1]# Heading Text
heading [level=2]## Heading Text
heading [level=3]### Heading Text
paragraphPlain text paragraph
link "text" [ref=eN][text](url)
list > listitem- Item text
tableMarkdown table
textbox "label"**label:** [input field]
button "text"[Button: text]
img "alt"![alt](src)
region "name"## name (section heading)
navigation### Navigation
banner### Header
contentinfo### Footer

Step 4: Structure as JSON

Nested JSON preserving the document hierarchy:

{
"url": "https://example.com",
"title": "Page Title",
"extracted_at": "2026-02-08T06:00:00Z",
"meta": {
"description": "...",
"og_title": "...",
"og_image": "...",
"canonical": "...",
"lang": "en"
},
"content": {
"navigation": {
"links": [
{"text": "Home", "href": "/", "type": "internal"},
{"text": "About", "href": "/about", "type": "internal"}
]
},
"sections": [
{
"heading": "Hero Section",
"level": 1,
"content": [
{"type": "paragraph", "text": "..."},
{"type": "cta", "text": "Get Started", "href": "/register"}
],
"subsections": [...]
}
],
"footer": {
"copyright": "...",
"links": [...]
}
},
"stats": {
"word_count": 1800,
"headings": 36,
"links": {"internal": 8, "external": 2},
"images": 5,
"forms": 0,
"tables": 3,
"interactive_elements": 68
}
}

Analysis Mode (--analyze)

When --analyze is specified, append a full analysis section covering:

CategoryWhat to Assess
Content QualityClarity, jargon level, reading grade, narrative structure
Information ArchitectureHeading hierarchy, section flow, content grouping
CTAs & ConversionCTA count, placement, hierarchy, clarity of action
Trust SignalsTestimonials, logos, case studies, security badges, social proof
SEO BasicsTitle tag, meta description, H1 count, canonical, structured data
AccessibilityHeading levels, ARIA landmarks, alt text, form labels
Mobile ReadinessViewport meta, responsive indicators, touch targets
Content GapsMissing pricing, FAQ gaps, no contact info, stale dates

Rate each category: Strong / Adequate / Needs Improvement / Missing

Provide prioritized recommendations: HIGH / MEDIUM / LOW with effort estimates.

See: Analysis Rubric

Accessibility Mode (--a11y)

Run an accessibility audit using snapshot data:

CheckSourceHow
Heading hierarchySnapshot [level=N]Verify H1 -> H2 -> H3 with no skips
ARIA landmarksSnapshot rolesVerify banner, main, contentinfo, navigation present
Link textSnapshot link elementsFlag "click here", "read more", empty links
Form labelsSnapshot + JSEvery input must have associated label
Alt textJS extractionEvery <img> must have alt attribute
Color contrastJS extractionCheck computed contrast ratios
Focus orderSnapshot tabindexFlag positive tabindex values
LanguageJS document.documentElement.langMust be set

Meta / SEO Mode (--meta)

Extract and assess using the orchestrator or individual extractors:

# Via orchestrator (recommended)
scripts/browser/extract.sh "<url>" --meta --og --structured --seo

# Via individual extractor files
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/og-twitter.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/structured-data.js)"

Compare Mode (--compare)

Extract both pages, then produce a side-by-side diff:

/browser-extract "https://v1.example.com" --compare "https://v2.example.com"

Output sections:

  1. Structural Diff — Sections added/removed/changed
  2. Content Diff — Text changes (word-level)
  3. Link Diff — Links added/removed/changed
  4. Meta Diff — SEO tag changes
  5. Stats Comparison — Word count, headings, links, images

Monitor Mode (--monitor)

First run saves a baseline. Subsequent runs diff against baseline:

# First run: saves baseline
/browser-extract "https://example.com" --monitor

# Later run: detects changes
/browser-extract "https://example.com" --monitor
# Output: "3 sections changed, 2 links added, copyright updated"

Baselines stored in: analyze-new-artifacts/coditect-browser-analysis/.baselines/

Options

OptionDescription
<url>URL to extract (required)
--jsonOutput as nested JSON
--bothOutput markdown + JSON
--csvExtract tables as CSV
--sectionsHeadings + content by section
--linksLink graph with internal/external classification
--formsForm structure and field inventory
--tablesTables as markdown or CSV
--metaSEO metadata extraction and assessment
--analyzeFull content + UX analysis with recommendations
--a11yAccessibility audit from snapshot
--compare <url>Side-by-side comparison with another URL
--monitorChange detection against saved baseline
--output <path>Custom output file path
--screenshotInclude full-page screenshot
--depth <N>Snapshot depth limit
--scope <selector>CSS selector to scope extraction
--no-closeKeep browser open after extraction
--helpShow this help

Output Paths

ModeDefault Output Path
Basic extractanalyze-new-artifacts/coditect-browser-analysis/{domain}-extract.md
JSONanalyze-new-artifacts/coditect-browser-analysis/{domain}-extract.json
Analysisanalyze-new-artifacts/coditect-browser-analysis/{domain}-analysis.md
Compareanalyze-new-artifacts/coditect-browser-analysis/{domain}-vs-{domain2}-compare.md
Monitor baselineanalyze-new-artifacts/coditect-browser-analysis/.baselines/{domain}-{date}.json
Screenshotanalyze-new-artifacts/coditect-browser-screenshots/{domain}.png

Examples

Quick Extract to Markdown

/browser-extract "https://docs.coditect.ai"
# Creates: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-extract.md

Full Analysis with Screenshot

/browser-extract "https://auth.coditect.ai" --analyze --screenshot
# Creates: auth-coditect-ai-analysis.md + auth-coditect-ai.png

Extract Just the Main Content

/browser-extract "https://blog.example.com/post" --scope "article" --sections
# Extracts only the <article> element, organized by sections

Compare Two Versions

/browser-extract "https://staging.example.com" --compare "https://example.com"
# Creates: staging-example-com-vs-example-com-compare.md

SEO Quick Check

/browser-extract "https://example.com" --meta
# Outputs: title length, meta description, OG tags, structured data, robots

Success Output

/browser-extract: https://example.com
Format: markdown
Sections: 8
Words: 1,842
Links: 12 internal, 3 external
Output: analyze-new-artifacts/coditect-browser-analysis/example-com-extract.md

Principles

This command embodies:

  • #3 Complete Execution — open -> extract -> structure -> analyze -> write -> close in one command
  • #6 Clear, Understandable — Markdown output readable by anyone
  • #9 Based on Facts — Content extracted from real page state, not assumptions

Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team