/browser-extract - Browser Content Extraction

Extract web page content directly into structured markdown or nested JSON. No screenshot required. Optionally analyze content quality, accessibility, SEO, and compare pages side-by-side.

Usage

# Basic extraction (default: markdown)
/browser-extract "https://example.com"

# Output format
/browser-extract "https://example.com" --json           # Nested JSON
/browser-extract "https://example.com" --both           # Markdown + JSON
/browser-extract "https://example.com" --csv            # Tables as CSV

# Focused extraction
/browser-extract "https://..." --sections               # Headings + content by section
/browser-extract "https://..." --links                  # Link graph (internal/external/broken)
/browser-extract "https://..." --forms                  # Form structure, fields, validation
/browser-extract "https://..." --tables                 # Tables as markdown/CSV
/browser-extract "https://..." --meta                   # SEO: title, og:*, schema.org, robots

# Analysis modes
/browser-extract "https://..." --analyze                # Full content + UX analysis + suggestions
/browser-extract "https://..." --a11y                   # Accessibility audit from snapshot
/browser-extract "https://..." --compare "https://..."  # Side-by-side page diff
/browser-extract "https://..." --monitor                # Save baseline, detect changes on re-run

# Output control
/browser-extract "https://..." --output ./report.md     # Custom output path
/browser-extract "https://..." --screenshot             # Include screenshot in output
/browser-extract "https://..." --depth 3                # Snapshot depth limit
/browser-extract "https://..." --scope "#main"          # CSS scope for extraction
/browser-extract "https://..." --no-close               # Keep browser open after extraction

# Combined
/browser-extract "https://example.com" --analyze --screenshot --output ./analysis.md

System Prompt

EXECUTION DIRECTIVE: When the user invokes /browser-extract, you MUST:

Run the orchestrator — scripts/browser/extract.sh "<url>" --all --snapshot for combined JSON + snapshot
Convert snapshot to markdown — Apply the element mapping rules from the skill
Merge JSON metadata — Add orchestrator data as frontmatter and inline tables
Run analysis — If --analyze, --a11y, or --meta requested, apply the 8-dimension rubric
Write file — Save to output path (default: analyze-new-artifacts/coditect-browser-analysis/)
Close browser — Unless --no-close specified (orchestrator handles this)

Extraction Pipeline

URL → extract.sh (open → snapshot → JS extractors → close) → agent (structure → analyze → write)

Primary Method: Orchestrator Script

The orchestrator handles browser lifecycle, JS execution (avoiding shell quoting issues), and outputs combined JSON:

# Full extraction — all 9 JS extractors + snapshot
scripts/browser/extract.sh "<url>" --all --snapshot

# Specific extractors only
scripts/browser/extract.sh "<url>" --meta --seo --links

# With screenshot
scripts/browser/extract.sh "<url>" --all --snapshot --screenshot /tmp/page.png

# Keep browser open (for multi-page or follow-up commands)
scripts/browser/extract.sh "<url>" --all --snapshot --no-close

The orchestrator runs 9 JS extractor modules from scripts/browser/extractors/:

Extractor	Flag	Data
`meta.js`	`--meta`	Title, description, viewport, word count, copyright
`og-twitter.js`	`--og`	Open Graph + Twitter Card tags
`structured-data.js`	`--structured`	JSON-LD schemas
`links.js`	`--links`	Internal, external, mailto links
`forms.js`	`--forms`	Form structure and fields
`tables.js`	`--tables`	Table data with headers and rows
`images.js`	`--images`	Image inventory with alt text
`landmarks.js`	`--landmarks`	ARIA landmarks + heading hierarchy
`seo.js`	`--seo`	Full SEO assessment with pass/fail

Fallback: Manual Step-by-Step

If the orchestrator is unavailable, run individual extractors directly:

# Step 1: Open and snapshot
npx agent-browser open "<url>"
npx agent-browser snapshot

# Step 2: Run JS extractors via file (avoids shell quoting issues)
npx agent-browser eval "$(cat scripts/browser/extractors/meta.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/links.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"

# Step 3: Close
npx agent-browser close

Step 3: Structure as Markdown

Convert the snapshot accessibility tree to markdown:

Snapshot Element	Markdown Output
`heading [level=1]`	`# Heading Text`
`heading [level=2]`	`## Heading Text`
`heading [level=3]`	`### Heading Text`
`paragraph`	Plain text paragraph
`link "text" [ref=eN]`	`[text](url)`
`list > listitem`	`- Item text`
`table`	Markdown table
`textbox "label"`	`label: [input field]`
`button "text"`	`[Button: text]`
`img "alt"`	`![alt](src)`
`region "name"`	`## name` (section heading)
`navigation`	`### Navigation`
`banner`	`### Header`
`contentinfo`	`### Footer`

Step 4: Structure as JSON

Nested JSON preserving the document hierarchy:

{
  "url": "https://example.com",
  "title": "Page Title",
  "extracted_at": "2026-02-08T06:00:00Z",
  "meta": {
    "description": "...",
    "og_title": "...",
    "og_image": "...",
    "canonical": "...",
    "lang": "en"
  },
  "content": {
    "navigation": {
      "links": [
        {"text": "Home", "href": "/", "type": "internal"},
        {"text": "About", "href": "/about", "type": "internal"}
      ]
    },
    "sections": [
      {
        "heading": "Hero Section",
        "level": 1,
        "content": [
          {"type": "paragraph", "text": "..."},
          {"type": "cta", "text": "Get Started", "href": "/register"}
        ],
        "subsections": [...]
      }
    ],
    "footer": {
      "copyright": "...",
      "links": [...]
    }
  },
  "stats": {
    "word_count": 1800,
    "headings": 36,
    "links": {"internal": 8, "external": 2},
    "images": 5,
    "forms": 0,
    "tables": 3,
    "interactive_elements": 68
  }
}

Analysis Mode (`--analyze`)

When --analyze is specified, append a full analysis section covering:

Category	What to Assess
Content Quality	Clarity, jargon level, reading grade, narrative structure
Information Architecture	Heading hierarchy, section flow, content grouping
CTAs & Conversion	CTA count, placement, hierarchy, clarity of action
Trust Signals	Testimonials, logos, case studies, security badges, social proof
SEO Basics	Title tag, meta description, H1 count, canonical, structured data
Accessibility	Heading levels, ARIA landmarks, alt text, form labels
Mobile Readiness	Viewport meta, responsive indicators, touch targets
Content Gaps	Missing pricing, FAQ gaps, no contact info, stale dates

Rate each category: Strong / Adequate / Needs Improvement / Missing

Provide prioritized recommendations: HIGH / MEDIUM / LOW with effort estimates.

See: Analysis Rubric

Accessibility Mode (`--a11y`)

Run an accessibility audit using snapshot data:

Check	Source	How
Heading hierarchy	Snapshot `[level=N]`	Verify H1 -> H2 -> H3 with no skips
ARIA landmarks	Snapshot roles	Verify banner, main, contentinfo, navigation present
Link text	Snapshot link elements	Flag "click here", "read more", empty links
Form labels	Snapshot + JS	Every input must have associated label
Alt text	JS extraction	Every `<img>` must have alt attribute
Color contrast	JS extraction	Check computed contrast ratios
Focus order	Snapshot `tabindex`	Flag positive tabindex values
Language	JS `document.documentElement.lang`	Must be set

Meta / SEO Mode (`--meta`)

Extract and assess using the orchestrator or individual extractors:

# Via orchestrator (recommended)
scripts/browser/extract.sh "<url>" --meta --og --structured --seo

# Via individual extractor files
npx agent-browser eval "$(cat scripts/browser/extractors/seo.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/og-twitter.js)"
npx agent-browser eval "$(cat scripts/browser/extractors/structured-data.js)"

Compare Mode (`--compare`)

Extract both pages, then produce a side-by-side diff:

/browser-extract "https://v1.example.com" --compare "https://v2.example.com"

Output sections:

Structural Diff — Sections added/removed/changed
Content Diff — Text changes (word-level)
Link Diff — Links added/removed/changed
Meta Diff — SEO tag changes
Stats Comparison — Word count, headings, links, images

Monitor Mode (`--monitor`)

First run saves a baseline. Subsequent runs diff against baseline:

# First run: saves baseline
/browser-extract "https://example.com" --monitor

# Later run: detects changes
/browser-extract "https://example.com" --monitor
# Output: "3 sections changed, 2 links added, copyright updated"

Baselines stored in: analyze-new-artifacts/coditect-browser-analysis/.baselines/

Options

Option	Description
`<url>`	URL to extract (required)
`--json`	Output as nested JSON
`--both`	Output markdown + JSON
`--csv`	Extract tables as CSV
`--sections`	Headings + content by section
`--links`	Link graph with internal/external classification
`--forms`	Form structure and field inventory
`--tables`	Tables as markdown or CSV
`--meta`	SEO metadata extraction and assessment
`--analyze`	Full content + UX analysis with recommendations
`--a11y`	Accessibility audit from snapshot
`--compare <url>`	Side-by-side comparison with another URL
`--monitor`	Change detection against saved baseline
`--output <path>`	Custom output file path
`--screenshot`	Include full-page screenshot
`--depth <N>`	Snapshot depth limit
`--scope <selector>`	CSS selector to scope extraction
`--no-close`	Keep browser open after extraction
`--help`	Show this help

Output Paths

Mode	Default Output Path
Basic extract	`analyze-new-artifacts/coditect-browser-analysis/{domain}-extract.md`
JSON	`analyze-new-artifacts/coditect-browser-analysis/{domain}-extract.json`
Analysis	`analyze-new-artifacts/coditect-browser-analysis/{domain}-analysis.md`
Compare	`analyze-new-artifacts/coditect-browser-analysis/{domain}-vs-{domain2}-compare.md`
Monitor baseline	`analyze-new-artifacts/coditect-browser-analysis/.baselines/{domain}-{date}.json`
Screenshot	`analyze-new-artifacts/coditect-browser-screenshots/{domain}.png`

Examples

Quick Extract to Markdown

/browser-extract "https://docs.coditect.ai"
# Creates: analyze-new-artifacts/coditect-browser-analysis/docs-coditect-ai-extract.md

Full Analysis with Screenshot

/browser-extract "https://auth.coditect.ai" --analyze --screenshot
# Creates: auth-coditect-ai-analysis.md + auth-coditect-ai.png

Extract Just the Main Content

/browser-extract "https://blog.example.com/post" --scope "article" --sections
# Extracts only the <article> element, organized by sections

Compare Two Versions

/browser-extract "https://staging.example.com" --compare "https://example.com"
# Creates: staging-example-com-vs-example-com-compare.md

SEO Quick Check

/browser-extract "https://example.com" --meta
# Outputs: title length, meta description, OG tags, structured data, robots

Success Output

/browser-extract: https://example.com
  Format: markdown
  Sections: 8
  Words: 1,842
  Links: 12 internal, 3 external
  Output: analyze-new-artifacts/coditect-browser-analysis/example-com-extract.md

Command: /browser — Direct browser control
Command: /browser-research — Multi-page agentic extraction
Skill: browser-content-extraction — Extraction patterns
Skill: browser-automation-patterns — Browser workflow patterns
Agent: coditect-browser-agent — Browser automation agent

Principles

This command embodies:

#3 Complete Execution — open -> extract -> structure -> analyze -> write -> close in one command
#6 Clear, Understandable — Markdown output readable by anyone
#9 Based on Facts — Content extracted from real page state, not assumptions

Command Version: 1.0.0 Created: 2026-02-08 Author: CODITECT Core Team

Usage​

System Prompt​

Extraction Pipeline​

Analysis Mode (--analyze)​

Accessibility Mode (--a11y)​

Meta / SEO Mode (--meta)​

Compare Mode (--compare)​

Monitor Mode (--monitor)​

Options​

Output Paths​

Examples​

Quick Extract to Markdown​

Full Analysis with Screenshot​

Extract Just the Main Content​

Compare Two Versions​

SEO Quick Check​

Success Output​

Related​

Principles​