5 ways to automate a browser with an AI agent

TL;DR: browser-use leads on benchmarks (89.1% WebVoyager), Stagehand wins on cost with action caching, Playwright CLI is 4x more token-efficient than MCP, and Skyvern handles sites with no usable DOM. To record and share what your agent does, screencli turns any of these into a polished video with one command.

The landscape in 30 seconds

AI agents that control browsers need a framework to do it. There are now five serious options, each with different tradeoffs on token cost, speed, and reliability. Here’s how they compare in March 2026.

Framework	Language	GitHub stars	WebVoyager score	Cost per action	Best for
browser-use	Python	78,000+	89.1%	~$0.07 / 10 steps	Autonomous agents
Stagehand	TypeScript	21,000+	—	$0.002–0.02	Hybrid automation
Playwright MCP	Any (MCP)	—	—	Free (high token cost)	Chat-style agents
Playwright CLI	Any (shell)	—	—	Free (low token cost)	Coding agents
Skyvern	Python	20,000+	85.85%	From $29/mo	Legacy/gov sites
screencli	Any (CLI)	—	—	Free / Pro $12/mo	Recording agent sessions

browser-use: highest benchmark score

browser-use holds the current state-of-the-art for autonomous web interaction at 89.1% on WebVoyager (586 diverse web tasks). It’s model-agnostic — works with Claude, GPT-4o, Gemini, or local models via LiteLLM.

The tradeoff is token consumption. Each step requires an LLM call, so a 10-step task costs roughly $0.07. That adds up on long sessions. But for complex, multi-step workflows where the agent needs to reason about what it sees, nothing else comes close on reliability.

pip install browser-use

Use it when: your agent needs to handle unpredictable pages autonomously and you’re working in Python.

Stagehand: cheapest repeat runs

Stagehand takes a different approach. Instead of full autonomy, it gives you three primitives — act, extract, observe — that combine Playwright’s precision with AI reasoning.

The killer feature in v3 (February 2026) is action caching. Actions that succeed once are stored and replayed without an LLM call on subsequent runs. Browserbase reports 44% faster execution on average and up to 80% speedup on repeated workflows, with ~30% cost reduction.

At $0.002–0.02 per action with caching, it’s the cheapest option for repetitive workflows.

import Stagehand from "@browserbasehq/stagehand";

const stagehand = new Stagehand();
await stagehand.init();
await stagehand.page.goto("https://your-app.com");
await stagehand.act("click the Sign In button");

Use it when: you’re in TypeScript, your workflows are repeatable, and you want to minimize LLM spend.

Playwright MCP: zero setup, high token cost

Microsoft’s official Playwright MCP server gives any AI agent browser control through the Model Context Protocol. It uses the accessibility tree for interactions instead of screenshots, which means fast, text-based actions with no vision model overhead.

The catch: the MCP schema for Playwright’s 26 tools costs ~3,600 tokens just to load. A content-rich page can return thousands more tokens of accessibility data per action. One benchmark measured 114,000 tokens for a typical automation task via MCP. For more on why token cost matters, see context engineering is the new prompt engineering.

# Claude Code
claude mcp add playwright -- npx @playwright/mcp@latest

Use it when: you need plug-and-play browser control in a chat-style agent without filesystem access. Accept the token overhead.

Playwright CLI: 4x fewer tokens

Playwright CLI launched in February 2026 as Microsoft’s answer to the MCP token problem. Same Playwright engine, but it saves state to disk instead of streaming it back into the context window.

The numbers: 27,000 tokens for the same task that costs 114,000 via MCP — a 4x reduction. The skill definition is ~68 tokens total versus 3,600 for the MCP schema. On longer sessions, early adopters report up to 10x fewer tokens.

# Install
npm i @playwright/cli@latest

# Use from any coding agent
npx playwright-cli navigate https://your-app.com
npx playwright-cli click "Sign In"
npx playwright-cli screenshot

Use it when: your agent has filesystem access (Claude Code, Cursor, Copilot) and you care about token efficiency.

Skyvern: no selectors needed

Skyvern uses computer vision + LLM reasoning to interact with pages without relying on DOM selectors or accessibility trees. It looks at what’s on screen and decides what to click.

This makes it the only viable option for government portals, legacy enterprise apps, and sites where the DOM is inaccessible or meaningless. It scored 85.85% on WebVoyager with its 2.0 release, and it’s the best-performing agent specifically on form-filling tasks.

Starting at $29/month with 30,000 credits, it’s priced for production use rather than experimentation.

Use it when: you’re automating sites with inaccessible DOMs, heavy iframes, or anti-bot measures that break selector-based tools.

Record what your agent does with screencli

These frameworks automate the browser. But none of them produce a shareable video of what happened. That’s the missing piece.

screencli is an open-source screen recording CLI built for AI agents. It wraps Playwright under the hood — your agent navigates the page, and screencli records the session with auto-trim, auto-zoom, click highlights, and gradient backgrounds. One command, one shareable link.

npx screencli record https://your-app.com -p "Demo the checkout flow"
# → https://screencli.sh/v/a3f2c8e1

It pairs with any of the frameworks above. Your agent automates the browser. screencli turns that session into a polished video you can drop into a PR, a changelog, or a tweet. For a full step-by-step tutorial, see how to record product demo videos with Claude Code.

How to pick

Budget-constrained, repeatable workflows: Stagehand. Action caching pays for itself fast.

Highest reliability on unknown pages: browser-use. The benchmark scores speak for themselves.

Token-efficient coding agent integration: Playwright CLI. 4x savings over MCP, disk-based state.

Legacy or government sites: Skyvern. Vision-based approach bypasses DOM entirely.

Quick prototype, no filesystem: Playwright MCP. Zero config, accepts the token cost.

Record and share the result: screencli. Turns any agent session into a shareable video.

FAQ

Which browser automation framework is best for Claude Code? Playwright CLI. It was designed for coding agents with filesystem access and uses 4x fewer tokens than Playwright MCP. browser-use is a strong alternative for complex multi-step tasks.

How much does browser automation cost with AI agents? Ranges from free (Playwright CLI/MCP) to ~$0.07 per 10-step task (browser-use) to $29/month (Skyvern). Stagehand’s action caching can reduce repeat workflow costs by ~30%.

Can I record what my AI agent does in the browser? Yes. Tools like screencli record AI-driven browser sessions into shareable videos with auto-zoom, click highlights, and gradient backgrounds — one command, no manual recording.

What’s the difference between Playwright MCP and Playwright CLI? Both use the same Playwright engine. MCP streams browser state into the LLM context window (high token cost, no filesystem needed). CLI saves state to disk and lets the agent read what it needs (low token cost, requires filesystem access).

Which framework has the highest success rate? browser-use leads with 89.1% on WebVoyager. Skyvern follows at 85.85%, with particular strength on form-filling tasks.

Try screencli free → screencli.sh