blogccasion

Testing browser-use, a scriptable AI browser agent

I'm not a big LinkedIn user, but the other day, my Google colleague Franziska Hinkelmann posted something about a project called browser-use that caught my eye:

Got low stakes repetitive tasks in the browser? Playwright + LLMs (Gemini 2.0) to the rescue! Super easy to make somebody else cough agents cough do the work for you, especially if you have to repeat a task for many rows in a Google Sheet.

After seeing her demo, I went and tried it out myself. Here are the steps that worked for me on macOS:

  1. Install uv following their installation instructions. (The usual caveat of first checking the source code before pasting anything in the Terminal applies.)

    curl -LsSf https://astral.sh/uv/install.sh | less
  2. Create a new Python environment and activate it. This is from browser-use's quickstart instructions.

    uv venv --python 3.11
    source .venv/bin/activate
  3. Install the dependencies and Playwright.

    uv pip install browser-use
    playwright install
  4. Create a .env file and add your OpenAI API key in the form OPENAI_API_KEY=abc123.

  5. Create an agent.py file with the source code of your agent. Here's the one I tried. As you can see, I'm tasking the agent with the following job: "Go to developer.chrome.com and find out what built-in AI APIs Chrome supports".

    from langchain_openai import ChatOpenAI
    from browser_use import Agent
    import asyncio
    from dotenv import load_dotenv
    load_dotenv()
    
    async def main():
        agent = Agent(
            task="Go to developer.chrome.com and find out what built-in AI APIs Chrome supports.",
            llm=ChatOpenAI(model="gpt-4o"),
        )
        result = await agent.run()
        print(result)
    
    asyncio.run(main())

The coolest thing is that you can watch the agent do its job in real time and see the reasoning steps in the command line.

Headless browser window with all UI elements annotated.

Here's an excerpt of the logs.

INFO     [agent] πŸš€ Starting task: Go to developer.chrome.com and find out what built-in AI APIs Chrome supports.
INFO     [agent] πŸ“ Step 1
INFO     [agent] 🀷 Eval: Unknown - It's the starting state.
INFO     [agent] 🧠 Memory: Start searching for built-in AI APIs supported by Chrome on developer.chrome.com.
INFO     [agent] 🎯 Next goal: Navigate to developer.chrome.com
INFO     [agent] πŸ› οΈ  Action 1/1: {"go_to_url":{"url":"https://developer.chrome.com"}}
INFO     [controller] πŸ”—  Navigated to https://developer.chrome.com
INFO     [agent] πŸ“ Step 2
INFO     [agent] πŸ‘ Eval: Success - Navigated to the correct URL.
INFO     [agent] 🧠 Memory: Visited developer.chrome.com. Need to search for information on built-in AI APIs supported by Chrome.
INFO     [agent] 🎯 Next goal: Use search input to find information on AI APIs.
INFO     [agent] πŸ› οΈ  Action 1/1: {"click_element":{"index":9}}
INFO     [controller] πŸ–±οΈ  Clicked button with index 9:
INFO     [agent] πŸ“ Step 3
INFO     [agent] πŸ‘ Eval: Success - The search bar was clicked but no input was entered yet.
INFO     [agent] 🧠 Memory: Need to input search query about AI APIs in Chrome using the search bar.
INFO     [agent] 🎯 Next goal: Enter 'AI APIs' in the search input to find relevant information.
INFO     [agent] πŸ› οΈ  Action 1/1: {"input_text":{"index":4,"text":"AI APIs"}}
INFO     [controller] ⌨️  Input AI APIs into index 4
[…]

At the end of the log messages is the agent's full response to your task, formatted in Markdown. FWIW, the answer is exactly correct:

Chrome supports several built-in AI APIs, including:

  1. Translator API - Available in origin trial for early preview participants. Use cases include on-demand translation and supporting multi-language communication.
  2. Language Detector API - Available in Chrome origin trial. It helps detect languages without user input and can label texts for better screen reader pronunciation.
  3. Summarizer API - Allows for local experimentation to preview program participants. It can summarize meeting transcripts, articles, and forum questions.
  4. Writer and Rewriter APIs - Experimental status in early preview program, used for creating and refining text content.
  5. Prompt API - Allows natural language requests to Gemini Nano in Chrome, in an experimental early stage.

Visit developer.chrome.com for complete details and participation in early trials.

It's pretty wild what this scriptable agent is capable of doing today. Be sure to check out some of the other demos and also try the browser-use web-ui, which adds a nice UI on top.