Make Models Use Your Computer: Hands-On With Gemini Computer Use And ChatGPT Agents

Views: 0


A window opens. No hands on the mouse. A cursor glides to a button, fills a form, and checks out a cart while you sip coffee. It is not magic. It is your AI using a browser like a person.

This guide shows how to Make Models Use Your Computer: Hands-On With Gemini Computer Use And ChatGPT Agents. You will learn what “computer use” really means, how to set it up with Gemini’s new Computer Use tools, how to do it with ChatGPT Agents and Operator, and how to ship safe flows that book, buy, and confirm without drama. We will keep the language simple, the steps direct, and the examples practical.


Why “computer use” matters now

Large models can read, reason, and talk. The leap is when they can act on a screen. Both Google and OpenAI now provide agents that control real user interfaces. Google released the Gemini 2.5 Computer Use model and tool so agents can click, type, scroll, and complete browser tasks through the Gemini API. It is built to see screenshots, plan actions, and operate UI elements with lower latency than earlier attempts, and it targets web and mobile flows. (blog.google)

OpenAI launched Operator, powered by a Computer-Using Agent that uses vision and reinforcement learning to work inside a browser. The system clicks buttons, fills fields, and hands back control at sensitive moments like logins or payments. The same push shows up in the broader ChatGPT Agent feature, which gives ChatGPT its own virtual computer to finish multi-step jobs. (OpenAI)

The result is simple. Your model can not only draft an email. It can open Gmail, paste the copy, attach a file, and schedule send. It can not only suggest hotels. It can actually reserve a room after you approve a summary.


What “computer use” means in real life

  • Web admin chores: update a CMS field, upload media, publish a draft, or fix a broken link when no API exists. Gemini Computer Use and the Google ADK show how to drive a Chromium browser through Playwright with model-guided actions. (Google GitHub)
  • Shopping and checkout: in ChatGPT, Instant Checkout lets you confirm a product and pay in the same chat using the Agentic Commerce Protocol, which OpenAI is opening to merchants. This turns a conversation into a transaction without sending you to a long checkout funnel. (OpenAI Developers)
  • Travel planning: search flights, compare hotels, and lock a restaurant reservation, with a short approval gate before each commit. Operator’s design centers on GUI interaction plus human handoff where needed. (OpenAI)
  • Research at scale: ChatGPT’s deep research mode plans, browses, and cites sources, then you approve actions that need control or checkout. (The Verge)

The core idea in one sentence

Your agent watches the screen, reasons about what to click, proposes a plan, gets your approval, and then acts, step by step, with logs you can audit. Gemini Computer Use and ChatGPT Agents both support this pattern through their own stacks. (Google AI for Developers)


Set up Gemini Computer Use

What you get
A Gemini model plus a “Computer Use” tool that reads screenshots and emits concrete UI actions. You run those actions with a safe automator, often Playwright. Google’s docs and the Agent Development Kit show supported actions, constraints, and how to link the tool to a headless browser. (Google AI for Developers)

Install and scaffold

  1. Get API access and credentials for the Gemini API.
  2. Install Playwright and the ADK toolset. (Google GitHub)
  3. Create a small runner that loops: screenshot → model action plan → execute → verify.

Minimal Python sketch

# pip install google-genai playwright

# playwright install chromium

from genai import Client

from playwright.sync_api import sync_playwright

genai = Client(api_key=”YOUR_GEMINI_KEY”)

def suggest_actions(screenshot_png, goal_text):

    # Ask the Computer Use model for the next UI action(s)

    return genai.models.generate(

        model=”gemini-2.5-computer-use”,

        input=[{“image”: screenshot_png}, {“text”: goal_text}]

    )[“actions”]

with sync_playwright() as p:

    browser = p.chromium.launch(headless=True)

    page = browser.new_page(viewport={“width”: 1280, “height”: 800})

    page.goto(“https://example.com/login”)

    while True:

        png = page.screenshot()

        actions = suggest_actions(png, “Sign in, open profile, export CSV”)

        for a in actions:

            if a[“type”] == “click”:

                page.click(a[“selector”])

            elif a[“type”] == “type”:

                page.fill(a[“selector”], a[“text”])

            elif a[“type”] == “navigate”:

                page.goto(a[“url”])

        # simple stop condition

        if page.url.endswith(“export.csv”):

            break

    browser.close()

How it works
The Computer Use model sees the live screenshot and proposes UI actions like click, type, or navigate. Your code executes those actions, waits for the page to settle, and loops. The Google docs describe the action schema and the best way to verify progress. (Google AI for Developers)

Design for safety

  • Add a cost and risk summary before any irreversible step.
  • Confirm destination URLs.
  • Keep a whitelist of allowed domains.
  • Store a screenshot on each loop for auditing.

When Gemini Computer Use shines

  • Sites without good APIs.
  • Mixed web and mobile tasks.
  • Visual tasks like dragging a file or handling rich widgets. Google’s announcement highlights improved UI control accuracy and latency on benchmarks for web and mobile tasks. (blog.google)

Set up ChatGPT Agents and Operator

What you get
Two related paths:

  • ChatGPT Agent gives ChatGPT a virtual machine to think and act across steps. You can interrupt, steer, and resume. It is built for iterative work with a visible plan and check-ins. (OpenAI)
  • Operator focuses on direct browser control. It uses a Computer-Using Agent trained with vision and reinforcement learning to interact with GUIs and hand control back at key moments. (OpenAI)

Why developers care
You can build your own agent flows with Agent Builder and tool calling. Agent Builder provides a visual canvas for multi-step workflows and an SDK to export code. Tool calling lets the model propose function calls with structured arguments while your app decides if and how to execute them. (OpenAI Platform)

Basic tool calling loop

// Node.js example with OpenAI

import OpenAI from “openai”;

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const tools = [{

  type: “function”,

  function: {

    name: “book_hotel”,

    description: “Reserve a hotel room”,

    parameters: {

      type: “object”,

      properties: {

        city: { type: “string” },

        date: { type: “string” },

        maxPrice: { type: “number” }

      },

      required: [“city”, “date”]

    }

  }

}];

const messages = [

  { role: “system”, content: “Plan first, ask for approval before booking.” },

  { role: “user”, content: “Find a hotel near the arena on Jan 14 under 200.” }

];

const first = await client.chat.completions.create({

  model: “gpt-5”, // or current model

  messages,

  tools,

  tool_choice: “auto”

});

const call = first.choices[0].message.tool_calls?.[0];

if (call?.function?.name === “book_hotel”) {

  const args = JSON.parse(call.function.arguments);

  // run your booking code or open an Operator task here

}

What the docs say
Tool calling is a multi-step dialogue where the model proposes actions and your app executes them. Agent Builder lets you compose flows, add guardrails, and ship without hand-rolling orchestration. (OpenAI Platform)

Hands-on with Operator
Operator is built for GUI work. It types, clicks, scrolls, and tries again when the page changes. OpenAI describes the model, training approach, and the human handoff for logins and sensitive tasks. Early coverage shows it running inside ChatGPT for Pro users first, with a plan to expand access. (OpenAI)


Affiliate Link
See our Affiliate Disclosure page for more details on what affiliate links do for our website.

Mint Mobile has the lowest prices for mobile phone services in the US.

Buying inside ChatGPT with Instant Checkout

If you want the agent to make a purchase in the chat, use Instant Checkout. It sits on the Agentic Commerce Protocol, an open standard that routes orders to merchants while keeping payment systems intact. Merchants integrate a few flows. Users approve the total and complete the order in one place. OpenAI is opening the protocol and working with large catalogs so shopping does not stop at links. (OpenAI Developers)

Recent news mentions live pilots with marketplaces and retailers, which shows how fast agentic commerce is moving into real catalogs. (Barron’s)


A clear comparison: Gemini Computer Use vs ChatGPT Agents

Scope and control

  • Gemini Computer Use focuses on translating screenshots to UI actions you execute through your code. You run the browser, set rules, and keep tight control. The ADK documents a Playwright toolset that drives Chromium and exposes click and type primitives. (Google GitHub)
  • ChatGPT Agents and Operator give you a higher level of built-in GUI control inside ChatGPT, plus a full agent platform with tool calling, Agent Builder, and a path to in-chat commerce. (OpenAI Platform)

Strengths

  • Gemini shines when you need a programmable, headless runner that you own, and when you want strict domain whitelists and custom verification between steps. Google describes lower latency and strong accuracy on browser tasks. (blog.google)
  • ChatGPT shines when your users already live in ChatGPT and want planning, screen control, and checkout in one place. The Operator model and Agent features are aimed at end-to-end tasks with human handoff baked in. (OpenAI)

Ecosystem

  • Gemini Computer Use lives in the Gemini API and ADK with Playwright. (Google AI for Developers)
  • ChatGPT Agents live with Agent Builder, ChatKit for embedding a UI, and the commerce protocol for purchases. (OpenAI Platform)

When to use which

  • Pick Gemini Computer Use for back-office automations, private dashboards, and any task where you want to host the runner.
  • Pick ChatGPT Agents when the chat is the product, you need fast prototyping, and you want a checkout inside the conversation.

Build a working project in a weekend

Goal: a “personal admin” that files invoices from email, updates your accounting site, and pings you when done.

Gemini path

  1. Scope: Gmail label, PDF invoice, accounting site with no API.
  2. Runner: Playwright with a Chromium profile that stays logged in. (Google GitHub)
  3. Loop:
    • Open Gmail, filter by label, download the latest invoice.
    • Go to accounting site, click “New bill,” fill fields from the PDF, attach file, save.
    • Screenshot after each step and verify that totals match.
  4. Approval: before saving, show a summary: vendor, date, amount.
  5. Log: store the screenshots, actions, and final URL in a sqlite file.

ChatGPT path

  1. Agent plan: use ChatGPT Agent to read your goal and propose the steps. (OpenAI)
  2. Tool: create a create_bill function for the accounting API if available. If not, route to Operator to handle the GUI. (OpenAI Platform)
  3. Approvals: the agent posts a bill summary and asks you to confirm in chat.
  4. Receipts: the agent emails a receipt and updates a private log.

Tip: keep the first version narrow. One vendor. One form. One site.


Prompts that steer agents well

Planning prompt

Prompt: Break the goal into numbered steps. Name the tool or GUI action for each step. Ask for approval before any irreversible step. After execution, return a short receipt with item, date, amount, and link.

Selector prompt

Prompt: When using a browser, prefer role, label, placeholder, and visible text over brittle CSS. After each action, verify by checking URL or a unique text on the page. If three retries fail, ask for human help.

Money prompt

Prompt: Before a purchase, show vendor, item, price, tax, shipping, and total. Wait for a plain “Approve purchase” reply. Do not proceed without it.

These match what both platforms recommend in their docs and guides about clear instructions, approvals, and guardrails. (OpenAI CDN)


Affiliate Link
See our Affiliate Disclosure page for more details on what affiliate links do for our website.

Honey Extension for Chrome. Save automatically with this free extension.

Safety and trust by design

  • Human in the loop: always confirm totals or dates before booking or buying. OpenAI and third-party reporting describe Operator pausing for sensitive actions. (OpenAI)
  • Domain allowlist: block actions outside approved domains.
  • PII scrubbing: mask sensitive fields in logs.
  • Signed actions: write an event trail with who, what, when, and where.
  • Kill switch: if the agent loops or the UI changes, stop and ask.

Performance and limitations

Google says the Gemini 2.5 Computer Use model outperforms leading alternatives on web and mobile control benchmarks and improves latency. That is promising, but you should still design retries and fallbacks for flaky pages. (blog.google)

Operator and ChatGPT Agent can handle long tasks, hand back control for logins, and recover from mistakes, but complex sites and CAPTCHAs still require a human step. Early coverage notes both progress and practical limits. (OpenAI)

Deep research handles planning and citations, yet even strong agents can misjudge source quality. Keep approvals on. Keep your logs tight. (The Verge)


Common pitfalls and fixes

  • Wrong button: prefer robust selectors and verify with page text or URL. The ADK guidance points to Playwright roles and labels for stability. (Google GitHub)
  • Form mismatch: add a schema check before submit.
  • Site update: store selectors in one file and rotate weekly.
  • Over-calling tools: tighten your system prompt and raise the decision threshold in your planner. Agent Builder helps you version guardrails. (OpenAI Platform)
  • Checkout confusion: summarize totals and require explicit approval. Instant Checkout flows are built around that step. (OpenAI Developers)

Cookbook: five hands-on recipes

1) Meeting scheduler from an email thread

  • Parse meeting intent from the thread.
  • Open the calendar UI, create an event, invite the thread sender, and paste the agenda.
  • Ask for a thumbs up before sending.
  • Log the event link.

Use Gemini Computer Use if your calendar lacks an API and you want tight control. Use ChatGPT Agent if you want a chat-first experience with a visible plan.

2) Vendor onboarding in a portal

  • Click through a wizard that has no API.
  • Upload W-9, fill bank info, and set notification preferences.
  • Pause for your review at the final screen.
  • Save a PDF of the confirmation.

Gemini ADK with Playwright is ideal here. (Google GitHub)

3) Local dinner plan with in-chat checkout

  • Find three options under a budget.
  • Hold a table or choose pick-up.
  • Buy a small host gift with Instant Checkout after confirming totals. (OpenAI Developers)

4) CMS cleanup

  • Crawl a list of stale posts.
  • Open each editor page, fix broken links, and update alt text.
  • Save, then post a report.

Gemini Computer Use can verify changes by checking page text after each save. (Google AI for Developers)

5) Team research sprint

  • Use deep research to gather sources and draft a brief with citations.
  • Switch to Operator when a site requires a download or a tricky UI. (The Verge)

Affiliate Link
See our Affiliate Disclosure page for more details on what affiliate links do for our website.

Amazon Prime Subscription - Affiliate link to a "Start your 30-day free trial" and enjoy FREE deivery, movies, TV shows, and more.
Amazon Prime Subscription – Affiliate link to a “Start your 30-day free trial” and enjoy FREE deivery, movies, TV shows, and more.

Developer notes for both stacks

Selectors beat pixels
Use roles, labels, and visible text. Avoid absolute x-y clicks when you can. Playwright supports semantic queries that survive layout shifts. (Google GitHub)

Plan, approve, act
Make the plan visible. Ask for approval. Execute. This matches both OpenAI’s and Google’s recommended patterns for agent reliability. (OpenAI Platform)

Instrument everything
Record actions, screenshots, and outcomes. Nightly replay a sample set to catch drift.

Buy only with clear consent
If using ChatGPT for commerce, follow the Agentic Commerce Protocol and show totals. Keep merchant systems intact and pass only what is required. (OpenAI Developers)

Ship small, then widen
Start with one site and one flow. Add branches after your evals are green.


Quick FAQ

Can I run Gemini Computer Use locally with full privacy?
You host the browser and runner. The model still runs in Google’s cloud through the Gemini API. Use domain allowlists and redact sensitive text in screenshots. (Google AI for Developers)

Can ChatGPT Agents handle checkout?
Yes through Instant Checkout, which uses the Agentic Commerce Protocol. You approve totals in chat and the order routes to the merchant. (OpenAI Developers)

Is this only for power users?
No. Agent Builder reduces the glue code. Operator and ChatGPT Agent focus on friendly approvals. Gemini’s ADK gives you a starter browser toolset. (OpenAI Platform)

What about mobile apps?
Google notes Computer Use targets web and mobile control tasks. For mobile on your side, consider hosted device farms or vendor SDKs. (blog.google)


The future is agents that respect your time

We now have the parts to Make Models Use Your Computer: Hands-On With Gemini Computer Use And ChatGPT Agents in a way that feels safe and useful. Gemini provides a precise, programmable path for UI control through your own runner. ChatGPT brings planning, a virtual computer, and in-chat checkout. Both expect you to stay in the loop, approve costs, and review results. That is the right balance for the work most of us do each day.

If you ship one narrow use case, you will feel the difference right away. No more 15 clicks to post a file. No more bouncing between tabs to complete a booking. The agent handles the grind. You handle the calls that matter.


Build checklist

  • Clear system prompt that enforces plan, approval, and execution
  • Domain allowlist and screenshot logging
  • Selector strategy with roles and labels
  • Approval gates on dates, money, and irreversible changes
  • Instant Checkout used only after a total summary is shown
  • Nightly replay and eval report

Final word

You do not need a giant team to automate real work now. With Gemini Computer Use and ChatGPT Agents, you can hand off repetitive web tasks, gain back hours, and keep full control over key decisions. Start small. Keep approvals tight. Add commerce only when it improves the experience. The tools are ready.


By hitting the Subscribe button, you are consenting to receive emails from AltPenguin.com via our Newsletter.

Thank you for Subscribing to the Alt+Penguin Newsletter!

By James Fristik

Writer and IT geek. James grew up fascinated with technology. He is a bookworm with a thirst for stories. This lead James down a path of writing poetry, short stories, playing roleplaying games like Dungeons & Dragons, and song lyrics. His love for technology came at 10 years old when his dad bought him his first computer. From 1999 until 2007 James would learn and repair computers for family, friends, and strangers he was recommended to. His desire to know how to do things like web design, 3D graphic rendering, graphic arts, programming, and server administration would project him to the career of Information Technology that he's been doing for the last 15 years.

Verified by MonsterInsights