OpenAI rolls out GPT‑5.4, pushing ChatGPT from answers to actions in office software

Published:

On a recent internal test at legal technology startup Harvey, an artificial intelligence system read a set of dense contracts, drafted proposed changes, updated a partner’s spreadsheet and prepared a summary — work that would normally occupy a junior associate for hours. The system then stepped beyond the document window, logging into a web portal to file the paperwork, all without a person touching the keyboard.

The engine behind that test, OpenAI’s new GPT‑5.4 model, began rolling out March 5 and is now being embedded into ChatGPT, developer tools and a growing list of office software. It is the company’s most explicit bid yet to move AI from answering questions into doing work inside the same applications where white‑collar employees spend their day.

OpenAI describes GPT‑5.4 as “our most capable and efficient frontier model for professional work.” The system is optimized for what the company calls reasoning‑heavy, multi‑step tasks: writing and editing documents, generating and debugging code, analyzing data, and orchestrating tools, including direct control of computers and web browsers. It appears in ChatGPT for paying users under the label “GPT‑5.4 Thinking” and is available through the OpenAI programming interface and the company’s Codex workflow platform.

The launch marks a turning point in how large AI models are being deployed in offices, and intensifies a competition with Google and Anthropic to automate more of the routine work of lawyers, bankers, consultants and back‑office staff. It also raises questions about how companies will manage the risks of giving software the ability to click, type and transact across real systems.

A model built to act, not just chat

GPT‑5.4 sits at the high end of OpenAI’s model lineup, following last year’s GPT‑5.2 and the specialized GPT‑5.3‑Codex for software development. It is not the default brain behind free ChatGPT, but a more expensive “frontier” option designed for complex work.

Technically, the model can accept both text and images and respond in text. It supports what OpenAI calls a reasoning control, letting developers dial how much computational effort the system spends on a given task. It can process up to roughly 1.05 million tokens of context — enough to ingest thousands of pages of contracts, financial statements or source code — and produce up to 128,000 tokens of output in a single run. Its training data extends through Aug. 31, 2025.

GPT‑5.4 also underpins new integrations aimed squarely at office workflows. On launch day, OpenAI introduced a ChatGPT for Excel add‑in that runs in the spreadsheet’s sidebar, allowing users to ask natural‑language questions about data, generate formulas or have the model build models and dashboards. Updated “skills” for spreadsheets and presentations in the company’s Codex toolchain tap the model to design financial models and draft slide decks.

The most notable change, however, is not in how GPT‑5.4 writes but in how it acts.

OpenAI calls it “our first general‑purpose model with native computer‑use capabilities.” In practice, that means GPT‑5.4 can generate code that drives desktop automation frameworks such as Playwright and can also issue mouse clicks and keyboard input by looking at screenshots of a user’s screen. With that, it can log in to web portals, fill out forms, navigate complex business software, and update records in line‑of‑business systems.

In internal tests on OSWorld‑Verified, a benchmark that measures an AI’s ability to navigate a desktop computer using screenshots plus mouse and keyboard, GPT‑5.4 completed 75% of tasks, OpenAI said. That compared with 47.3% for GPT‑5.2 and 72.4% for human participants in the same evaluation, meaning the new model slightly outperformed reported human accuracy in that setting.

On Online‑Mind2Web, a separate benchmark that tests browser navigation using only screenshots, GPT‑5.4 reached 92.8% task success. That is up from 70.9% for an earlier “agent mode” of ChatGPT that OpenAI had promoted in 2025 as an early step toward autonomous assistants.

“This is our first general‑purpose system that can reliably operate computers end‑to‑end across a wide range of applications,” the company said in its announcement, adding that behavior can be constrained by system instructions and custom confirmation rules.

Better at spreadsheets, slides and statutes

OpenAI and early partners say GPT‑5.4 is not just more active but more accurate than its predecessors on professional tasks.

On an internal benchmark called GDPval, which spans 44 occupations across large segments of the U.S. economy, the company reported that GPT‑5.4 matched or exceeded industry professionals in 83% of head‑to‑head comparisons. GPT‑5.2 scored 70.9% on the same set.

The improvements are especially pronounced in spreadsheets and presentations. On an investment banking–style modeling benchmark, GPT‑5.4 achieved an average score of 87.3%, up from 68.4% for GPT‑5.2, according to the model’s technical report. In a presentation‑generation test, human raters preferred slide decks generated by GPT‑5.4 over those from GPT‑5.2 in 68% of pairwise comparisons, citing more polished layout and better use of visuals.

Niko Grupen, head of applied research at Harvey, which builds AI tools for law firms, said in a statement that GPT‑5.4 “sets a new bar for document‑heavy legal work.” The model scored 91% on an internal “BigLaw Bench” evaluation that simulates complex tasks such as contract review and drafting in large law firms, the company said.

OpenAI also claims GPT‑5.4 is its “most factual model yet.” In tests on de‑identified user prompts where previous models had produced factual mistakes, GPT‑5.4 generated 33% fewer individual false claims and 18% fewer responses that contained any errors compared with GPT‑5.2.

On the coding side, GPT‑5.4 folds in many of the capabilities of GPT‑5.3‑Codex but is tuned to excel in workflows that involve multiple tools over longer sessions, such as building and iteratively testing a web application. On SWE‑Bench Pro, a standard benchmark that measures performance on software engineering tasks, GPT‑5.4 slightly outperformed both GPT‑5.3‑Codex and GPT‑5.2, though the gains were modest.

Enterprise pilots and a race for the office

Several companies that tested GPT‑5.4 before launch report large gains in speed and cost for automation.

Mainstay, a startup that automates homeowners’ association dues and property tax payments, said GPT‑5.4 achieved a 95% success rate on first attempts and 100% within three attempts when navigating roughly 30,000 different property portals. Chief executive Dod Fraser said the system also used about 70% fewer tokens — a measure of how much text the model processes — and completed sessions about three times faster compared with earlier models.

Tech outlets that examined the release have emphasized its office orientation. TechRadar called GPT‑5.4 OpenAI’s “most capable and efficient frontier model for professional work” and highlighted its strength with spreadsheets alongside the Excel integration. Tom’s Guide wrote that GPT‑5.4 “just made every other AI model look slow,” pointing to its million‑token context window and a new “/fast” mode in Codex that can generate responses up to 1.5 times faster at the same intelligence level.

Axios described the launch as a direct move into workplace tasks in competition with Alphabet’s Google and Anthropic. Google has been promoting its Gemini models inside Gmail, Docs and Sheets, while Anthropic has focused on enterprise contracts and research with its Claude assistant.

OpenAI is positioning GPT‑5.4 at the high end of the market. The company lists standard programming interface pricing at $2.50 per 1 million input tokens and $15 per 1 million output tokens, with discounted rates for cached input. A higher‑capacity GPT‑5.4 Pro variant, aimed at the most demanding workloads, is priced at $30 per 1 million input tokens and $180 per 1 million output tokens. By comparison, GPT‑5.2 inputs cost $1.75 per 1 million tokens on standard plans.

Safety controls for “high cyber capability”

OpenAI has classified GPT‑5.4 as “high cyber capability” under its internal Preparedness Framework, the same risk tier it has used for earlier advanced models. That label, the company says, triggers stricter safeguards and monitoring.

Because GPT‑5.4 can operate computers, a central concern is preventing it from taking harmful actions or being manipulated through so‑called prompt injection, where malicious content in a web page or document tries to steer an AI agent into leaking data or executing risky commands.

To address that, OpenAI allows developers and enterprise customers to define custom confirmation policies for computer use. For example, an organization can require human approval for any action that moves money, changes customer data in bulk or sends emails to large distribution lists. Companies can also limit which tools and applications agents are allowed to access.

The company has also released an evaluation it calls “chain‑of‑thought controllability,” designed to test whether models can deliberately hide their reasoning steps to evade monitoring. In public documentation, OpenAI said GPT‑5.4 Thinking showed low ability to suppress its reasoning traces when asked, which it presents as a positive sign for oversight systems that rely on reviewing an AI’s intermediate steps.

Researchers outside the company have been working on related problems. Recent academic work on so‑called agentic AI has explored methods for teaching models when to act and when to refuse, and how to coordinate multiple agents across complex workflows without losing control. Security experts at large cloud providers, including Microsoft, have meanwhile warned enterprises about “double‑agent” risks, where poorly governed AI systems gain access to internal tools and act in unexpected ways.

Jobs, governance and what comes next

The arrival of GPT‑5.4 comes amid a broader debate about how quickly AI will reshape white‑collar work. Unlike earlier chatbots that primarily produced text on request, the new model is designed to perform end‑to‑end tasks that map closely to the duties of junior analysts, paralegals, operations staff and interns: cleaning data, building models, drafting memos, preparing slides and updating multiple systems in sequence.

In the near term, employers are likely to present deployments as a way to reduce drudgery and allow professionals to focus on higher‑value work. But by OpenAI’s own metrics, GPT‑5.4 is already competitive with human professionals on large slices of simulated tasks, particularly in finance, law and sales support. That raises the prospect that, over time, firms may need fewer entry‑level workers to handle the same volume of work, and may shift human roles toward oversight, exception handling and client interaction.

The impact will not be uniform. Large companies with in‑house engineering teams can build elaborate guardrails around GPT‑5.4 agents, integrate them deeply into back‑office systems and monitor their behavior. Smaller organizations may rely on off‑the‑shelf tools such as ChatGPT for Excel or basic agent platforms, gaining access to powerful automation but with less capacity to police it.

Regulators are beginning to pay attention to these dynamics, particularly as AI agents move into regulated sectors such as finance and health care. Possible measures under discussion in policy circles include disclosure and audit requirements for high‑capability models, mandatory logging of AI‑driven actions in critical systems and clearer standards for how companies store and review an AI system’s reasoning in sensitive cases.

For now, GPT‑5.4’s release underscores how quickly general‑purpose AI is moving closer to the center of office work. Where earlier models offered help drafting emails or summarizing documents, OpenAI’s latest system is designed to work across those documents, the spreadsheets behind them and the portals where they are filed.

Whether it is experienced as a tireless assistant or an uneasy competitor may depend less on the model’s capabilities than on how employers, software vendors and regulators choose to deploy and constrain it in the months ahead.

Tags: #openai, #chatgpt, #aiagents, #automation, #officesoftware