๐ปZDNet AIโขStalecollected in 39m
GPT-5.4 Test: Impressive Yet Off-Target

๐กHands-on GPT-5.4 test reveals quality excels but instructions often ignored
โก 30-Second TL;DR
What Changed
High-quality answers from GPT-5.4 despite prompt misalignment.
Why It Matters
Highlights risks of overhyping model capabilities for pro use, urging validation. Could impact trust in OpenAI's next-gen announcements.
What To Do Next
Test GPT-5.4 Thinking on your key professional prompts for alignment.
Who should care:Developers & AI Engineers
๐ง Deep Insight
Web-grounded analysis with 7 cited sources.
๐ Enhanced Key Takeaways
- โขGPT-5.4 Thinking achieves state-of-the-art 75.0% success on OSWorld-Verified benchmark for desktop navigation via screenshots and keyboard/mouse actions, surpassing human performance at 72.4%[3].
- โขOn GDPval benchmark, GPT-5.4 matches or exceeds industry professionals in 83% of knowledge work tasks across 44 occupations, up from 70.9% for GPT-5.2[2].
- โขGPT-5.4 introduces native computer-use capabilities in Codex and API, enabling agents to operate software, navigate file systems, and execute multi-step workflows[1][2][3].
๐ ๏ธ Technical Deep Dive
- โขSupports 1-million-token context window for handling longer workflows and complex prompts while maintaining coherence[2][5][7].
- โขIntroduces 'tool search' in API, allowing efficient operation across larger tool ecosystems with lower token cost and latency[1][3].
- โขImproved tool calling accuracy on Toolathlon benchmark, enabling better multi-step tasks like reading emails, extracting attachments, grading, and spreadsheet updates[3].
- โขNative computer use via screenshots, mouse/keyboard commands, and libraries like Playwright; steerable behavior with custom safety policies[3].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
GPT-5.4 will accelerate agentic AI adoption in enterprise by reducing need for specialized frameworks
Hallucination rates will drop 33% in professional outputs compared to GPT-5.2
OpenAI reports individual factual claims are 33% less likely to be incorrect, with overall responses 18% less error-prone, enhancing reliability for decision-ready work[2].
โณ Timeline
2026-03
OpenAI releases GPT-5.4 series including Thinking, Pro, and Instant variants with computer-use and tool improvements
๐ Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: ZDNet AI โ