BREAKING: OpenAI's GPT-4o Can Do It All

Everything OpenAI Didn't Mention in Their 'Spring Update'

Monday morning at OpenAI headquarters, CTO Mira Murati strode out onto the stage to the raucous applause of a small group of employees and guest attendees. OpenAI’s “Spring Update” was billed as a live stream showcasing “ChatGPT & GPT-4 updates for users.” Or, as Sam Altman put it, stuff that “feels like magic.

And they didn’t disappoint.

GPT-4o: A New Flagship Model

No, it’s not GPT-5, but OpenAI has a brand-new flagship model – and it wowed us all yesterday morning. GPT-4o – the “o” is for “Omni” – stunned with its wide range of capabilities, impressively low latency, and all-around feeling as if it was the operating system from “Her,” Samantha.

GPT-4o is a single model, one that is end-to-end across text, vision, and audio. This means that all of its inputs and outputs are processed by the same neural network, the first OpenAI model to combine all of these modalities in one.  In fact, GPT-4o is a natively multimodal token in, multimodal token out model.

Talking with it on stage, OpenAI researchers held real-time conversations, having it tell bedtime stories with freakishly real emotiveness, using it as a language interpreter, and giving it access to their screen for it to walk them through coding and math problems. Even its laugh is shockingly real.

Part of this achievement is due to latency. Previously, Voice Mode would bring together 3 models - transcription, intelligence, text to speech - adding tons of latency into the process. Now, this all happens natively. As GPT-4o reasons across voice, text, and vision, latency improves to just 232 ms at its quickest, and an average of 320 ms. This is far better than average latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), prior.

Another bonus coming with GPT-4o is improved speed and efficiency in 50 languages. But all of this belies the highlight announcement of the presentation: GPT-4o will be available to everyone. That’s right, free users of ChatGPT will also get access to GPT-4o and its full suite of capabilities, including access to GPTs through the GPT Store, Vision, Browsing, Memory, Advanced Data Analysis, and even the ability to upload files for “assistance summarizing, writing or analyzing.

Meanwhile, on the API, GPT-4o will be available with some additional benefits over its predecessor, GPT-4 Turbo. Namely, it’ll be 2x faster, 50% cheaper, and with a 5x higher rate limit than Turbo. Developers everywhere are cheering.

And finally, the other big announcement of the day was that OpenAI will be releasing a new desktop app for ChatGPT on macOS, along with an improved, slimmed down UI for the web experience. The desktop app is slowly rolling out to Plus users, but in the meantime, the previews look great. For both free and paid users, it is “designed to integrate seamlessly into anything you’re doing on your computer. With a simple keyboard shortcut (Option + Space), you can instantly ask ChatGPT a question.” You will also be able to take and discuss screenshots directly in the app.

As all of this and more was demoed live, we couldn’t help but be blown away by the endless possibilities and applications GPT-4o will open up for us all.

A Demo for the Ages

Before we dive into what the OpenAI team conveniently left out of their live demo presentation, let’s recap the incredible abilities they showed off to the audience in real-time.

Mira, along with two OpenAI employees, Mark Chen and Barret Zoph, put the new flagship model through a barrage of live demos, and despite some slight bugs and timing quirks, it was nothing short of amazing. First, Chen prompted ChatGPT to tell Barret a bedtime story. This quickly elevated again and again as they prompted the model to demonstrate increasing emotiveness, culminating in GPT-4o showing off its wide vocal range, its robot voice, and even its singing chops.

Next up was some real-time translation. The model acted as a perfect interpreter, translating Mark’s English to Italian and Mira’s Italian right back to English. Language barriers are looking more and more like a thing of the past.

Then, breaking out GPT-4o’s visual capabilities, they put the model through a series of prompts, like evaluating what it saw, judging Barret’s facial expressions, and plenty more. It even helped the crew walk through some basic math problems as they were written down in front of the camera. Before long, the OpenAI team was booting up the macOS desktop app with access to the researchers’ screen.

After pulling up a graph of temperature data, it was clear GPT-4o could quickly and accurately read the images it was seeing on the screen and answer questions about them aloud with stunning speed and clarity. It was a tour-de-force of the new flagship’s multimodal in, multimodal out abilities.

What OpenAI Didn’t Reveal Live

But, to be honest, some of the greatest, most impressive capabilities of GPT-4o weren’t even mentioned in the live demo! Those were discovered later by a storm of interested users on X digging into the accompanying GPT-4o blog post on the OpenAI website. And some of them will shock you.

AI-Generated Text Imagery

By the examples provided in the blog post, GPT-4o seems to be lightyears ahead of any image generation model at producing text within the image.

Font Creation

The image generation abilities don’t stop there – it can also seemingly generate entirely new fonts of its own.

3D Rendering

Somehow, they failed to mention that GPT-4o can do what they’re calling “3D object synthesis.” Check out this rotating OpenAI logo it produced:

Photo-to-Caricature

The model can even take an input image of a person and generate a realistic caricature of them, one-shot.

Lecture/Meeting Summarization and Note-Taking

Another very valuable aspect of GPT-4o will certainly be using it to generate summaries and notes from uploaded meetings, lectures, and so much more. Here, it does so from a 45 minute video clip input:

Sound Effects Synthesis

GPT-4o will not be limited to speech synthesis, either. In the blog post, OpenAI demonstrates its ability to generate “the sounds of coins clanging on metal.”

New Tokenizer

Many were left wondering how OpenAI achieved such a boost in efficiency and cost reduction. Well, it also went unsaid during the stream, but GPT-4o relies on a brand-new tokenizer – “o200k_base” – which contains far more, longer tokens. This is a key reason behind the huge increase in capabilities across many foreign languages.

And Yes, GPT-4o Is Im-also-a-good-gpt2-chatbot

Yup! This was revealed in a tweet after the launch by William Fedus of OpenAI.

Evals - How Good Is It Really?

At this point, it’s worth diving into the evaluations and benchmark performances which OpenAI included in their blog post to answer the question – how good is GPT-4o, really? If it’s being touted as the next big thing, does it live up to the hype? Is it as good as it’s made out to be?

It’s going to take more time—and plenty more public input—to say for sure, but for now we can judge it by the evals OpenAI leaves us. First off, on text evaluation, GPT-4o achieves a new world record of 88.7% on 0-shot COT MMLU (general knowledge questions), demonstrating its improvements in reasoning. It also shows a pretty large jump on the MATH benchmark.

On speech recognition, it also beats out the rest of the field. GPT-4o “dramatically improves speech recognition performance over Whisper-v3 across all languages.” Similarly, on translation, it’s a huge leap from Whisper-v3. It just edges out Gemini to set a new state-of-the-art on speech translation.

Finally, on the given visual understanding evals, GPT-4o again comes out on top, beating GPT4 and Claude Opus across the board. It’s safe to say that GPT-4o achieves state-of-the-art performance, once again. One area that’s not included in the blog, though, but was provided to us by Sam Altman himself on X, is GPT-4o’s coding abilities. There too, it sees a massive jump over its competitors. What can’t this model do?

What This Means for OpenAI, Apple, and the Future of AI

After the dust has settled, there’s plenty we could say about what the release of GPT-4o means—for OpenAI, for Apple, and even for the future of the field of AI.

For OpenAI, it’s clear that they’ve had “Omni” in the works for quite some time. Some have even theorized that GPT-4o is closer to GPT-5 than expected, potentially an early version of the "Arrakis" model, or some other. But it’s not GPT-5. And so, OpenAI's decision to brand this latest model as GPT-4o, rather than waiting for GPT-5, shows a strategy to their thinking, one meant to manage expectations before Google I/O. The assistant's personality is also much more dynamic and engaging—to the point of literally sounding like Samantha from “Her”—which also puts it in stark competition with Character AI.

However, the key to major success, for OpenAI and others, lies in winning over Apple. And the recently reported deal struck between the two companies now makes a whole lot of sense. OpenAI could revolutionize iOS and Siri by integrating a smaller, on-device GPT-4o, adding native features like camera and screen streaming, and enhancing system-level actions and smart home APIs. GPT-4o would make an excellent foundation for a revamped, renewed Siri. If this is the case, it would make GPT-4o the AI agent for billions of users worldwide. That’s nothing to scoff at.

And as it relates to the future of AI, yesterday is bound to be looked back upon as a watershed moment. It was the release of ChatGPT that brought the AI chatbot into the mainstream. In the same way, GPT-4o will likely be the tool that brings the AI companion, assistant, private tutor, therapist, etc into the mainstream.

There is so much this model can and will do. And we’re just at the beginning of finding out.