AI models have been able to look at images and tell you what they can see inside the image for years, but with the spring update, OpenAI has taken things to a new level.
With the introduction of GPT-4o in ChatGPT – even without audio and video functionality – OpenAI has unveiled one of the best AI vision models released to date.
Its success is partly due to the fact that it is multimedia, having a deeper understanding of images, video, audio and text. It can reason across image, speech, video, and text where other forms are first converted to text.
To test his abilities, I gave him a series of pictures and asked him to describe what he could see. The more accurate the model. Often times, AI vision models – including GPT-4 – miss one or two objects or get the wrong description.
GPT-4o vision test
In each test, ChatGPT-4o was given the image and the prompt “What is this?” Without any context or additional information. This is closer to how people are likely to use this capability in the real world and how I recently used it at an event in Paris.
The goal is to see how well the image is analyzed. I followed up on each occasion to ask if it could detect whether or not an image was AI generated. They’re all created using an Ideogram from the description created by Claude 3 – so completely AI.
After the first couple of photos in the conversation, he started spontaneously telling me whether he thought the photo was AI-generated or not without me having to ask.
I started with the ChatGPT Desktop app but that started failing to respond, so I moved to the web and it worked fine.
1. Recognizing objects
For this first image, I created an image of a red ceramic mug filled with hot coffee. It was placed on a plate on a wooden table in a café, but the café was indistinct. The idea is to see how GPT-4o interprets the background and scenario.
GPT-4o wrote: “This image shows a steaming cup of coffee in a red cup and saucer placed on a rustic wooden table. The background is blurred, depicting a cozy indoor setting with people sitting and talking in what appears to be a cave-like café or a unique interior with stone walls.” The focus is on the cup of coffee, highlighting the steam rising from the hot drink.
This is as accurate and descriptive as it can get. Even drawing a picture of the scene goes beyond simple description. As a joke, I asked him: “Describe the thoughts of the person sitting at the table with this ready-to-drink coffee.”
She answered with a pretty descriptive overview but this line stood out: “This coffee is more than just a drink; It is a ritual, a cherished moment of solitude and contemplation amidst the hustle and bustle of everyday life.
2. Optical Character Recognition (OCR)
Optical character recognition (OCR) is a technology as old as computing, so it should be trivial for any vision model, even smartphones do it on the device. I gave her a simple photo showing a close-up of an old wooden sign that said “Welcome to Oakville” in a vintage handwritten style. I want more than just text from the AI.
GPT-4o delivered; Writing: “This photo shows a vintage-looking sign that says ‘Welcome to Oakville’. The sign appears weathered, with an old rustic finish, indicating that it has been exposed to the elements for some time. The text is in bold white font on a red background , with some signs of wear, adding to its nostalgic charm. The sign is partly surrounded by greenery, suggesting that it may have been placed outside, perhaps at the entrance to a town or community called Oakville.
Not only did the text read correctly, but it also suggested more detail about the location, identifying a possible town or settlement and pulling out details of the sign itself.
3. Facial recognition
This is a controversial use of AI technology and is somewhat banned in the European Union. But how well can GPT-4o describe a face? The image was an AI-generated depiction of a woman in her 40s with dark hair and warm brown eyes. She had a friendly smile.
GPT-4o described it as follows: “This image is a close-up of a smiling woman with dark, curly hair. The focus is on her face, which is well-lit, showing details of skin texture and features. She has a friendly expression, and her eyes are looking directly into the camera. The background is blurred “, focusing on the topic.”
I wanted a little more, including at a certain age. I asked her, “How old is this person and what is her emotional state?” She answered with the fact that she is in her late 30s to early 40s and in a positive state of mind, explaining: “The general impression is that she is in a good place emotionally, and probably feels good and comfortable.”
4. Revealing feelings
Next, I wanted to see how well GPT-4o would do with an image of an old man with a sad expression. Is he going to capture that or just describe the scene? Remember the only input it had was “what is this” and the image.
GPT-4o performed well. The photo read: “This photo is a photo of an elderly man sitting on a bench outdoors, most likely in a park. The lighting indicates that the photo was taken during the golden hour, with soft sunlight highlighting his face. The man’s expression is thoughtful, with a slight hint of Melancholic or contemplative. He is wearing a collared shirt and jacket, and the background is blurred, focusing attention on the man while providing a pleasant, natural setting.
The AI model did exactly what I hoped for: described the scene and made suggestions about what the man might be feeling at that moment.
5. Understand the scene
How well is GPT-4o able to describe a scene from an image and identify elements within the scene? This is a useful skill as AI models are increasingly used to help describe the real world. It is one of the main features of ChatGPT’s new audio mode.
“This photo shows a lively outdoor farmers market on a sunny day,” GPT-4o announced. “Various booths have been set up under tents, displaying an abundance of fresh produce including fruits, vegetables and other goods. People wander through the market, browsing, shopping and interacting with vendors.” The market appears to be located in a beautiful tree-lined area, with green foliage in the background. The atmosphere feels lively and community-oriented, with an emphasis on local and fresh produce.
Not much to say. He was accurate in his description of the picture and the situation. You provided more detail than I expected and highlighted some of the products on sale.
6. Image quality evaluation
I asked Ideogram to create four nearly identical images in a grid but giving each a different resolution or composition level. I wanted to see how GPT-4o would describe this. It can handle the basic image, but what about four of them, each a little different?
“The image is a collection of four landscape photographs arranged in a grid, each displaying a stunning view of rolling hills covered in green grass and wildflowers under a cloud-filled sky,” GPT-4o wrote. So far so good.
It continues: “The top left photo shows a soft pink and blue gradient sky, suggesting dawn or dusk, with hills bathed in warm golden light, enhancing the vibrant green grass and yellow and white flowers. The top right photo shows a similar scene but in muted tones, suggesting… Early morning or late afternoon. The lower left image has cooler colors and a clearer blue sky, suggesting midday with softer lighting. The lower right image is warmer and more saturated, with a pink tint in the sky, suggesting the light of a sunset Warm These differences in lighting and color tone illustrate how different times of day and photographic techniques can affect the mood and appearance of the same natural scene.
Although she didn’t understand the purpose of the grid of four images – to compare composition – she was able to describe each one individually which was impressive. She actually did her own comparison of the photos.
7. Multiple object detection
Finally, there is an important test if a model is to be useful as an accessibility tool – how well it detects objects in a crowded scene. Same claim as others. The photo shows a cluttered desk and I expect it to pull out at least one little thing.
GPT-4o wrote: “The photo shows a modern, organized office workspace with a city view visible through large windows in the background. The desk is neatly arranged with many items typically found in a productive work environment, including a laptop in a Central, a metal desk lamp and a coffee cup suggest a relaxed yet focused atmosphere. Scattered notebooks and papers suggest active work and note-taking, while a small potted plant adds a touch of greenery. A framed photo of two people provides a personal touch, potentially serving as a resource For motivation. The workspace is filled with office supplies such as pens, pencils and organizers, ensuring functionality. The urban backdrop of the city skyline with modern high-rise buildings creates a dynamic and inspiring environment.
It worked. I decided to follow up and ask him to tell me the color of the plant pot and it worked right away. So I asked him to find my glasses in the photo, and he correctly said: “Your glasses are on the desk to the right of the laptop, sitting on top of a notebook.”
summary
Every vision model I’ve used in the past has made at least one mistake, usually a major one like misidentifying an object or not picking up a color or brand.
GPT-4o gets every one of these points. The move to true multimodality has been a game-changer for OpenAI.
It also demonstrates the potential value of smart glasses as the true future of data interaction. Forget the smartphone, let’s use vision to merge the real and the digital.