When GPT-4 was released in March this year, the model was branded as an advanced model with multimodal capabilities. However, multimodality was nowhere in sight. After almost six months, OpenAI released a string of updates last week, the notable one being image and voice feature– making GPT-4 truly multimodal, and finally bringing the ‘Vision’ feature.
As showcased by OpenAI’s co-founder Greg Brockman in the demo video for explaining GPT-4 functionalities earlier this year, the varied uses of GPT-4 Vision has been put to test and the results have been incredible. Here are a few of the amazing features of GPT-4 Vision.
May it be a plant, animal, character or any random object, GPT-4 has been able to correctly identify it from an image. Furthermore, it is able to generate descriptive detail about the object. In the below screenshots, ChatGPT has been able to rightly identify the main plant without any descriptive input prompt, and the character ‘Waldo’, respectively.
By inputting an image with any form of text into ChatGPT Plus, the model is able to transcribe the content from the image. In the below screenshot, the image contains medieval writing from philosopher and writer Robert Boyle’s manuscript.
The model is able to easily read graphs, charts or any form of data, and infer results based on it. In the below screenshot, a bar graph of performance of two models on various competitive exams are shown.
Processing Multiple Conditions
The model can also comprehend and process images with multiple conditions. For example, in the image below, it has read a set of instructions to arrive at an answer.