This blog post was originally published at SOYNET’s website. It is reprinted here with the permission of SOYNET.

A Multi-Modal Conversational Model for Image Understanding and Generation

Visual ChatGPT allows users to perform complex visual tasks using text and visual inputs. With the rapid advancements in AI, there is a growing need for models that can handle multiple modalities beyond text and images, such as videos or voices. One of the challenges in building such a system is the amount of data and computational resources required. But worry no more because the newly released Visual ChatGPT has got us covered; it is based on ChatGPT and incorporates a variety of Visual Foundation Models (VFMs) to bridge the gap between ChatGPT and visual information. It uses a Prompt Manager that supports various functions, including explicitly telling ChatGPT the capability of each VFM and handling the histories, priorities, and conflicts of different VFMs.

The Prompt Manager

The Prompt Manager in Visual ChatGPT aims to help the language model accurately understand and handle various visual language tasks by providing clear guidelines and prompts for different scenarios. The Prompt Manager distinguishes among the various Visual Foundation Models (VFMs) available in Visual ChatGPT and helps select the appropriate model for a specific task. It defines the name, usage, inputs/outputs, and optional examples for each VFM to help the model decide which VFM to use for a particular task.

The Prompt Manager also handles user queries by generating a unique filename for newly uploaded images and appending a suffix Prompt to force Visual ChatGPT to use foundation models instead of relying solely on its imagination. Using these prompts encourages the model to provide specific outputs generated by the foundation models rather than generic responses and to use the appropriate VFM in a particular scenario. This helps improve the accuracy and relevance of Visual ChatGPT’s responses to user queries, particularly those related to visual language tasks.

Visual foundation model (VFM)

It is a machine-learning model that processes visual information, such as images or videos. VFMs are usually trained on large amounts of visual data. They can recognize and extract features from visual inputs. Other machine learning models, such as ChatGPT, can then use these features to perform more complex tasks requiring language and visual understanding. For example, a VFM could identify objects in an image. Then ChatGPT could generate a caption describing the objects in natural language. VFMs can be trained using various techniques, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

Sources: Visual ChatGpt Paper

Example of Visual ChatGPT in Action

To understand how Visual ChatGPT works, consider the following scenario: a user uploads an image of a yellow flower and enters a complex language instruction “please generate a red flower conditioned on the predicted depth of this image and then make it like a cartoon, step by step.” With the help of Prompt Manager, Visual ChatGPT starts a chain of execution of related Visual Foundation Models.

In this case, it first applies the depth estimation model to detect the depth information, then utilizes the depth-to-image model to generate a figure of a red flower with the depth information, and finally leverages the style transfer VFM based on the Stable Diffusion model to change the style of this image into a cartoon.

During the above pipeline, Prompt Manager serves as a dispatcher for ChatGPT by providing the type of visual formats and recording the information transformation process. Finally, when Visual ChatGPT obtains the hints of “cartoon” from the Prompt Manager, it will end the execution pipeline and show the final result.

Customized System Principles

Visual ChatGPT is designed to assist with various text and visual-related tasks, such as VQA, image generation, and editing. The system relies on a list of VFMs to solve various VL tasks. Visual ChatGPT is designed to avoid ambiguity and be strict about filename usage, ensuring that it retrieves and manipulates the correct image files.

Filename sensitivity is critical in Visual ChatGPT since one conversation round may contain multiple images and their updated versions. To tackle more challenging queries by decomposing them into subproblems, Chain-of-Thought (CoT) is introduced in Visual ChatGPT to help decide, leverage, and dispatch multiple VFMs. Visual ChatGPT must follow strict reasoning formats.

The intermediate reasoning results are parsed using elaborate regex matching algorithms to construct the rational input format for the ChatGPT model to help determine the next execution, e.g., triggering a new VFM or returning the final response. Visual ChatGPT must be reliable as a language model and not fabricate image content or filenames. Therefore, prompts are designed to ensure that the model is loyal to the output of the vision foundation models.


Visual ChatGPT’s limitations are its dependence on ChatGPT and VFMs. The accuracy and effectiveness of individual models invoked heavily influence the performance of Visual ChatGPT. Additionally, the heavy prompt engineering required to convert VFMs into language and make them distinguishable can be time-consuming and involves computer vision and natural language processing expertise.

Another limitation of Visual ChatGPT is its limited real-time capabilities. Since it automatically decomposes complex tasks into several subtasks, handling a specific task may involve invoking multiple VFMs, resulting in limited real-time capabilities.

To deal with the unsatisfactory results due to some VFMs failure, a self-correction module is suggested in the paper; this is for checking the consistency between execution results and human intentions and making the corresponding editing. This self-correction behavior can lead to more complex thinking of the model, significantly increasing the inference time.

Despite the limitations, Visual ChatGPT has demonstrated great potential and competence for different tasks despite these limitations. Its integration of visual information into dialogue tasks holds great promise for the future of AI systems.

Sweta Chaturvedi
Marketing Manager, SOYNET

Sources: Original Paper Visual ChatGpt

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top