Visual ChatGPT
Visual ChatGPT connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.
Updates:
- We propose the template idea in Visual ChatGPT!
- A template is a pre-defined execution flow that assists ChatGPT in assembling complex tasks involving multiple foundation models.
- A template contains the experiential solution to complex tasks as determined by humans.
- A template can invoke multiple foundation models or even establish a new ChatGPT session
- To define a template, simply adding a class with attributes
template_model = True
- Thanks to @ShengmingYin and @thebestannie for providing a template example in
InfinityOutPainting
class (see the following gif) - Firstly, run
python visual_chatgpt.py --load "ImageCaptioning_cuda:0,ImageEditing_cuda:1,VisualQuestionAnswering_cuda:2"
- Secondly, say
extend the image to 2048x1024
to Visual ChatGPT! - By simply creating an
InfinityOutPainting
template, Visual ChatGPT can seamlessly extend images to any size through collaboration with existingImageCaptioning
,ImageEditing
, andVisualQuestionAnswering
foundation models, without the need for additional training.
- Visual ChatGPT needs the effort of the community! We crave your contribution to add new and interesting features!

Insight & Goal:
On the one hand, ChatGPT (or LLMs) serves as a general interface that provides a broad and diverse understanding of a wide range of topics. On the other hand, Foundation Models serve as domain experts by providing deep knowledge in specific domains. By leveraging both general and deep knowledge, we aim at building an AI that is capable of handling various tasks.
Demo

System Architecture

Quick Start
GPU memory usage
Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:
Foundation Model | GPU Memory (MB) |
ImageEditing | 3981 |
InstructPix2Pix | 2827 |
Text2Image | 3385 |
ImageCaptioning | 1209 |
Image2Canny | 0 |
CannyText2Image | 3531 |
Image2Line | 0 |
LineText2Image | 3529 |
Image2Hed | 0 |
HedText2Image | 3529 |
Image2Scribble | 0 |
ScribbleText2Image | 3531 |
Image2Pose | 0 |
PoseText2Image | 3529 |
Image2Seg | 919 |
SegText2Image | 3529 |
Image2Depth | 0 |
DepthText2Image | 3531 |
Image2Normal | 0 |
NormalText2Image | 3529 |
VisualQuestionAnswering | 1495 |
Acknowledgement
We appreciate the open source of the following projects:
Contact Information
For help or issues using the Visual ChatGPT, please submit a GitHub issue.
For other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).