Official repo for the paper: Visual ChatGPT

Tag

ChatGPT

Type

Code

URL

https://github.com/microsoft/visual-chatgpt

Created

Mar 29, 2023 09:02 AM

Visual ChatGPT

Visual ChatGPT connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.

See our paper: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Updates:

We propose the template idea in Visual ChatGPT!

A template is a pre-defined execution flow that assists ChatGPT in assembling complex tasks involving multiple foundation models.
A template contains the experiential solution to complex tasks as determined by humans.
A template can invoke multiple foundation models or even establish a new ChatGPT session
To define a template, simply adding a class with attributes template_model = True

Thanks to @ShengmingYin and @thebestannie for providing a template example in InfinityOutPainting class (see the following gif)

Firstly, run python visual_chatgpt.py --load "ImageCaptioning_cuda:0,ImageEditing_cuda:1,VisualQuestionAnswering_cuda:2"
Secondly, say extend the image to 2048x1024 to Visual ChatGPT!
By simply creating an InfinityOutPainting template, Visual ChatGPT can seamlessly extend images to any size through collaboration with existing ImageCaptioning, ImageEditing, and VisualQuestionAnswering foundation models, without the need for additional training.

Visual ChatGPT needs the effort of the community! We crave your contribution to add new and interesting features!

Insight & Goal:

On the one hand, ChatGPT (or LLMs) serves as a general interface that provides a broad and diverse understanding of a wide range of topics. On the other hand, Foundation Models serve as domain experts by providing deep knowledge in specific domains. By leveraging both general and deep knowledge, we aim at building an AI that is capable of handling various tasks.

Demo

System Architecture

Quick Start

GPU memory usage

Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:

Foundation Model	GPU Memory (MB)
ImageEditing	3981
InstructPix2Pix	2827
Text2Image	3385
ImageCaptioning	1209
Image2Canny	0
CannyText2Image	3531
Image2Line	0
LineText2Image	3529
Image2Hed	0
HedText2Image	3529
Image2Scribble	0
ScribbleText2Image	3531
Image2Pose	0
PoseText2Image	3529
Image2Seg	919
SegText2Image	3529
Image2Depth	0
DepthText2Image	3531
Image2Normal	0
NormalText2Image	3529
VisualQuestionAnswering	1495

Acknowledgement

We appreciate the open source of the following projects:

Hugging Face LangChain Stable Diffusion ControlNet InstructPix2Pix CLIPSeg BLIP

Contact Information

For help or issues using the Visual ChatGPT, please submit a GitHub issue.

For other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).