Meta is continuing to push forward with its research into new forms of generative AI models, today revealing its latest effort known as CM3leon (pronounced like “chameleon”).
CM3leon is a multimodal foundation model for text-to-image creation, as well as image-to-text creation, which is useful for automatically generating captions for images.
AI generated images are obviously not a new concept at this point, with popular tools like Stable Diffusion, DALL-E and Midjourney that are widely available.
What is new are the techniques Meta is using to build CM3leon and the performance that Meta claims the foundation model is able to achieve.
Text-to-image generation technologies today largely rely on the use of diffusion models (where Stable Diffusion gets its name from) to create an image. CM3leon is using something different: a token-based autoregressive model.
“Diffusion models have recently dominated image generation work due to their strong performance and relatively modest computational cost,” Meta research wrote in a research paper titled Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. “In contrast, token-based autoregressive models are known to also produce strong results, with even better global image coherence in particular, but are much more expensive to train and use for inference.”
What Meta researchers have been able to do with CM3leon is actually demonstrate how the token-based autoregressive model can, in fact, be more efficient than a diffusion model based approach.
“CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less compute than previous transformer-based methods,” Meta researcher wrote in a blog post.
The basic outline of how CM3leon works is somewhat similar to how existing text generation models work.
Meta researchers started with a retrieval-augmented pre-training stage. Rather than just scraping publicly available images off the internet, which is a method that has caused some legal challenges for diffusion-based models, Meta has taken a different path.
“The ethical implications of image data sourcing in the domain of text-to-image generation have been a topic of considerable debate,” the Meta research paper states. “In this study, we use only licensed images from Shutterstock. As a result, we can avoid concerns related to image ownership and attribution, without sacrificing performance.”
After the pre-training, the CM3leon model goes through a supervised fine-tuning (SFT) stage that Meta researchers claim produces highly optimized results, both in terms of resource utilization as well as image quality. SFT is an approach that is used by OpenAI to help train ChatGPT. Meta notes in its research paper that SFT is used to train the model to understand complex prompts which is useful for generative tasks.
“We have found that instruction tuning notably amplifies multi-modal model performance across various tasks such as image caption generation, visual question answering, text-based editing, and conditional image generation,” the paper states.
Looking at the sample sets of generated images that Meta has shared in its blog post about CM3leon, the results are impressive and clearly show the model’s ability to understand complex, multi-stage prompts, generating extremely high resolution images as a result.
Currently CM3leon is a research effort and it’s not clear when or even if Meta will make this technology publicly available in a service on one of its platforms. Given how powerful it seems to be, and the higher efficiency of generation, it does see highly likely that CMleon and its approach to generative AI will move beyond research (eventually).
Head over to our on-demand library to view sessions from VB Transform 2023. Register Here
Meta is continuing to push forward with its research into new forms of generative AI models, today revealing its latest effort known as CM3leon (pronounced like “chameleon”).
CM3leon is a multimodal foundation model for text-to-image creation, as well as image-to-text creation, which is useful for automatically generating captions for images.
AI generated images are obviously not a new concept at this point, with popular tools like Stable Diffusion, DALL-E and Midjourney that are widely available.
What is new are the techniques Meta is using to build CM3leon and the performance that Meta claims the foundation model is able to achieve.
Event
VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.
Text-to-image generation technologies today largely rely on the use of diffusion models (where Stable Diffusion gets its name from) to create an image. CM3leon is using something different: a token-based autoregressive model.
“Diffusion models have recently dominated image generation work due to their strong performance and relatively modest computational cost,” Meta research wrote in a research paper titled Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. “In contrast, token-based autoregressive models are known to also produce strong results, with even better global image coherence in particular, but are much more expensive to train and use for inference.”
What Meta researchers have been able to do with CM3leon is actually demonstrate how the token-based autoregressive model can, in fact, be more efficient than a diffusion model based approach.
“CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less compute than previous transformer-based methods,” Meta researcher wrote in a blog post.
Meta’s ‘ethical’ approach to image training
The basic outline of how CM3leon works is somewhat similar to how existing text generation models work.
Meta researchers started with a retrieval-augmented pre-training stage. Rather than just scraping publicly available images off the internet, which is a method that has caused some legal challenges for diffusion-based models, Meta has taken a different path.
“The ethical implications of image data sourcing in the domain of text-to-image generation have been a topic of considerable debate,” the Meta research paper states. “In this study, we use only licensed images from Shutterstock. As a result, we can avoid concerns related to image ownership and attribution, without sacrificing performance.”
After the pre-training, the CM3leon model goes through a supervised fine-tuning (SFT) stage that Meta researchers claim produces highly optimized results, both in terms of resource utilization as well as image quality. SFT is an approach that is used by OpenAI to help train ChatGPT. Meta notes in its research paper that SFT is used to train the model to understand complex prompts which is useful for generative tasks.
“We have found that instruction tuning notably amplifies multi-modal model performance across various tasks such as image caption generation, visual question answering, text-based editing, and conditional image generation,” the paper states.
Looking at the sample sets of generated images that Meta has shared in its blog post about CM3leon, the results are impressive and clearly show the model’s ability to understand complex, multi-stage prompts, generating extremely high resolution images as a result.
Currently CM3leon is a research effort and it’s not clear when or even if Meta will make this technology publicly available in a service on one of its platforms. Given how powerful it seems to be, and the higher efficiency of generation, it does see highly likely that CMleon and its approach to generative AI will move beyond research (eventually).
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Author: Sean Michael Kerner
Source: Venturebeat