Meta continues to advance its research into new forms of generative AI models, and today it revealed its latest research effort, showing off a model called CM3leon, which it claims compares to other competing examples, including the DALL-E 2 model. It has the best performance.
According to Hoshio, Meta claims that its new artificial intelligence model called CM3Leon is the best text-to-image conversion tool. CM3leon is a multimodal base model for text-to-image and image-to-text generation, particularly useful for automatic caption generation for images. Therefore, according to the image input, the model can create a descriptive text title that accurately reflects the content of the image.
Obviously, CM3leon AI-generated images are not a new concept at this point, as popular tools such as Stable Diffusion, DALL-E, and Midjourney are widely available. What’s new are the techniques Meta used to build the CM3leon and the performance Meta claims its base model is capable of achieving.
Most of the current text-to-image generation technologies use some kind of artificial intelligence model called “diffusion model” to generate the image. CM3leon has a different approach to generating images from text. Instead of using a diffusion model, it uses a “sign-based autoregression” model. This means that CM3leon is designed to generate an image by breaking the input text into smaller units called “tokens” and then using a special method to generate each token in a sequence, to finally produce the image output.
“Diffusion models have recently become popular for image generation due to their robust performance and relatively low computational cost,” the Meta Research Group wrote in a research paper titled Scaling Automatic Multivariate Regression Models: Pretraining and Instruction Tuning. “In contrast, token-based autoregression models can produce better results with more coherent images, but are much more expensive to train and use for inference.”
RCO NEWS