In a groundbreaking development in the realm of artificial intelligence, researchers from Meta and other esteemed institutions have introduced Transfusion, a novel method that enables the training of multimodal models on both discrete and continuous data. This innovative approach promises to bridge the gap between language modeling and diffusion models, paving the way for more seamless integration across diverse modalities such as text, images, audio, and video.
Traditionally, language models, which predict the next word in a sequence, have dominated the discrete modality domain, while diffusion models and their generalizations have emerged as the state-of-the-art techniques for generating continuous modalities like images. Efforts to merge these two domains have included extending language models to leverage diffusion models or appending pre-trained diffusion models to language models. However, these methods often result in architectural complexities or信息 loss when continuous modalities are quantized for processing by standard language models.
Transfusion, as detailed in a recent paper, offers a new training methodology that enables a single transformer to seamlessly generate both discrete and continuous modalities. By combining the language model loss function with diffusion, the model is trained on hybrid modality sequences, effectively unifying the processing of text tokens and diffused continuous images. This breakthrough ensures that no information is lost during the integration process.
The research also involves pre-training multiple Transfusion models from scratch, with parameter counts reaching up to 7 billion, using a mix of text and image data. These models have been benchmarked against various single- and cross-modal tasks, demonstrating exceptional scalability and performance. The paper, titled Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, is available here.
Experiments have shown that Transfusion outperforms approaches that quantize images and train language models on discrete image tokens. The study further demonstrates that introducing modality-specific encoding and decoding layers can enhance the model’s performance, even allowing the compression of images to as few as 16 patches. With an expansion to 7 billion parameters and 2 trillion multimodal tokens, the Transfusion model generates images and text on par with comparable-scale diffusion models and language models, effectively harnessing the strengths of both domains.
In the GenEval benchmark, the 7-billion-parameter Transfusion model surpasses popular models like DALL-E 2 and SDXL in image generation, while matching the text generation capabilities of the Llama 1 model. This versatility positions Transfusion as a promising approach to training genuinely multimodal models that excel in both image and text generation tasks.
The Architecture of Transfusion
At the heart of Transfusion lies a single, unified transformer that processes sequences, regardless of their modality. The transformer takes high-dimensional vectors as input and produces similar vectors as output. To facilitate this, lightweight, modality-specific components with non-shared parameters are employed. For text, these components are embedding matrices, converting input integers into a vector space, which is then transformed into a discrete distribution over the vocabulary. For images, the researchers experimented with two methods to compress local patch vectors into a single transformer vector and vice versa: a simple linear layer and U-Net’s up and down blocks.
In addressing the different nature of text and image continuity, Transfusion employs a unique attention mechanism. While language models typically use causal masking to prevent future token information leakage, images often require unmasked (bidirectional) attention. Transfusion adapts to this by intelligently managing attention during the modeling process.
With its ability to handle both discrete and continuous data effectively, Transfusion is set to reshape the landscape of multimodal AI. As Meta and its collaborators continue to refine this innovative approach, the potential for more advanced and integrated AI systems across various industries and applications seems all but guaranteed. The future of multimodal processing is looking increasingly unified and powerful, thanks to the groundbreaking work on Transfusion.
【source】https://www.jiqizhixin.com/articles/2024-08-26-8
Views: 2