As part of its broader effort to remove language barriers and keep people connected, Meta has developed a multilingual base model that can understand nearly 100 languages from speech or text and create translations io one or two differe languages in real time. .
This multimodal technology, called SeamlessM4T, has been publicly released to help researchers develop and iroduce universal applications capable of providing speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation. to work This collection together with SeamlessAlign; A multimodal translation dataset extracted from a total of 265,000 hours of speech and text has been made available.
This represes a significa advance in AI applications in linguistics, as it is a single system that can perform multiple speech and text-related tasks, whereas previous approaches required differe systems to perform each task, as Example of a dedicated system for speech-to-speech translation.
What can SeamlessM4T do?
As Meta explains, SeamlessM4T is able to implicitly recognize the source language without the need for a separate language recognition model. This model can recognize speech and text in nearly 100 languages and produce text with the same number and speech in 36 differe languages. More ierestingly, SeamlessM4T can recognize when more than one language is combined in the same seence and provide translations based on the target language requested. While previous systems required differe approaches for each task.
Experimes with BLASER 2.0, a tool for evaluating speech and text units, show that this model outperforms curre state-of-the-art models for speech-to-text translation. Specifically, this model performed better in dealing with background noise and speaker changes, with average improvemes of 37% and 48%, respectively.
“SeamlessM4T outperforms previous state-of-the-art competitors and significaly improves translation performance for low- and medium-resource languages,” Meta wrote in a blog post. In addition, it has maiained its strong performance in languages with high resources (such as English).”
If developed, this model could lead to the creation of large-scale global translation systems, allowing people who speak differe languages to communicate more effectively.
Notably, Google is also active in this field and in this regard, it has iroduced its Universal Speech Model (USM), which can perform automatic speech recognition (ASR) not only for common languages, but also for uncommon languages.




