As part of its broader effort to remove language barriers and keep people connected, Meta has developed a multilingual base model that can understand nearly 100 languages from speech or text and create translations into one or two different languages in real time. .
This multimodal technology, called SeamlessM4T, has been publicly released to help researchers develop and introduce universal applications capable of providing speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation. to work This collection together with SeamlessAlign; A multimodal translation dataset extracted from a total of 265,000 hours of speech and text has been made available.
This represents a significant advance in AI applications in linguistics, as it is a single system that can perform multiple speech and text-related tasks, whereas previous approaches required different systems to perform each task, as Example of a dedicated system for speech-to-speech translation.
What can SeamlessM4T do?
As Meta explains, SeamlessM4T is able to implicitly recognize the source language without the need for a separate language recognition model. This model can recognize speech and text in nearly 100 languages and produce text with the same number and speech in 36 different languages. More interestingly, SeamlessM4T can recognize when more than one language is combined in the same sentence and provide translations based on the target language requested. While previous systems required different approaches for each task.
Experiments with BLASER 2.0, a tool for evaluating speech and text units, show that this model outperforms current state-of-the-art models for speech-to-text translation. Specifically, this model performed better in dealing with background noise and speaker changes, with average improvements of 37% and 48%, respectively.
“SeamlessM4T outperforms previous state-of-the-art competitors and significantly improves translation performance for low- and medium-resource languages,” Meta wrote in a blog post. In addition, it has maintained its strong performance in languages with high resources (such as English).”
If developed, this model could lead to the creation of large-scale global translation systems, allowing people who speak different languages to communicate more effectively.
Notably, Google is also active in this field and in this regard, it has introduced its Universal Speech Model (USM), which can perform automatic speech recognition (ASR) not only for common languages, but also for uncommon languages.
RCO NEWS