Chinese researchers of artificial ielligence from an innovative model called Faasytalking They have unveiled that it can only produce realistic and corollable videos of speaking faces with only one fixed portrait image. This model of advanced architecture -based architecture Video Diffusion Transformer It uses and uses audio-visual synchronization techniques, providing accurate coordination between lip movemes, face states, body movemes and input sound.
According to the GitHub page description of this project, there is a two -step strategy for sound and image synchronization.
How to produce a spokesman by faasytalking artificial ielligence
In the first step, the model with the clip level training coordinates the overall scene movemes including the face, the surrounding objects and the background with the input sound. Secondly, the details of the lip movemes are precisely frame and modified using specific masks to fully match the sound.
One of the major challenges in the field of graphics and vision of the machine has been the production of removable avatars of fixed image. Most of the previous methods used to maiain realism and sound synchronization used 3D mediators such as 3DMM or Flame, but these were ineffective in reproducing delicate face movemes and natural animations.
In the video below you can compare some of the models made by this model and other models:
Faasytalking It also uses a special module to corol the severity of the movemes, which allows for adjusting the amou of face and body animation. This feature makes it possible to produce videos beyond the moveme of the lips. Unlike many other models, the system uses face -based mechanism to maiain face ideity that offers more natural and iegrated results.
Other capabilities of this model include the production of characters with differe angles (close, half -seal, full or angled), support for differe graphic (realistic or cartoon) and even animals.
Compared to closed and advanced methods such as Omnihuman-1The Faasytalking model offers higher quality in terms of realism, ideity preservation, motor cohesion and audio-visual matching.




