Chinese researchers of artificial intelligence from an innovative model called Fantasytalking They have unveiled that it can only produce realistic and controllable videos of speaking faces with only one fixed portrait image. This model of advanced architecture -based architecture Video Diffusion Transformer It uses and uses audio-visual synchronization techniques, providing accurate coordination between lip movements, face states, body movements and input sound.
According to the GitHub page description of this project, there is a two -step strategy for sound and image synchronization.
How to produce a spokesman by fantasytalking artificial intelligence
In the first step, the model with the clip level training coordinates the overall scene movements including the face, the surrounding objects and the background with the input sound. Secondly, the details of the lip movements are precisely frame and modified using specific masks to fully match the sound.
One of the major challenges in the field of graphics and vision of the machine has been the production of removable avatars of fixed image. Most of the previous methods used to maintain realism and sound synchronization used 3D mediators such as 3DMM or Flame, but these were ineffective in reproducing delicate face movements and natural animations.
In the video below you can compare some of the models made by this model and other models:
Fantasytalking It also uses a special module to control the severity of the movements, which allows for adjusting the amount of face and body animation. This feature makes it possible to produce videos beyond the movement of the lips. Unlike many other models, the system uses face -based mechanism to maintain face identity that offers more natural and integrated results.
Other capabilities of this model include the production of characters with different angles (close, half -seal, full or angled), support for different graphic (realistic or cartoon) and even animals.
Compared to closed and advanced methods such as Omnihuman-1The Fantasytalking model offers higher quality in terms of realism, identity preservation, motor cohesion and audio-visual matching.
RCO NEWS